Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The various performance criteria applied in this analysis include the probability of reaching the ultimate target, the costs, elapsed times and system vulnerability resulting from any intrusion. This Excel file contains all the logical, probabilistic and statistical data entered by a user, and required for the evaluation of the criteria. It also reports the results of all the computations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Companion data for the creation of a banksia plot:Background:In research evaluating statistical analysis methods, a common aim is to compare point estimates and confidence intervals (CIs) calculated from different analyses. This can be challenging when the outcomes (and their scale ranges) differ across datasets. We therefore developed a plot to facilitate pairwise comparisons of point estimates and confidence intervals from different statistical analyses both within and across datasets.Methods:The plot was developed and refined over the course of an empirical study. To compare results from a variety of different studies, a system of centring and scaling is used. Firstly, the point estimates from reference analyses are centred to zero, followed by scaling confidence intervals to span a range of one. The point estimates and confidence intervals from matching comparator analyses are then adjusted by the same amounts. This enables the relative positions of the point estimates and CI widths to be quickly assessed while maintaining the relative magnitudes of the difference in point estimates and confidence interval widths between the two analyses. Banksia plots can be graphed in a matrix, showing all pairwise comparisons of multiple analyses. In this paper, we show how to create a banksia plot and present two examples: the first relates to an empirical evaluation assessing the difference between various statistical methods across 190 interrupted time series (ITS) data sets with widely varying characteristics, while the second example assesses data extraction accuracy comparing results obtained from analysing original study data (43 ITS studies) with those obtained by four researchers from datasets digitally extracted from graphs from the accompanying manuscripts.Results:In the banksia plot of statistical method comparison, it was clear that there was no difference, on average, in point estimates and it was straightforward to ascertain which methods resulted in smaller, similar or larger confidence intervals than others. In the banksia plot comparing analyses from digitally extracted data to those from the original data it was clear that both the point estimates and confidence intervals were all very similar among data extractors and original data.Conclusions:The banksia plot, a graphical representation of centred and scaled confidence intervals, provides a concise summary of comparisons between multiple point estimates and associated CIs in a single graph. Through this visualisation, patterns and trends in the point estimates and confidence intervals can be easily identified.This collection of files allows the user to create the images used in the companion paper and amend this code to create their own banksia plots using either Stata version 17 or R version 4.3.1
Hydrology Graphs This repository contains the code for the manuscript "A Graph Formulation for Tracing Hydrological Pollutant Transport in Surface Waters." There are three main folders containing code and data, and these are outlined below. We call the framework for building a graph of these hydrological systems "Hydrology Graphs". Several of the datafiles for building this framework are large and cannot be stored on Github. To conserve space, the notebook get_and_unpack_data.ipynb or the script get_and_unpack_data.py can be used to download the data from the Watershed Boundary Dataset (WBD), the National Hydrography Dataset (NHDPlusV2), and the agricultural land dataset for the state of Wisconsin. The files WILakes.df and WIRivers.df metnioend in section 1 below are contained within the WI_lakes_rivers.zip folder, and the files 24k Hydro Waterbodies dataset are contained in a zip file under the directory DNR_data/Hydro_Waterbodies. These files can also be unpacked by running the corresponding cells in the notebook get_and_unpack_data.ipynb or get_and_unpack_data.py. 1. graph_construction This folder contains the data and code for building a graph of the watershed-river-waterbody hydrological system. It uses data from the Watershed Boundary Dataset (link here) and the National Hydrography Dataset (link here) as a basis and builds a list of directed edges. We use NetworkX to build and visualize the list as a graph. case_studies This folder contains three .ipynb files for three separate case studies. These three case studies focus on how "Hydrology Graphs" can be used to analyze pollutant impacts in surface waters. Details of these case studies can be found in the manuscript above. DNR_data This folder contains data from the Wisconsin Department of Natural Resources (DNR) on water quality in several Wisconsin lakes. The data was obtained from here using the file Web_scraping_script.py. The original downloaded reports are found in the folder original_lake_reports. These reports were then cleaned and reformatted using the script DNR_data_filter.ipynb. The resulting, cleaned reports are found in the Lakes folder. Each subfolder of the Lakes folder contains data for a single lake. The two .csvs lake_index_WBIC.csv contain an index for what lake each numbered subfolder corresponds. In addition, we added the corresponding COMID in lake_index_WBIC_COMID.csv by matching the NHDPlusV2 data to the Wisconsin DNR's 24k Hydro Waterbodies dataset which we downloaded from here. The DNR's reported data only matches lakes to a waterbody identification code (WBIC), so we use HYDROLakes (indexed by WBIC) to match to the COMID. This is done in the DNR_data_filter.ipynb script as well. Python Versions The .py files in graph_construction/ were run using Python version 3.9.7. The scripts used the following packages and version numbers: geopandas (0.10.2) shapely (1.8.1.post1) tqdm (4.63.0) networkx (2.7.1) pandas (1.4.1) numpy (1.21.2). This dataset is associated with the following publication: Cole, D.L., G.J. Ruiz-Mercado, and V.M. Zavala. A graph-based modeling framework for tracing hydrological pollutant transport in surface waters. COMPUTERS AND CHEMICAL ENGINEERING. Elsevier Science Ltd, New York, NY, USA, 179: 108457, (2023).
The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching *** zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than *** zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just * percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of **** percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached *** zettabytes.
The data explorer allows users to create bespoke cross tabs and charts on consumption by property attributes and characteristics, based on the data available from NEED. Two variables can be selected at once (for example property age and property type), with mean, median or number of observations shown in the table. There is also a choice of fuel (electricity or gas). The data spans 2008 to 2022.
Figures provided in the latest version of the tool (June 2024) are based on data used in the June 2023 National Energy Efficiency Data-Framework (NEED) publication. More information on the development of the framework, headline results and data quality are available in the publication. There are also additional detailed tables including distributions of consumption and estimates at local authority level. The data are also available as a comma separated value (csv) file.
If you have any queries or comments on these outputs please contact: energyefficiency.stats@energysecurity.gov.uk.
<p class="gem-c-attachment_metadata"><span class="gem-c-attachment_attribute">2.56 MB</span></p>
<p class="gem-c-attachment_metadata">This file may not be suitable for users of assistive technology.</p>
<details data-module="ga4-event-tracker" data-ga4-event='{"event_name":"select_content","type":"detail","text":"Request an accessible format.","section":"Request an accessible format.","index_section":1}' class="gem-c-details govuk-details govuk-!-margin-bottom-0" title="Request an accessible format.">
Request an accessible format.
If you use assistive technology (such as a screen reader) and need a version of this document in a more accessible format, please email <a href="mailto:alt.formats@energysecurity.gov.uk" target="_blank" class="govuk-link">alt.formats@energysecurity.gov.uk</a>. Please tell us what format you need. It will help us if you say what assistive technology you use.
In this Zenodo repository we present the results of using KROWN to benchmark popular RDF Graph Materialization systems such as RMLMapper, RMLStreamer, Morph-KGC, SDM-RDFizer, and Ontop (in materialization mode).
What is KROWN 👑?
KROWN 👑 is a benchmark for materialization systems to construct Knowledge Graphs from (semi-)heterogeneous data sources using declarative mappings such as RML.
Many benchmarks already exist for virtualization systems e.g. GTFS-Madrid-Bench, NPD, BSBM which focus on complex queries with a single declarative mapping. However, materialization systems are unaffected by complex queries since their input is the dataset and the mappings to generate a Knowledge Graph. Some specialized datasets exist to benchmark specific limitations of materialization systems such as duplicated or empty values in datasets e.g. GENOMICS, but they do not cover all aspects of materialization systems. Therefore, it is hard to compare materialization systems among each other in general which is where KROWN 👑 comes in!
Results
The raw results are available as ZIP archives, the analysis of the results are available in the spreadsheet results.ods.
Evaluation setup
We generated several scenarios using KROWN’s data generator and executed them 5 times with KROWN’s execution framework. All experiments were performed on Ubuntu 22.04 LTS machines (Linux 5.15.0, x86_64) with each Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz, 48 GB RAM memory, and 2 GB swap memory. The output of each materialization system was set to N-Triples.
Materialization systems
We selected the most popular maintained materialization systems for constructing RDF graphs for performing our experiments with KROWN:
RMLMapper
RMLStreamer
Morph-KGC
SDM-RDFizer
OntopM (Ontop in materialization mode)
Note: KROWN is flexible and allows adding any other materialization system, see KROWN’s execution framework documentation for more information.
Scenarios
We consider the following scenarios:
Raw data: number of rows, columns and cell size
Duplicates & empty values: percentage of the data containing duplicates or empty values
Mappings: Triples Maps (TM), Predicate Object Maps (POM), Named Graph Maps (NG).
Joins: relations (1-N, N-1, N-M), conditions, and duplicates during joins
Note: KROWN is flexible and allows adding any other scenario, see KROWN’s data generator documentation for more information.
In the table below we list all parameter values we used to configure our scenarios:
Scenario
Parameter values
Raw data: rows
10K, 100K, 1M, 10M
Raw data: columns
1, 10, 20, 30
Raw data: cell size
500, 1K, 5K, 10K
Duplicates: percentage
0%, 25%, 50%, 75%, 100%
Empty values: percentage
0%, 25%, 50%, 75%, 100%
Mappings: TMs + 5POMs
1, 10, 20, 30 TMs
Mappings: 20TMs + POMs
1, 3, 5, 10 POMs
Mappings: NG in SM
1, 5, 10, 15 NGs
Mappings: NG in POM
1, 5, 10, 15 NGs
Mappings: NG in SM/POM
1/1, 5/5, 10/10, 15/15 NGs
Joins: 1-N relations
1-1, 1-5, 1-10, 1-15
Joins: N-1 relations
1-1, 5-1, 10-1, 15-1
Joins: N-M relations
3-3, 3-5, 5-3, 10-5, 5-10
Joins: join conditions
1, 5, 10, 15
Joins: join duplicates
0, 5, 10, 15
biogas/biogas_0/supplydata197.csv
in step 2 where supply data are specified). This dataset is associated with the following publication: Hu, Y., W. Zhang, P. Tominac, M. Shen, D. Göreke, E. Martín-Hernández, M. Martín, G.J. Ruiz-Mercado, and V.M. Zavala. ADAM: A web platform for graph-based modeling and optimization of supply chains. COMPUTERS AND CHEMICAL ENGINEERING. Elsevier Science Ltd, New York, NY, USA, 165: 107911, (2022).CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
FOCUSON**LONDON**2010:**INCOME**AND**SPENDING**AT**HOME** Household income in London far exceeds that of any other region in the UK. At £900 per week, London’s gross weekly household income is 15 per cent higher than the next highest region. Despite this, the costs to each household are also higher in the capital. Londoners pay a greater amount of their income in tax and national insurance than the UK average as well as footing a higher bill for housing and everyday necessities. All of which leaves London households less well off than the headline figures suggest. This chapter, authored by Richard Walker in the GLA Intelligence Unit, begins with an analysis of income at both individual and household level, before discussing the distribution and sources of income. This is followed by a look at wealth and borrowing and finally, focuses on expenditure including an insight to the cost of housing in London, compared with other regions in the UK. See other reports from this Focus on London series. REPORT: To view the report online click on the image below. Income and Spending Report PDF https://londondatastore-upload.s3.amazonaws.com/fol/fol10-income-cover-thumb1.png" alt="Alt text"> PRESENTATION: This interactive presentation finds the answer to the question, who really is better off, an average London or UK household? This analysis takes into account available data from all types of income and expenditure. Click on the link to access. PREZI The Prezi in plain text version RANKINGS:
https://londondatastore-upload.s3.amazonaws.com/fol/fol10-income-tableau-chart-thumb.jpg" alt="Alt text"> This interactive chart shows some key borough level income and expenditure data. This chart helps show the relationships between five datasets. Users can rank each of the indicators in turn. Borough rankings Tableau Chart MAP: These interactive borough maps help to geographically present a range of income and expenditure data within London. Interactive Maps - Instant Atlas DATA: All the data contained within the Income and Spending at Home report as well as the data used to create the charts and maps can be accessed in this spreadsheet. Report data FACTS: Some interesting facts from the data… ● Five boroughs with the highest median gross weekly pay per person in 2009: -1. Kensington & Chelsea - £809 -2. City of London - £767 -3. Westminster - £675 -4. Wandsworth - £636 -5. Richmond - £623 -32. Brent - £439 -33. Newham - £422 ● Five boroughs with the highest median weekly rent for a 2 bedroom property in October 2010: -1. Kensington & Chelsea - £550 -2. Westminster - £500 -3. City of London - £450 -4. Camden - £375 -5. Islington - £360 -32. Havering - £183 -33. Bexley - £173 ● Five boroughs with the highest percentage of households that own their home outright in 2009: -1. Bexley – 38 per cent -2. Havering – 36 per cent -3. Richmond – 32 per cent -4. Bromley – 31 per cent -5. Barnet – 28 per cent -31. Tower Hamlets – 9 per cent -32. Southwark – 9 per cent
These datasets contain reviews from the Goodreads book review website, and a variety of attributes describing the items. Critically, these datasets have multiple levels of user interaction, raging from adding to a shelf, rating, and reading.
Metadata includes
reviews
add-to-shelf, read, review actions
book attributes: title, isbn
graph of similar books
Basic Statistics:
Items: 1,561,465
Users: 808,749
Interactions: 225,394,930
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Global Biotic Interactions: Interpreted Data Products
Global Biotic Interactions (GloBI, https://globalbioticinteractions.org, [1]) aims to facilitate access to existing species interaction records (e.g., predator-prey, plant-pollinator, virus-host). This data publication provides interpreted species interaction data products. These products are the result of a process in which versioned, existing species interaction datasets ([2]) are linked to the so-called GloBI Taxon Graph ([3]) and transformed into various aggregate formats (e.g., tsv, csv, neo4j, rdf/nquad, darwin core-ish archives). In addition, the applied name maps are included to make the applied taxonomic linking explicit.
Citation
--------
GloBI is made possible by researchers, collections, projects and institutions openly sharing their datasets. When using this data, please make sure to attribute these *original data contributors*, including citing the specific datasets in derivative work. Each species interaction record indexed by GloBI contains a reference and dataset citation. Also, a full lists of all references can be found in citations.csv/citations.tsv files in this publication. If you have ideas on how to make it easier to cite original datasets, please open/join a discussion via https://globalbioticinteractions.org or related projects.
To credit GloBI for more easily finding interaction data, please use the following citation to reference GloBI:
Jorrit H. Poelen, James D. Simons and Chris J. Mungall. (2014). Global Biotic Interactions: An open infrastructure to share and analyze species-interaction datasets. Ecological Informatics. https://doi.org/10.1016/j.ecoinf.2014.08.005.
Bias and Errors
--------
As with any analysis and processing workflow, care should be taken to understand the bias and error propagation of data sources and related data transformation processes. The datasets indexed by GloBI are biased geospatially, temporally and taxonomically ([5], [6]). Also, mapping of verbatim names from datasets to known name concept may contains errors due to synonym mismatches, outdated names lists, typos or conflicting name authorities. Finally, bugs may introduce bias and errors in the resulting integrated data product.
To help better understand where bias and errors are introduced, only versioned data and code are used as an input: the datasets ([2]), name maps ([3]) and integration software ([6]) are versioned so that the integration processes can be reproduced if needed. This way, steps take to compile an integrated data record can be traced and the sources of bias and errors can be more easily found.
This version was preceded by [7].
Contents
--------
README:
this file
citations.csv.gz:
contains data citations in a in a gzipped comma-separated values format.
citations.tsv.gz:
contains data citations in a gzipped tab-separated values format.
datasets.csv.gz:
contains list of indexed datasets in a gzipped comma-separated values format.
datasets.tsv.gz:
contains list of indexed datasets in a gzipped tab-separated values format.
verbatim-interactions.csv.gz
contains species interactions tabulated as pair-wise interaction in a gzipped comma-separated values format. Included taxonomic name are *not* interpreted, but included as documented in their sources.
verbatim-interactions.tsv.gz
contains species interactions tabulated as pair-wise interaction in a gzipped tab-separated values format. Included taxonomic name are *not* interpreted, but included as documented in their sources.
interactions.csv.gz:
contains species interactions tabulated as pair-wise interactions in a gzipped comma-separated values format. Included taxonomic names are interpreted using taxonomic alignment workflows and may be different than those provided by the original sources.
interactions.tsv.gz:
contains species interactions tabulated as pair-wise interactions in a gzipped tab-separated values format. Included taxonomic names are interpreted using taxonomic alignment workflows and may be different than those provided by the original sources.
refuted-interactions.csv.gz:
contains refuted species interactions tabulated as pair-wise interactions in a gzipped comma-separated values format. Included taxonomic names are interpreted using taxonomic alignment workflows and may be different than those provided by the original sources.
refuted-interactions.tsv.gz:
contains refuted species interactions tabulated as pair-wise interactions in a gzipped tab-separated values format. Included taxonomic names are interpreted using taxonomic alignment workflows and may be different than those provided by the original sources.
refuted-verbatim-interactions.csv.gz:
contains refuted species interactions tabulated as pair-wise interactions in a gzipped comma-separated values format. Included taxonomic name are *not* interpreted, but included as documented in their sources.
refuted-verbatim-interactions.tsv.gz:
contains refuted species interactions tabulated as pair-wise interactions in a gzipped tab-separated values format. Included taxonomic name are *not* interpreted, but included as documented in their sources.
interactions.nq.gz:
contains species interactions expressed in the resource description framework in a gzipped rdf/quads format.
dwca-by-study.zip:
contains species interactions data as a Darwin Core Archive aggregated by study using a custom, occurrence level, association extension.
dwca.zip:
contains species interactions data as a Darwin Core Archive using a custom, occurrence level, association extension.
neo4j-graphdb.zip:
contains a neo4j v3.5.32 graph database snapshot containing a graph representation of the species interaction data.
taxonCache.tsv.gz:
contains hierarchies and identifiers associated with names from naming schemes in a gzipped tab-separated values format.
taxonMap.tsv.gz:
describes how names in existing datasets were mapped into existing naming schemes in a gzipped tab-separated values format.
References
-----
[1] Jorrit H. Poelen, James D. Simons and Chris J. Mungall. (2014). Global Biotic Interactions: An open infrastructure to share and analyze species-interaction datasets. Ecological Informatics. doi: 10.1016/j.ecoinf.2014.08.005.
[2] Poelen, J. H. (2020) Global Biotic Interactions: Elton Dataset Cache. Zenodo. doi: 10.5281/ZENODO.3950557.
[3] Poelen, J. H. (2021). Global Biotic Interactions: Taxon Graph (Version 0.3.28) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4451472
[4] Hortal, J. et al. (2015) Seven Shortfalls that Beset Large-Scale Knowledge of Biodiversity. Annual Review of Ecology, Evolution, and Systematics, 46(1), pp.523–549. doi: 10.1146/annurev-ecolsys-112414-054400.
[5] Cains, M. et al. (2017) Ivmooc 2017 - Gap Analysis Of Globi: Identifying Research And Data Sharing Opportunities For Species Interactions. Zenodo. Zenodo. doi: 10.5281/ZENODO.814978.
[6] Poelen, J. et al. (2022) globalbioticinteractions/globalbioticinteractions v0.24.6. Zenodo. doi: 10.5281/ZENODO.7327955.
[7] GloBI Community. (2023). Global Biotic Interactions: Interpreted Data Products hash://md5/89797a5a325ac5c50990581689718edf hash://sha256/946178b36c3ea2f2daa105ad244cf5d6cd236ec8c99956616557cf4e6666545b (0.6) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8284068
Content References
-----
hash://sha256/fb4e5f2d0288ab9936dc2298b0a7a22526f405e540e55c3de9c1cbd01afa9a00 citations.csv.gz
hash://sha256/12a154440230203b9d54f5233d4bda20c482d9d2a34a8363c6d7efdf4281ee47 citations.tsv.gz
hash://sha256/236882c394ff15eda4fe2e994a8f07cb9c0c42bd77d9a5339c9fac217b16a004 datasets.csv.gz
hash://sha256/236882c394ff15eda4fe2e994a8f07cb9c0c42bd77d9a5339c9fac217b16a004 datasets.tsv.gz
hash://sha256/42d50329eca99a6ded1b3fc63af5fa99b029b44ffeba79a02187311422c8710c dwca-by-study.zip
hash://sha256/77f7e1db20e977287ed6983ce7ea1d8b35bd88fe148372b9886ce62989bc2c22 dwca.zip
hash://sha256/4fb8f91d5638ef94ddc0b301e891629802e8080f01e3040bf3d0e819e0bfbe9e interactions.csv.gz
hash://sha256/c83ffa45ffc8e32f1933d23364c108fff92d8b9480401d54e2620a961ad9f0c5 interactions.nq.gz
hash://sha256/ce0d1ce3bebf94198996f471a03a15ad54a8c1aac5a5a6905e0f2fd4687427ac interactions.tsv.gz
hash://sha256/e4adf8c0fe545410c08e497d3189075a262f086977556c0f0fd229f8a2f39ffe neo4j-graphdb.zip
hash://sha256/8cbf6cd70ecbd724f1a4184aeeb0ba78b67747a627e5824d960fe98651871b34 refuted-interactions.csv.gz
hash://sha256/caa0f7bcf91531160fda7c4fc14020154ce6183215f77aacb8dbb0b823295022 refuted-interactions.tsv.gz
hash://sha256/29ed2703c0696d0d6ab1f1a00fcdce6da7c86d0a85ddd6e8bb00a3b1017daac9 refuted-verbatim-interactions.csv.gz
hash://sha256/5542136e32baa935ffa4834889f6af07989fab94db763ab01a3e135886a23556 refuted-verbatim-interactions.tsv.gz
hash://sha256/af742d945a1ecdb698926589fceb8147e99f491d7475b39e9b516ce1cfe2599b taxonCache.tsv.gz
hash://sha256/1a85b81dc9312994695e63966dec06858bbcd3c084f5044c29371b1c14f15c3d taxonMap.tsv.gz
hash://sha256/5f9ebc62be68f7ffb097c4ff168e6b7b45b1e835843c90a2af6b30d7e2a9eab1 verbatim-interactions.csv.gz
hash://sha256/d29704b6275a2f7aaffbd131d63009914bdbbf1d9bc2667ff4ce0713d586f4f6 verbatim-interactions.tsv.gz
hash://sha256/735599feaf18a416a375d985a27f51bb citations.csv.gz
hash://sha256/328049ca46682b8aee2611fe3ef2e3c9 citations.tsv.gz
hash://sha256/8a645af66bf9cf8ddae0c3d6bc3ccb30 datasets.csv.gz
hash://sha256/8a645af66bf9cf8ddae0c3d6bc3ccb30 datasets.tsv.gz
hash://sha256/654eb9d9445ed382036f0e45398ec6bb dwca-by-study.zip
hash://sha256/291e517d3ca72b727d85501a289d7d59 dwca.zip
hash://sha256/4dbfb8605adce1c0e2165d5bdb918f95
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This tutorial will teach you how to take time-series data from many field sites and create a shareable online map, where clicking on a field location brings you to a page with interactive graph(s).
The tutorial can be completed with a sample dataset (provided via a Google Drive link within the document) or with your own time-series data from multiple field sites.
Part 1 covers how to make interactive graphs in Google Data Studio and Part 2 covers how to link data pages to an interactive map with ArcGIS Online. The tutorial will take 1-2 hours to complete.
An example interactive map and data portal can be found at: https://temple.maps.arcgis.com/apps/View/index.html?appid=a259e4ec88c94ddfbf3528dc8a5d77e8
Replication files for "Job-to-Job Mobility and Inflation" Authors: Renato Faccini and Leonardo Melosi Review of Economics and Statistics Date: February 2, 2023 -------------------------------------------------------------------------------------------- ORDERS OF TOPICS .Section 1. We explain the code to replicate all the figures in the paper (except Figure 6) .Section 2. We explain how Figure 6 is constructed .Section 3. We explain how the data are constructed SECTION 1 Replication_Main.m is used to reproduce all the figures of the paper except Figure 6. All the primitive variables are defined in the code and all the steps are commented in code to facilitate the replication of our results. Replication_Main.m, should be run in Matlab. The authors tested it on a DELL XPS 15 7590 laptop wih the follwoing characteristics: -------------------------------------------------------------------------------------------- Processor Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz 2.40 GHz Installed RAM 64.0 GB System type 64-bit operating system, x64-based processor -------------------------------------------------------------------------------------------- It took 2 minutes and 57 seconds for this machine to construct Figures 1, 2, 3, 4a, 4b, 5, 7a, and 7b. The following version of Matlab and Matlab toolboxes has been used for the test: -------------------------------------------------------------------------------------------- MATLAB Version: 9.7.0.1190202 (R2019b) MATLAB License Number: 363305 Operating System: Microsoft Windows 10 Enterprise Version 10.0 (Build 19045) Java Version: Java 1.8.0_202-b08 with Oracle Corporation Java HotSpot(TM) 64-Bit Server VM mixed mode -------------------------------------------------------------------------------------------- MATLAB Version 9.7 (R2019b) Financial Toolbox Version 5.14 (R2019b) Optimization Toolbox Version 8.4 (R2019b) Statistics and Machine Learning Toolbox Version 11.6 (R2019b) Symbolic Math Toolbox Version 8.4 (R2019b) -------------------------------------------------------------------------------------------- The replication code uses auxiliary files and save the pictures in various subfolders: \JL_models: It contains the equations describing the model including the observation equations and routine used to solve the model. To do so, the routine in this folder calls other routines located in some fo the subfolders below. \gensystoama: It contains a set of codes that allow us to solve linear rational expectations models. We use the AMA solver. More information are provided in the file AMASOLVE.m. The codes in this subfolder have been developed by Alejandro Justiniano. \filters: it contains the Kalman filter augmented with a routine to make sure that the zero lower bound constraint for the nominal interest rate is satisfied in every period in our sample. \SteadyStateSolver: It contains a set of routines that are used to solved the steady state of the model numerically. \NLEquations: It contains some of the equations of the model that are log-linearized using the symbolic toolbox of matlab. \NberDates: It contains a set of routines that allows to add shaded area to graphs to denote NBER recessions. \Graphics: It contains useful codes enabling features to construct some of the graphs in the paper. \Data: it contains the data set used in the paper. \Params: It contains a spreadsheet with the values attributes to the model parameters. \VAR_Estimation: It contains the forecasts implied by the Bayesian VAR model of Section 2. The output of Replication_Main.m are the figures of the paper that are stored in the subfolder \Figures SECTION 2 The Excel file "Figure-6.xlsx" is used to create the charts in Figure 6. All three panels of the charts (A, B, and C) plot a measure of unexpected wage inflation against the unemployment rate, then fits separate linear regressions for the periods 1960-1985,1986-2007, and 2008-2009. Unexpected wage inflation is given by the difference between wage growth and a measure of expected wage growth. In all three panels, the unemployment rate used is the civilian unemployment rate (UNRATE), seasonally adjusted, from the BLS. The sheet "Panel A" uses quarterly manufacturing sector average hourly earnings growth data, seasonally adjusted (CES3000000008), from the Bureau of Labor Statistics (BLS) Employment Situation report as the measure of wage inflation. The unexpected wage inflation is given by the difference between earnings growth at time t and the average of earnings growth across the previous four months. Growth rates are annualized quarterly values. The sheet "Panel B" uses quarterly Nonfarm Business Sector Compensation Per Hour, seasonally adjusted (COMPNFB), from the BLS Productivity and Costs report as its measure of wage inflation. As in Panel A, expected wage inflation is given by the... Visit https://dataone.org/datasets/sha256%3A44c88fe82380bfff217866cac93f85483766eb9364f66cfa03f1ebdaa0408335 for complete metadata about this dataset.
Our consumer data is gathered and aggregated via surveys, digital services, and public data sources. We use powerful profiling algorithms to collect and ingest only fresh and reliable data points.
Our comprehensive data enrichment solution includes a variety of data sets that can help you address gaps in your customer data, gain a deeper understanding of your customers, and power superior client experiences. 1. Geography - City, State, ZIP, County, CBSA, Census Tract, etc. 2. Demographics - Gender, Age Group, Marital Status, Language etc. 3. Financial - Income Range, Credit Rating Range, Credit Type, Net worth Range, etc 4. Persona - Consumer type, Communication preferences, Family type, etc 5. Interests - Content, Brands, Shopping, Hobbies, Lifestyle etc. 6. Household - Number of Children, Number of Adults, IP Address, etc. 7. Behaviours - Brand Affinity, App Usage, Web Browsing etc. 8. Firmographics - Industry, Company, Occupation, Revenue, etc 9. Retail Purchase - Store, Category, Brand, SKU, Quantity, Price etc. 10. Auto - Car Make, Model, Type, Year, etc. 11. Housing - Home type, Home value, Renter/Owner, Year Built etc.
Consumer Graph Schema & Reach: Our data reach represents the total number of counts available within various categories and comprises attributes such as country location, MAU, DAU & Monthly Location Pings:
Data Export Methodology: Since we collect data dynamically, we provide the most updated data and insights via a best-suited method on a suitable interval (daily/weekly/monthly).
Consumer Graph Use Cases: 360-Degree Customer View: Get a comprehensive image of customers by the means of internal and external data aggregation. Data Enrichment: Leverage Online to offline consumer profiles to build holistic audience segments to improve campaign targeting using user data enrichment Fraud Detection: Use multiple digital (web and mobile) identities to verify real users and detect anomalies or fraudulent activity. Advertising & Marketing: Understand audience demographics, interests, lifestyle, hobbies, and behaviors to build targeted marketing campaigns.
Here's the schema of Consumer Data:
person_id
first_name
last_name
age
gender
linkedin_url
twitter_url
facebook_url
city
state
address
zip
zip4
country
delivery_point_bar_code
carrier_route
walk_seuqence_code
fips_state_code
fips_country_code
country_name
latitude
longtiude
address_type
metropolitan_statistical_area
core_based+statistical_area
census_tract
census_block_group
census_block
primary_address
pre_address
streer
post_address
address_suffix
address_secondline
address_abrev
census_median_home_value
home_market_value
property_build+year
property_with_ac
property_with_pool
property_with_water
property_with_sewer
general_home_value
property_fuel_type
year
month
household_id
Census_median_household_income
household_size
marital_status
length+of_residence
number_of_kids
pre_school_kids
single_parents
working_women_in_house_hold
homeowner
children
adults
generations
net_worth
education_level
occupation
education_history
credit_lines
credit_card_user
newly_issued_credit_card_user
credit_range_new
credit_cards
loan_to_value
mortgage_loan2_amount
mortgage_loan_type
mortgage_loan2_type
mortgage_lender_code
mortgage_loan2_render_code
mortgage_lender
mortgage_loan2_lender
mortgage_loan2_ratetype
mortgage_rate
mortgage_loan2_rate
donor
investor
interest
buyer
hobby
personal_email
work_email
devices
phone
employee_title
employee_department
employee_job_function
skills
recent_job_change
company_id
company_name
company_description
technologies_used
office_address
office_city
office_country
office_state
office_zip5
office_zip4
office_carrier_route
office_latitude
office_longitude
office_cbsa_code
office_census_block_group
office_census_tract
office_county_code
company_phone
company_credit_score
company_csa_code
company_dpbc
company_franchiseflag
company_facebookurl
company_linkedinurl
company_twitterurl
company_website
company_fortune_rank
company_government_type
company_headquarters_branch
company_home_business
company_industry
company_num_pcs_used
company_num_employees
company_firm_individual
company_msa
company_msa_name
company_naics_code
company_naics_description
company_naics_code2
company_naics_description2
company_sic_code2
company_sic_code2_description
company_sic_code4
company_sic_code4_description
company_sic_code6
company_sic_code6_description
company_sic_code8
company_sic_code8_description
company_parent_company
company_parent_company_location
company_public_private
company_subsidiary_company
company_residential_business_code
company_revenue_at_side_code
company_revenue_range
company_revenue
company_sales_volume
company_small_business
company_stock_ticker
company_year_founded
company_minorityowned
company_female_owned_or_operated
company_franchise_code
company_dma
company_dma_name
company_hq_address
company_hq_city
company_hq_duns
company_hq_state
company_hq_zip5
company_hq_zip4
co...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data sets contain six sheets, which are: (1) samples and their statistical descriptions, (2) two grading methods and their co-occurrence maps, (3)-(6) rating results under four output conditions.
This dataset consists of weekly trajectory information of Gulf Stream Warm Core Rings (WCR) that existed between 2021 and 2023. This work builds upon two previous datasets: (i) Warm Core Ring trajectory information from 2000 to 2010 -- Porter et al. (2022) (https://doi.org/10.5281/zenodo.7406675) (ii) Warm Core Ring trajectory information from 2011 to 2020 -- Silver et al. (2022a) (https://doi.org/10.5281/zenodo.6436380). Combining these three datasets (previous two and this one), a total of 24 years of weekly Warm Core Ring trajectories are now available. An example of how to use such a dataset can be found in Silver et al. (2022b). The format of the dataset is similar to that of Porter et al. (2022) and Silver et al. (2022a), and the following description is adapted from those datasets. This dataset is comprised of individual files containing each ring’s weekly center location and its surface area for 81 WCRs that existed and tracked between January 1, 2021 and December 31, 2023 (5 WCRs formed in 2020 and still existed in 2021; 28 formed in 2021; 30 formed in 2022; 18 formed in 2023). Each Warm Core Ring is identified by a unique alphanumeric code 'WEyyyymmddX', where 'WE' represents a Warm Eddy (as identified in the analysis charts); 'yyyymmdd' is the year, month and day of formation; and the last character 'X' represents the sequential sighting (formation) of the eddy in that particular year. Continuity of a ring which passes from one year to the next is maintained by the same character in the previous year and absorbed by the initial alphabets for the next year. For example, the first ring formed in 2022 has a trailing alphabet of 'H', which signifies that a total of seven rings were carried over from 2021 which were still present on January 1, 2022 and were assigned the initial seven alphabets (A, B, C, D, E, F and G). Each ring has its own netCDF (.nc) filename following its alphanumeric code. Each file contains 4 variables every week, “Lon”- the ring center’s longitude, “Lat”- the ring center’s latitude, “Area” - the rings size in km^2, and “Date” in days – which is the number of days since Jan 01, 0000. Five rings formed in the year 2020 that carried over into the year 2021 were included in this dataset. These rings include ‘WE20200724Q’, ‘WE20200826R’, ‘WE20200911S’, ‘WE20200930T’, and ‘WE20201111W’. The two rings that formed in 2023, and were carried over into the following year were included with their full trajectories going into the year 2024. These rings include ‘WE20231006U’ and ‘WE20231211W’. The process of creating the WCR tracking dataset follows the same methodology of the previously generated WCR census (Gangopadhyay et al., 2019, 2020). The Jenifer Clark’s Gulf Stream Charts (Gangopadhyay et al., 2019) used to create this dataset are 2-3 times a week from 2021-2023. Thus, we used approximately 360+ Charts for the 3 years of analysis. All of these charts were reanalyzed between -75° and -55°W using QGIS 2.18.16 (2016) and geo-referenced on a WGS84 coordinate system (Decker, 1986).
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
According to Cognitive Market Research, the global semantic knowledge graphing market size is USD 1512.2 million in 2024 and will expand at a compound annual growth rate (CAGR) of 14.80% from 2024 to 2031.
North America held the major market of around 40% of the global revenue with a market size of USD 604.88 million in 2024 and will grow at a compound annual growth rate (CAGR) of 13.0% from 2024 to 2031.
Europe accounted for a share of over 30% of the global market size of USD 453.66 million.
Asia Pacific held the market of around 23% of the global revenue with a market size of USD 347.81 million in 2024 and will grow at a compound annual growth rate (CAGR) of 16.8% from 2024 to 2031.
Latin America market of around 5% of the global revenue with a market size of USD 75.61 million in 2024 and will grow at a compound annual growth rate (CAGR) of 14.2% from 2024 to 2031.
Middle East and Africa held the major market of around 2% of the global revenue with a market size of USD 30.24 million in 2024 and will grow at a compound annual growth rate (CAGR) of 14.5% from 2024 to 2031.
The natural language processing knowledge graphing held the highest growth rate in semantic knowledge graphing market in 2024.
Market Dynamics of Semantic Knowledge Graphing Market
Key Drivers of Semantic Knowledge Graphing Market
Growing Volumes of Structured, Semi-structured, and Unstructured Data to Increase the Global Demand
The global demand for semantic knowledge graphing is escalating in response to the exponential growth of structured, semi-structured, and unstructured data. Enterprises are inundated with vast amounts of data from diverse sources such as social media, IoT devices, and enterprise applications. Structured data from databases, semi-structured data like XML and JSON, and unstructured data from documents, emails, and multimedia files present significant challenges in terms of organization, analysis, and deriving actionable insights. Semantic knowledge graphing addresses these challenges by providing a unified framework for representing, integrating, and analyzing disparate data types. By leveraging semantic technologies, businesses can unlock the value hidden within their data, enabling advanced analytics, natural language processing, and knowledge discovery. As organizations increasingly recognize the importance of harnessing data for strategic decision-making, the demand for semantic knowledge graphing solutions continues to surge globally.
Demand for Contextual Insights to Propel the Growth
The burgeoning demand for contextual insights is propelling the growth of semantic knowledge graphing solutions. In today's data-driven landscape, businesses are striving to extract deeper contextual meaning from their vast datasets to gain a competitive edge. Semantic knowledge graphing enables organizations to connect disparate data points, understand relationships, and derive valuable insights within the appropriate context. This contextual understanding is crucial for various applications such as personalized recommendations, predictive analytics, and targeted marketing campaigns. By leveraging semantic technologies, companies can not only enhance decision-making processes but also improve customer experiences and operational efficiency. As industries across sectors increasingly recognize the importance of contextual insights in driving innovation and business success, the adoption of semantic knowledge graphing solutions is poised to witness significant growth. This trend underscores the pivotal role of semantic technologies in unlocking the true potential of data for strategic advantage in today's dynamic marketplace.
Restraint Factors Of Semantic Knowledge Graphing Market
Stringent Data Privacy Regulations to Hinder the Market Growth
Stringent data privacy regulations present a significant hurdle to the growth of the Semantic Knowledge Graphing market. Regulations such as GDPR (General Data Protection Regulation) in Europe and CCPA (California Consumer Privacy Act) in the United States impose strict requirements on how organizations collect, store, process, and share personal data. Compliance with these regulations necessitates robust data protection measures, including anonymization, encryption, and access controls, which can complicate the implementation of semantic knowledge graphing systems. Moreover, concerns about data breach...
CompanyKG is a heterogeneous graph consisting of 1,169,931 nodes and 50,815,503 undirected edges, with each node representing a real-world company and each edge signifying a relationship between the connected pair of companies.
Edges: We model 15 different inter-company relations as undirected edges, each of which corresponds to a unique edge type. These edge types capture various forms of similarity between connected company pairs. Associated with each edge of a certain type, we calculate a real-numbered weight as an approximation of the similarity level of that type. It is important to note that the constructed edges do not represent an exhaustive list of all possible edges due to incomplete information. Consequently, this leads to a sparse and occasionally skewed distribution of edges for individual relation/edge types. Such characteristics pose additional challenges for downstream learning tasks. Please refer to our paper for a detailed definition of edge types and weight calculations.
Nodes: The graph includes all companies connected by edges defined previously. Each node represents a company and is associated with a descriptive text, such as "Klarna is a fintech company that provides support for direct and post-purchase payments ...". To comply with privacy and confidentiality requirements, we encoded the text into numerical embeddings using four different pre-trained text embedding models: mSBERT (multilingual Sentence BERT), ADA2, SimCSE (fine-tuned on the raw company descriptions) and PAUSE.
Evaluation Tasks. The primary goal of CompanyKG is to develop algorithms and models for quantifying the similarity between pairs of companies. In order to evaluate the effectiveness of these methods, we have carefully curated three evaluation tasks:
Background and Motivation
In the investment industry, it is often essential to identify similar companies for a variety of purposes, such as market/competitor mapping and Mergers & Acquisitions (M&A). Identifying comparable companies is a critical task, as it can inform investment decisions, help identify potential synergies, and reveal areas for growth and improvement. The accurate quantification of inter-company similarity, also referred to as company similarity quantification, is the cornerstone to successfully executing such tasks. However, company similarity quantification is often a challenging and time-consuming process, given the vast amount of data available on each company, and the complex and diversified relationships among them.
While there is no universally agreed definition of company similarity, researchers and practitioners in PE industry have adopted various criteria to measure similarity, typically reflecting the companies' operations and relationships. These criteria can embody one or more dimensions such as industry sectors, employee profiles, keywords/tags, customers' review, financial performance, co-appearance in news, and so on. Investment professionals usually begin with a limited number of companies of interest (a.k.a. seed companies) and require an algorithmic approach to expand their search to a larger list of companies for potential investment.
In recent years, transformer-based Language Models (LMs) have become the preferred method for encoding textual company descriptions into vector-space embeddings. Then companies that are similar to the seed companies can be searched in the embedding space using distance metrics like cosine similarity. The rapid advancements in Large LMs (LLMs), such as GPT-3/4 and LLaMA, have significantly enhanced the performance of general-purpose conversational models. These models, such as ChatGPT, can be employed to answer questions related to similar company discovery and quantification in a Q&A format.
However, graph is still the most natural choice for representing and learning diverse company relations due to its ability to model complex relationships between a large number of entities. By representing companies as nodes and their relationships as edges, we can form a Knowledge Graph (KG). Utilizing this KG allows us to efficiently capture and analyze the network structure of the business landscape. Moreover, KG-based approaches allow us to leverage powerful tools from network science, graph theory, and graph-based machine learning, such as Graph Neural Networks (GNNs), to extract insights and patterns to facilitate similar company analysis. While there are various company datasets (mostly commercial/proprietary and non-relational) and graph datasets available (mostly for single link/node/graph-level predictions), there is a scarcity of datasets and benchmarks that combine both to create a large-scale KG dataset expressing rich pairwise company relations.
Source Code and Tutorial:
https://github.com/llcresearch/CompanyKG2
Paper: to be published
2.2 Full Mall Graph Clustering Train The sample training data for this problem is a set of 106981 fingerprints (task2_train_fingerprints.json) and some edges between them. We have provided files that indicate three different edge types, all of which should be treated differently.
task2_train_steps.csv indicates edges that connect subsequent steps within a trajectory. These edges should be highly trusted as they indicate a certainty that two fingerprints were recorded from the same floor.
task2_train_elevations.csv indicate the opposite of the steps. These elevations indicate that the fingerprints are almost definitely from a different floor. You can thus extrapolate that if fingerprint $N$ from trajectory $n$ is on a different floor to fingerprint $M$ from trajectory $m$, then all other fingerprints in both trajectories $m$ and $n$ must also be on seperate floors.
task2_train_estimated_wifi_distances.csv are the pre-computed distances that we have calculated using our own distance metric. This metric is imperfect and as such we know that many of these edges will be incorrect (i.e. they will connect two floors together). We suggest that initially you use the edges in this file to construct your initial graph and compute some solution. However, if you get a high score on task1 then you might consider computing your own wifi distances to build a graph.
Your graph can be at one of two levels of detail, either trajectory level or fingerprint level, you can choose what representation you want to use, but ultimately we want to know the trajectory clusters. Trajectory level would have every node as a trajectory and edges between nodes would occur if fingerprints in their trajectories had high similiraty. Fingerprint level would have each fingerprint as a node. You can lookup the trajectory id of the fingerprint using the task2_train_lookup.json to convert between representations.
To help you debug and train your solution we have provided a ground truth for some of the trajectories in task2_train_GT.json. In this file the keys are the trajectory ids (the same as in task2_train_lookup.json) and the values are the real floor id of the building.
Test The test set is the exact same format as the training set (for a seperate building, we weren't going to make it that easy ;) ) but we haven't included the equivalent ground truth file. This will be withheld to allow us to score your solution.
Points to consider - When doing this on real data we do not know the exact number of floors to expect, so your model will need to decide this for itself as well. For this data, do not expect to find more than 20 floors or less than 3 floors. - Sometimes in balcony areas the similarity between fingerprints on different floors can be deceivingly high. In these cases it may be wise to try to rely on the graph information rather than the individual similarity (e.g. what is the similarity of the other neighbour nodes to this candidate other-floor node?) - To the best of our knowledge there are no outlier fingerprints in the data that do not belong to the building. Every fingerprint belongs to a floor
2.3 Loading the data In this section we will provide some example code to open the files and construct both types of graph.
import os
import json
import csv
import networkx as nx
from tqdm import tqdm
path_to_data = "task2_for_participants/train"
with open(os.path.join(path_to_data,"task2_train_estimated_wifi_distances.csv")) as f:
wifi = []
reader = csv.DictReader(f)
for line in tqdm(reader):
wifi.append([line['id1'],line['id2'],float(line['estimated_distance'])])
with open(os.path.join(path_to_data,"task2_train_elevations.csv")) as f:
elevs = []
reader = csv.DictReader(f)
for line in tqdm(reader):
elevs.append([line['id1'],line['id2']])
with open(os.path.join(path_to_data,"task2_train_steps.csv")) as f:
steps = []
reader = csv.DictReader(f)
for line in tqdm(reader):
steps.append([line['id1'],line['id2'],float(line['displacement'])])
fp_lookup_path = os.path.join(path_to_data,"task2_train_lookup.json")
gt_path = os.path.join(path_to_data,"task2_train_GT.json")
with open(fp_lookup_path) as f:
fp_lookup = json.load(f)
with open(gt_path) as f:
gt = json.load(f)
Fingerprint graph This is one way to construct the fingerprint-level graph, where each node in the graph is a fingerprint. We have added edge weights that correspond to the estimated/true distances from the wifi and pdr edges respectively. We have also added elevation edges to indicate this relationship. You might want to explicitly enforce that there are none of these edges (or any valid elevation edge between trajectories) when developing your solution.
G = nx.Graph()
for id1,id2,dist in tqdm(steps):
G.add_edge(id1, id2, ty = "s", weight=dist)
for id1,id2,dist in tqdm(wifi):
G.add_edge(id1, id2, ty = "w", weight=dist)
for id1,id2 in tqdm(elevs):
G.add_edge(id1, id2, ty = "e")
Trajectory graph The trajectory graph is arguably not as simple as you need to think of a way to represent many wifi connections between trajectories. In the example graph below we just take the mean distance as a weight, but is this really the best representation?
B = nx.Graph()
Get all the trajectory ids from the lookup
valid_nodes = set(fp_lookup.values())
for node in valid_nodes:
B.add_node(node)
Either add an edge or append the distance to the edge data
for id1,id2,dist in tqdm(wifi):
if not B.has_edge(fp_lookup[str(id1)], fp_lookup[str(id2)]):
B.add_edge(fp_lookup[str(id1)],
fp_lookup[str(id2)],
ty = "w", weight=[dist])
else:
B[fp_lookup[str(id1)]][fp_lookup[str(id2)]]['weight'].append(dist)
Compute the mean edge weight
for edge in B.edges(data=True):
B[edge[0]][edge[1]]['weight'] = sum(B[edge[0]][edge[1]]['weight'])/len(B[edge[0]][edge[1]]['weight'])
If you have made a wifi connection between trajectories with an elev, delete the edge
for id1,id2 in tqdm(elevs):
if B.has_edge(fp_lookup[str(id1)], fp_lookup[str(id2)]):
B.remove_edge(fp_lookup[str(id1)],
fp_lookup[str(id2)])
Graph Neural Networks (GNNs) have gained traction across different domains such as transportation, bio-informatics, language processing, and computer vision. However, there is a noticeable absence of research on applying GNNs to supply chain networks. Supply chain networks are inherently graphlike in structure, making them prime candidates for applying GNN methodologies. This opens up a world of possibilities for optimizing, predicting, and solving even the most complex supply chain problems. A major setback in this approach lies in the absence of real-world benchmark datasets to facilitate the research and resolution of supply chain problem using GNNs. To address the issue, we present a real-world benchmark dataset for temporal tasks, obtained from one of the leading FMCG companies in Bangladesh, focusing on supply chain planning for production purposes. The dataset includes temporal data as node features to enable sales predictions, production planning, and the identification of factory issues. By utilizing this dataset, researchers can employ GNNs to address numerous supply chain problems, thereby advancing the field of supply chain analytics and planning.
Dataset GitHub arXiv PDF on arXiv
Read the paper to learn more details and data statistics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The PheKnowLator (PKT) Human Disease KG (PKT-KG) was built to model mechanisms of human disease, which includes the Central Dogma and represents multiple biological scales of organization including molecular, cellular, tissue, and organ. The knowledge representation was designed in collaboration with a PhD-level molecular biologist (Figure).
The PKT Human Disease KG was constructed using 12 OBO Foundry ontologies, 31 Linked Open Data sets, and results from two large-scale experiments (Supplementary Material). The 12 OBO Foundry ontologies were selected to represent chemicals and vaccines (i.e., ChEBI and Vaccine Ontology), cells and cell lines (i.e., Cell Ontology, Cell Line Ontology), gene/gene product attributes (i.e., Gene Ontology), phenotypes and diseases (i.e., Human Phenotype Ontology, Mondo Disease Ontology), proteins, including complexes and isoforms (i.e., Protein Ontology), pathways (i.e., Pathway Ontology), types and attributes of biological sequences (i.e., Sequence Ontology), and anatomical entities (Uberon ontology). The RO is used to provide relationships between the core OBO Foundry ontologies and database entities.
The PKT Human Disease KG contained 18 node types and 33 edge types. Note that the number of nodes and edge types reflects those that are explicitly added to the core set of OBO Foundry ontologies and does not take into account the node and edge types provided by the ontologies. These nodes and edge types were used to construct 12 different PKT Human Disease benchmark KGs by altering the Knowledge Model (i.e., class- vs. instance-based), Relation Strategy (i.e., standard vs. inverse relations), and Semantic Abstraction (i.e., OWL-NETS (yes/no) with and without Knowledge Model harmonization [OWL-NETS Only vs. OWL-NETS + Harmonization]) parameters. Benchmarks within the PheKnowLator ecosystem are different versions of a KG that can be built under alternative knowledge models, relation strategies, and with or without semantic abstraction. They provide users with the ability to evaluate different modeling decisions (based on the prior mentioned parameters) and to examine the impact of these decisions on different downstream tasks.
The Figures and Tables explaining attributes in the builds can be found here.
The benchmarks were originally built and stored using Google Cloud Platform (GCP) resources. For details and a complete description of this process, can be found on GitHub (here). Note that we have developed this Zenodo-based archive for the builds. While the original GCP resources contained all of the resources needed to generate the builds, due to the file size upload limits associated with each archive, we have limited the uploaded files to the KGs, associated metadata, and log files. The list of resources, including their URLs, and date of download, can all be found in the logs associated with each build.
🗂 For additional information on the KG file types please see the following Wiki page, which is also available as a download from this repository (PheKnowLator_HumanDiseaseKG_Output_FileInformation.xlsx).
Class-based Builds
Standard Relations
Inverse Relations
Instance-based Builds
Standard Relations
Inverse Relations
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The various performance criteria applied in this analysis include the probability of reaching the ultimate target, the costs, elapsed times and system vulnerability resulting from any intrusion. This Excel file contains all the logical, probabilistic and statistical data entered by a user, and required for the evaluation of the criteria. It also reports the results of all the computations.