71 datasets found

Data from: S1 Datasets -
plos.figshare.com
datasetcatalog.nlm.nih.gov
zip
Updated Jun 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhe Zhang; Yuhao Chen; Huixue Wang; Qiming Fu; Jianping Chen; You Lu (2023). S1 Datasets - [Dataset]. http://doi.org/10.1371/journal.pone.0286770.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0286770.s001
Dataset updated
Jun 8, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Zhe Zhang; Yuhao Chen; Huixue Wang; Qiming Fu; Jianping Chen; You Lu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A critical issue in intelligent building control is detecting energy consumption anomalies based on intelligent device status data. The building field is plagued by energy consumption anomalies caused by a number of factors, many of which are associated with one another in apparent temporal relationships. For the detection of abnormalities, most traditional detection methods rely solely on a single variable of energy consumption data and its time series changes. Therefore, they are unable to examine the correlation between the multiple characteristic factors that affect energy consumption anomalies and their relationship in time. The outcomes of anomaly detection are one-sided. To address the above problems, this paper proposes an anomaly detection method based on multivariate time series. Firstly, in order to extract the correlation between different feature variables affecting energy consumption, this paper introduces a graph convolutional network to build an anomaly detection framework. Secondly, as different feature variables have different influences on each other, the framework is enhanced by a graph attention mechanism so that time series features with higher influence on energy consumption are given more attention weights, resulting in better anomaly detection of building energy consumption. Finally, the effectiveness of this paper’s method and existing methods for detecting energy consumption anomalies in smart buildings are compared using standard data sets. The experimental results show that the model has better detection accuracy.
Goodreads Radial Bar Chart Values
kaggle.com
zip
Updated Jul 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Justinyouth (2023). Goodreads Radial Bar Chart Values [Dataset]. https://www.kaggle.com/datasets/justinyouth/goodreadsdataset/code
Explore at:
zip(207 bytes)Available download formats
Dataset updated
Jul 29, 2023
Authors
Justinyouth
Description
Datasets for beginners.

The radial chart dataset for Goodreads comprises two main variables: "Value" and "Angle." Each data point represents a specific observation with corresponding values for "Value" and "Angle."

The "Value" variable denotes the magnitude or quantity associated with each data point. In this dataset, the values range from 0 to 270, indicating diverse levels or measurements of a certain aspect related to Goodreads.

The "Angle" variable represents the angular position of each data point on the radial chart. The angular position is crucial for plotting the data points correctly on the circular layout of the chart.

This dataset aims to visualize and explore the distribution or relationships of the "Value" variable in a radial chart format. Radial charts, also known as circular or polar charts, are effective tools for displaying data in a circular layout, enabling quick comprehension of patterns, trends, or anomalies.

By leveraging the radial chart visualization, users can gain insights into the distribution of values, identify potential outliers, and observe any cyclical patterns or clusters within the dataset. Additionally, the radial chart provides an intuitive representation of the dataset, allowing stakeholders to grasp the overall structure and characteristics of the data at a glance.

The dataset's radial chart visualization can be particularly valuable for Goodreads, a popular platform for book enthusiasts, as it offers a unique perspective on various aspects of the platform's data. Further analysis and exploration of the dataset can help users make data-driven decisions, identify areas for improvement, and uncover hidden trends that may influence user engagement and experience on the platform.
S
CBCD:A Chinese Bar Chart Dataset for Data Extraction
scidb.cn
Updated Nov 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ma Qiuping; Zhang Qi; Bi Hangshuo; Zhao Xiaofan (2025). CBCD:A Chinese Bar Chart Dataset for Data Extraction [Dataset]. http://doi.org/10.57760/sciencedb.j00240.00052
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00240.00052
Dataset updated
Nov 14, 2025
Dataset provided by
Science Data Bank
Authors
Ma Qiuping; Zhang Qi; Bi Hangshuo; Zhao Xiaofan
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Currently, in the field of chart datasets, most existing resources are mainly in English, and there are almost no open-source Chinese chart datasets, which brings certain limitations to research and applications related to Chinese charts. This dataset draws on the construction method of the DVQA dataset to create a chart dataset focused on the Chinese environment. To ensure the authenticity and practicality of the dataset, we first referred to the authoritative website of the National Bureau of Statistics and selected 24 widely used data label categories in practical applications, totaling 262 specific labels. These tag categories cover multiple important areas such as socio-economic, demographic, and industrial development. In addition, in order to further enhance the diversity and practicality of the dataset, this paper sets 10 different numerical dimensions. These numerical dimensions not only provide a rich range of values, but also include multiple types of values, which can simulate various data distributions and changes that may be encountered in real application scenarios. This dataset has carefully designed various types of Chinese bar charts to cover various situations that may be encountered in practical applications. Specifically, the dataset not only includes conventional vertical and horizontal bar charts, but also introduces more challenging stacked bar charts to test the performance of the method on charts of different complexities. In addition, to further increase the diversity and practicality of the dataset, the text sets diverse attribute labels for each chart type. These attribute labels include but are not limited to whether they have data labels, whether the text is rotated 45 °, 90 °, etc. The addition of these details makes the dataset more realistic for real-world application scenarios, while also placing higher demands on data extraction methods. In addition to the charts themselves, the dataset also provides corresponding data tables and title text for each chart, which is crucial for understanding the content of the chart and verifying the accuracy of the extracted results. This dataset selects Matplotlib, the most popular and widely used data visualization library in the Python programming language, to be responsible for generating chart images required for research. Matplotlib has become the preferred tool for data scientists and researchers in data visualization tasks due to its rich features, flexible configuration options, and excellent compatibility. By utilizing the Matplotlib library, every detail of the chart can be precisely controlled, from the drawing of data points to the annotation of coordinate axes, from the addition of legends to the setting of titles, ensuring that the generated chart images not only meet the research needs, but also have high readability and attractiveness visually. The dataset consists of 58712 pairs of Chinese bar charts and corresponding data tables, divided into training, validation, and testing sets in a 7:2:1 ratio.
2
MCS3
datacatalogue.ukdataservice.ac.uk
Updated May 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of London, Institute of Education, Centre for Longitudinal Studies (2024). MCS3 [Dataset]. http://doi.org/10.5255/UKDA-SN-8240-1
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-8240-1
Dataset updated
May 17, 2024
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
Authors
University of London, Institute of Education, Centre for Longitudinal Studies
Time period covered
Jan 1, 2006
Description
Background:
The Millennium Cohort Study (MCS) is a large-scale, multi-purpose longitudinal dataset providing information about babies born at the beginning of the 21st century, their progress through life, and the families who are bringing them up, for the four countries of the United Kingdom. The original objectives of the first MCS survey, as laid down in the proposal to the Economic and Social Research Council (ESRC) in March 2000, were:
to chart the initial conditions of social, economic and health advantages and disadvantages facing children born at the start of the 21st century, capturing information that the research community of the future will require
to provide a basis for comparing patterns of development with the preceding cohorts (the National Child Development Study, held at the UK Data Archive under GN 33004, and the 1970 Birth Cohort Study, held under GN 33229)
to collect information on previously neglected topics, such as fathers' involvement in children's care and development
to focus on parents as the most immediate elements of the children's 'background', charting their experience as mothers and fathers of newborn babies in the year 2000, recording how they (and any other children in the family) adapted to the newcomer, and what their aspirations for her/his future may be
to emphasise intergenerational links including those back to the parents' own childhood
to investigate the wider social ecology of the family, including social networks, civic engagement and community facilities and services, splicing in geo-coded data when available
Additional objectives subsequently included for MCS were:
to provide control cases for the national evaluation of Sure Start (a government programme intended to alleviate child poverty and social exclusion)
to provide samples of adequate size to analyse and compare the smaller countries of the United Kingdom, and include disadvantaged areas of England
Further information about the MCS can be found on the Centre for Longitudinal Studies web pages.
The content of MCS studies, including questions, topics and variables can be explored via the CLOSER Discovery website.
The first sweep (MCS1) interviewed both mothers and (where resident) fathers (or father-figures) of infants included in the sample when the babies were nine months old, and the second sweep (MCS2) was carried out with the same respondents when the children were three years of age. The third sweep (MCS3) was conducted in 2006, when the children were aged five years old, the fourth sweep (MCS4) in 2008, when they were seven years old, the fifth sweep (MCS5) in 2012-2013, when they were eleven years old, the sixth sweep (MCS6) in 2015, when they were fourteen years old, and the seventh sweep (MCS7) in 2018, when they were seventeen years old.

Safeguarded versions of MCS studies:
The Safeguarded versions of MCS1, MCS2, MCS3, MCS4, MCS5, MCS6 and MCS7 are held under UK Data Archive SNs 4683, 5350, 5795, 6411, 7464, 8156 and 8682 respectively. The longitudinal family file is held under SN 8172.

Polygenic Indices
Polygenic indices are available under Special Licence SN 9437. Derived summary scores have been created that combine the estimated effects of many different genes on a specific trait or characteristic, such as a person's risk of Alzheimer's disease, asthma, substance abuse, or mental health disorders, for example. These polygenic scores can be combined with existing survey data to offer a more nuanced understanding of how cohort members' outcomes may be shaped.

Sub-sample studies:
Some studies based on sub-samples of MCS have also been conducted, including a study of MCS respondent mothers who had received assisted fertility treatment, conducted in 2003 (see EUL SN 5559). Also, birth registration and maternity hospital episodes for the MCS respondents are held as a separate dataset (see EUL SN 5614).

Release of Sweeps 1 to 4 to Long Format (Summer 2020)
To support longitudinal research and make it easier to compare data from different time points, all data from across all sweeps is now in a consistent format. The update affects the data from sweeps 1 to 4 (from 9 months to 7 years), which are updated from the old/wide to a new/long format to match the format of data of sweeps 5 and 6 (age 11 and 14 sweeps). The old/wide formatted datasets contained one row per family with multiple variables for different respondents. The new/long formatted datasets contain one row per respondent (per parent or per cohort member) for each MCS family. Additional updates have been made to all sweeps to harmonise variable labels and enhance anonymisation.

How to access genetic and/or bio-medical sample data from a range of longitudinal surveys:
For information on how to access biomedical data from MCS that are not held at the UKDS, see the CLS Genetic data and biological samples webpage.

Secure Access datasets:
Secure Access versions of the MCS have more restrictive access conditions than versions available under the standard Safeguarded Licence or Special Licence (see 'Access data' tab above).

Secure Access versions of the MCS include:
detailed sensitive variables not available under EUL. These have been grouped thematically and are held under SN 8753 (socio-economic, accommodation and occupational data), SN 8754 (self-reported health, behaviour and fertility), SN 8755 (demographics, language and religion) and SN 8756 (exact participation dates). These files replace previously available studies held under SNs 8456 and 8622-8627
detailed geographical identifier files which are grouped by sweep held under SN 7758 (MCS1), SN 7759 (MCS2), SN 7760 (MCS3), SN 7761 (MCS4), SN 7762 (MCS5 2001 Census Boundaries), SN 7763 (MCS5 2011 Census Boundaries), SN 8231 (MCS6 2001 Census Boundaries), SN 8232 (MCS6 2011 Census Boundaries), SN 8757 (MCS7), SN 8758 (MCS7 2001 Census Boundaries) and SN 8759 (MCS7 2011 Census Boundaries). These files replace previously available files grouped by geography SN 7049 (Ward level), SN 7050 (Lower Super Output Area level), and SN 7051 (Output Area level)
linked education administrative datasets for Key Stages 1, 2, 4 and 5 held under SN 8481 (England). This replaces previously available datasets for Key Stage 1 (SN 6862) and Key Stage 2 (SN 7712)
linked education administrative datasets for Key Stage 1 held under SN 7414 (Scotland)
linked education administrative dataset for Key Stages 1, 2, 3 and 4 under SN 9085 (Wales)
linked NHS Patient Episode Database for Wales (PEDW) for MCS1 – MCS5 held under SN 8302
linked Scottish Medical Records data held under SNs 8709, 8710, 8711, 8712, 8713 and 8714;
Banded Distances to English Grammar Schools for MCS5 held under SN 8394
linked Health Administrative Datasets (Hospital Episode Statistics) for England for years 2000-2019 held under SN 9030
linked Hospital of Birth data held under SN 5724.
The linked education administrative datasets held under SNs 8481,7414 and 9085 may be ordered alongside the MCS detailed geographical identifier files only if sufficient justification is provided in the application.

Researchers applying for access to the Secure Access MCS datasets should indicate on their ESRC Accredited Researcher application form the EUL dataset(s) that they also wish to access (selected from the MCS Series Access web page).
The Millennium Cohort Study: Sweep 3 Banded Distances to Current, First, Second, and Third Choice Schools study provides banded distances to the current, first, second, and third choice school of MCS cohort members at sweep 3 (2006). The cohort members would therefore be aged between four and six years old, and have entered the primary school education system.

Plotly Dashboard Healthcare

kaggle.com

zip

Updated Jan 4, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

A SURESH (2022). Plotly Dashboard Healthcare [Dataset]. https://www.kaggle.com/datasets/sureshmecad/plotly-dashboard-healthcare

Explore at:

zip(1741234 bytes)Available download formats

Dataset updated

Jan 4, 2022

Authors

A SURESH

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

Data Visualization

Content

a. Scatter plot

  i. The webapp should allow the user to select genes from datasets and plot 2D scatter plots between 2 variables(expression/copy_number/chronos) for 
    any pair of genes.

  ii. The user should be able to filter and color data points using metadata information available in the file “metadata.csv”.

  iii. The visualization could be interactive - It would be great if the user can hover over the data-points on the plot and get the relevant information (hint - 
    visit https://plotly.com/r/, https://plotly.com/python)

  iv. Here is a quick reference for you. The scatter plot is between chronos score for TTBK2 gene and expression for MORC2 gene with coloring defined by
    Gender/Sex column from the metadata file.

b. Boxplot/violin plot

  i. User should be able to select a gene and a variable (expression / chronos / copy_number) and generate a boxplot to display its distribution across 
   multiple categories as defined by user selected variable (a column from the metadata file)

 ii. Here is an example for your reference where violin plot for CHRONOS score for gene CCL22 is plotted and grouped by ‘Lineage’

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?

2
MCS2
datacatalogue.ukdataservice.ac.uk
Updated Jun 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of London, Institute of Education, Centre for Longitudinal Studies (2024). MCS2 [Dataset]. http://doi.org/10.5255/UKDA-SN-7759-2
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-7759-2
Dataset updated
Jun 10, 2024
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
Authors
University of London, Institute of Education, Centre for Longitudinal Studies
Time period covered
Jan 1, 2003 - Jan 1, 2005
Area covered
United Kingdom
Description
Background:
The Millennium Cohort Study (MCS) is a large-scale, multi-purpose longitudinal dataset providing information about babies born at the beginning of the 21st century, their progress through life, and the families who are bringing them up, for the four countries of the United Kingdom. The original objectives of the first MCS survey, as laid down in the proposal to the Economic and Social Research Council (ESRC) in March 2000, were:
to chart the initial conditions of social, economic and health advantages and disadvantages facing children born at the start of the 21st century, capturing information that the research community of the future will require
to provide a basis for comparing patterns of development with the preceding cohorts (the National Child Development Study, held at the UK Data Archive under GN 33004, and the 1970 Birth Cohort Study, held under GN 33229)
to collect information on previously neglected topics, such as fathers' involvement in children's care and development
to focus on parents as the most immediate elements of the children's 'background', charting their experience as mothers and fathers of newborn babies in the year 2000, recording how they (and any other children in the family) adapted to the newcomer, and what their aspirations for her/his future may be
to emphasise intergenerational links including those back to the parents' own childhood
to investigate the wider social ecology of the family, including social networks, civic engagement and community facilities and services, splicing in geo-coded data when available
Additional objectives subsequently included for MCS were:
to provide control cases for the national evaluation of Sure Start (a government programme intended to alleviate child poverty and social exclusion)
to provide samples of adequate size to analyse and compare the smaller countries of the United Kingdom, and include disadvantaged areas of England
Further information about the MCS can be found on the Centre for Longitudinal Studies web pages.
The content of MCS studies, including questions, topics and variables can be explored via the CLOSER Discovery website.
The first sweep (MCS1) interviewed both mothers and (where resident) fathers (or father-figures) of infants included in the sample when the babies were nine months old, and the second sweep (MCS2) was carried out with the same respondents when the children were three years of age. The third sweep (MCS3) was conducted in 2006, when the children were aged five years old, the fourth sweep (MCS4) in 2008, when they were seven years old, the fifth sweep (MCS5) in 2012-2013, when they were eleven years old, the sixth sweep (MCS6) in 2015, when they were fourteen years old, and the seventh sweep (MCS7) in 2018, when they were seventeen years old.

Safeguarded versions of MCS studies:
The Safeguarded versions of MCS1, MCS2, MCS3, MCS4, MCS5, MCS6 and MCS7 are held under UK Data Archive SNs 4683, 5350, 5795, 6411, 7464, 8156 and 8682 respectively. The longitudinal family file is held under SN 8172.

Polygenic Indices
Polygenic indices are available under Special Licence SN 9437. Derived summary scores have been created that combine the estimated effects of many different genes on a specific trait or characteristic, such as a person's risk of Alzheimer's disease, asthma, substance abuse, or mental health disorders, for example. These polygenic scores can be combined with existing survey data to offer a more nuanced understanding of how cohort members' outcomes may be shaped.

Sub-sample studies:
Some studies based on sub-samples of MCS have also been conducted, including a study of MCS respondent mothers who had received assisted fertility treatment, conducted in 2003 (see EUL SN 5559). Also, birth registration and maternity hospital episodes for the MCS respondents are held as a separate dataset (see EUL SN 5614).

Release of Sweeps 1 to 4 to Long Format (Summer 2020)
To support longitudinal research and make it easier to compare data from different time points, all data from across all sweeps is now in a consistent format. The update affects the data from sweeps 1 to 4 (from 9 months to 7 years), which are updated from the old/wide to a new/long format to match the format of data of sweeps 5 and 6 (age 11 and 14 sweeps). The old/wide formatted datasets contained one row per family with multiple variables for different respondents. The new/long formatted datasets contain one row per respondent (per parent or per cohort member) for each MCS family. Additional updates have been made to all sweeps to harmonise variable labels and enhance anonymisation.

How to access genetic and/or bio-medical sample data from a range of longitudinal surveys:
For information on how to access biomedical data from MCS that are not held at the UKDS, see the CLS Genetic data and biological samples webpage.

Secure Access datasets:
Secure Access versions of the MCS have more restrictive access conditions than versions available under the standard Safeguarded Licence or Special Licence (see 'Access data' tab above).

Secure Access versions of the MCS include:
detailed sensitive variables not available under EUL. These have been grouped thematically and are held under SN 8753 (socio-economic, accommodation and occupational data), SN 8754 (self-reported health, behaviour and fertility), SN 8755 (demographics, language and religion) and SN 8756 (exact participation dates). These files replace previously available studies held under SNs 8456 and 8622-8627
detailed geographical identifier files which are grouped by sweep held under SN 7758 (MCS1), SN 7759 (MCS2), SN 7760 (MCS3), SN 7761 (MCS4), SN 7762 (MCS5 2001 Census Boundaries), SN 7763 (MCS5 2011 Census Boundaries), SN 8231 (MCS6 2001 Census Boundaries), SN 8232 (MCS6 2011 Census Boundaries), SN 8757 (MCS7), SN 8758 (MCS7 2001 Census Boundaries) and SN 8759 (MCS7 2011 Census Boundaries). These files replace previously available files grouped by geography SN 7049 (Ward level), SN 7050 (Lower Super Output Area level), and SN 7051 (Output Area level)
linked education administrative datasets for Key Stages 1, 2, 4 and 5 held under SN 8481 (England). This replaces previously available datasets for Key Stage 1 (SN 6862) and Key Stage 2 (SN 7712)
linked education administrative datasets for Key Stage 1 held under SN 7414 (Scotland)
linked education administrative dataset for Key Stages 1, 2, 3 and 4 under SN 9085 (Wales)
linked NHS Patient Episode Database for Wales (PEDW) for MCS1 – MCS5 held under SN 8302
linked Scottish Medical Records data held under SNs 8709, 8710, 8711, 8712, 8713 and 8714;
Banded Distances to English Grammar Schools for MCS5 held under SN 8394
linked Health Administrative Datasets (Hospital Episode Statistics) for England for years 2000-2019 held under SN 9030
linked Hospital of Birth data held under SN 5724.
The linked education administrative datasets held under SNs 8481,7414 and 9085 may be ordered alongside the MCS detailed geographical identifier files only if sufficient justification is provided in the application.

Researchers applying for access to the Secure Access MCS datasets should indicate on their ESRC Accredited Researcher application form the EUL dataset(s) that they also wish to access (selected from the MCS Series Access web page).
International Data Access Network (IDAN)
These data are now available to researchers based outside the UK. Selected UKDS SecureLab/controlled datasets from the Institute for Social and Economic Research (ISER) and the Centre for Longitudinal Studies (CLS) have been made available under the International Data Access Network (IDAN) scheme, via a Safe Room access point at one of the UKDS IDAN partners. Prospective users should read the UKDS SecureLab application guide for non-ONS data for researchers outside of the UK via Safe Room Remote Desktop Access. Further details about the IDAN scheme can be found on the UKDS International Data Access Network webpage and on the IDAN website.

Latest edition information:
For the second edition (February 2017), the data file has been updated to correct discrepancies found in the first edition.
S1 Data -
figshare.com
zip
Updated Mar 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chubing Guo; Jian Wang; Yongping Zhang; Haozhe Zhang; Haochun Yang (2024). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0291460.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0291460.s001
Dataset updated
Mar 7, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Chubing Guo; Jian Wang; Yongping Zhang; Haozhe Zhang; Haochun Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In air pollution studies, the correlation analysis of environmental variables has usually been challenged by parametric diversity. Such variable variations are not only from the extrinsic meteorological conditions and industrial activities but also from the interactive influences between the multiple parameters. A promising solution has been motivated by the recent development of visibility graph (VG) on multi-variable data analysis, especially for the characterization of pollutants’ correlation in the temporal domain, the multiple visibility graph (MVG) for nonlinear multivariate time series analysis has been verified effectively in different realistic scenarios. To comprehensively study the correlation between pollutant data and season, in this work, we propose a multi-layer complex network with a community division strategy based on the joint analysis of the atmospheric pollutants. Compared to the single-layer-based complex networks, our proposed method can integrate multiple different atmospheric pollutants for analysis, and combine them with multivariate time series data to obtain higher temporary community division for ground air pollutants interpretation. Substantial experiments have shown that this method effectively utilizes air pollution data from multiple representative indicators. By mining community information in the data, it successfully achieves reasonable and strong interpretive analysis of air pollution data.

causRCA: Real-World Dataset for Causal Discovery and Root Cause Analysis in...

zenodo.org

zip

Updated Sep 5, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Carl Willy Mehling; Carl Willy Mehling; Sven Pieper; Sven Pieper; Tobias Lüke; Tobias Lüke (2025). causRCA: Real-World Dataset for Causal Discovery and Root Cause Analysis in Machinery [Dataset]. http://doi.org/10.5281/zenodo.15876410

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15876410

Dataset updated

Sep 5, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Carl Willy Mehling; Carl Willy Mehling; Sven Pieper; Sven Pieper; Tobias Lüke; Tobias Lüke

License

http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

Description

causRCA: Real-World Dataset for Causal Discovery and Root Cause Analysis in Machinery

causRCA is a collection of time series datasets recorded from the CNC control of an industrial vertical lathe.

The datasets comprise real-world recordings from normal factory operation and labeled fault data from a hardware-in-the-loop simulation. The fault datasets come with labels for the underlying (simulated) cause of the failure, a labeled diagnosis, and a causal model of all variables in the datasets.

The extensive metadata and provided ground truth causal structure enable benchmarking of methods in causal discovery, root cause analysis, anomaly detection, and fault diagnosis in general.

Use Cases & Applications

Causal Discovery: Benchmark learned causal graphs against an expert-derived causal graph.
Supervised Root Cause Analysis: Train and test models on labeled diagnosis for different fault scenarios.
Unsupervised Root Cause Analysis: Identify manipulated variables in different fault scenarios with known ground truth.

Data & File Overview

data/
 ┣ real_op/
 ┣ dig_twin/
 ┃ ┣ exp_coolant/
 ┃ ┣ exp_hydraulics/
 ┃ ┗ exp_probe/
 ┣ expert_graph/
 ┗ README_DATASET.md

The data folder contains:

real_op/: CSV files with time series data from normal operation.
dig_twin/: Data from the digital twin experiments. Each group (coolant,hydraulics,probe) contains a causal subgraph as ground truth, different fault scenarios and multiple runs per scenario:
- exp_coolant/: Coolant system faults
- exp_hydraulics/: Hydraulic system faults
- exp_probe/: Probe system faults
expert_graph/: GML and interactive HTML file with the expert-derived causal graph and lists of nodes and edges.
README_DATASET.md: Dataset description

Datasets summary

(Sub-)graph	#Nodes	#Edges	#Datasets normal	#Datasets Fault	#Fault Scenarios	#Different Diagnoses	#Causing Variables
Lathe (Full graph)	92	104	170	100	19	10	14
--Probe	11	15	170	34	6	3	2
--Hydraulics	17	18	170	41	9	5	6
--Coolant	15	10	170	25	4	2	6
--(Other Vars)	49	61	170	-	-	-	-

*datasets from normal operation contain all machine variables and therefore all subgraphs and their respective variables within it.

Methodological Information

Real Operation Data (`real_op`)

Data were recorded through an OPC UA interface during normal production cycles on a vertical lathe. These files capture baseline machine behavior under standard operating conditions, without induced or known faults.

Digital Twin Data (`dig_twin`)

A hardware-in-the-loop digital twin was developed by connecting the original machine controller to a real-time simulation. Faults (e.g., valve leaks, filter clogs) were injected by manipulating specific twin variables, providing known ground-truth causes. Data were recorded via the same OPC UA interface to ensure consistent structure.

Known limitations

Data was sampled via an OPC UA interface. The timestamps only reflect the published time of value change by the CNC and do not necessarily reflect the exact time of value changes.

Consequently, the chronological order of changes across different variables is not strictly guaranteed. This may impact time-series analyses that are highly sensitive to precise temporal ordering.

Methods for Processing

see the causRCA GitHub Repository

Acknowledgements

The authors gratefully acknowledge the contributions of:

KAMAX Holding GmbH & Co. KG for providing real production data from the vertical lathe.
Schuster Maschinenbau GmbH for supporting the digital twin development with knowledge and the PLC project.
ISG Industrielle Steuerungstechnik GmbH for developing the digital twin implementation.
SEITEC GmbH for hosting the hardware-in-the-loop setup and developing the OPC UA data recording solution.

Declaration of GenAI and AI-assisted Technologies

During the preparation of the dataset, the author(s) used generative AI tools to enhance the dataset's applicability by structuring data in an accessible format with extensive metadata, assist in coding transformations, and draft description content. All AI-generated output was reviewed and edited under human oversight, and no original dataset content was created by AI.

z
Classification of web-based Digital Humanities projects leveraging...
zenodo.org
data-staging.niaid.nih.gov
csv, tsv
Updated Nov 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tommaso Battisti; Tommaso Battisti (2025). Classification of web-based Digital Humanities projects leveraging information visualisation techniques [Dataset]. http://doi.org/10.5281/zenodo.14192758
Explore at:
tsv, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14192758
Dataset updated
Nov 10, 2025
Dataset provided by
Zenodo
Authors
Tommaso Battisti; Tommaso Battisti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description

This dataset contains a list of 186 Digital Humanities projects leveraging information visualisation techniques. Each project has been classified according to visualisation and interaction methods, narrativity and narrative solutions, domain, methods for the representation of uncertainty and interpretation, and the employment of critical and custom approaches to visually represent humanities data.

Classification schema: categories and columns

The project_id column contains unique internal identifiers assigned to each project. Meanwhile, the last_access column records the most recent date (in DD/MM/YYYY format) on which each project was reviewed based on the web address specified in the url column.
The remaining columns can be grouped into descriptive categories aimed at characterising projects according to different aspects:

Narrativity. It reports the presence of information visualisation techniques employed within narrative structures. Here, the term narrative encompasses both author-driven linear data stories and more user-directed experiences where the narrative sequence is determined by user exploration [1]. We define 2 columns to identify projects using visualisation techniques in narrative, or non-narrative sections. Both conditions can be true for projects employing visualisations in both contexts. Columns:

non_narrative (boolean)

narrative (boolean)

Domain. The humanities domain to which the project is related. We rely on [2] and the chapters of the first part of [3] to abstract a set of general domains. Column:

domain (categorical):

History and archaeology

Art and art history

Language and literature

Music and musicology

Multimedia and performing arts

Philosophy and religion

Other: both extra-list domains and cases of collections without a unique or specific thematic focus.

Visualisation of uncertainty and interpretation. Buiding upon the frameworks proposed by [4] and [5], a set of categories was identified, highlighting a distinction between precise and impressional communication of uncertainty. Precise methods explicitly represent quantifiable uncertainty such as missing, unknown, or uncertain data, precisely locating and categorising it using visual variables and positioning. Two sub-categories are interactive distinction, when uncertain data is not visually distinguishable from the rest of the data but can be dynamically isolated or included/excluded categorically through interaction techniques (usually filters); and visual distinction, when uncertainty visually “emerges” from the representation by means of dedicated glyphs and spatial or visual cues and variables. On the other hand, impressional methods communicate the constructed and situated nature of data [6], exposing the interpretative layer of the visualisation and indicating more abstract and unquantifiable uncertainty using graphical aids or interpretative metrics. Two sub-categories are: ambiguation, when the use of graphical expedients—like permeable glyph boundaries or broken lines—visually convey the ambiguity of a phenomenon; and interpretative metrics, when expressive, non-scientific, or non-punctual metrics are used to build a visualisation. Column:

uncertainty_interpretation (categorical):

Interactive distinction

Visual distinction

Ambiguation

Interpretative metrics

Critical adaptation. We identify projects in which, with regards to at least a visualisation, the following criteria are fulfilled: 1) avoid repurposing of prepackaged, generic-use, or ready-made solutions; 2) being tailored and unique to reflect the peculiarities of the phenomena at hand; 3) avoid simplifications to embrace and depict complexity, promoting time-consuming visualisation-based inquiry. Column:

critical_adaptation (boolean)

Non-temporal visualisation techniques. We adopt and partially adapt the terminology and definitions from [7]. A column is defined for each type of visualisation and accounts for its presence within a project, also including stacked layouts and more complex variations. Columns and inclusion criteria:

plot (boolean): visual representations that map data points onto a two-dimensional coordinate system.

cluster_or_set (boolean): sets or cluster-based visualisations used to unveil possible inter-object similarities.

map (boolean): geographical maps used to show spatial insights. While we do not specify the variants of maps (e.g., pin maps, dot density maps, flow maps, etc.), we make an exception for maps where each data point is represented by another visualisation (e.g., a map where each data point is a pie chart) by accounting for the presence of both in their respective columns.

network (boolean): visual representations highlighting relational aspects through nodes connected by links or edges.

hierarchical_diagram (boolean): tree-like structures such as tree diagrams, radial trees, but also dendrograms. They differ from networks for their strictly hierarchical structure and absence of closed connection loops.

treemap (boolean): still hierarchical, but highlighting quantities expressed by means of area size. It also includes circle packing variants.

word_cloud (boolean): clouds of words, where each instance’s size is proportional to its frequency in a related context

bars (boolean): includes bar charts, histograms, and variants. It coincides with “bar charts” in [7] but with a more generic term to refer to all bar-based visualisations.

line_chart (boolean): the display of information as sequential data points connected by straight-line segments.

area_chart (boolean): similar to a line chart but with a filled area below the segments. It also includes density plots.

pie_chart (boolean): circular graphs divided into slices which can also use multi-level solutions.

plot_3d (boolean): plots that use a third dimension to encode an additional variable.

proportional_area (boolean): representations used to compare values through area size. Typically, using circle- or square-like shapes.

other (boolean): it includes all other types of non-temporal visualisations that do not fall into the aforementioned categories.

Temporal visualisations and encodings. In addition to non-temporal visualisations, a group of techniques to encode temporality is considered in order to enable comparisons with [7]. Columns:

timeline (boolean): the display of a list of data points or spans in chronological order. They include timelines working either with a scale or simply displaying events in sequence. As in [7], we also include structured solutions resembling Gantt chart layouts.

temporal_dimension (boolean): to report when time is mapped to any dimension of a visualisation, with the exclusion of timelines. We use the term “dimension” and not “axis” as in [7] as more appropriate for radial layouts or more complex representational choices.

animation (boolean): temporality is perceived through an animation changing the visualisation according to time flow.

visual_variable (boolean): another visual encoding strategy is used to represent any temporality-related variable (e.g., colour).

Interactions. A set of categories to assess affordable interactions based on the concept of user intent [8] and user-allowed perceptualisation data actions [9]. The following categories roughly match the manipulative subset of methods of the “how” an interaction is performed in the conception of [10]. Only interactions that affect the aspect of the visualisation or the visual representation of its data points, symbols, and glyphs are taken into consideration. Columns:

basic_selection (boolean): the demarcation of an element either for the duration of the interaction or more permanently until the occurrence of another selection.

advanced_selection (boolean): the demarcation involves both the selected element and connected elements within the visualisation or leads to brush and link effects across views. Basic selection is tacitly implied.

navigation (boolean): interactions that allow moving, zooming, panning, rotating, and scrolling the view but only when applied to the visualisation and not to the web page. It also includes “drill” interactions (to navigate through different levels or portions of data detail, often generating a new view that replaces or accompanies the original) and “expand” interactions generating new perspectives on data by expanding and collapsing nodes.

arrangement (boolean): the organisation of visualisation elements (symbols, glyphs, etc.) or multi-visualisation layouts spatially through drag and drop or
f
UC_vs_US Statistic Analysis.xlsx
figshare.com
xlsx
Updated Jul 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
F. (Fabiano) Dalpiaz (2020). UC_vs_US Statistic Analysis.xlsx [Dataset]. http://doi.org/10.23644/uu.12631628.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.23644/uu.12631628.v1
Dataset updated
Jul 9, 2020
Dataset provided by
Utrecht University
Authors
F. (Fabiano) Dalpiaz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sheet 1 (Raw-Data): The raw data of the study is provided, presenting the tagging results for the used measures described in the paper. For each subject, it includes multiple columns: A. a sequential student ID B an ID that defines a random group label and the notation C. the used notation: user Story or use Cases D. the case they were assigned to: IFA, Sim, or Hos E. the subject's exam grade (total points out of 100). Empty cells mean that the subject did not take the first exam F. a categorical representation of the grade L/M/H, where H is greater or equal to 80, M is between 65 included and 80 excluded, L otherwise G. the total number of classes in the student's conceptual model H. the total number of relationships in the student's conceptual model I. the total number of classes in the expert's conceptual model J. the total number of relationships in the expert's conceptual model K-O. the total number of encountered situations of alignment, wrong representation, system-oriented, omitted, missing (see tagging scheme below) P. the researchers' judgement on how well the derivation process explanation was explained by the student: well explained (a systematic mapping that can be easily reproduced), partially explained (vague indication of the mapping ), or not present.

Tagging scheme: Aligned (AL) - A concept is represented as a class in both models, either

with the same name or using synonyms or clearly linkable names; Wrongly represented (WR) - A class in the domain expert model is incorrectly represented in the student model, either (i) via an attribute, method, or relationship rather than class, or (ii) using a generic term (e.g., user'' instead ofurban planner''); System-oriented (SO) - A class in CM-Stud that denotes a technical implementation aspect, e.g., access control. Classes that represent legacy system or the system under design (portal, simulator) are legitimate; Omitted (OM) - A class in CM-Expert that does not appear in any way in CM-Stud; Missing (MI) - A class in CM-Stud that does not appear in any way in CM-Expert.

All the calculations and information provided in the following sheets

originate from that raw data.

Sheet 2 (Descriptive-Stats): Shows a summary of statistics from the data collection,

including the number of subjects per case, per notation, per process derivation rigor category, and per exam grade category.

Sheet 3 (Size-Ratio):

The number of classes within the student model divided by the number of classes within the expert model is calculated (describing the size ratio). We provide box plots to allow a visual comparison of the shape of the distribution, its central value, and its variability for each group (by case, notation, process, and exam grade) . The primary focus in this study is on the number of classes. However, we also provided the size ratio for the number of relationships between student and expert model.

Sheet 4 (Overall):

Provides an overview of all subjects regarding the encountered situations, completeness, and correctness, respectively. Correctness is defined as the ratio of classes in a student model that is fully aligned with the classes in the corresponding expert model. It is calculated by dividing the number of aligned concepts (AL) by the sum of the number of aligned concepts (AL), omitted concepts (OM), system-oriented concepts (SO), and wrong representations (WR). Completeness on the other hand, is defined as the ratio of classes in a student model that are correctly or incorrectly represented over the number of classes in the expert model. Completeness is calculated by dividing the sum of aligned concepts (AL) and wrong representations (WR) by the sum of the number of aligned concepts (AL), wrong representations (WR) and omitted concepts (OM). The overview is complemented with general diverging stacked bar charts that illustrate correctness and completeness.

For sheet 4 as well as for the following four sheets, diverging stacked bar

charts are provided to visualize the effect of each of the independent and mediated variables. The charts are based on the relative numbers of encountered situations for each student. In addition, a "Buffer" is calculated witch solely serves the purpose of constructing the diverging stacked bar charts in Excel. Finally, at the bottom of each sheet, the significance (T-test) and effect size (Hedges' g) for both completeness and correctness are provided. Hedges' g was calculated with an online tool: https://www.psychometrica.de/effect_size.html. The independent and moderating variables can be found as follows:

Sheet 5 (By-Notation):

Model correctness and model completeness is compared by notation - UC, US.

Sheet 6 (By-Case):

Model correctness and model completeness is compared by case - SIM, HOS, IFA.

Sheet 7 (By-Process):

Model correctness and model completeness is compared by how well the derivation process is explained - well explained, partially explained, not present.

Sheet 8 (By-Grade):

Model correctness and model completeness is compared by the exam grades, converted to categorical values High, Low , and Medium.
F1-score under the model without different components.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhe Zhang; Yuhao Chen; Huixue Wang; Qiming Fu; Jianping Chen; You Lu (2023). F1-score under the model without different components. [Dataset]. http://doi.org/10.1371/journal.pone.0286770.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0286770.t004
Dataset updated
Jun 8, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Zhe Zhang; Yuhao Chen; Huixue Wang; Qiming Fu; Jianping Chen; You Lu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
F1-score under the model without different components.
T
United States Stock Market Index Data
tradingeconomics.com
ar.tradingeconomics.com
+12more
csv, excel, json, xml
Updated Dec 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TRADING ECONOMICS (2025). United States Stock Market Index Data [Dataset]. https://tradingeconomics.com/united-states/stock-market
Explore at:
excel, xml, json, csvAvailable download formats
Dataset updated
Dec 2, 2025
Dataset authored and provided by
TRADING ECONOMICS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 3, 1928 - Dec 2, 2025
Area covered
United States
Description
The main stock market index of United States, the US500, rose to 6818 points on December 2, 2025, gaining 0.08% from the previous session. Over the past month, the index has declined 0.50%, though it remains 12.70% higher than a year ago, according to trading on a contract for difference (CFD) that tracks this benchmark index from United States. United States Stock Market Index - values, historical data, forecasts and news - updated on December of 2025.
r
Data from: Cold pool driven convective initiation: using causal graph...
resodate.org
data.ub.uni-muenchen.de
+1more
Updated Jan 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirjam Hirt; George C. Craig; Sophia Schäfer; Julien Savre; Rieke Heinze (2020). Cold pool driven convective initiation: using causal graph analysis to determine what convection permitting models are missing [Dataset]. http://doi.org/10.5282/UBM/DATA.178
Explore at:
Unique identifier
https://doi.org/10.5282/UBM/DATA.178
Dataset updated
Jan 1, 2020
Dataset provided by
Universitätsbibliothek der Ludwig-Maximilians-Universität München
Authors
Mirjam Hirt; George C. Craig; Sophia Schäfer; Julien Savre; Rieke Heinze
Description
The data in this folder comprises all data necessary to produce the Figures presented in our paper (Hirt et al, 2020, in review, Quarterly Journal of the Royal Meteorological Society). Corresponding Jupyter notebooks, which were used to analyse and plot the data, are available at https://github.com/HirtM/cold_pool_driven_convection_initiation. The datasets are netcdf files and should contain all relevant metadata. cp_aggregates2*: These datasets contain different variables of cold pool objects. For each variable, several different statistics are available, e.g. the average/median/some percentile over the area of each cold pool object. Note that the data does not contain tracked cold pools. Any sequence of cold pool indices is hence meaningless. Each cold pool index does not only have information about its cold pool, but also its edges (see mask dimension). P_ci_* These datasets contain information on convection initiation within cold pool areas, cold pool edge areas or no cold pool areas. No single cold pool objects are identified here. prec_* As P_ci_*, but for precipitation. synoptic_conditions_variables.nc This dataset contains domain averaged (total domain, not cold pool objects) timeseries of selected variables. The selected variables were chosen in order to describe the synoptic and diurnal conditions of the days of interest. This dataset is used for the causal regression analysis. All the data here is derived from the ICON-LEM simulation conducted within HDCP2: http://hdcp2.eu/index.php?id=5013 Heinze, R., Dipankar, A., Carbajal Henken, C., Moseley, C., Sourdeval, O., Trömel, S., Xie, X., Adamidis, P., Ament, F., Baars, H., Barthlott, C., Behrendt, A., Blahak, U., Bley, S., Brdar, S., Brueck, M., Crewell, S., Deneke, H., Di Girolamo, P., Evaristo, R., Fischer, J., Frank, C., Friederichs, P., Göcke, T., Gorges, K., Hande, L., Hanke, M., Hansen, A., Hege, H.-C., Hoose, C., Jahns, T., Kalthoff, N., Klocke, D., Kneifel, S., Knippertz, P., Kuhn, A., van Laar, T., Macke, A., Maurer, V., Mayer, B., Meyer, C. I., Muppa, S. K., Neggers, R. A. J., Orlandi, E., Pantillon, F., Pospichal, B., Röber, N., Scheck, L., Seifert, A., Seifert, P., Senf, F., Siligam, P., Simmer, C., Steinke, S., Stevens, B., Wapler, K., Weniger, M., Wulfmeyer, V., Zängl, G., Zhang, D. and Quaas, J. (2016): Large-eddy simulations over Germany using ICON: A comprehensive evaluation. Q.J.R. Meteorol. Soc., doi:10.1002/qj.2947 M.Hirt, 9 Jan 2020
m
Create SD Holdings Co Ltd - Inventory
macro-rankings.com
csv, excel
Updated Aug 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
macro-rankings (2025). Create SD Holdings Co Ltd - Inventory [Dataset]. https://www.macro-rankings.com/Markets/Stocks/3148-TSE/Balance-Sheet/Inventory
Explore at:
csv, excelAvailable download formats
Dataset updated
Aug 23, 2025
Dataset authored and provided by
macro-rankings
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
japan
Description
Inventory Time Series for Create SD Holdings Co Ltd. Create SD Holdings Co., Ltd., through its subsidiaries, engages in the drug store, dispensing pharmacy, nursing care, and related businesses in Japan. The company operates drugstores and pharmacies that sell pharmaceuticals, cosmetics, food products, and daily necessities, etc., as well as supermarkets. It also engages in the operation and management of nursing care homes for the elderly and functional training day service centers; and the provision of visiting medication counseling at nursing homes and homes. In addition, the company offers cleaning work for shops and administrative support for stores, etc. Create SD Holdings Co., Ltd. was founded in 1983 and is headquartered in Yokohama, Japan.
Z
Data from: A dataset to model Levantine landcover and land-use change...
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kempf, Michael (2023). A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10396147
Explore at:
Dataset updated
Dec 16, 2023
Dataset provided by
University of Basel
Authors
Kempf, Michael
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Levant
Description
Overview

This dataset is the repository for the following paper submitted to Data in Brief:

Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).

The Data in Brief article contains the supplement information and is the related data paper to:

Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).

Description/abstract

The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.

Folder structure

The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:

“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.

“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.

“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).

“yield_productivity” contains .csv files of yield information for all countries listed above.

“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).

“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.

“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.

Code structure

1_MODIS_NDVI_hdf_file_extraction.R

This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.

2_MERGE_MODIS_tiles.R

In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").

3_CROP_MODIS_merged_tiles.R

Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS. The repository provides the already clipped and merged NDVI datasets.

4_TREND_analysis_NDVI.R

Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.

5_BUILT_UP_change_raster.R

Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.

6_POPULATION_numbers_plot.R

For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.

7_YIELD_plot.R

In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.

8_GLDAS_read_extract_trend

The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection). Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.

(9_workflow_diagramme) this simple code can be used to plot a workflow diagram and is detached from the actual analysis.

Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data Curation, Writing - Original Draft, Writing - Review & Editing, Visualization, Supervision, Project administration, and Funding acquisition: Michael
m
Beijing Compass Technology Develop - Stock Price Series
macro-rankings.com
csv, excel
Updated Dec 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
macro-rankings (2024). Beijing Compass Technology Develop - Stock Price Series [Dataset]. https://www.macro-rankings.com/markets/stocks/300803-she
Explore at:
csv, excelAvailable download formats
Dataset updated
Dec 31, 2024
Dataset authored and provided by
macro-rankings
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
china, Beijing
Description
Stock Price Time Series for Beijing Compass Technology Develop. Beijing Compass Technology Development Co., Ltd. develops and delivers securities analysis software and securities information solutions in China. It provides investors with financial data analysis and securities investment consulting services through securities tool software terminals as a carrier and the Internet as a tool. The company's securities tool software includes the Quanying series and the Caifuzhangmen series of products. The company engages in financial information service, securities, and advertising service business. It serves individual professional and small and medium-sized investors. Beijing Compass Technology Development Co., Ltd. was formerly known as Beijing Compass Securities Research Co., Ltd. and changed its name to Beijing Compass Technology Development Co., Ltd. in April 2001. Beijing Compass Technology Development Co., Ltd. was founded in 1997 and is based in Beijing, China.
f
Descriptive data of the values used to build graph (Fig 2).
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Nov 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Medei, Emiliano; Vilar-Pereira, Glaucia; Feijo, Daniel F.; Silverio, Jaline Coutinho; Moreno-Loaiza, Oscar; Mata-Santos, Hilton Antônio; Carneiro, Vitor Coutinho; Paiva, Claudia N.; Bozza, Marcelo T.; Lannes-Vieira, Joseli; Vanzan, Daniel Figueiredo; Oliveira, Camila Victória Sousa; Ramos, Isalira P. (2024). Descriptive data of the values used to build graph (Fig 2). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001432678
Explore at:
Dataset updated
Nov 7, 2024
Authors
Medei, Emiliano; Vilar-Pereira, Glaucia; Feijo, Daniel F.; Silverio, Jaline Coutinho; Moreno-Loaiza, Oscar; Mata-Santos, Hilton Antônio; Carneiro, Vitor Coutinho; Paiva, Claudia N.; Bozza, Marcelo T.; Lannes-Vieira, Joseli; Vanzan, Daniel Figueiredo; Oliveira, Camila Victória Sousa; Ramos, Isalira P.
Description
Descriptive data of the values used to build graph (Fig 2).
h
Millennium Cohort Study: Linked Health Administrative Datasets (SAIL),...
harmonydata.ac.uk
Updated Oct 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University College London, Institute of Education, Centre for Longitudinal Studies (2025). Millennium Cohort Study: Linked Health Administrative Datasets (SAIL), Wales: Secure Access / MCS; SAIL [Dataset]. http://doi.org/10.5255/UKDA-SN-9310-1
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-9310-1
Dataset updated
Oct 21, 2025
Dataset provided by
SAIL Databank
University College London, Institute of Education, Centre for Longitudinal Studies
Description
Background: The Millennium Cohort Study (MCS) is a large-scale, multi-purpose longitudinal dataset providing information about babies born at the beginning of the 21st century, their progress through life, and the families who are bringing them up, for the four countries of the United Kingdom. The original objectives of the first MCS survey, as laid down in the proposal to the Economic and Social Research Council (ESRC) in March 2000, were:to chart the initial conditions of social, economic and health advantages and disadvantages facing children born at the start of the 21st century, capturing information that the research community of the future will requireto provide a basis for comparing patterns of development with the preceding cohorts (the National Child Development Study, held at the UK Data Archive under GN 33004, and the 1970 Birth Cohort Study, held under GN 33229)to collect information on previously neglected topics, such as fathers' involvement in children's care and developmentto focus on parents as the most immediate elements of the children's 'background', charting their experience as mothers and fathers of newborn babies in the year 2000, recording how they (and any other children in the family) adapted to the newcomer, and what their aspirations for her/his future may beto emphasise intergenerational links including those back to the parents' own childhoodto investigate the wider social ecology of the family, including social networks, civic engagement and community facilities and services, splicing in geo-coded data when availableAdditional objectives subsequently included for MCS were:to provide control cases for the national evaluation of Sure Start (a government programme intended to alleviate child poverty and social exclusion)to provide samples of adequate size to analyse and compare the smaller countries of the United Kingdom, and include disadvantaged areas of EnglandFurther information about the MCS can be found on the Centre for Longitudinal Studies web pages.The content of MCS studies, including questions, topics and variables can be explored via the CLOSER Discovery website. The first sweep (MCS1) interviewed both mothers and (where resident) fathers (or father-figures) of infants included in the sample when the babies were nine months old, and the second sweep (MCS2) was carried out with the same respondents when the children were three years of age. The third sweep (MCS3) was conducted in 2006, when the children were aged five years old, the fourth sweep (MCS4) in 2008, when they were seven years old, the fifth sweep (MCS5) in 2012-2013, when they were eleven years old, the sixth sweep (MCS6) in 2015, when they were fourteen years old, and the seventh sweep (MCS7) in 2018, when they were seventeen years old. End User Licence versions of MCS studies:The End User Licence (EUL) versions of MCS1, MCS2, MCS3, MCS4, MCS5, MCS6 and MCS7 are held under UK Data Archive SNs 4683, 5350, 5795, 6411, 7464, 8156 and 8682 respectively. The longitudinal family file is held under SN 8172.

Sub-sample studies: Some studies based on sub-samples of MCS have also been conducted, including a study of MCS respondent mothers who had received assisted fertility treatment, conducted in 2003 (see EUL SN 5559). Also, birth registration and maternity hospital episodes for the MCS respondents are held as a separate dataset (see EUL SN 5614).Release of Sweeps 1 to 4 to Long Format (Summer 2020)To support longitudinal research and make it easier to compare data from different time points, all data from across all sweeps is now in a consistent format. The update affects the data from sweeps 1 to 4 (from 9 months to 7 years), which are updated from the old/wide to a new/long format to match the format of data of sweeps 5 and 6 (age 11 and 14 sweeps). The old/wide formatted datasets contained one row per family with multiple variables for different respondents. The new/long formatted datasets contain one row per respondent (per parent or per cohort member) for each MCS family. Additional updates have been made to all sweeps to harmonise variable labels and enhance anonymisation.

How to access genetic and/or bio-medical sample data from a range of longitudinal surveys:For information on how to access biomedical data from MCS that are not held at the UKDS, see the CLS Genetic data and biological samples webpage.

Secure Access datasets: Secure Access versions of the MCS have more restrictive access conditions than versions available under the standard End User Licence or Special Licence (see 'Access data' tab above).

Secure Access versions of the MCS include:detailed sensitive variables not available under EUL.
2
MCS4
datacatalogue.ukdataservice.ac.uk
Updated Dec 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of London, Institute of Education, Centre for Longitudinal Studies (2024). MCS4 [Dataset]. http://doi.org/10.5255/UKDA-SN-6411-9
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-6411-9
Dataset updated
Dec 3, 2024
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
Authors
University of London, Institute of Education, Centre for Longitudinal Studies
Area covered
United Kingdom
Description
Background:
The Millennium Cohort Study (MCS) is a large-scale, multi-purpose longitudinal dataset providing information about babies born at the beginning of the 21st century, their progress through life, and the families who are bringing them up, for the four countries of the United Kingdom. The original objectives of the first MCS survey, as laid down in the proposal to the Economic and Social Research Council (ESRC) in March 2000, were:
to chart the initial conditions of social, economic and health advantages and disadvantages facing children born at the start of the 21st century, capturing information that the research community of the future will require
to provide a basis for comparing patterns of development with the preceding cohorts (the National Child Development Study, held at the UK Data Archive under GN 33004, and the 1970 Birth Cohort Study, held under GN 33229)
to collect information on previously neglected topics, such as fathers' involvement in children's care and development
to focus on parents as the most immediate elements of the children's 'background', charting their experience as mothers and fathers of newborn babies in the year 2000, recording how they (and any other children in the family) adapted to the newcomer, and what their aspirations for her/his future may be
to emphasise intergenerational links including those back to the parents' own childhood
to investigate the wider social ecology of the family, including social networks, civic engagement and community facilities and services, splicing in geo-coded data when available
Additional objectives subsequently included for MCS were:
to provide control cases for the national evaluation of Sure Start (a government programme intended to alleviate child poverty and social exclusion)
to provide samples of adequate size to analyse and compare the smaller countries of the United Kingdom, and include disadvantaged areas of England
Further information about the MCS can be found on the Centre for Longitudinal Studies web pages.
The content of MCS studies, including questions, topics and variables can be explored via the CLOSER Discovery website.
The first sweep (MCS1) interviewed both mothers and (where resident) fathers (or father-figures) of infants included in the sample when the babies were nine months old, and the second sweep (MCS2) was carried out with the same respondents when the children were three years of age. The third sweep (MCS3) was conducted in 2006, when the children were aged five years old, the fourth sweep (MCS4) in 2008, when they were seven years old, the fifth sweep (MCS5) in 2012-2013, when they were eleven years old, the sixth sweep (MCS6) in 2015, when they were fourteen years old, and the seventh sweep (MCS7) in 2018, when they were seventeen years old.

Safeguarded versions of MCS studies:
The Safeguarded versions of MCS1, MCS2, MCS3, MCS4, MCS5, MCS6 and MCS7 are held under UK Data Archive SNs 4683, 5350, 5795, 6411, 7464, 8156 and 8682 respectively. The longitudinal family file is held under SN 8172.

Polygenic Indices
Polygenic indices are available under Special Licence SN 9437. Derived summary scores have been created that combine the estimated effects of many different genes on a specific trait or characteristic, such as a person's risk of Alzheimer's disease, asthma, substance abuse, or mental health disorders, for example. These polygenic scores can be combined with existing survey data to offer a more nuanced understanding of how cohort members' outcomes may be shaped.

Sub-sample studies:
Some studies based on sub-samples of MCS have also been conducted, including a study of MCS respondent mothers who had received assisted fertility treatment, conducted in 2003 (see EUL SN 5559). Also, birth registration and maternity hospital episodes for the MCS respondents are held as a separate dataset (see EUL SN 5614).

Release of Sweeps 1 to 4 to Long Format (Summer 2020)
To support longitudinal research and make it easier to compare data from different time points, all data from across all sweeps is now in a consistent format. The update affects the data from sweeps 1 to 4 (from 9 months to 7 years), which are updated from the old/wide to a new/long format to match the format of data of sweeps 5 and 6 (age 11 and 14 sweeps). The old/wide formatted datasets contained one row per family with multiple variables for different respondents. The new/long formatted datasets contain one row per respondent (per parent or per cohort member) for each MCS family. Additional updates have been made to all sweeps to harmonise variable labels and enhance anonymisation.

How to access genetic and/or bio-medical sample data from a range of longitudinal surveys:
For information on how to access biomedical data from MCS that are not held at the UKDS, see the CLS Genetic data and biological samples webpage.

Secure Access datasets:
Secure Access versions of the MCS have more restrictive access conditions than versions available under the standard Safeguarded Licence or Special Licence (see 'Access data' tab above).

Secure Access versions of the MCS include:
detailed sensitive variables not available under EUL. These have been grouped thematically and are held under SN 8753 (socio-economic, accommodation and occupational data), SN 8754 (self-reported health, behaviour and fertility), SN 8755 (demographics, language and religion) and SN 8756 (exact participation dates). These files replace previously available studies held under SNs 8456 and 8622-8627
detailed geographical identifier files which are grouped by sweep held under SN 7758 (MCS1), SN 7759 (MCS2), SN 7760 (MCS3), SN 7761 (MCS4), SN 7762 (MCS5 2001 Census Boundaries), SN 7763 (MCS5 2011 Census Boundaries), SN 8231 (MCS6 2001 Census Boundaries), SN 8232 (MCS6 2011 Census Boundaries), SN 8757 (MCS7), SN 8758 (MCS7 2001 Census Boundaries) and SN 8759 (MCS7 2011 Census Boundaries). These files replace previously available files grouped by geography SN 7049 (Ward level), SN 7050 (Lower Super Output Area level), and SN 7051 (Output Area level)
linked education administrative datasets for Key Stages 1, 2, 4 and 5 held under SN 8481 (England). This replaces previously available datasets for Key Stage 1 (SN 6862) and Key Stage 2 (SN 7712)
linked education administrative datasets for Key Stage 1 held under SN 7414 (Scotland)
linked education administrative dataset for Key Stages 1, 2, 3 and 4 under SN 9085 (Wales)
linked NHS Patient Episode Database for Wales (PEDW) for MCS1 – MCS5 held under SN 8302
linked Scottish Medical Records data held under SNs 8709, 8710, 8711, 8712, 8713 and 8714;
Banded Distances to English Grammar Schools for MCS5 held under SN 8394
linked Health Administrative Datasets (Hospital Episode Statistics) for England for years 2000-2019 held under SN 9030
linked Hospital of Birth data held under SN 5724.
The linked education administrative datasets held under SNs 8481,7414 and 9085 may be ordered alongside the MCS detailed geographical identifier files only if sufficient justification is provided in the application.

Researchers applying for access to the Secure Access MCS datasets should indicate on their ESRC Accredited Researcher application form the EUL dataset(s) that they also wish to access (selected from the MCS Series Access web page).

MCS4:
The objectives of MCS4 were the same as MCS3, namely:

to continue tracking the child's physical, cognitive and behavioural development
to chart continuity and change in the child's family circumstances and physical environment up to age seven
to record the child's transition to primary school and their experience of the first years at school
to track the child's previous experience of early education and day-care, along with current out-of-school care arrangements
record the progress of the child's older siblings
to re-contact families who had participated in at least one of the earlier surveys but who may not have participated in all sweeps
to ask the children directly about their thoughts and experiences at age seven

This study now includes the data and documentation from the Teacher Survey completed at Sweep 4 which were previously available under SN 6848.

Latest edition information
For the ninth edition (October 2022), a new data file mcs4_family_interview has been added due to the family level data being split out from the parent-level data to make future merging with MCS8 onwards easier. Two data files (mcs4_parent_interview and mcs4_parent_cm_interview) have been updated to include variables that were missed from the previous edition (mainly from the income and employment module) due to a technical error. There have also been edits to some variable labels that had been found to
Z
Data from: KGCW 2023 Challenge @ ESWC 2023
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated May 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Şimşek, Umutcan; Iglesias, Ana (2023). KGCW 2023 Challenge @ ESWC 2023 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7689309
Explore at:
Dataset updated
May 17, 2023
Dataset provided by
Universidad Politécnica de Madrid
KU Leuven
STI Insbruck
IDLab - Ghent University - imec
Authors
Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Şimşek, Umutcan; Iglesias, Ana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Knowledge Graph Construction Workshop 2023: challenge

Knowledge graph construction of heterogeneous data has seen a lot of uptake in the last decade from compliance to performance optimizations with respect to execution time. Besides execution time as a metric for comparing knowledge graph construction, other metrics e.g. CPU or memory usage are not considered. This challenge aims at benchmarking systems to find which RDF graph construction system optimizes for metrics e.g. execution time, CPU, memory usage, or a combination of these metrics.

Task description

The task is to reduce and report the execution time and computing resources (CPU and memory usage) for the parameters listed in this challenge, compared to the state-of-the-art of the existing tools and the baseline results provided by this challenge. This challenge is not limited to execution times to create the fastest pipeline, but also computing resources to achieve the most efficient pipeline.

We provide a tool which can execute such pipelines end-to-end. This tool also collects and aggregates the metrics such as execution time, CPU and memory usage, necessary for this challenge as CSV files. Moreover, the information about the hardware used during the execution of the pipeline is available as well to allow fairly comparing different pipelines. Your pipeline should consist of Docker images which can be executed on Linux to run the tool. The tool is already tested with existing systems, relational databases e.g. MySQL and PostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuoso which can be combined in any configuration. It is strongly encouraged to use this tool for participating in this challenge. If you prefer to use a different tool or our tool imposes technical requirements you cannot solve, please contact us directly.

Part 1: Knowledge Graph Construction Parameters

These parameters are evaluated using synthetic generated data to have more insights of their influence on the pipeline.

Data

Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).

Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).

Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).

Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).

Number of input files: scaling the number of datasets (1, 5, 10, 15).

Mappings

Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).

Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).

Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)

Part 2: GTFS-Madrid-Bench

The GTFS-Madrid-Bench provides insights in the pipeline with real data from the public transport domain in Madrid.

Scaling

GTFS-1 SQL

GTFS-10 SQL

GTFS-100 SQL

GTFS-1000 SQL

Heterogeneity

GTFS-100 XML + JSON

GTFS-100 CSV + XML

GTFS-100 CSV + JSON

GTFS-100 SQL + XML + JSON + CSV

Example pipeline

The ground truth dataset and baseline results are generated in different steps for each parameter:

The provided CSV files and SQL schema are loaded into a MySQL relational database.

Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format.

The constructed knowledge graph is loaded into a Virtuoso triplestore, tuned according to the Virtuoso documentation.

The provided SPARQL queries are executed on the SPARQL endpoint exposed by Virtuoso.

The pipeline is executed 5 times from which the median execution time of each step is calculated and reported. Each step with the median execution time is then reported in the baseline results with all its measured metrics. Query timeout is set to 1 hour and knowledge graph construction timeout to 24 hours. The execution is performed with the following tool , you can adapt the execution plans for this example pipeline to your own needs.

Each parameter has its own directory in the ground truth dataset with the following files:

Input dataset as CSV.

Mapping file as RML.

Queries as SPARQL.

Execution plan for the pipeline in metadata.json.

Datasets

Knowledge Graph Construction Parameters

The dataset consists of:

Input dataset as CSV for each parameter.

Mapping file as RML for each parameter.

SPARQL queries to retrieve the results for each parameter.

Baseline results for each parameter with the example pipeline.

Ground truth dataset for each parameter generated with the example pipeline.

Format

All input datasets are provided as CSV, depending on the parameter that is being evaluated, the number of rows and columns may differ. The first row is always the header of the CSV.

GTFS-Madrid-Bench

The dataset consists of:

Input dataset as CSV with SQL schema for the scaling and a combination of XML,

CSV, and JSON is provided for the heterogeneity.

Mapping file as RML for both scaling and heterogeneity.

SPARQL queries to retrieve the results.

Baseline results with the example pipeline.

Ground truth dataset generated with the example pipeline.

Format

CSV datasets always have a header as their first row. JSON and XML datasets have their own schema.

Evaluation criteria

Submissions must evaluate the following metrics:

Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.

CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.

Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

Expected output

Duplicate values

Scale Number of Triples 0 percent 2000000 triples 25 percent 1500020 triples 50 percent 1000020 triples 75 percent 500020 triples 100 percent 20 triples

Empty values

Scale Number of Triples 0 percent 2000000 triples 25 percent 1500000 triples 50 percent 1000000 triples 75 percent 500000 triples 100 percent 0 triples

Mappings

Scale Number of Triples 1TM + 15POM 1500000 triples 3TM + 5POM 1500000 triples 5TM + 3POM 1500000 triples 15TM + 1POM 1500000 triples

Properties

Scale Number of Triples 1M rows 1 column 1000000 triples 1M rows 10 columns 10000000 triples 1M rows 20 columns 20000000 triples 1M rows 30 columns 30000000 triples

Records

Scale Number of Triples 10K rows 20 columns 200000 triples 100K rows 20 columns 2000000 triples 1M rows 20 columns 20000000 triples 10M rows 20 columns 200000000 triples

Joins

1-1 joins

Scale Number of Triples 0 percent 0 triples 25 percent 125000 triples 50 percent 250000 triples 75 percent 375000 triples 100 percent 500000 triples

1-N joins

Scale Number of Triples 1-10 0 percent 0 triples 1-10 25 percent 125000 triples 1-10 50 percent 250000 triples 1-10 75 percent 375000 triples 1-10 100 percent 500000 triples 1-5 50 percent 250000 triples 1-10 50 percent 250000 triples 1-15 50 percent 250005 triples 1-20 50 percent 250000 triples

1-N joins

Scale Number of Triples 10-1 0 percent 0 triples 10-1 25 percent 125000 triples 10-1 50 percent 250000 triples 10-1 75 percent 375000 triples 10-1 100 percent 500000 triples 5-1 50 percent 250000 triples 10-1 50 percent 250000 triples 15-1 50 percent 250005 triples 20-1 50 percent 250000 triples

N-M joins

Scale Number of Triples 5-5 50 percent 1374085 triples 10-5 50 percent 1375185 triples 5-10 50 percent 1375290 triples 5-5 25 percent 718785 triples 5-5 50 percent 1374085 triples 5-5 75 percent 1968100 triples 5-5 100 percent 2500000 triples 5-10 25 percent 719310 triples 5-10 50 percent 1375290 triples 5-10 75 percent 1967660 triples 5-10 100 percent 2500000 triples 10-5 25 percent 719370 triples 10-5 50 percent 1375185 triples 10-5 75 percent 1968235 triples 10-5 100 percent 2500000 triples

GTFS Madrid Bench

Generated Knowledge Graph

Scale Number of Triples 1 395953 triples 10 3959530 triples 100 39595300 triples 1000 395953000 triples

Queries

Query Scale 1 Scale 10 Scale 100 Scale 1000 Q1 58540 results 585400 results No results available No results available Q2 636 results 11998 results 125565 results 1261368 results Q3 421 results 4207 results 42067 results 420667 results Q4 13 results 130 results 1300 results 13000 results Q5 35 results 350 results 3500 results 35000

Facebook

Twitter

Click to copy link

Link copied

Cite

Zhe Zhang; Yuhao Chen; Huixue Wang; Qiming Fu; Jianping Chen; You Lu (2023). S1 Datasets - [Dataset]. http://doi.org/10.1371/journal.pone.0286770.s001

Data from: S1 Datasets -

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0286770.s001

Dataset updated

Jun 8, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Zhe Zhang; Yuhao Chen; Huixue Wang; Qiming Fu; Jianping Chen; You Lu

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A critical issue in intelligent building control is detecting energy consumption anomalies based on intelligent device status data. The building field is plagued by energy consumption anomalies caused by a number of factors, many of which are associated with one another in apparent temporal relationships. For the detection of abnormalities, most traditional detection methods rely solely on a single variable of energy consumption data and its time series changes. Therefore, they are unable to examine the correlation between the multiple characteristic factors that affect energy consumption anomalies and their relationship in time. The outcomes of anomaly detection are one-sided. To address the above problems, this paper proposes an anomaly detection method based on multivariate time series. Firstly, in order to extract the correlation between different feature variables affecting energy consumption, this paper introduces a graph convolutional network to build an anomaly detection framework. Secondly, as different feature variables have different influences on each other, the framework is enhanced by a graph attention mechanism so that time series features with higher influence on energy consumption are given more attention weights, resulting in better anomaly detection of building energy consumption. Finally, the effectiveness of this paper’s method and existing methods for detecting energy consumption anomalies in smart buildings are compared using standard data sets. The experimental results show that the model has better detection accuracy.

Clear search

Close search

Google apps

Main menu

Data from: S1 Datasets -

Goodreads Radial Bar Chart Values

CBCD:A Chinese Bar Chart Dataset for Data Extraction

MCS3

Plotly Dashboard Healthcare

Context

Content

Acknowledgements

Inspiration

MCS2

S1 Data -

causRCA: Real-World Dataset for Causal Discovery and Root Cause Analysis in...

causRCA: Real-World Dataset for Causal Discovery and Root Cause Analysis in Machinery

Use Cases & Applications

Data & File Overview

Datasets summary

Methodological Information

Real Operation Data (real_op)

Digital Twin Data (dig_twin)

Known limitations

Methods for Processing

Acknowledgements

Declaration of GenAI and AI-assisted Technologies

Classification of web-based Digital Humanities projects leveraging...

Description

Classification schema: categories and columns

UC_vs_US Statistic Analysis.xlsx

F1-score under the model without different components.

United States Stock Market Index Data

Data from: Cold pool driven convective initiation: using causal graph...

Create SD Holdings Co Ltd - Inventory

Data from: A dataset to model Levantine landcover and land-use change...

Beijing Compass Technology Develop - Stock Price Series

Descriptive data of the values used to build graph (Fig 2).

Millennium Cohort Study: Linked Health Administrative Datasets (SAIL),...

MCS4

Data from: KGCW 2023 Challenge @ ESWC 2023

Data from: S1 Datasets -See More Versions

Real Operation Data (`real_op`)

Digital Twin Data (`dig_twin`)

Data from: S1 Datasets -