27 datasets found

Synthetic Data for graphdb-benchmark
figshare.com
txt
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sotiris Beis; Symeon Papadopoulos; Yannis Kompatsiaris (2023). Synthetic Data for graphdb-benchmark [Dataset]. http://doi.org/10.6084/m9.figshare.1221760.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1221760.v1
Dataset updated
Jun 3, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Sotiris Beis; Symeon Papadopoulos; Yannis Kompatsiaris
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data we used to evaluate Louvain Method in the study Benchmarking Graph Databases on the Problem of Community Detection. These data werw synthetically generated using the LFR-Benchmark (3rd link). There are two type of files, networkX.dat and communityX.dat. The networkX.dat file contains the list of edges (nodes are labelled from 1 to the number of nodes; the edges are ordered and repeated twice, i.e. source-target and target-source). The first four lines of the networkX.dat file list the parameters we used to generate the data. The communityX.dat file contains a list of the nodes and their membership (memberships are labelled by integer numbers >=1). Note X correspond to the number of nodes each dataset contains.
d
GraphXAI
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Queen, Owen (2023). GraphXAI [Dataset]. http://doi.org/10.7910/DVN/KULOS8
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/KULOS8
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Queen, Owen
Description
As post hoc explanations are increasingly used to understand the behavior of Graph Neural Networks (GNNs), it becomes crucial to evaluate the quality and reliability of GNN explanations. However, assessing the quality of GNN explanations is challenging as existing graph datasets have no or unreliable ground-truth explanations for a given task. Here, we introduce a synthetic graph data generator, ShapeGGen, which can generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) accompanied by ground-truth explanations. Further, the flexibility to generate diverse synthetic datasets and corresponding ground-truth explanations allows us to mimic the data generated by various real-world applications. We include ShapeGGen and additional XAI-ready real-world graph datasets into an open-source graph explainability library, GraphXAI. In addition, GraphXAI provides a broader ecosystem of data loaders, data processing functions, synthetic and real-world graph datasets with ground-truth explanations, visualizers, GNN model implementations, and a set of evaluation metrics to benchmark the performance of any given GNN explainer.
Datasets of synthetic task flow graphs for evaluating a latency/energy...
data.europa.eu
data.niaid.nih.gov
unknown
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). Datasets of synthetic task flow graphs for evaluating a latency/energy optimization task allocation framework [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-10654551?locale=da
Explore at:
unknown(1386672)Available download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These datasets of synthetic task flow graphs were generated to evaluate the performance and scalability of an optimal task allocation approach for applications of various structures and sizes in an environment following the edge/hub/cloud paradigm. The system under study comprised an edge device (e.g., a single-board computer attached to an unmanned aerial vehicle (UAV)) interacting with a hub device (e.g., a laptop), which in turn communicated with a more computationally capable cloud server. The objective was the minimization of either overall latency or overall energy consumption, under memory, storage, energy, and task precedence constraints. We considered that a percentage of the tasks required fixed allocation on the edge or hub device. We generated 18 task flow graphs of parallel, serial, and mixed (a combination of parallel and serial) structure with 10, 100, and 1000 nodes, and various in/out degrees, utilizing the Task Graphs For Free (TGFF) random task graph generator [1],[2]. Additional task parameters (e.g., execution time, power consumption, memory, storage, output data size) were included post-generation, using representative random values. More details are provided in README.txt and in [3]. Note: These datasets are released under a Creative Commons Attribution license. If you utilize these datasets in your work, please cite us using the corresponding Zenodo DOI https://doi.org/10.5281/zenodo.10654551. References:[1] R. P. Dick, D. L. Rhodes, and W. Wolf, "TGFF: Task graphs for free," Proceedings of the Sixth International Workshop on Hardware/Software Codesign (CODES/CASHE), 1998, pp. 97-101, doi: 10.1109/HSC.1998.666245.[2] R. P. Dick, D. L. Rhodes, and K. Vallerio, "TGFF," https://robertdick.org/projects/tgff/.[3] A. Kouloumpris, G. L. Stavrinides, M. K. Michael, and T. Theocharides, "An optimization framework for task allocation in the edge/hub/cloud paradigm," Future Generation Computer Systems, vol. 155, pp. 354-366, Jun. 2024, doi: 10.1016/j.future.2024.02.005.
Variant data used during semi-synthetic data generation.
plos.figshare.com
xls
Updated Sep 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric V. Strobl; Eric R. Gamazon (2025). Variant data used during semi-synthetic data generation. [Dataset]. http://doi.org/10.1371/journal.pcbi.1013461.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1013461.t001
Dataset updated
Sep 5, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Eric V. Strobl; Eric R. Gamazon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Variant data used during semi-synthetic data generation.
Datasets of synthetic task graphs for evaluating a reliability and latency...
data.europa.eu
data-staging.niaid.nih.gov
+1more
unknown
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo, Datasets of synthetic task graphs for evaluating a reliability and latency multi-objective task allocation framework [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-10357101?locale=hr
Explore at:
unknown(642181)Available download formats
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These datasets of synthetic task graphs were generated to evaluate the performance and scalability of a multi-objective task allocation approach for workflow applications of various structures and sizes in a system based on the edge-hub-cloud paradigm. The targeted architecture comprised an edge device (e.g., a single-board computer attached to an unmanned aerial vehicle (UAV)) interacting with a hub device (e.g., a laptop), which in turn communicated with a more computationally capable cloud server. The objectives were the maximization of the overall reliability and the minimization of the overall latency of the application, under memory, storage, energy, and task precedence constraints. We considered that a percentage of the tasks required fixed allocation on the edge or hub device. Each task had a different vulnerability factor (i.e., probability of failure) on each device. We generated nine task graphs of serial, parallel, and mixed (a combination of serial and parallel) structure with 10, 100, and 1000 nodes, utilizing the Task Graphs For Free (TGFF) random task graph generator [1]. Additional task parameters (e.g., execution time, power consumption, vulnerability factor, memory, storage, output data size) were included post-generation, using representative random values. More details are provided in README.txt. Note: These datasets are released under a Creative Commons Attribution license. If you utilize these datasets in your work, please cite us using the corresponding Zenodo DOI https://doi.org/10.5281/zenodo.10357101. References: [1] R. P. Dick, D. L. Rhodes and W. Wolf, "TGFF: Task graphs for free," Proceedings of the Sixth International Workshop on Hardware/Software Codesign (CODES/CASHE'98), Seattle, WA, USA, 1998, pp. 97-101, doi: 10.1109/HSC.1998.666245.
Z
Data from: Synthetic Multimodal Dataset for Daily Life Activities
data.niaid.nih.gov
Updated Jan 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ugai, Takanori; Egami, Shusaku; Swe Nwe Nwe Htun; Kozaki, Kouji; Kawamura, Takahiro; Fukuda, Ken (2024). Synthetic Multimodal Dataset for Daily Life Activities [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8046266
Explore at:
Dataset updated
Jan 29, 2024
Dataset provided by
National Institute of Advanced Industrial Science and Technology
Fujitsu
Osaka Electro-Communication University
National Agriculture and Food Research Organization
Authors
Ugai, Takanori; Egami, Shusaku; Swe Nwe Nwe Htun; Kozaki, Kouji; Kawamura, Takahiro; Fukuda, Ken
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Outline

This dataset is originally created for the Knowledge Graph Reasoning Challenge for Social Issues (KGRC4SI)

Video data that simulates daily life actions in a virtual space from Scenario Data.

Knowledge graphs, and transcriptions of the Video Data content ("who" did what "action" with what "object," when and where, and the resulting "state" or "position" of the object).

Knowledge Graph Embedding Data are created for reasoning based on machine learning

This data is open to the public as open data

Details

Videos

mp4 format

203 action scenarios

For each scenario, there is a character rear view (file name ending in 0), an indoor camera switching view (file name ending in 1), and a fixed camera view placed in each corner of the room (file name ending in 2-5). Also, for each action scenario, data was generated for a minimum of 1 to a maximum of 7 patterns with different room layouts (scenes). A total of 1,218 videos

Videos with slowly moving characters simulate the movements of elderly people.

Knowledge Graphs

RDF format

203 knowledge graphs corresponding to the videos

Includes schema and location supplement information

The schema is described below

SPARQL endpoints and query examples are available

Script Data

txt format

Data provided to VirtualHome2KG to generate videos and knowledge graphs

Includes the action title and a brief description in text format.

Embedding

Embedding Vectors in TransE, ComplEx, and RotatE. Created with DGL-KE (https://dglke.dgl.ai/doc/)

Embedding Vectors created with jRDF2vec (https://github.com/dwslab/jRDF2Vec).

Specification of Ontology

Please refer to the specification for descriptions of all classes, instances, and properties: https://aistairc.github.io/VirtualHome2KG/vh2kg_ontology.htm

Related Resources

KGRC4SI Final Presentations with automatic English subtitles (YouTube)

VirtualHome2KG (Software)

VirtualHome-AIST (Unity)

VirtualHome-AIST (Python API)

Visualization Tool (Software)

Script Editor (Software)
Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...
zenodo.org
data.niaid.nih.gov
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7916716
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

An example

An example test sentence:

Test Sentence: {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King."}

An example of ontology:

Ontology: Music Ontology

Expected Output:

{ "id": "ont_k_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", "triples": [ { "sub": "The Loco-Motion", "rel": "publication date", "obj": "01 January 1962" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Gerry Goffin" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Carole King" },] }

The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

The structure of the repo is as the following.

Text2KGBench

src: the source code used for generation and evaluation, and baseline

benchmark the code used to generate the benchmark

evaluation evaluation scripts for calculating the results

baseline code for generating the baselines including prompts, sentence similarities, and LLM client.

data: the benchmark datasets and baseline data. There are two datasets: wikidata_tekgen and dbpedia_webnlg.

wikidata_tekgen Wikidata-TekGen Dataset

ontologies 10 ontologies used by this dataset

train training data

test test data

manually_verified_sentences ids of a subset of test cases manually validated

unseen_sentences new sentences that are added by the authors which are not part of Wikipedia

test unseen test unseen test sentences

ground_truth ground truth for unseen test sentences.

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

unseen prompts unseen prompts for the unseen test cases

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

unseen results results for the unseen test cases

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

dbpedia_webnlg DBpedia Dataset

ontologies 19 ontologies used by this dataset

train training data

test test data

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
GAP Graphs Part 1 (A-T)
kaggle.com
zip
Updated Dec 5, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). GAP Graphs Part 1 (A-T) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-gap-01/code
Explore at:
zip(24502979638 bytes)Available download formats
Dataset updated
Dec 5, 2021
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The unfortunate lack of a widely used graph benchmark suite forces each research publication to create its own evaluation methodology, and this often results in mistakes or unnecessary differences. Common serious mistakes we have observed include: using trivially small input graphs, using only a single input graph topology, or using low-performance implementations as baselines. These methodological issues make it difficult for good ideas to stand out and cloud the reasoning behind why these ideas are beneficial.

In order for the research community to make progress on accelerating graph processing, it is important to be able to properly and reliably compare results. We created the GAP Benchmark Suite to standardize evaluations in order to alleviate the methodological issues we observed. Through standardization, we hope to not only make results easier to compare, but to also prevent common evaluation mistakes. We provide both a benchmark specification to standardize the methodology and a high-performance reference implementation to be used as a baseline. Our benchmark was co-designed with our workload characterization, and it has undergone multiple revisions guided by community feedback.

GAP Benchmark matrices: Scott Beamer, Krste Asanovic', and David Patterson. as described in "The GAP Benchmark Suite", https://arxiv.org/abs/1508.03619 .

(1) GAP-twitter (|V|=61.6M, |E|=1,468.4M, directed) is an example of a social network topology [18]. This particular crawl of Twitter has been commonly used by researchers and thus eases comparisons with prior work. By virtue of it coming from real-world data, it has interesting irregularities and the skew in its degree distribution can be a challenge for some implementations.

[18] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is Twitter, a social network or a news media? International World Wide Web Conference (WWW), 2010. A permuted version of this matrix appears as SNAP/twitter7 in the SuiteSparse Matrix Collection.

(2) GAP-web (|V|=50.6M, |E|=1,949.4M, directed) is a web-crawl of the .sk domain (sk-2005) [9]. Despite its large size, it exhibits substantial locality due to its topology and high average degree.

The matrix comes from the Laboratory for Web Algorithmics (LAW), Universita degli Studi di Milano, http://law.di.unimi.it/index.php. The pattern of this GAP-web matrix also appears as LAW/sk-2005, in the SuiteSparse Matrix Collection.

(3) GAP-road (|V|=23.9M, |E|=58.3M, directed) is the distances of all of the roads in the USA [10]. Although it is substantially smaller than the rest of the graphs, it has a high diameter which can cause some synchronous implementations to have long runtimes.

[10] 9th DIMACS implementation challenge -- shortest paths. http://www.dis.uniroma1.it/challenge9/, 2006. The pattern of the GAP-road matrix also appears as DIMACS10/road_usa in the SuiteSparse Matrix Collection.

(4) GAP-kron (|V|=134.2M, |E|=2,111.6M, undirected) uses the Kronecker synthetic graph generator [19] with the same parameters as Graph 500 (A=0.57, B=C=0.19, D=0.05) [14]. It has been used frequently in research due to Graph 500, so it also provides continuity with prior work.

[19] Jurij Leskovec, Deepayan Chakrabarti, Jon Kleinberg, and Christos Faloutsos. Realistic, mathematically tractable graph generation and evolution, using Kronecker multiplication. European Conference on Principles and Practice of Knowledge Discovery in Databases, 2005. [14] Graph500 benchmark. www.graph500.org.

(5) GAP-urand (|V|=134.2M, |E|=2,147.4M, undirected) is synthetically generated by the Erdos– Reyni model (Uniform Random) [11]. With respect to locality, it represents the worst case as every vertex has equal probability of being a neighbor of every other vertex. When contrasted with the similarly sized kron graph, it demonstrates the impact of kron’s scale-free property.

[11] Paul Erdos and Alfred Reyni. On random graphs. I. Publicationes Mathematicae, 6:290–297, 1959.
h
clinical-synthetic-text-kg
huggingface.co
Updated Jun 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ran Xu (2024). clinical-synthetic-text-kg [Dataset]. https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-kg
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 23, 2024
Authors
Ran Xu
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Data Description

We release the synthetic data generated using the method described in the paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models (ACL 2024 Findings). The external knowledge we use is based on external knowledge graphs.

Generated Datasets

The original train/validation/test data, and the generated synthetic training data are listed as follows. For each dataset, we generate 5000 synthetic… See the full description on the dataset page: https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-kg.
Synthetic data generating parameters. The table summarizes the generating...
plos.figshare.com
xls
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Ranieri; Floriana Pichiorri; Emma Colamarino; Febo Cincotti; Donatella Mattia; Jlenia Toppi (2025). Synthetic data generating parameters. The table summarizes the generating parameters for synthetic networks showing the corresponding symbol, name and range after the application of the constraints in Section e.2. [Dataset]. http://doi.org/10.1371/journal.pone.0319031.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0319031.t002
Dataset updated
Jun 5, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Andrea Ranieri; Floriana Pichiorri; Emma Colamarino; Febo Cincotti; Donatella Mattia; Jlenia Toppi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Synthetic data generating parameters. The table summarizes the generating parameters for synthetic networks showing the corresponding symbol, name and range after the application of the constraints in Section e.2.
Z
Data from: IntelliGraphs: Datasets for Benchmarking Knowledge Graph...
data.niaid.nih.gov
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thiviyan Thanapalasingam; Emile van Krieken; Peter Bloem; Paul Groth (2023). IntelliGraphs: Datasets for Benchmarking Knowledge Graph Generation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7824817
Explore at:
Dataset updated
Jun 14, 2023
Dataset provided by
University of Amsterdam
Vrije Universiteit Amsterdam
Authors
Thiviyan Thanapalasingam; Emile van Krieken; Peter Bloem; Paul Groth
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntelliGraphs is a collection of datasets for benchmarking Knowledge Graph Generation models. It consists of three synthetic datasets (syn-paths, syn-tipr, syn-types) and two real-world datasets (wd-movies, wd-articles). There is also a Python package available that loads these datasets and verifies new graphs using semantics that was pre-defined for each dataset. It can also be used as a testbed for developing new generative models.
Summary of the functions in the package EATME.
plos.figshare.com
xls
Updated Oct 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li-Pang Chen; Cheng-Kuan Lin (2024). Summary of the functions in the package EATME. [Dataset]. http://doi.org/10.1371/journal.pone.0308828.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0308828.t002
Dataset updated
Oct 3, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Li-Pang Chen; Cheng-Kuan Lin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this paper, we introduce an R package EATME, which is known as Exponentially weighted moving average (EWMA) control chart with Adjustments To Measurement Error. The main purpose of this package is to correct for measurement error effects in continuous or binary random variables and develop the corrected control charts based on the EWMA statistic. In addition, the corrected control charts can detect out-of-control process accurately. The package contains a function to generate synthetic data and includes functions to determine the reasonable coefficient of control limit as well as estimate average run length. Moreover, for the visualization, we also provide the control charts to show the monitoring of in-control and out-of-control process. Finally, the functions in this package are clearly demonstrated, and numerical studies show the validity of the package.
4
Code underlying the publication: Online graph filter design over expanding...
data.4tu.nl
zip
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bishwadeep Das (2024). Code underlying the publication: Online graph filter design over expanding graphs [Dataset]. http://doi.org/10.4121/aabf6ecd-ce11-4427-9fbc-9a769e16de49.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/aabf6ecd-ce11-4427-9fbc-9a769e16de49.v1
Dataset updated
Nov 13, 2024
Dataset provided by
4TU.ResearchData
Authors
Bishwadeep Das
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset funded by
NWO
Description
Research Objective is to design online algorithms for graph filter design over expanding graphs under conditions of known and unknown connectivity. The data-sets used in this paper are available online. Code for generating synthetic data is included. In the folder Recsys_new, the experimental setup and online algorithms for movie rating prediction for Movielens100k is provided. In the folder Stochastic_Synthetic_New, the experimental setup and online algorithms for signal interpolation for synthetic expanding graphs is included. In Stochastic_covid, the code for Covid case count prediction over a growing city network is provided.
Z
Dataset Artifact for paper "Root Cause Analysis for Microservice System...
data.niaid.nih.gov
Updated Aug 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pham, Luan; Ha, Huong; Zhang, Hongyu (2024). Dataset Artifact for paper "Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13305662
Explore at:
Dataset updated
Aug 25, 2024
Dataset provided by
Chongqing University
RMIT University
Authors
Pham, Luan; Ha, Huong; Zhang, Hongyu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Artifacts for the paper titled Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?.

This artifact repository contains 9 compressed folders, as follows:

ID File Name Description

1 syn_circa.zip CIRCA10, and CIRCA50 datasets for Causal Discovery

2 syn_rcd.zip RCD10, and RCD50 datasets for Causal Discovery

3 syn_causil.zip CausIL10, and CausIL50 datasets for Causal Discovery

4 rca_circa.zip CIRCA10, and CIRCA50 datasets for RCA

5 rca_rcd.zip RCD10, and RCD50 datasets for RCA

6 online-boutique.zip Online Boutique dataset for RCA

7 sock-shop-1.zip Sock Shop 1 dataset for RCA

8 sock-shop-2.zip Sock Shop 2 dataset for RCA

9 train-ticket.zip Train Ticket dataset for RCA

Each zip file contains the generated/collected data from the corresponding data generator or microservice benchmark systems (e.g., online-boutique.zip contains metrics data collected from the Online Boutique system).

Details about the generation of our datasets

Synthetic datasets

We use three different synthetic data generators from three previous RCA studies [15, 25, 28] to create the synthetic datasets: CIRCA, RCD, and CausIL data generators. Their mechanisms are as follows:1. CIRCA datagenerator [28] generates a random causal directed acyclic graph (DAG) based on a given number of nodes and edges. From this DAG, time series data for each node is generated using a vector auto-regression (VAR) model. A fault is injected into a node by altering the noise term in the VAR model for two timestamps. 2. RCD data generator [25] uses the pyAgrum package [3] to generate a random DAG based on a given number of nodes, subsequently generating discrete time series data for each node, with values ranging from 0 to 5. A fault is introduced into a node by changing its conditional probability distribution.3. CausIL data generator [15] generates causal graphs and time series data that simulate the behavior of microservice systems. It first constructs a DAG of services and metrics based on domain knowledge, then generates metric data for each node of the DAG using regressors trained on real metrics data. Unlike the CIRCA and RCD data generators, the CausIL data generator does not have the capability to inject faults.To create our synthetic datasets, we first generate 10 DAGs whose nodes range from 10 to 50 for each of the synthetic data generators. Next, we generate fault-free datasets using these DAGs with different seedings, resulting in 100 cases for the CIRCA and RCD generators and 10 cases for the CausIL generator. We then create faulty datasets by introducing ten faults into each DAG and generating the corresponding faulty data, yielding 100 cases for the CIRCA and RCD data generators. The fault-free datasets (e.g. syn_rcd, syn_circa) are used to evaluate causal discovery methods, while the faulty datasets (e.g. rca_rcd, rca_circa) are used to assess RCA methods.

Data collected from benchmark microservice systems

We deploy three popular benchmark microservice systems: Sock Shop [6], Online Boutique [4], and Train Ticket [8], on a four-node Kubernetes cluster hosted by AWS. Next, we use the Istio service mesh [2] with Prometheus [5] and cAdvisor [1] to monitor and collect resource-level and service-level metrics of all services, as in previous works [ 25 , 39, 59 ]. To generate traffic, we use the load generators provided by these systems and customise them to explore all services with 100 to 200 users concurrently. We then introduce five common faults (CPU hog, memory leak, disk IO stress, network delay, and packet loss) into five different services within each system. Finally, we collect metrics data before and after the fault injection operation. An overview of our setup is presented in the Figure below.

Code

The code to reproduce the experimental results in the paper is available at https://github.com/phamquiluan/RCAEval.

References

As in our paper.
Human Labeled OHLCV Stock Market Data
kaggle.com
zip
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Barathan Aslan (2025). Human Labeled OHLCV Stock Market Data [Dataset]. https://www.kaggle.com/datasets/barathanaslan/human-labeled-synthetic-stock-market-data
Explore at:
zip(9914465 bytes)Available download formats
Dataset updated
Mar 26, 2025
Authors
Barathan Aslan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

This dataset provides synthetically generated financial time series data, presented as OHLCV (Open-High-Low-Close-Volume) candlestick charts. A key feature of this dataset is the inclusion of technical analysis annotations (labels) meticulously created by a human analyst for each chart.

The primary goal is to offer a resource for training and evaluating machine learning models focused on automated technical analysis and chart pattern recognition. By providing synthetic data with high-quality human labels, this dataset aims to facilitate research and development in areas like algorithmic trading and financial visualization analysis.

This is an evolving dataset. It represents the initial phase of a larger labeling effort, and future updates are planned to incorporate a greater number and variety of labeled chart patterns.

Content

The dataset is provided entirely as a collection of JSON files. Each file represents a single 300-candle chart window and contains:

metadata: Contains basic information related to the generation of the file (e.g., generation timestamp, version).

ohlcv_data: A sequence of 300 data points. Each point is a dictionary representing one time candle and includes:

time: Timestamp string (ISO 8601 format). Note: These timestamps maintain realistic intra-day time progression (hours, minutes), but the specific dates (Day, Month, Year) are entirely synthetic and do not align with real-world calendar dates.

open, high, low, close: Numerical values representing the candle's price range. Note: These values are synthetic and are not tied to any real financial instrument's price.

volume: A numerical value representing activity during the candle's period. Note: This is also a synthetic value.

labels: A dictionary containing the human-provided technical analysis annotations for the corresponding chart window:

horizontal_lines: A list of structures, each containing a price key. These typically denote significant horizontal levels identified by the labeler, such as support or resistance.

ray_lines: A list of structures, each defining a line segment via start_date, start_price, end_date, and end_price. These are used to represent patterns like trendlines, channel boundaries, or other linear formations observed by the labeler.

Data Generation Approach

The dataset features synthetically generated candlestick patterns. The generation process focuses on creating structurally plausible chart sequences. Human analysts then carefully review these sequences and apply relevant technical analysis labels (support, resistance, trendlines).

While the patterns may resemble those seen in financial markets, the underlying numerical data (price, volume, and the associated timestamps) is artificial and intentionally detached from any real-world financial data. Users should focus on the relative structure of the candles and the associated human-provided labels, rather than interpreting the absolute values as representative of any specific market or time.

Acknowledgements

This dataset is made possible through ongoing human labeling efforts and custom data generation software.

Inspiration

Train models (e.g., CNNs, Transformers) to recognize support/resistance levels and trendlines directly from chart data.

Develop and benchmark algorithms for automated technical analysis pattern detection.

Use as a basis for generating further augmented chart data for ML training.

Explore novel approaches to financial time series analysis using labeled, synthetic data.
h
difficult_problem_dataset_v3
huggingface.co
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ikedachin (2025). difficult_problem_dataset_v3 [Dataset]. https://huggingface.co/datasets/ikedachin/difficult_problem_dataset_v3
Explore at:
Dataset updated
Sep 30, 2025
Authors
ikedachin
Description
OverView

This dataset is a synthetic dataset created using the Scalable Data Generation (SDG) framework.It is structured for use with a thinking model, and the input and output form a set of questions and answers.

Data Generation Pipeline

Question Generation

Model: Qwen/Qwen3-30B-A3B-Instruct-2507
Assume the role of an academic graph expert and generate Ph.D.-level questions.

Reasoning Process Generation

Model: openai/gpt-oss-120b
Reassign the role of… See the full description on the dataset page: https://huggingface.co/datasets/ikedachin/difficult_problem_dataset_v3.
Summary of the arguments of the functions in the package EATME.
figshare.com
plos.figshare.com
xls
Updated Oct 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li-Pang Chen; Cheng-Kuan Lin (2024). Summary of the arguments of the functions in the package EATME. [Dataset]. http://doi.org/10.1371/journal.pone.0308828.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0308828.t003
Dataset updated
Oct 3, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Li-Pang Chen; Cheng-Kuan Lin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary of the arguments of the functions in the package EATME.
h
reasoning-biochem
huggingface.co
Updated Oct 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Extrasensory AI (2024). reasoning-biochem [Dataset]. https://huggingface.co/datasets/extrasensory/reasoning-biochem
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 16, 2024
Dataset authored and provided by
Extrasensory AI
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This is a synthetic reasoning dataset generated from the PrimeKG biomedical knowledge graph. It contains verifiable reasoning traces generated using the approach outlined in Synthetic CoT Reasoning Trace Generation from Knowledge Graphs. The synthetic chain-of-thought data is generated procedurally using program synthesis and logic programming which is able to produce vast quantities of verifiable forward reasoning traces with minimal human oversight. The benchmark is intended to be used to… See the full description on the dataset page: https://huggingface.co/datasets/extrasensory/reasoning-biochem.
m
Data from: The Least Cost Directed Perfect Awareness Problem - Benchmark...
data.mendeley.com
Updated Nov 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Felipe Pereira (2024). The Least Cost Directed Perfect Awareness Problem - Benchmark Instances and Solutions [Dataset]. http://doi.org/10.17632/xgtjgzf28r.3
Explore at:
Unique identifier
https://doi.org/10.17632/xgtjgzf28r.3
Dataset updated
Nov 11, 2024
Authors
Felipe Pereira
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
This dataset contains complementary data to the paper "The Least Cost Directed Perfect Awareness Problem: Complexity, Algorithms and Computations" [1]. Here, we make available two sets of instances of the combinatorial optimization problem studied in that paper, which deals with the spread of information on social networks. We also provide the best known solutions and bounds obtained through computational experiments for each instance.

The first input set includes 300 synthetic instances composed of graphs that resemble real-world social networks. These graphs were produced with a generator proposed in [2]. The second set consists of 14 instances built from graphs obtained by crawling Twitter [3].

The directories "synthetic_instances" and "twitter_instances" contain files that describe both sets of instances, all of which follow the format: the first two lines correspond to:

where

where

where and

The directories "solutions_for_synthetic_instances" and "solutions_for_twitter_instances" contain files that describe the best known solutions for both sets of instances, all of which follow the format: the first line corresponds to:

where is the number of vertices in the solution. Each of the next lines contains:

where

where

Lastly, two files, namely, "bounds_for_synthetic_instances.csv" and "bounds_for_twitter_instances.csv", enumerate the values of the best known lower and upper bounds for both sets of instances.

This work was supported by grants from Santander Bank, Brazil, Brazilian National Council for Scientific and Technological Development (CNPq), Brazil, São Paulo Research Foundation (FAPESP), Brazil.

Caveat: the opinions, hypotheses and conclusions or recommendations expressed in this material are the responsibility of the authors and do not necessarily reflect the views of Santander, CNPq, or FAPESP.

References

[1] F. C. Pereira, P. J. de Rezende. The Least Cost Directed Perfect Awareness Problem: Complexity, Algorithms and Computations. Submitted. 2023.

[2] B. Bollobás, C. Borgs, J. Chayes, and O. Riordan. Directed scale-free graphs. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’03, pages 132–139, 2003.

[3] C. Schweimer, C. Gfrerer, F. Lugstein, D. Pape, J. A. Velimsky, R. Elsässer, and B. C. Geiger. Generating simple directed social network graphs for information spreading. In Proceedings of the ACM Web Conference 2022, WWW ’22, pages 1475–1485, 2022.
Datasets of synthetic workflows for evaluating a multi-objective and...
data.europa.eu
ieee-dataport.org
+1more
unknown
Updated Apr 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2024). Datasets of synthetic workflows for evaluating a multi-objective and multi-constrained scheduling approach for cyber-physical applications [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-10978009?locale=en
Explore at:
unknown(306914)Available download formats
Dataset updated
Apr 15, 2024
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These datasets of synthetic workflows (task graphs) were generated to evaluate the performance and scalability of a multi-objective and multi-constrained scheduling approach for workflow applications of various structures, sizes, and sensing/actuating requirements in a cyber-physical system (CPS) based on the edge-hub-cloud paradigm. The examined CPS comprised four edge devices (i.e., single-board computers, each attached to an unmanned aerial vehicle (UAV) equipped with sensors/actuators) interacting with a hub device (e.g., a laptop), which in turn communicated with a more computationally capable cloud server. All system devices featured heterogeneous multicore processors with different processing core failure rates and varied sensing/actuating or other specialized capabilities. Our objectives were the minimization of the overall latency, the minimization of the overall energy consumption, and the maximization of the overall reliability of the workflow application in the specific CPS, under deadline, reliability, memory, storage, energy, capability, and task precedence constraints. We generated 25 random task graphs with 10, 20, 30, 40, and 50 nodes (5 task graphs for each size), utilizing the Task Graphs For Free (TGFF) random task graph generator [1],[2]. Additional task parameters (e.g., execution time, power consumption, memory, storage, output data size, capability, reliability threshold) were included post-generation, using appropriate values. More details are provided in README.txt.References:[1] R. P. Dick, D. L. Rhodes, and W. Wolf, "TGFF: Task graphs for free," Proceedings of the Sixth International Workshop on Hardware/Software Codesign (CODES/CASHE), 1998, pp. 97-101, doi: 10.1109/HSC.1998.666245.[2] R. P. Dick, D. L. Rhodes, and K. Vallerio, "TGFF," https://robertdick.org/projects/tgff/.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sotiris Beis; Symeon Papadopoulos; Yannis Kompatsiaris (2023). Synthetic Data for graphdb-benchmark [Dataset]. http://doi.org/10.6084/m9.figshare.1221760.v1

Synthetic Data for graphdb-benchmark

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.1221760.v1

Dataset updated

Jun 3, 2023

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Sotiris Beis; Symeon Papadopoulos; Yannis Kompatsiaris

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The data we used to evaluate Louvain Method in the study Benchmarking Graph Databases on the Problem of Community Detection. These data werw synthetically generated using the LFR-Benchmark (3rd link). There are two type of files, networkX.dat and communityX.dat. The networkX.dat file contains the list of edges (nodes are labelled from 1 to the number of nodes; the edges are ordered and repeated twice, i.e. source-target and target-source). The first four lines of the networkX.dat file list the parameters we used to generate the data. The communityX.dat file contains a list of the nodes and their membership (memberships are labelled by integer numbers >=1). Note X correspond to the number of nodes each dataset contains.

Clear search

Close search

Google apps

Main menu

Synthetic Data for graphdb-benchmark

GraphXAI

Datasets of synthetic task flow graphs for evaluating a latency/energy...

Variant data used during semi-synthetic data generation.

Datasets of synthetic task graphs for evaluating a reliability and latency...

Data from: Synthetic Multimodal Dataset for Daily Life Activities

Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

GAP Graphs Part 1 (A-T)

clinical-synthetic-text-kg

Synthetic data generating parameters. The table summarizes the generating...

Data from: IntelliGraphs: Datasets for Benchmarking Knowledge Graph...

Summary of the functions in the package EATME.

Code underlying the publication: Online graph filter design over expanding...

Dataset Artifact for paper "Root Cause Analysis for Microservice System...

Human Labeled OHLCV Stock Market Data

Context

Content

Data Generation Approach

Acknowledgements

Inspiration

difficult_problem_dataset_v3

Summary of the arguments of the functions in the package EATME.

reasoning-biochem

Data from: The Least Cost Directed Perfect Awareness Problem - Benchmark...

Datasets of synthetic workflows for evaluating a multi-objective and...

Synthetic Data for graphdb-benchmark