27 datasets found
  1. Synthetic Data for graphdb-benchmark

    • figshare.com
    txt
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sotiris Beis; Symeon Papadopoulos; Yannis Kompatsiaris (2023). Synthetic Data for graphdb-benchmark [Dataset]. http://doi.org/10.6084/m9.figshare.1221760.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Sotiris Beis; Symeon Papadopoulos; Yannis Kompatsiaris
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data we used to evaluate Louvain Method in the study Benchmarking Graph Databases on the Problem of Community Detection. These data werw synthetically generated using the LFR-Benchmark (3rd link). There are two type of files, networkX.dat and communityX.dat. The networkX.dat file contains the list of edges (nodes are labelled from 1 to the number of nodes; the edges are ordered and repeated twice, i.e. source-target and target-source). The first four lines of the networkX.dat file list the parameters we used to generate the data. The communityX.dat file contains a list of the nodes and their membership (memberships are labelled by integer numbers >=1). Note X correspond to the number of nodes each dataset contains.

  2. d

    GraphXAI

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Queen, Owen (2023). GraphXAI [Dataset]. http://doi.org/10.7910/DVN/KULOS8
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Queen, Owen
    Description

    As post hoc explanations are increasingly used to understand the behavior of Graph Neural Networks (GNNs), it becomes crucial to evaluate the quality and reliability of GNN explanations. However, assessing the quality of GNN explanations is challenging as existing graph datasets have no or unreliable ground-truth explanations for a given task. Here, we introduce a synthetic graph data generator, ShapeGGen, which can generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) accompanied by ground-truth explanations. Further, the flexibility to generate diverse synthetic datasets and corresponding ground-truth explanations allows us to mimic the data generated by various real-world applications. We include ShapeGGen and additional XAI-ready real-world graph datasets into an open-source graph explainability library, GraphXAI. In addition, GraphXAI provides a broader ecosystem of data loaders, data processing functions, synthetic and real-world graph datasets with ground-truth explanations, visualizers, GNN model implementations, and a set of evaluation metrics to benchmark the performance of any given GNN explainer.

  3. Datasets of synthetic task flow graphs for evaluating a latency/energy...

    • data.europa.eu
    • data.niaid.nih.gov
    unknown
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). Datasets of synthetic task flow graphs for evaluating a latency/energy optimization task allocation framework [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-10654551?locale=da
    Explore at:
    unknown(1386672)Available download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These datasets of synthetic task flow graphs were generated to evaluate the performance and scalability of an optimal task allocation approach for applications of various structures and sizes in an environment following the edge/hub/cloud paradigm. The system under study comprised an edge device (e.g., a single-board computer attached to an unmanned aerial vehicle (UAV)) interacting with a hub device (e.g., a laptop), which in turn communicated with a more computationally capable cloud server. The objective was the minimization of either overall latency or overall energy consumption, under memory, storage, energy, and task precedence constraints. We considered that a percentage of the tasks required fixed allocation on the edge or hub device. We generated 18 task flow graphs of parallel, serial, and mixed (a combination of parallel and serial) structure with 10, 100, and 1000 nodes, and various in/out degrees, utilizing the Task Graphs For Free (TGFF) random task graph generator [1],[2]. Additional task parameters (e.g., execution time, power consumption, memory, storage, output data size) were included post-generation, using representative random values. More details are provided in README.txt and in [3]. Note: These datasets are released under a Creative Commons Attribution license. If you utilize these datasets in your work, please cite us using the corresponding Zenodo DOI https://doi.org/10.5281/zenodo.10654551. References:[1] R. P. Dick, D. L. Rhodes, and W. Wolf, "TGFF: Task graphs for free," Proceedings of the Sixth International Workshop on Hardware/Software Codesign (CODES/CASHE), 1998, pp. 97-101, doi: 10.1109/HSC.1998.666245.[2] R. P. Dick, D. L. Rhodes, and K. Vallerio, "TGFF," https://robertdick.org/projects/tgff/.[3] A. Kouloumpris, G. L. Stavrinides, M. K. Michael, and T. Theocharides, "An optimization framework for task allocation in the edge/hub/cloud paradigm," Future Generation Computer Systems, vol. 155, pp. 354-366, Jun. 2024, doi: 10.1016/j.future.2024.02.005.

  4. Variant data used during semi-synthetic data generation.

    • plos.figshare.com
    xls
    Updated Sep 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric V. Strobl; Eric R. Gamazon (2025). Variant data used during semi-synthetic data generation. [Dataset]. http://doi.org/10.1371/journal.pcbi.1013461.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 5, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Eric V. Strobl; Eric R. Gamazon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Variant data used during semi-synthetic data generation.

  5. Datasets of synthetic task graphs for evaluating a reliability and latency...

    • data.europa.eu
    • data-staging.niaid.nih.gov
    • +1more
    unknown
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo, Datasets of synthetic task graphs for evaluating a reliability and latency multi-objective task allocation framework [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-10357101?locale=hr
    Explore at:
    unknown(642181)Available download formats
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These datasets of synthetic task graphs were generated to evaluate the performance and scalability of a multi-objective task allocation approach for workflow applications of various structures and sizes in a system based on the edge-hub-cloud paradigm. The targeted architecture comprised an edge device (e.g., a single-board computer attached to an unmanned aerial vehicle (UAV)) interacting with a hub device (e.g., a laptop), which in turn communicated with a more computationally capable cloud server. The objectives were the maximization of the overall reliability and the minimization of the overall latency of the application, under memory, storage, energy, and task precedence constraints. We considered that a percentage of the tasks required fixed allocation on the edge or hub device. Each task had a different vulnerability factor (i.e., probability of failure) on each device. We generated nine task graphs of serial, parallel, and mixed (a combination of serial and parallel) structure with 10, 100, and 1000 nodes, utilizing the Task Graphs For Free (TGFF) random task graph generator [1]. Additional task parameters (e.g., execution time, power consumption, vulnerability factor, memory, storage, output data size) were included post-generation, using representative random values. More details are provided in README.txt. Note: These datasets are released under a Creative Commons Attribution license. If you utilize these datasets in your work, please cite us using the corresponding Zenodo DOI https://doi.org/10.5281/zenodo.10357101. References: [1] R. P. Dick, D. L. Rhodes and W. Wolf, "TGFF: Task graphs for free," Proceedings of the Sixth International Workshop on Hardware/Software Codesign (CODES/CASHE'98), Seattle, WA, USA, 1998, pp. 97-101, doi: 10.1109/HSC.1998.666245.

  6. Z

    Data from: Synthetic Multimodal Dataset for Daily Life Activities

    • data.niaid.nih.gov
    Updated Jan 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ugai, Takanori; Egami, Shusaku; Swe Nwe Nwe Htun; Kozaki, Kouji; Kawamura, Takahiro; Fukuda, Ken (2024). Synthetic Multimodal Dataset for Daily Life Activities [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8046266
    Explore at:
    Dataset updated
    Jan 29, 2024
    Dataset provided by
    National Institute of Advanced Industrial Science and Technology
    Fujitsu
    Osaka Electro-Communication University
    National Agriculture and Food Research Organization
    Authors
    Ugai, Takanori; Egami, Shusaku; Swe Nwe Nwe Htun; Kozaki, Kouji; Kawamura, Takahiro; Fukuda, Ken
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Outline

    This dataset is originally created for the Knowledge Graph Reasoning Challenge for Social Issues (KGRC4SI)

    Video data that simulates daily life actions in a virtual space from Scenario Data.

    Knowledge graphs, and transcriptions of the Video Data content ("who" did what "action" with what "object," when and where, and the resulting "state" or "position" of the object).

    Knowledge Graph Embedding Data are created for reasoning based on machine learning

    This data is open to the public as open data

    Details

    Videos

    mp4 format

    203 action scenarios

    For each scenario, there is a character rear view (file name ending in 0), an indoor camera switching view (file name ending in 1), and a fixed camera view placed in each corner of the room (file name ending in 2-5). Also, for each action scenario, data was generated for a minimum of 1 to a maximum of 7 patterns with different room layouts (scenes). A total of 1,218 videos

    Videos with slowly moving characters simulate the movements of elderly people.

    Knowledge Graphs

    RDF format

    203 knowledge graphs corresponding to the videos

    Includes schema and location supplement information

    The schema is described below

    SPARQL endpoints and query examples are available

    Script Data

    txt format

    Data provided to VirtualHome2KG to generate videos and knowledge graphs

    Includes the action title and a brief description in text format.

    Embedding

    Embedding Vectors in TransE, ComplEx, and RotatE. Created with DGL-KE (https://dglke.dgl.ai/doc/)

    Embedding Vectors created with jRDF2vec (https://github.com/dwslab/jRDF2Vec).

    Specification of Ontology

    Please refer to the specification for descriptions of all classes, instances, and properties: https://aistairc.github.io/VirtualHome2KG/vh2kg_ontology.htm

    Related Resources

    KGRC4SI Final Presentations with automatic English subtitles (YouTube)

    VirtualHome2KG (Software)

    VirtualHome-AIST (Unity)

    VirtualHome-AIST (Python API)

    Visualization Tool (Software)

    Script Editor (Software)

  7. Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

    It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

    An example

    An example test sentence:

    Test Sentence:
    {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by 
    American songwriters Gerry Goffin and Carole King."}
    

    An example of ontology:

    Ontology: Music Ontology

    Expected Output:

    {
     "id": "ont_k_music_test_n", 
     "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", 
     "triples": [
     {
      "sub": "The Loco-Motion", 
      "rel": "publication date",
      "obj": "01 January 1962"
     },{
      "sub": "The Loco-Motion",
      "rel": "lyrics by",
      "obj": "Gerry Goffin"
     },{
      "sub": "The Loco-Motion", 
      "rel": "lyrics by", 
      "obj": "Carole King"
     },]
    }
    

    The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

    The structure of the repo is as the following.

    This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

    [1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

    [2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages

  8. GAP Graphs Part 1 (A-T)

    • kaggle.com
    zip
    Updated Dec 5, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). GAP Graphs Part 1 (A-T) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-gap-01/code
    Explore at:
    zip(24502979638 bytes)Available download formats
    Dataset updated
    Dec 5, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The unfortunate lack of a widely used graph benchmark suite forces each research publication to create its own evaluation methodology, and this often results in mistakes or unnecessary differences. Common serious mistakes we have observed include: using trivially small input graphs, using only a single input graph topology, or using low-performance implementations as baselines. These methodological issues make it difficult for good ideas to stand out and cloud the reasoning behind why these ideas are beneficial.

    In order for the research community to make progress on accelerating graph processing, it is important to be able to properly and reliably compare results. We created the GAP Benchmark Suite to standardize evaluations in order to alleviate the methodological issues we observed. Through standardization, we hope to not only make results easier to compare, but to also prevent common evaluation mistakes. We provide both a benchmark specification to standardize the methodology and a high-performance reference implementation to be used as a baseline. Our benchmark was co-designed with our workload characterization, and it has undergone multiple revisions guided by community feedback.

    GAP Benchmark matrices: Scott Beamer, Krste Asanovic', and David Patterson. as described in "The GAP Benchmark Suite", https://arxiv.org/abs/1508.03619 .

    (1) GAP-twitter (|V|=61.6M, |E|=1,468.4M, directed) is an example of a social network topology [18]. This particular crawl of Twitter has been commonly used by researchers and thus eases comparisons with prior work. By virtue of it coming from real-world data, it has interesting irregularities and the skew in its degree distribution can be a challenge for some implementations.

    [18] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon.
    What is Twitter, a social network or a news media? International
    World Wide Web Conference (WWW), 2010.
    
    A permuted version of this matrix appears as SNAP/twitter7 in
    the SuiteSparse Matrix Collection.
    

    (2) GAP-web (|V|=50.6M, |E|=1,949.4M, directed) is a web-crawl of the .sk domain (sk-2005) [9]. Despite its large size, it exhibits substantial locality due to its topology and high average degree.

    The matrix comes from the Laboratory for Web Algorithmics (LAW), Universita
    degli Studi di Milano, http://law.di.unimi.it/index.php.
    
    The pattern of this GAP-web matrix also appears as LAW/sk-2005, in the
    SuiteSparse Matrix Collection.
    

    (3) GAP-road (|V|=23.9M, |E|=58.3M, directed) is the distances of all of the roads in the USA [10]. Although it is substantially smaller than the rest of the graphs, it has a high diameter which can cause some synchronous implementations to have long runtimes.

    [10] 9th DIMACS implementation challenge -- shortest paths.
    http://www.dis.uniroma1.it/challenge9/, 2006.
    
    The pattern of the GAP-road matrix also appears as DIMACS10/road_usa
    in the SuiteSparse Matrix Collection.
    

    (4) GAP-kron (|V|=134.2M, |E|=2,111.6M, undirected) uses the Kronecker synthetic graph generator [19] with the same parameters as Graph 500 (A=0.57, B=C=0.19, D=0.05) [14]. It has been used frequently in research due to Graph 500, so it also provides continuity with prior work.

    [19] Jurij Leskovec, Deepayan Chakrabarti, Jon Kleinberg, and
    Christos Faloutsos. Realistic, mathematically tractable graph
    generation and evolution, using Kronecker multiplication.
    European Conference on Principles and Practice of Knowledge
    Discovery in Databases, 2005.
    
    [14] Graph500 benchmark. www.graph500.org.
    

    (5) GAP-urand (|V|=134.2M, |E|=2,147.4M, undirected) is synthetically generated by the Erdos– Reyni model (Uniform Random) [11]. With respect to locality, it represents the worst case as every vertex has equal probability of being a neighbor of every other vertex. When contrasted with the similarly sized kron graph, it demonstrates the impact of kron’s scale-free property.

    [11] Paul Erdos and Alfred Reyni. On random graphs. I.
    Publicationes Mathematicae, 6:290–297, 1959.
    
  9. h

    clinical-synthetic-text-kg

    • huggingface.co
    Updated Jun 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ran Xu (2024). clinical-synthetic-text-kg [Dataset]. https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-kg
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 23, 2024
    Authors
    Ran Xu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Data Description

    We release the synthetic data generated using the method described in the paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models (ACL 2024 Findings). The external knowledge we use is based on external knowledge graphs.

      Generated Datasets
    

    The original train/validation/test data, and the generated synthetic training data are listed as follows. For each dataset, we generate 5000 synthetic… See the full description on the dataset page: https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-kg.

  10. Synthetic data generating parameters. The table summarizes the generating...

    • plos.figshare.com
    xls
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrea Ranieri; Floriana Pichiorri; Emma Colamarino; Febo Cincotti; Donatella Mattia; Jlenia Toppi (2025). Synthetic data generating parameters. The table summarizes the generating parameters for synthetic networks showing the corresponding symbol, name and range after the application of the constraints in Section e.2. [Dataset]. http://doi.org/10.1371/journal.pone.0319031.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Andrea Ranieri; Floriana Pichiorri; Emma Colamarino; Febo Cincotti; Donatella Mattia; Jlenia Toppi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Synthetic data generating parameters. The table summarizes the generating parameters for synthetic networks showing the corresponding symbol, name and range after the application of the constraints in Section e.2.

  11. Z

    Data from: IntelliGraphs: Datasets for Benchmarking Knowledge Graph...

    • data.niaid.nih.gov
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thiviyan Thanapalasingam; Emile van Krieken; Peter Bloem; Paul Groth (2023). IntelliGraphs: Datasets for Benchmarking Knowledge Graph Generation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7824817
    Explore at:
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    University of Amsterdam
    Vrije Universiteit Amsterdam
    Authors
    Thiviyan Thanapalasingam; Emile van Krieken; Peter Bloem; Paul Groth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntelliGraphs is a collection of datasets for benchmarking Knowledge Graph Generation models. It consists of three synthetic datasets (syn-paths, syn-tipr, syn-types) and two real-world datasets (wd-movies, wd-articles). There is also a Python package available that loads these datasets and verifies new graphs using semantics that was pre-defined for each dataset. It can also be used as a testbed for developing new generative models.

  12. Summary of the functions in the package EATME.

    • plos.figshare.com
    xls
    Updated Oct 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li-Pang Chen; Cheng-Kuan Lin (2024). Summary of the functions in the package EATME. [Dataset]. http://doi.org/10.1371/journal.pone.0308828.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 3, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Li-Pang Chen; Cheng-Kuan Lin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this paper, we introduce an R package EATME, which is known as Exponentially weighted moving average (EWMA) control chart with Adjustments To Measurement Error. The main purpose of this package is to correct for measurement error effects in continuous or binary random variables and develop the corrected control charts based on the EWMA statistic. In addition, the corrected control charts can detect out-of-control process accurately. The package contains a function to generate synthetic data and includes functions to determine the reasonable coefficient of control limit as well as estimate average run length. Moreover, for the visualization, we also provide the control charts to show the monitoring of in-control and out-of-control process. Finally, the functions in this package are clearly demonstrated, and numerical studies show the validity of the package.

  13. 4

    Code underlying the publication: Online graph filter design over expanding...

    • data.4tu.nl
    zip
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bishwadeep Das (2024). Code underlying the publication: Online graph filter design over expanding graphs [Dataset]. http://doi.org/10.4121/aabf6ecd-ce11-4427-9fbc-9a769e16de49.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 13, 2024
    Dataset provided by
    4TU.ResearchData
    Authors
    Bishwadeep Das
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Dataset funded by
    NWO
    Description

    Research Objective is to design online algorithms for graph filter design over expanding graphs under conditions of known and unknown connectivity. The data-sets used in this paper are available online. Code for generating synthetic data is included. In the folder Recsys_new, the experimental setup and online algorithms for movie rating prediction for Movielens100k is provided. In the folder Stochastic_Synthetic_New, the experimental setup and online algorithms for signal interpolation for synthetic expanding graphs is included. In Stochastic_covid, the code for Covid case count prediction over a growing city network is provided.


  14. Z

    Dataset Artifact for paper "Root Cause Analysis for Microservice System...

    • data.niaid.nih.gov
    Updated Aug 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pham, Luan; Ha, Huong; Zhang, Hongyu (2024). Dataset Artifact for paper "Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13305662
    Explore at:
    Dataset updated
    Aug 25, 2024
    Dataset provided by
    Chongqing University
    RMIT University
    Authors
    Pham, Luan; Ha, Huong; Zhang, Hongyu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Artifacts for the paper titled Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?.

    This artifact repository contains 9 compressed folders, as follows:

    ID File Name Description

    1 syn_circa.zip CIRCA10, and CIRCA50 datasets for Causal Discovery

    2 syn_rcd.zip RCD10, and RCD50 datasets for Causal Discovery

    3 syn_causil.zip CausIL10, and CausIL50 datasets for Causal Discovery

    4 rca_circa.zip CIRCA10, and CIRCA50 datasets for RCA

    5 rca_rcd.zip RCD10, and RCD50 datasets for RCA

    6 online-boutique.zip Online Boutique dataset for RCA

    7 sock-shop-1.zip Sock Shop 1 dataset for RCA

    8 sock-shop-2.zip Sock Shop 2 dataset for RCA

    9 train-ticket.zip Train Ticket dataset for RCA

    Each zip file contains the generated/collected data from the corresponding data generator or microservice benchmark systems (e.g., online-boutique.zip contains metrics data collected from the Online Boutique system).

    Details about the generation of our datasets

    1. Synthetic datasets

    We use three different synthetic data generators from three previous RCA studies [15, 25, 28] to create the synthetic datasets: CIRCA, RCD, and CausIL data generators. Their mechanisms are as follows:1. CIRCA datagenerator [28] generates a random causal directed acyclic graph (DAG) based on a given number of nodes and edges. From this DAG, time series data for each node is generated using a vector auto-regression (VAR) model. A fault is injected into a node by altering the noise term in the VAR model for two timestamps. 2. RCD data generator [25] uses the pyAgrum package [3] to generate a random DAG based on a given number of nodes, subsequently generating discrete time series data for each node, with values ranging from 0 to 5. A fault is introduced into a node by changing its conditional probability distribution.3. CausIL data generator [15] generates causal graphs and time series data that simulate the behavior of microservice systems. It first constructs a DAG of services and metrics based on domain knowledge, then generates metric data for each node of the DAG using regressors trained on real metrics data. Unlike the CIRCA and RCD data generators, the CausIL data generator does not have the capability to inject faults.To create our synthetic datasets, we first generate 10 DAGs whose nodes range from 10 to 50 for each of the synthetic data generators. Next, we generate fault-free datasets using these DAGs with different seedings, resulting in 100 cases for the CIRCA and RCD generators and 10 cases for the CausIL generator. We then create faulty datasets by introducing ten faults into each DAG and generating the corresponding faulty data, yielding 100 cases for the CIRCA and RCD data generators. The fault-free datasets (e.g. syn_rcd, syn_circa) are used to evaluate causal discovery methods, while the faulty datasets (e.g. rca_rcd, rca_circa) are used to assess RCA methods.

    1. Data collected from benchmark microservice systems

    We deploy three popular benchmark microservice systems: Sock Shop [6], Online Boutique [4], and Train Ticket [8], on a four-node Kubernetes cluster hosted by AWS. Next, we use the Istio service mesh [2] with Prometheus [5] and cAdvisor [1] to monitor and collect resource-level and service-level metrics of all services, as in previous works [ 25 , 39, 59 ]. To generate traffic, we use the load generators provided by these systems and customise them to explore all services with 100 to 200 users concurrently. We then introduce five common faults (CPU hog, memory leak, disk IO stress, network delay, and packet loss) into five different services within each system. Finally, we collect metrics data before and after the fault injection operation. An overview of our setup is presented in the Figure below.

    Code

    The code to reproduce the experimental results in the paper is available at https://github.com/phamquiluan/RCAEval.

    References

    As in our paper.

  15. Human Labeled OHLCV Stock Market Data

    • kaggle.com
    zip
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Barathan Aslan (2025). Human Labeled OHLCV Stock Market Data [Dataset]. https://www.kaggle.com/datasets/barathanaslan/human-labeled-synthetic-stock-market-data
    Explore at:
    zip(9914465 bytes)Available download formats
    Dataset updated
    Mar 26, 2025
    Authors
    Barathan Aslan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    This dataset provides synthetically generated financial time series data, presented as OHLCV (Open-High-Low-Close-Volume) candlestick charts. A key feature of this dataset is the inclusion of technical analysis annotations (labels) meticulously created by a human analyst for each chart.

    The primary goal is to offer a resource for training and evaluating machine learning models focused on automated technical analysis and chart pattern recognition. By providing synthetic data with high-quality human labels, this dataset aims to facilitate research and development in areas like algorithmic trading and financial visualization analysis.

    This is an evolving dataset. It represents the initial phase of a larger labeling effort, and future updates are planned to incorporate a greater number and variety of labeled chart patterns.

    Content

    The dataset is provided entirely as a collection of JSON files. Each file represents a single 300-candle chart window and contains:

    1. metadata: Contains basic information related to the generation of the file (e.g., generation timestamp, version).
    2. ohlcv_data: A sequence of 300 data points. Each point is a dictionary representing one time candle and includes:
      • time: Timestamp string (ISO 8601 format). Note: These timestamps maintain realistic intra-day time progression (hours, minutes), but the specific dates (Day, Month, Year) are entirely synthetic and do not align with real-world calendar dates.
      • open, high, low, close: Numerical values representing the candle's price range. Note: These values are synthetic and are not tied to any real financial instrument's price.
      • volume: A numerical value representing activity during the candle's period. Note: This is also a synthetic value.
    3. labels: A dictionary containing the human-provided technical analysis annotations for the corresponding chart window:
      • horizontal_lines: A list of structures, each containing a price key. These typically denote significant horizontal levels identified by the labeler, such as support or resistance.
      • ray_lines: A list of structures, each defining a line segment via start_date, start_price, end_date, and end_price. These are used to represent patterns like trendlines, channel boundaries, or other linear formations observed by the labeler.

    Data Generation Approach

    The dataset features synthetically generated candlestick patterns. The generation process focuses on creating structurally plausible chart sequences. Human analysts then carefully review these sequences and apply relevant technical analysis labels (support, resistance, trendlines).

    While the patterns may resemble those seen in financial markets, the underlying numerical data (price, volume, and the associated timestamps) is artificial and intentionally detached from any real-world financial data. Users should focus on the relative structure of the candles and the associated human-provided labels, rather than interpreting the absolute values as representative of any specific market or time.

    Acknowledgements

    This dataset is made possible through ongoing human labeling efforts and custom data generation software.

    Inspiration

    • Train models (e.g., CNNs, Transformers) to recognize support/resistance levels and trendlines directly from chart data.
    • Develop and benchmark algorithms for automated technical analysis pattern detection.
    • Use as a basis for generating further augmented chart data for ML training.
    • Explore novel approaches to financial time series analysis using labeled, synthetic data.
  16. h

    difficult_problem_dataset_v3

    • huggingface.co
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ikedachin (2025). difficult_problem_dataset_v3 [Dataset]. https://huggingface.co/datasets/ikedachin/difficult_problem_dataset_v3
    Explore at:
    Dataset updated
    Sep 30, 2025
    Authors
    ikedachin
    Description

    OverView

    This dataset is a synthetic dataset created using the Scalable Data Generation (SDG) framework.It is structured for use with a thinking model, and the input and output form a set of questions and answers.

      Data Generation Pipeline
    

    Question Generation

    Model: Qwen/Qwen3-30B-A3B-Instruct-2507
    Assume the role of an academic graph expert and generate Ph.D.-level questions.

    Reasoning Process Generation

    Model: openai/gpt-oss-120b
    Reassign the role of… See the full description on the dataset page: https://huggingface.co/datasets/ikedachin/difficult_problem_dataset_v3.

  17. Summary of the arguments of the functions in the package EATME.

    • figshare.com
    • plos.figshare.com
    xls
    Updated Oct 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li-Pang Chen; Cheng-Kuan Lin (2024). Summary of the arguments of the functions in the package EATME. [Dataset]. http://doi.org/10.1371/journal.pone.0308828.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 3, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Li-Pang Chen; Cheng-Kuan Lin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary of the arguments of the functions in the package EATME.

  18. h

    reasoning-biochem

    • huggingface.co
    Updated Oct 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Extrasensory AI (2024). reasoning-biochem [Dataset]. https://huggingface.co/datasets/extrasensory/reasoning-biochem
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 16, 2024
    Dataset authored and provided by
    Extrasensory AI
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This is a synthetic reasoning dataset generated from the PrimeKG biomedical knowledge graph. It contains verifiable reasoning traces generated using the approach outlined in Synthetic CoT Reasoning Trace Generation from Knowledge Graphs. The synthetic chain-of-thought data is generated procedurally using program synthesis and logic programming which is able to produce vast quantities of verifiable forward reasoning traces with minimal human oversight. The benchmark is intended to be used to… See the full description on the dataset page: https://huggingface.co/datasets/extrasensory/reasoning-biochem.

  19. m

    Data from: The Least Cost Directed Perfect Awareness Problem - Benchmark...

    • data.mendeley.com
    Updated Nov 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Felipe Pereira (2024). The Least Cost Directed Perfect Awareness Problem - Benchmark Instances and Solutions [Dataset]. http://doi.org/10.17632/xgtjgzf28r.3
    Explore at:
    Dataset updated
    Nov 11, 2024
    Authors
    Felipe Pereira
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    This dataset contains complementary data to the paper "The Least Cost Directed Perfect Awareness Problem: Complexity, Algorithms and Computations" [1]. Here, we make available two sets of instances of the combinatorial optimization problem studied in that paper, which deals with the spread of information on social networks. We also provide the best known solutions and bounds obtained through computational experiments for each instance.

    The first input set includes 300 synthetic instances composed of graphs that resemble real-world social networks. These graphs were produced with a generator proposed in [2]. The second set consists of 14 instances built from graphs obtained by crawling Twitter [3].

    The directories "synthetic_instances" and "twitter_instances" contain files that describe both sets of instances, all of which follow the format: the first two lines correspond to:

    where

    where

    where and

    The directories "solutions_for_synthetic_instances" and "solutions_for_twitter_instances" contain files that describe the best known solutions for both sets of instances, all of which follow the format: the first line corresponds to:

    where is the number of vertices in the solution. Each of the next lines contains:

    where

    where

    Lastly, two files, namely, "bounds_for_synthetic_instances.csv" and "bounds_for_twitter_instances.csv", enumerate the values of the best known lower and upper bounds for both sets of instances.

    This work was supported by grants from Santander Bank, Brazil, Brazilian National Council for Scientific and Technological Development (CNPq), Brazil, São Paulo Research Foundation (FAPESP), Brazil.

    Caveat: the opinions, hypotheses and conclusions or recommendations expressed in this material are the responsibility of the authors and do not necessarily reflect the views of Santander, CNPq, or FAPESP.

    References

    [1] F. C. Pereira, P. J. de Rezende. The Least Cost Directed Perfect Awareness Problem: Complexity, Algorithms and Computations. Submitted. 2023.

    [2] B. Bollobás, C. Borgs, J. Chayes, and O. Riordan. Directed scale-free graphs. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’03, pages 132–139, 2003.

    [3] C. Schweimer, C. Gfrerer, F. Lugstein, D. Pape, J. A. Velimsky, R. Elsässer, and B. C. Geiger. Generating simple directed social network graphs for information spreading. In Proceedings of the ACM Web Conference 2022, WWW ’22, pages 1475–1485, 2022.

  20. Datasets of synthetic workflows for evaluating a multi-objective and...

    • data.europa.eu
    • ieee-dataport.org
    • +1more
    unknown
    Updated Apr 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2024). Datasets of synthetic workflows for evaluating a multi-objective and multi-constrained scheduling approach for cyber-physical applications [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-10978009?locale=en
    Explore at:
    unknown(306914)Available download formats
    Dataset updated
    Apr 15, 2024
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These datasets of synthetic workflows (task graphs) were generated to evaluate the performance and scalability of a multi-objective and multi-constrained scheduling approach for workflow applications of various structures, sizes, and sensing/actuating requirements in a cyber-physical system (CPS) based on the edge-hub-cloud paradigm. The examined CPS comprised four edge devices (i.e., single-board computers, each attached to an unmanned aerial vehicle (UAV) equipped with sensors/actuators) interacting with a hub device (e.g., a laptop), which in turn communicated with a more computationally capable cloud server. All system devices featured heterogeneous multicore processors with different processing core failure rates and varied sensing/actuating or other specialized capabilities. Our objectives were the minimization of the overall latency, the minimization of the overall energy consumption, and the maximization of the overall reliability of the workflow application in the specific CPS, under deadline, reliability, memory, storage, energy, capability, and task precedence constraints. We generated 25 random task graphs with 10, 20, 30, 40, and 50 nodes (5 task graphs for each size), utilizing the Task Graphs For Free (TGFF) random task graph generator [1],[2]. Additional task parameters (e.g., execution time, power consumption, memory, storage, output data size, capability, reliability threshold) were included post-generation, using appropriate values. More details are provided in README.txt.References:[1] R. P. Dick, D. L. Rhodes, and W. Wolf, "TGFF: Task graphs for free," Proceedings of the Sixth International Workshop on Hardware/Software Codesign (CODES/CASHE), 1998, pp. 97-101, doi: 10.1109/HSC.1998.666245.[2] R. P. Dick, D. L. Rhodes, and K. Vallerio, "TGFF," https://robertdick.org/projects/tgff/.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sotiris Beis; Symeon Papadopoulos; Yannis Kompatsiaris (2023). Synthetic Data for graphdb-benchmark [Dataset]. http://doi.org/10.6084/m9.figshare.1221760.v1
Organization logoOrganization logo

Synthetic Data for graphdb-benchmark

Explore at:
txtAvailable download formats
Dataset updated
Jun 3, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Sotiris Beis; Symeon Papadopoulos; Yannis Kompatsiaris
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The data we used to evaluate Louvain Method in the study Benchmarking Graph Databases on the Problem of Community Detection. These data werw synthetically generated using the LFR-Benchmark (3rd link). There are two type of files, networkX.dat and communityX.dat. The networkX.dat file contains the list of edges (nodes are labelled from 1 to the number of nodes; the edges are ordered and repeated twice, i.e. source-target and target-source). The first four lines of the networkX.dat file list the parameters we used to generate the data. The communityX.dat file contains a list of the nodes and their membership (memberships are labelled by integer numbers >=1). Note X correspond to the number of nodes each dataset contains.

Search
Clear search
Close search
Google apps
Main menu