54 datasets found
  1. Sample Graph Datasets in CSV Format

    • zenodo.org
    csv
    Updated Dec 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Edwin Carreño; Edwin Carreño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample Graph Datasets in CSV Format

    Note: none of the data sets published here contain actual data, they are for testing purposes only.

    Description

    This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

    • dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
    • dataset_30_edges_interactions.csv: contains 47 rows (edges).
    • the common identifier dataset_30 refers to the same graph.

    CSV nodes

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    UniProt IDstringprotein identification
    labelstringprotein label (type of node)
    propertiesstringa dictionary containing properties related to the protein.

    CSV edges

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    Relationship IDstringrelationship identification
    Source IDstringidentification of the source protein in the relationship
    Target IDstringidentification of the target protein in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata

    GraphNumber of NodesNumber of EdgesSparse graph

    dataset_30*

    30

    47

    Y

    dataset_60*

    60

    181

    Y

    dataset_120*

    120

    689

    Y

    dataset_240*

    240

    2819

    Y

    dataset_300*

    300

    4658

    Y

    dataset_600*

    600

    18004

    Y

    dataset_1200*

    1200

    71785

    Y

    dataset_2400*

    2400

    288600

    Y

    dataset_3000*

    3000

    449727

    Y

    dataset_6000*

    6000

    1799413

    Y

    dataset_12000*

    12000

    7199863

    Y

    dataset_24000*

    24000

    28792361

    Y

    dataset_30000*

    30000

    44991744

    Y

    This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

    CSV nodes (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringnode identification
    labelstringnode label (type of node)
    propertiesstringa dictionary containing properties related to the node.

    CSV edges (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringrelationship identification
    sourcestringidentification of the source node in the relationship
    targetstringidentification of the target node in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata (tiny graphs)

    GraphNumber of NodesNumber of EdgesSparse graph
    dataset_dummy*36N
    dataset_dummy2*36N
  2. h

    doc-formats-csv-3

    • huggingface.co
    Updated Nov 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasets examples (2023). doc-formats-csv-3 [Dataset]. https://huggingface.co/datasets/datasets-examples/doc-formats-csv-3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 23, 2023
    Dataset authored and provided by
    Datasets examples
    Description

    [doc] formats - csv - 3

    This dataset contains one csv file at the root:

    data.csv

    ignored comment

    col1|col2 dog|woof cat|meow pokemon|pika human|hello

    We define the config name in the YAML config, as well as the exact location of the file, the separator as "|", the name of the columns, and the number of rows to ignore (the row #1 is a row of column headers, that will be replaced by the names option, and the row #0 is ignored). The reference for the options is the documentation… See the full description on the dataset page: https://huggingface.co/datasets/datasets-examples/doc-formats-csv-3.

  3. Z

    Data from: KGCW 2024 Challenge @ ESWC 2024

    • data.niaid.nih.gov
    • investigacion.usc.gal
    • +2more
    Updated Jun 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iglesias, Ana (2024). KGCW 2024 Challenge @ ESWC 2024 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10721874
    Explore at:
    Dataset updated
    Jun 11, 2024
    Dataset provided by
    Chaves-Fraga, David
    Van Assche, Dylan
    Iglesias, Ana
    Serles, Umutcan
    Dimou, Anastasia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Knowledge Graph Construction Workshop 2024: challenge

    Knowledge graph construction of heterogeneous data has seen a lot of uptakein the last decade from compliance to performance optimizations with respectto execution time. Besides execution time as a metric for comparing knowledgegraph construction, other metrics e.g. CPU or memory usage are not considered.This challenge aims at benchmarking systems to find which RDF graphconstruction system optimizes for metrics e.g. execution time, CPU,memory usage, or a combination of these metrics.

    Task description

    The task is to reduce and report the execution time and computing resources(CPU and memory usage) for the parameters listed in this challenge, comparedto the state-of-the-art of the existing tools and the baseline results providedby this challenge. This challenge is not limited to execution times to createthe fastest pipeline, but also computing resources to achieve the most efficientpipeline.

    We provide a tool which can execute such pipelines end-to-end. This tool alsocollects and aggregates the metrics such as execution time, CPU and memoryusage, necessary for this challenge as CSV files. Moreover, the informationabout the hardware used during the execution of the pipeline is available aswell to allow fairly comparing different pipelines. Your pipeline should consistof Docker images which can be executed on Linux to run the tool. The tool isalready tested with existing systems, relational databases e.g. MySQL andPostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuosowhich can be combined in any configuration. It is strongly encouraged to usethis tool for participating in this challenge. If you prefer to use a differenttool or our tool imposes technical requirements you cannot solve, please contactus directly.

    Track 1: Conformance

    The set of new specification for the RDF Mapping Language (RML) established by the W3C Community Group on Knowledge Graph Construction provide a set of test-cases for each module:

    RML-Core

    RML-IO

    RML-CC

    RML-FNML

    RML-Star

    These test-cases are evaluated in this Track of the Challenge to determine their feasibility, correctness, etc. by applying them in implementations. This Track is in Beta status because these new specifications have not seen any implementation yet, thus it may contain bugs and issues. If you find problems with the mappings, output, etc. please report them to the corresponding repository of each module.

    Note: validating the output of the RML Star module automatically through the provided tooling is currently not possible, see https://github.com/kg-construct/challenge-tool/issues/1.

    Through this Track we aim to spark development of implementations for the new specifications and improve the test-cases. Let us know your problems with the test-cases and we will try to find a solution.

    Track 2: Performance

    Part 1: Knowledge Graph Construction Parameters

    These parameters are evaluated using synthetic generated data to have moreinsights of their influence on the pipeline.

    Data

    Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).

    Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).

    Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).

    Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).

    Number of input files: scaling the number of datasets (1, 5, 10, 15).

    Mappings

    Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).

    Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).

    Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)

    Part 2: GTFS-Madrid-Bench

    The GTFS-Madrid-Bench provides insights in the pipeline with real data from thepublic transport domain in Madrid.

    Scaling

    GTFS-1 SQL

    GTFS-10 SQL

    GTFS-100 SQL

    GTFS-1000 SQL

    Heterogeneity

    GTFS-100 XML + JSON

    GTFS-100 CSV + XML

    GTFS-100 CSV + JSON

    GTFS-100 SQL + XML + JSON + CSV

    Example pipeline

    The ground truth dataset and baseline results are generated in different stepsfor each parameter:

    The provided CSV files and SQL schema are loaded into a MySQL relational database.

    Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format

    The pipeline is executed 5 times from which the median execution time of eachstep is calculated and reported. Each step with the median execution time isthen reported in the baseline results with all its measured metrics.Knowledge graph construction timeout is set to 24 hours. The execution is performed with the following tool: https://github.com/kg-construct/challenge-tool,you can adapt the execution plans for this example pipeline to your own needs.

    Each parameter has its own directory in the ground truth dataset with thefollowing files:

    Input dataset as CSV.

    Mapping file as RML.

    Execution plan for the pipeline in metadata.json.

    Datasets

    Knowledge Graph Construction Parameters

    The dataset consists of:

    Input dataset as CSV for each parameter.

    Mapping file as RML for each parameter.

    Baseline results for each parameter with the example pipeline.

    Ground truth dataset for each parameter generated with the example pipeline.

    Format

    All input datasets are provided as CSV, depending on the parameter that is beingevaluated, the number of rows and columns may differ. The first row is alwaysthe header of the CSV.

    GTFS-Madrid-Bench

    The dataset consists of:

    Input dataset as CSV with SQL schema for the scaling and a combination of XML,

    CSV, and JSON is provided for the heterogeneity.

    Mapping file as RML for both scaling and heterogeneity.

    SPARQL queries to retrieve the results.

    Baseline results with the example pipeline.

    Ground truth dataset generated with the example pipeline.

    Format

    CSV datasets always have a header as their first row.JSON and XML datasets have their own schema.

    Evaluation criteria

    Submissions must evaluate the following metrics:

    Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.

    CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.

    Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

    Expected output

    Duplicate values

    Scale Number of Triples

    0 percent 2000000 triples

    25 percent 1500020 triples

    50 percent 1000020 triples

    75 percent 500020 triples

    100 percent 20 triples

    Empty values

    Scale Number of Triples

    0 percent 2000000 triples

    25 percent 1500000 triples

    50 percent 1000000 triples

    75 percent 500000 triples

    100 percent 0 triples

    Mappings

    Scale Number of Triples

    1TM + 15POM 1500000 triples

    3TM + 5POM 1500000 triples

    5TM + 3POM 1500000 triples

    15TM + 1POM 1500000 triples

    Properties

    Scale Number of Triples

    1M rows 1 column 1000000 triples

    1M rows 10 columns 10000000 triples

    1M rows 20 columns 20000000 triples

    1M rows 30 columns 30000000 triples

    Records

    Scale Number of Triples

    10K rows 20 columns 200000 triples

    100K rows 20 columns 2000000 triples

    1M rows 20 columns 20000000 triples

    10M rows 20 columns 200000000 triples

    Joins

    1-1 joins

    Scale Number of Triples

    0 percent 0 triples

    25 percent 125000 triples

    50 percent 250000 triples

    75 percent 375000 triples

    100 percent 500000 triples

    1-N joins

    Scale Number of Triples

    1-10 0 percent 0 triples

    1-10 25 percent 125000 triples

    1-10 50 percent 250000 triples

    1-10 75 percent 375000 triples

    1-10 100 percent 500000 triples

    1-5 50 percent 250000 triples

    1-10 50 percent 250000 triples

    1-15 50 percent 250005 triples

    1-20 50 percent 250000 triples

    1-N joins

    Scale Number of Triples

    10-1 0 percent 0 triples

    10-1 25 percent 125000 triples

    10-1 50 percent 250000 triples

    10-1 75 percent 375000 triples

    10-1 100 percent 500000 triples

    5-1 50 percent 250000 triples

    10-1 50 percent 250000 triples

    15-1 50 percent 250005 triples

    20-1 50 percent 250000 triples

    N-M joins

    Scale Number of Triples

    5-5 50 percent 1374085 triples

    10-5 50 percent 1375185 triples

    5-10 50 percent 1375290 triples

    5-5 25 percent 718785 triples

    5-5 50 percent 1374085 triples

    5-5 75 percent 1968100 triples

    5-5 100 percent 2500000 triples

    5-10 25 percent 719310 triples

    5-10 50 percent 1375290 triples

    5-10 75 percent 1967660 triples

    5-10 100 percent 2500000 triples

    10-5 25 percent 719370 triples

    10-5 50 percent 1375185 triples

    10-5 75 percent 1968235 triples

    10-5 100 percent 2500000 triples

    GTFS Madrid Bench

    Generated Knowledge Graph

    Scale Number of Triples

    1 395953 triples

    10 3959530 triples

    100 39595300 triples

    1000 395953000 triples

    Queries

    Query Scale 1 Scale 10 Scale 100 Scale 1000

    Q1 58540 results 585400 results No results available No results available

    Q2 636 results 11998 results
    125565 results 1261368 results

    Q3 421 results 4207 results 42067 results 420667 results

    Q4 13 results 130 results 1300 results 13000 results

    Q5 35 results 350 results 3500 results 35000 results

    Q6 1 result 1 result 1 result 1 result

    Q7 68 results 67 results 67 results 53 results

    Q8 35460 results 354600 results No results available No results available

    Q9 130 results 1300

  4. ENTSO-E Hydropower modelling data (PECD) in CSV format

    • zenodo.org
    csv
    Updated Aug 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matteo De Felice; Matteo De Felice (2020). ENTSO-E Hydropower modelling data (PECD) in CSV format [Dataset]. http://doi.org/10.5281/zenodo.3950048
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 14, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Matteo De Felice; Matteo De Felice
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PECD Hydro modelling

    This repository contains a more user-friendly version of the Hydro modelling data released by ENTSO-E with their latest Seasonal Outlook.

    The original URLs:

    The original ENTSO-E hydropower dataset integrates the PECD (Pan-European Climate Database) released for the MAF 2019

    As I did for the wind & solar data, the datasets released in this repository are only a more user- and machine-readable version of the original Excel files. As avid user of ENTSO-E data, with this repository I want to share my data wrangling efforts to make this dataset more accessible.

    Data description

    The zipped file contains 86 Excel files, two different files for each ENTSO-E zone.

    In this repository you can find 5 CSV files:

    • PECD-hydro-capacities.csv: installed capacities
    • PECD-hydro-weekly-inflows.csv: weekly inflows for reservoir and open-loop pumping
    • PECD-hydro-daily-ror-generation.csv: daily run-of-river generation
    • PECD-hydro-weekly-reservoir-min-max-generation.csv: minimum and maximum weekly reservoir generation
    • PECD-hydro-weekly-reservoir-min-max-levels.csv: weekly minimum and maximum reservoir levels

    Capacities

    The file PECD-hydro-capacities.csv contains: run of river capacity (MW) and storage capacity (GWh), reservoir plants capacity (MW) and storage capacity (GWh), closed-loop pumping/turbining (MW) and storage capacity and open-loop pumping/turbining (MW) and storage capacity. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Run-of-River and pondage, rows from 5 to 7, columns from 2 to 5
    • sheet Reservoir, rows from 5 to 7, columns from 1 to 3
    • sheet Pump storage - Open Loop, rows from 5 to 7, columns from 1 to 3
    • sheet Pump storage - Closed Loop, rows from 5 to 7, columns from 1 to 3

    Inflows

    The file PECD-hydro-weekly-inflows.csv contains the weekly inflow (GWh) for the climatic years 1982-2017 for reservoir plants and open-loop pumping. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Reservoir, rows from 13 to 66, columns from 16 to 51
    • sheet Pump storage - Open Loop, rows from 13 to 66, columns from 16 to 51

    Daily run-of-river

    The file PECD-hydro-daily-ror-generation.csv contains the daily run-of-river generation (GWh). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Run-of-River and pondage, rows from 13 to 378, columns from 15 to 51

    Miminum and maximum reservoir generation

    The file PECD-hydro-weekly-reservoir-min-max-generation.csv contains the minimum and maximum generation (MW, weekly) for reservoir-based plants for the climatic years 1982-2017. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Reservoir, rows from 13 to 66, columns from 196 to 231
    • sheet Reservoir, rows from 13 to 66, columns from 232 to 267

    Minimum/Maximum reservoir levels

    The file PECD-hydro-weekly-reservoir-min-max-levels.csv contains the minimum/maximum reservoir levels at beginning of each week (scaled coefficient from 0 to 1). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Reservoir, rows from 14 to 66, column 12
    • sheet Reservoir, rows from 14 to 66, column 13

    CHANGELOG

    [2020/07/17] Added maximum generation for the reservoir

  5. d

    WBPHS segment counts and segment effort, 1955-Present

    • catalog.data.gov
    Updated Feb 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Fish and Wildlife Service (2025). WBPHS segment counts and segment effort, 1955-Present [Dataset]. https://catalog.data.gov/dataset/wbphs-segment-counts-and-segment-effort-1955-present
    Explore at:
    Dataset updated
    Feb 22, 2025
    Dataset provided by
    U.S. Fish and Wildlife Service
    Description

    The segment counts by social group and species or species group for the Waterfowl Breeding Population and Habitat Survey and associated segment effort information. Three data files are included with their associated metadata (html and xml formats). Segment counts are summed counts of waterfowl per segment and are separated into two files, described below, along with the effort table needed to analyze recent segment count information. wbphs_segment_counts_1955to1999_forDistribution.csv, which represents the period prior the collection of geolocated data. There is no associated effort file for these counts and segments with zero birds are included in the segment counts table, so effort can be inferred; there is no information to determine the proportion of each segment surveyed for this period and it must be presumed they were surveyed completely. Number of rows in table = 1,988,290. wbphs_segment_counts_forDistribution.csv, which contains positive segment records only, by species or species group beginning with 2000. wbphs_segment_effort_forDistribution.csv file is important for this segment counts file and can be used to infer zero value segments, by species or species group. Number of rows in table = 365,863. wbphs_segment_effort_forDistribution.csv. The segment survey effort and location from the Waterfowl Breeding Population and Habitat Survey beginning with 2000. If a segment was not flown, it is absent from the table for the corresponding year. Number of rows in table = 65,122. Also included here is a small R code file, createSingleSegmentCountTable.R, which can be run to format the 2000+ data to match the 1955-1999 format and combine the data over the two time periods. Please consult the metadata for an explanation of the fields and other information to understand the limitations of the data.

  6. KGCW 2023 Challenge @ ESWC 2023

    • zenodo.org
    • investigacion.usc.gal
    application/gzip
    Updated Apr 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Van Assche; Dylan Van Assche; David Chaves-Fraga; David Chaves-Fraga; Anastasia Dimou; Anastasia Dimou; Umutcan Şimşek; Umutcan Şimşek; Ana Iglesias; Ana Iglesias (2024). KGCW 2023 Challenge @ ESWC 2023 [Dataset]. http://doi.org/10.5281/zenodo.7837289
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Apr 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dylan Van Assche; Dylan Van Assche; David Chaves-Fraga; David Chaves-Fraga; Anastasia Dimou; Anastasia Dimou; Umutcan Şimşek; Umutcan Şimşek; Ana Iglesias; Ana Iglesias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Knowledge Graph Construction Workshop 2023: challenge

    Knowledge graph construction of heterogeneous data has seen a lot of uptake
    in the last decade from compliance to performance optimizations with respect
    to execution time. Besides execution time as a metric for comparing knowledge
    graph construction, other metrics e.g. CPU or memory usage are not considered.
    This challenge aims at benchmarking systems to find which RDF graph
    construction system optimizes for metrics e.g. execution time, CPU,
    memory usage, or a combination of these metrics.

    Task description

    The task is to reduce and report the execution time and computing resources
    (CPU and memory usage) for the parameters listed in this challenge, compared
    to the state-of-the-art of the existing tools and the baseline results provided
    by this challenge. This challenge is not limited to execution times to create
    the fastest pipeline, but also computing resources to achieve the most efficient
    pipeline.

    We provide a tool which can execute such pipelines end-to-end. This tool also
    collects and aggregates the metrics such as execution time, CPU and memory
    usage, necessary for this challenge as CSV files. Moreover, the information
    about the hardware used during the execution of the pipeline is available as
    well to allow fairly comparing different pipelines. Your pipeline should consist
    of Docker images which can be executed on Linux to run the tool. The tool is
    already tested with existing systems, relational databases e.g. MySQL and
    PostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuoso
    which can be combined in any configuration. It is strongly encouraged to use
    this tool for participating in this challenge. If you prefer to use a different
    tool or our tool imposes technical requirements you cannot solve, please contact
    us directly.

    Part 1: Knowledge Graph Construction Parameters

    These parameters are evaluated using synthetic generated data to have more
    insights of their influence on the pipeline.

    Data

    • Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).
    • Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).
    • Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).
    • Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).
    • Number of input files: scaling the number of datasets (1, 5, 10, 15).

    Mappings

    • Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).
    • Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).
    • Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)

    Part 2: GTFS-Madrid-Bench

    The GTFS-Madrid-Bench provides insights in the pipeline with real data from the
    public transport domain in Madrid.

    Scaling

    • GTFS-1 SQL
    • GTFS-10 SQL
    • GTFS-100 SQL
    • GTFS-1000 SQL

    Heterogeneity

    • GTFS-100 XML + JSON
    • GTFS-100 CSV + XML
    • GTFS-100 CSV + JSON
    • GTFS-100 SQL + XML + JSON + CSV

    Example pipeline

    The ground truth dataset and baseline results are generated in different steps
    for each parameter:

    1. The provided CSV files and SQL schema are loaded into a MySQL relational database.
    2. Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format.
    3. The constructed knowledge graph is loaded into a Virtuoso triplestore, tuned according to the Virtuoso documentation.
    4. The provided SPARQL queries are executed on the SPARQL endpoint exposed by Virtuoso.

    The pipeline is executed 5 times from which the median execution time of each
    step is calculated and reported. Each step with the median execution time is
    then reported in the baseline results with all its measured metrics.
    Query timeout is set to 1 hour and knowledge graph construction timeout
    to 24 hours. The execution is performed with the following tool: https://github.com/kg-construct/challenge-tool,
    you can adapt the execution plans for this example pipeline to your own needs.

    Each parameter has its own directory in the ground truth dataset with the
    following files:

    • Input dataset as CSV.
    • Mapping file as RML.
    • Queries as SPARQL.
    • Execution plan for the pipeline in metadata.json.

    Datasets

    Knowledge Graph Construction Parameters

    The dataset consists of:

    • Input dataset as CSV for each parameter.
    • Mapping file as RML for each parameter.
    • SPARQL queries to retrieve the results for each parameter.
    • Baseline results for each parameter with the example pipeline.
    • Ground truth dataset for each parameter generated with the example pipeline.

    Format

    All input datasets are provided as CSV, depending on the parameter that is being
    evaluated, the number of rows and columns may differ. The first row is always
    the header of the CSV.

    GTFS-Madrid-Bench

    The dataset consists of:

    • Input dataset as CSV with SQL schema for the scaling and a combination of XML,
    • CSV, and JSON is provided for the heterogeneity.
    • Mapping file as RML for both scaling and heterogeneity.
    • SPARQL queries to retrieve the results.
    • Baseline results with the example pipeline.
    • Ground truth dataset generated with the example pipeline.

    Format

    CSV datasets always have a header as their first row.
    JSON and XML datasets have their own schema.

    Evaluation criteria

    Submissions must evaluate the following metrics:

    • Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.
    • CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.
    • Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

    Expected output

    Duplicate values

    ScaleNumber of Triples
    0 percent2000000 triples
    25 percent1500020 triples
    50 percent1000020 triples
    75 percent500020 triples
    100 percent20 triples

    Empty values

    ScaleNumber of Triples
    0 percent2000000 triples
    25 percent1500000 triples
    50 percent1000000 triples
    75 percent500000 triples
    100 percent0 triples

    Mappings

    ScaleNumber of Triples
    1TM + 15POM1500000 triples
    3TM + 5POM1500000 triples
    5TM + 3POM 1500000 triples
    15TM + 1POM1500000 triples

    Properties

    ScaleNumber of Triples
    1M rows 1 column1000000 triples
    1M rows 10 columns10000000 triples
    1M rows 20 columns20000000 triples
    1M rows 30 columns30000000 triples

    Records

    ScaleNumber of Triples
    10K rows 20 columns200000 triples
    100K rows 20 columns2000000 triples
    1M rows 20 columns20000000 triples
    10M rows 20 columns200000000 triples

    Joins

    1-1 joins

    ScaleNumber of Triples
    0 percent0 triples
    25 percent125000 triples
    50 percent250000 triples
    75 percent375000 triples
    100 percent500000 triples

    1-N joins

    ScaleNumber of Triples
    1-10 0 percent0 triples
    1-10 25 percent125000 triples
    1-10 50 percent250000 triples
    1-10 75 percent375000

  7. Z

    BF skip indexes for Ethereum

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Loporchio, Matteo (2024). BF skip indexes for Ethereum [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7957140
    Explore at:
    Dataset updated
    Dec 26, 2024
    Dataset authored and provided by
    Loporchio, Matteo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General information

    This repository includes all data needed to reproduce the experiments presented in [1].The paper describes the BF skip index, a data structure based on Bloom filters [2] that can be used for answering inter-block queries on blockchains efficiently. The article also includes a historical analysis of logsBloom filters included in the Ethereum block headers, as well as an experimental analysis of the proposed data structure. The latter was conducted using the data set of events generated by the CryptoKitties Core contract, a popular decentralized application launched in 2017 (and also one of the first applications based on NFTs).

    In this description, we use the following abbreviations (also adopted throughout the paper) to denote two different sets of Ethereum blocks.

    D1: set of all Ethereum blocks between height 0 and 14999999.

    D2: set of all Ethereum blocks between height 14000000 and 14999999.

    Moreover, in accordance with the terminology adopted in the paper, we define the set of keys of a block as the set of all contract addresses and log topics of the transactions in the block. As defined in [3], log topics comprise event signature digests and the indexed parameters associated with the event occurrence.

    Data set description

    File Description

    filters_ones_0-14999999.csv.xz Compressed CSV file containing the number of ones for each logsBloom filter in D1.

    receipt_stats_0-14999999.csv.xz Compressed CSV file containing statistics about all transaction receipts in D1.

    Approval.csv CSV file containing the Approval event occurrences for the CryptoKitties Core contract in D2.

    Birth.csv CSV file containing the Birth event occurrences for the CryptoKitties Core contract in D2.

    Pregnant.csv CSV file containing the Pregnant event occurrences for the CryptoKitties Core contract in D2.

    Transfer.csv CSV file containing the Transfer event occurrences for the CryptoKitties Core contract in D2.

    events.xz Compressed binary file containing information about all contract events in D2.

    keys.xz Compressed binary file containing information about all keys in D2.

    File structure

    We now describe the structure of the files included in this repository.

    filters_ones_0-14999999.csv.xz is a compressed CSV file with 15 million rows (one for each block in D1) and 3 columns. Note that it is not necessary to decompress this file, as the provided code is capable of processing it directly in its compressed form. The columns have the following meaning.

    blockId: the identifier of the block.

    timestamp: timestamp of the block.

    numOnes: number of bits set to 1 in the logsBloom filter of the block.

    receipt_stats_0-14999999.csv.xz is a compressed CSV file with 15 million rows (one for each block in D1) and 5 columns. As for the previous file, it is not necessary to decompress this file.

    blockId: the identifier of the block.

    txCount: number of transactions included in the block.

    numLogs: number of event logs included in the block.

    numKeys: number of keys included in the block.

    numUniqueKeys: number of distinct keys in the block (useful as the same key may appear multiple times).

    All CSV files related to the CryptoKitties Core events (i.e., Approval.csv, Birth.csv, Pregnant.csv, Transfer.csv) have the same structure. They consist of 1 million rows (one for each block in D2) and 2 columns, namely:

    blockId: identifier of the block.

    numOcc: number of event occurrences in the block.

    events.xz is a compressed binary file describing all unique event occurrences in the blocks of D2. The file contains 1 million data chunks (i.e., one for each Ethereum block). Each chunk includes the following information. Do note that this file only records unique event occurrences in each block, meaning that if an event from a contract is triggered more than once within the same block, there will be only one sequence within the corresponding chunk.

    blockId: identifier of the block (4 bytes).

    numEvents: number of event occurrences in the block (4 bytes).

    A list of numEvent sequences, each made up of 52 bytes. A sequence represents an event occurrence and is indeed the concatenation of two fields, namely:

    Address of the contract triggering the event (20 bytes).

    Event signature digest (32 bytes).

    keys.xz is a compressed binary file describing all unique keys in the blocks of D2. As for the previous file, duplicate keys only appear once. The file contains 1 million data chunks, each representing an Ethereum block and including the following information.

    blockId: identifier of the block (4 bytes)

    numAddr: number of unique contract addresses (4 bytes).

    numTopics: number of unique topics (4 bytes).

    A sequence of numAddr addresses, each represented using 20 bytes.

    A sequence of numTopics topics, each represented using 32 bytes.

    Notes

    For space reasons, some of the files in this repository have been compressed using the XZ compression utility. Unless otherwise specified, these files need to be decompressed before they can be read. Please make sure you have an application installed on your system that is capable of decompressing such files.

    Cite this work

    If the data included in this repository have been useful, please cite the following article in your work.

    @article{loporchio2025skip, title={Skip index: Supporting efficient inter-block queries and query authentication on the blockchain}, author={Loporchio, Matteo and Bernasconi, Anna and Di Francesco Maesa, Damiano and Ricci, Laura}, journal={Future Generation Computer Systems}, volume={164}, pages={107556}, year={2025}, publisher={Elsevier} }

    References

    Loporchio, Matteo et al. "Skip index: supporting efficient inter-block queries and query authentication on the blockchain". Future Generation Computer Systems 164 (2025): 107556. https://doi.org/10.1016/j.future.2024.107556

    Bloom, Burton H. "Space/time trade-offs in hash coding with allowable errors." Communications of the ACM 13.7 (1970): 422-426.

    Wood, Gavin. "Ethereum: A secure decentralised generalised transaction ledger." Ethereum project yellow paper 151.2014 (2014): 1-32.

  8. d

    Geochemical data supporting analysis of geochemical conditions and nitrogen...

    • catalog.data.gov
    • data.usgs.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Geochemical data supporting analysis of geochemical conditions and nitrogen transport in nearshore groundwater and the subterranean estuary at a Cape Cod embayment, East Falmouth, Massachusetts, 2013 [Dataset]. https://catalog.data.gov/dataset/geochemical-data-supporting-analysis-of-geochemical-conditions-and-nitrogen-transport-in-n
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    East Falmouth, Cape Cod, Massachusetts, Falmouth
    Description

    This data release provides analytical and other data in support of an analysis of nitrogen transport and transformation in groundwater and in a subterranean estuary in the Eel River and onshore locations on the Seacoast Shores peninsula, Falmouth, Massachusetts. The analysis is described in U.S. Geological Survey Scientific Investigations Report 2018-5095 by Colman and others (2018). This data release is structured as a set of comma-separated values (CSV) files, each of which contains data columns for laboratory (if applicable), USGS Site Name, date sampled, time sampled, and columns of specific analytical and(or) other data. The .csv data files have the same number of rows and each row in each .csv file corresponds to the same sample. Blank cells in a .csv file indicate that the sample was not analyzed for that constituent. The data release also provides a Data Dictionary (Data_Dictionary.csv) that provides the following information for each constituent (analyte): laboratory or data source, data type, description of units, method, minimum reporting limit, limit of quantitation if appropriate, method reference citations, minimum, maximum, median, and average values for each analyte. The data release also contains a file called Abbreviations in Data_Dictionary.pdf that contains all of the abbreviations in the Data Dictionary and in the well characteristics file in the companion report, Colman and others (2018).

  9. C

    Replication data for "High life satisfaction reported among small-scale...

    • dataverse.csuc.cat
    csv, txt
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric Galbraith; Eric Galbraith; Victoria Reyes Garcia; Victoria Reyes Garcia (2024). Replication data for "High life satisfaction reported among small-scale societies with low incomes" [Dataset]. http://doi.org/10.34810/data904
    Explore at:
    csv(1620), csv(7829), txt(7017), csv(227502)Available download formats
    Dataset updated
    Feb 7, 2024
    Dataset provided by
    CORA.Repositori de Dades de Recerca
    Authors
    Eric Galbraith; Eric Galbraith; Victoria Reyes Garcia; Victoria Reyes Garcia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2021 - Oct 24, 2023
    Area covered
    Mongolia, Bulgan soum, Darjeeling, India, Puna, Argentina, Laprak, Nepal, Kumbungu, Ghana, United Republic of, Tanzania, Mafia Island, Bassari country, Senegal, China, Shangri-la, Ba, Fiji, Guatemala, Western highlands
    Dataset funded by
    European Commission
    Description

    This dataset was created in order to document self-reported life evaluations among small-scale societies that exist on the fringes of mainstream industrialized socieities. The data were produced as part of the LICCI project, through fieldwork carried out by LICCI partners. The data include individual responses to a life satisfaction question, and household asset values. Data from Gallup World Poll and the World Values Survey are also included, as used for comparison. TABULAR DATA-SPECIFIC INFORMATION --------------------------------- 1. File name: LICCI_individual.csv Number of rows and columns: 2814,7 Variable list: Variable names: User, Site, village Description: identification of investigator and location Variable name: Well.being.general Description: numerical score for life satisfaction question Variable names: HH_Assets_US, HH_Assets_USD_capita Description: estimated value of representative assets in the household of respondent, total and per capita (accounting for number of household inhabitants) 2. File name: LICCI_bySite.csv Number of rows and columns: 19,8 Variable list: Variable names: Site, N Description: site name and number of respondents at the site Variable names: SWB_mean, SWB_SD Description: mean and standard deviation of life satisfaction score Variable names: HHAssets_USD_mean, HHAssets_USD_sd Description: Site mean and standard deviation of household asset value Variable names: PerCapAssets_USD_mean, PerCapAssets_USD_sd Description: Site mean and standard deviation of per capita asset value 3. File name: gallup_WVS_GDP_pk.csv Number of rows and columns: 146,8 Variable list: Variable name: Happiness Score, Whisker-high, Whisker-low Description: from Gallup World Poll as documented in World Happiness Report 2022. Variable name: GDP-PPP2017 Description: Gross Domestic Product per capita for year 2020 at PPP (constant 2017 international $). Accessed May 2022. Variable name: pk Description: Produced capital per capita for year 2018 (in 2018 US$) for available countries, as estimated by the World Bank (accessed February 2022). Variable names: WVS7_mean, WVS7_std Description: Results of Question 49 in the World Values Survey, Wave 7.

  10. Hottest Kaggle Datasets

    • kaggle.com
    zip
    Updated Jan 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abeer Alzuhair (2021). Hottest Kaggle Datasets [Dataset]. https://www.kaggle.com/abeeralzuhair2020/hottest-kaggle-datasets
    Explore at:
    zip(351617 bytes)Available download formats
    Dataset updated
    Jan 30, 2021
    Authors
    Abeer Alzuhair
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This data was collected as a course project for the immersive data science course (by General Assembly and Misk Academy).

    Content

    This dataset is in a CSV format, it consists of 5717 rows and 15 columns, where each row is a dataset on Kaggle and each column represents a feature of that dataset. |Feature|Description| |-------|-----------| |title| dataset name | |usability| dataset usability rating by Kaggle | |num_of_files| number of files associated with the dataset | |types_of_files| types of files associated with the dataset | |files_size| size of the dataset files | |vote_counts| total votes count by the dataset viewer | |medal| reward to popular datasets measured by the number of upvotes (votes by novices are excluded from medal calculation), [Bronze = 5 Votes, Silver = 20 Votes, Gold = 50 Votes] | |url_reference| reference to the dataset page on Kaggle in the format: www.kaggle.com/url_reference | |keywords| Topics tagged with the dataset | |num_of_columns| number of features in the dataset | |views| number of views | |downloads| number of downloads | |download_per_view| download per view ratio | |date_created| dataset creation date | |last_updated| date of the last update |

    Acknowledgements

    I would like to thank all my GA instructors for their continuous help and support

    All data were taken from https://www.kaggle.com , collected on 30 Jan 2021

    Inspiration

    Using this dataset, we could try to predict the upcoming datasets uploaded, number of votes, number of downloads, medal type, etc.

  11. Finance Dataset by Faker Library

    • kaggle.com
    Updated Feb 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamza Obaydallah (2024). Finance Dataset by Faker Library [Dataset]. https://www.kaggle.com/datasets/hamzazaki/finance-dataset-by-faker-library/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 20, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Hamza Obaydallah
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9365842%2F5d270d8701f4dc2687f0ae193ee018ae%2F20-Best-Finance-Economic-Datasets-for-Machine-Learning-Social.jpg?generation=1708443878634431&alt=media" alt=""> Finance dataset with fake information such as transaction ID, date, amount, currency, description, category, merchant, customer, city, and country. It can be used for educational purposes as well as for testing.

    This script generates a dataset with fake information such as name, email, phone number, address, date of birth, job, and company. Adjust the num_rows variable to specify the number of rows you want in your dataset. Finally, the dataset is saved to a CSV file named fake_dataset.csv. You can modify the fields or add additional fields according to your requirements.

    `

    Define the number of rows for your dataset

    num_rows = 15000

    Generate fake finance data

    data = { 'Transaction_ID': [fake.uuid4() for _ in range(num_rows)], 'Date': [fake.date_time_this_year() for _ in range(num_rows)],

    'Amount': [round(random.uniform(10, 10000), 2) for _ in range(num_rows)],
    'Currency': [fake.currency_code() for _ in range(num_rows)],
    'Description': [fake.bs() for _ in range(num_rows)],
    'Category': [random.choice(['Food', 'Transport', 'Shopping', 'Entertainment', 'Utilities']) for _ in range(num_rows)],
    'Merchant': [fake.company() for _ in range(num_rows)],
    'Customer': [fake.name() for _ in range(num_rows)],
    'City': [fake.city() for _ in range(num_rows)],
    'Country': [fake.country() for _ in range(num_rows)]
    

    }

    Create a DataFrame

    df = pd.DataFrame(data)

    Save the DataFrame to a CSV file

    df.to_csv('finance_dataset.csv', index=False)

    Display the DataFrame

    df.head()`

  12. d

    Key generic technology prediction in patent citation using graph neural...

    • dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Jun 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M. L. Ding (2024). Key generic technology prediction in patent citation using graph neural networks [Dataset]. http://doi.org/10.5061/dryad.nk98sf803
    Explore at:
    Dataset updated
    Jun 5, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    M. L. Ding
    Time period covered
    Jan 11, 2024
    Description

    With the rapid advancement of the Fourth Industrial Revolution, international competition in technology and industry is intensifying. However, in the era of big data and large-scale science, making accurate judgments about the key areas of technology and innovative trends has become exceptionally difficult. This paper constructs a patent indicator evaluation system based on the dimensions of key and generic patent citation, integrates graph neural network modeling to predict key common technologies, and confirms the effectiveness of the method using the field of genetic engineering as an example. According to the LDA topic model, the main technical R&D directions in genetic engineering are genetic analysis and detection technologies, the application of microorganisms in industrial production, virology research involving vaccine development and immune responses, high-throughput sequencing and analysis technologies in genomics, targeted drug design and molecular therapeutic strategies..., These datasets were obtained by the Incopat patent database for cited patents (2013-2022) in the field of genetic engineering. Details for the datasets are provided in the README file. This directory contains the selection of the patent datasets. 1) Table of key generic indicators for nodes (partial 1).csv This file consists of 10 indicators of patents: technical coverage, patent families, patent family citation, patent cooperation, enterprise-enterprise cooperation, industry-university-research cooperation, claims, citation frequency, layout countries, and layout countries. 2) Table of key generic indicators for nodes (partial 2).csv This file consists of 10 indicators of patents: technical convergence, cited countries, inventors, citations, homologous countries/areas, degree centrality, closeness centrality, betweenness centrality, eigenvector centrality, and PageRank. 3) patent.content The content file contains descriptions of the patents in the following format:

    This README file was generated on 2023-11-25 by Mingli Ding.

    GENERAL INFORMATION

    1. Author Information Investigators Contact Information Name: Mingli Ding; Wangke Yu; Shuhua Wang Institution: Jingdezhen Ceramic University Address: Jingdezhen, Jiangxi, China Email: mlding1@163.com
    2. Date of data collection:2013-2022

    DATA & FILE OVERVIEW

    1. File List:

    A) Table of key generic indicators for nodes (partial 1).csv

    B) Table of key generic indicators for nodes (partial 2).csv

    C) patent.content

    D) patent.cites

    E) Graph neural network modeling highest accuracy for different dimensions.csv

    F) Prediction effects of key generic technologies.csv

    DATA-SPECIFIC INFORMATION FOR: Table of key generic indicators for nodes (partial 1).csv

    1. Number of variables: 10
    2. Number of cases/rows: 72489
    3. Variable List:
    • technical coverage: number ...
  13. Data curation materials in "Daily life in the Open Biologist's second job,...

    • zenodo.org
    bin, tiff, txt
    Updated Sep 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Livia C T Scorza; Livia C T Scorza; Tomasz Zieliński; Tomasz Zieliński; Andrew J Millar; Andrew J Millar (2024). Data curation materials in "Daily life in the Open Biologist's second job, as a Data Curator" [Dataset]. http://doi.org/10.5281/zenodo.13321937
    Explore at:
    tiff, txt, binAvailable download formats
    Dataset updated
    Sep 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Livia C T Scorza; Livia C T Scorza; Tomasz Zieliński; Tomasz Zieliński; Andrew J Millar; Andrew J Millar
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This is the supplementary material accompanying the manuscript "Daily life in the Open Biologist’s second job, as a Data Curator", published in Wellcome Open Research.

    It contains:

    - Python_scripts.zip: Python scripts used for data cleaning and organization:

    -add_headers.py: adds specified headers automatically to a list of csv files, creating new output files containing a "_with_headers" suffix.

    -count_NaN_values.py: counts the total number of rows containing null values in a csv file and prints the location of null values in the (row, column) format.

    -remove_rowsNaN_file.py: removes rows containing null values in a single csv file and saves the modified file with a "_dropNaN" suffix.

    -remove_rowsNaN_list.py: removes rows containing null values in list of csv files and saves the modified files with a "_dropNaN" suffix.

    - README_template.txt: a template for a README file to be used to describe and accompany a dataset.

    - template_for_source_data_information.xlsx: a spreadsheet to help manuscript authors to keep track of data used for each figure (e.g., information about data location and links to dataset description).

    - Supplementary_Figure_1.tif: Example of a dataset shared by us on Zenodo. The elements that make the dataset FAIR are indicated by the respective letters. Findability (F) is achieved by the dataset unique and persistent identifier (DOI), as well as by the related identifiers for the publication and dataset on GitHub. Additionally, the dataset is described with rich metadata, (e.g., keywords). Accessibility (A) is achieved by the ease of visualization and downloading using a standardised communications protocol (https). Also, the metadata are publicly accessible and licensed under the public domain. Interoperability (I) is achieved by the open formats used (CSV; R), and metadata are harvestable using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), a low-barrier mechanism for repository interoperability. Reusability (R) is achieved by the complete description of the data with metadata in README files and links to the related publication (which contains more detailed information, as well as links to protocols on protocols.io). The dataset has a clear and accessible data usage license (CC-BY 4.0).

  14. C

    Windy City Business Names

    • data.cityofchicago.org
    Updated Jun 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Chicago (2025). Windy City Business Names [Dataset]. https://data.cityofchicago.org/Community-Economic-Development/Windy-City-Business-Names/eghd-qvdp
    Explore at:
    csv, xml, tsv, application/rssxml, application/rdfxml, application/geo+json, kml, kmzAvailable download formats
    Dataset updated
    Jun 7, 2025
    Authors
    City of Chicago
    Description

    This dataset contains all current and active business licenses issued by the Department of Business Affairs and Consumer Protection. This dataset contains a large number of records /rows of data and may not be viewed in full in Microsoft Excel. Therefore, when downloading the file, select CSV from the Export menu. Open the file in an ASCII text editor, such as Notepad or Wordpad, to view and search.

    Data fields requiring description are detailed below.

    APPLICATION TYPE: 'ISSUE' is the record associated with the initial license application. 'RENEW' is a subsequent renewal record. All renewal records are created with a term start date and term expiration date. 'C_LOC' is a change of location record. It means the business moved. 'C_CAPA' is a change of capacity record. Only a few license types my file this type of application. 'C_EXPA' only applies to businesses that have liquor licenses. It means the business location expanded.

    LICENSE STATUS: 'AAI' means the license was issued.

    Business license owners may be accessed at: http://data.cityofchicago.org/Community-Economic-Development/Business-Owners/ezma-pppn To identify the owner of a business, you will need the account number or legal name.

    Data Owner: Business Affairs and Consumer Protection

    Time Period: Current

    Frequency: Data is updated daily

  15. n

    Data from: A Deep Learning and XGBoost-based Method for Predicting...

    • narcis.nl
    • data.mendeley.com
    Updated Aug 3, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wang, P (via Mendeley Data) (2021). A Deep Learning and XGBoost-based Method for Predicting Protein-protein Interaction Sites [Dataset]. http://doi.org/10.17632/9tft3vz5tm.2
    Explore at:
    Dataset updated
    Aug 3, 2021
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    wang, P (via Mendeley Data)
    Description

    local_feature_training_set.csv: Preprocessing data of feature extractor contains 65869 rows and 344 columns, and rows represent the number of samples , the first 343 columns represent feature and the last column represent label

    local_feature_testing_set.csv: Preprocessing data of feature extractor contains 11791 rows and 344 columns, and rows represent the number of samples , the first 343 columns represent feature and the last column represent label

    global&local_feature_training_set.csv: Preprocessing data of feature extractor contains 65869 rows and 1028 columns, and rows represent the number of samples , the first 1027 columns represent feature and the last column represent label

    global&local_feature_testing_set.csv: Preprocessing data of feature extractor contains 11791 rows and 1028 columns, and rows represent the number of samples , the first 1027 columns represent feature and the last column represent label

  16. Retailrocket recommender system dataset

    • kaggle.com
    Updated Nov 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roman Zykov (2022). Retailrocket recommender system dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/4471234
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 8, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Roman Zykov
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    The dataset consists of three files: a file with behaviour data (events.csv), a file with item properties (item_properties.сsv) and a file, which describes category tree (category_tree.сsv). The data has been collected from a real-world ecommerce website. It is raw data, i.e. without any content transformations, however, all values are hashed due to confidential issues. The purpose of publishing is to motivate researches in the field of recommender systems with implicit feedback.

    Content

    The behaviour data, i.e. events like clicks, add to carts, transactions, represent interactions that were collected over a period of 4.5 months. A visitor can make three types of events, namely “view”, “addtocart” or “transaction”. In total there are 2 756 101 events including 2 664 312 views, 69 332 add to carts and 22 457 transactions produced by 1 407 580 unique visitors. For about 90% of events corresponding properties can be found in the “item_properties.csv” file.

    For example:

    • “1439694000000,1,view,100,” means visitorId = 1, clicked the item with id = 100 at 1439694000000 (Unix timestamp)
    • “1439694000000,2,transaction,1000,234” means visitorId = 2 purchased the item with id = 1000 in transaction with id = 234 at 1439694000000 (Unix timestamp)

    The file with item properties (item_properties.csv) includes 20 275 902 rows, i.e. different properties, describing 417 053 unique items. File is divided into 2 files due to file size limitations. Since the property of an item can vary in time (e.g., price changes over time), every row in the file has corresponding timestamp. In other words, the file consists of concatenated snapshots for every week in the file with the behaviour data. However, if a property of an item is constant over the observed period, only a single snapshot value will be present in the file. For example, we have three properties for single item and 4 weekly snapshots, like below:

    timestamp,itemid,property,value
    1439694000000,1,100,1000
    1439695000000,1,100,1000
    1439696000000,1,100,1000
    1439697000000,1,100,1000
    1439694000000,1,200,1000
    1439695000000,1,200,1100
    1439696000000,1,200,1200
    1439697000000,1,200,1300
    1439694000000,1,300,1000
    1439695000000,1,300,1000
    1439696000000,1,300,1100
    1439697000000,1,300,1100
    

    After snapshot merge it would looks like:

    1439694000000,1,100,1000
    1439694000000,1,200,1000
    1439695000000,1,200,1100
    1439696000000,1,200,1200
    1439697000000,1,200,1300
    1439694000000,1,300,1000
    1439696000000,1,300,1100
    

    Because property=100 is constant over time, property=200 has different values for all snapshots, property=300 has been changed once.

    Item properties file contain timestamp column because all of them are time dependent, since properties may change over time, e.g. price, category, etc. Initially, this file consisted of snapshots for every week in the events file and contained over 200 millions rows. We have merged consecutive constant property values, so it's changed from snapshot form to change log form. Thus, constant values would appear only once in the file. This action has significantly reduced the number of rows in 10 times.

    All values in the “item_properties.csv” file excluding "categoryid" and "available" properties were hashed. Value of the "categoryid" property contains item category identifier. Value of the "available" property contains availability of the item, i.e. 1 means the item was available, otherwise 0. All numerical values were marked with "n" char at the beginning, and have 3 digits precision after decimal point, e.g., "5" will become "n5.000", "-3.67584" will become "n-3.675". All words in text values were normalized (stemming procedure: https://en.wikipedia.org/wiki/Stemming) and hashed, numbers were processed as above, e.g. text "Hello world 2017!" will become "24214 44214 n2017.000"

    The category tree file has 1669 rows. Every row in the file specifies a child categoryId and the corresponding parent. For example:

    • Line “100,200” means that categoryid=1 has parent with categoryid=200
    • Line “300,” means that categoryid hasn’t parent in the tree

    Acknowledgements

    Retail Rocket (retailrocket.io) helps web shoppers make better shopping decisions by providing personalized real-time recommendations through multiple channels with over 100MM unique monthly users and 1000+ retail partners over the world.

    Inspiration

  17. Tuberculosis X-Ray Dataset (Synthetic)

    • kaggle.com
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arif Miah (2025). Tuberculosis X-Ray Dataset (Synthetic) [Dataset]. https://www.kaggle.com/datasets/miadul/tuberculosis-x-ray-dataset-synthetic
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Arif Miah
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📝 Dataset Summary

    This synthetic dataset contains 20,000 records of X-ray data labeled as "Normal" or "Tuberculosis". It is specifically created for training and evaluating classification models in the field of medical image analysis. The dataset aims to aid in building machine learning and deep learning models for detecting tuberculosis from X-ray data.

    💡 Context

    Tuberculosis (TB) is a highly infectious disease that primarily affects the lungs. Accurate detection of TB using chest X-rays can significantly enhance medical diagnostics. However, real-world datasets are often scarce or restricted due to privacy concerns. This synthetic dataset bridges that gap by providing simulated patient data while maintaining realistic distributions and patterns commonly observed in TB cases.

    🗃️ Dataset Details

    • Number of Rows: 20,000
    • Number of Columns: 15
    • File Format: CSV
    • Resolution: Simulated patient data, not real X-ray images
    • Size: Approximately 10 MB

    🏷️ Columns and Descriptions

    Column NameDescription
    Patient_IDUnique ID for each patient (e.g., PID000001)
    AgeAge of the patient (in years)
    GenderGender of the patient (Male/Female)
    Chest_PainPresence of chest pain (Yes/No)
    Cough_SeveritySeverity of cough (Scale: 0-9)
    BreathlessnessSeverity of breathlessness (Scale: 0-4)
    FatigueLevel of fatigue experienced (Scale: 0-9)
    Weight_LossWeight loss (in kg)
    FeverLevel of fever (Mild, Moderate, High)
    Night_SweatsWhether night sweats are present (Yes/No)
    Sputum_ProductionLevel of sputum production (Low, Medium, High)
    Blood_in_SputumPresence of blood in sputum (Yes/No)
    Smoking_HistorySmoking status (Never, Former, Current)
    Previous_TB_HistoryPrevious tuberculosis history (Yes/No)
    ClassTarget variable indicating the condition (Normal, Tuberculosis)

    🔍 Data Generation Process

    The dataset was generated using Python with the following libraries:
    - Pandas: To create and save the dataset as a CSV file
    - NumPy: To generate random numbers and simulate realistic data
    - Random Seed: Set to ensure reproducibility

    The target variable "Class" has a 70-30 distribution between Normal and Tuberculosis cases. The data is randomly generated with realistic patterns that mimic typical TB symptoms and demographic distributions.

    🔧 Usage

    This dataset is intended for:
    - Machine Learning and Deep Learning classification tasks
    - Data exploration and feature analysis
    - Model evaluation and comparison
    - Educational and research purposes

    📊 Potential Applications

    1. Tuberculosis Detection Models: Train CNNs or other classification algorithms to detect TB.
    2. Healthcare Research: Analyze the correlation between symptoms and TB outcomes.
    3. Data Visualization: Perform EDA to uncover patterns and insights.
    4. Model Benchmarking: Compare various algorithms for TB detection.

    📑 License

    This synthetic dataset is open for educational and research use. Please credit the creator if used in any public or academic work.

    🙌 Acknowledgments

    This dataset was generated as a synthetic alternative to real-world data to help developers and researchers practice building and fine-tuning classification models without the constraints of sensitive patient data.

  18. WBPHS geolocated data and segment effort, 2000+

    • catalog.data.gov
    Updated Feb 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Fish and Wildlife Service (2025). WBPHS geolocated data and segment effort, 2000+ [Dataset]. https://catalog.data.gov/dataset/wbphs-geolocated-data-and-segment-effort-2000
    Explore at:
    Dataset updated
    Feb 21, 2025
    Dataset provided by
    U.S. Fish and Wildlife Servicehttp://www.fws.gov/
    Description

    The geolocated counts for the Waterfowl Breeding Population and Habitat Survey and associated segment effort information from 2000 to present. The survey was not conducted in 2020-21 due to the COVID pandemic. Two data files are included with their associated metadata (html and xml formats). wbphs_geolocated_counts_forDistribution.csv includes the locations of the plane when survey species observations were made. For each observation, the social group and count is recorded along with a description of the location quality. Number of rows in table = 1,820,628. wbphs_segment_effort_forDistribution.csv. The survey effort file includes the midpoint latitude and longitude of each segment when known, which can differ by year (as indicated by the version number). If a segment was not flown, it is absent from the table for the corresponding year. Number of rows in table = 65,122. Not all geolocated records have locations. Please consult the metadata for an explanation of the fields and other information to understand the limitations of the data.

  19. w

    National Register of Road Haulage and Road Passenger Transport Operators

    • data.wu.ac.at
    • data.europa.eu
    csv
    Updated Jan 8, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Driver & Vehicle Standards Agency (2015). National Register of Road Haulage and Road Passenger Transport Operators [Dataset]. https://data.wu.ac.at/schema/data_gov_uk/NWFjNWExYjgtZDA1Yi00YWE2LWFjYTUtMTk0YWQ2N2QxZjZi
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 8, 2015
    Dataset provided by
    Driver & Vehicle Standards Agency
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    This data contains information on known Goods and Passenger Vehicle Operators that run a business for hire and reward (holders of Standard Licences) within GB.

    This CSV file contains a data extract from the Secretary of State's National Register to comply with EU Regulation 1071/2009 and is required to be published by the Member State that the data represents.

    Note: The CSV file contains a large number of rows and will need to be opened within a suitable software package.

  20. shakespeare_words_google_big_query

    • kaggle.com
    Updated Jul 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crypto D NVarChar (2023). shakespeare_words_google_big_query [Dataset]. https://www.kaggle.com/datasets/cryptodnvarchar/shakespeare-words-google-big-query
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 20, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Crypto D NVarChar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Google BigQuery

    SQL: SELECT word, corpus FROM bigquery-public-data.samples.shakespeare LIMIT 100000000

    df.shape: Returns number of rows/columns. (164656, 2) .csv format, containers headers filesize: 3.4MB

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015
Organization logo

Sample Graph Datasets in CSV Format

Explore at:
csvAvailable download formats
Dataset updated
Dec 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Edwin Carreño; Edwin Carreño
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Sample Graph Datasets in CSV Format

Note: none of the data sets published here contain actual data, they are for testing purposes only.

Description

This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

  • dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
  • dataset_30_edges_interactions.csv: contains 47 rows (edges).
  • the common identifier dataset_30 refers to the same graph.

CSV nodes

Each dataset contains the following columns:

Name of the ColumnTypeDescription
UniProt IDstringprotein identification
labelstringprotein label (type of node)
propertiesstringa dictionary containing properties related to the protein.

CSV edges

Each dataset contains the following columns:

Name of the ColumnTypeDescription
Relationship IDstringrelationship identification
Source IDstringidentification of the source protein in the relationship
Target IDstringidentification of the target protein in the relationship
labelstringrelationship label (type of relationship)
propertiesstringa dictionary containing properties related to the relationship.

Metadata

GraphNumber of NodesNumber of EdgesSparse graph

dataset_30*

30

47

Y

dataset_60*

60

181

Y

dataset_120*

120

689

Y

dataset_240*

240

2819

Y

dataset_300*

300

4658

Y

dataset_600*

600

18004

Y

dataset_1200*

1200

71785

Y

dataset_2400*

2400

288600

Y

dataset_3000*

3000

449727

Y

dataset_6000*

6000

1799413

Y

dataset_12000*

12000

7199863

Y

dataset_24000*

24000

28792361

Y

dataset_30000*

30000

44991744

Y

This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

CSV nodes (tiny graphs)

Each dataset contains the following columns:

Name of the ColumnTypeDescription
IDstringnode identification
labelstringnode label (type of node)
propertiesstringa dictionary containing properties related to the node.

CSV edges (tiny graphs)

Each dataset contains the following columns:

Name of the ColumnTypeDescription
IDstringrelationship identification
sourcestringidentification of the source node in the relationship
targetstringidentification of the target node in the relationship
labelstringrelationship label (type of relationship)
propertiesstringa dictionary containing properties related to the relationship.

Metadata (tiny graphs)

GraphNumber of NodesNumber of EdgesSparse graph
dataset_dummy*36N
dataset_dummy2*36N
Search
Clear search
Close search
Google apps
Main menu