4 datasets found
  1. SQLite Sakila Sample Database

    • kaggle.com
    Updated Mar 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Atanas Kanev (2021). SQLite Sakila Sample Database [Dataset]. https://www.kaggle.com/atanaskanev/sqlite-sakila-sample-database/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Atanas Kanev
    Description

    SQLite Sakila Sample Database

    Database Description

    The Sakila sample database is a fictitious database designed to represent a DVD rental store. The tables of the database include film, film_category, actor, customer, rental, payment and inventory among others. The Sakila sample database is intended to provide a standard schema that can be used for examples in books, tutorials, articles, samples, and so forth. Detailed information about the database can be found on the MySQL website: https://dev.mysql.com/doc/sakila/en/

    Sakila for SQLite is a part of the sakila-sample-database-ports project intended to provide ported versions of the original MySQL database for other database systems, including:

    • Oracle
    • SQL Server
    • SQLIte
    • Interbase/Firebird
    • Microsoft Access

    Sakila for SQLite is a port of the Sakila example database available for MySQL, which was originally developed by Mike Hillyer of the MySQL AB documentation team. This project is designed to help database administrators to decide which database to use for development of new products The user can run the same SQL against different kind of databases and compare the performance

    License: BSD Copyright DB Software Laboratory http://www.etl-tools.com

    Note: Part of the insert scripts were generated by Advanced ETL Processor http://www.etl-tools.com/etl-tools/advanced-etl-processor-enterprise/overview.html

    Information about the project and the downloadable files can be found at: https://code.google.com/archive/p/sakila-sample-database-ports/

    Other versions and developments of the project can be found at: https://github.com/ivanceras/sakila/tree/master/sqlite-sakila-db

    https://github.com/jOOQ/jOOQ/tree/main/jOOQ-examples/Sakila

    Direct access to the MySQL Sakila database, which does not require installation of MySQL (queries can be typed directly in the browser), is provided on the phpMyAdmin demo version website: https://demo.phpmyadmin.net/master-config/

    Files Description

    The files in the sqlite-sakila-db folder are the script files which can be used to generate the SQLite version of the database. For convenience, the script files have already been run in cmd to generate the sqlite-sakila.db file, as follows:

    sqlite> .open sqlite-sakila.db # creates the .db file sqlite> .read sqlite-sakila-schema.sql # creates the database schema sqlite> .read sqlite-sakila-insert-data.sql # inserts the data

    Therefore, the sqlite-sakila.db file can be directly loaded into SQLite3 and queries can be directly executed. You can refer to my notebook for an overview of the database and a demonstration of SQL queries. Note: Data about the film_text table is not provided in the script files, thus the film_text table is empty. Instead the film_id, title and description fields are included in the film table. Moreover, the Sakila Sample Database has many versions, so an Entity Relationship Diagram (ERD) is provided to describe this specific version. You are advised to refer to the ERD to familiarise yourself with the structure of the database.

  2. Data from: KGCW 2023 Challenge @ ESWC 2023

    • zenodo.org
    • investigacion.usc.gal
    application/gzip
    Updated Apr 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Van Assche; Dylan Van Assche; David Chaves-Fraga; David Chaves-Fraga; Anastasia Dimou; Anastasia Dimou; Umutcan Şimşek; Umutcan Şimşek; Ana Iglesias; Ana Iglesias (2024). KGCW 2023 Challenge @ ESWC 2023 [Dataset]. http://doi.org/10.5281/zenodo.7837289
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Apr 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dylan Van Assche; Dylan Van Assche; David Chaves-Fraga; David Chaves-Fraga; Anastasia Dimou; Anastasia Dimou; Umutcan Şimşek; Umutcan Şimşek; Ana Iglesias; Ana Iglesias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Knowledge Graph Construction Workshop 2023: challenge

    Knowledge graph construction of heterogeneous data has seen a lot of uptake
    in the last decade from compliance to performance optimizations with respect
    to execution time. Besides execution time as a metric for comparing knowledge
    graph construction, other metrics e.g. CPU or memory usage are not considered.
    This challenge aims at benchmarking systems to find which RDF graph
    construction system optimizes for metrics e.g. execution time, CPU,
    memory usage, or a combination of these metrics.

    Task description

    The task is to reduce and report the execution time and computing resources
    (CPU and memory usage) for the parameters listed in this challenge, compared
    to the state-of-the-art of the existing tools and the baseline results provided
    by this challenge. This challenge is not limited to execution times to create
    the fastest pipeline, but also computing resources to achieve the most efficient
    pipeline.

    We provide a tool which can execute such pipelines end-to-end. This tool also
    collects and aggregates the metrics such as execution time, CPU and memory
    usage, necessary for this challenge as CSV files. Moreover, the information
    about the hardware used during the execution of the pipeline is available as
    well to allow fairly comparing different pipelines. Your pipeline should consist
    of Docker images which can be executed on Linux to run the tool. The tool is
    already tested with existing systems, relational databases e.g. MySQL and
    PostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuoso
    which can be combined in any configuration. It is strongly encouraged to use
    this tool for participating in this challenge. If you prefer to use a different
    tool or our tool imposes technical requirements you cannot solve, please contact
    us directly.

    Part 1: Knowledge Graph Construction Parameters

    These parameters are evaluated using synthetic generated data to have more
    insights of their influence on the pipeline.

    Data

    • Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).
    • Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).
    • Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).
    • Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).
    • Number of input files: scaling the number of datasets (1, 5, 10, 15).

    Mappings

    • Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).
    • Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).
    • Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)

    Part 2: GTFS-Madrid-Bench

    The GTFS-Madrid-Bench provides insights in the pipeline with real data from the
    public transport domain in Madrid.

    Scaling

    • GTFS-1 SQL
    • GTFS-10 SQL
    • GTFS-100 SQL
    • GTFS-1000 SQL

    Heterogeneity

    • GTFS-100 XML + JSON
    • GTFS-100 CSV + XML
    • GTFS-100 CSV + JSON
    • GTFS-100 SQL + XML + JSON + CSV

    Example pipeline

    The ground truth dataset and baseline results are generated in different steps
    for each parameter:

    1. The provided CSV files and SQL schema are loaded into a MySQL relational database.
    2. Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format.
    3. The constructed knowledge graph is loaded into a Virtuoso triplestore, tuned according to the Virtuoso documentation.
    4. The provided SPARQL queries are executed on the SPARQL endpoint exposed by Virtuoso.

    The pipeline is executed 5 times from which the median execution time of each
    step is calculated and reported. Each step with the median execution time is
    then reported in the baseline results with all its measured metrics.
    Query timeout is set to 1 hour and knowledge graph construction timeout
    to 24 hours. The execution is performed with the following tool: https://github.com/kg-construct/challenge-tool,
    you can adapt the execution plans for this example pipeline to your own needs.

    Each parameter has its own directory in the ground truth dataset with the
    following files:

    • Input dataset as CSV.
    • Mapping file as RML.
    • Queries as SPARQL.
    • Execution plan for the pipeline in metadata.json.

    Datasets

    Knowledge Graph Construction Parameters

    The dataset consists of:

    • Input dataset as CSV for each parameter.
    • Mapping file as RML for each parameter.
    • SPARQL queries to retrieve the results for each parameter.
    • Baseline results for each parameter with the example pipeline.
    • Ground truth dataset for each parameter generated with the example pipeline.

    Format

    All input datasets are provided as CSV, depending on the parameter that is being
    evaluated, the number of rows and columns may differ. The first row is always
    the header of the CSV.

    GTFS-Madrid-Bench

    The dataset consists of:

    • Input dataset as CSV with SQL schema for the scaling and a combination of XML,
    • CSV, and JSON is provided for the heterogeneity.
    • Mapping file as RML for both scaling and heterogeneity.
    • SPARQL queries to retrieve the results.
    • Baseline results with the example pipeline.
    • Ground truth dataset generated with the example pipeline.

    Format

    CSV datasets always have a header as their first row.
    JSON and XML datasets have their own schema.

    Evaluation criteria

    Submissions must evaluate the following metrics:

    • Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.
    • CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.
    • Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

    Expected output

    Duplicate values

    ScaleNumber of Triples
    0 percent2000000 triples
    25 percent1500020 triples
    50 percent1000020 triples
    75 percent500020 triples
    100 percent20 triples

    Empty values

    ScaleNumber of Triples
    0 percent2000000 triples
    25 percent1500000 triples
    50 percent1000000 triples
    75 percent500000 triples
    100 percent0 triples

    Mappings

    ScaleNumber of Triples
    1TM + 15POM1500000 triples
    3TM + 5POM1500000 triples
    5TM + 3POM 1500000 triples
    15TM + 1POM1500000 triples

    Properties

    ScaleNumber of Triples
    1M rows 1 column1000000 triples
    1M rows 10 columns10000000 triples
    1M rows 20 columns20000000 triples
    1M rows 30 columns30000000 triples

    Records

    ScaleNumber of Triples
    10K rows 20 columns200000 triples
    100K rows 20 columns2000000 triples
    1M rows 20 columns20000000 triples
    10M rows 20 columns200000000 triples

    Joins

    1-1 joins

    ScaleNumber of Triples
    0 percent0 triples
    25 percent125000 triples
    50 percent250000 triples
    75 percent375000 triples
    100 percent500000 triples

    1-N joins

    ScaleNumber of Triples
    1-10 0 percent0 triples
    1-10 25 percent125000 triples
    1-10 50 percent250000 triples
    1-10 75 percent375000

  3. Data from: KGCW 2023 Challenge @ ESWC 2023

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated May 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Van Assche; Dylan Van Assche; David Chaves-Fraga; David Chaves-Fraga; Anastasia Dimou; Anastasia Dimou; Umutcan Şimşek; Umutcan Şimşek; Ana Iglesias; Ana Iglesias (2023). KGCW 2023 Challenge @ ESWC 2023 [Dataset]. http://doi.org/10.5281/zenodo.7689310
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dylan Van Assche; Dylan Van Assche; David Chaves-Fraga; David Chaves-Fraga; Anastasia Dimou; Anastasia Dimou; Umutcan Şimşek; Umutcan Şimşek; Ana Iglesias; Ana Iglesias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Knowledge Graph Construction Workshop 2023: challenge

    Knowledge graph construction of heterogeneous data has seen a lot of uptake
    in the last decade from compliance to performance optimizations with respect
    to execution time. Besides execution time as a metric for comparing knowledge
    graph construction, other metrics e.g. CPU or memory usage are not considered.
    This challenge aims at benchmarking systems to find which RDF graph
    construction system optimizes for metrics e.g. execution time, CPU,
    memory usage, or a combination of these metrics.

    Task description

    The task is to reduce and report the execution time and computing resources
    (CPU and memory usage) for the parameters listed in this challenge, compared
    to the state-of-the-art of the existing tools and the baseline results provided
    by this challenge. This challenge is not limited to execution times to create
    the fastest pipeline, but also computing resources to achieve the most efficient
    pipeline.

    We provide a tool which can execute such pipelines end-to-end. This tool also
    collects and aggregates the metrics such as execution time, CPU and memory
    usage, necessary for this challenge as CSV files. Moreover, the information
    about the hardware used during the execution of the pipeline is available as
    well to allow fairly comparing different pipelines. Your pipeline should consist
    of Docker images which can be executed on Linux to run the tool. The tool is
    already tested with existing systems, relational databases e.g. MySQL and
    PostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuoso
    which can be combined in any configuration. It is strongly encouraged to use
    this tool for participating in this challenge. If you prefer to use a different
    tool or our tool imposes technical requirements you cannot solve, please contact
    us directly.

    Part 1: Knowledge Graph Construction Parameters

    These parameters are evaluated using synthetic generated data to have more
    insights of their influence on the pipeline.

    Data

    • Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).
    • Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).
    • Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).
    • Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).
    • Number of input files: scaling the number of datasets (1, 5, 10, 15).

    Mappings

    • Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).
    • Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).
    • Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)

    Part 2: GTFS-Madrid-Bench

    The GTFS-Madrid-Bench provides insights in the pipeline with real data from the
    public transport domain in Madrid.

    Scaling

    • GTFS-1 SQL
    • GTFS-10 SQL
    • GTFS-100 SQL
    • GTFS-1000 SQL

    Heterogeneity

    • GTFS-100 XML + JSON
    • GTFS-100 CSV + XML
    • GTFS-100 CSV + JSON
    • GTFS-100 SQL + XML + JSON + CSV

    Example pipeline

    The ground truth dataset and baseline results are generated in different steps
    for each parameter:

    1. The provided CSV files and SQL schema are loaded into a MySQL relational database.
    2. Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format.
    3. The constructed knowledge graph is loaded into a Virtuoso triplestore, tuned according to the Virtuoso documentation.
    4. The provided SPARQL queries are executed on the SPARQL endpoint exposed by Virtuoso.

    The pipeline is executed 5 times from which the median execution time of each
    step is calculated and reported. Each step with the median execution time is
    then reported in the baseline results with all its measured metrics.
    Query timeout is set to 1 hour and knowledge graph construction timeout
    to 24 hours. The execution is performed with the following tool

    Each parameter has its own directory in the ground truth dataset with the
    following files:

    • Input dataset as CSV.
    • Mapping file as RML.
    • Queries as SPARQL.
    • Execution plan for the pipeline in metadata.json.

    Datasets

    Knowledge Graph Construction Parameters

    The dataset consists of:

    • Input dataset as CSV for each parameter.
    • Mapping file as RML for each parameter.
    • SPARQL queries to retrieve the results for each parameter.
    • Baseline results for each parameter with the example pipeline.
    • Ground truth dataset for each parameter generated with the example pipeline.

    Format

    All input datasets are provided as CSV, depending on the parameter that is being
    evaluated, the number of rows and columns may differ. The first row is always
    the header of the CSV.

    GTFS-Madrid-Bench

    The dataset consists of:

    • Input dataset as CSV with SQL schema for the scaling and a combination of XML,
    • CSV, and JSON is provided for the heterogeneity.
    • Mapping file as RML for both scaling and heterogeneity.
    • SPARQL queries to retrieve the results.
    • Baseline results with the example pipeline.
    • Ground truth dataset generated with the example pipeline.

    Format

    CSV datasets always have a header as their first row.
    JSON and XML datasets have their own schema.

    Evaluation criteria

    Submissions must evaluate the following metrics:

    • Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.
    • CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.
    • Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

    Expected output

    Duplicate values

    ScaleNumber of Triples
    0 percent2000000 triples
    25 percent1500020 triples
    50 percent1000020 triples
    75 percent500020 triples
    100 percent20 triples

    Empty values

    ScaleNumber of Triples
    0 percent2000000 triples
    25 percent1500000 triples
    50 percent1000000 triples
    75 percent500000 triples
    100 percent0 triples

    Mappings

    ScaleNumber of Triples
    1TM + 15POM1500000 triples
    3TM + 5POM1500000 triples
    5TM + 3POM 1500000 triples
    15TM + 1POM1500000 triples

    Properties

    ScaleNumber of Triples
    1M rows 1 column1000000 triples
    1M rows 10 columns10000000 triples
    1M rows 20 columns20000000 triples
    1M rows 30 columns30000000 triples

    Records

    ScaleNumber of Triples
    10K rows 20 columns200000 triples
    100K rows 20 columns2000000 triples
    1M rows 20 columns20000000 triples
    10M rows 20 columns200000000 triples

    Joins

    1-1 joins

    ScaleNumber of Triples
    0 percent0

  4. KGCW 2024 Challenge @ ESWC 2024

    • zenodo.org
    application/gzip
    Updated Mar 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Van Assche; Dylan Van Assche; David Chaves-Fraga; David Chaves-Fraga; Anastasia Dimou; Anastasia Dimou; Umutcan Serles; Umutcan Serles; Ana Iglesias; Ana Iglesias (2024). KGCW 2024 Challenge @ ESWC 2024 [Dataset]. http://doi.org/10.5281/zenodo.10721875
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Mar 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dylan Van Assche; Dylan Van Assche; David Chaves-Fraga; David Chaves-Fraga; Anastasia Dimou; Anastasia Dimou; Umutcan Serles; Umutcan Serles; Ana Iglesias; Ana Iglesias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Knowledge Graph Construction Workshop 2024: challenge

    Knowledge graph construction of heterogeneous data has seen a lot of uptake
    in the last decade from compliance to performance optimizations with respect
    to execution time. Besides execution time as a metric for comparing knowledge
    graph construction, other metrics e.g. CPU or memory usage are not considered.
    This challenge aims at benchmarking systems to find which RDF graph
    construction system optimizes for metrics e.g. execution time, CPU,
    memory usage, or a combination of these metrics.

    Task description

    The task is to reduce and report the execution time and computing resources
    (CPU and memory usage) for the parameters listed in this challenge, compared
    to the state-of-the-art of the existing tools and the baseline results provided
    by this challenge. This challenge is not limited to execution times to create
    the fastest pipeline, but also computing resources to achieve the most efficient
    pipeline.

    We provide a tool which can execute such pipelines end-to-end. This tool also
    collects and aggregates the metrics such as execution time, CPU and memory
    usage, necessary for this challenge as CSV files. Moreover, the information
    about the hardware used during the execution of the pipeline is available as
    well to allow fairly comparing different pipelines. Your pipeline should consist
    of Docker images which can be executed on Linux to run the tool. The tool is
    already tested with existing systems, relational databases e.g. MySQL and
    PostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuoso
    which can be combined in any configuration. It is strongly encouraged to use
    this tool for participating in this challenge. If you prefer to use a different
    tool or our tool imposes technical requirements you cannot solve, please contact
    us directly.

    Track 1: Conformance

    The set of new specification for the RDF Mapping Language (RML) established by the W3C Community Group on Knowledge Graph Construction provide a set of test-cases for each module:

    These test-cases are evaluated in this Track of the Challenge to determine their feasibility, correctness, etc. by applying them in implementations. This Track is in Beta status because these new specifications have not seen any implementation yet, thus it may contain bugs and issues. If you find problems with the mappings, output, etc. please report them to the corresponding repository of each module.

    Through this Track we aim to spark development of implementations for the new specifications and improve the test-cases. Let us know your problems with the test-cases and we will try to find a solution.

    Track 2: Performance

    Part 1: Knowledge Graph Construction Parameters

    These parameters are evaluated using synthetic generated data to have more
    insights of their influence on the pipeline.

    Data

    • Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).
    • Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).
    • Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).
    • Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).
    • Number of input files: scaling the number of datasets (1, 5, 10, 15).

    Mappings

    • Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).
    • Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).
    • Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)

    Part 2: GTFS-Madrid-Bench

    The GTFS-Madrid-Bench provides insights in the pipeline with real data from the
    public transport domain in Madrid.

    Scaling

    • GTFS-1 SQL
    • GTFS-10 SQL
    • GTFS-100 SQL
    • GTFS-1000 SQL

    Heterogeneity

    • GTFS-100 XML + JSON
    • GTFS-100 CSV + XML
    • GTFS-100 CSV + JSON
    • GTFS-100 SQL + XML + JSON + CSV

    Example pipeline

    The ground truth dataset and baseline results are generated in different steps
    for each parameter:

    1. The provided CSV files and SQL schema are loaded into a MySQL relational database.
    2. Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format

    The pipeline is executed 5 times from which the median execution time of each
    step is calculated and reported. Each step with the median execution time is
    then reported in the baseline results with all its measured metrics.
    Knowledge graph construction timeout is set to 24 hours.
    The execution is performed with the following tool: https://github.com/kg-construct/challenge-tool,
    you can adapt the execution plans for this example pipeline to your own needs.

    Each parameter has its own directory in the ground truth dataset with the
    following files:

    • Input dataset as CSV.
    • Mapping file as RML.
    • Execution plan for the pipeline in metadata.json.

    Datasets

    Knowledge Graph Construction Parameters

    The dataset consists of:

    • Input dataset as CSV for each parameter.
    • Mapping file as RML for each parameter.
    • Baseline results for each parameter with the example pipeline.
    • Ground truth dataset for each parameter generated with the example pipeline.

    Format

    All input datasets are provided as CSV, depending on the parameter that is being
    evaluated, the number of rows and columns may differ. The first row is always
    the header of the CSV.

    GTFS-Madrid-Bench

    The dataset consists of:

    • Input dataset as CSV with SQL schema for the scaling and a combination of XML,
    • CSV, and JSON is provided for the heterogeneity.
    • Mapping file as RML for both scaling and heterogeneity.
    • SPARQL queries to retrieve the results.
    • Baseline results with the example pipeline.
    • Ground truth dataset generated with the example pipeline.

    Format

    CSV datasets always have a header as their first row.
    JSON and XML datasets have their own schema.

    Evaluation criteria

    Submissions must evaluate the following metrics:

    • Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.
    • CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.
    • Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

    Expected output

    Duplicate values

    ScaleNumber of Triples
    0 percent2000000 triples
    25 percent1500020 triples
    50 percent1000020 triples
    75 percent500020 triples
    100 percent20 triples

    Empty values

    ScaleNumber of Triples
    0 percent2000000 triples
    25 percent1500000 triples
    50 percent1000000 triples
    75 percent500000 triples
    100 percent0 triples

    Mappings

    ScaleNumber of Triples
    1TM + 15POM1500000 triples
    3TM + 5POM1500000 triples
    5TM + 3POM 1500000 triples
    15TM + 1POM1500000 triples

    Properties

    ScaleNumber of Triples
    1M rows 1 column1000000 triples
    1M rows 10 columns10000000 triples
    1M rows 20 columns20000000 triples
    1M rows 30 columns30000000

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Atanas Kanev (2021). SQLite Sakila Sample Database [Dataset]. https://www.kaggle.com/atanaskanev/sqlite-sakila-sample-database/tasks
Organization logo

SQLite Sakila Sample Database

SQLite Port of the Original MySQL Sakila Sample Database

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Atanas Kanev
Description

SQLite Sakila Sample Database

Database Description

The Sakila sample database is a fictitious database designed to represent a DVD rental store. The tables of the database include film, film_category, actor, customer, rental, payment and inventory among others. The Sakila sample database is intended to provide a standard schema that can be used for examples in books, tutorials, articles, samples, and so forth. Detailed information about the database can be found on the MySQL website: https://dev.mysql.com/doc/sakila/en/

Sakila for SQLite is a part of the sakila-sample-database-ports project intended to provide ported versions of the original MySQL database for other database systems, including:

  • Oracle
  • SQL Server
  • SQLIte
  • Interbase/Firebird
  • Microsoft Access

Sakila for SQLite is a port of the Sakila example database available for MySQL, which was originally developed by Mike Hillyer of the MySQL AB documentation team. This project is designed to help database administrators to decide which database to use for development of new products The user can run the same SQL against different kind of databases and compare the performance

License: BSD Copyright DB Software Laboratory http://www.etl-tools.com

Note: Part of the insert scripts were generated by Advanced ETL Processor http://www.etl-tools.com/etl-tools/advanced-etl-processor-enterprise/overview.html

Information about the project and the downloadable files can be found at: https://code.google.com/archive/p/sakila-sample-database-ports/

Other versions and developments of the project can be found at: https://github.com/ivanceras/sakila/tree/master/sqlite-sakila-db

https://github.com/jOOQ/jOOQ/tree/main/jOOQ-examples/Sakila

Direct access to the MySQL Sakila database, which does not require installation of MySQL (queries can be typed directly in the browser), is provided on the phpMyAdmin demo version website: https://demo.phpmyadmin.net/master-config/

Files Description

The files in the sqlite-sakila-db folder are the script files which can be used to generate the SQLite version of the database. For convenience, the script files have already been run in cmd to generate the sqlite-sakila.db file, as follows:

sqlite> .open sqlite-sakila.db # creates the .db file sqlite> .read sqlite-sakila-schema.sql # creates the database schema sqlite> .read sqlite-sakila-insert-data.sql # inserts the data

Therefore, the sqlite-sakila.db file can be directly loaded into SQLite3 and queries can be directly executed. You can refer to my notebook for an overview of the database and a demonstration of SQL queries. Note: Data about the film_text table is not provided in the script files, thus the film_text table is empty. Instead the film_id, title and description fields are included in the film table. Moreover, the Sakila Sample Database has many versions, so an Entity Relationship Diagram (ERD) is provided to describe this specific version. You are advised to refer to the ERD to familiarise yourself with the structure of the database.

Search
Clear search
Close search
Google apps
Main menu