24 datasets found

Bike Store Relational Database | SQL
kaggle.com
zip
Updated Aug 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dillon Myrick (2023). Bike Store Relational Database | SQL [Dataset]. https://www.kaggle.com/datasets/dillonmyrick/bike-store-sample-database
Explore at:
zip(94412 bytes)Available download formats
Dataset updated
Aug 21, 2023
Authors
Dillon Myrick
Description
This is the sample database from sqlservertutorial.net. This is a great dataset for learning SQL and practicing querying relational databases.

Database Diagram:

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4146319%2Fc5838eb006bab3938ad94de02f58c6c1%2FSQL-Server-Sample-Database.png?generation=1692609884383007&alt=media" alt="">

Terms of Use

The sample database is copyrighted and cannot be used for commercial purposes. For example, it cannot be used for the following but is not limited to the purposes: - Selling - Including in paid courses
classicmodels
kaggle.com
zip
Updated Apr 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ambreen (2024). classicmodels [Dataset]. https://www.kaggle.com/datasets/ambreenabdulraheem/classicmodels
Explore at:
zip(879935 bytes)Available download formats
Dataset updated
Apr 22, 2024
Authors
Ambreen
Description
MySQL Sample Database Schema. The MySQL sample database schema consists of the following tables:

customers: stores customer’s data.

products: stores a list of scale model cars.

productlines: stores a list of product lines.

orders: stores sales orders placed by customers.

orderdetails: stores sales order line items for every sales order.

payments: stores payments made by customers based on their accounts.

employees: stores employee information and the organization structure such as who reports to whom.

offices: stores sales office data.
SQLite Sakila Sample Database
kaggle.com
zip
Updated Mar 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atanas Kanev (2021). SQLite Sakila Sample Database [Dataset]. https://www.kaggle.com/datasets/atanaskanev/sqlite-sakila-sample-database/code
Explore at:
zip(4495190 bytes)Available download formats
Dataset updated
Mar 14, 2021
Authors
Atanas Kanev
Description
SQLite Sakila Sample Database

Database Description

The Sakila sample database is a fictitious database designed to represent a DVD rental store. The tables of the database include film, film_category, actor, customer, rental, payment and inventory among others. The Sakila sample database is intended to provide a standard schema that can be used for examples in books, tutorials, articles, samples, and so forth. Detailed information about the database can be found on the MySQL website: https://dev.mysql.com/doc/sakila/en/

Sakila for SQLite is a part of the sakila-sample-database-ports project intended to provide ported versions of the original MySQL database for other database systems, including:

Oracle

SQL Server

SQLIte

Interbase/Firebird

Microsoft Access

Sakila for SQLite is a port of the Sakila example database available for MySQL, which was originally developed by Mike Hillyer of the MySQL AB documentation team. This project is designed to help database administrators to decide which database to use for development of new products The user can run the same SQL against different kind of databases and compare the performance

License: BSD Copyright DB Software Laboratory http://www.etl-tools.com

Note: Part of the insert scripts were generated by Advanced ETL Processor http://www.etl-tools.com/etl-tools/advanced-etl-processor-enterprise/overview.html

Information about the project and the downloadable files can be found at: https://code.google.com/archive/p/sakila-sample-database-ports/

Other versions and developments of the project can be found at: https://github.com/ivanceras/sakila/tree/master/sqlite-sakila-db

https://github.com/jOOQ/jOOQ/tree/main/jOOQ-examples/Sakila

Direct access to the MySQL Sakila database, which does not require installation of MySQL (queries can be typed directly in the browser), is provided on the phpMyAdmin demo version website: https://demo.phpmyadmin.net/master-config/

Files Description

The files in the sqlite-sakila-db folder are the script files which can be used to generate the SQLite version of the database. For convenience, the script files have already been run in cmd to generate the sqlite-sakila.db file, as follows:

sqlite> .open sqlite-sakila.db # creates the .db file sqlite> .read sqlite-sakila-schema.sql # creates the database schema sqlite> .read sqlite-sakila-insert-data.sql # inserts the data

Therefore, the sqlite-sakila.db file can be directly loaded into SQLite3 and queries can be directly executed. You can refer to my notebook for an overview of the database and a demonstration of SQL queries. Note: Data about the film_text table is not provided in the script files, thus the film_text table is empty. Instead the film_id, title and description fields are included in the film table. Moreover, the Sakila Sample Database has many versions, so an Entity Relationship Diagram (ERD) is provided to describe this specific version. You are advised to refer to the ERD to familiarise yourself with the structure of the database.
Employees
kaggle.com
zip
Updated Nov 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sudhir Singh (2021). Employees [Dataset]. https://www.kaggle.com/datasets/crepantherx/employees
Explore at:
zip(31992550 bytes)Available download formats
Dataset updated
Nov 12, 2021
Authors
Sudhir Singh
Description
Dataset

This dataset was created by Sudhir Singh

Released under Data files © Original Authors

Contents
Z
Rediscovery Datasets: Connecting Duplicate Reports of Apache, Eclipse, and...
data.niaid.nih.gov
Updated Aug 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sadat, Mefta; Bener, Ayse Basar; Miranskyy, Andriy V. (2024). Rediscovery Datasets: Connecting Duplicate Reports of Apache, Eclipse, and KDE [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_400614
Explore at:
Dataset updated
Aug 3, 2024
Dataset provided by
Ryerson University
Authors
Sadat, Mefta; Bener, Ayse Basar; Miranskyy, Andriy V.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present three defect rediscovery datasets mined from Bugzilla. The datasets capture data for three groups of open source software projects: Apache, Eclipse, and KDE. The datasets contain information about approximately 914 thousands of defect reports over a period of 18 years (1999-2017) to capture the inter-relationships among duplicate defects.

File Descriptions

apache.csv - Apache Defect Rediscovery dataset

eclipse.csv - Eclipse Defect Rediscovery dataset

kde.csv - KDE Defect Rediscovery dataset

apache.relations.csv - Inter-relations of rediscovered defects of Apache

eclipse.relations.csv - Inter-relations of rediscovered defects of Eclipse

kde.relations.csv - Inter-relations of rediscovered defects of KDE

create_and_populate_neo4j_objects.cypher - Populates Neo4j graphDB by importing all the data from the CSV files. Note that you have to set dbms.import.csv.legacy_quote_escaping configuration setting to false to load the CSV files as per https://neo4j.com/docs/operations-manual/current/reference/configuration-settings/#config_dbms.import.csv.legacy_quote_escaping

create_and_populate_mysql_objects.sql - Populates MySQL RDBMS by importing all the data from the CSV files

rediscovery_db_mysql.zip - For your convenience, we also provide full backup of the MySQL database

neo4j_examples.txt - Sample Neo4j queries

mysql_examples.txt - Sample MySQL queries

rediscovery_eclipse_6325.png - Output of Neo4j example #1

distinct_attrs.csv - Distinct values of bug_status, resolution, priority, severity for each project
Z
FooDrugs database: A database with molecular and text information about food...
data.niaid.nih.gov
zenodo.org
Updated Jul 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Garranzo, Marco; Piette Gómez, Óscar; Lacruz Pleguezuelos, Blanca; Pérez, David; Laguna Lobo, Teresa; Carrillo de Santa Pau, Enrique (2023). FooDrugs database: A database with molecular and text information about food - drug interactions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6638469
Explore at:
Dataset updated
Jul 28, 2023
Dataset provided by
IMDEA Food Institute
Authors
Garranzo, Marco; Piette Gómez, Óscar; Lacruz Pleguezuelos, Blanca; Pérez, David; Laguna Lobo, Teresa; Carrillo de Santa Pau, Enrique
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
FooDrugs database is a development done by the Computational Biology Group at IMDEA Food Institute (Madrid, Spain), in the context of the Food Nutrition Security Cloud (FNS-Cloud) project. Food Nutrition Security Cloud (FNS-Cloud) has received funding from the European Union's Horizon 2020 Research and Innovation programme (H2020-EU.3.2.2.3. – A sustainable and competitive agri-food industry) under Grant Agreement No. 863059 – www.fns-cloud.eu (See more details about FNS-Cloud below)

FooDrugs stores information extracted from transcriptomics and text documents for foo-drug interactiosn and it is part of a demonstrator to be done in the FNS-Cloud project. The database was built using MySQL, an open source relational database management system. FooDrugs host information for a total of 161 transcriptomics GEO series with 585 conditions for food or bioactive compounds. Each condition is defined as a food/biocomponent per time point, per concentration, per cell line, primary culture or biopsy per study. FooDrugs includes information about a bipartite network with 510 nodes and their similarity scores (tau score; https://clue.io/connectopedia/connectivity_scores) related with possible drug interactions with drugs assayed in conectivity map (https://www.broadinstitute.org/connectivity-map-cmap). The information is stored in eight tables:

Table “study” : This table contains basic information about study identifiers from GEO, pubmed or platform, study type, title and abstract

Table “sample”: This table contains basic information about the different experiments in a study, like the identifier of the sample, treatment, origin type, time point or concentration.

Table “misc_study”: This table contains additional information about different attributes of the study.

Table “misc_sample”: This table contains additional information about different attributes of the sample.

Table “cmap”: This table contains information about 70895 nodes, compromising drugs, foods or bioactives, overexpressed and knockdown genes (see section 3.4). The information includes cell line, compound and perturbation type.

Table “cmap_foodrugs”: This table contains information about the tau score (see section 3.4) that relates food with drugs or genes and the node identifier in the FooDrugs network.

Table “topTable”: This table contains information about 150 over and underexpressed genes from each GEO study condition, used to calculate the tau score (see section 3.4). The information stored is the logarithmic fold change, average expression, t-statistic, p-value, adjusted p-value and if the gene is up or downregulated.

Table “nodes”: This table stores the information about the identification of the sample and the node in the bipartite network connecting the tables “sample”, “cmap_foodrugs” and “topTable”.

In addition, FooDrugs database stores a total of 6422 food/drug interactions from 2849 text documents, obtained from three different sources: 2312 documents from PubMed, 285 from DrugBank, and 252 from drugs.com. These documents describe potential interactions between 1464 food/bioactive compounds and 3009 drugs. The information is stored in two tables:

Table “texts”: This table contains all the documents with its identifiers where interactions have been identified with strategy described in section 4.

Table “TM_interactions”: This table contains information about interaction identifiers, the food and drug entities, and the start and the end positions of the context for the interaction in the document.

FNS-Cloud will overcome fragmentation problems by integrating existing FNS data, which is essential for high-end, pan-European FNS research, addressing FNS, diet, health, and consumer behaviours as well as on sustainable agriculture and the bio-economy. Current fragmented FNS resources not only result in knowledge gaps that inhibit public health and agricultural policy, and the food industry from developing effective solutions, making production sustainable and consumption healthier, but also do not enable exploitation of FNS knowledge for the benefit of European citizens. FNS-Cloud will, through three Demonstrators; Agri-Food, Nutrition & Lifestyle and NCDs & the Microbiome to facilitate: (1) Analyses of regional and country-specific differences in diet including nutrition, (epi)genetics, microbiota, consumer behaviours, culture and lifestyle and their effects on health (obesity, NCDs, ethnic and traditional foods), which are essential for public health and agri-food and health policies; (2) Improved understanding agricultural differences within Europe and what these means in terms of creating a sustainable, resilient food systems for healthy diets; and (3) Clear definitions of boundaries and how these affect the compositions of foods and consumer choices and, ultimately, personal and public health in the future. Long-term sustainability of the FNS-Cloud will be based on Services that have the capacity to link with new resources and enable cross-talk amongst them; access to FNS-Cloud data will be open access, underpinned by FAIR principles (findable, accessible, interoperable and re-useable). FNS-Cloud will work closely with the proposed Food, Nutrition and Health Research Infrastructure (FNHRI) as well as METROFOOD-RI and other existing ESFRI RIs (e.g. ELIXIR, ECRIN) in which several FNS-Cloud Beneficiaries are involved directly. (https://cordis.europa.eu/project/id/863059)

***** changes between version FooDrugs_v2 and FooDrugs_V3 (31st January 2023) are:

Increased the amount of text documents by 85.675 from PubMed and ClinicalTrials.gov, and the amount of Text Mining interactions by 168.826.

Increased the amount of transcriptomic studies by 32 GEO series.

Removed all rows in table cmap_foodrugs representing interactions with values of tau=0

Removed 43 GEO series that after manually checking didn't correspond to food compounds.

Added a new column to the table texts: citation to hold the citation of the text.

Added these columns to the table study: contributor to contain the authors of the study, publication_date to store the date of publication of the study in GEO and pubmed_id to reference the publication associated with the study if any.

Added a new column to topTable to hold the top 150 up-regulated and 150 down-regulated genes.
s
Orphan Drugs - Dataset 1: Twitter issue-networks as excluded publics
orda.shef.ac.uk
txt
Updated Oct 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew Hanchard (2021). Orphan Drugs - Dataset 1: Twitter issue-networks as excluded publics [Dataset]. http://doi.org/10.15131/shef.data.16447326.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.15131/shef.data.16447326.v1
Dataset updated
Oct 22, 2021
Dataset provided by
The University of Sheffield
Authors
Matthew Hanchard
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset comprises of two .csv format files used within workstream 2 of the Wellcome Trust funded ‘Orphan drugs: High prices, access to medicines and the transformation of biopharmaceutical innovation’ project (219875/Z/19/Z). They appear in various outputs, e.g. publications and presentations.

The deposited data were gathered using the University of Amsterdam Digital Methods Institute’s ‘Twitter Capture and Analysis Toolset’ (DMI-TCAT) before being processed and extracted from Gephi. DMI-TCAT queries Twitter’s STREAM Application Programming Interface (API) using SQL and retrieves data on a pre-set text query. It then sends the returned data for storage on a MySQL database. The tool allows for output of that data in various formats. This process aligns fully with Twitter’s service user terms and conditions. The query for the deposited dataset gathered a 1% random sample of all public tweets posted between 10-Feb-2021 and 10-Mar-2021 containing the text ‘Rare Diseases’ and/or ‘Rare Disease Day’, storing it on a local MySQL database managed by the University of Sheffield School of Sociological Studies (http://dmi-tcat.shef.ac.uk/analysis/index.php), accessible only via a valid VPN such as FortiClient and through a permitted active directory user profile. The dataset was output from the MySQL database raw as a .gexf format file, suitable for social network analysis (SNA). It was then opened using Gephi (0.9.2) data visualisation software and anonymised/pseudonymised in Gephi as per the ethical approval granted by the University of Sheffield School of Sociological Studies Research Ethics Committee on 02-Jun-201 (reference: 039187). The deposited dataset comprises of two anonymised/pseudonymised social network analysis .csv files extracted from Gephi, one containing node data (Issue-networks as excluded publics – Nodes.csv) and another containing edge data (Issue-networks as excluded publics – Edges.csv). Where participants explicitly provided consent, their original username has been provided. Where they have provided consent on the basis that they not be identifiable, their username has been replaced with an appropriate pseudonym. All other usernames have been anonymised with a randomly generated 16-digit key. The level of anonymity for each Twitter user is provided in column C of deposited file ‘Issue-networks as excluded publics – Nodes.csv’.

This dataset was created and deposited onto the University of Sheffield Online Research Data repository (ORDA) on 26-Aug-2021 by Dr. Matthew S. Hanchard, Research Associate at the University of Sheffield iHuman institute/School of Sociological Studies. ORDA has full permission to store this dataset and to make it open access for public re-use without restriction under a CC BY license, in line with the Wellcome Trust commitment to making all research data Open Access.

The University of Sheffield are the designated data controller for this dataset.
f
A sample of datasets available from the EcoData Retriever.
figshare.com
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin D. Morris; Ethan P. White (2023). A sample of datasets available from the EcoData Retriever. [Dataset]. https://figshare.com/articles/dataset/_A_sample_of_datasets_available_from_the_EcoData_Retriever_/1074128
Explore at:
xlsAvailable download formats
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Benjamin D. Morris; Ethan P. White
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Tested using MySQL on a machine with 4 GB RAM and 4 x 2.4GHz processor.Includes time required to download and reformat data and import to MySQL
c
Data Base Management Systems market size was USD 50.5 billion in 2022 !
cognitivemarketresearch.com
pdf,excel,csv,ppt
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cognitive Market Research (2025). Data Base Management Systems market size was USD 50.5 billion in 2022 ! [Dataset]. https://www.cognitivemarketresearch.com/data-base-management-systems-market-report
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset updated
Oct 29, 2025
Dataset authored and provided by
Cognitive Market Research
License
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
Time period covered
2021 - 2033
Area covered
Global
Description
The global Data Base Management Systems market was valued at USD 50.5 billion in 2022 and is projected to reach USD 120.6 Billion by 2030, registering a CAGR of 11.5 % for the forecast period 2023-2030. Factors Affecting Data Base Management Systems Market Growth

Growing inclination of organizations towards adoption of advanced technologies like cloud-based technology favours the growth of global DBMS market

The cloud-based data base management system solutions offer the organizations with an ability to scale their database infrastructure up or down as per requirement. In a crucial business environment data volume can vary over time. Here, the cloud allows organizations to allocate resources in a dynamic and systematic manner, thereby, ensuring optimal performance without underutilization. In addition, these cloud-based solutions are cost-efficient. As, these cloud-based DBMS solutions eliminate the need for companies to maintain and invest in physical infrastructure and hardware. It helps in reducing ongoing operational costs and upfront capital expenditures. Organizations can choose pay-as-you-go pricing models, where they need to pay only for the resources they consume. Therefore, it has been a cost-efficient option for both smaller businesses and large-enterprises. Moreover, the cloud-based data base management system platforms usually come with management tools which streamline administrative tasks such as backup, provisioning, recovery, and monitoring. It allows IT teams to concentrate on more of strategic tasks rather than routine maintenance activities, thereby, enhancing operational efficiency. Whereas, these cloud-based data base management systems allow users to remote access and collaboration among teams, irrespective of their physical locations. Thus, in regards with today's work environment, which focuses on distributed and remote workforces. These cloud-based DBMS solution enables to access data and update in real-time through authorized personnel, allowing collaboration and better decision-making. Thus, owing to all the above factors, the rising adoption of advanced technologies like cloud-based DBMS is favouring the market growth.

Availability of open-source solutions is likely to restrain the global data base management systems market growth

Open-source data base management system solutions such as PostgreSQL, MongoDB, and MySQL, offer strong functionality at minimal or no licensing costs. It makes open-source solutions an attractive option for companies, especially start-ups or smaller businesses with limited budgets. As these open-source solutions offer similar capabilities to various commercial DBMS offerings, various organizations may opt for this solutions in order to save costs. The open-source solutions may benefit from active developer communities which contribute to their development, enhancement, and maintenance. This type of collaborative environment supports continuous innovation and improvement, which results into solutions that are slightly competitive with commercial offerings in terms of performance and features. Thus, the open-source solutions create competition for commercial DBMS market, they thrive in the market by offering unique value propositions, addressing needs of organizations which prioritize professional support, seamless integration into complex IT ecosystems, and advanced features. Introduction of Data Base Management Systems

A Database Management System (DBMS) is a software which is specifically designed to organize and manage data in a structured manner. This system allows users to create, modify, and query a database, and also manage the security and access controls for that particular database. The DBMS offers tools for creating and modifying data models, that define the structure and relationships of data in a database. This system is also responsible for storing and retrieving data from the database, and also provide several methods for searching and querying the data. The data base management system also offers mechanisms to control concurrent access to the database, in order to ensure that number of users may access the data. The DBMS provides tools to enforce security constraints and data integrity, such as the constraints on the value of data and access controls that restricts who can access the data. The data base management system also provides mechanisms for recovering and backing up the data when a system failure occurs....
h
text_to_sql_ko
huggingface.co
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Won Heo (2024). text_to_sql_ko [Dataset]. https://huggingface.co/datasets/won75/text_to_sql_ko
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 2, 2024
Authors
Won Heo
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Korean Text to MySQL Dataset

Dataset Summary

Korean Text to MySQL is a dataset comprising approximately 3,300 samples generated using OpenAI's gpt-4o model. This dataset is designed to train models that convert natural language questions in Korean into MySQL queries. The data generation process was inspired by the Self-Instruct method and followed the steps outlined below.

Data Generation Process

Creation of SEED Dataset

Approximately 100 SEED samples were… See the full description on the dataset page: https://huggingface.co/datasets/won75/text_to_sql_ko.
Clean Meta Kaggle
kaggle.com
Updated Sep 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yoni Kremer (2023). Clean Meta Kaggle [Dataset]. https://www.kaggle.com/datasets/yonikremer/clean-meta-kaggle
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yoni Kremer
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Cleaned Meta-Kaggle Dataset

The Original Dataset - Meta-Kaggle

Explore our public data on competitions, datasets, kernels (code / notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.

https://i.imgur.com/2Egeb8R.png" alt="" title="a title">

This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.

Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.

August 2023 update

In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here

We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.

The Problems with the Original Dataset

The original dataset is 32 CSV files, with 268 colums and 7GB of compressed data. Having so many tables and columns makes it hard to understand the data.

The data is not normalized, so when you join tables you get a lot of errors.

Some values refer to non-existing values in other tables. For example, the UserId column in the ForumMessages table has values that do not exist in the Users table.

There are missing values.

There are duplicate values.

There are values that are not valid. For example, Ids that are not positive integers.

The date and time columns are not in the right format.

Some columns only have the same value for all rows, so they are not useful.

The boolean columns have string values True or False.

Incorrect values for the Total columns. For example, the DatasetCount is not the total number of datasets with the Tag according to the DatasetTags table.

Users upvote their own messages.

The Solution

To handle so many tables and columns I use a relational database. I use MySQL, but you can use any relational database.

The steps to create the database are:

Creating the database tables with the right data types and constraints. I do that by running the db_abd_create_tables.sql script.

Downloading the CSV files from Kaggle using the Kaggle API.

Cleaning the data using pandas. I do that by running the clean_data.py script. The script does the following steps for each table:

Drops the columns that are not needed.

Converts each column to the right data type.

Replaces foreign keys that do not exist with NULL.

Replaces some of the missing values with default values.

Removes rows where there are missing values in the primary key/not null columns.

Removes duplicate rows.

Loading the data into the database using the LOAD DATA INFILE command.

Checks that the number of rows in the database tables is the same as the number of rows in the CSV files.

Adds foreign key constraints to the database tables. I do that by running the add_foreign_keys.sql script.

Update the Total columns in the database tables. I do that by running the update_totals.sql script.

Backup the database.
s
Data from: Peatland Mid-Infrared Database (1.0.0)
repository.soilwise-he.eu
Updated Sep 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Peatland Mid-Infrared Database (1.0.0) [Dataset]. http://doi.org/10.5281/zenodo.17092587
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.17092587
Dataset updated
Sep 10, 2025
Description
README 2025-09-10 Introduction The peatland mid infrared database (pmird) stores data from peat, vegetation, litter, and dissolved organic matter samples, in particular mid infrared spectra and other variables, from previously published and unpublished data sources. The majority of samples in the database are peat samples from northern bogs. Currently, the database contains entries from 26 studies, 11216 samples, and 3877 mid infrared spectra. The aim is to provide a harmonized data source that can be useful to re-analyse existing data, analyze peat chemistry, develop and test spectral prediction models, and provide data on various peat properties. Usage notes Download and Setup The peatland mid infrared database can be downloaded from https://doi.org/10.5281/zenodo.17092587. The publication contains the following files and folders: pmird-backup-2025-09-10.sql: A mysqldump backup of the pmird database. pmird_prepared_data: A folder that contains: Folders like c00001-2020-08-17-Hodgkins with the raw spectra for samples from each dataset in the pmird database (see below for how to import the spectra). Files like pmird_prepare_data_c00001-2020-08-17-Hodgkins.Rmd that contain the R code used to process and import the data from each dataset into the database. Corresponding html files contain the compiled scripts. pmird_prepare_data.Rmd: An Rmarkdown script that was used to run the scripts that created the database (the top level script). mysql_scripts: A folder that contains: pmird_mysql_initialization.sql: MariaDB script to initialize the database. 001-db-initialize.Rmd: Rmarkdown script that executes pmird_mysql_initialization.sql and populated dataset-independent tables. add-citations.Rmd: Rmarkdown script that adds information on references to the database. add-licenses.Rmd: Rmarkdown script that adds information on licenses to the database. add-mir-metadata-quality.Rmd Rmarkdown script that adds information on the quality of the infrared spectra to the database. Dockerfile: A Dockerfile that defines the computing environment used to create the database. renv.lock A renv.lock file that lists the R packages used to create the database. The database can be set up as follows: The downloaded database needs to be imported in a running MariaDB instance. In a linux terminal, the downloaded sql file can be imported like so: mysql -u

Data from: SQL Injection Attack Netflow

zenodo.org
portalcientifico.unileon.es
+3more

Updated Sep 28, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Ignacio Crespo; Ignacio Crespo; Adrián Campazas; Adrián Campazas (2022). SQL Injection Attack Netflow [Dataset]. http://doi.org/10.5281/zenodo.6907252

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.6907252

Dataset updated

Sep 28, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Ignacio Crespo; Ignacio Crespo; Adrián Campazas; Adrián Campazas

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Introduction

This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.

NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.

Datasets

The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).

The datasets contain both benign and malicious traffic. All collected datasets are balanced.

The version of NetFlow used to build the datasets is 5.

Dataset	Aim	Samples	Benign-malicious traffic ratio
D1	Training	400,003	50%
D2	Test	57,239	50%

Infrastructure and implementation

Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.

DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)

Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).

The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.

The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.

Parameters	Description
'--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema'	Enumerate users, password hashes, privileges, roles, databases, tables and columns
--level=5	Increase the probability of a false positive identification
--risk=3	Increase the probability of extracting data
--random-agent	Select the User-Agent randomly
--batch	Never ask for user input, use the default behavior
--answers="follow=Y"	Predefined answers to yes

Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).

The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24.
The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.

However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.

To run the MySQL server we ran MariaDB version 10.4.12.
Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.

Z
Dataset of raw experiments data of large triaxial-experiments
data.niaid.nih.gov
Updated Jan 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Biebricher, Sven F. (2021). Dataset of raw experiments data of large triaxial-experiments [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4382099
Explore at:
Dataset updated
Jan 18, 2021
Dataset provided by
RWTH Aachen - Lehrstuhl für Geotechnik
Authors
Biebricher, Sven F.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set contains the measurement data of 20 large-scale triaxial (permeability included) tests carried out. Synthetic sand-lime bricks (KS XL) from the Silka company and the natural Nievelstein sandstone were sampled. The samples have a size of about 25 cm in diameter and a height of 45 cm. The experiments where carried out at the Chair of Geotechnical Engeneering at the RWTH Aachen University in Germany.

The following data were recorded for the rock samples:

geomechanical probe properties

size and weight

density (dry and wet)

porosity

permeability coefficient

uniaxial strength

During the tests the following data were recorded:

axial strain

confining and axial pressure

fluid pressure

fluid flow through specimen

temperatures

Data sets are stored in a MySQL-Database. Data sets can be interpreted with the devloped Triaxial Test Evaluation Tool.
Sample Superstore cleaned dataset
kaggle.com
zip
Updated Apr 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prixpam (2025). Sample Superstore cleaned dataset [Dataset]. https://www.kaggle.com/prixpam/eda-on-sample-superstore-data-set-using-mysql
Explore at:
zip(563700 bytes)Available download formats
Dataset updated
Apr 4, 2025
Authors
Prixpam
Description
A Cleaned Dataset and SQL Scripts for Business Insights from the Sample Superstore

Dataset Description This project features the Sample Superstore dataset, originally sourced from Kaggle, enhanced with MySQL-based data cleaning and analysis. The dataset includes 4,929 records of sales, profit, customer, and product data from a fictional retail superstore. It has been cleaned to remove duplicate orders and paired with a comprehensive set of SQL queries to uncover actionable business insights.
p
Royal Institute for Cultural Heritage Radiocarbon and stable isotope...
pandora.earth
Updated Jul 12, 2011
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2011). Royal Institute for Cultural Heritage Radiocarbon and stable isotope measurements - Dataset - Pandora [Dataset]. https://pandora.earth/gl_ES/dataset/royal-institute-for-cultural-heritage-radiocarbon-and-stable-isotope-measurements
Explore at:
Dataset updated
Jul 12, 2011
Description
The Radiocarbon dating laboratory of IRPA/KIK was founded in the 1960s. Initially dates were reported at more or less regular intervals in the journal Radiocarbon (Schreurs 1968). Since the advent of radiocarbon dating in the 1950s it had been a common practice amongst radiocarbon laboratories to publish their dates in so-called ‘date-lists’ that were arranged per laboratory. This was first done in the Radiocarbon Supplement of the American Journal of Science and later in the specialised journal Radiocarbon. In the course of time the latter, with the added subtitle An International Journal of Cosmogenic Isotope Research, became a regular scientific journal shifting focus from date-lists to articles. Furthermore the world-wide exponential increase of radiocarbon dates made it almost impossible to publish them all in the same journal, even more so because of the broad range of applications that use radiocarbon analysis, ranging from archaeology and art history to geology and oceanography and recently also biomedical studies.The IRPA/KIK database From 1995 onwards IRPA/KIK’s Radiocarbon laboratory started to publish its dates in small publications, continuing the numbering of the preceding lists in Radiocarbon. The first booklet in this series was “Royal Institute for Cultural Heritage Radiocarbon dates XV” (Van Strydonck et al. 1995), followed by three more volumes (XVI, XVII, XVIII). The next list (XIX, 2005) was no longer printed but instead handed out as a PDF file on CD-rom. The ever increasing number of dates and the difficulties in handling all the data, however, made us look for a more permanent and easier solution. In order to improve data management and consulting, it was thus decided to gather all our dates in a web-based database. List XIX was in fact already a Microsoft Access database that was converted into a reader friendly style and could also be printed as a PDF file. However a Microsoft Access database is not the most practical solution to make information publicly available. Hence the structure of the database was recreated in Mysql and the existing content was transferred into the corresponding fields. To display the records, a web-based front-end was programmed in PHP/Apache. It features a full-text search function that allows for partial word-matching. In addition the records can be consulted in PDF format. Old records from the printed date-lists as well as new records are now added using the same Microsoft Acces back-end, which is now connected directly to the Mysql database. The main problem with introducing the old data was that not all the current criteria were available in the past (e.g. stable isotope measurements). Furthermore since all the sample information is given by the submitter, its quality largely depends on the persons willingness to contribute as well as on the accuracy and correctness of the information he provides. Sometimes problems arrive from the fact that a certain investigation (like an excavation) is carried out over a relatively long period (sometimes even more than ten years) and is directed by different people or even institutions. This can lead to differences in the labeling procedure of the samples, but also in the interpretation of structures and artifacts and in the orthography of the site’s name. Finally the submitter might change address, while the names of institutions or even regions and countries might change as well (e.g.Zaire - Congo)
D
Replication Data for: Two origins of the prefix IZ- and how they affect the...
dataverse.azure.uit.no
dataverse.no
pdf, txt, xlsx
Updated Jul 18, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Endresen; Anna Endresen (2022). Replication Data for: Two origins of the prefix IZ- and how they affect the VY- vs. IZ- correlation in Modern Russian. [Dataset]. http://doi.org/10.18710/NFNB8D
Explore at:
xlsx(354229), txt(8075), txt(491558), pdf(61023)Available download formats
Unique identifier
https://doi.org/10.18710/NFNB8D
Dataset updated
Jul 18, 2022
Dataset provided by
DataverseNO
Authors
Anna Endresen; Anna Endresen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
Jan 1, 1950 - Jun 14, 2014
Description
This is the data examined in the study of Modern Russian verbs formed with the prefixes VY- and IZ-, a native East Slavic prefix and a loan Church Slavonic prefix, both of which mean ‘out of’. The study provides a synchronic contrastive analysis of the two prefixes and discusses how much they are semantically similar and what determines their distribution across Russian verbs. The dataset “VY_IZ_DATABASE_2019” provides replication data for the article “Two origins of the prefix IZ- and how they affect the VY- vs. IZ- correlation in Modern Russian” accepted for publication in Russian Linguistics. International Journal for the Study of Russian and other Slavic Languages 43(3). The amount of data examined in this study exceeds all previous accounts of the issue. The database contains 989 prefixed verbs. The verbs were culled from the Modern Subcorpus of the Russian National Corpus (www.ruscorpora.ru) and manually tagged for a number of parameters. The data was extracted automatically via the software management program MySQL. After that each verb was double-checked in the corpus and analyzed. In the database, each verb is accompanied with an English gloss, simplex base, corpus frequency, corpus example of its use, and a number of tags relevant for this study (type of perfective, submeaning of the prefix, etc.). The structure of the database is described in detail in the document “ReadMe”. Here is the abstract of the article: This article reports on a synchronic study of 989 Modern Russian verbs formed with the prefixes VY- and IZ-, including standard lexemes, obsolete verbs, and newly-formed coinages culled from the Russian National Corpus. I argue that the hypothesis about the two historical origins of the prefix IZ- may explain the ambivalent behavior of this prefix in Modern Russian, which shows both semantic overlap and semantic contrast with the prefix VY-. I revisit the most detailed semantic account of the two prefixes (Nesset et al 2011) and provide additional support for their model of polysemy in terms of type and token frequencies of the analyzed verbs. I further propose that VY- and IZ- encode different spatial image schemas and thus explain why the prefix IZ- is compatible with verbs of multidirectional motion, whereas VY- preferably attaches to verbs of unidirectional motion; why the verbs prefixed in IZ- often carry a more evocative flavor and refer to more intensive activities than those described by parallel verbs in VY-; why IZ- encodes multiplication of an action named by the base and why this is not common for VY-; and finally how it is possible for IZ- to have both bookish and colloquial uses, being very obsolete and highly productive in different submeanings.
CHINOOK Music
kaggle.com
zip
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
willian oliveira (2024). CHINOOK Music [Dataset]. https://www.kaggle.com/datasets/willianoliveiragibin/chinook-music
Explore at:
zip(9603 bytes)Available download formats
Dataset updated
Sep 19, 2024
Authors
willian oliveira
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Chinook Database is a sample database designed for use with multiple database platforms, such as SQL Server, Oracle, MySQL, and others. It can be easily set up by running a single SQL script, making it a convenient alternative to the popular Northwind database. Chinook is widely used in demos and testing environments, particularly for Object-Relational Mapping (ORM) tools that target both single and multiple database servers.

Supported Database Servers Chinook supports several database servers, including:

DB2 MySQL Oracle PostgreSQL SQL Server SQL Server Compact SQLite Download Instructions You can download the SQL scripts for each supported database server from the latest release assets. The appropriate SQL script file(s) for your database vendor are provided, which can be executed using your preferred database management tool.

Data Model The Chinook Database represents a digital media store, containing tables that include:

Artists Albums Media tracks Invoices Customers Sample Data The media data in Chinook is derived from a real iTunes Library, providing a realistic dataset for users. Additionally, users can generate their own SQL scripts using their personal iTunes Library by following specific instructions. Customer and employee details in the database were manually crafted with fictitious names, addresses (mappable via Google Maps), and well-structured contact information such as phone numbers, faxes, and emails. Sales data is auto-generated and spans a four-year period, using random values.

Why is it Called Chinook? The Chinook Database's name is a nod to its predecessor, the Northwind database. Chinooks are warm, dry winds found in the interior regions of North America, particularly over southern Alberta in Canada, where the Canadian Prairies meet mountain ranges. This natural phenomenon inspired the choice of name, reflecting the idea that Chinook serves as a refreshing alternative to the Northwind database.
g
Meta-Information des Samples der Media-Analyse Daten: IntermediaPlus...
search.gesis.org
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brentel, Inga; Kampes, Céline Fabienne; Jandura, Olaf, Meta-Information des Samples der Media-Analyse Daten: IntermediaPlus (2014-2016) [Dataset]. https://search.gesis.org/research_data/SDN-10.7802-2030
Explore at:
Dataset provided by
GESIS search
GESIS, Köln
Authors
Brentel, Inga; Kampes, Céline Fabienne; Jandura, Olaf
License
https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
Description
Bei dem aufbereiteten Längsschnitt-Datensatzes 2014 bis 2016 handelt es sich um „Big-Data“, weshalb der Gesamtdatensatz nur in Form einer Datenbank (MySQL) verfügbar sein wird. In dieser Datenbank liegt die Information verschiedener Variablen eines Befragten untereinander. Die vorliegende Publikation umfasst eine SQL-Datenbank mit den Meta-Daten des Sample des Gesamtdatensatzes, das einen Ausschnitt der verfügbaren Variablen des Gesamtdatensatzes darstellt und die Struktur der aufbereiteten Daten darlegen soll, und eine Datendokumentation des Samples. Für diesen Zweck beinhaltet das Sample alle Variablen der Soziodemographie, dem Freizeitverhalten, der Zusatzinformation zu einem Befragten und dessen Haushalt sowie den interviewspezifischen Variablen und Gewichte. Lediglich bei den Variablen bezüglich der Mediennutzung des Befragten, handelt es sich um eine kleine Auswahl: Für die Onlinemediennutzung wurden die Variablen aller Gesamtangebote sowie der Einzelangebote der Genre Politik und Digital aufgenommen. Die Mediennutzung von Radio, Print und TV wurde im Sample nicht berücksichtigt, da deren Struktur anhand der veröffentlichten Längsschnittdaten der Media-Analyse MA Radio, MA Pressemedien und MA Intermedia nachvollzogen werden kann.
Die Datenbank mit den tatsächlichen Befragungsdaten wäre auf Grund der Größe des Datenmaterials bereits im kritischen Bereich der Dateigröße für den normalen Up- und Download. Die tatsächlichen Befragungsergebnisse, die zur Analyse nötig sind, werden dann 2021 in Form des Gesamtdatensatzes der Media-Analyse-Daten: IntermediaPlus (2014-2016) im DBK bei GESIS veröffentlicht werden.

Die Daten sowie deren Datenaufbereitung sind ein Vorschlag eines Best-Practice Cases für Big-Data Management bzw. den Umgang mit Big-Data in den Sozialwissenschaften und mit sozialwissenschaftlichen Daten. Unter Verwendung der GESIS Software CharmStats, die im Rahmen dieses Projektes um Big-Data Features erweitert wurde, erfolgt die Dokumentation und Herstellung der Transparenz der Harmonisierungsarbeit. Durch ein Python-Skript sowie ein html-Template wurde der Arbeitsprozess um und mit CharmStats zudem stärker automatisiert.

Der aufbereitete Längsschnitt des Gesamtdatensatzes der MA IntermediaPlus für 2014 bis 2016 wird 2021 in Kooperation mit GESIS herausgegeben werden und den FAIR-Prinzipien (Wilkinson et al. 2016) entsprechend verfügbar gemacht werden. Ziel ist es durch die Harmonisierung der einzelnen Querschnitte die Datenquelle der Media-Analyse, die im Rahmen des Dissertationsprojektes „Angebots- und Publikumsfragmentierung online“ durch Inga Brentel und Céline Fabienne Kampes erfolgt, für Forschung zum sozialen und medialen Wandel in der Bundesrepublik Deutschland zugänglich zu machen.

Künftige Studiennummer des Gesamtdatensatzes der IndermediaPlus im DBK der GESIS: ZA5769 (Version 1-0-0) und der doi: https://dx.doi.org/10.4232/1.13530

****************English Version****************

The prepared Longitudinal IntermediaPlus dataset 2014 to 2016 is a "big data", which is why the entire dataset will only be available in the form of a database (MySQL). In this database, the information of different variables of a respondent is organized in one column, one below the other. The present publication includes a SQL-Database with the meta data of a sample of the full database, which represents a section of the available variables of the total data set and is intended to show the structure of the prepared data and the data-documentation (codebook) of the sample. For this purpose, the sample contains all variables of sociodemography, free-time activities, additional information on a respondent and his household as well as the interview-specific variables and weights. Only the variables concerning the respondent's media use are a small selection: For online media use, the variables of all overall offerings as well as the individual offerings of the genres politics and digital were included. The media use of radio, print and TV was not included in the sample because its structure can be traced using the published longitudinal data of the media analysis MA Radio, MA Pressemedien and MA Intermedia.
Due to the size of the datafile, the database with the actual survey data would already be in the critical range of the file size for the common upload and download. The actual survey result...
Z
Data from: KGCW 2023 Challenge @ ESWC 2023
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated May 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Şimşek, Umutcan; Iglesias, Ana (2023). KGCW 2023 Challenge @ ESWC 2023 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7689309
Explore at:
Dataset updated
May 17, 2023
Dataset provided by
Universidad Politécnica de Madrid
KU Leuven
STI Insbruck
IDLab - Ghent University - imec
Authors
Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Şimşek, Umutcan; Iglesias, Ana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Knowledge Graph Construction Workshop 2023: challenge

Knowledge graph construction of heterogeneous data has seen a lot of uptake in the last decade from compliance to performance optimizations with respect to execution time. Besides execution time as a metric for comparing knowledge graph construction, other metrics e.g. CPU or memory usage are not considered. This challenge aims at benchmarking systems to find which RDF graph construction system optimizes for metrics e.g. execution time, CPU, memory usage, or a combination of these metrics.

Task description

The task is to reduce and report the execution time and computing resources (CPU and memory usage) for the parameters listed in this challenge, compared to the state-of-the-art of the existing tools and the baseline results provided by this challenge. This challenge is not limited to execution times to create the fastest pipeline, but also computing resources to achieve the most efficient pipeline.

We provide a tool which can execute such pipelines end-to-end. This tool also collects and aggregates the metrics such as execution time, CPU and memory usage, necessary for this challenge as CSV files. Moreover, the information about the hardware used during the execution of the pipeline is available as well to allow fairly comparing different pipelines. Your pipeline should consist of Docker images which can be executed on Linux to run the tool. The tool is already tested with existing systems, relational databases e.g. MySQL and PostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuoso which can be combined in any configuration. It is strongly encouraged to use this tool for participating in this challenge. If you prefer to use a different tool or our tool imposes technical requirements you cannot solve, please contact us directly.

Part 1: Knowledge Graph Construction Parameters

These parameters are evaluated using synthetic generated data to have more insights of their influence on the pipeline.

Data

Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).

Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).

Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).

Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).

Number of input files: scaling the number of datasets (1, 5, 10, 15).

Mappings

Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).

Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).

Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)

Part 2: GTFS-Madrid-Bench

The GTFS-Madrid-Bench provides insights in the pipeline with real data from the public transport domain in Madrid.

Scaling

GTFS-1 SQL

GTFS-10 SQL

GTFS-100 SQL

GTFS-1000 SQL

Heterogeneity

GTFS-100 XML + JSON

GTFS-100 CSV + XML

GTFS-100 CSV + JSON

GTFS-100 SQL + XML + JSON + CSV

Example pipeline

The ground truth dataset and baseline results are generated in different steps for each parameter:

The provided CSV files and SQL schema are loaded into a MySQL relational database.

Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format.

The constructed knowledge graph is loaded into a Virtuoso triplestore, tuned according to the Virtuoso documentation.

The provided SPARQL queries are executed on the SPARQL endpoint exposed by Virtuoso.

The pipeline is executed 5 times from which the median execution time of each step is calculated and reported. Each step with the median execution time is then reported in the baseline results with all its measured metrics. Query timeout is set to 1 hour and knowledge graph construction timeout to 24 hours. The execution is performed with the following tool , you can adapt the execution plans for this example pipeline to your own needs.

Each parameter has its own directory in the ground truth dataset with the following files:

Input dataset as CSV.

Mapping file as RML.

Queries as SPARQL.

Execution plan for the pipeline in metadata.json.

Datasets

Knowledge Graph Construction Parameters

The dataset consists of:

Input dataset as CSV for each parameter.

Mapping file as RML for each parameter.

SPARQL queries to retrieve the results for each parameter.

Baseline results for each parameter with the example pipeline.

Ground truth dataset for each parameter generated with the example pipeline.

Format

All input datasets are provided as CSV, depending on the parameter that is being evaluated, the number of rows and columns may differ. The first row is always the header of the CSV.

GTFS-Madrid-Bench

The dataset consists of:

Input dataset as CSV with SQL schema for the scaling and a combination of XML,

CSV, and JSON is provided for the heterogeneity.

Mapping file as RML for both scaling and heterogeneity.

SPARQL queries to retrieve the results.

Baseline results with the example pipeline.

Ground truth dataset generated with the example pipeline.

Format

CSV datasets always have a header as their first row. JSON and XML datasets have their own schema.

Evaluation criteria

Submissions must evaluate the following metrics:

Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.

CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.

Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

Expected output

Duplicate values

Scale Number of Triples 0 percent 2000000 triples 25 percent 1500020 triples 50 percent 1000020 triples 75 percent 500020 triples 100 percent 20 triples

Empty values

Scale Number of Triples 0 percent 2000000 triples 25 percent 1500000 triples 50 percent 1000000 triples 75 percent 500000 triples 100 percent 0 triples

Mappings

Scale Number of Triples 1TM + 15POM 1500000 triples 3TM + 5POM 1500000 triples 5TM + 3POM 1500000 triples 15TM + 1POM 1500000 triples

Properties

Scale Number of Triples 1M rows 1 column 1000000 triples 1M rows 10 columns 10000000 triples 1M rows 20 columns 20000000 triples 1M rows 30 columns 30000000 triples

Records

Scale Number of Triples 10K rows 20 columns 200000 triples 100K rows 20 columns 2000000 triples 1M rows 20 columns 20000000 triples 10M rows 20 columns 200000000 triples

Joins

1-1 joins

Scale Number of Triples 0 percent 0 triples 25 percent 125000 triples 50 percent 250000 triples 75 percent 375000 triples 100 percent 500000 triples

1-N joins

Scale Number of Triples 1-10 0 percent 0 triples 1-10 25 percent 125000 triples 1-10 50 percent 250000 triples 1-10 75 percent 375000 triples 1-10 100 percent 500000 triples 1-5 50 percent 250000 triples 1-10 50 percent 250000 triples 1-15 50 percent 250005 triples 1-20 50 percent 250000 triples

1-N joins

Scale Number of Triples 10-1 0 percent 0 triples 10-1 25 percent 125000 triples 10-1 50 percent 250000 triples 10-1 75 percent 375000 triples 10-1 100 percent 500000 triples 5-1 50 percent 250000 triples 10-1 50 percent 250000 triples 15-1 50 percent 250005 triples 20-1 50 percent 250000 triples

N-M joins

Scale Number of Triples 5-5 50 percent 1374085 triples 10-5 50 percent 1375185 triples 5-10 50 percent 1375290 triples 5-5 25 percent 718785 triples 5-5 50 percent 1374085 triples 5-5 75 percent 1968100 triples 5-5 100 percent 2500000 triples 5-10 25 percent 719310 triples 5-10 50 percent 1375290 triples 5-10 75 percent 1967660 triples 5-10 100 percent 2500000 triples 10-5 25 percent 719370 triples 10-5 50 percent 1375185 triples 10-5 75 percent 1968235 triples 10-5 100 percent 2500000 triples

GTFS Madrid Bench

Generated Knowledge Graph

Scale Number of Triples 1 395953 triples 10 3959530 triples 100 39595300 triples 1000 395953000 triples

Queries

Query Scale 1 Scale 10 Scale 100 Scale 1000 Q1 58540 results 585400 results No results available No results available Q2 636 results 11998 results 125565 results 1261368 results Q3 421 results 4207 results 42067 results 420667 results Q4 13 results 130 results 1300 results 13000 results Q5 35 results 350 results 3500 results 35000

Facebook

Twitter

Click to copy link

Link copied

Cite

Dillon Myrick (2023). Bike Store Relational Database | SQL [Dataset]. https://www.kaggle.com/datasets/dillonmyrick/bike-store-sample-database

Bike Store Relational Database | SQL

Sample database from sqlservertutorial.net for a retail bike store.

Explore at:

zip(94412 bytes)Available download formats

Dataset updated

Aug 21, 2023

Authors

Dillon Myrick

Description

This is the sample database from sqlservertutorial.net. This is a great dataset for learning SQL and practicing querying relational databases.

Database Diagram:

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4146319%2Fc5838eb006bab3938ad94de02f58c6c1%2FSQL-Server-Sample-Database.png?generation=1692609884383007&alt=media" alt="">

The sample database is copyrighted and cannot be used for commercial purposes. For example, it cannot be used for the following but is not limited to the purposes: - Selling - Including in paid courses

Clear search

Close search

Google apps

Main menu

Bike Store Relational Database | SQL

classicmodels

SQLite Sakila Sample Database

SQLite Sakila Sample Database

Database Description

Files Description

Employees

Dataset

Contents

Rediscovery Datasets: Connecting Duplicate Reports of Apache, Eclipse, and...

FooDrugs database: A database with molecular and text information about food...

Orphan Drugs - Dataset 1: Twitter issue-networks as excluded publics

A sample of datasets available from the EcoData Retriever.

Data Base Management Systems market size was USD 50.5 billion in 2022 !

text_to_sql_ko

Clean Meta Kaggle

Cleaned Meta-Kaggle Dataset

The Original Dataset - Meta-Kaggle

August 2023 update

The Problems with the Original Dataset

The Solution

Data from: Peatland Mid-Infrared Database (1.0.0)

Data from: SQL Injection Attack Netflow

Dataset of raw experiments data of large triaxial-experiments

Sample Superstore cleaned dataset

Royal Institute for Cultural Heritage Radiocarbon and stable isotope...

Replication Data for: Two origins of the prefix IZ- and how they affect the...

CHINOOK Music

Meta-Information des Samples der Media-Analyse Daten: IntermediaPlus...

Data from: KGCW 2023 Challenge @ ESWC 2023

Bike Store Relational Database | SQL

Sample database from sqlservertutorial.net for a retail bike store.