24 datasets found
  1. SQLite Sakila Sample Database

    • kaggle.com
    zip
    Updated Mar 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Atanas Kanev (2021). SQLite Sakila Sample Database [Dataset]. https://www.kaggle.com/datasets/atanaskanev/sqlite-sakila-sample-database/code
    Explore at:
    zip(4495190 bytes)Available download formats
    Dataset updated
    Mar 14, 2021
    Authors
    Atanas Kanev
    Description

    SQLite Sakila Sample Database

    Database Description

    The Sakila sample database is a fictitious database designed to represent a DVD rental store. The tables of the database include film, film_category, actor, customer, rental, payment and inventory among others. The Sakila sample database is intended to provide a standard schema that can be used for examples in books, tutorials, articles, samples, and so forth. Detailed information about the database can be found on the MySQL website: https://dev.mysql.com/doc/sakila/en/

    Sakila for SQLite is a part of the sakila-sample-database-ports project intended to provide ported versions of the original MySQL database for other database systems, including:

    • Oracle
    • SQL Server
    • SQLIte
    • Interbase/Firebird
    • Microsoft Access

    Sakila for SQLite is a port of the Sakila example database available for MySQL, which was originally developed by Mike Hillyer of the MySQL AB documentation team. This project is designed to help database administrators to decide which database to use for development of new products The user can run the same SQL against different kind of databases and compare the performance

    License: BSD Copyright DB Software Laboratory http://www.etl-tools.com

    Note: Part of the insert scripts were generated by Advanced ETL Processor http://www.etl-tools.com/etl-tools/advanced-etl-processor-enterprise/overview.html

    Information about the project and the downloadable files can be found at: https://code.google.com/archive/p/sakila-sample-database-ports/

    Other versions and developments of the project can be found at: https://github.com/ivanceras/sakila/tree/master/sqlite-sakila-db

    https://github.com/jOOQ/jOOQ/tree/main/jOOQ-examples/Sakila

    Direct access to the MySQL Sakila database, which does not require installation of MySQL (queries can be typed directly in the browser), is provided on the phpMyAdmin demo version website: https://demo.phpmyadmin.net/master-config/

    Files Description

    The files in the sqlite-sakila-db folder are the script files which can be used to generate the SQLite version of the database. For convenience, the script files have already been run in cmd to generate the sqlite-sakila.db file, as follows:

    sqlite> .open sqlite-sakila.db # creates the .db file sqlite> .read sqlite-sakila-schema.sql # creates the database schema sqlite> .read sqlite-sakila-insert-data.sql # inserts the data

    Therefore, the sqlite-sakila.db file can be directly loaded into SQLite3 and queries can be directly executed. You can refer to my notebook for an overview of the database and a demonstration of SQL queries. Note: Data about the film_text table is not provided in the script files, thus the film_text table is empty. Instead the film_id, title and description fields are included in the film table. Moreover, the Sakila Sample Database has many versions, so an Entity Relationship Diagram (ERD) is provided to describe this specific version. You are advised to refer to the ERD to familiarise yourself with the structure of the database.

  2. classicmodels

    • kaggle.com
    zip
    Updated Apr 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ambreen (2024). classicmodels [Dataset]. https://www.kaggle.com/datasets/ambreenabdulraheem/classicmodels
    Explore at:
    zip(879935 bytes)Available download formats
    Dataset updated
    Apr 22, 2024
    Authors
    Ambreen
    Description

    MySQL Sample Database Schema. The MySQL sample database schema consists of the following tables:

    customers: stores customer’s data.

    products: stores a list of scale model cars.

    productlines: stores a list of product lines.

    orders: stores sales orders placed by customers.

    orderdetails: stores sales order line items for every sales order.

    payments: stores payments made by customers based on their accounts.

    employees: stores employee information and the organization structure such as who reports to whom.

    offices: stores sales office data.

  3. Bike Store Relational Database | SQL

    • kaggle.com
    zip
    Updated Aug 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dillon Myrick (2023). Bike Store Relational Database | SQL [Dataset]. https://www.kaggle.com/datasets/dillonmyrick/bike-store-sample-database
    Explore at:
    zip(94412 bytes)Available download formats
    Dataset updated
    Aug 21, 2023
    Authors
    Dillon Myrick
    Description

    This is the sample database from sqlservertutorial.net. This is a great dataset for learning SQL and practicing querying relational databases.

    Database Diagram:

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4146319%2Fc5838eb006bab3938ad94de02f58c6c1%2FSQL-Server-Sample-Database.png?generation=1692609884383007&alt=media" alt="">

    Terms of Use

    The sample database is copyrighted and cannot be used for commercial purposes. For example, it cannot be used for the following but is not limited to the purposes: - Selling - Including in paid courses

  4. Z

    FooDrugs database: A database with molecular and text information about food...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Garranzo, Marco; Piette Gómez, Óscar; Lacruz Pleguezuelos, Blanca; Pérez, David; Laguna Lobo, Teresa; Carrillo de Santa Pau, Enrique (2023). FooDrugs database: A database with molecular and text information about food - drug interactions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6638469
    Explore at:
    Dataset updated
    Jul 28, 2023
    Dataset provided by
    IMDEA Food Institute
    Authors
    Garranzo, Marco; Piette Gómez, Óscar; Lacruz Pleguezuelos, Blanca; Pérez, David; Laguna Lobo, Teresa; Carrillo de Santa Pau, Enrique
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    FooDrugs database is a development done by the Computational Biology Group at IMDEA Food Institute (Madrid, Spain), in the context of the Food Nutrition Security Cloud (FNS-Cloud) project. Food Nutrition Security Cloud (FNS-Cloud) has received funding from the European Union's Horizon 2020 Research and Innovation programme (H2020-EU.3.2.2.3. – A sustainable and competitive agri-food industry) under Grant Agreement No. 863059 – www.fns-cloud.eu (See more details about FNS-Cloud below)

    FooDrugs stores information extracted from transcriptomics and text documents for foo-drug interactiosn and it is part of a demonstrator to be done in the FNS-Cloud project. The database was built using MySQL, an open source relational database management system. FooDrugs host information for a total of 161 transcriptomics GEO series with 585 conditions for food or bioactive compounds. Each condition is defined as a food/biocomponent per time point, per concentration, per cell line, primary culture or biopsy per study. FooDrugs includes information about a bipartite network with 510 nodes and their similarity scores (tau score; https://clue.io/connectopedia/connectivity_scores) related with possible drug interactions with drugs assayed in conectivity map (https://www.broadinstitute.org/connectivity-map-cmap). The information is stored in eight tables:

    Table “study” : This table contains basic information about study identifiers from GEO, pubmed or platform, study type, title and abstract

    Table “sample”: This table contains basic information about the different experiments in a study, like the identifier of the sample, treatment, origin type, time point or concentration.

    Table “misc_study”: This table contains additional information about different attributes of the study.

    Table “misc_sample”: This table contains additional information about different attributes of the sample.

    Table “cmap”: This table contains information about 70895 nodes, compromising drugs, foods or bioactives, overexpressed and knockdown genes (see section 3.4). The information includes cell line, compound and perturbation type.

    Table “cmap_foodrugs”: This table contains information about the tau score (see section 3.4) that relates food with drugs or genes and the node identifier in the FooDrugs network.

    Table “topTable”: This table contains information about 150 over and underexpressed genes from each GEO study condition, used to calculate the tau score (see section 3.4). The information stored is the logarithmic fold change, average expression, t-statistic, p-value, adjusted p-value and if the gene is up or downregulated.

    Table “nodes”: This table stores the information about the identification of the sample and the node in the bipartite network connecting the tables “sample”, “cmap_foodrugs” and “topTable”.

    In addition, FooDrugs database stores a total of 6422 food/drug interactions from 2849 text documents, obtained from three different sources: 2312 documents from PubMed, 285 from DrugBank, and 252 from drugs.com. These documents describe potential interactions between 1464 food/bioactive compounds and 3009 drugs. The information is stored in two tables:

    Table “texts”: This table contains all the documents with its identifiers where interactions have been identified with strategy described in section 4.

    Table “TM_interactions”: This table contains information about interaction identifiers, the food and drug entities, and the start and the end positions of the context for the interaction in the document.

    FNS-Cloud will overcome fragmentation problems by integrating existing FNS data, which is essential for high-end, pan-European FNS research, addressing FNS, diet, health, and consumer behaviours as well as on sustainable agriculture and the bio-economy. Current fragmented FNS resources not only result in knowledge gaps that inhibit public health and agricultural policy, and the food industry from developing effective solutions, making production sustainable and consumption healthier, but also do not enable exploitation of FNS knowledge for the benefit of European citizens. FNS-Cloud will, through three Demonstrators; Agri-Food, Nutrition & Lifestyle and NCDs & the Microbiome to facilitate: (1) Analyses of regional and country-specific differences in diet including nutrition, (epi)genetics, microbiota, consumer behaviours, culture and lifestyle and their effects on health (obesity, NCDs, ethnic and traditional foods), which are essential for public health and agri-food and health policies; (2) Improved understanding agricultural differences within Europe and what these means in terms of creating a sustainable, resilient food systems for healthy diets; and (3) Clear definitions of boundaries and how these affect the compositions of foods and consumer choices and, ultimately, personal and public health in the future. Long-term sustainability of the FNS-Cloud will be based on Services that have the capacity to link with new resources and enable cross-talk amongst them; access to FNS-Cloud data will be open access, underpinned by FAIR principles (findable, accessible, interoperable and re-useable). FNS-Cloud will work closely with the proposed Food, Nutrition and Health Research Infrastructure (FNHRI) as well as METROFOOD-RI and other existing ESFRI RIs (e.g. ELIXIR, ECRIN) in which several FNS-Cloud Beneficiaries are involved directly. (https://cordis.europa.eu/project/id/863059)

    ***** changes between version FooDrugs_v2 and FooDrugs_V3 (31st January 2023) are:

    Increased the amount of text documents by 85.675 from PubMed and ClinicalTrials.gov, and the amount of Text Mining interactions by 168.826.

    Increased the amount of transcriptomic studies by 32 GEO series.

    Removed all rows in table cmap_foodrugs representing interactions with values of tau=0

    Removed 43 GEO series that after manually checking didn't correspond to food compounds.

    Added a new column to the table texts: citation to hold the citation of the text.

    Added these columns to the table study: contributor to contain the authors of the study, publication_date to store the date of publication of the study in GEO and pubmed_id to reference the publication associated with the study if any.

    Added a new column to topTable to hold the top 150 up-regulated and 150 down-regulated genes.

  5. Employees

    • kaggle.com
    zip
    Updated Nov 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sudhir Singh (2021). Employees [Dataset]. https://www.kaggle.com/datasets/crepantherx/employees
    Explore at:
    zip(31992550 bytes)Available download formats
    Dataset updated
    Nov 12, 2021
    Authors
    Sudhir Singh
    Description

    Dataset

    This dataset was created by Sudhir Singh

    Released under Data files © Original Authors

    Contents

  6. Z

    Rediscovery Datasets: Connecting Duplicate Reports of Apache, Eclipse, and...

    • data.niaid.nih.gov
    Updated Aug 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sadat, Mefta; Bener, Ayse Basar; Miranskyy, Andriy V. (2024). Rediscovery Datasets: Connecting Duplicate Reports of Apache, Eclipse, and KDE [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_400614
    Explore at:
    Dataset updated
    Aug 3, 2024
    Dataset provided by
    Ryerson University
    Authors
    Sadat, Mefta; Bener, Ayse Basar; Miranskyy, Andriy V.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present three defect rediscovery datasets mined from Bugzilla. The datasets capture data for three groups of open source software projects: Apache, Eclipse, and KDE. The datasets contain information about approximately 914 thousands of defect reports over a period of 18 years (1999-2017) to capture the inter-relationships among duplicate defects.

    File Descriptions

    apache.csv - Apache Defect Rediscovery dataset

    eclipse.csv - Eclipse Defect Rediscovery dataset

    kde.csv - KDE Defect Rediscovery dataset

    apache.relations.csv - Inter-relations of rediscovered defects of Apache

    eclipse.relations.csv - Inter-relations of rediscovered defects of Eclipse

    kde.relations.csv - Inter-relations of rediscovered defects of KDE

    create_and_populate_neo4j_objects.cypher - Populates Neo4j graphDB by importing all the data from the CSV files. Note that you have to set dbms.import.csv.legacy_quote_escaping configuration setting to false to load the CSV files as per https://neo4j.com/docs/operations-manual/current/reference/configuration-settings/#config_dbms.import.csv.legacy_quote_escaping

    create_and_populate_mysql_objects.sql - Populates MySQL RDBMS by importing all the data from the CSV files

    rediscovery_db_mysql.zip - For your convenience, we also provide full backup of the MySQL database

    neo4j_examples.txt - Sample Neo4j queries

    mysql_examples.txt - Sample MySQL queries

    rediscovery_eclipse_6325.png - Output of Neo4j example #1

    distinct_attrs.csv - Distinct values of bug_status, resolution, priority, severity for each project

  7. MySQL Java Computer Programs

    • figshare.com
    zip
    Updated Jul 3, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suhailan Safei (2017). MySQL Java Computer Programs [Dataset]. http://doi.org/10.6084/m9.figshare.2813497.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 3, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Suhailan Safei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This mysql database contains list of submitted Java programs based on series of online lab exercises from year 2013 to 2015. The programs were submitted by first year computer science students from Faculty of Informatics and Computing, Universiti Sultan Zainal Abidin, Malaysia who undertaking Introductory Computer Programming subject. There were 67, 18 and 47 of participated students in 2013, 2014 and 2015 respectively. The submitted programs were all of their solution attempts in answering a computational programming question. The question was as the following:

    Write a program that will read string. Then your program should show all the string character using * except for character 2, output its real character. sample input. Apology sample output. p****

  8. Clean Meta Kaggle

    • kaggle.com
    Updated Sep 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yoni Kremer (2023). Clean Meta Kaggle [Dataset]. https://www.kaggle.com/datasets/yonikremer/clean-meta-kaggle
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yoni Kremer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Cleaned Meta-Kaggle Dataset

    The Original Dataset - Meta-Kaggle

    Explore our public data on competitions, datasets, kernels (code / notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

    Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.

    https://i.imgur.com/2Egeb8R.png" alt="" title="a title">

    This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.

    Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.

    August 2023 update

    In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here

    We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.

    The Problems with the Original Dataset

    • The original dataset is 32 CSV files, with 268 colums and 7GB of compressed data. Having so many tables and columns makes it hard to understand the data.
    • The data is not normalized, so when you join tables you get a lot of errors.
    • Some values refer to non-existing values in other tables. For example, the UserId column in the ForumMessages table has values that do not exist in the Users table.
    • There are missing values.
    • There are duplicate values.
    • There are values that are not valid. For example, Ids that are not positive integers.
    • The date and time columns are not in the right format.
    • Some columns only have the same value for all rows, so they are not useful.
    • The boolean columns have string values True or False.
    • Incorrect values for the Total columns. For example, the DatasetCount is not the total number of datasets with the Tag according to the DatasetTags table.
    • Users upvote their own messages.

    The Solution

    • To handle so many tables and columns I use a relational database. I use MySQL, but you can use any relational database.
    • The steps to create the database are:
    • Creating the database tables with the right data types and constraints. I do that by running the db_abd_create_tables.sql script.
    • Downloading the CSV files from Kaggle using the Kaggle API.
    • Cleaning the data using pandas. I do that by running the clean_data.py script. The script does the following steps for each table:
      • Drops the columns that are not needed.
      • Converts each column to the right data type.
      • Replaces foreign keys that do not exist with NULL.
      • Replaces some of the missing values with default values.
      • Removes rows where there are missing values in the primary key/not null columns.
      • Removes duplicate rows.
    • Loading the data into the database using the LOAD DATA INFILE command.
    • Checks that the number of rows in the database tables is the same as the number of rows in the CSV files.
    • Adds foreign key constraints to the database tables. I do that by running the add_foreign_keys.sql script.
    • Update the Total columns in the database tables. I do that by running the update_totals.sql script.
    • Backup the database.
  9. s

    Orphan Drugs - Dataset 1: Twitter issue-networks as excluded publics

    • orda.shef.ac.uk
    txt
    Updated Oct 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew Hanchard (2021). Orphan Drugs - Dataset 1: Twitter issue-networks as excluded publics [Dataset]. http://doi.org/10.15131/shef.data.16447326.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 22, 2021
    Dataset provided by
    The University of Sheffield
    Authors
    Matthew Hanchard
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset comprises of two .csv format files used within workstream 2 of the Wellcome Trust funded ‘Orphan drugs: High prices, access to medicines and the transformation of biopharmaceutical innovation’ project (219875/Z/19/Z). They appear in various outputs, e.g. publications and presentations.

    The deposited data were gathered using the University of Amsterdam Digital Methods Institute’s ‘Twitter Capture and Analysis Toolset’ (DMI-TCAT) before being processed and extracted from Gephi. DMI-TCAT queries Twitter’s STREAM Application Programming Interface (API) using SQL and retrieves data on a pre-set text query. It then sends the returned data for storage on a MySQL database. The tool allows for output of that data in various formats. This process aligns fully with Twitter’s service user terms and conditions. The query for the deposited dataset gathered a 1% random sample of all public tweets posted between 10-Feb-2021 and 10-Mar-2021 containing the text ‘Rare Diseases’ and/or ‘Rare Disease Day’, storing it on a local MySQL database managed by the University of Sheffield School of Sociological Studies (http://dmi-tcat.shef.ac.uk/analysis/index.php), accessible only via a valid VPN such as FortiClient and through a permitted active directory user profile. The dataset was output from the MySQL database raw as a .gexf format file, suitable for social network analysis (SNA). It was then opened using Gephi (0.9.2) data visualisation software and anonymised/pseudonymised in Gephi as per the ethical approval granted by the University of Sheffield School of Sociological Studies Research Ethics Committee on 02-Jun-201 (reference: 039187). The deposited dataset comprises of two anonymised/pseudonymised social network analysis .csv files extracted from Gephi, one containing node data (Issue-networks as excluded publics – Nodes.csv) and another containing edge data (Issue-networks as excluded publics – Edges.csv). Where participants explicitly provided consent, their original username has been provided. Where they have provided consent on the basis that they not be identifiable, their username has been replaced with an appropriate pseudonym. All other usernames have been anonymised with a randomly generated 16-digit key. The level of anonymity for each Twitter user is provided in column C of deposited file ‘Issue-networks as excluded publics – Nodes.csv’.

    This dataset was created and deposited onto the University of Sheffield Online Research Data repository (ORDA) on 26-Aug-2021 by Dr. Matthew S. Hanchard, Research Associate at the University of Sheffield iHuman institute/School of Sociological Studies. ORDA has full permission to store this dataset and to make it open access for public re-use without restriction under a CC BY license, in line with the Wellcome Trust commitment to making all research data Open Access.

    The University of Sheffield are the designated data controller for this dataset.

  10. p

    Royal Institute for Cultural Heritage Radiocarbon and stable isotope...

    • pandora.earth
    Updated Jul 12, 2011
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2011). Royal Institute for Cultural Heritage Radiocarbon and stable isotope measurements - Dataset - Pandora [Dataset]. https://pandora.earth/gl_ES/dataset/royal-institute-for-cultural-heritage-radiocarbon-and-stable-isotope-measurements
    Explore at:
    Dataset updated
    Jul 12, 2011
    Description

    The Radiocarbon dating laboratory of IRPA/KIK was founded in the 1960s. Initially dates were reported at more or less regular intervals in the journal Radiocarbon (Schreurs 1968). Since the advent of radiocarbon dating in the 1950s it had been a common practice amongst radiocarbon laboratories to publish their dates in so-called ‘date-lists’ that were arranged per laboratory. This was first done in the Radiocarbon Supplement of the American Journal of Science and later in the specialised journal Radiocarbon. In the course of time the latter, with the added subtitle An International Journal of Cosmogenic Isotope Research, became a regular scientific journal shifting focus from date-lists to articles. Furthermore the world-wide exponential increase of radiocarbon dates made it almost impossible to publish them all in the same journal, even more so because of the broad range of applications that use radiocarbon analysis, ranging from archaeology and art history to geology and oceanography and recently also biomedical studies.The IRPA/KIK database From 1995 onwards IRPA/KIK’s Radiocarbon laboratory started to publish its dates in small publications, continuing the numbering of the preceding lists in Radiocarbon. The first booklet in this series was “Royal Institute for Cultural Heritage Radiocarbon dates XV” (Van Strydonck et al. 1995), followed by three more volumes (XVI, XVII, XVIII). The next list (XIX, 2005) was no longer printed but instead handed out as a PDF file on CD-rom. The ever increasing number of dates and the difficulties in handling all the data, however, made us look for a more permanent and easier solution. In order to improve data management and consulting, it was thus decided to gather all our dates in a web-based database. List XIX was in fact already a Microsoft Access database that was converted into a reader friendly style and could also be printed as a PDF file. However a Microsoft Access database is not the most practical solution to make information publicly available. Hence the structure of the database was recreated in Mysql and the existing content was transferred into the corresponding fields. To display the records, a web-based front-end was programmed in PHP/Apache. It features a full-text search function that allows for partial word-matching. In addition the records can be consulted in PDF format. Old records from the printed date-lists as well as new records are now added using the same Microsoft Acces back-end, which is now connected directly to the Mysql database. The main problem with introducing the old data was that not all the current criteria were available in the past (e.g. stable isotope measurements). Furthermore since all the sample information is given by the submitter, its quality largely depends on the persons willingness to contribute as well as on the accuracy and correctness of the information he provides. Sometimes problems arrive from the fact that a certain investigation (like an excavation) is carried out over a relatively long period (sometimes even more than ten years) and is directed by different people or even institutions. This can lead to differences in the labeling procedure of the samples, but also in the interpretation of structures and artifacts and in the orthography of the site’s name. Finally the submitter might change address, while the names of institutions or even regions and countries might change as well (e.g.Zaire - Congo)

  11. n

    Heparome

    • neuinfo.org
    • dknet.org
    • +2more
    Updated Oct 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Heparome [Dataset]. http://identifiers.org/RRID:SCR_008615
    Explore at:
    Dataset updated
    Oct 11, 2025
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented on July 17, 2013. A database which contains the information of heparin-binding proteins of E. coli K-12 MG1655 cells. Heparin affinity columns were applied to enrich and fractionate proteins. Identification of proteins was done via the collaboration with David Russell''s lab. Because heparin is negatively charged sulfated glucosaminoglycan, polyamion binding proteins, which contain nucleic acid-binding proteins, are expected to bind to heparin columns. Study of the expression pattern of heparin-binding proteins will help to study the nucleic acid-binding proteins, most of which are related to regulation. Moreover, heparin affinity columns will also erich low abundance proteins. Heparome database is constructed using MySQL. Website interface is built using HTML and PHP. Queries between MySQL database and website interface are executed using PHP. Besides including information of identified proteins, such as swiss accession number, gene name, molecular weight, isoelectric point, condon adaptation index (CAI), functional classification, et. al. , it also includes information of experiments, such as sample preparation, heparin-HPLC chromatography, SDS-PAGE gel separation and MALDI-MS.

  12. Z

    Data from: KGCW 2023 Challenge @ ESWC 2023

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated May 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Şimşek, Umutcan; Iglesias, Ana (2023). KGCW 2023 Challenge @ ESWC 2023 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7689309
    Explore at:
    Dataset updated
    May 17, 2023
    Dataset provided by
    IDLab - Ghent University - imec
    KU Leuven
    STI Insbruck
    Universidad Politécnica de Madrid
    Authors
    Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Şimşek, Umutcan; Iglesias, Ana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Knowledge Graph Construction Workshop 2023: challenge

    Knowledge graph construction of heterogeneous data has seen a lot of uptake in the last decade from compliance to performance optimizations with respect to execution time. Besides execution time as a metric for comparing knowledge graph construction, other metrics e.g. CPU or memory usage are not considered. This challenge aims at benchmarking systems to find which RDF graph construction system optimizes for metrics e.g. execution time, CPU, memory usage, or a combination of these metrics.

    Task description

    The task is to reduce and report the execution time and computing resources (CPU and memory usage) for the parameters listed in this challenge, compared to the state-of-the-art of the existing tools and the baseline results provided by this challenge. This challenge is not limited to execution times to create the fastest pipeline, but also computing resources to achieve the most efficient pipeline.

    We provide a tool which can execute such pipelines end-to-end. This tool also collects and aggregates the metrics such as execution time, CPU and memory usage, necessary for this challenge as CSV files. Moreover, the information about the hardware used during the execution of the pipeline is available as well to allow fairly comparing different pipelines. Your pipeline should consist of Docker images which can be executed on Linux to run the tool. The tool is already tested with existing systems, relational databases e.g. MySQL and PostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuoso which can be combined in any configuration. It is strongly encouraged to use this tool for participating in this challenge. If you prefer to use a different tool or our tool imposes technical requirements you cannot solve, please contact us directly.

    Part 1: Knowledge Graph Construction Parameters

    These parameters are evaluated using synthetic generated data to have more insights of their influence on the pipeline.

    Data

    Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).

    Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).

    Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).

    Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).

    Number of input files: scaling the number of datasets (1, 5, 10, 15).

    Mappings

    Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).

    Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).

    Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)

    Part 2: GTFS-Madrid-Bench

    The GTFS-Madrid-Bench provides insights in the pipeline with real data from the public transport domain in Madrid.

    Scaling

    GTFS-1 SQL

    GTFS-10 SQL

    GTFS-100 SQL

    GTFS-1000 SQL

    Heterogeneity

    GTFS-100 XML + JSON

    GTFS-100 CSV + XML

    GTFS-100 CSV + JSON

    GTFS-100 SQL + XML + JSON + CSV

    Example pipeline

    The ground truth dataset and baseline results are generated in different steps for each parameter:

    The provided CSV files and SQL schema are loaded into a MySQL relational database.

    Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format.

    The constructed knowledge graph is loaded into a Virtuoso triplestore, tuned according to the Virtuoso documentation.

    The provided SPARQL queries are executed on the SPARQL endpoint exposed by Virtuoso.

    The pipeline is executed 5 times from which the median execution time of each step is calculated and reported. Each step with the median execution time is then reported in the baseline results with all its measured metrics. Query timeout is set to 1 hour and knowledge graph construction timeout to 24 hours. The execution is performed with the following tool , you can adapt the execution plans for this example pipeline to your own needs.

    Each parameter has its own directory in the ground truth dataset with the following files:

    Input dataset as CSV.

    Mapping file as RML.

    Queries as SPARQL.

    Execution plan for the pipeline in metadata.json.

    Datasets

    Knowledge Graph Construction Parameters

    The dataset consists of:

    Input dataset as CSV for each parameter.

    Mapping file as RML for each parameter.

    SPARQL queries to retrieve the results for each parameter.

    Baseline results for each parameter with the example pipeline.

    Ground truth dataset for each parameter generated with the example pipeline.

    Format

    All input datasets are provided as CSV, depending on the parameter that is being evaluated, the number of rows and columns may differ. The first row is always the header of the CSV.

    GTFS-Madrid-Bench

    The dataset consists of:

    Input dataset as CSV with SQL schema for the scaling and a combination of XML,

    CSV, and JSON is provided for the heterogeneity.

    Mapping file as RML for both scaling and heterogeneity.

    SPARQL queries to retrieve the results.

    Baseline results with the example pipeline.

    Ground truth dataset generated with the example pipeline.

    Format

    CSV datasets always have a header as their first row. JSON and XML datasets have their own schema.

    Evaluation criteria

    Submissions must evaluate the following metrics:

    Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.

    CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.

    Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

    Expected output

    Duplicate values

        Scale
        Number of Triples
    
    
    
    
        0 percent
        2000000 triples
    
    
        25 percent
        1500020 triples 
    
    
        50 percent
        1000020 triples 
    
    
        75 percent
        500020 triples
    
    
        100 percent
        20 triples
    

    Empty values

        Scale
        Number of Triples
    
    
    
    
        0 percent
        2000000 triples
    
    
        25 percent
        1500000 triples 
    
    
        50 percent
        1000000 triples 
    
    
        75 percent
        500000 triples
    
    
        100 percent
        0 triples
    

    Mappings

        Scale
        Number of Triples
    
    
    
    
        1TM + 15POM
        1500000 triples
    
    
        3TM + 5POM
        1500000 triples 
    
    
        5TM + 3POM 
        1500000 triples 
    
    
        15TM + 1POM
        1500000 triples
    

    Properties

        Scale
        Number of Triples
    
    
        1M rows 1 column
        1000000 triples
    
    
        1M rows 10 columns
        10000000 triples 
    
    
        1M rows 20 columns
        20000000 triples 
    
    
        1M rows 30 columns
        30000000 triples
    

    Records

        Scale
        Number of Triples
    
    
        10K rows 20 columns
        200000 triples
    
    
        100K rows 20 columns
        2000000 triples 
    
    
        1M rows 20 columns
        20000000 triples 
    
    
        10M rows 20 columns
        200000000 triples
    

    Joins

    1-1 joins

        Scale
        Number of Triples
    
    
        0 percent
        0 triples
    
    
        25 percent
        125000 triples 
    
    
        50 percent
        250000 triples 
    
    
        75 percent
        375000 triples
    
    
        100 percent
        500000 triples
    

    1-N joins

        Scale
        Number of Triples
    
    
        1-10 0 percent
        0 triples
    
    
        1-10 25 percent
        125000 triples 
    
    
        1-10 50 percent
        250000 triples 
    
    
        1-10 75 percent
        375000 triples
    
    
        1-10 100 percent
        500000 triples
    
    
        1-5 50 percent
        250000 triples
    
    
        1-10 50 percent
        250000 triples 
    
    
        1-15 50 percent
        250005 triples 
    
    
        1-20 50 percent
        250000 triples
    

    1-N joins

        Scale
        Number of Triples
    
    
        10-1 0 percent
        0 triples
    
    
        10-1 25 percent
        125000 triples 
    
    
        10-1 50 percent
        250000 triples 
    
    
        10-1 75 percent
        375000 triples
    
    
        10-1 100 percent
        500000 triples
    
    
        5-1 50 percent
        250000 triples
    
    
        10-1 50 percent
        250000 triples 
    
    
        15-1 50 percent
        250005 triples 
    
    
        20-1 50 percent
        250000 triples
    

    N-M joins

        Scale
        Number of Triples
    
    
        5-5 50 percent
        1374085 triples
    
    
        10-5 50 percent
        1375185 triples
    
    
        5-10 50 percent 
        1375290 triples
    
    
        5-5 25 percent
        718785 triples
    
    
        5-5 50 percent
        1374085 triples
    
    
        5-5 75 percent 
        1968100 triples
    
    
        5-5 100 percent 
        2500000 triples 
    
    
        5-10 25 percent 
        719310 triples
    
    
        5-10 50 percent 
        1375290 triples
    
    
        5-10 75 percent 
        1967660 triples
    
    
        5-10 100 percent 
        2500000 triples
    
    
        10-5 25 percent 
        719370 triples 
    
    
        10-5 50 percent 
        1375185 triples
    
    
        10-5 75 percent 
        1968235 triples
    
    
        10-5 100 percent 
        2500000 triples
    

    GTFS Madrid Bench

    Generated Knowledge Graph

        Scale
        Number of Triples
    
    
        1
        395953 triples
    
    
        10
        3959530 triples 
    
    
        100
        39595300 triples 
    
    
        1000
        395953000 triples
    

    Queries

        Query
        Scale 1
        Scale 10
        Scale 100
        Scale 1000
    
    
        Q1
        58540 results
        585400 results
        No results available
        No results available
    
    
        Q2
        636 results
        11998 results 
        125565 results
        1261368 results
    
    
        Q3
        421 results
        4207 results 
        42067 results
        420667 results
    
    
        Q4
        13 results
        130 results
        1300 results
        13000 results
    
    
        Q5
        35 results
        350 results
        3500 results
        35000
    
  13. r

    Usage Statistics for University of Tasmania EPrints Repository

    • researchdata.edu.au
    Updated Apr 27, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sale, Arthur; Sale, Arthur (2017). Usage Statistics for University of Tasmania EPrints Repository [Dataset]. https://researchdata.edu.au/usage-statistics-university-eprints-repository/927350
    Explore at:
    Dataset updated
    Apr 27, 2017
    Dataset provided by
    University of Tasmania, Australia
    Authors
    Sale, Arthur; Sale, Arthur
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    The dataset is an active collection of access data to information items in the University of Tasmania’s EPrints repository. Each night a task is scheduled to run, and this picks up in the Apache access logs from where it left off the previous night. Each download of an open access full-text item causes the generation of a database record in the MySQL database, together with a timestamp, and an approximate location of the computer system generating the download. This is achieved by looking up the IP address against the GeoIP database, with one significant difference. Downloads originating from a University of Tasmania IP address are separately identified, and removed from the ‘Australia’ category. This eliminates vanity searches from achieving high significance. Countries are coded using the ISO3166 two-letter code.

    The dataset has been used to analyse the usage made of the repository and to tune it to achieve maximal visibility for the University of Tasmania. Researchers with items in the repository have used it to identify the types of use being made of their work, and to find potential collaborators. The citation of a work in a journal or conference article, for example, causes a typical step in usage, and the citing article can be searched in Google or Google Scholar to identify the authors. This enhances the dissemination experience and its value.

    The software was written in the University of Tasmania by Professor Arthur Sale (in php) based on earlier work by the University of Melbourne (with permission). Mr Christian McGee wrote some critical sections of the code in perl, and set up the cron scheduling.

    The dataset is generated by a computer program written by Professor Arthur Sale. The software was a test bed for ideas, and subsequently resulted in an official software set included in the EPrints distribution. This set expanded on the concepts significantly

  14. CHINOOK Music

    • kaggle.com
    zip
    Updated Sep 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    willian oliveira (2024). CHINOOK Music [Dataset]. https://www.kaggle.com/datasets/willianoliveiragibin/chinook-music
    Explore at:
    zip(9603 bytes)Available download formats
    Dataset updated
    Sep 19, 2024
    Authors
    willian oliveira
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Chinook Database is a sample database designed for use with multiple database platforms, such as SQL Server, Oracle, MySQL, and others. It can be easily set up by running a single SQL script, making it a convenient alternative to the popular Northwind database. Chinook is widely used in demos and testing environments, particularly for Object-Relational Mapping (ORM) tools that target both single and multiple database servers.

    Supported Database Servers Chinook supports several database servers, including:

    DB2 MySQL Oracle PostgreSQL SQL Server SQL Server Compact SQLite Download Instructions You can download the SQL scripts for each supported database server from the latest release assets. The appropriate SQL script file(s) for your database vendor are provided, which can be executed using your preferred database management tool.

    Data Model The Chinook Database represents a digital media store, containing tables that include:

    Artists Albums Media tracks Invoices Customers Sample Data The media data in Chinook is derived from a real iTunes Library, providing a realistic dataset for users. Additionally, users can generate their own SQL scripts using their personal iTunes Library by following specific instructions. Customer and employee details in the database were manually crafted with fictitious names, addresses (mappable via Google Maps), and well-structured contact information such as phone numbers, faxes, and emails. Sales data is auto-generated and spans a four-year period, using random values.

    Why is it Called Chinook? The Chinook Database's name is a nod to its predecessor, the Northwind database. Chinooks are warm, dry winds found in the interior regions of North America, particularly over southern Alberta in Canada, where the Canadian Prairies meet mountain ranges. This natural phenomenon inspired the choice of name, reflecting the idea that Chinook serves as a refreshing alternative to the Northwind database.

  15. Z

    KGCW 2024 Challenge @ ESWC 2024

    • data.niaid.nih.gov
    • investigacion.usc.gal
    • +3more
    Updated Jun 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Serles, Umutcan; Iglesias, Ana (2024). KGCW 2024 Challenge @ ESWC 2024 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10721874
    Explore at:
    Dataset updated
    Jun 11, 2024
    Dataset provided by
    KU Leuven
    STI Insbruck
    Universidad Politécnica de Madrid
    IDLab
    Universidade de Santiago de Compostela
    Authors
    Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Serles, Umutcan; Iglesias, Ana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Knowledge Graph Construction Workshop 2024: challenge

    Knowledge graph construction of heterogeneous data has seen a lot of uptakein the last decade from compliance to performance optimizations with respectto execution time. Besides execution time as a metric for comparing knowledgegraph construction, other metrics e.g. CPU or memory usage are not considered.This challenge aims at benchmarking systems to find which RDF graphconstruction system optimizes for metrics e.g. execution time, CPU,memory usage, or a combination of these metrics.

    Task description

    The task is to reduce and report the execution time and computing resources(CPU and memory usage) for the parameters listed in this challenge, comparedto the state-of-the-art of the existing tools and the baseline results providedby this challenge. This challenge is not limited to execution times to createthe fastest pipeline, but also computing resources to achieve the most efficientpipeline.

    We provide a tool which can execute such pipelines end-to-end. This tool alsocollects and aggregates the metrics such as execution time, CPU and memoryusage, necessary for this challenge as CSV files. Moreover, the informationabout the hardware used during the execution of the pipeline is available aswell to allow fairly comparing different pipelines. Your pipeline should consistof Docker images which can be executed on Linux to run the tool. The tool isalready tested with existing systems, relational databases e.g. MySQL andPostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuosowhich can be combined in any configuration. It is strongly encouraged to usethis tool for participating in this challenge. If you prefer to use a differenttool or our tool imposes technical requirements you cannot solve, please contactus directly.

    Track 1: Conformance

    The set of new specification for the RDF Mapping Language (RML) established by the W3C Community Group on Knowledge Graph Construction provide a set of test-cases for each module:

    RML-Core

    RML-IO

    RML-CC

    RML-FNML

    RML-Star

    These test-cases are evaluated in this Track of the Challenge to determine their feasibility, correctness, etc. by applying them in implementations. This Track is in Beta status because these new specifications have not seen any implementation yet, thus it may contain bugs and issues. If you find problems with the mappings, output, etc. please report them to the corresponding repository of each module.

    Note: validating the output of the RML Star module automatically through the provided tooling is currently not possible, see https://github.com/kg-construct/challenge-tool/issues/1.

    Through this Track we aim to spark development of implementations for the new specifications and improve the test-cases. Let us know your problems with the test-cases and we will try to find a solution.

    Track 2: Performance

    Part 1: Knowledge Graph Construction Parameters

    These parameters are evaluated using synthetic generated data to have moreinsights of their influence on the pipeline.

    Data

    Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).

    Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).

    Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).

    Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).

    Number of input files: scaling the number of datasets (1, 5, 10, 15).

    Mappings

    Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).

    Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).

    Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)

    Part 2: GTFS-Madrid-Bench

    The GTFS-Madrid-Bench provides insights in the pipeline with real data from thepublic transport domain in Madrid.

    Scaling

    GTFS-1 SQL

    GTFS-10 SQL

    GTFS-100 SQL

    GTFS-1000 SQL

    Heterogeneity

    GTFS-100 XML + JSON

    GTFS-100 CSV + XML

    GTFS-100 CSV + JSON

    GTFS-100 SQL + XML + JSON + CSV

    Example pipeline

    The ground truth dataset and baseline results are generated in different stepsfor each parameter:

    The provided CSV files and SQL schema are loaded into a MySQL relational database.

    Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format

    The pipeline is executed 5 times from which the median execution time of eachstep is calculated and reported. Each step with the median execution time isthen reported in the baseline results with all its measured metrics.Knowledge graph construction timeout is set to 24 hours. The execution is performed with the following tool: https://github.com/kg-construct/challenge-tool,you can adapt the execution plans for this example pipeline to your own needs.

    Each parameter has its own directory in the ground truth dataset with thefollowing files:

    Input dataset as CSV.

    Mapping file as RML.

    Execution plan for the pipeline in metadata.json.

    Datasets

    Knowledge Graph Construction Parameters

    The dataset consists of:

    Input dataset as CSV for each parameter.

    Mapping file as RML for each parameter.

    Baseline results for each parameter with the example pipeline.

    Ground truth dataset for each parameter generated with the example pipeline.

    Format

    All input datasets are provided as CSV, depending on the parameter that is beingevaluated, the number of rows and columns may differ. The first row is alwaysthe header of the CSV.

    GTFS-Madrid-Bench

    The dataset consists of:

    Input dataset as CSV with SQL schema for the scaling and a combination of XML,

    CSV, and JSON is provided for the heterogeneity.

    Mapping file as RML for both scaling and heterogeneity.

    SPARQL queries to retrieve the results.

    Baseline results with the example pipeline.

    Ground truth dataset generated with the example pipeline.

    Format

    CSV datasets always have a header as their first row.JSON and XML datasets have their own schema.

    Evaluation criteria

    Submissions must evaluate the following metrics:

    Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.

    CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.

    Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

    Expected output

    Duplicate values

    Scale Number of Triples

    0 percent 2000000 triples

    25 percent 1500020 triples

    50 percent 1000020 triples

    75 percent 500020 triples

    100 percent 20 triples

    Empty values

    Scale Number of Triples

    0 percent 2000000 triples

    25 percent 1500000 triples

    50 percent 1000000 triples

    75 percent 500000 triples

    100 percent 0 triples

    Mappings

    Scale Number of Triples

    1TM + 15POM 1500000 triples

    3TM + 5POM 1500000 triples

    5TM + 3POM 1500000 triples

    15TM + 1POM 1500000 triples

    Properties

    Scale Number of Triples

    1M rows 1 column 1000000 triples

    1M rows 10 columns 10000000 triples

    1M rows 20 columns 20000000 triples

    1M rows 30 columns 30000000 triples

    Records

    Scale Number of Triples

    10K rows 20 columns 200000 triples

    100K rows 20 columns 2000000 triples

    1M rows 20 columns 20000000 triples

    10M rows 20 columns 200000000 triples

    Joins

    1-1 joins

    Scale Number of Triples

    0 percent 0 triples

    25 percent 125000 triples

    50 percent 250000 triples

    75 percent 375000 triples

    100 percent 500000 triples

    1-N joins

    Scale Number of Triples

    1-10 0 percent 0 triples

    1-10 25 percent 125000 triples

    1-10 50 percent 250000 triples

    1-10 75 percent 375000 triples

    1-10 100 percent 500000 triples

    1-5 50 percent 250000 triples

    1-10 50 percent 250000 triples

    1-15 50 percent 250005 triples

    1-20 50 percent 250000 triples

    1-N joins

    Scale Number of Triples

    10-1 0 percent 0 triples

    10-1 25 percent 125000 triples

    10-1 50 percent 250000 triples

    10-1 75 percent 375000 triples

    10-1 100 percent 500000 triples

    5-1 50 percent 250000 triples

    10-1 50 percent 250000 triples

    15-1 50 percent 250005 triples

    20-1 50 percent 250000 triples

    N-M joins

    Scale Number of Triples

    5-5 50 percent 1374085 triples

    10-5 50 percent 1375185 triples

    5-10 50 percent 1375290 triples

    5-5 25 percent 718785 triples

    5-5 50 percent 1374085 triples

    5-5 75 percent 1968100 triples

    5-5 100 percent 2500000 triples

    5-10 25 percent 719310 triples

    5-10 50 percent 1375290 triples

    5-10 75 percent 1967660 triples

    5-10 100 percent 2500000 triples

    10-5 25 percent 719370 triples

    10-5 50 percent 1375185 triples

    10-5 75 percent 1968235 triples

    10-5 100 percent 2500000 triples

    GTFS Madrid Bench

    Generated Knowledge Graph

    Scale Number of Triples

    1 395953 triples

    10 3959530 triples

    100 39595300 triples

    1000 395953000 triples

    Queries

    Query Scale 1 Scale 10 Scale 100 Scale 1000

    Q1 58540 results 585400 results No results available No results available

    Q2 636 results 11998 results
    125565 results 1261368 results

    Q3 421 results 4207 results 42067 results 420667 results

    Q4 13 results 130 results 1300 results 13000 results

    Q5 35 results 350 results 3500 results 35000 results

    Q6 1 result 1 result 1 result 1 result

    Q7 68 results 67 results 67 results 53 results

    Q8 35460 results 354600 results No results available No results available

    Q9 130 results 1300

  16. daily - sales - expenses database&tables

    • kaggle.com
    zip
    Updated Nov 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OIE (2025). daily - sales - expenses database&tables [Dataset]. https://www.kaggle.com/datasets/emmyofh/daily-sales-expenses-database-and-tables
    Explore at:
    zip(666 bytes)Available download formats
    Dataset updated
    Nov 11, 2025
    Authors
    OIE
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This file contains the database and table creation scripts used in the Daily Sales & Expenses System project. It defines the structure of the MySQL database where all sales, expenses, inventory, and login records are stored before being accessed, analyzed, and visualized using Python.

    The file includes: - Commands to create the database - SQL statements to create the users, sales, expenses and inventory tables - (Optionally) Sample INSERT statements to populate the tables with test data

    This file is essential for anyone who wants to replicate or test the Python-MySQL integration in the project.

  17. f

    Data from: MWSTAT: A MODULATED WEB-BASED STATISTICAL SYSTEM

    • scielo.figshare.com
    jpeg
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco Louzada; Anderson Ara (2023). MWSTAT: A MODULATED WEB-BASED STATISTICAL SYSTEM [Dataset]. http://doi.org/10.6084/m9.figshare.6967682.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    SciELO journals
    Authors
    Francisco Louzada; Anderson Ara
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT In this paper we present the development of a modulated web based statistical system, hereafter MWStat, which shifts the statistical paradigm of analyzing data into a real time structure. The MWStat system is useful for both online storage data and questionnaires analysis, as well as to provide real time disposal of results from analysis related to several statistical methodologies in a customizable fashion. Overall, it can be seem as a useful technical solution that can be applied to a large range of statistical applications, which needs of a scheme of devolution of real time results, accessible to anyone with internet access. We display here the step-by-step instructions for implementing the system. The structure is accessible, built with an easily interpretable language and it can be strategically applied to online statistical applications. We rely on the relationship of several free languages, namely, PHP, R, MySQL database and an Apache HTTP server, and on the use of software tools such as phpMyAdmin. We expose three didactical examples of the MWStat system on institutional evaluation, statistical quality control and multivariate analysis. The methodology is also illustrated in a real example on institutional evaluation. A MWStat module was specifically built for providing a real time poll for teacher evaluation at the Federal University of São Carlos (Brazil).

  18. g

    Meta-Information des Samples der Media-Analyse Daten: IntermediaPlus...

    • search.gesis.org
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brentel, Inga; Kampes, Céline Fabienne; Jandura, Olaf, Meta-Information des Samples der Media-Analyse Daten: IntermediaPlus (2014-2016) [Dataset]. https://search.gesis.org/research_data/SDN-10.7802-2030
    Explore at:
    Dataset provided by
    GESIS search
    GESIS, Köln
    Authors
    Brentel, Inga; Kampes, Céline Fabienne; Jandura, Olaf
    License

    https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms

    Description

    Bei dem aufbereiteten Längsschnitt-Datensatzes 2014 bis 2016 handelt es sich um „Big-Data“, weshalb der Gesamtdatensatz nur in Form einer Datenbank (MySQL) verfügbar sein wird. In dieser Datenbank liegt die Information verschiedener Variablen eines Befragten untereinander. Die vorliegende Publikation umfasst eine SQL-Datenbank mit den Meta-Daten des Sample des Gesamtdatensatzes, das einen Ausschnitt der verfügbaren Variablen des Gesamtdatensatzes darstellt und die Struktur der aufbereiteten Daten darlegen soll, und eine Datendokumentation des Samples. Für diesen Zweck beinhaltet das Sample alle Variablen der Soziodemographie, dem Freizeitverhalten, der Zusatzinformation zu einem Befragten und dessen Haushalt sowie den interviewspezifischen Variablen und Gewichte. Lediglich bei den Variablen bezüglich der Mediennutzung des Befragten, handelt es sich um eine kleine Auswahl: Für die Onlinemediennutzung wurden die Variablen aller Gesamtangebote sowie der Einzelangebote der Genre Politik und Digital aufgenommen. Die Mediennutzung von Radio, Print und TV wurde im Sample nicht berücksichtigt, da deren Struktur anhand der veröffentlichten Längsschnittdaten der Media-Analyse MA Radio, MA Pressemedien und MA Intermedia nachvollzogen werden kann.
    Die Datenbank mit den tatsächlichen Befragungsdaten wäre auf Grund der Größe des Datenmaterials bereits im kritischen Bereich der Dateigröße für den normalen Up- und Download. Die tatsächlichen Befragungsergebnisse, die zur Analyse nötig sind, werden dann 2021 in Form des Gesamtdatensatzes der Media-Analyse-Daten: IntermediaPlus (2014-2016) im DBK bei GESIS veröffentlicht werden.

    Die Daten sowie deren Datenaufbereitung sind ein Vorschlag eines Best-Practice Cases für Big-Data Management bzw. den Umgang mit Big-Data in den Sozialwissenschaften und mit sozialwissenschaftlichen Daten. Unter Verwendung der GESIS Software CharmStats, die im Rahmen dieses Projektes um Big-Data Features erweitert wurde, erfolgt die Dokumentation und Herstellung der Transparenz der Harmonisierungsarbeit. Durch ein Python-Skript sowie ein html-Template wurde der Arbeitsprozess um und mit CharmStats zudem stärker automatisiert.

    Der aufbereitete Längsschnitt des Gesamtdatensatzes der MA IntermediaPlus für 2014 bis 2016 wird 2021 in Kooperation mit GESIS herausgegeben werden und den FAIR-Prinzipien (Wilkinson et al. 2016) entsprechend verfügbar gemacht werden. Ziel ist es durch die Harmonisierung der einzelnen Querschnitte die Datenquelle der Media-Analyse, die im Rahmen des Dissertationsprojektes „Angebots- und Publikumsfragmentierung online“ durch Inga Brentel und Céline Fabienne Kampes erfolgt, für Forschung zum sozialen und medialen Wandel in der Bundesrepublik Deutschland zugänglich zu machen.

    Künftige Studiennummer des Gesamtdatensatzes der IndermediaPlus im DBK der GESIS: ZA5769 (Version 1-0-0) und der doi: https://dx.doi.org/10.4232/1.13530

    ****************English Version****************

    The prepared Longitudinal IntermediaPlus dataset 2014 to 2016 is a "big data", which is why the entire dataset will only be available in the form of a database (MySQL). In this database, the information of different variables of a respondent is organized in one column, one below the other. The present publication includes a SQL-Database with the meta data of a sample of the full database, which represents a section of the available variables of the total data set and is intended to show the structure of the prepared data and the data-documentation (codebook) of the sample. For this purpose, the sample contains all variables of sociodemography, free-time activities, additional information on a respondent and his household as well as the interview-specific variables and weights. Only the variables concerning the respondent's media use are a small selection: For online media use, the variables of all overall offerings as well as the individual offerings of the genres politics and digital were included. The media use of radio, print and TV was not included in the sample because its structure can be traced using the published longitudinal data of the media analysis MA Radio, MA Pressemedien and MA Intermedia.
    Due to the size of the datafile, the database with the actual survey data would already be in the critical range of the file size for the common upload and download. The actual survey result...

  19. SQL Analytics Case Study (Employees Database)

    • kaggle.com
    zip
    Updated Nov 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Priyank Barbhaya (2025). SQL Analytics Case Study (Employees Database) [Dataset]. https://www.kaggle.com/datasets/priyankbarbhaya/sql-analytics-case-study-employees-database
    Explore at:
    zip(7449546 bytes)Available download formats
    Dataset updated
    Nov 4, 2025
    Authors
    Priyank Barbhaya
    Description

    This dataset contains the complete MySQL Employees Database, a widely used sample dataset for learning SQL, data analysis, business intelligence, and database design. It includes employee information, salaries, job titles, departments, managers, and department history, making it ideal for real-world analytical practice.

    The dataset is structured into multiple tables that represent a real corporate environment with employee records spanning several decades. Users can practice SQL joins, window functions, aggregation, CTEs, subqueries, business KPIs, HR analytics, trend analysis, and more.

  20. Chinook Database

    • kaggle.com
    zip
    Updated Nov 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rana Sabry (2023). Chinook Database [Dataset]. https://www.kaggle.com/datasets/ranasabrii/chinook/data
    Explore at:
    zip(448874 bytes)Available download formats
    Dataset updated
    Nov 7, 2023
    Authors
    Rana Sabry
    Description

    The Chinook database was created as an alternative to the Northwind database. It represents a digital media store, including tables for artists, albums, media tracks, invoices and customers.

    The Chinook database is available on GitHub. It’s available for various DBMSs including MySQL, SQL Server, SQL Server Compact, PostgreSQL, Oracle, DB2, and of course, SQLite.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Atanas Kanev (2021). SQLite Sakila Sample Database [Dataset]. https://www.kaggle.com/datasets/atanaskanev/sqlite-sakila-sample-database/code
Organization logo

SQLite Sakila Sample Database

SQLite Port of the Original MySQL Sakila Sample Database

Explore at:
zip(4495190 bytes)Available download formats
Dataset updated
Mar 14, 2021
Authors
Atanas Kanev
Description

SQLite Sakila Sample Database

Database Description

The Sakila sample database is a fictitious database designed to represent a DVD rental store. The tables of the database include film, film_category, actor, customer, rental, payment and inventory among others. The Sakila sample database is intended to provide a standard schema that can be used for examples in books, tutorials, articles, samples, and so forth. Detailed information about the database can be found on the MySQL website: https://dev.mysql.com/doc/sakila/en/

Sakila for SQLite is a part of the sakila-sample-database-ports project intended to provide ported versions of the original MySQL database for other database systems, including:

  • Oracle
  • SQL Server
  • SQLIte
  • Interbase/Firebird
  • Microsoft Access

Sakila for SQLite is a port of the Sakila example database available for MySQL, which was originally developed by Mike Hillyer of the MySQL AB documentation team. This project is designed to help database administrators to decide which database to use for development of new products The user can run the same SQL against different kind of databases and compare the performance

License: BSD Copyright DB Software Laboratory http://www.etl-tools.com

Note: Part of the insert scripts were generated by Advanced ETL Processor http://www.etl-tools.com/etl-tools/advanced-etl-processor-enterprise/overview.html

Information about the project and the downloadable files can be found at: https://code.google.com/archive/p/sakila-sample-database-ports/

Other versions and developments of the project can be found at: https://github.com/ivanceras/sakila/tree/master/sqlite-sakila-db

https://github.com/jOOQ/jOOQ/tree/main/jOOQ-examples/Sakila

Direct access to the MySQL Sakila database, which does not require installation of MySQL (queries can be typed directly in the browser), is provided on the phpMyAdmin demo version website: https://demo.phpmyadmin.net/master-config/

Files Description

The files in the sqlite-sakila-db folder are the script files which can be used to generate the SQLite version of the database. For convenience, the script files have already been run in cmd to generate the sqlite-sakila.db file, as follows:

sqlite> .open sqlite-sakila.db # creates the .db file sqlite> .read sqlite-sakila-schema.sql # creates the database schema sqlite> .read sqlite-sakila-insert-data.sql # inserts the data

Therefore, the sqlite-sakila.db file can be directly loaded into SQLite3 and queries can be directly executed. You can refer to my notebook for an overview of the database and a demonstration of SQL queries. Note: Data about the film_text table is not provided in the script files, thus the film_text table is empty. Instead the film_id, title and description fields are included in the film table. Moreover, the Sakila Sample Database has many versions, so an Entity Relationship Diagram (ERD) is provided to describe this specific version. You are advised to refer to the ERD to familiarise yourself with the structure of the database.

Search
Clear search
Close search
Google apps
Main menu