29 datasets found

SQLite Sakila Sample Database
kaggle.com
zip
Updated Mar 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atanas Kanev (2021). SQLite Sakila Sample Database [Dataset]. https://www.kaggle.com/datasets/atanaskanev/sqlite-sakila-sample-database/code
Explore at:
zip(4495190 bytes)Available download formats
Dataset updated
Mar 14, 2021
Authors
Atanas Kanev
Description
SQLite Sakila Sample Database

Database Description

The Sakila sample database is a fictitious database designed to represent a DVD rental store. The tables of the database include film, film_category, actor, customer, rental, payment and inventory among others. The Sakila sample database is intended to provide a standard schema that can be used for examples in books, tutorials, articles, samples, and so forth. Detailed information about the database can be found on the MySQL website: https://dev.mysql.com/doc/sakila/en/

Sakila for SQLite is a part of the sakila-sample-database-ports project intended to provide ported versions of the original MySQL database for other database systems, including:

Oracle

SQL Server

SQLIte

Interbase/Firebird

Microsoft Access

Sakila for SQLite is a port of the Sakila example database available for MySQL, which was originally developed by Mike Hillyer of the MySQL AB documentation team. This project is designed to help database administrators to decide which database to use for development of new products The user can run the same SQL against different kind of databases and compare the performance

License: BSD Copyright DB Software Laboratory http://www.etl-tools.com

Note: Part of the insert scripts were generated by Advanced ETL Processor http://www.etl-tools.com/etl-tools/advanced-etl-processor-enterprise/overview.html

Information about the project and the downloadable files can be found at: https://code.google.com/archive/p/sakila-sample-database-ports/

Other versions and developments of the project can be found at: https://github.com/ivanceras/sakila/tree/master/sqlite-sakila-db

https://github.com/jOOQ/jOOQ/tree/main/jOOQ-examples/Sakila

Direct access to the MySQL Sakila database, which does not require installation of MySQL (queries can be typed directly in the browser), is provided on the phpMyAdmin demo version website: https://demo.phpmyadmin.net/master-config/

Files Description

The files in the sqlite-sakila-db folder are the script files which can be used to generate the SQLite version of the database. For convenience, the script files have already been run in cmd to generate the sqlite-sakila.db file, as follows:

sqlite> .open sqlite-sakila.db # creates the .db file sqlite> .read sqlite-sakila-schema.sql # creates the database schema sqlite> .read sqlite-sakila-insert-data.sql # inserts the data

Therefore, the sqlite-sakila.db file can be directly loaded into SQLite3 and queries can be directly executed. You can refer to my notebook for an overview of the database and a demonstration of SQL queries. Note: Data about the film_text table is not provided in the script files, thus the film_text table is empty. Instead the film_id, title and description fields are included in the film table. Moreover, the Sakila Sample Database has many versions, so an Entity Relationship Diagram (ERD) is provided to describe this specific version. You are advised to refer to the ERD to familiarise yourself with the structure of the database.
classicmodels
kaggle.com
zip
Updated Apr 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ambreen (2024). classicmodels [Dataset]. https://www.kaggle.com/datasets/ambreenabdulraheem/classicmodels
Explore at:
zip(879935 bytes)Available download formats
Dataset updated
Apr 22, 2024
Authors
Ambreen
Description
MySQL Sample Database Schema. The MySQL sample database schema consists of the following tables:

customers: stores customer’s data.

products: stores a list of scale model cars.

productlines: stores a list of product lines.

orders: stores sales orders placed by customers.

orderdetails: stores sales order line items for every sales order.

payments: stores payments made by customers based on their accounts.

employees: stores employee information and the organization structure such as who reports to whom.

offices: stores sales office data.
Bike Store Relational Database | SQL
kaggle.com
zip
Updated Aug 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dillon Myrick (2023). Bike Store Relational Database | SQL [Dataset]. https://www.kaggle.com/datasets/dillonmyrick/bike-store-sample-database
Explore at:
zip(94412 bytes)Available download formats
Dataset updated
Aug 21, 2023
Authors
Dillon Myrick
Description
This is the sample database from sqlservertutorial.net. This is a great dataset for learning SQL and practicing querying relational databases.

Database Diagram:

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4146319%2Fc5838eb006bab3938ad94de02f58c6c1%2FSQL-Server-Sample-Database.png?generation=1692609884383007&alt=media" alt="">

Terms of Use

The sample database is copyrighted and cannot be used for commercial purposes. For example, it cannot be used for the following but is not limited to the purposes: - Selling - Including in paid courses
Z
FooDrugs database: A database with molecular and text information about food...
data.niaid.nih.gov
zenodo.org
Updated Jul 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Garranzo, Marco; Piette Gómez, Óscar; Lacruz Pleguezuelos, Blanca; Pérez, David; Laguna Lobo, Teresa; Carrillo de Santa Pau, Enrique (2023). FooDrugs database: A database with molecular and text information about food - drug interactions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6638469
Explore at:
Dataset updated
Jul 28, 2023
Dataset provided by
IMDEA Food Institute
Authors
Garranzo, Marco; Piette Gómez, Óscar; Lacruz Pleguezuelos, Blanca; Pérez, David; Laguna Lobo, Teresa; Carrillo de Santa Pau, Enrique
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
FooDrugs database is a development done by the Computational Biology Group at IMDEA Food Institute (Madrid, Spain), in the context of the Food Nutrition Security Cloud (FNS-Cloud) project. Food Nutrition Security Cloud (FNS-Cloud) has received funding from the European Union's Horizon 2020 Research and Innovation programme (H2020-EU.3.2.2.3. – A sustainable and competitive agri-food industry) under Grant Agreement No. 863059 – www.fns-cloud.eu (See more details about FNS-Cloud below)

FooDrugs stores information extracted from transcriptomics and text documents for foo-drug interactiosn and it is part of a demonstrator to be done in the FNS-Cloud project. The database was built using MySQL, an open source relational database management system. FooDrugs host information for a total of 161 transcriptomics GEO series with 585 conditions for food or bioactive compounds. Each condition is defined as a food/biocomponent per time point, per concentration, per cell line, primary culture or biopsy per study. FooDrugs includes information about a bipartite network with 510 nodes and their similarity scores (tau score; https://clue.io/connectopedia/connectivity_scores) related with possible drug interactions with drugs assayed in conectivity map (https://www.broadinstitute.org/connectivity-map-cmap). The information is stored in eight tables:

Table “study” : This table contains basic information about study identifiers from GEO, pubmed or platform, study type, title and abstract

Table “sample”: This table contains basic information about the different experiments in a study, like the identifier of the sample, treatment, origin type, time point or concentration.

Table “misc_study”: This table contains additional information about different attributes of the study.

Table “misc_sample”: This table contains additional information about different attributes of the sample.

Table “cmap”: This table contains information about 70895 nodes, compromising drugs, foods or bioactives, overexpressed and knockdown genes (see section 3.4). The information includes cell line, compound and perturbation type.

Table “cmap_foodrugs”: This table contains information about the tau score (see section 3.4) that relates food with drugs or genes and the node identifier in the FooDrugs network.

Table “topTable”: This table contains information about 150 over and underexpressed genes from each GEO study condition, used to calculate the tau score (see section 3.4). The information stored is the logarithmic fold change, average expression, t-statistic, p-value, adjusted p-value and if the gene is up or downregulated.

Table “nodes”: This table stores the information about the identification of the sample and the node in the bipartite network connecting the tables “sample”, “cmap_foodrugs” and “topTable”.

In addition, FooDrugs database stores a total of 6422 food/drug interactions from 2849 text documents, obtained from three different sources: 2312 documents from PubMed, 285 from DrugBank, and 252 from drugs.com. These documents describe potential interactions between 1464 food/bioactive compounds and 3009 drugs. The information is stored in two tables:

Table “texts”: This table contains all the documents with its identifiers where interactions have been identified with strategy described in section 4.

Table “TM_interactions”: This table contains information about interaction identifiers, the food and drug entities, and the start and the end positions of the context for the interaction in the document.

FNS-Cloud will overcome fragmentation problems by integrating existing FNS data, which is essential for high-end, pan-European FNS research, addressing FNS, diet, health, and consumer behaviours as well as on sustainable agriculture and the bio-economy. Current fragmented FNS resources not only result in knowledge gaps that inhibit public health and agricultural policy, and the food industry from developing effective solutions, making production sustainable and consumption healthier, but also do not enable exploitation of FNS knowledge for the benefit of European citizens. FNS-Cloud will, through three Demonstrators; Agri-Food, Nutrition & Lifestyle and NCDs & the Microbiome to facilitate: (1) Analyses of regional and country-specific differences in diet including nutrition, (epi)genetics, microbiota, consumer behaviours, culture and lifestyle and their effects on health (obesity, NCDs, ethnic and traditional foods), which are essential for public health and agri-food and health policies; (2) Improved understanding agricultural differences within Europe and what these means in terms of creating a sustainable, resilient food systems for healthy diets; and (3) Clear definitions of boundaries and how these affect the compositions of foods and consumer choices and, ultimately, personal and public health in the future. Long-term sustainability of the FNS-Cloud will be based on Services that have the capacity to link with new resources and enable cross-talk amongst them; access to FNS-Cloud data will be open access, underpinned by FAIR principles (findable, accessible, interoperable and re-useable). FNS-Cloud will work closely with the proposed Food, Nutrition and Health Research Infrastructure (FNHRI) as well as METROFOOD-RI and other existing ESFRI RIs (e.g. ELIXIR, ECRIN) in which several FNS-Cloud Beneficiaries are involved directly. (https://cordis.europa.eu/project/id/863059)

***** changes between version FooDrugs_v2 and FooDrugs_V3 (31st January 2023) are:

Increased the amount of text documents by 85.675 from PubMed and ClinicalTrials.gov, and the amount of Text Mining interactions by 168.826.

Increased the amount of transcriptomic studies by 32 GEO series.

Removed all rows in table cmap_foodrugs representing interactions with values of tau=0

Removed 43 GEO series that after manually checking didn't correspond to food compounds.

Added a new column to the table texts: citation to hold the citation of the text.

Added these columns to the table study: contributor to contain the authors of the study, publication_date to store the date of publication of the study in GEO and pubmed_id to reference the publication associated with the study if any.

Added a new column to topTable to hold the top 150 up-regulated and 150 down-regulated genes.
Z
Rediscovery Datasets: Connecting Duplicate Reports of Apache, Eclipse, and...
data.niaid.nih.gov
Updated Aug 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sadat, Mefta; Bener, Ayse Basar; Miranskyy, Andriy V. (2024). Rediscovery Datasets: Connecting Duplicate Reports of Apache, Eclipse, and KDE [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_400614
Explore at:
Dataset updated
Aug 3, 2024
Dataset provided by
Ryerson University
Authors
Sadat, Mefta; Bener, Ayse Basar; Miranskyy, Andriy V.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present three defect rediscovery datasets mined from Bugzilla. The datasets capture data for three groups of open source software projects: Apache, Eclipse, and KDE. The datasets contain information about approximately 914 thousands of defect reports over a period of 18 years (1999-2017) to capture the inter-relationships among duplicate defects.

File Descriptions

apache.csv - Apache Defect Rediscovery dataset

eclipse.csv - Eclipse Defect Rediscovery dataset

kde.csv - KDE Defect Rediscovery dataset

apache.relations.csv - Inter-relations of rediscovered defects of Apache

eclipse.relations.csv - Inter-relations of rediscovered defects of Eclipse

kde.relations.csv - Inter-relations of rediscovered defects of KDE

create_and_populate_neo4j_objects.cypher - Populates Neo4j graphDB by importing all the data from the CSV files. Note that you have to set dbms.import.csv.legacy_quote_escaping configuration setting to false to load the CSV files as per https://neo4j.com/docs/operations-manual/current/reference/configuration-settings/#config_dbms.import.csv.legacy_quote_escaping

create_and_populate_mysql_objects.sql - Populates MySQL RDBMS by importing all the data from the CSV files

rediscovery_db_mysql.zip - For your convenience, we also provide full backup of the MySQL database

neo4j_examples.txt - Sample Neo4j queries

mysql_examples.txt - Sample MySQL queries

rediscovery_eclipse_6325.png - Output of Neo4j example #1

distinct_attrs.csv - Distinct values of bug_status, resolution, priority, severity for each project
Clean Meta Kaggle
kaggle.com
Updated Sep 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yoni Kremer (2023). Clean Meta Kaggle [Dataset]. https://www.kaggle.com/datasets/yonikremer/clean-meta-kaggle
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yoni Kremer
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Cleaned Meta-Kaggle Dataset

The Original Dataset - Meta-Kaggle

Explore our public data on competitions, datasets, kernels (code / notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.

https://i.imgur.com/2Egeb8R.png" alt="" title="a title">

This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.

Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.

August 2023 update

In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here

We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.

The Problems with the Original Dataset

The original dataset is 32 CSV files, with 268 colums and 7GB of compressed data. Having so many tables and columns makes it hard to understand the data.

The data is not normalized, so when you join tables you get a lot of errors.

Some values refer to non-existing values in other tables. For example, the UserId column in the ForumMessages table has values that do not exist in the Users table.

There are missing values.

There are duplicate values.

There are values that are not valid. For example, Ids that are not positive integers.

The date and time columns are not in the right format.

Some columns only have the same value for all rows, so they are not useful.

The boolean columns have string values True or False.

Incorrect values for the Total columns. For example, the DatasetCount is not the total number of datasets with the Tag according to the DatasetTags table.

Users upvote their own messages.

The Solution

To handle so many tables and columns I use a relational database. I use MySQL, but you can use any relational database.

The steps to create the database are:

Creating the database tables with the right data types and constraints. I do that by running the db_abd_create_tables.sql script.

Downloading the CSV files from Kaggle using the Kaggle API.

Cleaning the data using pandas. I do that by running the clean_data.py script. The script does the following steps for each table:

Drops the columns that are not needed.

Converts each column to the right data type.

Replaces foreign keys that do not exist with NULL.

Replaces some of the missing values with default values.

Removes rows where there are missing values in the primary key/not null columns.

Removes duplicate rows.

Loading the data into the database using the LOAD DATA INFILE command.

Checks that the number of rows in the database tables is the same as the number of rows in the CSV files.

Adds foreign key constraints to the database tables. I do that by running the add_foreign_keys.sql script.

Update the Total columns in the database tables. I do that by running the update_totals.sql script.

Backup the database.
s
Orphan Drugs - Dataset 1: Twitter issue-networks as excluded publics
orda.shef.ac.uk
txt
Updated Oct 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew Hanchard (2021). Orphan Drugs - Dataset 1: Twitter issue-networks as excluded publics [Dataset]. http://doi.org/10.15131/shef.data.16447326.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.15131/shef.data.16447326.v1
Dataset updated
Oct 22, 2021
Dataset provided by
The University of Sheffield
Authors
Matthew Hanchard
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset comprises of two .csv format files used within workstream 2 of the Wellcome Trust funded ‘Orphan drugs: High prices, access to medicines and the transformation of biopharmaceutical innovation’ project (219875/Z/19/Z). They appear in various outputs, e.g. publications and presentations.

The deposited data were gathered using the University of Amsterdam Digital Methods Institute’s ‘Twitter Capture and Analysis Toolset’ (DMI-TCAT) before being processed and extracted from Gephi. DMI-TCAT queries Twitter’s STREAM Application Programming Interface (API) using SQL and retrieves data on a pre-set text query. It then sends the returned data for storage on a MySQL database. The tool allows for output of that data in various formats. This process aligns fully with Twitter’s service user terms and conditions. The query for the deposited dataset gathered a 1% random sample of all public tweets posted between 10-Feb-2021 and 10-Mar-2021 containing the text ‘Rare Diseases’ and/or ‘Rare Disease Day’, storing it on a local MySQL database managed by the University of Sheffield School of Sociological Studies (http://dmi-tcat.shef.ac.uk/analysis/index.php), accessible only via a valid VPN such as FortiClient and through a permitted active directory user profile. The dataset was output from the MySQL database raw as a .gexf format file, suitable for social network analysis (SNA). It was then opened using Gephi (0.9.2) data visualisation software and anonymised/pseudonymised in Gephi as per the ethical approval granted by the University of Sheffield School of Sociological Studies Research Ethics Committee on 02-Jun-201 (reference: 039187). The deposited dataset comprises of two anonymised/pseudonymised social network analysis .csv files extracted from Gephi, one containing node data (Issue-networks as excluded publics – Nodes.csv) and another containing edge data (Issue-networks as excluded publics – Edges.csv). Where participants explicitly provided consent, their original username has been provided. Where they have provided consent on the basis that they not be identifiable, their username has been replaced with an appropriate pseudonym. All other usernames have been anonymised with a randomly generated 16-digit key. The level of anonymity for each Twitter user is provided in column C of deposited file ‘Issue-networks as excluded publics – Nodes.csv’.

This dataset was created and deposited onto the University of Sheffield Online Research Data repository (ORDA) on 26-Aug-2021 by Dr. Matthew S. Hanchard, Research Associate at the University of Sheffield iHuman institute/School of Sociological Studies. ORDA has full permission to store this dataset and to make it open access for public re-use without restriction under a CC BY license, in line with the Wellcome Trust commitment to making all research data Open Access.

The University of Sheffield are the designated data controller for this dataset.
c
Data Base Management Systems market size was USD 50.5 billion in 2022 !
cognitivemarketresearch.com
pdf,excel,csv,ppt
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cognitive Market Research (2025). Data Base Management Systems market size was USD 50.5 billion in 2022 ! [Dataset]. https://www.cognitivemarketresearch.com/data-base-management-systems-market-report
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset updated
Oct 29, 2025
Dataset authored and provided by
Cognitive Market Research
License
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
Time period covered
2021 - 2033
Area covered
Global
Description
The global Data Base Management Systems market was valued at USD 50.5 billion in 2022 and is projected to reach USD 120.6 Billion by 2030, registering a CAGR of 11.5 % for the forecast period 2023-2030. Factors Affecting Data Base Management Systems Market Growth

Growing inclination of organizations towards adoption of advanced technologies like cloud-based technology favours the growth of global DBMS market

The cloud-based data base management system solutions offer the organizations with an ability to scale their database infrastructure up or down as per requirement. In a crucial business environment data volume can vary over time. Here, the cloud allows organizations to allocate resources in a dynamic and systematic manner, thereby, ensuring optimal performance without underutilization. In addition, these cloud-based solutions are cost-efficient. As, these cloud-based DBMS solutions eliminate the need for companies to maintain and invest in physical infrastructure and hardware. It helps in reducing ongoing operational costs and upfront capital expenditures. Organizations can choose pay-as-you-go pricing models, where they need to pay only for the resources they consume. Therefore, it has been a cost-efficient option for both smaller businesses and large-enterprises. Moreover, the cloud-based data base management system platforms usually come with management tools which streamline administrative tasks such as backup, provisioning, recovery, and monitoring. It allows IT teams to concentrate on more of strategic tasks rather than routine maintenance activities, thereby, enhancing operational efficiency. Whereas, these cloud-based data base management systems allow users to remote access and collaboration among teams, irrespective of their physical locations. Thus, in regards with today's work environment, which focuses on distributed and remote workforces. These cloud-based DBMS solution enables to access data and update in real-time through authorized personnel, allowing collaboration and better decision-making. Thus, owing to all the above factors, the rising adoption of advanced technologies like cloud-based DBMS is favouring the market growth.

Availability of open-source solutions is likely to restrain the global data base management systems market growth

Open-source data base management system solutions such as PostgreSQL, MongoDB, and MySQL, offer strong functionality at minimal or no licensing costs. It makes open-source solutions an attractive option for companies, especially start-ups or smaller businesses with limited budgets. As these open-source solutions offer similar capabilities to various commercial DBMS offerings, various organizations may opt for this solutions in order to save costs. The open-source solutions may benefit from active developer communities which contribute to their development, enhancement, and maintenance. This type of collaborative environment supports continuous innovation and improvement, which results into solutions that are slightly competitive with commercial offerings in terms of performance and features. Thus, the open-source solutions create competition for commercial DBMS market, they thrive in the market by offering unique value propositions, addressing needs of organizations which prioritize professional support, seamless integration into complex IT ecosystems, and advanced features. Introduction of Data Base Management Systems

A Database Management System (DBMS) is a software which is specifically designed to organize and manage data in a structured manner. This system allows users to create, modify, and query a database, and also manage the security and access controls for that particular database. The DBMS offers tools for creating and modifying data models, that define the structure and relationships of data in a database. This system is also responsible for storing and retrieving data from the database, and also provide several methods for searching and querying the data. The data base management system also offers mechanisms to control concurrent access to the database, in order to ensure that number of users may access the data. The DBMS provides tools to enforce security constraints and data integrity, such as the constraints on the value of data and access controls that restricts who can access the data. The data base management system also provides mechanisms for recovering and backing up the data when a system failure occurs....
MySQL Java Computer Programs
figshare.com
zip
Updated Jul 3, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suhailan Safei (2017). MySQL Java Computer Programs [Dataset]. http://doi.org/10.6084/m9.figshare.2813497.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.2813497.v1
Dataset updated
Jul 3, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Suhailan Safei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This mysql database contains list of submitted Java programs based on series of online lab exercises from year 2013 to 2015. The programs were submitted by first year computer science students from Faculty of Informatics and Computing, Universiti Sultan Zainal Abidin, Malaysia who undertaking Introductory Computer Programming subject. There were 67, 18 and 47 of participated students in 2013, 2014 and 2015 respectively. The submitted programs were all of their solution attempts in answering a computational programming question. The question was as the following:

Write a program that will read string. Then your program should show all the string character using * except for character 2, output its real character. sample input. Apology sample output. p****
Employees
kaggle.com
zip
Updated Nov 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sudhir Singh (2021). Employees [Dataset]. https://www.kaggle.com/datasets/crepantherx/employees
Explore at:
zip(31992550 bytes)Available download formats
Dataset updated
Nov 12, 2021
Authors
Sudhir Singh
Description
Dataset

This dataset was created by Sudhir Singh

Released under Data files © Original Authors

Contents
p
Royal Institute for Cultural Heritage Radiocarbon and stable isotope...
pandora.earth
Updated Jul 12, 2011
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2011). Royal Institute for Cultural Heritage Radiocarbon and stable isotope measurements - Dataset - Pandora [Dataset]. https://pandora.earth/gl_ES/dataset/royal-institute-for-cultural-heritage-radiocarbon-and-stable-isotope-measurements
Explore at:
Dataset updated
Jul 12, 2011
Description
The Radiocarbon dating laboratory of IRPA/KIK was founded in the 1960s. Initially dates were reported at more or less regular intervals in the journal Radiocarbon (Schreurs 1968). Since the advent of radiocarbon dating in the 1950s it had been a common practice amongst radiocarbon laboratories to publish their dates in so-called ‘date-lists’ that were arranged per laboratory. This was first done in the Radiocarbon Supplement of the American Journal of Science and later in the specialised journal Radiocarbon. In the course of time the latter, with the added subtitle An International Journal of Cosmogenic Isotope Research, became a regular scientific journal shifting focus from date-lists to articles. Furthermore the world-wide exponential increase of radiocarbon dates made it almost impossible to publish them all in the same journal, even more so because of the broad range of applications that use radiocarbon analysis, ranging from archaeology and art history to geology and oceanography and recently also biomedical studies.The IRPA/KIK database From 1995 onwards IRPA/KIK’s Radiocarbon laboratory started to publish its dates in small publications, continuing the numbering of the preceding lists in Radiocarbon. The first booklet in this series was “Royal Institute for Cultural Heritage Radiocarbon dates XV” (Van Strydonck et al. 1995), followed by three more volumes (XVI, XVII, XVIII). The next list (XIX, 2005) was no longer printed but instead handed out as a PDF file on CD-rom. The ever increasing number of dates and the difficulties in handling all the data, however, made us look for a more permanent and easier solution. In order to improve data management and consulting, it was thus decided to gather all our dates in a web-based database. List XIX was in fact already a Microsoft Access database that was converted into a reader friendly style and could also be printed as a PDF file. However a Microsoft Access database is not the most practical solution to make information publicly available. Hence the structure of the database was recreated in Mysql and the existing content was transferred into the corresponding fields. To display the records, a web-based front-end was programmed in PHP/Apache. It features a full-text search function that allows for partial word-matching. In addition the records can be consulted in PDF format. Old records from the printed date-lists as well as new records are now added using the same Microsoft Acces back-end, which is now connected directly to the Mysql database. The main problem with introducing the old data was that not all the current criteria were available in the past (e.g. stable isotope measurements). Furthermore since all the sample information is given by the submitter, its quality largely depends on the persons willingness to contribute as well as on the accuracy and correctness of the information he provides. Sometimes problems arrive from the fact that a certain investigation (like an excavation) is carried out over a relatively long period (sometimes even more than ten years) and is directed by different people or even institutions. This can lead to differences in the labeling procedure of the samples, but also in the interpretation of structures and artifacts and in the orthography of the site’s name. Finally the submitter might change address, while the names of institutions or even regions and countries might change as well (e.g.Zaire - Congo)
m
Membership of the IMO Commission for Agricultural Meteorology (1913-1947)
data.mendeley.com
Updated May 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giuditta Parolini (2020). Membership of the IMO Commission for Agricultural Meteorology (1913-1947) [Dataset]. http://doi.org/10.17632/pds6tz443t.1
Explore at:
Unique identifier
https://doi.org/10.17632/pds6tz443t.1
Dataset updated
May 6, 2020
Authors
Giuditta Parolini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset has been built as part of a research project on the history of agricultural meteorology in the first half of the twentieth century. The dataset provides information on the members of the technical commission on agriculture established by the International Meteorological Organization (IMO) in the years 1913-1947. The main sources used to build the dataset are the proceedings of the meetings held by the IMO, and in particular the membership lists printed in these proceedings. Information provided by these primary sources has been constantly examined for consistency and correctness and, whenever possible, biographical resources on individual members have been consulted and mentioned in the dataset here presented. The sql files in the dataset allow to re-build the MySQL database that was created to investigate the commission membership and its transformation over time. A copy of the data tables is also available in csv format for the users who wish to access only the data. In the dataset there are twelve tables. Eleven tables provide information (affiliation, nation, city, role in the commission) on the scientists listed as members of the commission for a specific time period. There is a table for each year (1913, 1919, 1921, 1923, 1926, 1929, 1932, 1935, 1937, 1946, 1947) in which a membership list is available. NationH and NationG stands for Nation(History) and Nation(Geography), similarly for cityH and cityG. In this way, it is possible to place commission members also within modern countries and cities, not only their historical counterparts, if one wishes to build a map of the members’ location using current geodata. The role of each member within the commission has only three possible options: president, secretary/vice-president, member. The last table in the dataset, m_all, provides a comprehensive list of all the over one-hundred members of the commission with some biographical details on them (when available). The idmembers value is the unique identifier for each member within this dataset. This dataset has been used in an extensive investigation on the role that the IMO had in promoting international collaboration in agricultural meteorology during the first half of the twentieth century. The data here gathered, however, can be of interest beyond the history of agricultural meteorology. They also offer relevant materials to scholars more generally concerned with the work of the IMO, and the database structure provides a template for similar data collection work on other IMO technical commissions. These commissions were key places for sharing meteorological and climatological knowledge between mid-nineteenth century and mid-twentieth century and they certainly deserve more attention than the one they have so far received from scholars.

I gratefully acknowledge the financial support of the German Research Foundation (DFG) (Project No. 321660352) in the preparation of this dataset.
n
Heparome
neuinfo.org
dknet.org
+2more
Updated Oct 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Heparome [Dataset]. http://identifiers.org/RRID:SCR_008615
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_008615
Dataset updated
Oct 11, 2025
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented on July 17, 2013. A database which contains the information of heparin-binding proteins of E. coli K-12 MG1655 cells. Heparin affinity columns were applied to enrich and fractionate proteins. Identification of proteins was done via the collaboration with David Russell''s lab. Because heparin is negatively charged sulfated glucosaminoglycan, polyamion binding proteins, which contain nucleic acid-binding proteins, are expected to bind to heparin columns. Study of the expression pattern of heparin-binding proteins will help to study the nucleic acid-binding proteins, most of which are related to regulation. Moreover, heparin affinity columns will also erich low abundance proteins. Heparome database is constructed using MySQL. Website interface is built using HTML and PHP. Queries between MySQL database and website interface are executed using PHP. Besides including information of identified proteins, such as swiss accession number, gene name, molecular weight, isoelectric point, condon adaptation index (CAI), functional classification, et. al. , it also includes information of experiments, such as sample preparation, heparin-HPLC chromatography, SDS-PAGE gel separation and MALDI-MS.
u
Data from: KGCW 2023 Challenge @ ESWC 2023
investigacion.usc.gal
Updated 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Şimşek, Umutcan; Iglesias, Ana; Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Şimşek, Umutcan; Iglesias, Ana (2023). KGCW 2023 Challenge @ ESWC 2023 [Dataset]. https://investigacion.usc.gal/documentos/67321d88aea56d4af0484859
Explore at:
Dataset updated
2023
Authors
Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Şimşek, Umutcan; Iglesias, Ana; Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Şimşek, Umutcan; Iglesias, Ana
Description
Knowledge Graph Construction Workshop 2023: challenge

Knowledge graph construction of heterogeneous data has seen a lot of uptakein the last decade from compliance to performance optimizations with respectto execution time. Besides execution time as a metric for comparing knowledgegraph construction, other metrics e.g. CPU or memory usage are not considered.This challenge aims at benchmarking systems to find which RDF graphconstruction system optimizes for metrics e.g. execution time, CPU,memory usage, or a combination of these metrics.

Task description

The task is to reduce and report the execution time and computing resources(CPU and memory usage) for the parameters listed in this challenge, comparedto the state-of-the-art of the existing tools and the baseline results providedby this challenge. This challenge is not limited to execution times to createthe fastest pipeline, but also computing resources to achieve the most efficientpipeline.

We provide a tool which can execute such pipelines end-to-end. This tool alsocollects and aggregates the metrics such as execution time, CPU and memoryusage, necessary for this challenge as CSV files. Moreover, the informationabout the hardware used during the execution of the pipeline is available aswell to allow fairly comparing different pipelines. Your pipeline should consistof Docker images which can be executed on Linux to run the tool. The tool isalready tested with existing systems, relational databases e.g. MySQL andPostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuosowhich can be combined in any configuration. It is strongly encouraged to usethis tool for participating in this challenge. If you prefer to use a differenttool or our tool imposes technical requirements you cannot solve, please contactus directly.

Part 1: Knowledge Graph Construction Parameters

These parameters are evaluated using synthetic generated data to have moreinsights of their influence on the pipeline.

Data

Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).

Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).

Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).

Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).

Number of input files: scaling the number of datasets (1, 5, 10, 15).

Mappings

Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).

Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).

Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)

Part 2: GTFS-Madrid-Bench

The GTFS-Madrid-Bench provides insights in the pipeline with real data from thepublic transport domain in Madrid.

Scaling

GTFS-1 SQL

GTFS-10 SQL

GTFS-100 SQL

GTFS-1000 SQL

Heterogeneity

GTFS-100 XML + JSON

GTFS-100 CSV + XML

GTFS-100 CSV + JSON

GTFS-100 SQL + XML + JSON + CSV

Example pipeline

The ground truth dataset and baseline results are generated in different stepsfor each parameter:

The provided CSV files and SQL schema are loaded into a MySQL relational database.

Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format.

The constructed knowledge graph is loaded into a Virtuoso triplestore, tuned according to the Virtuoso documentation.

The provided SPARQL queries are executed on the SPARQL endpoint exposed by Virtuoso.

The pipeline is executed 5 times from which the median execution time of eachstep is calculated and reported. Each step with the median execution time isthen reported in the baseline results with all its measured metrics.Query timeout is set to 1 hour and knowledge graph construction timeoutto 24 hours. The execution is performed with the following tool: https://github.com/kg-construct/challenge-tool,you can adapt the execution plans for this example pipeline to your own needs.

Each parameter has its own directory in the ground truth dataset with thefollowing files:

Input dataset as CSV.

Mapping file as RML.

Queries as SPARQL.

Execution plan for the pipeline in metadata.json.

Datasets

Knowledge Graph Construction Parameters

The dataset consists of:

Input dataset as CSV for each parameter.

Mapping file as RML for each parameter.

SPARQL queries to retrieve the results for each parameter.

Baseline results for each parameter with the example pipeline.

Ground truth dataset for each parameter generated with the example pipeline.

Format

All input datasets are provided as CSV, depending on the parameter that is beingevaluated, the number of rows and columns may differ. The first row is alwaysthe header of the CSV.

GTFS-Madrid-Bench

The dataset consists of:

Input dataset as CSV with SQL schema for the scaling and a combination of XML,

CSV, and JSON is provided for the heterogeneity.

Mapping file as RML for both scaling and heterogeneity.

SPARQL queries to retrieve the results.

Baseline results with the example pipeline.

Ground truth dataset generated with the example pipeline.

Format

CSV datasets always have a header as their first row.JSON and XML datasets have their own schema.

Evaluation criteria

Submissions must evaluate the following metrics:

Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.

CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.

Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

Expected output

Duplicate values

Scale Number of Triples

0 percent 2000000 triples

25 percent 1500020 triples

50 percent 1000020 triples

75 percent 500020 triples

100 percent 20 triples

Empty values

Scale Number of Triples

0 percent 2000000 triples

25 percent 1500000 triples

50 percent 1000000 triples

75 percent 500000 triples

100 percent 0 triples

Mappings

Scale Number of Triples

1TM + 15POM 1500000 triples

3TM + 5POM 1500000 triples

5TM + 3POM 1500000 triples

15TM + 1POM 1500000 triples

Properties

Scale Number of Triples

1M rows 1 column 1000000 triples

1M rows 10 columns 10000000 triples

1M rows 20 columns 20000000 triples

1M rows 30 columns 30000000 triples

Records

Scale Number of Triples

10K rows 20 columns 200000 triples

100K rows 20 columns 2000000 triples

1M rows 20 columns 20000000 triples

10M rows 20 columns 200000000 triples

Joins

1-1 joins

Scale Number of Triples

0 percent 0 triples

25 percent 125000 triples

50 percent 250000 triples

75 percent 375000 triples

100 percent 500000 triples

1-N joins

Scale Number of Triples

1-10 0 percent 0 triples

1-10 25 percent 125000 triples

1-10 50 percent 250000 triples

1-10 75 percent 375000 triples

1-10 100 percent 500000 triples

1-5 50 percent 250000 triples

1-10 50 percent 250000 triples

1-15 50 percent 250005 triples

1-20 50 percent 250000 triples

1-N joins

Scale Number of Triples

10-1 0 percent 0 triples

10-1 25 percent 125000 triples

10-1 50 percent 250000 triples

10-1 75 percent 375000 triples

10-1 100 percent 500000 triples

5-1 50 percent 250000 triples

10-1 50 percent 250000 triples

15-1 50 percent 250005 triples

20-1 50 percent 250000 triples

N-M joins

Scale Number of Triples

5-5 50 percent 1374085 triples

10-5 50 percent 1375185 triples

5-10 50 percent 1375290 triples

5-5 25 percent 718785 triples

5-5 50 percent 1374085 triples

5-5 75 percent 1968100 triples

5-5 100 percent 2500000 triples

5-10 25 percent 719310 triples

5-10 50 percent 1375290 triples

5-10 75 percent 1967660 triples

5-10 100 percent 2500000 triples

10-5 25 percent 719370 triples

10-5 50 percent 1375185 triples

10-5 75 percent 1968235 triples

10-5 100 percent 2500000 triples

GTFS Madrid Bench

Generated Knowledge Graph

Scale Number of Triples

1 395953 triples

10 3959530 triples

100 39595300 triples

1000 395953000 triples

Queries

Query Scale 1 Scale 10 Scale 100 Scale 1000

Q1 58540 results 585400 results No results available No results available

Q2 636 results 11998 results
125565 results 1261368 results

Q3 421 results 4207 results 42067 results 420667 results

Q4 13 results 130 results 1300 results 13000 results

Q5 35 results 350 results 3500 results 35000 results

Q6 1 result 1 result 1 result 1 result

Q7 68 results 67 results 67 results 53 results

Q8 35460 results 354600 results No results available No results available

Q9 130 results 1300 results 13000 results 130000 results

Q10 1 result 1 result 1 result 1 result

Q11 130 results 260 results 260 results 260 results

Q12 13 results 130 results 1300 results 13000 results

Q13 265 results 2650 results 26500 results 265000 results

Q14 2234 results 22340 results 223400 results No results available

Q15 592 results 8684 results 35502 results 206628 results

Q16 390 results 780 results 260 results 780 results

Q17 855 results 8550 results 85500 results 855000 results

Q18 104 results 1300 results 13000 results 130000 results
r
Usage Statistics for University of Tasmania EPrints Repository
researchdata.edu.au
Updated Apr 27, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sale, Arthur; Sale, Arthur (2017). Usage Statistics for University of Tasmania EPrints Repository [Dataset]. https://researchdata.edu.au/usage-statistics-university-eprints-repository/927350
Explore at:
Dataset updated
Apr 27, 2017
Dataset provided by
University of Tasmania, Australia
Authors
Sale, Arthur; Sale, Arthur
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
The dataset is an active collection of access data to information items in the University of Tasmania’s EPrints repository. Each night a task is scheduled to run, and this picks up in the Apache access logs from where it left off the previous night. Each download of an open access full-text item causes the generation of a database record in the MySQL database, together with a timestamp, and an approximate location of the computer system generating the download. This is achieved by looking up the IP address against the GeoIP database, with one significant difference. Downloads originating from a University of Tasmania IP address are separately identified, and removed from the ‘Australia’ category. This eliminates vanity searches from achieving high significance. Countries are coded using the ISO3166 two-letter code.

The dataset has been used to analyse the usage made of the repository and to tune it to achieve maximal visibility for the University of Tasmania. Researchers with items in the repository have used it to identify the types of use being made of their work, and to find potential collaborators. The citation of a work in a journal or conference article, for example, causes a typical step in usage, and the citing article can be searched in Google or Google Scholar to identify the authors. This enhances the dissemination experience and its value.

The software was written in the University of Tasmania by Professor Arthur Sale (in php) based on earlier work by the University of Melbourne (with permission). Mr Christian McGee wrote some critical sections of the code in perl, and set up the cron scheduling.

The dataset is generated by a computer program written by Professor Arthur Sale. The software was a test bed for ideas, and subsequently resulted in an official software set included in the EPrints distribution. This set expanded on the concepts significantly
CHINOOK Music
kaggle.com
zip
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
willian oliveira (2024). CHINOOK Music [Dataset]. https://www.kaggle.com/datasets/willianoliveiragibin/chinook-music
Explore at:
zip(9603 bytes)Available download formats
Dataset updated
Sep 19, 2024
Authors
willian oliveira
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Chinook Database is a sample database designed for use with multiple database platforms, such as SQL Server, Oracle, MySQL, and others. It can be easily set up by running a single SQL script, making it a convenient alternative to the popular Northwind database. Chinook is widely used in demos and testing environments, particularly for Object-Relational Mapping (ORM) tools that target both single and multiple database servers.

Supported Database Servers Chinook supports several database servers, including:

DB2 MySQL Oracle PostgreSQL SQL Server SQL Server Compact SQLite Download Instructions You can download the SQL scripts for each supported database server from the latest release assets. The appropriate SQL script file(s) for your database vendor are provided, which can be executed using your preferred database management tool.

Data Model The Chinook Database represents a digital media store, containing tables that include:

Artists Albums Media tracks Invoices Customers Sample Data The media data in Chinook is derived from a real iTunes Library, providing a realistic dataset for users. Additionally, users can generate their own SQL scripts using their personal iTunes Library by following specific instructions. Customer and employee details in the database were manually crafted with fictitious names, addresses (mappable via Google Maps), and well-structured contact information such as phone numbers, faxes, and emails. Sales data is auto-generated and spans a four-year period, using random values.

Why is it Called Chinook? The Chinook Database's name is a nod to its predecessor, the Northwind database. Chinooks are warm, dry winds found in the interior regions of North America, particularly over southern Alberta in Canada, where the Canadian Prairies meet mountain ranges. This natural phenomenon inspired the choice of name, reflecting the idea that Chinook serves as a refreshing alternative to the Northwind database.
a
DREAMS Impact Evaluation Project – Nested Cohort – Follow Up Round 01 -...
data.ahri.org
Updated Apr 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr Maryam, Shahmanesh (2020). DREAMS Impact Evaluation Project – Nested Cohort – Follow Up Round 01 - South Africa [Dataset]. https://data.ahri.org/index.php/catalog/925
Explore at:
Dataset updated
Apr 13, 2020
Dataset provided by
Dr Maryam, Shahmanesh
Time period covered
2018
Area covered
South Africa
Description
Abstract

The main aim of this study is to evaluate the impact and effectiveness of the scale up of the DREAMS HIV prevention package of biological, behavioural and social interventions in reducing HIV incidence in adolescent girls and young women residing in the uMkhanyakude district of KwaZulu-Natal. To achieve this aim, the changes in different outcomes will be assessed over time. The primary outcome will be HIV incidence and other key secondary outcomes will include knowledge of own HIV status, sexual debut, HSV-2, number of sexual partners, age-disparity with sexual partners, ever been pregnant, condom use, unmet need for contraception, transactional sex, education (remaining in school) and experiences of violence.

Geographic coverage

Demographic surveillance area of the Africa Health Research Institute; KwaZulu-Natal, uMkhanyakude district.

Analysis unit

Individual

Universe

Closed cohorts of 800 AGYW will be followed prospectively at three time points over the two-year study period - baseline, 12 months and 24 months - at points closest aligned with periods before, during and after DREAMS implementation. In ACDIS cohorts of 400 girls aged 13-17 years and 400 young women aged 14-23 years will undergo informed consent, recruited undergo a baseline questionnaire and provide dry blood spots for HSV2 at the same time they provide a sample for the HIV testing in the surveillance and then reviewed annually for the next two years.

Kind of data

Longitudinal survey data

Sampling procedure

Adolescent girls and young women aged 14-23 years who were residents in the demographic surveillance area of the Africa Health Research Institute. A total of 3013 participants were randomly selected to obtain a target sample size of 800 after 2 years of follow-up, allowing for 40% non-contact/loss-to-follow-up. Sampling was stratified by age group and area (week-blocks).

Cleaning operations

All data will be managed using electronic data management tools. The data management system for these will be based on REDCap (research electronic data capture) developed at Vanderbilt University. The REDCap database resides within a single MySQL database server within a secure server cluster at the AHRI. Survey data are synchronised by the REDCap application from the mobile device to a central MySQL server. Access control is managed through Microsoft Active Directory with minimum password complexity and compulsory password change policies.
D
Replication Data for: Two origins of the prefix IZ- and how they affect the...
dataverse.azure.uit.no
dataverse.no
pdf, txt, xlsx
Updated Jul 18, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Endresen; Anna Endresen (2022). Replication Data for: Two origins of the prefix IZ- and how they affect the VY- vs. IZ- correlation in Modern Russian. [Dataset]. http://doi.org/10.18710/NFNB8D
Explore at:
xlsx(354229), txt(8075), txt(491558), pdf(61023)Available download formats
Unique identifier
https://doi.org/10.18710/NFNB8D
Dataset updated
Jul 18, 2022
Dataset provided by
DataverseNO
Authors
Anna Endresen; Anna Endresen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
Jan 1, 1950 - Jun 14, 2014
Description
This is the data examined in the study of Modern Russian verbs formed with the prefixes VY- and IZ-, a native East Slavic prefix and a loan Church Slavonic prefix, both of which mean ‘out of’. The study provides a synchronic contrastive analysis of the two prefixes and discusses how much they are semantically similar and what determines their distribution across Russian verbs. The dataset “VY_IZ_DATABASE_2019” provides replication data for the article “Two origins of the prefix IZ- and how they affect the VY- vs. IZ- correlation in Modern Russian” accepted for publication in Russian Linguistics. International Journal for the Study of Russian and other Slavic Languages 43(3). The amount of data examined in this study exceeds all previous accounts of the issue. The database contains 989 prefixed verbs. The verbs were culled from the Modern Subcorpus of the Russian National Corpus (www.ruscorpora.ru) and manually tagged for a number of parameters. The data was extracted automatically via the software management program MySQL. After that each verb was double-checked in the corpus and analyzed. In the database, each verb is accompanied with an English gloss, simplex base, corpus frequency, corpus example of its use, and a number of tags relevant for this study (type of perfective, submeaning of the prefix, etc.). The structure of the database is described in detail in the document “ReadMe”. Here is the abstract of the article: This article reports on a synchronic study of 989 Modern Russian verbs formed with the prefixes VY- and IZ-, including standard lexemes, obsolete verbs, and newly-formed coinages culled from the Russian National Corpus. I argue that the hypothesis about the two historical origins of the prefix IZ- may explain the ambivalent behavior of this prefix in Modern Russian, which shows both semantic overlap and semantic contrast with the prefix VY-. I revisit the most detailed semantic account of the two prefixes (Nesset et al 2011) and provide additional support for their model of polysemy in terms of type and token frequencies of the analyzed verbs. I further propose that VY- and IZ- encode different spatial image schemas and thus explain why the prefix IZ- is compatible with verbs of multidirectional motion, whereas VY- preferably attaches to verbs of unidirectional motion; why the verbs prefixed in IZ- often carry a more evocative flavor and refer to more intensive activities than those described by parallel verbs in VY-; why IZ- encodes multiplication of an action named by the base and why this is not common for VY-; and finally how it is possible for IZ- to have both bookish and colloquial uses, being very obsolete and highly productive in different submeanings.
Z
KGCW 2024 Challenge @ ESWC 2024
data.niaid.nih.gov
investigacion.usc.gal
+3more
Updated Jun 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Serles, Umutcan; Iglesias, Ana (2024). KGCW 2024 Challenge @ ESWC 2024 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10721874
Explore at:
Dataset updated
Jun 11, 2024
Dataset provided by
Universidad Politécnica de Madrid
KU Leuven
STI Insbruck
IDLab
Universidade de Santiago de Compostela
Authors
Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Serles, Umutcan; Iglesias, Ana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Knowledge Graph Construction Workshop 2024: challenge

Knowledge graph construction of heterogeneous data has seen a lot of uptakein the last decade from compliance to performance optimizations with respectto execution time. Besides execution time as a metric for comparing knowledgegraph construction, other metrics e.g. CPU or memory usage are not considered.This challenge aims at benchmarking systems to find which RDF graphconstruction system optimizes for metrics e.g. execution time, CPU,memory usage, or a combination of these metrics.

Task description

The task is to reduce and report the execution time and computing resources(CPU and memory usage) for the parameters listed in this challenge, comparedto the state-of-the-art of the existing tools and the baseline results providedby this challenge. This challenge is not limited to execution times to createthe fastest pipeline, but also computing resources to achieve the most efficientpipeline.

We provide a tool which can execute such pipelines end-to-end. This tool alsocollects and aggregates the metrics such as execution time, CPU and memoryusage, necessary for this challenge as CSV files. Moreover, the informationabout the hardware used during the execution of the pipeline is available aswell to allow fairly comparing different pipelines. Your pipeline should consistof Docker images which can be executed on Linux to run the tool. The tool isalready tested with existing systems, relational databases e.g. MySQL andPostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuosowhich can be combined in any configuration. It is strongly encouraged to usethis tool for participating in this challenge. If you prefer to use a differenttool or our tool imposes technical requirements you cannot solve, please contactus directly.

Track 1: Conformance

The set of new specification for the RDF Mapping Language (RML) established by the W3C Community Group on Knowledge Graph Construction provide a set of test-cases for each module:

RML-Core

RML-IO

RML-CC

RML-FNML

RML-Star

These test-cases are evaluated in this Track of the Challenge to determine their feasibility, correctness, etc. by applying them in implementations. This Track is in Beta status because these new specifications have not seen any implementation yet, thus it may contain bugs and issues. If you find problems with the mappings, output, etc. please report them to the corresponding repository of each module.

Note: validating the output of the RML Star module automatically through the provided tooling is currently not possible, see https://github.com/kg-construct/challenge-tool/issues/1.

Through this Track we aim to spark development of implementations for the new specifications and improve the test-cases. Let us know your problems with the test-cases and we will try to find a solution.

Track 2: Performance

Part 1: Knowledge Graph Construction Parameters

These parameters are evaluated using synthetic generated data to have moreinsights of their influence on the pipeline.

Data

Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).

Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).

Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).

Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).

Number of input files: scaling the number of datasets (1, 5, 10, 15).

Mappings

Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).

Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).

Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)

Part 2: GTFS-Madrid-Bench

The GTFS-Madrid-Bench provides insights in the pipeline with real data from thepublic transport domain in Madrid.

Scaling

GTFS-1 SQL

GTFS-10 SQL

GTFS-100 SQL

GTFS-1000 SQL

Heterogeneity

GTFS-100 XML + JSON

GTFS-100 CSV + XML

GTFS-100 CSV + JSON

GTFS-100 SQL + XML + JSON + CSV

Example pipeline

The ground truth dataset and baseline results are generated in different stepsfor each parameter:

The provided CSV files and SQL schema are loaded into a MySQL relational database.

Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format

The pipeline is executed 5 times from which the median execution time of eachstep is calculated and reported. Each step with the median execution time isthen reported in the baseline results with all its measured metrics.Knowledge graph construction timeout is set to 24 hours. The execution is performed with the following tool: https://github.com/kg-construct/challenge-tool,you can adapt the execution plans for this example pipeline to your own needs.

Each parameter has its own directory in the ground truth dataset with thefollowing files:

Input dataset as CSV.

Mapping file as RML.

Execution plan for the pipeline in metadata.json.

Datasets

Knowledge Graph Construction Parameters

The dataset consists of:

Input dataset as CSV for each parameter.

Mapping file as RML for each parameter.

Baseline results for each parameter with the example pipeline.

Ground truth dataset for each parameter generated with the example pipeline.

Format

All input datasets are provided as CSV, depending on the parameter that is beingevaluated, the number of rows and columns may differ. The first row is alwaysthe header of the CSV.

GTFS-Madrid-Bench

The dataset consists of:

Input dataset as CSV with SQL schema for the scaling and a combination of XML,

CSV, and JSON is provided for the heterogeneity.

Mapping file as RML for both scaling and heterogeneity.

SPARQL queries to retrieve the results.

Baseline results with the example pipeline.

Ground truth dataset generated with the example pipeline.

Format

CSV datasets always have a header as their first row.JSON and XML datasets have their own schema.

Evaluation criteria

Submissions must evaluate the following metrics:

Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.

CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.

Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

Expected output

Duplicate values

Scale Number of Triples

0 percent 2000000 triples

25 percent 1500020 triples

50 percent 1000020 triples

75 percent 500020 triples

100 percent 20 triples

Empty values

Scale Number of Triples

0 percent 2000000 triples

25 percent 1500000 triples

50 percent 1000000 triples

75 percent 500000 triples

100 percent 0 triples

Mappings

Scale Number of Triples

1TM + 15POM 1500000 triples

3TM + 5POM 1500000 triples

5TM + 3POM 1500000 triples

15TM + 1POM 1500000 triples

Properties

Scale Number of Triples

1M rows 1 column 1000000 triples

1M rows 10 columns 10000000 triples

1M rows 20 columns 20000000 triples

1M rows 30 columns 30000000 triples

Records

Scale Number of Triples

10K rows 20 columns 200000 triples

100K rows 20 columns 2000000 triples

1M rows 20 columns 20000000 triples

10M rows 20 columns 200000000 triples

Joins

1-1 joins

Scale Number of Triples

0 percent 0 triples

25 percent 125000 triples

50 percent 250000 triples

75 percent 375000 triples

100 percent 500000 triples

1-N joins

Scale Number of Triples

1-10 0 percent 0 triples

1-10 25 percent 125000 triples

1-10 50 percent 250000 triples

1-10 75 percent 375000 triples

1-10 100 percent 500000 triples

1-5 50 percent 250000 triples

1-10 50 percent 250000 triples

1-15 50 percent 250005 triples

1-20 50 percent 250000 triples

1-N joins

Scale Number of Triples

10-1 0 percent 0 triples

10-1 25 percent 125000 triples

10-1 50 percent 250000 triples

10-1 75 percent 375000 triples

10-1 100 percent 500000 triples

5-1 50 percent 250000 triples

10-1 50 percent 250000 triples

15-1 50 percent 250005 triples

20-1 50 percent 250000 triples

N-M joins

Scale Number of Triples

5-5 50 percent 1374085 triples

10-5 50 percent 1375185 triples

5-10 50 percent 1375290 triples

5-5 25 percent 718785 triples

5-5 50 percent 1374085 triples

5-5 75 percent 1968100 triples

5-5 100 percent 2500000 triples

5-10 25 percent 719310 triples

5-10 50 percent 1375290 triples

5-10 75 percent 1967660 triples

5-10 100 percent 2500000 triples

10-5 25 percent 719370 triples

10-5 50 percent 1375185 triples

10-5 75 percent 1968235 triples

10-5 100 percent 2500000 triples

GTFS Madrid Bench

Generated Knowledge Graph

Scale Number of Triples

1 395953 triples

10 3959530 triples

100 39595300 triples

1000 395953000 triples

Queries

Query Scale 1 Scale 10 Scale 100 Scale 1000

Q1 58540 results 585400 results No results available No results available

Q2 636 results 11998 results
125565 results 1261368 results

Q3 421 results 4207 results 42067 results 420667 results

Q4 13 results 130 results 1300 results 13000 results

Q5 35 results 350 results 3500 results 35000 results

Q6 1 result 1 result 1 result 1 result

Q7 68 results 67 results 67 results 53 results

Q8 35460 results 354600 results No results available No results available

Q9 130 results 1300
f
Data from: MWSTAT: A MODULATED WEB-BASED STATISTICAL SYSTEM
scielo.figshare.com
jpeg
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francisco Louzada; Anderson Ara (2023). MWSTAT: A MODULATED WEB-BASED STATISTICAL SYSTEM [Dataset]. http://doi.org/10.6084/m9.figshare.6967682.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6967682.v1
Dataset updated
Jun 1, 2023
Dataset provided by
SciELO journals
Authors
Francisco Louzada; Anderson Ara
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT In this paper we present the development of a modulated web based statistical system, hereafter MWStat, which shifts the statistical paradigm of analyzing data into a real time structure. The MWStat system is useful for both online storage data and questionnaires analysis, as well as to provide real time disposal of results from analysis related to several statistical methodologies in a customizable fashion. Overall, it can be seem as a useful technical solution that can be applied to a large range of statistical applications, which needs of a scheme of devolution of real time results, accessible to anyone with internet access. We display here the step-by-step instructions for implementing the system. The structure is accessible, built with an easily interpretable language and it can be strategically applied to online statistical applications. We rely on the relationship of several free languages, namely, PHP, R, MySQL database and an Apache HTTP server, and on the use of software tools such as phpMyAdmin. We expose three didactical examples of the MWStat system on institutional evaluation, statistical quality control and multivariate analysis. The methodology is also illustrated in a real example on institutional evaluation. A MWStat module was specifically built for providing a real time poll for teacher evaluation at the Federal University of São Carlos (Brazil).

Facebook

Twitter

Click to copy link

Link copied

Cite

Atanas Kanev (2021). SQLite Sakila Sample Database [Dataset]. https://www.kaggle.com/datasets/atanaskanev/sqlite-sakila-sample-database/code

SQLite Sakila Sample Database

SQLite Port of the Original MySQL Sakila Sample Database

Explore at:

zip(4495190 bytes)Available download formats

Dataset updated

Mar 14, 2021

Authors

Atanas Kanev

Description

SQLite Sakila Sample Database

Database Description

The Sakila sample database is a fictitious database designed to represent a DVD rental store. The tables of the database include film, film_category, actor, customer, rental, payment and inventory among others. The Sakila sample database is intended to provide a standard schema that can be used for examples in books, tutorials, articles, samples, and so forth. Detailed information about the database can be found on the MySQL website: https://dev.mysql.com/doc/sakila/en/

Sakila for SQLite is a part of the sakila-sample-database-ports project intended to provide ported versions of the original MySQL database for other database systems, including:

Oracle
SQL Server
SQLIte
Interbase/Firebird
Microsoft Access

Sakila for SQLite is a port of the Sakila example database available for MySQL, which was originally developed by Mike Hillyer of the MySQL AB documentation team. This project is designed to help database administrators to decide which database to use for development of new products The user can run the same SQL against different kind of databases and compare the performance

License: BSD Copyright DB Software Laboratory http://www.etl-tools.com

Note: Part of the insert scripts were generated by Advanced ETL Processor http://www.etl-tools.com/etl-tools/advanced-etl-processor-enterprise/overview.html

Information about the project and the downloadable files can be found at: https://code.google.com/archive/p/sakila-sample-database-ports/

Other versions and developments of the project can be found at: https://github.com/ivanceras/sakila/tree/master/sqlite-sakila-db

https://github.com/jOOQ/jOOQ/tree/main/jOOQ-examples/Sakila

Direct access to the MySQL Sakila database, which does not require installation of MySQL (queries can be typed directly in the browser), is provided on the phpMyAdmin demo version website: https://demo.phpmyadmin.net/master-config/

Files Description

The files in the sqlite-sakila-db folder are the script files which can be used to generate the SQLite version of the database. For convenience, the script files have already been run in cmd to generate the sqlite-sakila.db file, as follows:

sqlite> .open sqlite-sakila.db # creates the .db file sqlite> .read sqlite-sakila-schema.sql # creates the database schema sqlite> .read sqlite-sakila-insert-data.sql # inserts the data

Therefore, the sqlite-sakila.db file can be directly loaded into SQLite3 and queries can be directly executed. You can refer to my notebook for an overview of the database and a demonstration of SQL queries. Note: Data about the film_text table is not provided in the script files, thus the film_text table is empty. Instead the film_id, title and description fields are included in the film table. Moreover, the Sakila Sample Database has many versions, so an Entity Relationship Diagram (ERD) is provided to describe this specific version. You are advised to refer to the ERD to familiarise yourself with the structure of the database.

Clear search

Close search

Google apps

Main menu

SQLite Sakila Sample Database

SQLite Sakila Sample Database

Database Description

Files Description

classicmodels

Bike Store Relational Database | SQL

FooDrugs database: A database with molecular and text information about food...

Rediscovery Datasets: Connecting Duplicate Reports of Apache, Eclipse, and...

Clean Meta Kaggle

Cleaned Meta-Kaggle Dataset

The Original Dataset - Meta-Kaggle

August 2023 update

The Problems with the Original Dataset

The Solution

Orphan Drugs - Dataset 1: Twitter issue-networks as excluded publics

Data Base Management Systems market size was USD 50.5 billion in 2022 !

MySQL Java Computer Programs

Employees

Dataset

Contents

Royal Institute for Cultural Heritage Radiocarbon and stable isotope...

Membership of the IMO Commission for Agricultural Meteorology (1913-1947)

Heparome

Data from: KGCW 2023 Challenge @ ESWC 2023

Usage Statistics for University of Tasmania EPrints Repository

CHINOOK Music

DREAMS Impact Evaluation Project – Nested Cohort – Follow Up Round 01 -...

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Cleaning operations

Replication Data for: Two origins of the prefix IZ- and how they affect the...

KGCW 2024 Challenge @ ESWC 2024

Data from: MWSTAT: A MODULATED WEB-BASED STATISTICAL SYSTEM

SQLite Sakila Sample Database

SQLite Port of the Original MySQL Sakila Sample Database

SQLite Sakila Sample Database

Database Description

Files Description