7 datasets found

Clean Meta Kaggle
kaggle.com
Updated Sep 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yoni Kremer (2023). Clean Meta Kaggle [Dataset]. https://www.kaggle.com/datasets/yonikremer/clean-meta-kaggle
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yoni Kremer
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Cleaned Meta-Kaggle Dataset

The Original Dataset - Meta-Kaggle

Explore our public data on competitions, datasets, kernels (code / notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.

https://i.imgur.com/2Egeb8R.png" alt="" title="a title">

This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.

Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.

August 2023 update

In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here

We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.

The Problems with the Original Dataset

The original dataset is 32 CSV files, with 268 colums and 7GB of compressed data. Having so many tables and columns makes it hard to understand the data.

The data is not normalized, so when you join tables you get a lot of errors.

Some values refer to non-existing values in other tables. For example, the UserId column in the ForumMessages table has values that do not exist in the Users table.

There are missing values.

There are duplicate values.

There are values that are not valid. For example, Ids that are not positive integers.

The date and time columns are not in the right format.

Some columns only have the same value for all rows, so they are not useful.

The boolean columns have string values True or False.

Incorrect values for the Total columns. For example, the DatasetCount is not the total number of datasets with the Tag according to the DatasetTags table.

Users upvote their own messages.

The Solution

To handle so many tables and columns I use a relational database. I use MySQL, but you can use any relational database.

The steps to create the database are:

Creating the database tables with the right data types and constraints. I do that by running the db_abd_create_tables.sql script.

Downloading the CSV files from Kaggle using the Kaggle API.

Cleaning the data using pandas. I do that by running the clean_data.py script. The script does the following steps for each table:

Drops the columns that are not needed.

Converts each column to the right data type.

Replaces foreign keys that do not exist with NULL.

Replaces some of the missing values with default values.

Removes rows where there are missing values in the primary key/not null columns.

Removes duplicate rows.

Loading the data into the database using the LOAD DATA INFILE command.

Checks that the number of rows in the database tables is the same as the number of rows in the CSV files.

Adds foreign key constraints to the database tables. I do that by running the add_foreign_keys.sql script.

Update the Total columns in the database tables. I do that by running the update_totals.sql script.

Backup the database.
Z
BioTIME
data.niaid.nih.gov
zenodo.org
Updated Jun 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BioTIME Consortium (2021). BioTIME [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1095627
Explore at:
Dataset updated
Jun 24, 2021
Dataset provided by
Centre for Biological Diversity and Scottish Oceans Institute, School of Biology, University of St. Andrews, St. Andrews, UK
Authors
BioTIME Consortium
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The BioTIME database contains raw data on species identities and abundances in ecological assemblages through time. The database consists of 11 tables; one raw data table plus ten related meta data tables. For further information please see our associated data paper.

This data consists of several elements:

BioTIMESQL_02_04_2018.sql - an SQL file for the full public version of BioTIME which can be imported into any mySQL database.

BioTIMEQuery_02_04_2018.csv - data file, although too large to view in Excel, this can be read into several software applications such as R or various database packages.

BioTIMEMetadata_02_04_2018.csv - file containing the meta data for all studies.

BioTIMECitations_02_04_2018.csv - file containing the citation list for all studies.

BioTIMECitations_02_04_2018.xlsx - file containing the citation list for all studies (some special characters are not supported in the csv format).

BioTIMEInteractions_02_04_2018.Rmd - an r markdown page providing a brief overview of how to interact with the database and associated .csv files (this will not work until field paths and database connections have been added/updated).

Please note: any users of any of this material should cite the associated data paper in addition to the DOI listed here.

To cite the data paper use the following:

Dornelas M, Antão LH, Moyes F, Bates, AE, Magurran, AE, et al. BioTIME: A database of biodiversity time series for the Anthropocene. Global Ecol Biogeogr. 2018; 27:760 - 786. https://doi.org/10.1111/geb.12729
Data from: Spatial Modeling for Resources Framework (SMRF)
catalog.data.gov
gimi9.com
+1more
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Spatial Modeling for Resources Framework (SMRF) [Dataset]. https://catalog.data.gov/dataset/spatial-modeling-for-resources-framework-smrf-1db41
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
Spatial Modeling for Resources Framework (SMRF) was developed at the USDA Agricultural Research Service (ARS) in Boise, ID, and was designed to increase the flexibility of taking measured weather data and distributing the point measurements across a watershed. SMRF was developed to be used as an operational or research framework, where ease of use, efficiency, and ability to run in near real time are high priorities. Highlights Robust meteorological spatial forcing data development for physically based models The Python framework can be used for research or operational applications Parallel processing and multi-threading allow for large modeling domains at high resolution Real time and historical applications for water supply resourses Features SMRF was developed as a modular framework to enable new modules to be easily intigrated and utilized. Load data into SMRF from MySQL database, CSV files, or gridded climate models (i.e. WRF) Variables currently implemented: Air temperature; Vapor pressure; Precipitation mass, phase, density, and percent snow; Wind speed and direction; Solar radiation; Thermal radiation Output variables to NetCDF files Data queue for multithreaded application Computation tasks implemented in C Resources in this dataset:Resource Title: SMRF GitHub repository. File Name: Web Page, url: https://github.com/USDA-ARS-NWRC/smrf SMRF was designed to increase the flexibility of taking measured weather data, or atmospheric models, and distributing the data across a watershed.
f
Using Virtuoso as an alternate triple store for a VIVO instance
vivo.figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paul Albert; Eliza Chan; Prakesh Adekkanattu; Mohammad Mansour (2023). Using Virtuoso as an alternate triple store for a VIVO instance [Dataset]. http://doi.org/10.6084/m9.figshare.2002032.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.2002032.v2
Dataset updated
May 30, 2023
Dataset provided by
VIVO
Authors
Paul Albert; Eliza Chan; Prakesh Adekkanattu; Mohammad Mansour
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background: For some time, the VIVO for Weill Cornell Medical College (WCMC) had struggled with both unacceptable page load times and unreliable uptime. With some individual profiles containing upwards of 800 publications, WCMC VIVO has relatively large profiles, but no profile was so large that it could account for this performance. The WCMC VIVO Implementation Team explored a number of options for improving performance including caching, better hardware, query optimization, limiting user access to large pages, using another instance of Tomcat, throttling bots, and blocking IP's issuing too many requests. But none of these avenues were fruitful. Analysis of triple stores: With the 1.7 version, VIVO ships with the Jena SDB triple store, but the SDB version of Jena is no longer supported by its developers. In April, we reviewed various published analyses and benchmarks suggesting there were alternatives to Jena such as Virtuoso that perform better than even Jena's successor, TDB. In particular, the Berlin SPARQL Benchmark v. 3.1[1] showed that Virtuoso had the strongest performance compared to the other data stores measured including BigData, BigOwlim, and Jena TDB. In addition, Virtuoso is used on dbpedia.org which serves up 3 billion triples compared to the only 12 million with WCMC's VIVO site. Whereas Jena SDB stores its triples in a MySQL database, Virtuoso manages its in a binary file. The software is available in open source and commercial editions. Configuration: In late 2014, we installed Virtuoso on a local machine and loaded data from our production VIVO. Some queries completed in about 10% of the time as compared to our production VIVO. However, we noticed that the listview queries invoked whenever profile pages were loaded were still slow. After soliciting feedback from members of both the Virtuoso and VIVO communities, we modified these queries to rely on the OPTIONAL instead of UNION construct. This modification, which wasn't possible in a Jena SDB environment, reduced by eight-fold the number of queries that the application makes of the triple store. About four or five additional steps were required for VIVO and Virtuoso to work optimally with one another; these are documented in the VIVO Duraspace wiki. Results: On March 31, WCMC launched Virtuoso in its production environment. According to our instance of New Relic, VIVO has an average page load of about four seconds and 99% uptime, both of which are dramatic improvements. There are opportunities for further tuning: the four second average includes pages such as the visualizations as well as pages served up to logged in users, which are slower than other types of pages. [1] http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/results/V7/#comparison
MedSynora DW - Medical Data Warehouse
kaggle.com
zip
Updated Mar 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BenMebrar (2025). MedSynora DW - Medical Data Warehouse [Dataset]. https://www.kaggle.com/datasets/mebrar21/medsynora-dw
Explore at:
zip(89253728 bytes)Available download formats
Dataset updated
Mar 14, 2025
Authors
BenMebrar
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
MedSynora DW – A Comprehensive Synthetic Hospital Patient Data Warehouse

Overview MedSynora DW is a huge synthetic dataset designed to simulate the operation flow by adopting a patient-based approach in a large hospital. This dataset covers patient encounters, treatments, lab tests, vital signs, cost details and more over a full year of 2024. It is developed to support data science, machine learning, and business intelligence projects in the healthcare domain.

Project Highlights • Realistic Simulation: Generated using advanced Python scripts and statistical models, the dataset reflects realistic hospital operations and patient flows without using any real patient data. • Comprehensive Schema: The data warehouse includes multiple fact and dimension tables: o Fact Tables: Encounter, Treatment, Lab Tests, Special Tests, Vitals, and Cost. o Dimension Tables: Patient, Doctor, Disease, Insurance, Room, Date, Chronic Diseases, Allergies, and Additional Services. o Bridge Tables: For managing many-to-many relationships (e.g., doctors per encounter) and some other… • Synthetic & Scalable: The dataset is entirely synthetic, ensuring privacy and compliance. It is designed to be scalable – the current version simulates around 145,000 encounter records.

Data Generation • Data Sources & Methods: Data is generated using bunch of Py libraries. Highly customized algorithms simulate realistic patient demographics, doctor assignments, treatment choices, lab test results, and cost breakdowns etc.. • Diverse Scenarios: With over 300 diseases and thousands of treatment variations, along with dozens of lab and special tests, the dataset offers profoundly rich variability to support complex analytical projects.

How to Use This Dataset • For Data Modeling & ETL Testing: Import the CSV files into your favorite database system (e.g., PostgreSQL, MySQL, or directly into a BI tool like Power BI) and set up relationships as described in the accompanying documentation. • For Machine Learning Projects: Use the dataset to build predictive models related to patient outcomes, cost analysis, or treatment efficacy. • For Educational Purposes: Ideal for learning about data warehousing, star schema design, and advanced analytics in healthcare.

Final Note MedSynora DW offers a unique opportunity to experiment with a comprehensive, realistic hospital data warehouse without compromising real patient information. Enjoy exploring, analyzing, and building with this dataset – and feel free to reach out if you have any questions or suggestions. In particular, inconsistencies, deficiencies or suggestions about the dataset by experts in the field will contribute to other version improvements.
E-Commerce Data
kaggle.com
zip
Updated Aug 17, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carrie (2017). E-Commerce Data [Dataset]. https://www.kaggle.com/datasets/carrie1/ecommerce-data
Explore at:
zip(7548686 bytes)Available download formats
Dataset updated
Aug 17, 2017
Authors
Carrie
Description
Context

Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".

Content

"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."

Acknowledgements

Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

Image from stocksnap.io.

Inspiration

Analyses for this dataset could include time series, clustering, classification and more.
g
Meta-Information des Samples der Media-Analyse Daten: IntermediaPlus...
search.gesis.org
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brentel, Inga; Kampes, Céline Fabienne; Jandura, Olaf, Meta-Information des Samples der Media-Analyse Daten: IntermediaPlus (2014-2016) [Dataset]. https://search.gesis.org/research_data/SDN-10.7802-2030
Explore at:
Dataset provided by
GESIS, Köln
GESIS search
Authors
Brentel, Inga; Kampes, Céline Fabienne; Jandura, Olaf
License
https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
Description
Bei dem aufbereiteten Längsschnitt-Datensatzes 2014 bis 2016 handelt es sich um „Big-Data“, weshalb der Gesamtdatensatz nur in Form einer Datenbank (MySQL) verfügbar sein wird. In dieser Datenbank liegt die Information verschiedener Variablen eines Befragten untereinander. Die vorliegende Publikation umfasst eine SQL-Datenbank mit den Meta-Daten des Sample des Gesamtdatensatzes, das einen Ausschnitt der verfügbaren Variablen des Gesamtdatensatzes darstellt und die Struktur der aufbereiteten Daten darlegen soll, und eine Datendokumentation des Samples. Für diesen Zweck beinhaltet das Sample alle Variablen der Soziodemographie, dem Freizeitverhalten, der Zusatzinformation zu einem Befragten und dessen Haushalt sowie den interviewspezifischen Variablen und Gewichte. Lediglich bei den Variablen bezüglich der Mediennutzung des Befragten, handelt es sich um eine kleine Auswahl: Für die Onlinemediennutzung wurden die Variablen aller Gesamtangebote sowie der Einzelangebote der Genre Politik und Digital aufgenommen. Die Mediennutzung von Radio, Print und TV wurde im Sample nicht berücksichtigt, da deren Struktur anhand der veröffentlichten Längsschnittdaten der Media-Analyse MA Radio, MA Pressemedien und MA Intermedia nachvollzogen werden kann.
Die Datenbank mit den tatsächlichen Befragungsdaten wäre auf Grund der Größe des Datenmaterials bereits im kritischen Bereich der Dateigröße für den normalen Up- und Download. Die tatsächlichen Befragungsergebnisse, die zur Analyse nötig sind, werden dann 2021 in Form des Gesamtdatensatzes der Media-Analyse-Daten: IntermediaPlus (2014-2016) im DBK bei GESIS veröffentlicht werden.

Die Daten sowie deren Datenaufbereitung sind ein Vorschlag eines Best-Practice Cases für Big-Data Management bzw. den Umgang mit Big-Data in den Sozialwissenschaften und mit sozialwissenschaftlichen Daten. Unter Verwendung der GESIS Software CharmStats, die im Rahmen dieses Projektes um Big-Data Features erweitert wurde, erfolgt die Dokumentation und Herstellung der Transparenz der Harmonisierungsarbeit. Durch ein Python-Skript sowie ein html-Template wurde der Arbeitsprozess um und mit CharmStats zudem stärker automatisiert.

Der aufbereitete Längsschnitt des Gesamtdatensatzes der MA IntermediaPlus für 2014 bis 2016 wird 2021 in Kooperation mit GESIS herausgegeben werden und den FAIR-Prinzipien (Wilkinson et al. 2016) entsprechend verfügbar gemacht werden. Ziel ist es durch die Harmonisierung der einzelnen Querschnitte die Datenquelle der Media-Analyse, die im Rahmen des Dissertationsprojektes „Angebots- und Publikumsfragmentierung online“ durch Inga Brentel und Céline Fabienne Kampes erfolgt, für Forschung zum sozialen und medialen Wandel in der Bundesrepublik Deutschland zugänglich zu machen.

Künftige Studiennummer des Gesamtdatensatzes der IndermediaPlus im DBK der GESIS: ZA5769 (Version 1-0-0) und der doi: https://dx.doi.org/10.4232/1.13530

****************English Version****************

The prepared Longitudinal IntermediaPlus dataset 2014 to 2016 is a "big data", which is why the entire dataset will only be available in the form of a database (MySQL). In this database, the information of different variables of a respondent is organized in one column, one below the other. The present publication includes a SQL-Database with the meta data of a sample of the full database, which represents a section of the available variables of the total data set and is intended to show the structure of the prepared data and the data-documentation (codebook) of the sample. For this purpose, the sample contains all variables of sociodemography, free-time activities, additional information on a respondent and his household as well as the interview-specific variables and weights. Only the variables concerning the respondent's media use are a small selection: For online media use, the variables of all overall offerings as well as the individual offerings of the genres politics and digital were included. The media use of radio, print and TV was not included in the sample because its structure can be traced using the published longitudinal data of the media analysis MA Radio, MA Pressemedien and MA Intermedia.
Due to the size of the datafile, the database with the actual survey data would already be in the critical range of the file size for the common upload and download. The actual survey result...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Yoni Kremer (2023). Clean Meta Kaggle [Dataset]. https://www.kaggle.com/datasets/yonikremer/clean-meta-kaggle

Clean Meta Kaggle

Kaggle's public data as a cleaned MySQL database.

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Sep 8, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Yoni Kremer

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Cleaned Meta-Kaggle Dataset

The Original Dataset - Meta-Kaggle

Explore our public data on competitions, datasets, kernels (code / notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.

https://i.imgur.com/2Egeb8R.png" alt="" title="a title">

This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.

Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.

August 2023 update

In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here

We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.

The Problems with the Original Dataset

The original dataset is 32 CSV files, with 268 colums and 7GB of compressed data. Having so many tables and columns makes it hard to understand the data.
The data is not normalized, so when you join tables you get a lot of errors.
Some values refer to non-existing values in other tables. For example, the UserId column in the ForumMessages table has values that do not exist in the Users table.
There are missing values.
There are duplicate values.
There are values that are not valid. For example, Ids that are not positive integers.
The date and time columns are not in the right format.
Some columns only have the same value for all rows, so they are not useful.
The boolean columns have string values True or False.
Incorrect values for the Total columns. For example, the DatasetCount is not the total number of datasets with the Tag according to the DatasetTags table.
Users upvote their own messages.

The Solution

To handle so many tables and columns I use a relational database. I use MySQL, but you can use any relational database.
The steps to create the database are:
Creating the database tables with the right data types and constraints. I do that by running the db_abd_create_tables.sql script.
Downloading the CSV files from Kaggle using the Kaggle API.
Cleaning the data using pandas. I do that by running the clean_data.py script. The script does the following steps for each table:
- Drops the columns that are not needed.
- Converts each column to the right data type.
- Replaces foreign keys that do not exist with NULL.
- Replaces some of the missing values with default values.
- Removes rows where there are missing values in the primary key/not null columns.
- Removes duplicate rows.
Loading the data into the database using the LOAD DATA INFILE command.
Checks that the number of rows in the database tables is the same as the number of rows in the CSV files.
Adds foreign key constraints to the database tables. I do that by running the add_foreign_keys.sql script.
Update the Total columns in the database tables. I do that by running the update_totals.sql script.
Backup the database.

Clear search

Close search

Google apps

Main menu

Clean Meta Kaggle

Cleaned Meta-Kaggle Dataset

The Original Dataset - Meta-Kaggle

August 2023 update

The Problems with the Original Dataset

The Solution

BioTIME

Data from: Spatial Modeling for Resources Framework (SMRF)

Using Virtuoso as an alternate triple store for a VIVO instance

MedSynora DW - Medical Data Warehouse

E-Commerce Data

Context

Content

Acknowledgements

Inspiration

Meta-Information des Samples der Media-Analyse Daten: IntermediaPlus...

Clean Meta Kaggle

Kaggle's public data as a cleaned MySQL database.

Cleaned Meta-Kaggle Dataset

The Original Dataset - Meta-Kaggle

August 2023 update

The Problems with the Original Dataset

The Solution