17 datasets found

mlcourse.ai - Dota 2 - winner prediction Dataset
kaggle.com
zip
Updated Sep 8, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sushma Biswas (2019). mlcourse.ai - Dota 2 - winner prediction Dataset [Dataset]. https://www.kaggle.com/datasets/sushmabiswas/mlcourseai-dota-2-winner-prediction-dataset
Explore at:
zip(759868828 bytes)Available download formats
Dataset updated
Sep 8, 2019
Authors
Sushma Biswas
Description
Context

Hello! I am currently taking the mlcourse.ai course and as part of one of it's in-class Kaggle competitions, this dataset was required. The data is originally hosted on git but I like to have my data right here on Kaggle. That's why this dataset.

If you find this dataset useful, do upvote. Thank you and happy learning!

Content

This dataset contains 6 files in total. 1. Sample_submission.csv 2. Train_features.csv 3. Test_features.csv 4. Train_targets.csv 5. Train_matches.jsonl 6. Test_matches.jsonl

Acknowledgements

All of the data in this dataset is originally hosted on git and the same can also be found on the in-class competition's 'data' page here.

Inspiration

to be updated.
How to Win Data Science Competition
kaggle.com
zip
Updated Jan 30, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Budi Ryan (2018). How to Win Data Science Competition [Dataset]. https://www.kaggle.com/budiryan/how-to-win-data-science-competition
Explore at:
zip(15845091 bytes)Available download formats
Dataset updated
Jan 30, 2018
Authors
Budi Ryan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Budi Ryan

Released under CC0: Public Domain

Contents
A
‘Kaggle Competitions Top 100’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Kaggle Competitions Top 100’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-kaggle-competitions-top-100-961d/latest
Explore at:
Dataset updated
Feb 14, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Kaggle Competitions Top 100’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/vivovinco/kaggle-competitions-top-100 on 13 February 2022.

--- Dataset description provided by original source is as follows ---

Context

This dataset contains top 100 of Kaggle competitions ranking. The dataset will be updated every month.

Content

100 rows and 13 columns. Columns' description are listed below.

User : Name of the user

Tier : Grandmaster, Master or Expert

Company/School : Company/School info of the user if mentioned

Country : Country info of the user if mentioned

Competitions_Num : Number of competitions joined

Competitions_Gold : Number of competitions gold medals won

Competitions_Silver : Number of competitions silver medals won

Competitions_Bronze : Number of competitions bronze medals won

Datasets_Num : Number of public datasets

Notebooks_Num : Number of public notebooks

Discussions_Num : Number of topics/comments posted

Points : Total points

Profile : Link of Kaggle profile

Acknowledgements

Data from Kaggle. Image from Smartcat.

If you're reading this, please upvote.

--- Original source retains full ownership of the source dataset ---
CrunchDAO Competition Unified Dataset
kaggle.com
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joakim Arvidsson (2023). CrunchDAO Competition Unified Dataset [Dataset]. https://www.kaggle.com/datasets/joebeachcapital/crunchdao-competition-unified-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Joakim Arvidsson
Description
This data set is for creating predictive models for the CrunchDAO tournament. Registration is required in order to participate in the competition, and to be eligible to earn $CRUNCH tokens.

See notebooks (Code tab) for how to import and explore the data, and build predictive models.

EDA

QuickStarter

See Terms of Use for data license.
FSDKaggle2018
zenodo.org
opendatalab.com
+1more
zip
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal (2020). FSDKaggle2018 [Dataset]. http://doi.org/10.5281/zenodo.2552860
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2552860
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal
Description
FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.

Citation

If you use the FSDKaggle2018 dataset or part of it, please cite our DCASE 2018 paper:

Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, Xavier Serra. "General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline". Proceedings of the DCASE 2018 Workshop (2018)

You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2018.

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

Contact

You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

About this dataset

Freesound Dataset Kaggle 2018 (or FSDKaggle2018 for short) is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology [1]. FSDKaggle2018 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2018. Please visit the DCASE2018 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound General-Purpose Audio Tagging Challenge. It was organized by researchers from the Music Technology Group of Universitat Pompeu Fabra, and from Google Research’s Machine Perception Team.

The goal of this competition was to build an audio tagging system that can categorize an audio clip as belonging to one of a set of 41 diverse categories drawn from the AudioSet Ontology.

All audio samples in this dataset are gathered from Freesound [2] and are provided here as uncompressed PCM 16 bit, 44.1 kHz, mono audio files. Note that because Freesound content is collaboratively contributed, recording quality and techniques can vary widely.

The ground truth data provided in this dataset has been obtained after a data labeling process which is described below in the Data labeling process section. FSDKaggle2018 clips are unequally distributed in the following 41 categories of the AudioSet Ontology:

"Acoustic_guitar", "Applause", "Bark", "Bass_drum", "Burping_or_eructation", "Bus", "Cello", "Chime", "Clarinet", "Computer_keyboard", "Cough", "Cowbell", "Double_bass", "Drawer_open_or_close", "Electric_piano", "Fart", "Finger_snapping", "Fireworks", "Flute", "Glockenspiel", "Gong", "Gunshot_or_gunfire", "Harmonica", "Hi-hat", "Keys_jangling", "Knock", "Laughter", "Meow", "Microwave_oven", "Oboe", "Saxophone", "Scissors", "Shatter", "Snare_drum", "Squeak", "Tambourine", "Tearing", "Telephone", "Trumpet", "Violin_or_fiddle", "Writing".

Some other relevant characteristics of FSDKaggle2018:

The dataset is split into a train set and a test set.

The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds. The total duration of the train set is roughly 18h.

Out of the ~9.5k samples from the train set, ~3.7k have manually-verified ground truth annotations and ~5.8k have non-verified annotations. The non-verified annotations of the train set have a quality estimate of at least 65-70% in each category. Checkout the Data labeling process section below for more information about this aspect.

Non-verified annotations in the train set are properly flagged in train.csv so that participants can opt to use this information during the development of their systems.

The test set is composed of 1.6k samples with manually-verified annotations and with a similar category distribution than that of the train set. The total duration of the test set is roughly 2h.

All audio samples in this dataset have a single label (i.e. are only annotated with one label). Checkout the Data labeling process section below for more information about this aspect. A single label should be predicted for each file in the test set.

Data labeling process

The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories (or labels), which was carried out by researchers at the Music Technology Group, Universitat Pompeu Fabra, Barcelona. Using this mapping, a number of Freesound audio samples were automatically annotated with labels from the AudioSet Ontology. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample.

Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assessed the presence/absence of an automatically assigned sound category, according to the AudioSet category description.

Audio samples in FSDKaggle2018 are only annotated with a single ground truth label (see train.csv). A total of 3,710 annotations included in the train set of FSDKaggle2018 are annotations that have been manually validated as present and predominant (some with inter-annotator agreement but not all of them). This means that in most cases there is no additional acoustic material other than the labeled category. In few cases there may be some additional sound events, but these additional events won't belong to any of the 41 categories of FSDKaggle2018.

The rest of the annotations have not been manually validated and therefore some of them could be inaccurate. Nonetheless, we have estimated that at least 65-70% of the non-verified annotations per category in the train set are indeed correct. It can happen that some of these non-verified audio samples present several sound sources even though only one label is provided as ground truth. These additional sources are typically out of the set of the 41 categories, but in a few cases they could be within.

More details about the data labeling process can be found in [3].

License

FSDKaggle2018 has licenses at two different levels, as explained next.

All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of the audio clips included in FSDKaggle2018 and their corresponding license. The licenses are specified in the files train_post_competition.csv and test_post_competition_scoring_clips.csv.

In addition, FSDKaggle2018 as a whole is the result of a curation process and it has an additional license. FSDKaggle2018 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2018.doc zip file.

Files

FSDKaggle2018 can be downloaded as a series of zip files with the following directory structure:

root │
└───FSDKaggle2018.audio_train/ Audio clips in the train set │
└───FSDKaggle2018.audio_test/ Audio clips in the test set │
└───FSDKaggle2018.meta/ Files for evaluation setup │ │
│ └───train_post_competition.csv Data split and ground truth for the train set │ │
│ └───test_post_competition_scoring_clips.csv Ground truth for the test set
│
└───FSDKaggle2018.doc/ │
└───README.md The dataset description file you are reading │
└───LICENSE-DATASET
P
Kaggle EyePACS Dataset
paperswithcode.com
Updated Oct 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Kaggle EyePACS Dataset [Dataset]. https://paperswithcode.com/dataset/kaggle-eyepacs
Explore at:
Dataset updated
Oct 28, 2020
Description
Diabetic retinopathy is the leading cause of blindness in the working-age population of the developed world. It is estimated to affect over 93 million people.

retina

The US Center for Disease Control and Prevention estimates that 29.1 million people in the US have diabetes and the World Health Organization estimates that 347 million people have the disease worldwide. Diabetic Retinopathy (DR) is an eye disease associated with long-standing diabetes. Around 40% to 45% of Americans with diabetes have some stage of the disease. Progression to vision impairment can be slowed or averted if DR is detected in time, however this can be difficult as the disease often shows few symptoms until it is too late to provide effective treatment.

Currently, detecting DR is a time-consuming and manual process that requires a trained clinician to examine and evaluate digital color fundus photographs of the retina. By the time human readers submit their reviews, often a day or two later, the delayed results lead to lost follow up, miscommunication, and delayed treatment.

Clinicians can identify DR by the presence of lesions associated with the vascular abnormalities caused by the disease. While this approach is effective, its resource demands are high. The expertise and equipment required are often lacking in areas where the rate of diabetes in local populations is high and DR detection is most needed. As the number of individuals with diabetes continues to grow, the infrastructure needed to prevent blindness due to DR will become even more insufficient.

The need for a comprehensive and automated method of DR screening has long been recognized, and previous efforts have made good progress using image classification, pattern recognition, and machine learning. With color fundus photography as input, the goal of this competition is to push an automated detection system to the limit of what is possible – ideally resulting in models with realistic clinical potential. The winning models will be open sourced to maximize the impact such a model can have on improving DR detection.

Acknowledgements This competition is sponsored by the California Healthcare Foundation.

Retinal images were provided by EyePACS, a free platform for retinopathy screening.
P
Data from: Microsoft Malware Classification Challenge Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Royi Ronen; Marian Radu; Corina Feuerstein; Elad Yom-Tov; Mansour Ahmadi, Microsoft Malware Classification Challenge Dataset [Dataset]. https://paperswithcode.com/dataset/microsoft-malware-classification-challenge
Explore at:
Authors
Royi Ronen; Marian Radu; Corina Feuerstein; Elad Yom-Tov; Mansour Ahmadi
Description
The Microsoft Malware Classification Challenge was announced in 2015 along with a publication of a huge dataset of nearly 0.5 terabytes, consisting of disassembly and bytecode of more than 20K malware samples. Apart from serving in the Kaggle competition, the dataset has become a standard benchmark for research on modeling malware behaviour. To date, the dataset has been cited in more than 50 research papers. Here we provide a high-level comparison of the publications citing the dataset. The comparison simplifies finding potential research directions in this field and future performance evaluation of the dataset.
LLM - Detect AI Datamix
kaggle.com
Updated Feb 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 2, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Raja Biswas
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the datamix created by Team 🔍 📝 🕵️‍♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:

Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.

One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.

Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence
Netflix Prize data
kaggle.com
zip
Updated Jul 19, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Netflix (2017). Netflix Prize data [Dataset]. https://www.kaggle.com/netflix-inc/netflix-prize-data
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Jul 19, 2017
Dataset authored and provided by
Netflixhttp://netflix.com/
Description
Context

Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor's Pragmatic Chaos team. This is the dataset that was used in that competition.

Content

This comes directly from the README:

TRAINING DATASET FILE DESCRIPTION

The file "training_set.tar" is a tar of a directory containing 17770 files, one per movie. The first line of each file contains the movie id followed by a colon. Each subsequent line in the file corresponds to a rating from a customer and its date in the following format:

CustomerID,Rating,Date

MovieIDs range from 1 to 17770 sequentially.

CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users.

Ratings are on a five star (integral) scale from 1 to 5.

Dates have the format YYYY-MM-DD.

MOVIES FILE DESCRIPTION

Movie information in "movie_titles.txt" is in the following format:

MovieID,YearOfRelease,Title

MovieID do not correspond to actual Netflix movie ids or IMDB movie ids.

YearOfRelease can range from 1890 to 2005 and may correspond to the release of corresponding DVD, not necessarily its theaterical release.

Title is the Netflix movie title and may not correspond to titles used on other sites. Titles are in English.

QUALIFYING AND PREDICTION DATASET FILE DESCRIPTION

The qualifying dataset for the Netflix Prize is contained in the text file "qualifying.txt". It consists of lines indicating a movie id, followed by a colon, and then customer ids and rating dates, one per line for that movie id. The movie and customer ids are contained in the training set. Of course the ratings are withheld. There are no empty lines in the file.

MovieID1:

CustomerID11,Date11

CustomerID12,Date12

...

MovieID2:

CustomerID21,Date21

CustomerID22,Date22

For the Netflix Prize, your program must predict the all ratings the customers gave the movies in the qualifying dataset based on the information in the training dataset.

The format of your submitted prediction file follows the movie and customer id, date order of the qualifying dataset. However, your predicted rating takes the place of the corresponding customer id (and date), one per line.

For example, if the qualifying dataset looked like:

111:

3245,2005-12-19

5666,2005-12-23

6789,2005-03-14

225:

1234,2005-05-26

3456,2005-11-07

then a prediction file should look something like:

111:

3.0

3.4

4.0

225:

1.0

2.0

which predicts that customer 3245 would have rated movie 111 3.0 stars on the 19th of Decemeber, 2005, that customer 5666 would have rated it slightly higher at 3.4 stars on the 23rd of Decemeber, 2005, etc.

You must make predictions for all customers for all movies in the qualifying dataset.

THE PROBE DATASET FILE DESCRIPTION

To allow you to test your system before you submit a prediction set based on the qualifying dataset, we have provided a probe dataset in the file "probe.txt". This text file contains lines indicating a movie id, followed by a colon, and then customer ids, one per line for that movie id.

MovieID1:

CustomerID11

CustomerID12

...

MovieID2:

CustomerID21

CustomerID22

Like the qualifying dataset, the movie and customer id pairs are contained in the training set. However, unlike the qualifying dataset, the ratings (and dates) for each pair are contained in the training dataset.

If you wish, you may calculate the RMSE of your predictions against those ratings and compare your RMSE against the Cinematch RMSE on the same data. See http://www.netflixprize.com/faq#probe for that value.

Acknowledgements

The training data came in 17,000+ files. In the interest of keeping files together and file sizes as low as possible, I combined them into four text files: combined_data_(1,2,3,4).txt

The contest was originally hosted at http://netflixprize.com/index.html

The dataset was downloaded from https://archive.org/download/nf_prize_dataset.tar

Inspiration

This is a fun dataset to work with. You can read about the winning algorithm by BellKor's Pragmatic Chaos here
2016 March ML Mania Predictions
kaggle.com
zip
Updated Nov 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Will Cukierski (2017). 2016 March ML Mania Predictions [Dataset]. https://www.kaggle.com/datasets/wcukierski/2016-march-ml-mania
Explore at:
zip(28950066 bytes)Available download formats
Dataset updated
Nov 15, 2017
Authors
Will Cukierski
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Kaggle’s March Machine Learning Mania competition challenged data scientists to predict winners and losers of the men's 2016 NCAA basketball tournament. This dataset contains the 1070 selected predictions of all Kaggle participants. These predictions were collected and locked in prior to the start of the tournament.

How can this data be used? You can pivot it to look at both Kaggle and NCAA teams alike. You can look at who will win games, which games will be close, which games are hardest to forecast, or which Kaggle teams are gambling vs. sticking to the data.

The NCAA tournament is a single-elimination tournament that begins with 68 teams. There are four games, usually called the “play-in round,” before the traditional bracket action starts. Due to competition timing, these games are included in the prediction files but should not be used in analysis, as it’s possible that the prediction was submitted after the play-in round games were over.

Data Description

Each Kaggle team could submit up to two prediction files. The prediction files in the dataset are in the 'predictions' folder and named according to:

TeamName_TeamId_SubmissionId.csv

The file format contains a probability prediction for every possible game between the 68 teams. This is necessary to cover every possible tournament outcome. Each team has a unique numerical Id (given in Teams.csv). Each game has a unique Id column created by concatenating the year and the two team Ids. The format is the following:

Id,Pred
2016_1112_1114,0.6
2016_1112_1122,0
...

The team with the lower numerical Id is always listed first. “Pred” represents the probability that the team with the lower Id beats the team with the higher Id. For example, "2016_1112_1114,0.6" indicates team 1112 has a 0.6 probability of beating team 1114.

For convenience, we have included the data files from the 2016 March Mania competition dataset in the Scripts environment (you may find TourneySlots.csv and TourneySeeds.csv useful for determining matchups, see the documentation). However, the focus of this dataset is on Kagglers' predictions.
Feature Extraction
kaggle.com
Updated Sep 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jason (2019). Feature Extraction [Dataset]. https://www.kaggle.com/jclchan/feature-extraction/notebooks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 4, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jason
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The datasets are derived from eye fundus images provided in Kaggle's 'APTOS 2019 Blindness Detection' competition. The competition involves classification of eye fundus images into 5 levels of severity in diabetic retinopathy.

Unlike most participants who used deep learning approach to this classification problem, here we tried using Fractal Dimensions and Persistent Homology (one of the major tools in Topological Data Analysis, TDA) in extracting features from images, as inputs to simpler ML algorithms like SVM. It shows some promising results with this approach.

There are three files in this dataset:

Process_Images.html - R scripts for extracting Fractal Dimensions and Persistent Homology features from images.

train_features.RDS and test_features.RDS - the output RDS (R dataset files) for training and testing images for the above Kaggle competition.

Columns in train_features.RDS & test_features.RDS:

id_code - image id

diagnosis - severity of diabetic retinopathy on a scale of 0 to 4: 0=No DR; 1=Mild; 2=Moderate; 3=Severe; 4=Proliferative DR; Artificially set to be 0 for test_features.RDS

n - number of persistent homology components detected from the image

fd1 to fd21 - proportion of sliding windows having a specific fractal dimensions: fd1 = proportion of windows having FD=2; fd2=proportion of windows having FD in (2, 2.05];... fd21=proportion of windows having FD in (2.95,3.00]

l1_2 to l1_499 - silhouette (p=0.1, dim=1) at various time steps.
SmartMem_features
kaggle.com
Updated Mar 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SmartMem (2025). SmartMem_features [Dataset]. https://www.kaggle.com/datasets/smartmem/smartmem-features/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SmartMem
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Competition Overview

This dataset is associated with the WWW 2025 SmartMem Competition. The competition task is to predict whether a memory module (DIMM) will experience a failure in the future based on its historical logs. For more details, please visit the competition homepage or the promotion homepage.

Dataset Background

Considering the large scale of the competition dataset and the long processing time required to run the full baseline for feature generation, we have released a feature dataset produced by the baseline. This dataset contains high-level features (referred to as "new features") extracted from the raw data of all DIMMs. Each DIMM's new feature set consists of 100 columns. Participants can leverage this dataset to explore memory failure prediction strategies—for example, by incorporating additional features to enhance model performance, using data augmentation or sampling methods to address class imbalance, or testing different models to tackle challenges from distribution drift in event sequence data.

Terminology Definitions

DIMM: Refers to a memory module, identified by its serial_number (abbreviated as SN). The memory failure prediction task is to forecast whether a given DIMM will fail in the future.

CE (Correctable Error): Each DIMM generates several log entries:

Read CE: Occurs when a memory fault leads to data errors during data exchanges in business processes.

SCRUB CE: Detected during memory inspections conducted by the Intel CPU while the server is running.

RdErrLogParity: A 32-bit binary number that records the 8-bit data transmitted in each cycle across 4 data buses (DQ) of an x4 granularity DDR4 memory during CPU-memory data exchanges; a bit value of 1 is considered an error.

Deduplication Rule: If the same DIMM records the identical RetryRdErrLogParity error on the same cell within a single observation window, only the earliest CE is retained.

Failure Mode:

Multiple failures in lower-level modules within a higher-level module are treated as a failure mode of the higher-level module. For example, if a Device (which contains multiple Banks) shows multiple Bank failures within an observation window, the Device failure mode is set to 1.

Module Hierarchy: Others > Device > Bank > Row/Column > Cell.

Only the failure mode of the highest-level module is recorded in each observation window.

New Feature Generation Process

Features are generated every 15 minutes, with the generation timestamp denoted as T.

For each generation at time T, data from the preceding 15 minutes, 1 hour, and 6 hours are used to compute features.

Each DIMM's new feature set is presented as tabular data, comprising:

1 column: LogTime (the feature generation time T, considered as the timestamp of the last CE used).

99 columns: 33 features for each of the three time windows (15 minutes, 1 hour, and 6 hours).

Feature Category Descriptions

Temporal Features

{read/scrub/all}_ce_log_num_{window_size}: The total number of de-duplicated CE log entries (read, scrub, or all) within the window.

{read/scrub/all}_ce_count_{window_size}: The total count of CE entries (read, scrub, or all) before deduplication within the window.

log_happen_frequency_{window_size}: The log frequency, defined as the observation window duration divided by the total number of CEs.

ce_storm_count_{window_size}: The number of CE storms (for details, see the baseline method _calculate_ce_storm_count).

Macro-level Spatial Features

fault_mode_{others/device/bank/row/column/cell}_{window_size}: Indicates whether a failure mode occurred at the corresponding module level (for details, see _get_spatio_features).

fault_{row/column}_num_{window_size}: The number of columns experiencing simultaneous row failures or the number of rows experiencing simultaneous column failures.

Micro-level Spatial Features

error_{bit/dq/burst}_count_{window_size}: Total count of errors (bit, dq, or burst) within the window.

max_{dq/burst}_interval_{window_size}: The maximum interval between parity errors (dq or burst) within the window.

dq_count={1/2/3/4}_{window_size}: The total number of occurrences where the dq error count equals n (with n ∈ {1, 2, 3, 4}).

burst_count={1/2/3/4/5/6/7/8}_{window_size}: The total number of occurrences where the burst error count equals n (with n ∈ {1, 2, 3, 4, 5, 6, 7, 8}).

How to use?

In the baseline program (baseline_en.py), place the feature files in the directory specified by Config.feature_path. Then, remove the call to ...
NetFlix-Prize-Lite
kaggle.com
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dhirendra Yadav (2023). NetFlix-Prize-Lite [Dataset]. https://www.kaggle.com/datasets/mlpedia/netflix-prize-lite
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dhirendra Yadav
Description
Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor's Pragmatic Chaos team. This is the dataset that was used in that competition.

full data https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data
hdbscan-0.8.28.whl
kaggle.com
Updated Oct 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
something4kag (2022). hdbscan-0.8.28.whl [Dataset]. https://www.kaggle.com/datasets/something4kag/hdbscan0828whl
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 18, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
something4kag
Description
Context

For use in code competitions with no Internet. Add Data this dataset. Install with !mkdir -p /tmp/pip/cache/ !cp ../input/hdbscan0828whl/hdbscan-0.8.28-cp37-cp37m-linuxx8664.whl /tmp/pip/cache/ !pip install --no-index --find-links /tmp/pip/cache/ hdbscan

Inspiration

Latest version 0.8.28 like https://www.kaggle.com/datasets/something4kag/hdbscan0827-whl import hdbscan no longer worked in notebooks, not available in docker now. Made it work!

Acknowledgements

https://github.com/scikit-learn-contrib/hdbscan

License for HDBSCAN

scikit-learn-contrib/hdbscan is licensed under the BSD 3-Clause "New" or "Revised" License

A permissive license similar to the BSD 2-Clause License, but with a 3rd clause that prohibits others from using the name of the copyright holder or its contributors to promote derived products without written consent.
Tokyo 2020 Summer Paralympics
kaggle.com
zip
Updated Sep 5, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Petro (2021). Tokyo 2020 Summer Paralympics [Dataset]. https://www.kaggle.com/datasets/piterfm/tokyo-2020-paralympics
Explore at:
zip(311851 bytes)Available download formats
Dataset updated
Sep 5, 2021
Authors
Petro
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
Tokyo
Description
This is a Paralympic Games dataset that describes medals and athletes for Tokyo 2020. The data was created from Tokyo Paralympics.

All medals and more than 4,500 athletes (with some personal data: date and place of birth, height, etc.) of the Paralympic Games you can find here. Apart from it coaches and technical officials are present.

Please, click on the ticker to the right top of the dataset to cast an upvote. It will help be on the top.

Data: 1. medals_total.csv - dataset contains all medals grouped by country as here. 2. medals.csv - dataset includes general information on all athletes who won a medal. 3. athletes.csv - dataset includes some personal information of all athletes. 4. coaches.csv - dataset includes some personal information of all coaches. 5. technical_officials - dataset includes some personal information of all technical officials.

Related Datasets

Olympic Games, 1986-2020

Beijing 2022 Olympics

Tokyo 2020 Olympics

Tokyo 2020 Horses

Olympic Games Hosts

Data Visualization

Tokyo 2020 Paralympics

Dataset History

2021-09-05 - dataset is updated. Contains full information. 2021-08-30 - dataset is updated. Contains information for the first 6 days of competitions. 2021-08-27 - dataset is created. Contains information for the first 3 days of competitions.

Q&A

If you have some questions please start a discussion.
Listening-to-Earthquakes
kaggle.com
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PenguinGUI (2025). Listening-to-Earthquakes [Dataset]. https://www.kaggle.com/datasets/penguingui/listening-to-earthquakes/versions/3
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 30, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
PenguinGUI
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Data Card: LANL Earthquake Prediction

1. Data Source and Description

Source: The data originates from the LANL Earthquake Prediction Kaggle competition, provided by Los Alamos National Laboratory.

Description: The dataset comprises continuous seismic acoustic data collected from laboratory experiments simulating earthquake conditions. The objective is to predict the "time to failure"—the time remaining until the next laboratory earthquake occurs—making this a regression task.

Licensing and Ethical Considerations: The data is publicly available under the competition’s terms of use on Kaggle. Ethically, any models or insights derived should be applied responsibly, considering the potential implications of earthquake prediction in real-world scenarios.

2. Data Preprocessing

Sliding-Window Approach: Given the large size and continuous nature of the seismic data (each segment contains 150,000 data points), training time series models like Temporal Convolutional Networks (TCN) or Long Short-Term Memory (LSTM) networks on full-length samples is computationally intensive. To address this, a sliding-window strategy was implemented:

Window Sizes: Multiple sizes were used—150,000, 15,000, and 1,500 data points—to capture patterns across different temporal scales.

Strides: Strides of 150,000, 15,000, 7,500, 1,500, and 750 were applied, creating overlapping or non-overlapping windows and resulting in five distinct processed datasets.

Data Segmentation: The continuous acoustic signals were segmented into fixed-length chunks of 150,000 data points each. For training data, the label assigned to each segment was the "time to failure" at its endpoint, aligning with the prediction task.

Data Storage: Extracted feature sequences were saved in NumPy’s compressed .npz format, ensuring efficient storage, accessibility, and consistency across training and testing phases.

3. Feature Extraction

Statistical Features: For each window, eleven statistical features were calculated to summarize the seismic signals:

Mean, standard deviation, minimum, maximum, median, skewness, and kurtosis.

Quantile-based features at 1%, 5%, 95%, and 99% to detect variations and potential anomalies linked to seismic events.

Advanced Techniques: Drawing from top competition solutions (e.g., the 26th place approach), consider enhancing your feature set with:

Matched Filtering: Identifies recurring patterns in the time series, which could signal precursors to earthquakes.

Hilbert Transform: Extracts the analytical envelope of the signal, highlighting peak behaviors and dynamic changes.

Multi-Scale Analysis: Using varied window sizes and strides enables the capture of both short-term fluctuations and longer-term trends, critical for understanding seismic signal dynamics.

4. Predictive Modeling

Model Selection: A diverse set of models was planned to leverage their unique strengths:

Random Forest: Offers a baseline and insights into feature importance.

Multi-Layer Perceptron (MLP): Provides a simple neural network baseline.

Temporal Convolutional Network (TCN): Excels at modeling temporal dependencies with computational efficiency.

Long Short-Term Memory (LSTM): Captures long-term sequential relationships, ideal for time series like seismic data.

Ensemble Potential: The competition’s winning team combined neural networks with gradient boosted decision trees (e.g., LightGBM or XGBoost), suggesting that integrating tree-based models could boost performance.

Training Considerations:

Use cross-validation strategies tailored to the data’s structure, such as "Leave One Earthquake Out," to ensure generalization (inspired by the 1st place writeup).

Tune hyperparameters carefully, especially for tree-based models, to optimize for the target metric.

5. Evaluation

Metric: The competition evaluates submissions using Mean Absolute Error (MAE), measuring the accuracy of predicted "time to failure" against actual values.

Validation: Robust validation is key. Consider splitting the data by earthquake cycles to mimic the test set’s structure and avoid overfitting.
Brazilian Volleyball Superliga 2021/22 (Women)
kaggle.com
Updated Jul 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ils (2022). Brazilian Volleyball Superliga 2021/22 (Women) [Dataset]. https://www.kaggle.com/datasets/smnlgn/superliga-202122
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 28, 2022
Dataset provided by
Kaggle
Authors
ils
Description
Although soccer is the sport most commonly associated with Brazil, volleyball is not that far behind. The national women's team is a 5 time olympic medalist, including two gold medals (2008 and 2012). Superliga is the top level Brazilian professional volleyball competition. This dataset contains information about every Superliga match for the 2021/2022.

Set *X*: Starting position in set X. * is for substitute players. Jogadora: Player's name. "(L)" indicates libero. Time: Team's name. Partida: Match. Serviço Err: Service Errors Serviço Ace: Service Aces Recepção Total: Reception total Recepção Err: Reception errors Ataque Exc: Attacks Ataque Err: Attack errors Ataque Blk: Blocked attacks Bloqueio Pts: Block points Fase: Primary competition stage. classificatoria (pool stage) or playoffs Cat: Secondary competition stage. turno or returno (pool stage) or quartas, semi, final (quarter finals, semifinals, final) VV: Did the player win the Viva Vôlei trophy? 1 for yes, 0 for no (Viva Vôlei is a "best player trophy" awarded at every game by popular vote).
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sushma Biswas (2019). mlcourse.ai - Dota 2 - winner prediction Dataset [Dataset]. https://www.kaggle.com/datasets/sushmabiswas/mlcourseai-dota-2-winner-prediction-dataset

mlcourse.ai - Dota 2 - winner prediction Dataset

Explore at:

zip(759868828 bytes)Available download formats

Dataset updated

Sep 8, 2019

Authors

Sushma Biswas

Description

Context

Hello! I am currently taking the mlcourse.ai course and as part of one of it's in-class Kaggle competitions, this dataset was required. The data is originally hosted on git but I like to have my data right here on Kaggle. That's why this dataset.

If you find this dataset useful, do upvote. Thank you and happy learning!

Content

This dataset contains 6 files in total. 1. Sample_submission.csv 2. Train_features.csv 3. Test_features.csv 4. Train_targets.csv 5. Train_matches.jsonl 6. Test_matches.jsonl

Acknowledgements

All of the data in this dataset is originally hosted on git and the same can also be found on the in-class competition's 'data' page here.

Inspiration

to be updated.

Clear search

Close search

Google apps

Main menu

mlcourse.ai - Dota 2 - winner prediction Dataset

Context

Content

Acknowledgements

Inspiration

How to Win Data Science Competition

Dataset

Contents

‘Kaggle Competitions Top 100’ analyzed by Analyst-2

Context

Content

Acknowledgements

CrunchDAO Competition Unified Dataset

FSDKaggle2018

Kaggle EyePACS Dataset

Data from: Microsoft Malware Classification Challenge Dataset

LLM - Detect AI Datamix

Netflix Prize data

Context

Content

TRAINING DATASET FILE DESCRIPTION

MOVIES FILE DESCRIPTION

QUALIFYING AND PREDICTION DATASET FILE DESCRIPTION

THE PROBE DATASET FILE DESCRIPTION

Acknowledgements

Inspiration

2016 March ML Mania Predictions

Data Description

Feature Extraction

SmartMem_features

Competition Overview

Dataset Background

Terminology Definitions

New Feature Generation Process

Feature Category Descriptions

Temporal Features

Macro-level Spatial Features

Micro-level Spatial Features

How to use?

NetFlix-Prize-Lite

hdbscan-0.8.28.whl

Context

Inspiration

Acknowledgements

License for HDBSCAN

Tokyo 2020 Summer Paralympics

Related Datasets

Data Visualization

Dataset History

Q&A

Listening-to-Earthquakes

Data Card: LANL Earthquake Prediction

1. Data Source and Description

2. Data Preprocessing

3. Feature Extraction

4. Predictive Modeling

5. Evaluation

Brazilian Volleyball Superliga 2021/22 (Women)

mlcourse.ai - Dota 2 - winner prediction Dataset

Context

Content

Acknowledgements

Inspiration