17 datasets found
  1. mlcourse.ai - Dota 2 - winner prediction Dataset

    • kaggle.com
    zip
    Updated Sep 8, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sushma Biswas (2019). mlcourse.ai - Dota 2 - winner prediction Dataset [Dataset]. https://www.kaggle.com/datasets/sushmabiswas/mlcourseai-dota-2-winner-prediction-dataset
    Explore at:
    zip(759868828 bytes)Available download formats
    Dataset updated
    Sep 8, 2019
    Authors
    Sushma Biswas
    Description

    Context

    Hello! I am currently taking the mlcourse.ai course and as part of one of it's in-class Kaggle competitions, this dataset was required. The data is originally hosted on git but I like to have my data right here on Kaggle. That's why this dataset.

    If you find this dataset useful, do upvote. Thank you and happy learning!

    Content

    This dataset contains 6 files in total. 1. Sample_submission.csv 2. Train_features.csv 3. Test_features.csv 4. Train_targets.csv 5. Train_matches.jsonl 6. Test_matches.jsonl

    Acknowledgements

    All of the data in this dataset is originally hosted on git and the same can also be found on the in-class competition's 'data' page here.

    Inspiration

    • to be updated.
  2. How to Win Data Science Competition

    • kaggle.com
    zip
    Updated Jan 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Budi Ryan (2018). How to Win Data Science Competition [Dataset]. https://www.kaggle.com/budiryan/how-to-win-data-science-competition
    Explore at:
    zip(15845091 bytes)Available download formats
    Dataset updated
    Jan 30, 2018
    Authors
    Budi Ryan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Budi Ryan

    Released under CC0: Public Domain

    Contents

  3. A

    ‘Kaggle Competitions Top 100’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Kaggle Competitions Top 100’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-kaggle-competitions-top-100-961d/latest
    Explore at:
    Dataset updated
    Feb 14, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Kaggle Competitions Top 100’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/vivovinco/kaggle-competitions-top-100 on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    This dataset contains top 100 of Kaggle competitions ranking. The dataset will be updated every month.

    Content

    100 rows and 13 columns. Columns' description are listed below.

    • User : Name of the user
    • Tier : Grandmaster, Master or Expert
    • Company/School : Company/School info of the user if mentioned
    • Country : Country info of the user if mentioned
    • Competitions_Num : Number of competitions joined
    • Competitions_Gold : Number of competitions gold medals won
    • Competitions_Silver : Number of competitions silver medals won
    • Competitions_Bronze : Number of competitions bronze medals won
    • Datasets_Num : Number of public datasets
    • Notebooks_Num : Number of public notebooks
    • Discussions_Num : Number of topics/comments posted
    • Points : Total points
    • Profile : Link of Kaggle profile

    Acknowledgements

    Data from Kaggle. Image from Smartcat.

    If you're reading this, please upvote.

    --- Original source retains full ownership of the source dataset ---

  4. CrunchDAO Competition Unified Dataset

    • kaggle.com
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joakim Arvidsson (2023). CrunchDAO Competition Unified Dataset [Dataset]. https://www.kaggle.com/datasets/joebeachcapital/crunchdao-competition-unified-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Joakim Arvidsson
    Description

    This data set is for creating predictive models for the CrunchDAO tournament. Registration is required in order to participate in the competition, and to be eligible to earn $CRUNCH tokens.

    See notebooks (Code tab) for how to import and explore the data, and build predictive models.

    See Terms of Use for data license.

  5. FSDKaggle2018

    • zenodo.org
    • opendatalab.com
    • +1more
    zip
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal (2020). FSDKaggle2018 [Dataset]. http://doi.org/10.5281/zenodo.2552860
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal
    Description

    FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.

    Citation

    If you use the FSDKaggle2018 dataset or part of it, please cite our DCASE 2018 paper:

    Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, Xavier Serra. "General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline". Proceedings of the DCASE 2018 Workshop (2018)

    You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2018.

    Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

    About this dataset

    Freesound Dataset Kaggle 2018 (or FSDKaggle2018 for short) is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology [1]. FSDKaggle2018 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2018. Please visit the DCASE2018 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound General-Purpose Audio Tagging Challenge. It was organized by researchers from the Music Technology Group of Universitat Pompeu Fabra, and from Google Research’s Machine Perception Team.

    The goal of this competition was to build an audio tagging system that can categorize an audio clip as belonging to one of a set of 41 diverse categories drawn from the AudioSet Ontology.

    All audio samples in this dataset are gathered from Freesound [2] and are provided here as uncompressed PCM 16 bit, 44.1 kHz, mono audio files. Note that because Freesound content is collaboratively contributed, recording quality and techniques can vary widely.

    The ground truth data provided in this dataset has been obtained after a data labeling process which is described below in the Data labeling process section. FSDKaggle2018 clips are unequally distributed in the following 41 categories of the AudioSet Ontology:

    "Acoustic_guitar", "Applause", "Bark", "Bass_drum", "Burping_or_eructation", "Bus", "Cello", "Chime", "Clarinet", "Computer_keyboard", "Cough", "Cowbell", "Double_bass", "Drawer_open_or_close", "Electric_piano", "Fart", "Finger_snapping", "Fireworks", "Flute", "Glockenspiel", "Gong", "Gunshot_or_gunfire", "Harmonica", "Hi-hat", "Keys_jangling", "Knock", "Laughter", "Meow", "Microwave_oven", "Oboe", "Saxophone", "Scissors", "Shatter", "Snare_drum", "Squeak", "Tambourine", "Tearing", "Telephone", "Trumpet", "Violin_or_fiddle", "Writing".

    Some other relevant characteristics of FSDKaggle2018:

    • The dataset is split into a train set and a test set.

    • The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds. The total duration of the train set is roughly 18h.

    • Out of the ~9.5k samples from the train set, ~3.7k have manually-verified ground truth annotations and ~5.8k have non-verified annotations. The non-verified annotations of the train set have a quality estimate of at least 65-70% in each category. Checkout the Data labeling process section below for more information about this aspect.

    • Non-verified annotations in the train set are properly flagged in train.csv so that participants can opt to use this information during the development of their systems.

    • The test set is composed of 1.6k samples with manually-verified annotations and with a similar category distribution than that of the train set. The total duration of the test set is roughly 2h.

    • All audio samples in this dataset have a single label (i.e. are only annotated with one label). Checkout the Data labeling process section below for more information about this aspect. A single label should be predicted for each file in the test set.

    Data labeling process

    The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories (or labels), which was carried out by researchers at the Music Technology Group, Universitat Pompeu Fabra, Barcelona. Using this mapping, a number of Freesound audio samples were automatically annotated with labels from the AudioSet Ontology. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample.

    Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assessed the presence/absence of an automatically assigned sound category, according to the AudioSet category description.

    Audio samples in FSDKaggle2018 are only annotated with a single ground truth label (see train.csv). A total of 3,710 annotations included in the train set of FSDKaggle2018 are annotations that have been manually validated as present and predominant (some with inter-annotator agreement but not all of them). This means that in most cases there is no additional acoustic material other than the labeled category. In few cases there may be some additional sound events, but these additional events won't belong to any of the 41 categories of FSDKaggle2018.

    The rest of the annotations have not been manually validated and therefore some of them could be inaccurate. Nonetheless, we have estimated that at least 65-70% of the non-verified annotations per category in the train set are indeed correct. It can happen that some of these non-verified audio samples present several sound sources even though only one label is provided as ground truth. These additional sources are typically out of the set of the 41 categories, but in a few cases they could be within.

    More details about the data labeling process can be found in [3].

    License

    FSDKaggle2018 has licenses at two different levels, as explained next.

    All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of the audio clips included in FSDKaggle2018 and their corresponding license. The licenses are specified in the files train_post_competition.csv and test_post_competition_scoring_clips.csv.

    In addition, FSDKaggle2018 as a whole is the result of a curation process and it has an additional license. FSDKaggle2018 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2018.doc zip file.

    Files

    FSDKaggle2018 can be downloaded as a series of zip files with the following directory structure:

    root
    │
    └───FSDKaggle2018.audio_train/ Audio clips in the train set │
    └───FSDKaggle2018.audio_test/ Audio clips in the test set │
    └───FSDKaggle2018.meta/ Files for evaluation setup │ │
    │ └───train_post_competition.csv Data split and ground truth for the train set │ │
    │ └───test_post_competition_scoring_clips.csv Ground truth for the test set

    └───FSDKaggle2018.doc/ │
    └───README.md The dataset description file you are reading │
    └───LICENSE-DATASET

  6. P

    Kaggle EyePACS Dataset

    • paperswithcode.com
    Updated Oct 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Kaggle EyePACS Dataset [Dataset]. https://paperswithcode.com/dataset/kaggle-eyepacs
    Explore at:
    Dataset updated
    Oct 28, 2020
    Description

    Diabetic retinopathy is the leading cause of blindness in the working-age population of the developed world. It is estimated to affect over 93 million people.

    retina

    The US Center for Disease Control and Prevention estimates that 29.1 million people in the US have diabetes and the World Health Organization estimates that 347 million people have the disease worldwide. Diabetic Retinopathy (DR) is an eye disease associated with long-standing diabetes. Around 40% to 45% of Americans with diabetes have some stage of the disease. Progression to vision impairment can be slowed or averted if DR is detected in time, however this can be difficult as the disease often shows few symptoms until it is too late to provide effective treatment.

    Currently, detecting DR is a time-consuming and manual process that requires a trained clinician to examine and evaluate digital color fundus photographs of the retina. By the time human readers submit their reviews, often a day or two later, the delayed results lead to lost follow up, miscommunication, and delayed treatment.

    Clinicians can identify DR by the presence of lesions associated with the vascular abnormalities caused by the disease. While this approach is effective, its resource demands are high. The expertise and equipment required are often lacking in areas where the rate of diabetes in local populations is high and DR detection is most needed. As the number of individuals with diabetes continues to grow, the infrastructure needed to prevent blindness due to DR will become even more insufficient.

    The need for a comprehensive and automated method of DR screening has long been recognized, and previous efforts have made good progress using image classification, pattern recognition, and machine learning. With color fundus photography as input, the goal of this competition is to push an automated detection system to the limit of what is possible – ideally resulting in models with realistic clinical potential. The winning models will be open sourced to maximize the impact such a model can have on improving DR detection.

    Acknowledgements This competition is sponsored by the California Healthcare Foundation.

    Retinal images were provided by EyePACS, a free platform for retinopathy screening.

  7. P

    Data from: Microsoft Malware Classification Challenge Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Royi Ronen; Marian Radu; Corina Feuerstein; Elad Yom-Tov; Mansour Ahmadi, Microsoft Malware Classification Challenge Dataset [Dataset]. https://paperswithcode.com/dataset/microsoft-malware-classification-challenge
    Explore at:
    Authors
    Royi Ronen; Marian Radu; Corina Feuerstein; Elad Yom-Tov; Mansour Ahmadi
    Description

    The Microsoft Malware Classification Challenge was announced in 2015 along with a publication of a huge dataset of nearly 0.5 terabytes, consisting of disassembly and bytecode of more than 20K malware samples. Apart from serving in the Kaggle competition, the dataset has become a standard benchmark for research on modeling malware behaviour. To date, the dataset has been cited in more than 50 research papers. Here we provide a high-level comparison of the publications citing the dataset. The comparison simplifies finding potential research directions in this field and future performance evaluation of the dataset.

  8. LLM - Detect AI Datamix

    • kaggle.com
    Updated Feb 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 2, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Raja Biswas
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is the datamix created by Team 🔍 📝 🕵️‍♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

    It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

    To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

    Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

    • Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:
      • Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.
      • One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.
      • Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

    We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

    Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence

  9. Netflix Prize data

    • kaggle.com
    zip
    Updated Jul 19, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Netflix (2017). Netflix Prize data [Dataset]. https://www.kaggle.com/netflix-inc/netflix-prize-data
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Jul 19, 2017
    Dataset authored and provided by
    Netflixhttp://netflix.com/
    Description

    Context

    Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor's Pragmatic Chaos team. This is the dataset that was used in that competition.

    Content

    This comes directly from the README:

    TRAINING DATASET FILE DESCRIPTION

    The file "training_set.tar" is a tar of a directory containing 17770 files, one per movie. The first line of each file contains the movie id followed by a colon. Each subsequent line in the file corresponds to a rating from a customer and its date in the following format:

    CustomerID,Rating,Date

    • MovieIDs range from 1 to 17770 sequentially.
    • CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users.
    • Ratings are on a five star (integral) scale from 1 to 5.
    • Dates have the format YYYY-MM-DD.

    MOVIES FILE DESCRIPTION

    Movie information in "movie_titles.txt" is in the following format:

    MovieID,YearOfRelease,Title

    • MovieID do not correspond to actual Netflix movie ids or IMDB movie ids.
    • YearOfRelease can range from 1890 to 2005 and may correspond to the release of corresponding DVD, not necessarily its theaterical release.
    • Title is the Netflix movie title and may not correspond to titles used on other sites. Titles are in English.

    QUALIFYING AND PREDICTION DATASET FILE DESCRIPTION

    The qualifying dataset for the Netflix Prize is contained in the text file "qualifying.txt". It consists of lines indicating a movie id, followed by a colon, and then customer ids and rating dates, one per line for that movie id. The movie and customer ids are contained in the training set. Of course the ratings are withheld. There are no empty lines in the file.

    MovieID1:

    CustomerID11,Date11

    CustomerID12,Date12

    ...

    MovieID2:

    CustomerID21,Date21

    CustomerID22,Date22

    For the Netflix Prize, your program must predict the all ratings the customers gave the movies in the qualifying dataset based on the information in the training dataset.

    The format of your submitted prediction file follows the movie and customer id, date order of the qualifying dataset. However, your predicted rating takes the place of the corresponding customer id (and date), one per line.

    For example, if the qualifying dataset looked like:

    111:

    3245,2005-12-19

    5666,2005-12-23

    6789,2005-03-14

    225:

    1234,2005-05-26

    3456,2005-11-07

    then a prediction file should look something like:

    111:

    3.0

    3.4

    4.0

    225:

    1.0

    2.0

    which predicts that customer 3245 would have rated movie 111 3.0 stars on the 19th of Decemeber, 2005, that customer 5666 would have rated it slightly higher at 3.4 stars on the 23rd of Decemeber, 2005, etc.

    You must make predictions for all customers for all movies in the qualifying dataset.

    THE PROBE DATASET FILE DESCRIPTION

    To allow you to test your system before you submit a prediction set based on the qualifying dataset, we have provided a probe dataset in the file "probe.txt". This text file contains lines indicating a movie id, followed by a colon, and then customer ids, one per line for that movie id.

    MovieID1:

    CustomerID11

    CustomerID12

    ...

    MovieID2:

    CustomerID21

    CustomerID22

    Like the qualifying dataset, the movie and customer id pairs are contained in the training set. However, unlike the qualifying dataset, the ratings (and dates) for each pair are contained in the training dataset.

    If you wish, you may calculate the RMSE of your predictions against those ratings and compare your RMSE against the Cinematch RMSE on the same data. See http://www.netflixprize.com/faq#probe for that value.

    Acknowledgements

    The training data came in 17,000+ files. In the interest of keeping files together and file sizes as low as possible, I combined them into four text files: combined_data_(1,2,3,4).txt

    The contest was originally hosted at http://netflixprize.com/index.html

    The dataset was downloaded from https://archive.org/download/nf_prize_dataset.tar

    Inspiration

    This is a fun dataset to work with. You can read about the winning algorithm by BellKor's Pragmatic Chaos here

  10. 2016 March ML Mania Predictions

    • kaggle.com
    zip
    Updated Nov 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Will Cukierski (2017). 2016 March ML Mania Predictions [Dataset]. https://www.kaggle.com/datasets/wcukierski/2016-march-ml-mania
    Explore at:
    zip(28950066 bytes)Available download formats
    Dataset updated
    Nov 15, 2017
    Authors
    Will Cukierski
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Kaggle’s March Machine Learning Mania competition challenged data scientists to predict winners and losers of the men's 2016 NCAA basketball tournament. This dataset contains the 1070 selected predictions of all Kaggle participants. These predictions were collected and locked in prior to the start of the tournament.

    How can this data be used? You can pivot it to look at both Kaggle and NCAA teams alike. You can look at who will win games, which games will be close, which games are hardest to forecast, or which Kaggle teams are gambling vs. sticking to the data.

    First round predictions

    The NCAA tournament is a single-elimination tournament that begins with 68 teams. There are four games, usually called the “play-in round,” before the traditional bracket action starts. Due to competition timing, these games are included in the prediction files but should not be used in analysis, as it’s possible that the prediction was submitted after the play-in round games were over.

    Data Description

    Each Kaggle team could submit up to two prediction files. The prediction files in the dataset are in the 'predictions' folder and named according to:

    TeamName_TeamId_SubmissionId.csv

    The file format contains a probability prediction for every possible game between the 68 teams. This is necessary to cover every possible tournament outcome. Each team has a unique numerical Id (given in Teams.csv). Each game has a unique Id column created by concatenating the year and the two team Ids. The format is the following:

    Id,Pred
    2016_1112_1114,0.6
    2016_1112_1122,0
    ...

    The team with the lower numerical Id is always listed first. “Pred” represents the probability that the team with the lower Id beats the team with the higher Id. For example, "2016_1112_1114,0.6" indicates team 1112 has a 0.6 probability of beating team 1114.

    For convenience, we have included the data files from the 2016 March Mania competition dataset in the Scripts environment (you may find TourneySlots.csv and TourneySeeds.csv useful for determining matchups, see the documentation). However, the focus of this dataset is on Kagglers' predictions.

  11. Feature Extraction

    • kaggle.com
    Updated Sep 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason (2019). Feature Extraction [Dataset]. https://www.kaggle.com/jclchan/feature-extraction/notebooks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 4, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jason
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The datasets are derived from eye fundus images provided in Kaggle's 'APTOS 2019 Blindness Detection' competition. The competition involves classification of eye fundus images into 5 levels of severity in diabetic retinopathy.

    Unlike most participants who used deep learning approach to this classification problem, here we tried using Fractal Dimensions and Persistent Homology (one of the major tools in Topological Data Analysis, TDA) in extracting features from images, as inputs to simpler ML algorithms like SVM. It shows some promising results with this approach.

    There are three files in this dataset:

    1. Process_Images.html - R scripts for extracting Fractal Dimensions and Persistent Homology features from images.

    2. train_features.RDS and test_features.RDS - the output RDS (R dataset files) for training and testing images for the above Kaggle competition.

    Columns in train_features.RDS & test_features.RDS:

    1. id_code - image id

    2. diagnosis - severity of diabetic retinopathy on a scale of 0 to 4: 0=No DR; 1=Mild; 2=Moderate; 3=Severe; 4=Proliferative DR; Artificially set to be 0 for test_features.RDS

    3. n - number of persistent homology components detected from the image

    4. fd1 to fd21 - proportion of sliding windows having a specific fractal dimensions: fd1 = proportion of windows having FD=2; fd2=proportion of windows having FD in (2, 2.05];... fd21=proportion of windows having FD in (2.95,3.00]

    5. l1_2 to l1_499 - silhouette (p=0.1, dim=1) at various time steps.

  12. SmartMem_features

    • kaggle.com
    Updated Mar 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SmartMem (2025). SmartMem_features [Dataset]. https://www.kaggle.com/datasets/smartmem/smartmem-features/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    SmartMem
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Competition Overview

    This dataset is associated with the WWW 2025 SmartMem Competition. The competition task is to predict whether a memory module (DIMM) will experience a failure in the future based on its historical logs. For more details, please visit the competition homepage or the promotion homepage.

    Dataset Background

    Considering the large scale of the competition dataset and the long processing time required to run the full baseline for feature generation, we have released a feature dataset produced by the baseline. This dataset contains high-level features (referred to as "new features") extracted from the raw data of all DIMMs. Each DIMM's new feature set consists of 100 columns. Participants can leverage this dataset to explore memory failure prediction strategies—for example, by incorporating additional features to enhance model performance, using data augmentation or sampling methods to address class imbalance, or testing different models to tackle challenges from distribution drift in event sequence data.

    Terminology Definitions

    • DIMM: Refers to a memory module, identified by its serial_number (abbreviated as SN). The memory failure prediction task is to forecast whether a given DIMM will fail in the future.
    • CE (Correctable Error): Each DIMM generates several log entries:
      • Read CE: Occurs when a memory fault leads to data errors during data exchanges in business processes.
      • SCRUB CE: Detected during memory inspections conducted by the Intel CPU while the server is running.
    • RdErrLogParity: A 32-bit binary number that records the 8-bit data transmitted in each cycle across 4 data buses (DQ) of an x4 granularity DDR4 memory during CPU-memory data exchanges; a bit value of 1 is considered an error.
    • Deduplication Rule: If the same DIMM records the identical RetryRdErrLogParity error on the same cell within a single observation window, only the earliest CE is retained.
    • Failure Mode:
      • Multiple failures in lower-level modules within a higher-level module are treated as a failure mode of the higher-level module. For example, if a Device (which contains multiple Banks) shows multiple Bank failures within an observation window, the Device failure mode is set to 1.
      • Module Hierarchy: Others > Device > Bank > Row/Column > Cell.
      • Only the failure mode of the highest-level module is recorded in each observation window.

    New Feature Generation Process

    • Features are generated every 15 minutes, with the generation timestamp denoted as T.
    • For each generation at time T, data from the preceding 15 minutes, 1 hour, and 6 hours are used to compute features.
    • Each DIMM's new feature set is presented as tabular data, comprising:
      • 1 column: LogTime (the feature generation time T, considered as the timestamp of the last CE used).
      • 99 columns: 33 features for each of the three time windows (15 minutes, 1 hour, and 6 hours).

    Feature Category Descriptions

    Temporal Features

    • {read/scrub/all}_ce_log_num_{window_size}: The total number of de-duplicated CE log entries (read, scrub, or all) within the window.
    • {read/scrub/all}_ce_count_{window_size}: The total count of CE entries (read, scrub, or all) before deduplication within the window.
    • log_happen_frequency_{window_size}: The log frequency, defined as the observation window duration divided by the total number of CEs.
    • ce_storm_count_{window_size}: The number of CE storms (for details, see the baseline method _calculate_ce_storm_count).

    Macro-level Spatial Features

    • fault_mode_{others/device/bank/row/column/cell}_{window_size}: Indicates whether a failure mode occurred at the corresponding module level (for details, see _get_spatio_features).
    • fault_{row/column}_num_{window_size}: The number of columns experiencing simultaneous row failures or the number of rows experiencing simultaneous column failures.

    Micro-level Spatial Features

    • error_{bit/dq/burst}_count_{window_size}: Total count of errors (bit, dq, or burst) within the window.
    • max_{dq/burst}_interval_{window_size}: The maximum interval between parity errors (dq or burst) within the window.
    • dq_count={1/2/3/4}_{window_size}: The total number of occurrences where the dq error count equals n (with n ∈ {1, 2, 3, 4}).
    • burst_count={1/2/3/4/5/6/7/8}_{window_size}: The total number of occurrences where the burst error count equals n (with n ∈ {1, 2, 3, 4, 5, 6, 7, 8}).

    How to use?

    In the baseline program (baseline_en.py), place the feature files in the directory specified by Config.feature_path. Then, remove the call to ...

  13. NetFlix-Prize-Lite

    • kaggle.com
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dhirendra Yadav (2023). NetFlix-Prize-Lite [Dataset]. https://www.kaggle.com/datasets/mlpedia/netflix-prize-lite
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dhirendra Yadav
    Description

    Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor's Pragmatic Chaos team. This is the dataset that was used in that competition.

    full data https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data

  14. hdbscan-0.8.28.whl

    • kaggle.com
    Updated Oct 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    something4kag (2022). hdbscan-0.8.28.whl [Dataset]. https://www.kaggle.com/datasets/something4kag/hdbscan0828whl
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 18, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    something4kag
    Description

    Context

    For use in code competitions with no Internet. Add Data this dataset. Install with !mkdir -p /tmp/pip/cache/ !cp ../input/hdbscan0828whl/hdbscan-0.8.28-cp37-cp37m-linuxx8664.whl /tmp/pip/cache/ !pip install --no-index --find-links /tmp/pip/cache/ hdbscan

    Inspiration

    Latest version 0.8.28 like https://www.kaggle.com/datasets/something4kag/hdbscan0827-whl import hdbscan no longer worked in notebooks, not available in docker now. Made it work!

    Acknowledgements

    https://github.com/scikit-learn-contrib/hdbscan

    License for HDBSCAN

    scikit-learn-contrib/hdbscan is licensed under the BSD 3-Clause "New" or "Revised" License

    A permissive license similar to the BSD 2-Clause License, but with a 3rd clause that prohibits others from using the name of the copyright holder or its contributors to promote derived products without written consent.

  15. Tokyo 2020 Summer Paralympics

    • kaggle.com
    zip
    Updated Sep 5, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Petro (2021). Tokyo 2020 Summer Paralympics [Dataset]. https://www.kaggle.com/datasets/piterfm/tokyo-2020-paralympics
    Explore at:
    zip(311851 bytes)Available download formats
    Dataset updated
    Sep 5, 2021
    Authors
    Petro
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    Tokyo
    Description

    This is a Paralympic Games dataset that describes medals and athletes for Tokyo 2020. The data was created from Tokyo Paralympics.

    All medals and more than 4,500 athletes (with some personal data: date and place of birth, height, etc.) of the Paralympic Games you can find here. Apart from it coaches and technical officials are present.

    Please, click on the ticker to the right top of the dataset to cast an upvote. It will help be on the top.

    Data: 1. medals_total.csv - dataset contains all medals grouped by country as here. 2. medals.csv - dataset includes general information on all athletes who won a medal. 3. athletes.csv - dataset includes some personal information of all athletes. 4. coaches.csv - dataset includes some personal information of all coaches. 5. technical_officials - dataset includes some personal information of all technical officials.

    Related Datasets

    Data Visualization

    Tokyo 2020 Paralympics

    Dataset History

    2021-09-05 - dataset is updated. Contains full information. 2021-08-30 - dataset is updated. Contains information for the first 6 days of competitions. 2021-08-27 - dataset is created. Contains information for the first 3 days of competitions.

    Q&A

    If you have some questions please start a discussion.

  16. Listening-to-Earthquakes

    • kaggle.com
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PenguinGUI (2025). Listening-to-Earthquakes [Dataset]. https://www.kaggle.com/datasets/penguingui/listening-to-earthquakes/versions/3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    PenguinGUI
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Data Card: LANL Earthquake Prediction

    1. Data Source and Description

    • Source: The data originates from the LANL Earthquake Prediction Kaggle competition, provided by Los Alamos National Laboratory.
    • Description: The dataset comprises continuous seismic acoustic data collected from laboratory experiments simulating earthquake conditions. The objective is to predict the "time to failure"—the time remaining until the next laboratory earthquake occurs—making this a regression task.
    • Licensing and Ethical Considerations: The data is publicly available under the competition’s terms of use on Kaggle. Ethically, any models or insights derived should be applied responsibly, considering the potential implications of earthquake prediction in real-world scenarios.

    2. Data Preprocessing

    • Sliding-Window Approach: Given the large size and continuous nature of the seismic data (each segment contains 150,000 data points), training time series models like Temporal Convolutional Networks (TCN) or Long Short-Term Memory (LSTM) networks on full-length samples is computationally intensive. To address this, a sliding-window strategy was implemented:
      • Window Sizes: Multiple sizes were used—150,000, 15,000, and 1,500 data points—to capture patterns across different temporal scales.
      • Strides: Strides of 150,000, 15,000, 7,500, 1,500, and 750 were applied, creating overlapping or non-overlapping windows and resulting in five distinct processed datasets.
    • Data Segmentation: The continuous acoustic signals were segmented into fixed-length chunks of 150,000 data points each. For training data, the label assigned to each segment was the "time to failure" at its endpoint, aligning with the prediction task.
    • Data Storage: Extracted feature sequences were saved in NumPy’s compressed .npz format, ensuring efficient storage, accessibility, and consistency across training and testing phases.

    3. Feature Extraction

    • Statistical Features: For each window, eleven statistical features were calculated to summarize the seismic signals:
      • Mean, standard deviation, minimum, maximum, median, skewness, and kurtosis.
      • Quantile-based features at 1%, 5%, 95%, and 99% to detect variations and potential anomalies linked to seismic events.
    • Advanced Techniques: Drawing from top competition solutions (e.g., the 26th place approach), consider enhancing your feature set with:
      • Matched Filtering: Identifies recurring patterns in the time series, which could signal precursors to earthquakes.
      • Hilbert Transform: Extracts the analytical envelope of the signal, highlighting peak behaviors and dynamic changes.
    • Multi-Scale Analysis: Using varied window sizes and strides enables the capture of both short-term fluctuations and longer-term trends, critical for understanding seismic signal dynamics.

    4. Predictive Modeling

    • Model Selection: A diverse set of models was planned to leverage their unique strengths:
      • Random Forest: Offers a baseline and insights into feature importance.
      • Multi-Layer Perceptron (MLP): Provides a simple neural network baseline.
      • Temporal Convolutional Network (TCN): Excels at modeling temporal dependencies with computational efficiency.
      • Long Short-Term Memory (LSTM): Captures long-term sequential relationships, ideal for time series like seismic data.
    • Ensemble Potential: The competition’s winning team combined neural networks with gradient boosted decision trees (e.g., LightGBM or XGBoost), suggesting that integrating tree-based models could boost performance.
    • Training Considerations:
      • Use cross-validation strategies tailored to the data’s structure, such as "Leave One Earthquake Out," to ensure generalization (inspired by the 1st place writeup).
      • Tune hyperparameters carefully, especially for tree-based models, to optimize for the target metric.

    5. Evaluation

    • Metric: The competition evaluates submissions using Mean Absolute Error (MAE), measuring the accuracy of predicted "time to failure" against actual values.
    • Validation: Robust validation is key. Consider splitting the data by earthquake cycles to mimic the test set’s structure and avoid overfitting.
  17. Brazilian Volleyball Superliga 2021/22 (Women)

    • kaggle.com
    Updated Jul 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ils (2022). Brazilian Volleyball Superliga 2021/22 (Women) [Dataset]. https://www.kaggle.com/datasets/smnlgn/superliga-202122
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 28, 2022
    Dataset provided by
    Kaggle
    Authors
    ils
    Description

    Although soccer is the sport most commonly associated with Brazil, volleyball is not that far behind. The national women's team is a 5 time olympic medalist, including two gold medals (2008 and 2012). Superliga is the top level Brazilian professional volleyball competition. This dataset contains information about every Superliga match for the 2021/2022.

    Set *X*: Starting position in set X. * is for substitute players. Jogadora: Player's name. "(L)" indicates libero. Time: Team's name. Partida: Match. Serviço Err: Service Errors Serviço Ace: Service Aces Recepção Total: Reception total Recepção Err: Reception errors Ataque Exc: Attacks Ataque Err: Attack errors Ataque Blk: Blocked attacks Bloqueio Pts: Block points Fase: Primary competition stage. classificatoria (pool stage) or playoffs Cat: Secondary competition stage. turno or returno (pool stage) or quartas, semi, final (quarter finals, semifinals, final) VV: Did the player win the Viva Vôlei trophy? 1 for yes, 0 for no (Viva Vôlei is a "best player trophy" awarded at every game by popular vote).

  18. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sushma Biswas (2019). mlcourse.ai - Dota 2 - winner prediction Dataset [Dataset]. https://www.kaggle.com/datasets/sushmabiswas/mlcourseai-dota-2-winner-prediction-dataset
Organization logo

mlcourse.ai - Dota 2 - winner prediction Dataset

Explore at:
zip(759868828 bytes)Available download formats
Dataset updated
Sep 8, 2019
Authors
Sushma Biswas
Description

Context

Hello! I am currently taking the mlcourse.ai course and as part of one of it's in-class Kaggle competitions, this dataset was required. The data is originally hosted on git but I like to have my data right here on Kaggle. That's why this dataset.

If you find this dataset useful, do upvote. Thank you and happy learning!

Content

This dataset contains 6 files in total. 1. Sample_submission.csv 2. Train_features.csv 3. Test_features.csv 4. Train_targets.csv 5. Train_matches.jsonl 6. Test_matches.jsonl

Acknowledgements

All of the data in this dataset is originally hosted on git and the same can also be found on the in-class competition's 'data' page here.

Inspiration

  • to be updated.
Search
Clear search
Close search
Google apps
Main menu