Hello! I am currently taking the mlcourse.ai course and as part of one of it's in-class Kaggle competitions, this dataset was required. The data is originally hosted on git but I like to have my data right here on Kaggle. That's why this dataset.
If you find this dataset useful, do upvote. Thank you and happy learning!
This dataset contains 6 files in total. 1. Sample_submission.csv 2. Train_features.csv 3. Test_features.csv 4. Train_targets.csv 5. Train_matches.jsonl 6. Test_matches.jsonl
All of the data in this dataset is originally hosted on git and the same can also be found on the in-class competition's 'data' page here.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Budi Ryan
Released under CC0: Public Domain
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Kaggle Competitions Top 100’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/vivovinco/kaggle-competitions-top-100 on 13 February 2022.
--- Dataset description provided by original source is as follows ---
This dataset contains top 100 of Kaggle competitions ranking. The dataset will be updated every month.
100 rows and 13 columns. Columns' description are listed below.
Data from Kaggle. Image from Smartcat.
If you're reading this, please upvote.
--- Original source retains full ownership of the source dataset ---
This data set is for creating predictive models for the CrunchDAO tournament. Registration is required in order to participate in the competition, and to be eligible to earn $CRUNCH tokens.
See notebooks (Code tab) for how to import and explore the data, and build predictive models.
See Terms of Use for data license.
FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.
Citation
If you use the FSDKaggle2018 dataset or part of it, please cite our DCASE 2018 paper:
Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, Xavier Serra. "General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline". Proceedings of the DCASE 2018 Workshop (2018)
You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2018.
Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017
Contact
You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.
About this dataset
Freesound Dataset Kaggle 2018 (or FSDKaggle2018 for short) is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology [1]. FSDKaggle2018 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2018. Please visit the DCASE2018 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound General-Purpose Audio Tagging Challenge. It was organized by researchers from the Music Technology Group of Universitat Pompeu Fabra, and from Google Research’s Machine Perception Team.
The goal of this competition was to build an audio tagging system that can categorize an audio clip as belonging to one of a set of 41 diverse categories drawn from the AudioSet Ontology.
All audio samples in this dataset are gathered from Freesound [2] and are provided here as uncompressed PCM 16 bit, 44.1 kHz, mono audio files. Note that because Freesound content is collaboratively contributed, recording quality and techniques can vary widely.
The ground truth data provided in this dataset has been obtained after a data labeling process which is described below in the Data labeling process section. FSDKaggle2018 clips are unequally distributed in the following 41 categories of the AudioSet Ontology:
"Acoustic_guitar", "Applause", "Bark", "Bass_drum", "Burping_or_eructation", "Bus", "Cello", "Chime", "Clarinet", "Computer_keyboard", "Cough", "Cowbell", "Double_bass", "Drawer_open_or_close", "Electric_piano", "Fart", "Finger_snapping", "Fireworks", "Flute", "Glockenspiel", "Gong", "Gunshot_or_gunfire", "Harmonica", "Hi-hat", "Keys_jangling", "Knock", "Laughter", "Meow", "Microwave_oven", "Oboe", "Saxophone", "Scissors", "Shatter", "Snare_drum", "Squeak", "Tambourine", "Tearing", "Telephone", "Trumpet", "Violin_or_fiddle", "Writing".
Some other relevant characteristics of FSDKaggle2018:
The dataset is split into a train set and a test set.
The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds. The total duration of the train set is roughly 18h.
Out of the ~9.5k samples from the train set, ~3.7k have manually-verified ground truth annotations and ~5.8k have non-verified annotations. The non-verified annotations of the train set have a quality estimate of at least 65-70% in each category. Checkout the Data labeling process section below for more information about this aspect.
Non-verified annotations in the train set are properly flagged in train.csv
so that participants can opt to use this information during the development of their systems.
The test set is composed of 1.6k samples with manually-verified annotations and with a similar category distribution than that of the train set. The total duration of the test set is roughly 2h.
All audio samples in this dataset have a single label (i.e. are only annotated with one label). Checkout the Data labeling process section below for more information about this aspect. A single label should be predicted for each file in the test set.
Data labeling process
The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories (or labels), which was carried out by researchers at the Music Technology Group, Universitat Pompeu Fabra, Barcelona. Using this mapping, a number of Freesound audio samples were automatically annotated with labels from the AudioSet Ontology. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample.
Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assessed the presence/absence of an automatically assigned sound category, according to the AudioSet category description.
Audio samples in FSDKaggle2018 are only annotated with a single ground truth label (see train.csv
). A total of 3,710 annotations included in the train set of FSDKaggle2018 are annotations that have been manually validated as present and predominant (some with inter-annotator agreement but not all of them). This means that in most cases there is no additional acoustic material other than the labeled category. In few cases there may be some additional sound events, but these additional events won't belong to any of the 41 categories of FSDKaggle2018.
The rest of the annotations have not been manually validated and therefore some of them could be inaccurate. Nonetheless, we have estimated that at least 65-70% of the non-verified annotations per category in the train set are indeed correct. It can happen that some of these non-verified audio samples present several sound sources even though only one label is provided as ground truth. These additional sources are typically out of the set of the 41 categories, but in a few cases they could be within.
More details about the data labeling process can be found in [3].
License
FSDKaggle2018 has licenses at two different levels, as explained next.
All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of the audio clips included in FSDKaggle2018 and their corresponding license. The licenses are specified in the files train_post_competition.csv
and test_post_competition_scoring_clips.csv
.
In addition, FSDKaggle2018 as a whole is the result of a curation process and it has an additional license. FSDKaggle2018 is released under CC-BY. This license is specified in the LICENSE-DATASET
file downloaded with the FSDKaggle2018.doc
zip file.
Files
FSDKaggle2018 can be downloaded as a series of zip files with the following directory structure:
root │
└───FSDKaggle2018.audio_train/ Audio clips in the train set │
└───FSDKaggle2018.audio_test/ Audio clips in the test set │
└───FSDKaggle2018.meta/ Files for evaluation setup │ │
│ └───train_post_competition.csv Data split and ground truth for the train set │ │
│ └───test_post_competition_scoring_clips.csv Ground truth for the test set
│
└───FSDKaggle2018.doc/ │
└───README.md The dataset description file you are reading │
└───LICENSE-DATASET
Diabetic retinopathy is the leading cause of blindness in the working-age population of the developed world. It is estimated to affect over 93 million people.
retina
The US Center for Disease Control and Prevention estimates that 29.1 million people in the US have diabetes and the World Health Organization estimates that 347 million people have the disease worldwide. Diabetic Retinopathy (DR) is an eye disease associated with long-standing diabetes. Around 40% to 45% of Americans with diabetes have some stage of the disease. Progression to vision impairment can be slowed or averted if DR is detected in time, however this can be difficult as the disease often shows few symptoms until it is too late to provide effective treatment.
Currently, detecting DR is a time-consuming and manual process that requires a trained clinician to examine and evaluate digital color fundus photographs of the retina. By the time human readers submit their reviews, often a day or two later, the delayed results lead to lost follow up, miscommunication, and delayed treatment.
Clinicians can identify DR by the presence of lesions associated with the vascular abnormalities caused by the disease. While this approach is effective, its resource demands are high. The expertise and equipment required are often lacking in areas where the rate of diabetes in local populations is high and DR detection is most needed. As the number of individuals with diabetes continues to grow, the infrastructure needed to prevent blindness due to DR will become even more insufficient.
The need for a comprehensive and automated method of DR screening has long been recognized, and previous efforts have made good progress using image classification, pattern recognition, and machine learning. With color fundus photography as input, the goal of this competition is to push an automated detection system to the limit of what is possible – ideally resulting in models with realistic clinical potential. The winning models will be open sourced to maximize the impact such a model can have on improving DR detection.
Acknowledgements This competition is sponsored by the California Healthcare Foundation.
Retinal images were provided by EyePACS, a free platform for retinopathy screening.
The Microsoft Malware Classification Challenge was announced in 2015 along with a publication of a huge dataset of nearly 0.5 terabytes, consisting of disassembly and bytecode of more than 20K malware samples. Apart from serving in the Kaggle competition, the dataset has become a standard benchmark for research on modeling malware behaviour. To date, the dataset has been cited in more than 50 research papers. Here we provide a high-level comparison of the publications citing the dataset. The comparison simplifies finding potential research directions in this field and future performance evaluation of the dataset.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is the datamix created by Team 🔍 📝 🕵️♂️ 🤖
during the LLM - Detect AI Generated Text
competition. This dataset helped us to win the competition. It facilitates a text-classification
task to separate LLM generate essays from the student written ones.
It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.
To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.
Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset
We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays
Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence
Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor's Pragmatic Chaos team. This is the dataset that was used in that competition.
This comes directly from the README:
The file "training_set.tar" is a tar of a directory containing 17770 files, one per movie. The first line of each file contains the movie id followed by a colon. Each subsequent line in the file corresponds to a rating from a customer and its date in the following format:
CustomerID,Rating,Date
Movie information in "movie_titles.txt" is in the following format:
MovieID,YearOfRelease,Title
The qualifying dataset for the Netflix Prize is contained in the text file "qualifying.txt". It consists of lines indicating a movie id, followed by a colon, and then customer ids and rating dates, one per line for that movie id. The movie and customer ids are contained in the training set. Of course the ratings are withheld. There are no empty lines in the file.
MovieID1:
CustomerID11,Date11
CustomerID12,Date12
...
MovieID2:
CustomerID21,Date21
CustomerID22,Date22
For the Netflix Prize, your program must predict the all ratings the customers gave the movies in the qualifying dataset based on the information in the training dataset.
The format of your submitted prediction file follows the movie and customer id, date order of the qualifying dataset. However, your predicted rating takes the place of the corresponding customer id (and date), one per line.
For example, if the qualifying dataset looked like:
111:
3245,2005-12-19
5666,2005-12-23
6789,2005-03-14
225:
1234,2005-05-26
3456,2005-11-07
then a prediction file should look something like:
111:
3.0
3.4
4.0
225:
1.0
2.0
which predicts that customer 3245 would have rated movie 111 3.0 stars on the 19th of Decemeber, 2005, that customer 5666 would have rated it slightly higher at 3.4 stars on the 23rd of Decemeber, 2005, etc.
You must make predictions for all customers for all movies in the qualifying dataset.
To allow you to test your system before you submit a prediction set based on the qualifying dataset, we have provided a probe dataset in the file "probe.txt". This text file contains lines indicating a movie id, followed by a colon, and then customer ids, one per line for that movie id.
MovieID1:
CustomerID11
CustomerID12
...
MovieID2:
CustomerID21
CustomerID22
Like the qualifying dataset, the movie and customer id pairs are contained in the training set. However, unlike the qualifying dataset, the ratings (and dates) for each pair are contained in the training dataset.
If you wish, you may calculate the RMSE of your predictions against those ratings and compare your RMSE against the Cinematch RMSE on the same data. See http://www.netflixprize.com/faq#probe for that value.
The training data came in 17,000+ files. In the interest of keeping files together and file sizes as low as possible, I combined them into four text files: combined_data_(1,2,3,4).txt
The contest was originally hosted at http://netflixprize.com/index.html
The dataset was downloaded from https://archive.org/download/nf_prize_dataset.tar
This is a fun dataset to work with. You can read about the winning algorithm by BellKor's Pragmatic Chaos here
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Kaggle’s March Machine Learning Mania competition challenged data scientists to predict winners and losers of the men's 2016 NCAA basketball tournament. This dataset contains the 1070 selected predictions of all Kaggle participants. These predictions were collected and locked in prior to the start of the tournament.
How can this data be used? You can pivot it to look at both Kaggle and NCAA teams alike. You can look at who will win games, which games will be close, which games are hardest to forecast, or which Kaggle teams are gambling vs. sticking to the data.
The NCAA tournament is a single-elimination tournament that begins with 68 teams. There are four games, usually called the “play-in round,” before the traditional bracket action starts. Due to competition timing, these games are included in the prediction files but should not be used in analysis, as it’s possible that the prediction was submitted after the play-in round games were over.
Each Kaggle team could submit up to two prediction files. The prediction files in the dataset are in the 'predictions' folder and named according to:
TeamName_TeamId_SubmissionId.csv
The file format contains a probability prediction for every possible game between the 68 teams. This is necessary to cover every possible tournament outcome. Each team has a unique numerical Id (given in Teams.csv). Each game has a unique Id column created by concatenating the year and the two team Ids. The format is the following:
Id,Pred
2016_1112_1114,0.6
2016_1112_1122,0
...
The team with the lower numerical Id is always listed first. “Pred” represents the probability that the team with the lower Id beats the team with the higher Id. For example, "2016_1112_1114,0.6" indicates team 1112 has a 0.6 probability of beating team 1114.
For convenience, we have included the data files from the 2016 March Mania competition dataset in the Scripts environment (you may find TourneySlots.csv and TourneySeeds.csv useful for determining matchups, see the documentation). However, the focus of this dataset is on Kagglers' predictions.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The datasets are derived from eye fundus images provided in Kaggle's 'APTOS 2019 Blindness Detection' competition. The competition involves classification of eye fundus images into 5 levels of severity in diabetic retinopathy.
Unlike most participants who used deep learning approach to this classification problem, here we tried using Fractal Dimensions and Persistent Homology (one of the major tools in Topological Data Analysis, TDA) in extracting features from images, as inputs to simpler ML algorithms like SVM. It shows some promising results with this approach.
There are three files in this dataset:
Process_Images.html - R scripts for extracting Fractal Dimensions and Persistent Homology features from images.
train_features.RDS and test_features.RDS - the output RDS (R dataset files) for training and testing images for the above Kaggle competition.
Columns in train_features.RDS & test_features.RDS:
id_code - image id
diagnosis - severity of diabetic retinopathy on a scale of 0 to 4: 0=No DR; 1=Mild; 2=Moderate; 3=Severe; 4=Proliferative DR; Artificially set to be 0 for test_features.RDS
n - number of persistent homology components detected from the image
fd1 to fd21 - proportion of sliding windows having a specific fractal dimensions: fd1 = proportion of windows having FD=2; fd2=proportion of windows having FD in (2, 2.05];... fd21=proportion of windows having FD in (2.95,3.00]
l1_2 to l1_499 - silhouette (p=0.1, dim=1) at various time steps.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset is associated with the WWW 2025 SmartMem Competition. The competition task is to predict whether a memory module (DIMM) will experience a failure in the future based on its historical logs. For more details, please visit the competition homepage or the promotion homepage.
Considering the large scale of the competition dataset and the long processing time required to run the full baseline for feature generation, we have released a feature dataset produced by the baseline. This dataset contains high-level features (referred to as "new features") extracted from the raw data of all DIMMs. Each DIMM's new feature set consists of 100 columns. Participants can leverage this dataset to explore memory failure prediction strategies—for example, by incorporating additional features to enhance model performance, using data augmentation or sampling methods to address class imbalance, or testing different models to tackle challenges from distribution drift in event sequence data.
serial_number
(abbreviated as SN). The memory failure prediction task is to forecast whether a given DIMM will fail in the future.RetryRdErrLogParity
error on the same cell within a single observation window, only the earliest CE is retained.Others > Device > Bank > Row/Column > Cell
.LogTime
(the feature generation time T, considered as the timestamp of the last CE used).{read/scrub/all}_ce_log_num_{window_size}
: The total number of de-duplicated CE log entries (read, scrub, or all) within the window.{read/scrub/all}_ce_count_{window_size}
: The total count of CE entries (read, scrub, or all) before deduplication within the window.log_happen_frequency_{window_size}
: The log frequency, defined as the observation window duration divided by the total number of CEs.ce_storm_count_{window_size}
: The number of CE storms (for details, see the baseline method _calculate_ce_storm_count
).fault_mode_{others/device/bank/row/column/cell}_{window_size}
: Indicates whether a failure mode occurred at the corresponding module level (for details, see _get_spatio_features
).fault_{row/column}_num_{window_size}
: The number of columns experiencing simultaneous row failures or the number of rows experiencing simultaneous column failures.error_{bit/dq/burst}_count_{window_size}
: Total count of errors (bit, dq, or burst) within the window.max_{dq/burst}_interval_{window_size}
: The maximum interval between parity errors (dq or burst) within the window.dq_count={1/2/3/4}_{window_size}
: The total number of occurrences where the dq error count equals n (with n ∈ {1, 2, 3, 4}).burst_count={1/2/3/4/5/6/7/8}_{window_size}
: The total number of occurrences where the burst error count equals n (with n ∈ {1, 2, 3, 4, 5, 6, 7, 8}).In the baseline program (baseline_en.py), place the feature files in the directory specified by Config.feature_path
. Then, remove the call to ...
Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor's Pragmatic Chaos team. This is the dataset that was used in that competition.
full data https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data
For use in code competitions with no Internet. Add Data this dataset. Install with !mkdir -p /tmp/pip/cache/ !cp ../input/hdbscan0828whl/hdbscan-0.8.28-cp37-cp37m-linuxx8664.whl /tmp/pip/cache/ !pip install --no-index --find-links /tmp/pip/cache/ hdbscan
Latest version 0.8.28 like https://www.kaggle.com/datasets/something4kag/hdbscan0827-whl import hdbscan no longer worked in notebooks, not available in docker now. Made it work!
https://github.com/scikit-learn-contrib/hdbscan
scikit-learn-contrib/hdbscan is licensed under the BSD 3-Clause "New" or "Revised" License
A permissive license similar to the BSD 2-Clause License, but with a 3rd clause that prohibits others from using the name of the copyright holder or its contributors to promote derived products without written consent.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This is a Paralympic Games dataset that describes medals and athletes for Tokyo 2020. The data was created from Tokyo Paralympics.
All medals and more than 4,500 athletes (with some personal data: date and place of birth, height, etc.) of the Paralympic Games you can find here. Apart from it coaches and technical officials are present.
Please, click on the ticker to the right top of the dataset to cast an upvote. It will help be on the top.
Data:
1. medals_total.csv
- dataset contains all medals grouped by country as here.
2. medals.csv
- dataset includes general information on all athletes who won a medal.
3. athletes.csv
- dataset includes some personal information of all athletes.
4. coaches.csv
- dataset includes some personal information of all coaches.
5. technical_officials
- dataset includes some personal information of all technical officials.
2021-09-05 - dataset is updated. Contains full information. 2021-08-30 - dataset is updated. Contains information for the first 6 days of competitions. 2021-08-27 - dataset is created. Contains information for the first 3 days of competitions.
If you have some questions please start a discussion.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
.npz
format, ensuring efficient storage, accessibility, and consistency across training and testing phases.Although soccer is the sport most commonly associated with Brazil, volleyball is not that far behind. The national women's team is a 5 time olympic medalist, including two gold medals (2008 and 2012). Superliga is the top level Brazilian professional volleyball competition. This dataset contains information about every Superliga match for the 2021/2022.
Set *X*: Starting position in set X. * is for substitute players. Jogadora: Player's name. "(L)" indicates libero. Time: Team's name. Partida: Match. Serviço Err: Service Errors Serviço Ace: Service Aces Recepção Total: Reception total Recepção Err: Reception errors Ataque Exc: Attacks Ataque Err: Attack errors Ataque Blk: Blocked attacks Bloqueio Pts: Block points Fase: Primary competition stage. classificatoria (pool stage) or playoffs Cat: Secondary competition stage. turno or returno (pool stage) or quartas, semi, final (quarter finals, semifinals, final) VV: Did the player win the Viva Vôlei trophy? 1 for yes, 0 for no (Viva Vôlei is a "best player trophy" awarded at every game by popular vote).
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Hello! I am currently taking the mlcourse.ai course and as part of one of it's in-class Kaggle competitions, this dataset was required. The data is originally hosted on git but I like to have my data right here on Kaggle. That's why this dataset.
If you find this dataset useful, do upvote. Thank you and happy learning!
This dataset contains 6 files in total. 1. Sample_submission.csv 2. Train_features.csv 3. Test_features.csv 4. Train_targets.csv 5. Train_matches.jsonl 6. Test_matches.jsonl
All of the data in this dataset is originally hosted on git and the same can also be found on the in-class competition's 'data' page here.