22 datasets found

Kaggle Winning Solutions Methods
kaggle.com
zip
Updated Jul 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darek Kłeczek (2023). Kaggle Winning Solutions Methods [Dataset]. https://www.kaggle.com/thedrcat/kaggle-winning-solutions-methods
Explore at:
zip(7817540 bytes)Available download formats
Dataset updated
Jul 15, 2023
Authors
Darek Kłeczek
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Machine Learning Methods associated with Kaggle Winning Solutions Writeups. Dataset was obtained with OpenAI models and used Kaggle Solutions website.

You can use this dataset to analyze methods needed to win a Kaggle competition :)

Article describing the process to collect this data. Notebook demonstrating now the data was collected.
How to Win Data Science Competition
kaggle.com
zip
Updated Jan 30, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Budi Ryan (2018). How to Win Data Science Competition [Dataset]. https://www.kaggle.com/budiryan/how-to-win-data-science-competition
Explore at:
zip(15845091 bytes)Available download formats
Dataset updated
Jan 30, 2018
Authors
Budi Ryan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Budi Ryan

Released under CC0: Public Domain

Contents
Kaggle Analytics Competitions - Metadata
kaggle.com
zip
Updated Nov 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrada (2022). Kaggle Analytics Competitions - Metadata [Dataset]. https://www.kaggle.com/datasets/andradaolteanu/kaggle-analytics-competitions-metadata
Explore at:
zip(183843 bytes)Available download formats
Dataset updated
Nov 1, 2022
Authors
Andrada
Description
Context

I have gathered this data to create a small analysis (an analysis within an analysis - inception like situation) to understand what makes a notebook win a Kaggle Analytics Competition.

Furthermore, the data lets us explore some differences in approaches between competitions and the evolution through time.

Of course, as we are talking about an analytical approach (which cannot be quantified, like a normal Kaggle Competition, that has a KPI), there can never be an EXACT recipe. However if we look at some quanitity (and then quality by reading the notebooks) features we can quickly see a pattern within the winning notebooks.

This knowledge might help you when you approach a new challenge, as well as guide on the "right" path.

Note: the dataset contains only PAST competitions that have already ended and the winners have been announced.
Kaggle Blog: Winners' Posts
kaggle.com
zip
Updated Sep 21, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2016). Kaggle Blog: Winners' Posts [Dataset]. https://www.kaggle.com/kaggle/kaggle-blog-winners-posts
Explore at:
zip(530977 bytes)Available download formats
Dataset updated
Sep 21, 2016
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
In 2010, Kaggle launched its first competition, which was won by Jure Zbontar, who used a simple linear model. Since then a lot has changed. We've seen the rebirth of neural networks, the rise of Python, the creation of powerful libraries like XGBoost, Keras and Tensorflow.

This is data set is a dump of all winners' posts from the Kaggle blog starting with Jure Zbontar. It allows us to track trends in the techniques, tools and libraries that win competitions.

This is a simple dump. If there's demand, I can upload more detail (including comments and tags).
mlcourse.ai - Dota 2 - winner prediction Dataset
kaggle.com
zip
Updated Sep 8, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sushma Biswas (2019). mlcourse.ai - Dota 2 - winner prediction Dataset [Dataset]. https://www.kaggle.com/datasets/sushmabiswas/mlcourseai-dota-2-winner-prediction-dataset
Explore at:
zip(759868828 bytes)Available download formats
Dataset updated
Sep 8, 2019
Authors
Sushma Biswas
Description
Context

Hello! I am currently taking the mlcourse.ai course and as part of one of it's in-class Kaggle competitions, this dataset was required. The data is originally hosted on git but I like to have my data right here on Kaggle. That's why this dataset.

If you find this dataset useful, do upvote. Thank you and happy learning!

Content

This dataset contains 6 files in total. 1. Sample_submission.csv 2. Train_features.csv 3. Test_features.csv 4. Train_targets.csv 5. Train_matches.jsonl 6. Test_matches.jsonl

Acknowledgements

All of the data in this dataset is originally hosted on git and the same can also be found on the in-class competition's 'data' page here.

Inspiration

to be updated.
LLM 20 Questions Games
kaggle.com
zip
Updated Aug 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
waechter (2024). LLM 20 Questions Games [Dataset]. https://www.kaggle.com/datasets/waechter/llm-20-questions-games
Explore at:
zip(189837141 bytes)Available download formats
Dataset updated
Aug 7, 2024
Authors
waechter
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Episodes games from https://www.kaggle.com/competitions/llm-20-questions This dataset can be used to analyze winning strategies, or as training data

description:

index is {episodeId}_{guesser}_{answer} (2 rows for each episodeId, one by team)

answers: list (len nb_round) of answers by {answer} agent

questions: list (len nb_round) of questions asked by {answer} agent

guesses: list (len nb_round) of guesses {answer} agent

keyword: keyword to be guessed

category: category of the keyword

guesser: name of guesser/asker team

answerer: name of answerer team

nb_round: int number of rounds (<20 means victory or error)

game_num: episodeId

source

Notebook: https://www.kaggle.com/code/waechter/llm-20-questions-games-dataset/notebook Meta kaggle dataset
meta-kaggle-top-voted-posts
kaggle.com
zip
Updated Jul 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vcol (2023). meta-kaggle-top-voted-posts [Dataset]. https://www.kaggle.com/datasets/vcolliym/meta-kaggle-top-voted-posts
Explore at:
zip(4662718 bytes)Available download formats
Dataset updated
Jul 7, 2023
Authors
Vcol
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
High level analysis results of discussion posts/ writeup of winning solutions in HTML format

Word count of competition tags stratified by competition types:

Research

Playground

Feature (no tag found under community competitions)
Predicting The Lottery
kaggle.com
zip
Updated Dec 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Predicting The Lottery [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-insights-from-us-state-lottery-scratc
Explore at:
zip(425238 bytes)Available download formats
Dataset updated
Dec 8, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Predicting The Lottery

US State Lottery History. Impossible To Predict Or Not, Let's try!

By [source]

About this dataset

This dataset consists of comprehensive data from various U.S. State Lottery Scratcher games in California, Missouri, New Mexico, Oklahoma and Virginia. It contains comprehensive information for researchers to evaluate the probability of winning for each lottery game and any related statistical values associated with each game. The columns include information such as price, gameNumber, topPrize, overallOdds, topPrizeAvail ExtraChances, secondChance and lots more. Also included is detailed data on Winning Tickets At Start (Regardless if they were claimed or not), Total Prize Money at start alongside Total Prize Money Unclaimed at end date. Users will also find useful odds and probability calculation including Probability of Winning Any Prize + 3 StdDevs alongside Max Tickets To Buy & Expected Value Of Any Prize (as % of cost). Last but certainly not least is information regarding Odds Ranking By Best Probability Of Winning Any Prize all the way to Overall Rank! Studying this dataset allows players an informed look towards making smarter choices when it comes to taking their chances in using state lotteries – May The Odds be ever in your favor!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This guide will explain how to use this dataset in detail. It will provide step-by-step instructions on how to interact with and analyze the data contained in this Kaggle dataset so that you can gain insight into your own research or project related to state lottery scratchers!

Understand the Dataset Contents

The contents include price information (e.g., price per play), game name and number, top prize amounts/overall odds/top prize availability/extra chances available/second chance option offered by a particular game as well as date fields indicating when a certain ticket was started/ended/exported etc., different prize amounts available per ticket along with their corresponding probabilities & expected value etc., total prize money at start & total prize money remaining (as %) along with rank according to best probability of winning any type of prize or best change in corresponding probabilities etc.. This detailed information can help an experienced researcher to perform sophisticated analysis on US state lottery tickets’ success rates and effects over time period etc..
In short – understanding what kind of variables are included in this dataset is necessary for analyzing these variables effectively!

Describe each variable & their corresponding categories properly Describing individual variables will be helpful for users by providing them more detailed insights about those variables & their categories – especially if there are many different types of categories associated within a single variable (like prizes won). Furthermore – some formulae should also be introduced where applicable since users may not understand why certain calculations were done (such as calculating expected value). All such things should be clarified properly via descriptions instead of just listing down numerical values without explaining anything else related to them!

Analyze differences between states using appropriate graphs & diagrams Data visualization plays an essential role while trying out various

Research Ideas

Analyzing the effectiveness of marketing campaigns for various state lotteries by examining sales of different scractcher tickets.

Examining the lottery scratcher game price points to identify selling opportunities or trends in preferences across states.

Utilizing the data to apply statistics and modeling techniques to project future expected values from similar scratch games across different states

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: CAratingstable.csv | Column name | Description ...
YIEDL Competition Data (updated daily)
kaggle.com
zip
Updated Jan 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joakim Arvidsson (2025). YIEDL Competition Data (updated daily) [Dataset]. https://www.kaggle.com/datasets/joebeachcapital/yiedl-competition/versions/80
Explore at:
zip(9274033415 bytes)Available download formats
Dataset updated
Jan 10, 2025
Authors
Joakim Arvidsson
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset updates daily for the Numerai Crypto data (daily competition), and weekly on Mondays for the Yiedl.ai weekly competition. The Yiedl data contains the most recent dataset from yiedl.ai, as well as a quickstarter notebook. It now also includes the Numerai Crypto daily data (including historical), which may be useful in both competitions. It should be everything you need in order to get started in these Crypto currency prediction competitions.

You can apply for an airdrop of 100 $YIEDL tokens here, which you can use to stake on your predictions to earn more tokens if your predictions are correct (or burn tokens if they are not).

Experienced data scientists can apply for a grant of an additional 5000 $YIEDL tokens, if approved.

The $YIEDL token is a recently launched token on the Polygon blockchain. More information can be found at the below links.

Documentation

Whitepaper

LinkedIn

X/Twitter

Medium

GitHub

Discord

YIEDL Vaults
2016 March ML Mania Predictions
kaggle.com
zip
Updated Nov 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Will Cukierski (2017). 2016 March ML Mania Predictions [Dataset]. https://www.kaggle.com/datasets/wcukierski/2016-march-ml-mania
Explore at:
zip(28950066 bytes)Available download formats
Dataset updated
Nov 15, 2017
Authors
Will Cukierski
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Kaggle’s March Machine Learning Mania competition challenged data scientists to predict winners and losers of the men's 2016 NCAA basketball tournament. This dataset contains the 1070 selected predictions of all Kaggle participants. These predictions were collected and locked in prior to the start of the tournament.

How can this data be used? You can pivot it to look at both Kaggle and NCAA teams alike. You can look at who will win games, which games will be close, which games are hardest to forecast, or which Kaggle teams are gambling vs. sticking to the data.

The NCAA tournament is a single-elimination tournament that begins with 68 teams. There are four games, usually called the “play-in round,” before the traditional bracket action starts. Due to competition timing, these games are included in the prediction files but should not be used in analysis, as it’s possible that the prediction was submitted after the play-in round games were over.

Data Description

Each Kaggle team could submit up to two prediction files. The prediction files in the dataset are in the 'predictions' folder and named according to:

TeamName_TeamId_SubmissionId.csv

The file format contains a probability prediction for every possible game between the 68 teams. This is necessary to cover every possible tournament outcome. Each team has a unique numerical Id (given in Teams.csv). Each game has a unique Id column created by concatenating the year and the two team Ids. The format is the following:

Id,Pred
2016_1112_1114,0.6
2016_1112_1122,0
...

The team with the lower numerical Id is always listed first. “Pred” represents the probability that the team with the lower Id beats the team with the higher Id. For example, "2016_1112_1114,0.6" indicates team 1112 has a 0.6 probability of beating team 1114.

For convenience, we have included the data files from the 2016 March Mania competition dataset in the Scripts environment (you may find TourneySlots.csv and TourneySeeds.csv useful for determining matchups, see the documentation). However, the focus of this dataset is on Kagglers' predictions.
MCTS | Extra training data
kaggle.com
zip
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James Day (2024). MCTS | Extra training data [Dataset]. https://www.kaggle.com/datasets/jsday96/mcts-extra-training-data
Explore at:
zip(22372652 bytes)Available download formats
Dataset updated
Dec 4, 2024
Authors
James Day
Description
Contains the extra training data used in the 1st place solution to the MCTS competition.

More specifically, it contains the following types of files: * ExtraAnnotatedGames_v{version_number}.csv - holds the generated rulesets, features describing those rulesets, and labels computed by simulating matches between pairs of agents. "v6" is the full-scale version used in the winning solution, "v4" is the half-scale version used in earlier experiments and discussed in a couple forum threads. * StartingPositionEvals/{rulesets_origin}_{mcts_config}_{runtime_per_ruleset}s_v2_r{run_id}.json - Game balance metrics, examined action counts, and search iteration counts for each ruleset in each dataset (ones provided by the competition organizer + extra rulesets I generated). * RecomputedFeatureEstimates.json - Estimates of the values of all the nondeterministic features for all rulesets from both data sources (organizer + generated). Computed by re-annotating all the rulesets 5 times with 15 trials per run, scaling the hardware speed specific features to account for hardware differences, and averaging the estimated feature values from all 5 runs to compute less-noisy values.
FacialRecognition
kaggle.com
zip
Updated Dec 1, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TheNicelander (2016). FacialRecognition [Dataset]. https://www.kaggle.com/petein/facialrecognition
Explore at:
zip(121674455 bytes)Available download formats
Dataset updated
Dec 1, 2016
Authors
TheNicelander
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description

#https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################

###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################

###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)

###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.

###To look at samples of the data, uncomment this line:

head(d.train)

###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe

im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe

im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe

################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer

#strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))

###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.

install.packages('foreach')

library("foreach", lib.loc="~/R/win-library/3.3")

###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):

###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

#each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)

#im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).

#To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))

#Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")

#Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }

#there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")

#One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)

#To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)

#The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:

install.packages('reshape2')

library(...
Argoverse-HD
kaggle.com
zip
Updated May 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Li (2021). Argoverse-HD [Dataset]. https://www.kaggle.com/mtlics/argoversehd
Explore at:
zip(31334725392 bytes)Available download formats
Dataset updated
May 8, 2021
Authors
Martin Li
Description
This dataset is built for streaming object detection, for more details please check out the dataset webpage.

Competition

https://via.placeholder.com/15/fc4903/000000?text=+" alt="#fc4903"> The competition on this dataset is hosted on Eval.AI, enter the challenge to win prizes and present at CVPR 2021 Workshop on Autonomous Driving.

http://www.cs.cmu.edu/~mengtial/proj/streaming/img/dataset-compare.png">
ICPC WF Ranking Results (1999 - Present) Datasets
kaggle.com
zip
Updated Sep 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hoang Le Ngoc (2024). ICPC WF Ranking Results (1999 - Present) Datasets [Dataset]. https://www.kaggle.com/datasets/justinianus/icpc-world-finals-ranking-since-1999
Explore at:
zip(284532 bytes)Available download formats
Dataset updated
Sep 23, 2024
Authors
Hoang Le Ngoc
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset Information

The ICPC World Finals Ranking Dataset, available on Kaggle, provides an extensive overview of performance metrics for teams from global universities participating in the International Collegiate Programming Contest (ICPC) World Finals since 1999. This dataset includes information such as team rank, representing university, competition year, and the university's country.

The ICPC is an internationally renowned programming competition where student teams tackle algorithmic problems within a set timeframe. The contest, organized by the Association for Computing Machinery (ACM), progresses through multiple rounds, including regional and online contests, culminating in the finals.

This dataset offers insights into the universities that have performed outstandingly in the ICPC World Finals from 1999 to the present. It features 21 attributes, such as team name, rank, university name, university's region and country, and the number of problems solved during the contest. Additionally, it contains data on the teams' rankings in the regional contests, which serve as qualifiers for the world finals.

The dataset is an invaluable tool for statistical and trend analysis, as well as for developing machine learning models. Researchers can utilize it to pinpoint universities and countries with consistent high performance over the years, examine the distribution of problems solved by teams across various years, and forecast future contest results based on past achievements. Moreover, educators and mentors can leverage this dataset to discern essential concepts to aid students in contest preparation.

*Created by Microsoft Copilot, an AI Language Model*

Changelog

03/04/2023: Initial dataset creation with icpc-full.csv, detailing results across all years.

10/04/2023: Introduction of new column Prize, detailing champions from all world and regional contests. Segregation of results by year into separate files icpc-xxxx.csv.

17/04/2023: The dataset received a Bronze medal. Published the first notebook. Thanks for the upvotes!

23/04/2023: Changed the data type of the Rank column to integer (no longer contains string data).

30/04/2024: Updated the ICPC WF Luxor 2022-2023 data with the files icpc-2022.csv, icpc-2023.csv, and the revised icpc-full.csv.

23/09/2024: Updated the ICPC WF Astana 2024 data with the files icpc-2024.csv and the revised icpc-full.csv. # About the Author I created this dataset to preserve all information about the ICPC World Finals, which was my first passion when I started in IT. I haven't had the chance to attend the World Finals, but I have won some awards in the ICPC Asia Regional contests and qualified for the World Finals in the Asia Pacific region, held by OLP/ICPC Vietnam:

Competed in the ICPC 2020 Asia Can Tho Regional Contest. (Team: No Girl No AC - Hoang Le Ngoc, Huy Nguyen Nhat, Man Ha Xuan)

Competed in the ICPC 2021 Asia Hanoi Regional Contest. (Team: The Phoenix Rises - Hoang Le Ngoc, Huy Nguyen Nhat, Phuoc Cao Xuan)

Won a Bronze medal in the ICPC 2022 Asia Ho Chi Minh City Regional Contest. (Team: HUSC.[401]_UnauthorizeD - Hoang Le Ngoc, Toan Le Sy, Hai Ngo Van)

Won a Third Prize (Vietnam teams) in the ICPC 2023 Asia Hue City Regional Contest. (Team: HUSC.GreedForSpeed - Hoang Le Ngoc, Toan Le Sy, Hai Ngo Van)

Won a Consolation Prize (Vietnam teams) in the ICPC 2024 Asia Hanoi Regional Contest. (Team: HUSC.Newbie - Hoang Le Ngoc, Toan Le Sy, Hai Ngo Van)

My Challenge: - Predict the university that the next champion team will come from. - Determine if your university or country can win a medal in future World Finals.

I hope you find it useful. Feel free to upvote and comment if you have any questions. With love from Vietnam <3
Clean Meta Kaggle
kaggle.com
Updated Sep 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yoni Kremer (2023). Clean Meta Kaggle [Dataset]. https://www.kaggle.com/datasets/yonikremer/clean-meta-kaggle
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yoni Kremer
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Cleaned Meta-Kaggle Dataset

The Original Dataset - Meta-Kaggle

Explore our public data on competitions, datasets, kernels (code / notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.

https://i.imgur.com/2Egeb8R.png" alt="" title="a title">

This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.

Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.

August 2023 update

In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here

We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.

The Problems with the Original Dataset

The original dataset is 32 CSV files, with 268 colums and 7GB of compressed data. Having so many tables and columns makes it hard to understand the data.

The data is not normalized, so when you join tables you get a lot of errors.

Some values refer to non-existing values in other tables. For example, the UserId column in the ForumMessages table has values that do not exist in the Users table.

There are missing values.

There are duplicate values.

There are values that are not valid. For example, Ids that are not positive integers.

The date and time columns are not in the right format.

Some columns only have the same value for all rows, so they are not useful.

The boolean columns have string values True or False.

Incorrect values for the Total columns. For example, the DatasetCount is not the total number of datasets with the Tag according to the DatasetTags table.

Users upvote their own messages.

The Solution

To handle so many tables and columns I use a relational database. I use MySQL, but you can use any relational database.

The steps to create the database are:

Creating the database tables with the right data types and constraints. I do that by running the db_abd_create_tables.sql script.

Downloading the CSV files from Kaggle using the Kaggle API.

Cleaning the data using pandas. I do that by running the clean_data.py script. The script does the following steps for each table:

Drops the columns that are not needed.

Converts each column to the right data type.

Replaces foreign keys that do not exist with NULL.

Replaces some of the missing values with default values.

Removes rows where there are missing values in the primary key/not null columns.

Removes duplicate rows.

Loading the data into the database using the LOAD DATA INFILE command.

Checks that the number of rows in the database tables is the same as the number of rows in the CSV files.

Adds foreign key constraints to the database tables. I do that by running the add_foreign_keys.sql script.

Update the Total columns in the database tables. I do that by running the update_totals.sql script.

Backup the database.
BirdCLEF 2024 | Best Working Note - Source Code
kaggle.com
zip
Updated Jun 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugo de Heer (2024). BirdCLEF 2024 | Best Working Note - Source Code [Dataset]. https://www.kaggle.com/datasets/hugodeheer/bird-source/code
Explore at:
zip(83018855 bytes)Available download formats
Dataset updated
Jun 20, 2024
Authors
Hugo de Heer
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The Source Code of Team Epoch IV's submission for BirdCLEF 2024.

Since we worked on a Python project rather than a notebook for easier collaboration, we developed our code locally and uploaded it to Kaggle for submissions. By uploading our source code to a dataset, we could run this code from a notebook. Additionally, we train our models locally and add these to this dataset as we only need to run inference on the Kaggle notebook.

To use this code in your own notebook, add this dataset and run:

!python3 "submit.py"

For the full source code, which also includes the code for training models, as well as our award-winning working note, see our repository on GitHub.
ML Competition on Cryptocurrency Market Data
kaggle.com
zip
Updated Nov 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YIEDL (2021). ML Competition on Cryptocurrency Market Data [Dataset]. https://www.kaggle.com/datasets/rocketcapital/ml-competition-on-cryptocurrency-market-data
Explore at:
zip(291744236 bytes)Available download formats
Dataset updated
Nov 23, 2021
Authors
YIEDL
Description
Context

The world of Asset Management today, from a technological point of view, is mainly linked to mature but inefficient supply chains, which merge discretionary and quantitative forecasting models. The financial industry has been working in the shadows for years to overcome this paradigm, pushing beyond technology, making use not only of automated models (trading systems and dynamic asset allocation systems) but also of the most modern Machine Learning techniques for Time Series Forecasting and Unsupervised Learning for the classification of financial instruments. However, in most cases, it uses proprietary technologies that are limited by definition (workforce, technology investment, scalability). Numerai, an offshoot of Jim Simons’ Renaissance Technologies, was the first to blaze a new path by building a first centralized machine learning competition, in order to gather a swarm of predictors outside the company, to integrate with internal intelligence. The discretionary contribution was therefore eliminated, and the information content generated internally was enriched by thousands of external contributors, in many cases linked to sectors unrelated to the financial industry, such as energy, aerospace, or biotechnology. In fact, the concept that to obtain good market forecasts, it is necessary to have only skills related to the financial world is overcome. What we have just described is the starting point of Rocket Capital Investment. To overcome the limit imposed by Numerai, a new competition has been engineered, which has the ambition to make this project even more “democratic”. How? Decentralizing, thanks to the Blockchain, the entire chain of participant management, collection, and validation of forecasts, as well as decisions relating to the evaluation and remuneration of the participants themselves. In this way, it is possible to make every aspect of the competition completely transparent and inviolable. Everything is managed by a Smart Contract, whose rules are known and shared. Let’s find out in more detail what it is.

Starting from the idea of Numerai, we have completely re-engineered all aspects related to the management of participants, Scoring, and Reward, following the concept of decentralization of the production chain. To this end, a proprietary token (MUSA token) has been created which acts as an exchange currency and which integrates a smart contract that acts as an autonomous competition manager. The communication interface between the users and the smart contract is a DApp (“Decentralized Application”). But let’s see in more detail how all these elements combine with each other, like in a puzzle.

Competition Technicalities

A suitably normalized dataset is issued every week, containing data from over 400 cryptocurrencies. For each asset, the data relating to prices, volumes traded, quantitative elements, as well as alternative data (information on the blockchain and on the sentiment of the various providers) are aggregated. Another difference with Numerai is the ability to distinguish assets for each row (the first column shows the related ticker). The last column instead contains the question to which the Data Scientists are asked to give an answer: the relative strength ranking of each asset, built on the forecast of the percentage change expected in the following week.

Registration for the Competition takes place by providing, in a completely anonymous way, the address of a crypto wallet on which the MUSA tokens are loaded. From that moment on, the MUSAs become, to all intents and purposes, the currency of exchange between participants and organizers. Every Monday a new Challenge opens, and all Data Scientists registered in the Contest are asked to use their models to generate predictions. By accessing the DApp, the participant can download the new dataset, complete with the history of the previous weeks and the last useful week. At this point the participant can perform two actions in sequence directly from the DApp: - Staking: MUSA tokens are placed on your prediction. - Submission: the forecast for the following week is uploaded to the blockchain.

Since the forecast consists of a series of numbers between 0 and 1 associated with each asset, it is very easy, the following week, to calculate the error committed in terms of RMSE (“Root Mean Square Error”). This allows creating a ranking on the participants, to be able to reward them accordingly with additional MUSA tokens. But let’s see in more detail how the Smart Contract, which was created, allows us to differentiate the reward based on different items (all, again, in a completely transparent and verifiable way): - Staking Reward: the mere fact of participating in the competition is remunerated. In future versions, it will also be possible to bet on the goodness of the other participants’ predictions. - Challenge Rew...
nfl-big-data-bowl-2021 Feather files
kaggle.com
zip
Updated Oct 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mathurin Aché (2020). nfl-big-data-bowl-2021 Feather files [Dataset]. https://www.kaggle.com/mathurinache/nflbigdatabowl2021-feather-files
Explore at:
zip(475231363 bytes)Available download formats
Dataset updated
Oct 15, 2020
Authors
Mathurin Aché
Description
When a quarterback takes a snap and drops back to pass, what happens next may seem like chaos. As offensive players move in various patterns, the defense works together to prevent successful pass completions and then to quickly tackle receivers that do catch the ball. In this year’s Kaggle competition, your goal is to use data science to better understand the schemes and players that make for a successful defense against passing plays.

In American football, there are a plethora of defensive strategies and outcomes. The National Football League (NFL) has used previous Kaggle competitions to focus on offensive plays, but as the old proverb goes, “defense wins championships.” Though metrics for analyzing quarterbacks, running backs, and wide receivers are consistently a part of public discourse, techniques for analyzing the defensive part of the game trail and lag behind. Identifying player, team, or strategic advantages on the defensive side of the ball would be a significant breakthrough for the game.

This competition uses NFL’s Next Gen Stats data, which includes the position and speed of every player on the field during each play. You’ll employ player tracking data for all drop-back pass plays from the 2018 regular season. The goal of submissions is to identify unique and impactful approaches to measure defensive performance on these plays. There are several different directions for participants to ‘tackle’ (ha)—which may require levels of football savvy, data aptitude, and creativity. As examples:

What are coverage schemes (man, zone, etc) that the defense employs? What coverage options tend to be better performing? Which players are the best at closely tracking receivers as they try to get open? Which players are the best at closing on receivers when the ball is in the air? Which players are the best at defending pass plays when the ball arrives? Is there any way to use player tracking data to predict whether or not certain penalties – for example, defensive pass interference – will be called? Who are the NFL’s best players against the pass? How does a defense react to certain types of offensive plays? Is there anything about a player – for example, their height, weight, experience, speed, or position – that can be used to predict their performance on defense? What does data tell us about defending the pass play? You are about to find out.

Note: Are you a university participant? Students have the option to participate in a college-only Competition, where you’ll work on the identical themes above. Students can opt-in for either the Open or College Competitions, but not both.
How could we win the next UK National Lottery ?
kaggle.com
zip
Updated Jun 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick L Ford (2025). How could we win the next UK National Lottery ? [Dataset]. https://www.kaggle.com/datasets/patricklford/how-could-we-win-the-next-uk-national-lottery/code
Explore at:
zip(59204 bytes)Available download formats
Dataset updated
Jun 12, 2025
Authors
Patrick L Ford
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Introduction

The National Lottery is the state-franchised lottery in the United Kingdom, established in 1994. It is regulated by the Gambling Commission and operated by Allwyn Entertainment, which took over from Camelot Group on 1 February 2024. The National Lottery has since become one of the most popular forms of gambling in the UK. Prizes are generally paid as a lump sum, except for the Set For Life game, which provides winnings over a fixed period. All prizes are tax-free. Of the total money spent on National Lottery games, approximately 53% is allocated to the prize fund, while 25% supports "good causes" as designated by Parliament. However, some critics consider this a "stealth tax" funding the National Lottery Community Fund. Additionally, 12% is collected as lottery duty by the UK government, 4% is paid to retailers as commission, and 5% goes to the operator, with 4% covering operational costs and 1% taken as profit. Since 22 April 2021, the minimum age to purchase National Lottery tickets and scratchcards has been 18, an increase from the previous age limit of 16.

Recommended reading: A previous project of mine where I look at lotteries. link - Kaggle

History

Origins and Early Development: - Lotteries in England were largely illegal under a statute from 1698 unless specifically authorised by law. However, state lotteries were introduced to raise funds for government initiatives and war efforts. The Bank of England established early lotteries such as the Million Lottery (1694) and the Malt Lottery (1697). Later, the Betting and Lotteries Act of 1934, amended in 1956 and 1976, allowed for small-scale lotteries.

Establishment of the National Lottery: - The modern National Lottery was created under the National Lottery etc. Act 1993, initiated by John Major’s government. The franchise was awarded to Camelot Group on 25 May 1994, and the first official draw took place on 19 November 1994. The first winning numbers were 30, 3, 5, 44, 14, and 22, with the bonus ball being 10. The jackpot was shared by seven winners, with a total prize of £5,874,778. The National Lottery remains a central aspect of UK gambling culture.

Operational Changes and Developments: - Camelot initially used Beitel Criterion draw machines, later replaced by Smartplay Magnum I models in 2003 and Magnum II models in 2009. One of the original Beitel Criterion machines, named Guinevere, was donated to the Science Museum in London in 2022. Cyber-security has been a concern, with a notable breach in March 2018 affecting 150 accounts, though no financial losses were reported. On 1 February 2024, Allwyn Entertainment took over National Lottery operations from Camelot Group.

Eligibility and Ticket Purchases:

Be at least 18 years old (requirement since April 2021).

Purchase tickets in person at authorised retailers in the UK or Isle of Man, or online through the National Lottery website.

Have a UK bank account for online purchases and be physically present in the UK or Isle of Man at the time of purchase.

If part of a syndicate, the ticket purchaser must meet all eligibility criteria.

Lottery tickets are non-transferable, and commercial syndicates charging additional fees are not permitted.

From its inception in November 1994 until April 2021, the minimum age to purchase National Lottery tickets and scratch cards was 16. This was increased to 18 to align with responsible gambling measures.

The National Lottery continues to be a significant source of entertainment and funding for public projects, with millions participating in hopes of winning life-changing prizes.

Calculating the Probability of Winning the Jackpot:

Players may pick six numbers from a pool of numbers - six different numbers from 1 to 59.

If you're playing a lottery where you choose 6 numbers from a pool of 59, denoted as (59C6).

Once the first number has been drawn, 1-59.

As the first ball is not replaced, there are only 58 possible values for the second one.

57 possible values for the third ball, 56 for the fourth, 55 for the fifth and 54 for the last ball.

In total there are: 59 x 58 x 57 x 56 x 55 x 54 = 32,441,381,280 possible combinations.

We have to take into account the fact that it does not matter what order the numbers are drawn in.

So to calculate the number of ways, 6 numbers can be arranged in: 6 x 5 x 4 x 3 x 2 x 1 = 720 permutations.

This means for a 59C6 lottery, the calculation is: C(59,6) = 59x58x57x56x55x54 / 6x5x4x3x2x1.

Or 32,441,381,280 / 720 = 45,057,474 different combinations of six numbers.

Which gives us a 1 in 45,057,474 chance of winning the UK National lottery.

Shiny App to Predict UK National Lottery Winning Numbers

I've been developing the below prediction app f...
UEFA 1960 TO 2022-23
kaggle.com
zip
Updated Jan 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Zonto (2023). UEFA 1960 TO 2022-23 [Dataset]. https://www.kaggle.com/datasets/scottzonto/uefa-1960-2022
Explore at:
zip(113246 bytes)Available download formats
Dataset updated
Jan 17, 2023
Authors
Scott Zonto
Description
"**Champions of Europe**: A retrospective journey through UEFA's history from 1960 to 2022-2023 - The ultimate data list" is a comprehensive collection of data on the history of the UEFA Champions League, Europe's premier club football competition. The dataset includes information on all the teams that have participated in the competition since its inception in 1960, including the home and away teams, match results, stadiums, attendance, and special win conditions. It also includes detailed information on teams' appearances, record streaks, active streaks, debut, most recent and best results. This dataset is an invaluable resource for football fans, researchers, analysts, and journalists, providing a wealth of historical data on one of the most prestigious and popular competitions in world football.

Facebook

Twitter

Click to copy link

Link copied

Cite

Darek Kłeczek (2023). Kaggle Winning Solutions Methods [Dataset]. https://www.kaggle.com/thedrcat/kaggle-winning-solutions-methods

Kaggle Winning Solutions Methods

Find out how to win a Kaggle competition!

Explore at:

zip(7817540 bytes)Available download formats

Dataset updated

Jul 15, 2023

Authors

Darek Kłeczek

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

Machine Learning Methods associated with Kaggle Winning Solutions Writeups. Dataset was obtained with OpenAI models and used Kaggle Solutions website.

You can use this dataset to analyze methods needed to win a Kaggle competition :)

Article describing the process to collect this data. Notebook demonstrating now the data was collected.

Clear search

Close search

Google apps

Main menu

Kaggle Winning Solutions Methods

How to Win Data Science Competition

Dataset

Contents

Kaggle Analytics Competitions - Metadata

Context

Kaggle Blog: Winners' Posts

mlcourse.ai - Dota 2 - winner prediction Dataset

Context

Content

Acknowledgements

Inspiration

LLM 20 Questions Games

description:

source

meta-kaggle-top-voted-posts

Predicting The Lottery

Predicting The Lottery

US State Lottery History. Impossible To Predict Or Not, Let's try!

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

YIEDL Competition Data (updated daily)

2016 March ML Mania Predictions

Data Description

MCTS | Extra training data

FacialRecognition

head(d.train)

install.packages('foreach')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

install.packages('reshape2')

Argoverse-HD

Competition

ICPC WF Ranking Results (1999 - Present) Datasets

Dataset Information

Changelog

Clean Meta Kaggle

Cleaned Meta-Kaggle Dataset

The Original Dataset - Meta-Kaggle

August 2023 update

The Problems with the Original Dataset

The Solution

BirdCLEF 2024 | Best Working Note - Source Code

ML Competition on Cryptocurrency Market Data

Context

Competition Technicalities

nfl-big-data-bowl-2021 Feather files

How could we win the next UK National Lottery ?

Introduction

History

Shiny App to Predict UK National Lottery Winning Numbers

UEFA 1960 TO 2022-23

Kaggle Winning Solutions Methods

Find out how to win a Kaggle competition!