22 datasets found
  1. Kaggle Winning Solutions Methods

    • kaggle.com
    zip
    Updated Jul 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darek Kłeczek (2023). Kaggle Winning Solutions Methods [Dataset]. https://www.kaggle.com/thedrcat/kaggle-winning-solutions-methods
    Explore at:
    zip(7817540 bytes)Available download formats
    Dataset updated
    Jul 15, 2023
    Authors
    Darek Kłeczek
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Machine Learning Methods associated with Kaggle Winning Solutions Writeups. Dataset was obtained with OpenAI models and used Kaggle Solutions website.

    You can use this dataset to analyze methods needed to win a Kaggle competition :)

    Article describing the process to collect this data. Notebook demonstrating now the data was collected.

  2. How to Win Data Science Competition

    • kaggle.com
    zip
    Updated Jan 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Budi Ryan (2018). How to Win Data Science Competition [Dataset]. https://www.kaggle.com/budiryan/how-to-win-data-science-competition
    Explore at:
    zip(15845091 bytes)Available download formats
    Dataset updated
    Jan 30, 2018
    Authors
    Budi Ryan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Budi Ryan

    Released under CC0: Public Domain

    Contents

  3. Kaggle Analytics Competitions - Metadata

    • kaggle.com
    zip
    Updated Nov 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrada (2022). Kaggle Analytics Competitions - Metadata [Dataset]. https://www.kaggle.com/datasets/andradaolteanu/kaggle-analytics-competitions-metadata
    Explore at:
    zip(183843 bytes)Available download formats
    Dataset updated
    Nov 1, 2022
    Authors
    Andrada
    Description

    Context

    I have gathered this data to create a small analysis (an analysis within an analysis - inception like situation) to understand what makes a notebook win a Kaggle Analytics Competition.

    Furthermore, the data lets us explore some differences in approaches between competitions and the evolution through time.

    Of course, as we are talking about an analytical approach (which cannot be quantified, like a normal Kaggle Competition, that has a KPI), there can never be an EXACT recipe. However if we look at some quanitity (and then quality by reading the notebooks) features we can quickly see a pattern within the winning notebooks.

    This knowledge might help you when you approach a new challenge, as well as guide on the "right" path.

    Note: the dataset contains only PAST competitions that have already ended and the winners have been announced.

  4. Kaggle Blog: Winners' Posts

    • kaggle.com
    zip
    Updated Sep 21, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2016). Kaggle Blog: Winners' Posts [Dataset]. https://www.kaggle.com/kaggle/kaggle-blog-winners-posts
    Explore at:
    zip(530977 bytes)Available download formats
    Dataset updated
    Sep 21, 2016
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    In 2010, Kaggle launched its first competition, which was won by Jure Zbontar, who used a simple linear model. Since then a lot has changed. We've seen the rebirth of neural networks, the rise of Python, the creation of powerful libraries like XGBoost, Keras and Tensorflow.

    This is data set is a dump of all winners' posts from the Kaggle blog starting with Jure Zbontar. It allows us to track trends in the techniques, tools and libraries that win competitions.

    This is a simple dump. If there's demand, I can upload more detail (including comments and tags).

  5. mlcourse.ai - Dota 2 - winner prediction Dataset

    • kaggle.com
    zip
    Updated Sep 8, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sushma Biswas (2019). mlcourse.ai - Dota 2 - winner prediction Dataset [Dataset]. https://www.kaggle.com/datasets/sushmabiswas/mlcourseai-dota-2-winner-prediction-dataset
    Explore at:
    zip(759868828 bytes)Available download formats
    Dataset updated
    Sep 8, 2019
    Authors
    Sushma Biswas
    Description

    Context

    Hello! I am currently taking the mlcourse.ai course and as part of one of it's in-class Kaggle competitions, this dataset was required. The data is originally hosted on git but I like to have my data right here on Kaggle. That's why this dataset.

    If you find this dataset useful, do upvote. Thank you and happy learning!

    Content

    This dataset contains 6 files in total. 1. Sample_submission.csv 2. Train_features.csv 3. Test_features.csv 4. Train_targets.csv 5. Train_matches.jsonl 6. Test_matches.jsonl

    Acknowledgements

    All of the data in this dataset is originally hosted on git and the same can also be found on the in-class competition's 'data' page here.

    Inspiration

    • to be updated.
  6. LLM 20 Questions Games

    • kaggle.com
    zip
    Updated Aug 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    waechter (2024). LLM 20 Questions Games [Dataset]. https://www.kaggle.com/datasets/waechter/llm-20-questions-games
    Explore at:
    zip(189837141 bytes)Available download formats
    Dataset updated
    Aug 7, 2024
    Authors
    waechter
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Episodes games from https://www.kaggle.com/competitions/llm-20-questions This dataset can be used to analyze winning strategies, or as training data

    description:

    • index is {episodeId}_{guesser}_{answer} (2 rows for each episodeId, one by team)
    • answers: list (len nb_round) of answers by {answer} agent
    • questions: list (len nb_round) of questions asked by {answer} agent
    • guesses: list (len nb_round) of guesses {answer} agent
    • keyword: keyword to be guessed
    • category: category of the keyword
    • guesser: name of guesser/asker team
    • answerer: name of answerer team
    • nb_round: int number of rounds (<20 means victory or error)
    • game_num: episodeId

    source

    Notebook: https://www.kaggle.com/code/waechter/llm-20-questions-games-dataset/notebook Meta kaggle dataset

  7. meta-kaggle-top-voted-posts

    • kaggle.com
    zip
    Updated Jul 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vcol (2023). meta-kaggle-top-voted-posts [Dataset]. https://www.kaggle.com/datasets/vcolliym/meta-kaggle-top-voted-posts
    Explore at:
    zip(4662718 bytes)Available download formats
    Dataset updated
    Jul 7, 2023
    Authors
    Vcol
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description
    • High level analysis results of discussion posts/ writeup of winning solutions in HTML format
    • Word count of competition tags stratified by competition types:
      • Research
      • Playground
      • Feature (no tag found under community competitions)
  8. Predicting The Lottery

    • kaggle.com
    zip
    Updated Dec 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Predicting The Lottery [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-insights-from-us-state-lottery-scratc
    Explore at:
    zip(425238 bytes)Available download formats
    Dataset updated
    Dec 8, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Predicting The Lottery

    US State Lottery History. Impossible To Predict Or Not, Let's try!

    By [source]

    About this dataset

    This dataset consists of comprehensive data from various U.S. State Lottery Scratcher games in California, Missouri, New Mexico, Oklahoma and Virginia. It contains comprehensive information for researchers to evaluate the probability of winning for each lottery game and any related statistical values associated with each game. The columns include information such as price, gameNumber, topPrize, overallOdds, topPrizeAvail ExtraChances, secondChance and lots more. Also included is detailed data on Winning Tickets At Start (Regardless if they were claimed or not), Total Prize Money at start alongside Total Prize Money Unclaimed at end date. Users will also find useful odds and probability calculation including Probability of Winning Any Prize + 3 StdDevs alongside Max Tickets To Buy & Expected Value Of Any Prize (as % of cost). Last but certainly not least is information regarding Odds Ranking By Best Probability Of Winning Any Prize all the way to Overall Rank! Studying this dataset allows players an informed look towards making smarter choices when it comes to taking their chances in using state lotteries – May The Odds be ever in your favor!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This guide will explain how to use this dataset in detail. It will provide step-by-step instructions on how to interact with and analyze the data contained in this Kaggle dataset so that you can gain insight into your own research or project related to state lottery scratchers!

    Understand the Dataset Contents

    The contents include price information (e.g., price per play), game name and number, top prize amounts/overall odds/top prize availability/extra chances available/second chance option offered by a particular game as well as date fields indicating when a certain ticket was started/ended/exported etc., different prize amounts available per ticket along with their corresponding probabilities & expected value etc., total prize money at start & total prize money remaining (as %) along with rank according to best probability of winning any type of prize or best change in corresponding probabilities etc.. This detailed information can help an experienced researcher to perform sophisticated analysis on US state lottery tickets’ success rates and effects over time period etc..
    In short – understanding what kind of variables are included in this dataset is necessary for analyzing these variables effectively!

    Describe each variable & their corresponding categories properly Describing individual variables will be helpful for users by providing them more detailed insights about those variables & their categories – especially if there are many different types of categories associated within a single variable (like prizes won). Furthermore – some formulae should also be introduced where applicable since users may not understand why certain calculations were done (such as calculating expected value). All such things should be clarified properly via descriptions instead of just listing down numerical values without explaining anything else related to them!

    Analyze differences between states using appropriate graphs & diagrams Data visualization plays an essential role while trying out various

    Research Ideas

    • Analyzing the effectiveness of marketing campaigns for various state lotteries by examining sales of different scractcher tickets.
    • Examining the lottery scratcher game price points to identify selling opportunities or trends in preferences across states.
    • Utilizing the data to apply statistics and modeling techniques to project future expected values from similar scratch games across different states

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: CAratingstable.csv | Column name | Description ...

  9. YIEDL Competition Data (updated daily)

    • kaggle.com
    zip
    Updated Jan 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joakim Arvidsson (2025). YIEDL Competition Data (updated daily) [Dataset]. https://www.kaggle.com/datasets/joebeachcapital/yiedl-competition/versions/80
    Explore at:
    zip(9274033415 bytes)Available download formats
    Dataset updated
    Jan 10, 2025
    Authors
    Joakim Arvidsson
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset updates daily for the Numerai Crypto data (daily competition), and weekly on Mondays for the Yiedl.ai weekly competition. The Yiedl data contains the most recent dataset from yiedl.ai, as well as a quickstarter notebook. It now also includes the Numerai Crypto daily data (including historical), which may be useful in both competitions. It should be everything you need in order to get started in these Crypto currency prediction competitions.

    You can apply for an airdrop of 100 $YIEDL tokens here, which you can use to stake on your predictions to earn more tokens if your predictions are correct (or burn tokens if they are not).

    Experienced data scientists can apply for a grant of an additional 5000 $YIEDL tokens, if approved.

    The $YIEDL token is a recently launched token on the Polygon blockchain. More information can be found at the below links.

  10. 2016 March ML Mania Predictions

    • kaggle.com
    zip
    Updated Nov 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Will Cukierski (2017). 2016 March ML Mania Predictions [Dataset]. https://www.kaggle.com/datasets/wcukierski/2016-march-ml-mania
    Explore at:
    zip(28950066 bytes)Available download formats
    Dataset updated
    Nov 15, 2017
    Authors
    Will Cukierski
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Kaggle’s March Machine Learning Mania competition challenged data scientists to predict winners and losers of the men's 2016 NCAA basketball tournament. This dataset contains the 1070 selected predictions of all Kaggle participants. These predictions were collected and locked in prior to the start of the tournament.

    How can this data be used? You can pivot it to look at both Kaggle and NCAA teams alike. You can look at who will win games, which games will be close, which games are hardest to forecast, or which Kaggle teams are gambling vs. sticking to the data.

    First round predictions

    The NCAA tournament is a single-elimination tournament that begins with 68 teams. There are four games, usually called the “play-in round,” before the traditional bracket action starts. Due to competition timing, these games are included in the prediction files but should not be used in analysis, as it’s possible that the prediction was submitted after the play-in round games were over.

    Data Description

    Each Kaggle team could submit up to two prediction files. The prediction files in the dataset are in the 'predictions' folder and named according to:

    TeamName_TeamId_SubmissionId.csv

    The file format contains a probability prediction for every possible game between the 68 teams. This is necessary to cover every possible tournament outcome. Each team has a unique numerical Id (given in Teams.csv). Each game has a unique Id column created by concatenating the year and the two team Ids. The format is the following:

    Id,Pred
    2016_1112_1114,0.6
    2016_1112_1122,0
    ...

    The team with the lower numerical Id is always listed first. “Pred” represents the probability that the team with the lower Id beats the team with the higher Id. For example, "2016_1112_1114,0.6" indicates team 1112 has a 0.6 probability of beating team 1114.

    For convenience, we have included the data files from the 2016 March Mania competition dataset in the Scripts environment (you may find TourneySlots.csv and TourneySeeds.csv useful for determining matchups, see the documentation). However, the focus of this dataset is on Kagglers' predictions.

  11. MCTS | Extra training data

    • kaggle.com
    zip
    Updated Dec 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Day (2024). MCTS | Extra training data [Dataset]. https://www.kaggle.com/datasets/jsday96/mcts-extra-training-data
    Explore at:
    zip(22372652 bytes)Available download formats
    Dataset updated
    Dec 4, 2024
    Authors
    James Day
    Description

    Contains the extra training data used in the 1st place solution to the MCTS competition.

    More specifically, it contains the following types of files: * ExtraAnnotatedGames_v{version_number}.csv - holds the generated rulesets, features describing those rulesets, and labels computed by simulating matches between pairs of agents. "v6" is the full-scale version used in the winning solution, "v4" is the half-scale version used in earlier experiments and discussed in a couple forum threads. * StartingPositionEvals/{rulesets_origin}_{mcts_config}_{runtime_per_ruleset}s_v2_r{run_id}.json - Game balance metrics, examined action counts, and search iteration counts for each ruleset in each dataset (ones provided by the competition organizer + extra rulesets I generated). * RecomputedFeatureEstimates.json - Estimates of the values of all the nondeterministic features for all rulesets from both data sources (organizer + generated). Computed by re-annotating all the rulesets 5 times with 15 trials per run, scaling the hardware speed specific features to account for hardware differences, and averaging the estimated feature values from all 5 runs to compute less-noisy values.

  12. FacialRecognition

    • kaggle.com
    zip
    Updated Dec 1, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TheNicelander (2016). FacialRecognition [Dataset]. https://www.kaggle.com/petein/facialrecognition
    Explore at:
    zip(121674455 bytes)Available download formats
    Dataset updated
    Dec 1, 2016
    Authors
    TheNicelander
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    #https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################

    ###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################

    ###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)

    ###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.

    ###To look at samples of the data, uncomment this line:

    head(d.train)

    ###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe

    im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe

    im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe

    ################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer

    #strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))

    ###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.

    install.packages('foreach')

    library("foreach", lib.loc="~/R/win-library/3.3")

    ###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):

    ###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')

    save(d.train, im.train, d.test, im.test, file='data.Rd')

    load('data.Rd')

    #each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)

    #im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).

    #To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))

    #Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")

    #Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }

    #there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")

    #One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)

    #To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)

    #The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:

    install.packages('reshape2')

    library(...

  13. Argoverse-HD

    • kaggle.com
    zip
    Updated May 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Li (2021). Argoverse-HD [Dataset]. https://www.kaggle.com/mtlics/argoversehd
    Explore at:
    zip(31334725392 bytes)Available download formats
    Dataset updated
    May 8, 2021
    Authors
    Martin Li
    Description

    This dataset is built for streaming object detection, for more details please check out the dataset webpage.

    Competition

    https://via.placeholder.com/15/fc4903/000000?text=+" alt="#fc4903"> The competition on this dataset is hosted on Eval.AI, enter the challenge to win prizes and present at CVPR 2021 Workshop on Autonomous Driving.


    http://www.cs.cmu.edu/~mengtial/proj/streaming/img/dataset-compare.png">

  14. ICPC WF Ranking Results (1999 - Present) Datasets

    • kaggle.com
    zip
    Updated Sep 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hoang Le Ngoc (2024). ICPC WF Ranking Results (1999 - Present) Datasets [Dataset]. https://www.kaggle.com/datasets/justinianus/icpc-world-finals-ranking-since-1999
    Explore at:
    zip(284532 bytes)Available download formats
    Dataset updated
    Sep 23, 2024
    Authors
    Hoang Le Ngoc
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset Information

    The ICPC World Finals Ranking Dataset, available on Kaggle, provides an extensive overview of performance metrics for teams from global universities participating in the International Collegiate Programming Contest (ICPC) World Finals since 1999. This dataset includes information such as team rank, representing university, competition year, and the university's country.

    The ICPC is an internationally renowned programming competition where student teams tackle algorithmic problems within a set timeframe. The contest, organized by the Association for Computing Machinery (ACM), progresses through multiple rounds, including regional and online contests, culminating in the finals.

    This dataset offers insights into the universities that have performed outstandingly in the ICPC World Finals from 1999 to the present. It features 21 attributes, such as team name, rank, university name, university's region and country, and the number of problems solved during the contest. Additionally, it contains data on the teams' rankings in the regional contests, which serve as qualifiers for the world finals.

    The dataset is an invaluable tool for statistical and trend analysis, as well as for developing machine learning models. Researchers can utilize it to pinpoint universities and countries with consistent high performance over the years, examine the distribution of problems solved by teams across various years, and forecast future contest results based on past achievements. Moreover, educators and mentors can leverage this dataset to discern essential concepts to aid students in contest preparation.

    *Created by Microsoft Copilot, an AI Language Model*

    Changelog

    • 03/04/2023: Initial dataset creation with icpc-full.csv, detailing results across all years.
    • 10/04/2023: Introduction of new column Prize, detailing champions from all world and regional contests. Segregation of results by year into separate files icpc-xxxx.csv.
    • 17/04/2023: The dataset received a Bronze medal. Published the first notebook. Thanks for the upvotes!
    • 23/04/2023: Changed the data type of the Rank column to integer (no longer contains string data).
    • 30/04/2024: Updated the ICPC WF Luxor 2022-2023 data with the files icpc-2022.csv, icpc-2023.csv, and the revised icpc-full.csv.
    • 23/09/2024: Updated the ICPC WF Astana 2024 data with the files icpc-2024.csv and the revised icpc-full.csv. # About the Author I created this dataset to preserve all information about the ICPC World Finals, which was my first passion when I started in IT. I haven't had the chance to attend the World Finals, but I have won some awards in the ICPC Asia Regional contests and qualified for the World Finals in the Asia Pacific region, held by OLP/ICPC Vietnam:
    • Competed in the ICPC 2020 Asia Can Tho Regional Contest. (Team: No Girl No AC - Hoang Le Ngoc, Huy Nguyen Nhat, Man Ha Xuan)
    • Competed in the ICPC 2021 Asia Hanoi Regional Contest. (Team: The Phoenix Rises - Hoang Le Ngoc, Huy Nguyen Nhat, Phuoc Cao Xuan)
    • Won a Bronze medal in the ICPC 2022 Asia Ho Chi Minh City Regional Contest. (Team: HUSC.[401]_UnauthorizeD - Hoang Le Ngoc, Toan Le Sy, Hai Ngo Van)
    • Won a Third Prize (Vietnam teams) in the ICPC 2023 Asia Hue City Regional Contest. (Team: HUSC.GreedForSpeed - Hoang Le Ngoc, Toan Le Sy, Hai Ngo Van)
    • Won a Consolation Prize (Vietnam teams) in the ICPC 2024 Asia Hanoi Regional Contest. (Team: HUSC.Newbie - Hoang Le Ngoc, Toan Le Sy, Hai Ngo Van)

    My Challenge: - Predict the university that the next champion team will come from. - Determine if your university or country can win a medal in future World Finals.

    I hope you find it useful. Feel free to upvote and comment if you have any questions. With love from Vietnam <3

  15. Clean Meta Kaggle

    • kaggle.com
    Updated Sep 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yoni Kremer (2023). Clean Meta Kaggle [Dataset]. https://www.kaggle.com/datasets/yonikremer/clean-meta-kaggle
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yoni Kremer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Cleaned Meta-Kaggle Dataset

    The Original Dataset - Meta-Kaggle

    Explore our public data on competitions, datasets, kernels (code / notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

    Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.

    https://i.imgur.com/2Egeb8R.png" alt="" title="a title">

    This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.

    Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.

    August 2023 update

    In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here

    We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.

    The Problems with the Original Dataset

    • The original dataset is 32 CSV files, with 268 colums and 7GB of compressed data. Having so many tables and columns makes it hard to understand the data.
    • The data is not normalized, so when you join tables you get a lot of errors.
    • Some values refer to non-existing values in other tables. For example, the UserId column in the ForumMessages table has values that do not exist in the Users table.
    • There are missing values.
    • There are duplicate values.
    • There are values that are not valid. For example, Ids that are not positive integers.
    • The date and time columns are not in the right format.
    • Some columns only have the same value for all rows, so they are not useful.
    • The boolean columns have string values True or False.
    • Incorrect values for the Total columns. For example, the DatasetCount is not the total number of datasets with the Tag according to the DatasetTags table.
    • Users upvote their own messages.

    The Solution

    • To handle so many tables and columns I use a relational database. I use MySQL, but you can use any relational database.
    • The steps to create the database are:
    • Creating the database tables with the right data types and constraints. I do that by running the db_abd_create_tables.sql script.
    • Downloading the CSV files from Kaggle using the Kaggle API.
    • Cleaning the data using pandas. I do that by running the clean_data.py script. The script does the following steps for each table:
      • Drops the columns that are not needed.
      • Converts each column to the right data type.
      • Replaces foreign keys that do not exist with NULL.
      • Replaces some of the missing values with default values.
      • Removes rows where there are missing values in the primary key/not null columns.
      • Removes duplicate rows.
    • Loading the data into the database using the LOAD DATA INFILE command.
    • Checks that the number of rows in the database tables is the same as the number of rows in the CSV files.
    • Adds foreign key constraints to the database tables. I do that by running the add_foreign_keys.sql script.
    • Update the Total columns in the database tables. I do that by running the update_totals.sql script.
    • Backup the database.
  16. BirdCLEF 2024 | Best Working Note - Source Code

    • kaggle.com
    zip
    Updated Jun 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugo de Heer (2024). BirdCLEF 2024 | Best Working Note - Source Code [Dataset]. https://www.kaggle.com/datasets/hugodeheer/bird-source/code
    Explore at:
    zip(83018855 bytes)Available download formats
    Dataset updated
    Jun 20, 2024
    Authors
    Hugo de Heer
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The Source Code of Team Epoch IV's submission for BirdCLEF 2024.

    Since we worked on a Python project rather than a notebook for easier collaboration, we developed our code locally and uploaded it to Kaggle for submissions. By uploading our source code to a dataset, we could run this code from a notebook. Additionally, we train our models locally and add these to this dataset as we only need to run inference on the Kaggle notebook.

    To use this code in your own notebook, add this dataset and run:

    !python3 "submit.py"
    

    For the full source code, which also includes the code for training models, as well as our award-winning working note, see our repository on GitHub.

  17. ML Competition on Cryptocurrency Market Data

    • kaggle.com
    zip
    Updated Nov 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    YIEDL (2021). ML Competition on Cryptocurrency Market Data [Dataset]. https://www.kaggle.com/datasets/rocketcapital/ml-competition-on-cryptocurrency-market-data
    Explore at:
    zip(291744236 bytes)Available download formats
    Dataset updated
    Nov 23, 2021
    Authors
    YIEDL
    Description

    Context

    The world of Asset Management today, from a technological point of view, is mainly linked to mature but inefficient supply chains, which merge discretionary and quantitative forecasting models. The financial industry has been working in the shadows for years to overcome this paradigm, pushing beyond technology, making use not only of automated models (trading systems and dynamic asset allocation systems) but also of the most modern Machine Learning techniques for Time Series Forecasting and Unsupervised Learning for the classification of financial instruments. However, in most cases, it uses proprietary technologies that are limited by definition (workforce, technology investment, scalability). Numerai, an offshoot of Jim Simons’ Renaissance Technologies, was the first to blaze a new path by building a first centralized machine learning competition, in order to gather a swarm of predictors outside the company, to integrate with internal intelligence. The discretionary contribution was therefore eliminated, and the information content generated internally was enriched by thousands of external contributors, in many cases linked to sectors unrelated to the financial industry, such as energy, aerospace, or biotechnology. In fact, the concept that to obtain good market forecasts, it is necessary to have only skills related to the financial world is overcome. What we have just described is the starting point of Rocket Capital Investment. To overcome the limit imposed by Numerai, a new competition has been engineered, which has the ambition to make this project even more “democratic”. How? Decentralizing, thanks to the Blockchain, the entire chain of participant management, collection, and validation of forecasts, as well as decisions relating to the evaluation and remuneration of the participants themselves. In this way, it is possible to make every aspect of the competition completely transparent and inviolable. Everything is managed by a Smart Contract, whose rules are known and shared. Let’s find out in more detail what it is.

    Starting from the idea of Numerai, we have completely re-engineered all aspects related to the management of participants, Scoring, and Reward, following the concept of decentralization of the production chain. To this end, a proprietary token (MUSA token) has been created which acts as an exchange currency and which integrates a smart contract that acts as an autonomous competition manager. The communication interface between the users and the smart contract is a DApp (“Decentralized Application”). But let’s see in more detail how all these elements combine with each other, like in a puzzle.

    Competition Technicalities

    A suitably normalized dataset is issued every week, containing data from over 400 cryptocurrencies. For each asset, the data relating to prices, volumes traded, quantitative elements, as well as alternative data (information on the blockchain and on the sentiment of the various providers) are aggregated. Another difference with Numerai is the ability to distinguish assets for each row (the first column shows the related ticker). The last column instead contains the question to which the Data Scientists are asked to give an answer: the relative strength ranking of each asset, built on the forecast of the percentage change expected in the following week.

    Registration for the Competition takes place by providing, in a completely anonymous way, the address of a crypto wallet on which the MUSA tokens are loaded. From that moment on, the MUSAs become, to all intents and purposes, the currency of exchange between participants and organizers. Every Monday a new Challenge opens, and all Data Scientists registered in the Contest are asked to use their models to generate predictions. By accessing the DApp, the participant can download the new dataset, complete with the history of the previous weeks and the last useful week. At this point the participant can perform two actions in sequence directly from the DApp: - Staking: MUSA tokens are placed on your prediction. - Submission: the forecast for the following week is uploaded to the blockchain.

    Since the forecast consists of a series of numbers between 0 and 1 associated with each asset, it is very easy, the following week, to calculate the error committed in terms of RMSE (“Root Mean Square Error”). This allows creating a ranking on the participants, to be able to reward them accordingly with additional MUSA tokens. But let’s see in more detail how the Smart Contract, which was created, allows us to differentiate the reward based on different items (all, again, in a completely transparent and verifiable way): - Staking Reward: the mere fact of participating in the competition is remunerated. In future versions, it will also be possible to bet on the goodness of the other participants’ predictions. - Challenge Rew...

  18. nfl-big-data-bowl-2021 Feather files

    • kaggle.com
    zip
    Updated Oct 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathurin Aché (2020). nfl-big-data-bowl-2021 Feather files [Dataset]. https://www.kaggle.com/mathurinache/nflbigdatabowl2021-feather-files
    Explore at:
    zip(475231363 bytes)Available download formats
    Dataset updated
    Oct 15, 2020
    Authors
    Mathurin Aché
    Description

    When a quarterback takes a snap and drops back to pass, what happens next may seem like chaos. As offensive players move in various patterns, the defense works together to prevent successful pass completions and then to quickly tackle receivers that do catch the ball. In this year’s Kaggle competition, your goal is to use data science to better understand the schemes and players that make for a successful defense against passing plays.

    In American football, there are a plethora of defensive strategies and outcomes. The National Football League (NFL) has used previous Kaggle competitions to focus on offensive plays, but as the old proverb goes, “defense wins championships.” Though metrics for analyzing quarterbacks, running backs, and wide receivers are consistently a part of public discourse, techniques for analyzing the defensive part of the game trail and lag behind. Identifying player, team, or strategic advantages on the defensive side of the ball would be a significant breakthrough for the game.

    This competition uses NFL’s Next Gen Stats data, which includes the position and speed of every player on the field during each play. You’ll employ player tracking data for all drop-back pass plays from the 2018 regular season. The goal of submissions is to identify unique and impactful approaches to measure defensive performance on these plays. There are several different directions for participants to ‘tackle’ (ha)—which may require levels of football savvy, data aptitude, and creativity. As examples:

    What are coverage schemes (man, zone, etc) that the defense employs? What coverage options tend to be better performing? Which players are the best at closely tracking receivers as they try to get open? Which players are the best at closing on receivers when the ball is in the air? Which players are the best at defending pass plays when the ball arrives? Is there any way to use player tracking data to predict whether or not certain penalties – for example, defensive pass interference – will be called? Who are the NFL’s best players against the pass? How does a defense react to certain types of offensive plays? Is there anything about a player – for example, their height, weight, experience, speed, or position – that can be used to predict their performance on defense? What does data tell us about defending the pass play? You are about to find out.

    Note: Are you a university participant? Students have the option to participate in a college-only Competition, where you’ll work on the identical themes above. Students can opt-in for either the Open or College Competitions, but not both.

  19. How could we win the next UK National Lottery ?

    • kaggle.com
    zip
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick L Ford (2025). How could we win the next UK National Lottery ? [Dataset]. https://www.kaggle.com/datasets/patricklford/how-could-we-win-the-next-uk-national-lottery/code
    Explore at:
    zip(59204 bytes)Available download formats
    Dataset updated
    Jun 12, 2025
    Authors
    Patrick L Ford
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Introduction

    The National Lottery is the state-franchised lottery in the United Kingdom, established in 1994. It is regulated by the Gambling Commission and operated by Allwyn Entertainment, which took over from Camelot Group on 1 February 2024. The National Lottery has since become one of the most popular forms of gambling in the UK. Prizes are generally paid as a lump sum, except for the Set For Life game, which provides winnings over a fixed period. All prizes are tax-free. Of the total money spent on National Lottery games, approximately 53% is allocated to the prize fund, while 25% supports "good causes" as designated by Parliament. However, some critics consider this a "stealth tax" funding the National Lottery Community Fund. Additionally, 12% is collected as lottery duty by the UK government, 4% is paid to retailers as commission, and 5% goes to the operator, with 4% covering operational costs and 1% taken as profit. Since 22 April 2021, the minimum age to purchase National Lottery tickets and scratchcards has been 18, an increase from the previous age limit of 16.

    Recommended reading: A previous project of mine where I look at lotteries. link - Kaggle

    History

    Origins and Early Development: - Lotteries in England were largely illegal under a statute from 1698 unless specifically authorised by law. However, state lotteries were introduced to raise funds for government initiatives and war efforts. The Bank of England established early lotteries such as the Million Lottery (1694) and the Malt Lottery (1697). Later, the Betting and Lotteries Act of 1934, amended in 1956 and 1976, allowed for small-scale lotteries.

    Establishment of the National Lottery: - The modern National Lottery was created under the National Lottery etc. Act 1993, initiated by John Major’s government. The franchise was awarded to Camelot Group on 25 May 1994, and the first official draw took place on 19 November 1994. The first winning numbers were 30, 3, 5, 44, 14, and 22, with the bonus ball being 10. The jackpot was shared by seven winners, with a total prize of £5,874,778. The National Lottery remains a central aspect of UK gambling culture.

    Operational Changes and Developments: - Camelot initially used Beitel Criterion draw machines, later replaced by Smartplay Magnum I models in 2003 and Magnum II models in 2009. One of the original Beitel Criterion machines, named Guinevere, was donated to the Science Museum in London in 2022. Cyber-security has been a concern, with a notable breach in March 2018 affecting 150 accounts, though no financial losses were reported. On 1 February 2024, Allwyn Entertainment took over National Lottery operations from Camelot Group.

    Eligibility and Ticket Purchases:

    • Be at least 18 years old (requirement since April 2021).
    • Purchase tickets in person at authorised retailers in the UK or Isle of Man, or online through the National Lottery website.
    • Have a UK bank account for online purchases and be physically present in the UK or Isle of Man at the time of purchase.
    • If part of a syndicate, the ticket purchaser must meet all eligibility criteria.
    • Lottery tickets are non-transferable, and commercial syndicates charging additional fees are not permitted.
    • From its inception in November 1994 until April 2021, the minimum age to purchase National Lottery tickets and scratch cards was 16. This was increased to 18 to align with responsible gambling measures.
    • The National Lottery continues to be a significant source of entertainment and funding for public projects, with millions participating in hopes of winning life-changing prizes.

    Calculating the Probability of Winning the Jackpot:

    • Players may pick six numbers from a pool of numbers - six different numbers from 1 to 59.
    • If you're playing a lottery where you choose 6 numbers from a pool of 59, denoted as (59C6).
    • Once the first number has been drawn, 1-59.
    • As the first ball is not replaced, there are only 58 possible values for the second one.
    • 57 possible values for the third ball, 56 for the fourth, 55 for the fifth and 54 for the last ball.
    • In total there are: 59 x 58 x 57 x 56 x 55 x 54 = 32,441,381,280 possible combinations.
    • We have to take into account the fact that it does not matter what order the numbers are drawn in.
    • So to calculate the number of ways, 6 numbers can be arranged in: 6 x 5 x 4 x 3 x 2 x 1 = 720 permutations.
    • This means for a 59C6 lottery, the calculation is: C(59,6) = 59x58x57x56x55x54 / 6x5x4x3x2x1.
    • Or 32,441,381,280 / 720 = 45,057,474 different combinations of six numbers.
    • Which gives us a 1 in 45,057,474 chance of winning the UK National lottery.

    Shiny App to Predict UK National Lottery Winning Numbers

    I've been developing the below prediction app f...

  20. UEFA 1960 TO 2022-23

    • kaggle.com
    zip
    Updated Jan 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Zonto (2023). UEFA 1960 TO 2022-23 [Dataset]. https://www.kaggle.com/datasets/scottzonto/uefa-1960-2022
    Explore at:
    zip(113246 bytes)Available download formats
    Dataset updated
    Jan 17, 2023
    Authors
    Scott Zonto
    Description

    "**Champions of Europe**: A retrospective journey through UEFA's history from 1960 to 2022-2023 - The ultimate data list" is a comprehensive collection of data on the history of the UEFA Champions League, Europe's premier club football competition. The dataset includes information on all the teams that have participated in the competition since its inception in 1960, including the home and away teams, match results, stadiums, attendance, and special win conditions. It also includes detailed information on teams' appearances, record streaks, active streaks, debut, most recent and best results. This dataset is an invaluable resource for football fans, researchers, analysts, and journalists, providing a wealth of historical data on one of the most prestigious and popular competitions in world football.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Darek Kłeczek (2023). Kaggle Winning Solutions Methods [Dataset]. https://www.kaggle.com/thedrcat/kaggle-winning-solutions-methods
Organization logo

Kaggle Winning Solutions Methods

Find out how to win a Kaggle competition!

Explore at:
zip(7817540 bytes)Available download formats
Dataset updated
Jul 15, 2023
Authors
Darek Kłeczek
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

Machine Learning Methods associated with Kaggle Winning Solutions Writeups. Dataset was obtained with OpenAI models and used Kaggle Solutions website.

You can use this dataset to analyze methods needed to win a Kaggle competition :)

Article describing the process to collect this data. Notebook demonstrating now the data was collected.

Search
Clear search
Close search
Google apps
Main menu