Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.
The original dataset is organized into multiple CSV files, each containing structured data on different entities:
Table 1. code_blocks.csv structure
| Column | Description |
| code_blocks_index | Global index linking code blocks to markup_data.csv. |
| kernel_id | Identifier for the Kaggle Jupyter notebook from which the code block was extracted. |
| code_block_id |
Position of the code block within the notebook. |
| code_block |
The actual machine learning code snippet. |
Table 2. kernels_meta.csv structure
| Column | Description |
| kernel_id | Identifier for the Kaggle Jupyter notebook. |
| kaggle_score | Performance metric of the notebook. |
| kaggle_comments | Number of comments on the notebook. |
| kaggle_upvotes | Number of upvotes the notebook received. |
| kernel_link | URL to the notebook. |
| comp_name | Name of the associated Kaggle competition. |
Table 3. competitions_meta.csv structure
| Column | Description |
| comp_name | Name of the Kaggle competition. |
| description | Overview of the competition task. |
| data_type | Type of data used in the competition. |
| comp_type | Classification of the competition. |
| subtitle | Short description of the task. |
| EvaluationAlgorithmAbbreviation | Metric used for assessing competition submissions. |
| data_sources | Links to datasets used. |
| metric type | Class label for the assessment metric. |
Table 4. markup_data.csv structure
| Column | Description |
| code_block | Machine learning code block. |
| too_long | Flag indicating whether the block spans multiple semantic types. |
| marks | Confidence level of the annotation. |
| graph_vertex_id | ID of the semantic type. |
The dataset allows mapping between these tables. For example:
kernel_id column.comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.
The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.
Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.
competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.
The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:
Facebook
TwitterIf you need help setting this up to use in a notebook with the internet off, check this notebook: https://www.kaggle.com/code/narnaoot/installing-packages-without-internet-for-kaggle
Facebook
TwitterI have gathered this data to create a small analysis (an analysis within an analysis - inception like situation) to understand what makes a notebook win a Kaggle Analytics Competition.
Furthermore, the data lets us explore some differences in approaches between competitions and the evolution through time.
Of course, as we are talking about an analytical approach (which cannot be quantified, like a normal Kaggle Competition, that has a KPI), there can never be an EXACT recipe. However if we look at some quanitity (and then quality by reading the notebooks) features we can quickly see a pattern within the winning notebooks.
This knowledge might help you when you approach a new challenge, as well as guide on the "right" path.
Note: the dataset contains only PAST competitions that have already ended and the winners have been announced.
Facebook
TwitterDataset Summary
Natural Language Processing with Disaster Tweets: https://www.kaggle.com/competitions/nlp-getting-started/data This particular challenge is perfect for data scientists looking to get started with Natural Language Processing. The competition dataset is not too big, and even if you don’t have much personal computing power, you can do all of the work in our free, no-setup, Jupyter Notebooks environment called Kaggle Notebooks.
Columns
id - a unique identifier for each tweet… See the full description on the dataset page: https://huggingface.co/datasets/gdwangh/kaggle-nlp-getting-start.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains top 100 of Kaggle competitions ranking. The dataset will be updated every month.
100 rows and 13 columns. Columns' description are listed below.
Data from Kaggle. Image from Smartcat.
If you're reading this, please upvote.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created to provide a stable, reliable data source for notebooks, avoiding the 'deleted-dataset' errors that can occur with the frequently-updated official Meta Kaggle dataset.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
I produced the dataset whilst working on the 2023 Kaggle AI report. The Meta Kaggle dataset provides helpful information about the Kaggle competitions but not the original descriptive text from the Kaggle web pages for each competition. We have information about the solutions but not the original problem. So, I wrote some web scraping scripts to collect and store that information.
Not all Kaggle web pages have that information available; some are missing or broken. Hence the nulls in the data. Secondly, note that not all previous Kaggle competitions exist in the Meta Kaggle data, which was used to collect the webpage slugs.
The scrapping scripts iterate over the IDs in Meta Kaggle competitions.csv data and attempt to collect the webpage data for that competition if it is currently null in the database. Hence new IDs will cause the scripts to go and collect their data, and each week, the scripts will try and fill in any links that were not working previously.
I have recently converted the original local scraping scripts on my machine into a Kaggle notebook that now updates this dataset weekly on Mondays. The notebook also explains the scraping procedure and its automation to keep this dataset up-to-date.
Note that the CompetitionId field joins to the Id of the competitions.csv of the Meta Kaggle dataset so that this information can be combined with the rest of Meta Kaggle.
My primary reason for collecting the data was for some text classification work I wanted to do, and I will publish it here soon. I hope that the data is useful to some other projects as well :-)
Facebook
TwitterThe 2023 Kaggle AI Report Competition had a deadline to make all notebooks public prior to the July 5th deadline. This dataset contains a preliminary list of all of those notebooks sorted by category.
See the competition overview, data, evaluation, submission instructions, and timeline pages for more detail about the competition itself.
Facebook
TwitterDescription 👋🛳️ Ahoy, welcome to Kaggle! You’re in the right place. This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.
If you want to talk with other users about this competition, come join our Discord! We've got channels for competitions, job postings and career discussions, resources, and socializing with your fellow data scientists. Follow the link here: https://discord.gg/kaggle
The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.
Read on or watch the video below to explore more details. Once you’re ready to start competing, click on the "Join Competition button to create an account and gain access to the competition data. Then check out Alexis Cook’s Titanic Tutorial that walks you through step by step how to make your first submission!
The Challenge The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
Recommended Tutorial We highly recommend Alexis Cook’s Titanic Tutorial that walks you through making your very first submission step by step and this starter notebook to get started.
How Kaggle’s Competitions Work Join the Competition Read about the challenge description, accept the Competition Rules and gain access to the competition dataset. Get to Work Download the data, build models on it locally or on Kaggle Notebooks (our no-setup, customizable Jupyter Notebooks environment with free GPUs) and generate a prediction file. Make a Submission Upload your prediction as a submission on Kaggle and receive an accuracy score. Check the Leaderboard See how your model ranks against other Kagglers on our leaderboard. Improve Your Score Check out the discussion forum to find lots of tutorials and insights from other competitors. Kaggle Lingo Video You may run into unfamiliar lingo as you dig into the Kaggle discussion forums and public notebooks. Check out Dr. Rachael Tatman’s video on Kaggle Lingo to get up to speed!
What Data Will I Use in This Competition? In this competition, you’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled train.csv and the other is titled test.csv.
Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.
The test.csv dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.
Using the patterns you find in the train.csv data, predict whether the other 418 passengers on board (found in test.csv) survived.
Check out the “Data” tab to explore the datasets even further. Once you feel you’ve created a competitive model, submit it to Kaggle to see where your model stands on our leaderboard against other Kagglers.
How to Submit your Prediction to Kaggle Once you’re ready to make a submission and get on the leaderboard:
Click on the “Submit Predictions” button
Upload a CSV file in the submission file format. You’re able to submit 10 submissions a day.
Submission File Format: You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.
The file should have exactly 2 columns:
PassengerId (sorted in any order) Survived (contains your binary predictions: 1 for survived, 0 for deceased) Got it! I’m ready to get started. Where do I get help if I need it? For Competition Help: Titanic Discussion Forum Kaggle doesn’t have a dedicated team to help troubleshoot your code so you’ll typically find that you receive a response more quickly by asking your question in the appropriate forum. The forums are full of useful information on the data, metric, and different approaches. We encourage you to use the forums often. If you share your knowledge, you'll find that others will share a lot in turn!
A Last Word on Kaggle Notebooks As we mentioned before, Kaggle Notebooks is our no-setup, customizable, Jupyter Notebooks environment with free GPUs and a huge repository ...
Facebook
TwitterThis dataset was obtained using four similar web scrappers made in python, more information at content.
topKagglersCompetitions.csv: Inside this dataset are the top kagglers at competitions with no biography data. Scrapper used: https://www.kaggle.com/ajpass/web-scrapping-vol-7-kaggle-competitions
topKagglersDatasets.csv: Inside this dataset are the top kagglers at datasets with no biography data. Scrapper used: https://www.kaggle.com/ajpass/data-mining-web-scrapping-vol-4-kaggle-datasets2
topKagglersDiscussion.csv: Inside this dataset are the top kagglers at discussions with no biography data. Scrapper used: https://www.kaggle.com/ajpass/web-scrapping-vol-6-kaggle-discussions
topKagglersNotebooks.csv: Inside this dataset are the top kagglers at notebooks with no biography data. Scrapper used: https://www.kaggle.com/ajpass/data-mining-web-scrapping-vol-5-kaggle-notebooks
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the pre-downloaded GPT-2 model and tokenizer files for offline use in Kaggle notebooks. It enables participants to use GPT-2 without requiring internet access, ensuring compliance with competition rules that restrict internet usage.
The dataset includes:
- GPT-2 Model: Config file, weights (model.safetensors), and other necessary files.
- GPT-2 Tokenizer: Vocabulary, merges, and tokenizer configuration files.
Use this dataset to load GPT-2 seamlessly into your notebook for generating text or other applications.
Contents: - gpt2_model.zip: Contains model weights and configuration files. - gpt2_tokenizer.zip: Contains tokenizer configuration and vocabulary files.
Usage:
Add this dataset to your notebook via the Kaggle dataset panel. Unzip the files and load them using the Hugging Face Transformers library with the from_pretrained method, pointing to the unzipped directories.
Licenses: The dataset reuses open-source GPT-2 files available under the original licensing terms provided by Hugging Face.
Purpose: This dataset was created for use in competitions where internet access is disabled to facilitate the usage of pre-trained models.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Episodes games from https://www.kaggle.com/competitions/llm-20-questions This dataset can be used to analyze winning strategies, or as training data
{episodeId}_{guesser}_{answer} (2 rows for each episodeId, one by team)Notebook: https://www.kaggle.com/code/waechter/llm-20-questions-games-dataset/notebook Meta kaggle dataset
Facebook
TwitterThis is the top 1000 user data of the four types of rankings (i.e., Competition, Datasets, Notebooks, and Discussion) from October 2021 to September 2023. The data was scraped from the Kaggle Ranking every month. The scraping code is in GitHub.
Note: Only the top 20 users' data have been stored in August 2023.
Note: Data collection ended in September 2023.
In 2021: - Competitions: Oct. 4, Nov. 21, Dec. 16 - Datasets: Oct. 12, Nov. 21, Dec. 16 - Notebooks: Oct. 13, Nov. 23, Dec. 16 - Discussion: Oct. 17, Nov. 23, Dec. 16
In 2022: - Competitions: Jan. 16, Feb. 20, Mar. 15, Apr. 15, May 15, June 15, Jul 15, Aug 15, Sep 19, Oct 15, Nov 15, Dec 16 - Datasets: Jan. 16, Feb. 20, Mar. 15, Apr. 15, May 15, June 15, Jul 15, Aug 15, Sep 19, Oct 15, Nov 15, Dec 16 - Notebooks: Jan. 16, Feb. 20, Mar. 15, Apr. 15, May 15, June 15, Jul 15, Aug 15, Sep 19, Oct 15, Nov 15, Dec 16 - Discussion: Jan. 16, Feb. 20, Mar. 15, Apr. 15, May 15, June 15, Jul 15, Aug 15, Sep 19, Oct 15, Nov15, Dec 18
In 2023 - Competitions: Jan. 13, Feb. 21, Mar 14, Apr 15, May 17, Jun 20, Jul 20, Aug 20, Sep 12 - Datasets: Jan. 13, Feb. 21, Mar 14, Apr 16, May 17, Jun 20, Jul 20, Aug 20, Sep 12 - Notebooks: Jan. 13, Feb. 21, Mar 15, Apr 15, May 16, Jun 20, Jul 20, Aug 20, Sep 12 - Discussion: Jan. 13, Feb. 23, Mar 16, Apr 15, May 16, Jun 20, Jul 20, Aug 20, Sep 12
Facebook
TwitterThis dataset contains the dataset for the SnakeCLEF2023 HuggingFace dataset.
https://huggingface.co/spaces/competitions/SnakeCLEF2023
This dataset does not contain Full Size Image Training Data 60 GB. I wanted everyone to use the data on Kaggle notebooks and participate in the competition.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Related discussion https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/464765
Related notebooks Version1: detailed results of it to check how/if the dataset is saved and reloaded step-by-step https://www.kaggle.com/code/chg0901/saveeverything-with-daigtext961-notebook?scriptVersionId=157295700
Version2: clean codes with a dataset containing the saved results from the origianl notebook https://www.kaggle.com/code/chg0901/saveeverything-with-daigtext961-notebook/notebook
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Sumukh
Released under CC0: Public Domain
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This AI-Kaggle-Assistant-File dataset is part of a notebook that has been specially prepared for use in the competition task Google - Gemini Long Context.
The following files can be found here:
Facebook
TwitterDataset generated by https://www.kaggle.com/steubk/meetings-are-boring-the-notebook See https://www.kaggle.com/competitions/predict-student-performance-from-game-play/discussion/396068 for details
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Context
Online e-commerce is rapidly growing in Pakistan. Sellers list thousands of products across multiple categories, each with different prices, ratings, and sales numbers. Understanding the patterns of product sales, pricing, and customer feedback is crucial for businesses and data scientists alike.
This dataset simulates a realistic snapshot of online product sales in Pakistan, including diverse categories like Electronics, Clothing, Home & Kitchen, Books, Beauty, and Sports.
Source
Generated synthetically using Python and NumPy for learning and practice purposes.
No real personal or private data is included.
Designed specifically for Kaggle competitions, notebooks, and ML/EDA exercises.
About the File
File name: Pakistan_Online_Product_Sales.csv
Rows: 1000+
Columns: 6
Purpose:
Train Machine Learning models (regression/classification)
Explore data through EDA and visualizations
Practice feature engineering and data preprocessing
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!