62 datasets found
  1. Meta Kaggle Code

    • kaggle.com
    zip
    Updated Nov 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
    Explore at:
    zip(167219625372 bytes)Available download formats
    Dataset updated
    Nov 27, 2025
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Explore our public notebook content!

    Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

    Why we’re releasing this dataset

    By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

    Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

    The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

    Sensitive data

    While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

    Joining with Meta Kaggle

    The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

    File organization

    The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

    The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

    Questions / Comments

    We love feedback! Let us know in the Discussion tab.

    Happy Kaggling!

  2. Code4ML 2.0

    • zenodo.org
    csv, txt
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737
    Explore at:
    csv, txtAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonimous authors; Anonimous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

    The original dataset is organized into multiple CSV files, each containing structured data on different entities:

    • code_blocks.csv: Contains raw code snippets extracted from Kaggle.
    • kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
    • competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
    • markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
    • vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

    Table 1. code_blocks.csv structure

    ColumnDescription
    code_blocks_indexGlobal index linking code blocks to markup_data.csv.
    kernel_idIdentifier for the Kaggle Jupyter notebook from which the code block was extracted.
    code_block_id

    Position of the code block within the notebook.

    code_block

    The actual machine learning code snippet.

    Table 2. kernels_meta.csv structure

    ColumnDescription
    kernel_idIdentifier for the Kaggle Jupyter notebook.
    kaggle_scorePerformance metric of the notebook.
    kaggle_commentsNumber of comments on the notebook.
    kaggle_upvotesNumber of upvotes the notebook received.
    kernel_linkURL to the notebook.
    comp_nameName of the associated Kaggle competition.

    Table 3. competitions_meta.csv structure

    ColumnDescription
    comp_nameName of the Kaggle competition.
    descriptionOverview of the competition task.
    data_typeType of data used in the competition.
    comp_typeClassification of the competition.
    subtitleShort description of the task.
    EvaluationAlgorithmAbbreviationMetric used for assessing competition submissions.
    data_sourcesLinks to datasets used.
    metric typeClass label for the assessment metric.

    Table 4. markup_data.csv structure

    ColumnDescription
    code_blockMachine learning code block.
    too_longFlag indicating whether the block spans multiple semantic types.
    marksConfidence level of the annotation.
    graph_vertex_idID of the semantic type.

    The dataset allows mapping between these tables. For example:

    • code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
    • kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

    In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

    Code4ML 2.0 Enhancements

    The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

    Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

    competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

    Applications

    The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

    • Code generation
    • Code understanding
    • Natural language processing of code-related tasks
  3. Sentence-transformers-2.2.0

    • kaggle.com
    zip
    Updated Jun 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nabil Arnaoot (2022). Sentence-transformers-2.2.0 [Dataset]. https://www.kaggle.com/datasets/narnaoot/sentencetransformers220/discussion
    Explore at:
    zip(519799 bytes)Available download formats
    Dataset updated
    Jun 1, 2022
    Authors
    Nabil Arnaoot
    Description

    If you need help setting this up to use in a notebook with the internet off, check this notebook: https://www.kaggle.com/code/narnaoot/installing-packages-without-internet-for-kaggle

  4. Kaggle Analytics Competitions - Metadata

    • kaggle.com
    zip
    Updated Nov 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrada (2022). Kaggle Analytics Competitions - Metadata [Dataset]. https://www.kaggle.com/datasets/andradaolteanu/kaggle-analytics-competitions-metadata
    Explore at:
    zip(183843 bytes)Available download formats
    Dataset updated
    Nov 1, 2022
    Authors
    Andrada
    Description

    Context

    I have gathered this data to create a small analysis (an analysis within an analysis - inception like situation) to understand what makes a notebook win a Kaggle Analytics Competition.

    Furthermore, the data lets us explore some differences in approaches between competitions and the evolution through time.

    Of course, as we are talking about an analytical approach (which cannot be quantified, like a normal Kaggle Competition, that has a KPI), there can never be an EXACT recipe. However if we look at some quanitity (and then quality by reading the notebooks) features we can quickly see a pattern within the winning notebooks.

    This knowledge might help you when you approach a new challenge, as well as guide on the "right" path.

    Note: the dataset contains only PAST competitions that have already ended and the winners have been announced.

  5. h

    kaggle-nlp-getting-start

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hui, kaggle-nlp-getting-start [Dataset]. https://huggingface.co/datasets/gdwangh/kaggle-nlp-getting-start
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    hui
    Description

    Dataset Summary

    Natural Language Processing with Disaster Tweets: https://www.kaggle.com/competitions/nlp-getting-started/data This particular challenge is perfect for data scientists looking to get started with Natural Language Processing. The competition dataset is not too big, and even if you don’t have much personal computing power, you can do all of the work in our free, no-setup, Jupyter Notebooks environment called Kaggle Notebooks.

    Columns

    id - a unique identifier for each tweet… See the full description on the dataset page: https://huggingface.co/datasets/gdwangh/kaggle-nlp-getting-start.

  6. Kaggle Competitions Top 100

    • kaggle.com
    zip
    Updated May 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vivo Vinco (2022). Kaggle Competitions Top 100 [Dataset]. https://www.kaggle.com/vivovinco/kaggle-competitions-top-100
    Explore at:
    zip(15932 bytes)Available download formats
    Dataset updated
    May 1, 2022
    Authors
    Vivo Vinco
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    This dataset contains top 100 of Kaggle competitions ranking. The dataset will be updated every month.

    Content

    100 rows and 13 columns. Columns' description are listed below.

    • User : Name of the user
    • Tier : Grandmaster, Master or Expert
    • Company/School : Company/School info of the user if mentioned
    • Country : Country info of the user if mentioned
    • Competitions_Num : Number of competitions joined
    • Competitions_Gold : Number of competitions gold medals won
    • Competitions_Silver : Number of competitions silver medals won
    • Competitions_Bronze : Number of competitions bronze medals won
    • Datasets_Num : Number of public datasets
    • Notebooks_Num : Number of public notebooks
    • Discussions_Num : Number of topics/comments posted
    • Points : Total points
    • Profile : Link of Kaggle profile

    Acknowledgements

    Data from Kaggle. Image from Smartcat.

    If you're reading this, please upvote.

  7. Meta Kaggle Competitions

    • kaggle.com
    zip
    Updated Nov 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pau Fortiana Chico (2025). Meta Kaggle Competitions [Dataset]. https://www.kaggle.com/datasets/paufortiana/meta-kaggle-competitions
    Explore at:
    zip(26645981 bytes)Available download formats
    Dataset updated
    Nov 11, 2025
    Authors
    Pau Fortiana Chico
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset was created to provide a stable, reliable data source for notebooks, avoiding the 'deleted-dataset' errors that can occur with the frequently-updated official Meta Kaggle dataset.

  8. Webpage Information for 5000+ Kaggle Competitions

    • kaggle.com
    zip
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Wynne (2023). Webpage Information for 5000+ Kaggle Competitions [Dataset]. https://www.kaggle.com/anthony35813/webpage-data-for-kaggle-competitions
    Explore at:
    zip(102059495 bytes)Available download formats
    Dataset updated
    Nov 8, 2023
    Authors
    Anthony Wynne
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    I produced the dataset whilst working on the 2023 Kaggle AI report. The Meta Kaggle dataset provides helpful information about the Kaggle competitions but not the original descriptive text from the Kaggle web pages for each competition. We have information about the solutions but not the original problem. So, I wrote some web scraping scripts to collect and store that information.

    Not all Kaggle web pages have that information available; some are missing or broken. Hence the nulls in the data. Secondly, note that not all previous Kaggle competitions exist in the Meta Kaggle data, which was used to collect the webpage slugs.

    The scrapping scripts iterate over the IDs in Meta Kaggle competitions.csv data and attempt to collect the webpage data for that competition if it is currently null in the database. Hence new IDs will cause the scripts to go and collect their data, and each week, the scripts will try and fill in any links that were not working previously.

    I have recently converted the original local scraping scripts on my machine into a Kaggle notebook that now updates this dataset weekly on Mondays. The notebook also explains the scraping procedure and its automation to keep this dataset up-to-date.

    Note that the CompetitionId field joins to the Id of the competitions.csv of the Meta Kaggle dataset so that this information can be combined with the rest of Meta Kaggle.

    My primary reason for collecting the data was for some text classification work I wanted to do, and I will publish it here soon. I hope that the data is useful to some other projects as well :-)

  9. List of public notebooks: AI Report competition

    • kaggle.com
    zip
    Updated Jul 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul Mooney (2023). List of public notebooks: AI Report competition [Dataset]. https://www.kaggle.com/datasets/paultimothymooney/list-of-public-notebooks-ai-report-competition
    Explore at:
    zip(9779 bytes)Available download formats
    Dataset updated
    Jul 7, 2023
    Authors
    Paul Mooney
    Description

    The 2023 Kaggle AI Report Competition had a deadline to make all notebooks public prior to the July 5th deadline. This dataset contains a preliminary list of all of those notebooks sorted by category.

    See the competition overview, data, evaluation, submission instructions, and timeline pages for more detail about the competition itself.

  10. Titanic Dataset

    • kaggle.com
    zip
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Mudasar Sabir (2025). Titanic Dataset [Dataset]. https://www.kaggle.com/datasets/mudasarsabir/titanic-dataset
    Explore at:
    zip(8350 bytes)Available download formats
    Dataset updated
    Apr 25, 2025
    Authors
    Muhammad Mudasar Sabir
    Description

    Description 👋🛳️ Ahoy, welcome to Kaggle! You’re in the right place. This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.

    If you want to talk with other users about this competition, come join our Discord! We've got channels for competitions, job postings and career discussions, resources, and socializing with your fellow data scientists. Follow the link here: https://discord.gg/kaggle

    The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

    Read on or watch the video below to explore more details. Once you’re ready to start competing, click on the "Join Competition button to create an account and gain access to the competition data. Then check out Alexis Cook’s Titanic Tutorial that walks you through step by step how to make your first submission!

    The Challenge The sinking of the Titanic is one of the most infamous shipwrecks in history.

    On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

    While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

    In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

    Recommended Tutorial We highly recommend Alexis Cook’s Titanic Tutorial that walks you through making your very first submission step by step and this starter notebook to get started.

    How Kaggle’s Competitions Work Join the Competition Read about the challenge description, accept the Competition Rules and gain access to the competition dataset. Get to Work Download the data, build models on it locally or on Kaggle Notebooks (our no-setup, customizable Jupyter Notebooks environment with free GPUs) and generate a prediction file. Make a Submission Upload your prediction as a submission on Kaggle and receive an accuracy score. Check the Leaderboard See how your model ranks against other Kagglers on our leaderboard. Improve Your Score Check out the discussion forum to find lots of tutorials and insights from other competitors. Kaggle Lingo Video You may run into unfamiliar lingo as you dig into the Kaggle discussion forums and public notebooks. Check out Dr. Rachael Tatman’s video on Kaggle Lingo to get up to speed!

    What Data Will I Use in This Competition? In this competition, you’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled train.csv and the other is titled test.csv.

    Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.

    The test.csv dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.

    Using the patterns you find in the train.csv data, predict whether the other 418 passengers on board (found in test.csv) survived.

    Check out the “Data” tab to explore the datasets even further. Once you feel you’ve created a competitive model, submit it to Kaggle to see where your model stands on our leaderboard against other Kagglers.

    How to Submit your Prediction to Kaggle Once you’re ready to make a submission and get on the leaderboard:

    Click on the “Submit Predictions” button

    Upload a CSV file in the submission file format. You’re able to submit 10 submissions a day.

    Submission File Format: You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

    The file should have exactly 2 columns:

    PassengerId (sorted in any order) Survived (contains your binary predictions: 1 for survived, 0 for deceased) Got it! I’m ready to get started. Where do I get help if I need it? For Competition Help: Titanic Discussion Forum Kaggle doesn’t have a dedicated team to help troubleshoot your code so you’ll typically find that you receive a response more quickly by asking your question in the appropriate forum. The forums are full of useful information on the data, metric, and different approaches. We encourage you to use the forums often. If you share your knowledge, you'll find that others will share a lot in turn!

    A Last Word on Kaggle Notebooks As we mentioned before, Kaggle Notebooks is our no-setup, customizable, Jupyter Notebooks environment with free GPUs and a huge repository ...

  11. Top Kagglers Rankings

    • kaggle.com
    zip
    Updated Aug 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AJ Pass (2020). Top Kagglers Rankings [Dataset]. https://www.kaggle.com/ajpass/top-kagglers-rankings
    Explore at:
    zip(352259 bytes)Available download formats
    Dataset updated
    Aug 20, 2020
    Authors
    AJ Pass
    Description

    Context

    This dataset was obtained using four similar web scrappers made in python, more information at content.

    Content

    topKagglersCompetitions.csv: Inside this dataset are the top kagglers at competitions with no biography data. Scrapper used: https://www.kaggle.com/ajpass/web-scrapping-vol-7-kaggle-competitions

    topKagglersDatasets.csv: Inside this dataset are the top kagglers at datasets with no biography data. Scrapper used: https://www.kaggle.com/ajpass/data-mining-web-scrapping-vol-4-kaggle-datasets2

    topKagglersDiscussion.csv: Inside this dataset are the top kagglers at discussions with no biography data. Scrapper used: https://www.kaggle.com/ajpass/web-scrapping-vol-6-kaggle-discussions

    topKagglersNotebooks.csv: Inside this dataset are the top kagglers at notebooks with no biography data. Scrapper used: https://www.kaggle.com/ajpass/data-mining-web-scrapping-vol-5-kaggle-notebooks

  12. GPT-2 Offline Model and Tokenizer for Kaggle

    • kaggle.com
    zip
    Updated Dec 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahul Bhat (2024). GPT-2 Offline Model and Tokenizer for Kaggle [Dataset]. https://www.kaggle.com/datasets/rahulbhat44/gpt-2-offline-model-and-tokenizer-for-kaggle/code
    Explore at:
    zip(463253154 bytes)Available download formats
    Dataset updated
    Dec 4, 2024
    Authors
    Rahul Bhat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the pre-downloaded GPT-2 model and tokenizer files for offline use in Kaggle notebooks. It enables participants to use GPT-2 without requiring internet access, ensuring compliance with competition rules that restrict internet usage.

    The dataset includes: - GPT-2 Model: Config file, weights (model.safetensors), and other necessary files. - GPT-2 Tokenizer: Vocabulary, merges, and tokenizer configuration files.

    Use this dataset to load GPT-2 seamlessly into your notebook for generating text or other applications.

    Contents: - gpt2_model.zip: Contains model weights and configuration files. - gpt2_tokenizer.zip: Contains tokenizer configuration and vocabulary files.

    Usage: Add this dataset to your notebook via the Kaggle dataset panel. Unzip the files and load them using the Hugging Face Transformers library with the from_pretrained method, pointing to the unzipped directories.

    Licenses: The dataset reuses open-source GPT-2 files available under the original licensing terms provided by Hugging Face.

    Purpose: This dataset was created for use in competitions where internet access is disabled to facilitate the usage of pre-trained models.

  13. LLM 20 Questions Games

    • kaggle.com
    zip
    Updated Aug 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    waechter (2024). LLM 20 Questions Games [Dataset]. https://www.kaggle.com/datasets/waechter/llm-20-questions-games
    Explore at:
    zip(189837141 bytes)Available download formats
    Dataset updated
    Aug 7, 2024
    Authors
    waechter
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Episodes games from https://www.kaggle.com/competitions/llm-20-questions This dataset can be used to analyze winning strategies, or as training data

    description:

    • index is {episodeId}_{guesser}_{answer} (2 rows for each episodeId, one by team)
    • answers: list (len nb_round) of answers by {answer} agent
    • questions: list (len nb_round) of questions asked by {answer} agent
    • guesses: list (len nb_round) of guesses {answer} agent
    • keyword: keyword to be guessed
    • category: category of the keyword
    • guesser: name of guesser/asker team
    • answerer: name of answerer team
    • nb_round: int number of rounds (<20 means victory or error)
    • game_num: episodeId

    source

    Notebook: https://www.kaggle.com/code/waechter/llm-20-questions-games-dataset/notebook Meta kaggle dataset

  14. Kaggle ranking datasets

    • kaggle.com
    zip
    Updated Sep 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tropicbird (2023). Kaggle ranking datasets [Dataset]. https://www.kaggle.com/datasets/hdsk38/comp-top-1000-data/discussion
    Explore at:
    zip(3780060 bytes)Available download formats
    Dataset updated
    Sep 17, 2023
    Authors
    tropicbird
    Description

    About

    This is the top 1000 user data of the four types of rankings (i.e., Competition, Datasets, Notebooks, and Discussion) from October 2021 to September 2023. The data was scraped from the Kaggle Ranking every month. The scraping code is in GitHub.

    Note: Only the top 20 users' data have been stored in August 2023.

    Note: Data collection ended in September 2023.

    Dates the ranking data scraped

    In 2021: - Competitions: Oct. 4, Nov. 21, Dec. 16 - Datasets: Oct. 12, Nov. 21, Dec. 16 - Notebooks: Oct. 13, Nov. 23, Dec. 16 - Discussion: Oct. 17, Nov. 23, Dec. 16

    In 2022: - Competitions: Jan. 16, Feb. 20, Mar. 15, Apr. 15, May 15, June 15, Jul 15, Aug 15, Sep 19, Oct 15, Nov 15, Dec 16 - Datasets: Jan. 16, Feb. 20, Mar. 15, Apr. 15, May 15, June 15, Jul 15, Aug 15, Sep 19, Oct 15, Nov 15, Dec 16 - Notebooks: Jan. 16, Feb. 20, Mar. 15, Apr. 15, May 15, June 15, Jul 15, Aug 15, Sep 19, Oct 15, Nov 15, Dec 16 - Discussion: Jan. 16, Feb. 20, Mar. 15, Apr. 15, May 15, June 15, Jul 15, Aug 15, Sep 19, Oct 15, Nov15, Dec 18

    In 2023 - Competitions: Jan. 13, Feb. 21, Mar 14, Apr 15, May 17, Jun 20, Jul 20, Aug 20, Sep 12 - Datasets: Jan. 13, Feb. 21, Mar 14, Apr 16, May 17, Jun 20, Jul 20, Aug 20, Sep 12 - Notebooks: Jan. 13, Feb. 21, Mar 15, Apr 15, May 16, Jun 20, Jul 20, Aug 20, Sep 12 - Discussion: Jan. 13, Feb. 23, Mar 16, Apr 15, May 16, Jun 20, Jul 20, Aug 20, Sep 12

  15. SnakeCLEF2023HF

    • kaggle.com
    zip
    Updated Mar 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raghvender (2023). SnakeCLEF2023HF [Dataset]. https://www.kaggle.com/datasets/raghvender/snakeclef2023hf
    Explore at:
    zip(35376564179 bytes)Available download formats
    Dataset updated
    Mar 20, 2023
    Authors
    Raghvender
    Description

    This dataset contains the dataset for the SnakeCLEF2023 HuggingFace dataset.

    https://huggingface.co/spaces/competitions/SnakeCLEF2023

    This dataset does not contain Full Size Image Training Data 60 GB. I wanted everyone to use the data on Kaggle notebooks and participate in the competition.

  16. DAIGT-SaveEverything

    • kaggle.com
    zip
    Updated Jan 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HongCheng (2024). DAIGT-SaveEverything [Dataset]. https://www.kaggle.com/datasets/chg0901/daigt-saveeverything
    Explore at:
    zip(719677868 bytes)Available download formats
    Dataset updated
    Jan 1, 2024
    Authors
    HongCheng
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Related discussion https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/464765

    Related notebooks Version1: detailed results of it to check how/if the dataset is saved and reloaded step-by-step https://www.kaggle.com/code/chg0901/saveeverything-with-daigtext961-notebook?scriptVersionId=157295700

    Version2: clean codes with a dataset containing the saved results from the origianl notebook https://www.kaggle.com/code/chg0901/saveeverything-with-daigtext961-notebook/notebook

  17. Nvidia Apex

    • kaggle.com
    Updated Jun 23, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumukh (2020). Nvidia Apex [Dataset]. https://www.kaggle.com/ii5m0k3ii/nvidia-apex/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 23, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sumukh
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Sumukh

    Released under CC0: Public Domain

    Contents

  18. AI-Kaggle-Assistant-File

    • kaggle.com
    zip
    Updated Nov 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mateusz (2024). AI-Kaggle-Assistant-File [Dataset]. https://www.kaggle.com/datasets/mateo252/ai-kaggle-assistant-file
    Explore at:
    zip(64505 bytes)Available download formats
    Dataset updated
    Nov 11, 2024
    Authors
    Mateusz
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This AI-Kaggle-Assistant-File dataset is part of a notebook that has been specially prepared for use in the competition task Google - Gemini Long Context.

    The following files can be found here:

    • all-css-style.html - html file contains only CSS tags for styling notebook elements,
    • cache_animation.gif - gif image as an additional visual element in the notebook,
    • generate_notebook_prompt.txt - special instructions for Gemini model to generate new data for notebook and required format of returned data,
    • generated_notebook_template.txt - template for proper display of data returned by the model,
    • improve_notebook_prompt.txt - second special instruction for the Gemini model to return the correct data,
    • improved_notebook_template.txt - second template to proper display of data returned by the model,
    • kaggle_notebook_template.txt - another template to proper display of data returned by the model,
    • my_titanic_markdown_notebook.md - my notebook with one of my projects containing analysis of the popular Titanic collection. It is used as an example in the project.
  19. "Meetings are BORING!"

    • kaggle.com
    zip
    Updated Mar 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    steubk (2023). "Meetings are BORING!" [Dataset]. https://www.kaggle.com/datasets/steubk/meetings-are-boring
    Explore at:
    zip(2656 bytes)Available download formats
    Dataset updated
    Mar 20, 2023
    Authors
    steubk
    Description
  20. Pakistan Online Product Sales Dataset

    • kaggle.com
    zip
    Updated Nov 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aliza Brand (2025). Pakistan Online Product Sales Dataset [Dataset]. https://www.kaggle.com/datasets/shahzadi786/pakistan-online-product-sales-dataset
    Explore at:
    zip(13739 bytes)Available download formats
    Dataset updated
    Nov 16, 2025
    Authors
    Aliza Brand
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Pakistan
    Description

    Context

    Online e-commerce is rapidly growing in Pakistan. Sellers list thousands of products across multiple categories, each with different prices, ratings, and sales numbers. Understanding the patterns of product sales, pricing, and customer feedback is crucial for businesses and data scientists alike.

    This dataset simulates a realistic snapshot of online product sales in Pakistan, including diverse categories like Electronics, Clothing, Home & Kitchen, Books, Beauty, and Sports.

    Source

    Generated synthetically using Python and NumPy for learning and practice purposes.

    No real personal or private data is included.

    Designed specifically for Kaggle competitions, notebooks, and ML/EDA exercises.

    About the File

    File name: Pakistan_Online_Product_Sales.csv

    Rows: 1000+

    Columns: 6

    Purpose:

    Train Machine Learning models (regression/classification)

    Explore data through EDA and visualizations

    Practice feature engineering and data preprocessing

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Organization logo

Meta Kaggle Code

Kaggle's public data on notebook code

Explore at:
zip(167219625372 bytes)Available download formats
Dataset updated
Nov 27, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!

Search
Clear search
Close search
Google apps
Main menu