100+ datasets found
  1. Meta Kaggle Code

    • kaggle.com
    zip
    Updated Aug 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
    Explore at:
    zip(153009997518 bytes)Available download formats
    Dataset updated
    Aug 14, 2025
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Explore our public notebook content!

    Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

    Why we’re releasing this dataset

    By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

    Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

    The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

    Sensitive data

    While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

    Joining with Meta Kaggle

    The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

    File organization

    The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

    The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

    Questions / Comments

    We love feedback! Let us know in the Discussion tab.

    Happy Kaggling!

  2. Data from: KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle

    • zenodo.org
    • data.niaid.nih.gov
    bin, bz2, pdf
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luigi Quaranta; Fabio Calefato; Fabio Calefato; Filippo Lanubile; Filippo Lanubile; Luigi Quaranta (2024). KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle [Dataset]. http://doi.org/10.5281/zenodo.4468523
    Explore at:
    bz2, pdf, binAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Luigi Quaranta; Fabio Calefato; Fabio Calefato; Filippo Lanubile; Filippo Lanubile; Luigi Quaranta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    KGTorrent is a dataset of Python Jupyter notebooks from the Kaggle platform.

    The dataset is accompanied by a MySQL database containing metadata about the notebooks and the activity of Kaggle users on the platform. The information to build the MySQL database has been derived from Meta Kaggle, a publicly available dataset containing Kaggle metadata.

    In this package, we share the complete KGTorrent dataset (consisting of the dataset itself plus its companion database), as well as the specific version of Meta Kaggle used to build the database.

    More specifically, the package comprises the following three compressed archives:

    1. KGT_dataset.tar.bz2, the dataset of Jupyter notebooks;

    2. KGTorrent_dump_10-2020.sql.tar.bz2, the dump of the MySQL companion database;

    3. MetaKaggle27Oct2020.tar.bz2, a copy of the Meta Kaggle version used to build the database.

    Moreover, we include KGTorrent_logical_schema.pdf, the logical schema of the KGTorrent MySQL database.

  3. issues-kaggle-notebooks

    • huggingface.co
    Updated Aug 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2025). issues-kaggle-notebooks [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks
    Explore at:
    Dataset updated
    Aug 12, 2025
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    Description

    GitHub Issues & Kaggle Notebooks

      Description
    

    GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the StarCoder2 model training corpus, precisely the bigcode/StarCoder2-Extras dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks.

  4. h

    kaggle-notebooks-edu-v0

    • huggingface.co
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Agents (2025). kaggle-notebooks-edu-v0 [Dataset]. https://huggingface.co/datasets/data-agents/kaggle-notebooks-edu-v0
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 31, 2025
    Dataset authored and provided by
    Data Agents
    Description

    Kaggle Notebooks LLM Filtered

    Model: meta-llama/Meta-Llama-3.1-70B-Instruct Sample: 12,400 Source dataset: data-agents/kaggle-notebooks Prompt:

    Below is an extract from a Jupyter notebook. Evaluate whether it has a high analysis value and could help a data scientist.

    The notebooks are formatted with the following tokens:

    START

    Here comes markdown content

    Here comes python code

    Here comes code output

    More… See the full description on the dataset page: https://huggingface.co/datasets/data-agents/kaggle-notebooks-edu-v0.

  5. Data from: DistilKaggle: a distilled dataset of Kaggle Jupyter notebooks

    • zenodo.org
    application/gzip, bin +1
    Updated Jan 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mojtaba Mostafavi Ghahfarokhi; Mojtaba Mostafavi Ghahfarokhi; Arash Asgari; Arash Asgari; Mohammad Abolnejadian; Mohammad Abolnejadian; Abbas Heydarnoori; Abbas Heydarnoori (2024). DistilKaggle: a distilled dataset of Kaggle Jupyter notebooks [Dataset]. http://doi.org/10.5281/zenodo.10317389
    Explore at:
    bin, csv, application/gzipAvailable download formats
    Dataset updated
    Jan 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mojtaba Mostafavi Ghahfarokhi; Mojtaba Mostafavi Ghahfarokhi; Arash Asgari; Arash Asgari; Mohammad Abolnejadian; Mohammad Abolnejadian; Abbas Heydarnoori; Abbas Heydarnoori
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    DistilKaggle is a curated dataset extracted from Kaggle Jupyter notebooks spanning from September 2015 to October 2023. This dataset is a distilled version derived from the download of over 300GB of Kaggle kernels, focusing on essential data for research purposes. The dataset exclusively comprises publicly available Python Jupyter notebooks from Kaggle. The essential information for retrieving the data needed to download the dataset is obtained from the MetaKaggle dataset provided by Kaggle.

    Contents

    The DistilKaggle dataset consists of three main CSV files:

    code.csv: Contains over 12 million rows of code cells extracted from the Kaggle kernels. Each row is identified by the kernel's ID and cell index for reproducibility.

    markdown.csv: Includes over 5 million rows of markdown cells extracted from Kaggle kernels. Similar to code.csv, each row is identified by the kernel's ID and cell index.

    notebook_metrics.csv: This file provides notebook features described in the accompanying paper released with this dataset. It includes metrics for over 517,000 Python notebooks.

    Directory Structure

    The kernels directory is organized based on Kaggle's Performance Tiers (PTs), a ranking system in Kaggle that classifies users. The structure includes PT-specific directories, each containing user ids that belong to this PT, download logs, and the essential data needed for downloading the notebooks.

    The utility directory contains two important files:

    aggregate_data.py: A Python script for aggregating data from different PTs into the mentioned CSV files.

    application.ipynb: A Jupyter notebook serving as a simple example application using the metrics dataframe. It demonstrates predicting the PT of the author based on notebook metrics.

    DistilKaggle.tar.gz: It is just the compressed version of the whole dataset. If you downloaded all of the other files independently already, there is no need to download this file.

    Usage

    Researchers can leverage this distilled dataset for various analyses without dealing with the bulk of the original 300GB dataset. For access to the raw, unprocessed Kaggle kernels, researchers can request the dataset directly.

    Note

    The original dataset of Kaggle kernels is substantial, exceeding 300GB, making it impractical for direct upload to Zenodo. Researchers interested in the full dataset can contact the dataset maintainers for access.

    Citation

    If you use this dataset in your research, please cite the accompanying paper or provide appropriate acknowledgment as outlined in the documentation.

    If you have any questions regarding the dataset, don't hesitate to contact me at mohammad.abolnejadian@gmail.com

    Thank you for using DistilKaggle!

  6. Data from: NEW notebook

    • kaggle.com
    Updated Apr 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahab Uwamahoro (2021). NEW notebook [Dataset]. https://www.kaggle.com/rahabuwamahoro/new-notebook/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 28, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rahab Uwamahoro
    Description

    Dataset

    This dataset was created by Rahab Uwamahoro

    Contents

  7. Latest polars package for offline installation

    • kaggle.com
    Updated Jun 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steven Van Ingelgem (2025). Latest polars package for offline installation [Dataset]. https://www.kaggle.com/datasets/svaningelgem/offline-polars
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 19, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Steven Van Ingelgem
    Description

    Latest polars sources. This can be included when you are on an offline notebook.

    I'll try to keep this up-to-date, but if there is a newer version, please do tell me and I'll update it.

    Current version: 1.31.0

    To install in your notebook: jupyter !pip install -q --no-index --find-links /kaggle/input/offline-polars polars

    License: MIT

  8. h

    kaggle-notebooks-outputs-filtered-0

    • huggingface.co
    Updated May 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Computer Intelligence Project (2025). kaggle-notebooks-outputs-filtered-0 [Dataset]. https://huggingface.co/datasets/bigcomputer/kaggle-notebooks-outputs-filtered-0
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 3, 2025
    Dataset authored and provided by
    Computer Intelligence Project
    Description

    bigcomputer/kaggle-notebooks-outputs-filtered-0 dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. kaggle_notebook

    • kaggle.com
    zip
    Updated May 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JadenQ (2020). kaggle_notebook [Dataset]. https://www.kaggle.com/jadenqee/kaggle-notebook
    Explore at:
    zip(932513669 bytes)Available download formats
    Dataset updated
    May 7, 2020
    Authors
    JadenQ
    Description

    Dataset

    This dataset was created by JadenQ

    Contents

    It contains the following files:

  10. A

    ‘Kaggle Notebooks Ranking’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Kaggle Notebooks Ranking’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-kaggle-notebooks-ranking-ebf3/e7f75ea8/?iid=002-618&v=presentation
    Explore at:
    Dataset updated
    Feb 13, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Kaggle Notebooks Ranking’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/vivovinco/kaggle-notebooks-ranking on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    This dataset contains Kaggle ranking of notebooks.

    Content

    +3000 rows and 8 columns. Columns' description are listed below.

    • Rank : Rank of the user
    • Tier : Grandmaster, Master or Expert
    • Username : Name of the user
    • Join Date : Year of join
    • Gold Medals : Number of gold medals
    • Silver Medals : Number of silver medals
    • Bronze Medals : Number of bronze medals
    • Points : Total points

    Acknowledgements

    Data from Kaggle. Image from Wikiwand.

    If you're reading this, please upvote.

    --- Original source retains full ownership of the source dataset ---

  11. Assessing Computational Notebook Understandability through Code Metrics...

    • zenodo.org
    zip
    Updated Oct 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mojtaba Mostafavi; Mojtaba Mostafavi (2023). Assessing Computational Notebook Understandability through Code Metrics Analysis [Dataset]. http://doi.org/10.5281/zenodo.8435192
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 12, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mojtaba Mostafavi; Mojtaba Mostafavi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Computational notebooks have become the primary coding environment for data scientists. Despite their popularity, research on the code quality of these notebooks is still in its infancy, and the code shared in these notebooks is often of poor quality. Considering the importance of maintenance and reusability, it is crucial to pay attention to the comprehension of the notebook code and identify the notebook metrics that play a significant role in their comprehension. The level of code comprehension is a qualitative variable closely associated with the user's opinion about the code. Previous studies have typically employed two approaches to measure it. One approach involves using limited questionnaire methods to review a small number of code pieces. Another approach relies solely on metadata, such as the number of likes and user votes for a project in the software repository. In our approach, we enhanced the measurement of the understandability level of notebook code by leveraging user comments within a software repository. As a case study, we started with 248,761 Kaggle Jupyter notebooks introduced in previous studies and their relevant metadata. To identify user comments associated with code comprehension within the notebooks, we utilized a fine-tuned DistillBERT transformer. We established a \emph{user comment based criterion} for measuring code understandability by considering the number of code understandability-related comments, the upvotes on those comments, the total views of the notebook, and the total upvotes received by the notebook. This criterion has proven to be more effective than alternative methods, making it the ground truth for evaluating the code comprehension of our notebook set. In addition, we collected a total of 34 metrics for 10,857 notebooks, categorized as script-based and notebook-based metrics. These metrics were utilized as features in our dataset. Using the Random Forest classifier, our predictive model achieved 85% accuracy in predicting code comprehension levels in computational notebooks, identifying developer expertise and markdown-based metrics as key factors.

  12. 📊 Best Open Source LLM Starter Pack 🧙🚀

    • kaggle.com
    Updated Aug 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radek Osmulski (2023). 📊 Best Open Source LLM Starter Pack 🧙🚀 [Dataset]. https://www.kaggle.com/datasets/radek1/best-llm-starter-pack
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 17, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Radek Osmulski
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains a couple of great open source models!

    • version 2 -- the best open source LLM at the time of writing (NousResearch/Nous-Hermes-Llama2-13b) that we can load on Kaggle! didn't manage to load anything larger than 13B
    • version 14 -- loading models using a new library, curated-transformers that should allow for easier modifications of the underlying architectures.

    This dataset also includes all the dependencies we need to load the model in 8bit, if that is what you would like to do (updated version of transfomers, accelerate, etc).

    I show how to load and run Nous-Hermes-Llama2-13b in the following notebook:

    👉 💡 Best Open Source LLM Starter Pack 🧪🚀

    If you find this dataset helpful, please leave an upvote! 🙂 Thank you! 🙏

  13. Kaggle LLMSE Dataset

    • kaggle.com
    Updated Oct 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haoquan Fang (2023). Kaggle LLMSE Dataset [Dataset]. https://www.kaggle.com/datasets/hqfang/kaggle-llmse-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 18, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Haoquan Fang
    Description

    deberta-billy is trained locally by @hqfang primarily using @radek1's notebook.

    deberta-lora-lindsey is trained locally by @lindseywei using the LoRA technique.

    deberta-openbook-eric-088 comes from @yuekaixueirc's dataset.

    deberta-openbook-eric-0897 comes from @yuekaixueirc's dataset.

    deberta-openbook-eric-0916 comes from @yuekaixueirc's dataset.

    54k_with_context_v1.csv was created by dropping duplicates @cdeotte's 60k training data all_12_with_context2.csv in this dataset.

    54k.csv was created by dropping the context column from the 54k_with_context_v1.csv.

    val_with_context_v1.csv was created by adding a context column to @itsuki9180's validation dataset.

  14. Python Package List

    • kaggle.com
    zip
    Updated Sep 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Otto Schnurr (2019). Python Package List [Dataset]. https://www.kaggle.com/ottoschnurr/kaggle-notebook-python-package-manifest
    Explore at:
    zip(19563 bytes)Available download formats
    Dataset updated
    Sep 9, 2019
    Authors
    Otto Schnurr
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    A significant amount of software is available in Kaggle's Python notebook. I had hoped to find a reference somewhere listing which Python packages were available and what each one did.

    When I didn't find what I was looking for, I decided to build this dataset instead.

    Content

    This dataset was assembled in four steps:

    1. Code inside a Kaggle notebook was used to gather the names of over 600 installed packages.
    2. A package list was scraped from Anaconda and cross-referenced against the notebook package list.
    3. The roughly 400 packages that remained were carefully queried from the Python Package Index using its JSON API.
    4. The results were collated into a manifest.

    Reference

    Acknowledgements

  15. images

    • kaggle.com
    Updated Jan 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    George Zoto (2021). images [Dataset]. https://www.kaggle.com/datasets/georgezoto/images/versions/9
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 2, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    George Zoto
    Description

    Context and Content

    Images used in several of my Kaggle notebooks, each source is mentioned explicitly.

  16. Hidden Gems Dataset

    • kaggle.com
    Updated Apr 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrada (2022). Hidden Gems Dataset [Dataset]. https://www.kaggle.com/datasets/andradaolteanu/hidden-gems-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 19, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Andrada
    Description

    This dataset accompanies my notebook 🎨Hidden Gems: SpecialSauces to create amazing EDA specially created as a Guide to create a successful notebook in any Analytics Competition.

  17. Styling a notebook with CSS - CSS file

    • kaggle.com
    Updated Jun 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Potter (2023). Styling a notebook with CSS - CSS file [Dataset]. https://www.kaggle.com/datasets/nnjjpp/styling-a-notebook-with-css
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 22, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nick Potter
    Description

    Dataset

    This dataset was created by Nick Potter

    Contents

  18. A

    ‘NYC Most Popular Baby Names Over the Years’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘NYC Most Popular Baby Names Over the Years’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-nyc-most-popular-baby-names-over-the-years-94c5/3fb35e8b/?iid=003-998&v=presentation
    Explore at:
    Dataset updated
    Feb 13, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    New York
    Description

    Analysis of ‘NYC Most Popular Baby Names Over the Years’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/most-popular-baby-names-in-nyce on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    About this dataset

    Popular Baby Name Data In NYC from 2011-2014

    Rows: 13962; Columns: 6

    The data include items, such as:

    • BRTH_YR: birth year the baby
    • GNDR: gender
    • ETHCTY: mother's ethnicity
    • NM: baby's name
    • CNT: count of the name
    • RNK: ranking of the name

    Source: NYC Open Data

    https://data.cityofnewyork.us/Health/Most-Popular-Baby-Names-by-Sex-and-Mother-s-Ethnic/25th-nujf

    This dataset was created by Data Society and contains around 10000 samples along with Nm, Rnk, technical information and other features such as: - Gndr - Ethcty - and more.

    How to use this dataset

    • Analyze Brth Yr in relation to Cnt
    • Study the influence of Nm on Rnk
    • More datasets

    Acknowledgements

    If you use this dataset in your research, please credit Data Society

    Start A New Notebook!

    --- Original source retains full ownership of the source dataset ---

  19. New Ensemble Of Public Notebooks_1

    • kaggle.com
    Updated Oct 27, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yu5uke (2021). New Ensemble Of Public Notebooks_1 [Dataset]. https://www.kaggle.com/mochiymochi/new-ensemble-of-public-notebooks-1/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 27, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    yu5uke
    Description

    Dataset

    This dataset was created by yu5uke

    Contents

  20. A

    ‘🏈 NFL Favorite Team’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘🏈 NFL Favorite Team’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-nfl-favorite-team-b15c/996bf36f/?iid=001-750&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘🏈 NFL Favorite Team’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/nfl-favorite-teame on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    About this dataset

    See Readme for more details.
    This repository contains a selection of the data -- and the data-processing scripts -- behind the articles, graphics and interactives at FiveThirtyEight.

    We hope you'll use it to check our work and to create stories and visualizations of your own. The data is available under the Creative Commons Attribution 4.0 International License and the code is available under the MIT License. If you do find it useful, please let us know](andrei.scheinkman@fivethirtyeight.com).

    Source: https://github.com/fivethirtyeight/data

    This dataset was created by FiveThirtyEight and contains around 0 samples along with Cch, Slp, technical information and other features such as: - Nyp - Beh - and more.

    How to use this dataset

    • Analyze Cch in relation to Slp
    • Study the influence of Nyp on Beh
    • More datasets

    Acknowledgements

    If you use this dataset in your research, please credit FiveThirtyEight

    Start A New Notebook!

    --- Original source retains full ownership of the source dataset ---

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Organization logo

Meta Kaggle Code

Kaggle's public data on notebook code

Explore at:
zip(153009997518 bytes)Available download formats
Dataset updated
Aug 14, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!

Search
Clear search
Close search
Google apps
Main menu