100+ datasets found
  1. Meta Kaggle Code

    • kaggle.com
    zip
    Updated Nov 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
    Explore at:
    zip(167219625372 bytes)Available download formats
    Dataset updated
    Nov 27, 2025
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Explore our public notebook content!

    Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

    Why we’re releasing this dataset

    By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

    Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

    The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

    Sensitive data

    While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

    Joining with Meta Kaggle

    The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

    File organization

    The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

    The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

    Questions / Comments

    We love feedback! Let us know in the Discussion tab.

    Happy Kaggling!

  2. Data from: KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle

    • zenodo.org
    • dataon.kisti.re.kr
    • +1more
    bin, bz2, pdf
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luigi Quaranta; Fabio Calefato; Fabio Calefato; Filippo Lanubile; Filippo Lanubile; Luigi Quaranta (2024). KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle [Dataset]. http://doi.org/10.5281/zenodo.4468523
    Explore at:
    bz2, pdf, binAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Luigi Quaranta; Fabio Calefato; Fabio Calefato; Filippo Lanubile; Filippo Lanubile; Luigi Quaranta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    KGTorrent is a dataset of Python Jupyter notebooks from the Kaggle platform.

    The dataset is accompanied by a MySQL database containing metadata about the notebooks and the activity of Kaggle users on the platform. The information to build the MySQL database has been derived from Meta Kaggle, a publicly available dataset containing Kaggle metadata.

    In this package, we share the complete KGTorrent dataset (consisting of the dataset itself plus its companion database), as well as the specific version of Meta Kaggle used to build the database.

    More specifically, the package comprises the following three compressed archives:

    1. KGT_dataset.tar.bz2, the dataset of Jupyter notebooks;

    2. KGTorrent_dump_10-2020.sql.tar.bz2, the dump of the MySQL companion database;

    3. MetaKaggle27Oct2020.tar.bz2, a copy of the Meta Kaggle version used to build the database.

    Moreover, we include KGTorrent_logical_schema.pdf, the logical schema of the KGTorrent MySQL database.

  3. Top Kaggle Notebooks dataset: R

    • kaggle.com
    zip
    Updated Apr 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md. Eyasen Arafat (2024). Top Kaggle Notebooks dataset: R [Dataset]. https://www.kaggle.com/datasets/mdearafat/top-kaggle-notebooks-dataset-r
    Explore at:
    zip(37357 bytes)Available download formats
    Dataset updated
    Apr 24, 2024
    Authors
    Md. Eyasen Arafat
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Source: Kaggle Content: Information about R notebooks Ranking: Top 500 (criteria, OUTPUTS: Visualizations) Programming Language: R Last Update: April 23, 2024, at 7:32 AM GMT+6

    This dataset can be useful for exploring popular R notebooks on Kaggle, finding inspiration for your own projects, and learning from other data scientists. By looking at the notebooks with high upvotes, views, and medals, you can get an idea of what topics are trending and what makes a successful Kaggle Notebook.

  4. issues-kaggle-notebooks

    • huggingface.co
    Updated Aug 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2025). issues-kaggle-notebooks [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks
    Explore at:
    Dataset updated
    Aug 12, 2025
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    Description

    GitHub Issues & Kaggle Notebooks

      Description
    

    GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the StarCoder2 model training corpus, precisely the bigcode/StarCoder2-Extras dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks.

  5. kaggle-notebooks-edu-v0

    • huggingface.co
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jupyter Agent (2025). kaggle-notebooks-edu-v0 [Dataset]. https://huggingface.co/datasets/jupyter-agent/kaggle-notebooks-edu-v0
    Explore at:
    Dataset updated
    May 31, 2025
    Dataset provided by
    Project Jupyterhttps://jupyter.org/
    Authors
    Jupyter Agent
    Description

    Kaggle Notebooks LLM Filtered

    Model: meta-llama/Meta-Llama-3.1-70B-Instruct Sample: 12,400 Source dataset: data-agents/kaggle-notebooks Prompt:

    Below is an extract from a Jupyter notebook. Evaluate whether it has a high analysis value and could help a data scientist.

    The notebooks are formatted with the following tokens:

    START

    Here comes markdown content

    Here comes python code

    Here comes code output

    More… See the full description on the dataset page: https://huggingface.co/datasets/jupyter-agent/kaggle-notebooks-edu-v0.

  6. Arcade Natural Language to Code Challenge

    • kaggle.com
    zip
    Updated Feb 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google AI (2023). Arcade Natural Language to Code Challenge [Dataset]. https://www.kaggle.com/datasets/googleai/arcade-nl2code-dataset
    Explore at:
    zip(3921922 bytes)Available download formats
    Dataset updated
    Feb 22, 2023
    Dataset authored and provided by
    Google AI
    Description

    Arcade: Natural Language to Code Generation in Interactive Computing Notebooks

    Arcade is a collection of natural language to code problems on interactive data science notebooks. Each problem features an NL intent as problem specification, a reference code solution, and preceding notebook context (Markdown or code cells). Arcade can be used to evaluate the accuracies of code large language models in generating data science programs given natural language instructions. Please read our paper for more details.

    Note👉 This Kaggle dataset only contains the dataset files of Arcade. Refer to our main Github repository for detailed instructions to use this dataset.

    Folder Structure

    Below is the structure of its content:

    └── ./
      ├── existing_tasks # Problems derived from existing data science notebooks on Github/
      │  ├── metadata.json # Metadata by `build_existing_tasks_split.py` to reproduce this split.
      │  ├── artifacts/ # Folder that stores dependent ML datasets to execute the problems, created by running `build_existing_tasks_split.py`
      │  └── derived_datasets/ # Folder for preprocessed datasets used for prompting experiments.
      ├── new_tasks/
      │  ├── dataset.json # Original, unprepossessed dataset
      │  ├── kaggle_dataset_provenance.csv # Metadata of the Kaggle datasets used to build this split.
      │  ├── artifacts/ # Folder that stores dependent ML Kaggle datasets to execute the problems, created by running `build_new_tasks_split.py`
      │  └── derived_datasets/ # Folder for preprocessed datasets used for prompting experiments.
      └── checksums.txt # Table of MD5 checksums of dataset files.
    

    Dataset File Structure

    All the dataset '*.json' files follow the same structure. Each dataset file is a Json-serialized list of Episodes. Each episode corresponds to a notebook annotated with NL-to-code problems. The structure of an episode is documented below:

    {
      "notebook_name": "Name of the notebook.",
      "work_dir": "Path to the dependent data artifacts (e.g., ML datasets) to execute the notebook.",
      "annotator": "Anonymized annotator Id."
      "turns": [
        # A list of natural language to code examples using the current notebook context.
        {
          "input": "Prompt to a code generation model.",
          "turn": {
            "intent": {
              "value": "Annotated NL intent for the current turn.",
              "is_cell_intent": "Metadata used for the existing tasks split to indicate if the code solution is only part of an existing code cell.",
              "cell_idx": "Index of the intent Markdown cell.",
              "line_span": "Line span of the intent.",
              "not_sure": "Annotation confidence.",
              "output_variables": "List of variable names denoting the output. If None, use the output of the last line of code as the output of the problem.",
            },
            "code": {
              "value": "Reference code solution.",
              "cell_idx": "Cell index of the code cell containing the solution.",
              "num_lines": "Number of lines in the reference solution.",
              "line_span": "Line span.",
            },
            "code_context": "Context code (all code cells before this problem) that need to be executed before executing the reference/predicted programs.",
            "delta_code_context": "Delta context code between the last problem in this notebook and the current problem, useful for incremental execution.",
            "metadata": {
              "annotator_id": "Annotator Id",
              "num_code_lines": "Metadata, please ignore.",
              "utterance_without_output_spec": "Annotated NL intent without output specification. Refer to the paper for details.",
            },
          },
          "notebook": "Field intended to store the Json-serialized Jupyter notebook. Not used for now since the notebook can be reconstructed from other metadata in this file.",
          "metadata": {
            # A dict of metadata of this turn.
            "context_cells": [ # A list of context cells before the problem.
              {
                "cell_type": "code|markdown",
                "source": "Cell content."
              },
            ],
            "delta_cell_num": "Number of preceding context cells between the prior turn and the current turn.",
            # The following fields only occur in datasets inlined with schema descriptions.
            "context_cell_num": "Number of context cells in the prompt after inserting schema descriptions and left-truncation.",
            "inten...
    
  7. kaggle-notebook-requirements

    • kaggle.com
    zip
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yama (2022). kaggle-notebook-requirements [Dataset]. https://www.kaggle.com/datasets/askeeee/kagglenotebookrequirements
    Explore at:
    zip(9392 bytes)Available download formats
    Dataset updated
    Dec 6, 2022
    Authors
    yama
    Description

    Dataset

    This dataset was created by yama

    Contents

  8. h

    kaggle-notebooks-conversations-hq

    • huggingface.co
    Updated Mar 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Computer Intelligence Project (2025). kaggle-notebooks-conversations-hq [Dataset]. https://huggingface.co/datasets/bigcomputer/kaggle-notebooks-conversations-hq
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 7, 2025
    Dataset authored and provided by
    Computer Intelligence Project
    Description

    bigcomputer/kaggle-notebooks-conversations-hq dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. numerai data V5.0 Universe

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Josef Švenda (2025). numerai data V5.0 Universe [Dataset]. https://www.kaggle.com/datasets/svendaj/numerai-latest-tournament-data
    Explore at:
    zip(10867970835 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Josef Švenda
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Weekly updated dataset with the latest version of Numerai tournament data. The dataset contains the directory with the name of the latest data version. Currently it is V5.0. The data are downloaded weekly by public Kaggle notebook numerai data whenever new data are available (opening of Saturday round). Upon a change in this notebook output, the dataset is automatically updated. So, you can add this dataset to your notebooks as data source or output of numerai data notebook and you do not need to download it yourself.

    Older versions of data are available elsewhere: * V4 and V4.1 - dataset and producing notebook * V4.2 Rain - dataset and producing notebook * V4.3 Midnight - dataset and producing notebook

    Text file current_round.txt contains the number of tournament round when data were successfully downloaded.

    In addition to all data files provided by Numerai, downloading notebook creates four partitions of non-overlapping eras for training and validation data. These files are stored in f"train_no{split}.parquet" and f"validation_no{split}.parquet" files. Since Round 864 polars library is used to produce downsampled files. Because polars are not using the index concept, the saved data file stores the index id as another column. If you need the same index as in original files you should add following check to your code right after df = pandas.read_parquet(filename): if not ("id" in df.index.names): df.set_index("id", inplace=True)

  10. h

    meta-kaggle-notebook

    • huggingface.co
    Updated May 15, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Computer Intelligence Project (2015). meta-kaggle-notebook [Dataset]. https://huggingface.co/datasets/bigcomputer/meta-kaggle-notebook
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 15, 2015
    Dataset authored and provided by
    Computer Intelligence Project
    Description

    bigcomputer/meta-kaggle-notebook dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. Code4ML 2.0

    • zenodo.org
    csv, txt
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737
    Explore at:
    csv, txtAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonimous authors; Anonimous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

    The original dataset is organized into multiple CSV files, each containing structured data on different entities:

    • code_blocks.csv: Contains raw code snippets extracted from Kaggle.
    • kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
    • competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
    • markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
    • vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

    Table 1. code_blocks.csv structure

    ColumnDescription
    code_blocks_indexGlobal index linking code blocks to markup_data.csv.
    kernel_idIdentifier for the Kaggle Jupyter notebook from which the code block was extracted.
    code_block_id

    Position of the code block within the notebook.

    code_block

    The actual machine learning code snippet.

    Table 2. kernels_meta.csv structure

    ColumnDescription
    kernel_idIdentifier for the Kaggle Jupyter notebook.
    kaggle_scorePerformance metric of the notebook.
    kaggle_commentsNumber of comments on the notebook.
    kaggle_upvotesNumber of upvotes the notebook received.
    kernel_linkURL to the notebook.
    comp_nameName of the associated Kaggle competition.

    Table 3. competitions_meta.csv structure

    ColumnDescription
    comp_nameName of the Kaggle competition.
    descriptionOverview of the competition task.
    data_typeType of data used in the competition.
    comp_typeClassification of the competition.
    subtitleShort description of the task.
    EvaluationAlgorithmAbbreviationMetric used for assessing competition submissions.
    data_sourcesLinks to datasets used.
    metric typeClass label for the assessment metric.

    Table 4. markup_data.csv structure

    ColumnDescription
    code_blockMachine learning code block.
    too_longFlag indicating whether the block spans multiple semantic types.
    marksConfidence level of the annotation.
    graph_vertex_idID of the semantic type.

    The dataset allows mapping between these tables. For example:

    • code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
    • kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

    In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

    Code4ML 2.0 Enhancements

    The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

    Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

    competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

    Applications

    The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

    • Code generation
    • Code understanding
    • Natural language processing of code-related tasks
  12. Python Package List

    • kaggle.com
    zip
    Updated Sep 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Otto Schnurr (2019). Python Package List [Dataset]. https://www.kaggle.com/ottoschnurr/kaggle-notebook-python-package-manifest
    Explore at:
    zip(19563 bytes)Available download formats
    Dataset updated
    Sep 9, 2019
    Authors
    Otto Schnurr
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    A significant amount of software is available in Kaggle's Python notebook. I had hoped to find a reference somewhere listing which Python packages were available and what each one did.

    When I didn't find what I was looking for, I decided to build this dataset instead.

    Content

    This dataset was assembled in four steps:

    1. Code inside a Kaggle notebook was used to gather the names of over 600 installed packages.
    2. A package list was scraped from Anaconda and cross-referenced against the notebook package list.
    3. The roughly 400 packages that remained were carefully queried from the Python Package Index using its JSON API.
    4. The results were collated into a manifest.

    Reference

    Acknowledgements

  13. h

    kaggle-notebooks-outputs-filtered-25

    • huggingface.co
    Updated Dec 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Computer Intelligence Project (2024). kaggle-notebooks-outputs-filtered-25 [Dataset]. https://huggingface.co/datasets/bigcomputer/kaggle-notebooks-outputs-filtered-25
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 28, 2024
    Dataset authored and provided by
    Computer Intelligence Project
    Description

    bigcomputer/kaggle-notebooks-outputs-filtered-25 dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. Assessing Computational Notebook Understandability through Code Metrics...

    • zenodo.org
    zip
    Updated Oct 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mojtaba Mostafavi; Mojtaba Mostafavi (2023). Assessing Computational Notebook Understandability through Code Metrics Analysis [Dataset]. http://doi.org/10.5281/zenodo.8435192
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 12, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mojtaba Mostafavi; Mojtaba Mostafavi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Computational notebooks have become the primary coding environment for data scientists. Despite their popularity, research on the code quality of these notebooks is still in its infancy, and the code shared in these notebooks is often of poor quality. Considering the importance of maintenance and reusability, it is crucial to pay attention to the comprehension of the notebook code and identify the notebook metrics that play a significant role in their comprehension. The level of code comprehension is a qualitative variable closely associated with the user's opinion about the code. Previous studies have typically employed two approaches to measure it. One approach involves using limited questionnaire methods to review a small number of code pieces. Another approach relies solely on metadata, such as the number of likes and user votes for a project in the software repository. In our approach, we enhanced the measurement of the understandability level of notebook code by leveraging user comments within a software repository. As a case study, we started with 248,761 Kaggle Jupyter notebooks introduced in previous studies and their relevant metadata. To identify user comments associated with code comprehension within the notebooks, we utilized a fine-tuned DistillBERT transformer. We established a \emph{user comment based criterion} for measuring code understandability by considering the number of code understandability-related comments, the upvotes on those comments, the total views of the notebook, and the total upvotes received by the notebook. This criterion has proven to be more effective than alternative methods, making it the ground truth for evaluating the code comprehension of our notebook set. In addition, we collected a total of 34 metrics for 10,857 notebooks, categorized as script-based and notebook-based metrics. These metrics were utilized as features in our dataset. Using the Random Forest classifier, our predictive model achieved 85% accuracy in predicting code comprehension levels in computational notebooks, identifying developer expertise and markdown-based metrics as key factors.

  15. Tensorflow's Global and Operation level seeds

    • kaggle.com
    zip
    Updated May 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deepak Ahire (2023). Tensorflow's Global and Operation level seeds [Dataset]. https://www.kaggle.com/datasets/adeepak7/tensorflow-global-and-operation-level-seeds
    Explore at:
    zip(2984 bytes)Available download formats
    Dataset updated
    May 20, 2023
    Authors
    Deepak Ahire
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    This dataset contains the python files containing snippets required for the Kaggle kernel - https://www.kaggle.com/code/adeepak7/tensorflow-s-global-and-operation-level-seeds/

    Since the kernel is around setting/re-setting global and local level seeds, the nullification of the effect of these seeds in the subsequent cells wasn't possible. Hence, the snippets have been provided as separate python files and these python files are executed independently in the separate cells.

  16. Z

    Data from: Dataset of paper "Why do Machine Learning Notebooks Crash?"

    • data-staging.niaid.nih.gov
    • nde-dev.biothings.io
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wang, Yiran; Meijer, Willem; López, José Antonio Hernández; Nilsson, Ulf; Varró, Dániel (2025). Dataset of paper "Why do Machine Learning Notebooks Crash?" [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_14070487
    Explore at:
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    Linköping University
    Authors
    Wang, Yiran; Meijer, Willem; López, José Antonio Hernández; Nilsson, Ulf; Varró, Dániel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    All the related data of our paper "Why do Machine Learning Notebooks Crash?" includes:

    GitHub and Kaggle notebooks that contain error outputs.

    GitHub notebooks are from The Stack repository[1].

    Kaggle notebooks are public notebooks on Kaggle platform from year 2023, downloaded via KGTorrent[2].

    Identified programming language results of GitHub notebooks.

    Identified ML library results from Kaggle notebooks.

    Datasets of crashes from GitHub and Kaggle.

    Clustering results of crashes from all crashes, and from GitHub and Kaggle respectively.

    Sampled crashes and associated notebooks (organized by cluster id).

    Manual labeling and reviewing results.

    Reproducing results.

    The related code repository can be found here.

  17. Malware DataSet

    • kaggle.com
    zip
    Updated May 25, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oronzo Comi (2021). Malware DataSet [Dataset]. https://www.kaggle.com/oronzo/malware-dataset
    Explore at:
    zip(53084715 bytes)Available download formats
    Dataset updated
    May 25, 2021
    Authors
    Oronzo Comi
    Description

    Dataset

    This dataset was created by Oronzo Comi

    Contents

    It contains the following files:

  18. No Data Sources

    • kaggle.com
    zip
    Updated Apr 12, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2017). No Data Sources [Dataset]. https://www.kaggle.com/kaggle/no-data-sources
    Explore at:
    zip(159 bytes)Available download formats
    Dataset updated
    Apr 12, 2017
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    Description

    This isn't a dataset, it is a collection of kernels written on Kaggle that use no data at all.

  19. Kaggle Notebook User Rankings

    • kaggle.com
    zip
    Updated Aug 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Trotman (2021). Kaggle Notebook User Rankings [Dataset]. https://www.kaggle.com/jtrotman/kaggle-notebook-user-rankings
    Explore at:
    zip(66237 bytes)Available download formats
    Dataset updated
    Aug 17, 2021
    Authors
    James Trotman
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Created just for this: How To Compute Notebook Author Rankings

    But feel free to re-use it if you wish!

  20. Update test

    • kaggle.com
    zip
    Updated Apr 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Potter (2024). Update test [Dataset]. https://www.kaggle.com/datasets/nnjjpp/update-test
    Explore at:
    zip(4719 bytes)Available download formats
    Dataset updated
    Apr 23, 2024
    Authors
    Nick Potter
    Description

    Example of updating a dataset using a notebook. I run a new version of this notebook, and the "log.csv" file gets updated with a new row.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Organization logo

Meta Kaggle Code

Kaggle's public data on notebook code

Explore at:
zip(167219625372 bytes)Available download formats
Dataset updated
Nov 27, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!

Search
Clear search
Close search
Google apps
Main menu