59 datasets found
  1. Code4ML 2.0

    • zenodo.org
    csv, txt
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737
    Explore at:
    csv, txtAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonimous authors; Anonimous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

    The original dataset is organized into multiple CSV files, each containing structured data on different entities:

    • code_blocks.csv: Contains raw code snippets extracted from Kaggle.
    • kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
    • competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
    • markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
    • vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

    Table 1. code_blocks.csv structure

    ColumnDescription
    code_blocks_indexGlobal index linking code blocks to markup_data.csv.
    kernel_idIdentifier for the Kaggle Jupyter notebook from which the code block was extracted.
    code_block_id

    Position of the code block within the notebook.

    code_block

    The actual machine learning code snippet.

    Table 2. kernels_meta.csv structure

    ColumnDescription
    kernel_idIdentifier for the Kaggle Jupyter notebook.
    kaggle_scorePerformance metric of the notebook.
    kaggle_commentsNumber of comments on the notebook.
    kaggle_upvotesNumber of upvotes the notebook received.
    kernel_linkURL to the notebook.
    comp_nameName of the associated Kaggle competition.

    Table 3. competitions_meta.csv structure

    ColumnDescription
    comp_nameName of the Kaggle competition.
    descriptionOverview of the competition task.
    data_typeType of data used in the competition.
    comp_typeClassification of the competition.
    subtitleShort description of the task.
    EvaluationAlgorithmAbbreviationMetric used for assessing competition submissions.
    data_sourcesLinks to datasets used.
    metric typeClass label for the assessment metric.

    Table 4. markup_data.csv structure

    ColumnDescription
    code_blockMachine learning code block.
    too_longFlag indicating whether the block spans multiple semantic types.
    marksConfidence level of the annotation.
    graph_vertex_idID of the semantic type.

    The dataset allows mapping between these tables. For example:

    • code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
    • kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

    In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

    Code4ML 2.0 Enhancements

    The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

    Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

    competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

    Applications

    The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

    • Code generation
    • Code understanding
    • Natural language processing of code-related tasks
  2. Titanic Dataset - cleaned

    • kaggle.com
    zip
    Updated Aug 8, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WinstonSDodson (2019). Titanic Dataset - cleaned [Dataset]. https://www.kaggle.com/winstonsdodson/titanic-dataset-cleaned
    Explore at:
    zip(41906 bytes)Available download formats
    Dataset updated
    Aug 8, 2019
    Authors
    WinstonSDodson
    Description

    This is the classic Titanic Dataset provided in the Kaggle Competition K Kernel and then cleaned in one of the most popular Kernels there. Please see the Kernel titled, "A Data Science Framework: To Achieve 99% Accuracy" for a great lesson in data science. This Kernel gives a great explanaton of the thinking behind the of this data cleaning as well as a very professional demonstration of the technologies and skills to do so. It then continues to provide an overview of many ML techniques and it is copiously and meticulously documented with many useful citations.

    Of course, data cleaning is an essential skill in data science but I wanted to use this data for a study of other machine learning techniques. So, I found and used this set of data that is well known and cleaned to a benchmark accepted by many.

  3. Kaggle Survey Challenge - All Kernels

    • kaggle.com
    zip
    Updated Nov 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KlemenVodopivec (2022). Kaggle Survey Challenge - All Kernels [Dataset]. https://www.kaggle.com/datasets/klemenvodopivec/kaggle-survey-challenge-all-kernels/data
    Explore at:
    zip(206438 bytes)Available download formats
    Dataset updated
    Nov 22, 2022
    Authors
    KlemenVodopivec
    Description

    Collections of kernels submissions for the Kaggle survey competitions from 2017 to 2022. As this data was collected during the 2022 survey competition, it does not contain all the kernels for year 2022 .

  4. PlaygroundS4E04|OriginalData

    • kaggle.com
    zip
    Updated Apr 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravi Ramakrishnan (2024). PlaygroundS4E04|OriginalData [Dataset]. https://www.kaggle.com/datasets/ravi20076/playgrounds4e04originaldata
    Explore at:
    zip(67811 bytes)Available download formats
    Dataset updated
    Apr 1, 2024
    Authors
    Ravi Ramakrishnan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is created using the below reference-
    https://archive.ics.uci.edu/dataset/1/abalone
    We import the corresponding repository in a Kaggle kernel and populate the dataset thereby. Users may choose to import the corresponding dataset with a simple read_csv in pandas and proceed with the solution.

    Best wishes!

  5. Pickled Crawl-300D-2M For Kernel Competitions

    • kaggle.com
    zip
    Updated Jun 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Budi Ryan (2019). Pickled Crawl-300D-2M For Kernel Competitions [Dataset]. https://www.kaggle.com/budiryan/pickled-crawl300d2m-for-kernel-competitions
    Explore at:
    zip(1820206270 bytes)Available download formats
    Dataset updated
    Jun 4, 2019
    Authors
    Budi Ryan
    Description

    Dataset

    This dataset was created by Budi Ryan

    Contents

  6. Lyft Best Performing Public Kernels

    • kaggle.com
    zip
    Updated Sep 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kkiller (2020). Lyft Best Performing Public Kernels [Dataset]. https://www.kaggle.com/kneroma/lyft-best-performing-public-kernels
    Explore at:
    zip(157791531 bytes)Available download formats
    Dataset updated
    Sep 10, 2020
    Authors
    kkiller
    Description

    Context

    This a shelter for best performing kernels in the lyft l5kit competition

  7. Competitions Shake-up

    • kaggle.com
    zip
    Updated Sep 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniboy370 (2020). Competitions Shake-up [Dataset]. https://www.kaggle.com/daniboy370/competitions-shakeup
    Explore at:
    zip(388789 bytes)Available download formats
    Dataset updated
    Sep 27, 2020
    Authors
    Daniboy370
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Shake-what ?!

    The Shake phenomenon occurs when the competition is shifting between two different datasets :

    \[ \text{Public test set} \ \Rightarrow \ \text{Private test set} \quad \Leftrightarrow \quad LB-\text{public} \ \Rightarrow \ LB-\text{private} \]

    The private test set that so far was unavailable becomes available, and thus the models scores are re-calculated. This re-evaluation elicits a respective re-ranking of the contestants in the competition. The shake allows participants to assess the severity of their overfitting to the public dataset, and act to improve their model until the deadline.

    Unable to find a uniform conventional term for this mechanism, I will use my common sense to define the following intuition :

                 <img src="https://github.com/Daniboy370/Uploads/blob/master/Kaggle-shake-ups/images/latex.png?raw=true" width="550">
    

    From the starter kernel :

                   <img src="https://github.com/Daniboy370/Uploads/blob/master/Kaggle-shake-ups/vids/shakeup_VID.gif?raw=true" width="625">
    

    Content

    Seven datasets of competitions which were scraped from Kaggle :

    CompetitionName of file
    Elo Merchant Category Recommendationdf_{Elo}
    Human Protein Atlas Image Classificationdf_{Protein}
    Humpback Whale Identificationdf_{Humpback}
    Microsoft Malware Predictiondf_{Microsoft}
    Quora Insincere Questions Classificationdf_{Quora}
    TGS Salt Identification Challengedf_{TGS}
    VSB Power Line Fault Detectiondf_{VSB}

    As an example, consider the following dataframe from the Quora competition : Team Name | Rank-private | Rank-public | Shake | Score-private | Score-public --- | --- The Zoo |1|7|6|0.71323|0.71123 ...| ...| ...| ...| ...| ... D.J. Trump|1401|65|-1336|0.000|0.70573

    I encourage everybody to investigate thoroughly the dataset in sought of interesting findings !

    \[ \text{Enjoy !}\]

  8. COVID19 pretrained

    • kaggle.com
    zip
    Updated Apr 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordi Mas (2020). COVID19 pretrained [Dataset]. https://www.kaggle.com/jordimas/covid19-pretrained
    Explore at:
    zip(1571087 bytes)Available download formats
    Dataset updated
    Apr 15, 2020
    Authors
    Jordi Mas
    Description

    This dataset contains data for use in COVID-19 competition kernels:

    • Pretrained models, that consists of several sets of initial populations for use in a DEoptim evolutions, and had been built and can be recreated with the same kernel scripts that use them, see the kernels for instructions on this.

    • World population data, in the file population.csv, all obtained from Wikipedia.

  9. all_kernels_cleaned

    • kaggle.com
    zip
    Updated Nov 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KlemenVodopivec (2022). all_kernels_cleaned [Dataset]. https://www.kaggle.com/datasets/klemenvodopivec/all-kernels-cleaned
    Explore at:
    zip(79146 bytes)Available download formats
    Dataset updated
    Nov 16, 2022
    Authors
    KlemenVodopivec
    Description

    Dataset

    This dataset was created by KlemenVodopivec

    Contents

  10. PyTorch Model Zoo

    • kaggle.com
    zip
    Updated Apr 3, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Igor Krashenyi (2019). PyTorch Model Zoo [Dataset]. https://www.kaggle.com/igorkrashenyi/pytorch-model-zoo
    Explore at:
    zip(8811991691 bytes)Available download formats
    Dataset updated
    Apr 3, 2019
    Authors
    Igor Krashenyi
    Description

    Dataset

    This dataset was created by Igor Krashenyi

    Contents

  11. Mlcourse.ai-2020

    • kaggle.com
    zip
    Updated Oct 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anas qais (2020). Mlcourse.ai-2020 [Dataset]. https://www.kaggle.com/anasqais/mlcourseai2020
    Explore at:
    zip(15881 bytes)Available download formats
    Dataset updated
    Oct 14, 2020
    Authors
    anas qais
    Description

    Open Machine Learning Course mlcourse.ai is designed to perfectly balance theory and practice; therefore, each topic is followed by an assignment with a deadline in a week. You can also take part in several Kaggle Inclass competitions held during the course and write your own tutorials. The next session launches in September, 2019. For more info go to the mlcourse.ai main page. Outline This is the list of published articles on medium.com (English), habr.com (Russian), and jqr.com (Chinese). See Kernels of this Dataset for the same material in English. 1. Exploratory Data Analysis with Pandas uk ru, cn, Kaggle Kernel 2. Visual Data Analysis with Python uk ru, cn, Kaggle Kernels: part1, part2 3. Classification, Decision Trees and k Nearest Neighbors uk, ru, cn, Kaggle Kernel 4. Linear Classification and Regression uk, ru, cn, Kaggle Kernels: part1, part2, part3, part4, part5 5. Bagging and Random Forest uk, ru, cn, Kaggle Kernels: part1, part2, part3 6. Feature Engineering and Feature Selection uk, ru, cn, Kaggle Kernel 7. Unsupervised Learning: Principal Component Analysis and Clustering uk, ru, cn, Kaggle Kernel 8. Vowpal Wabbit: Learning with Gigabytes of Data uk, ru, cn, Kaggle Kernel 9. Time Series Analysis with Python, part 1 uk, ru, cn. Predicting future with Facebook Prophet, part 2 uk, cn Kaggle Kernels: part1, part2 10. Gradient Boosting uk, ru, cn, Kaggle Kernel Assignments Each topic is followed by an assignment. See demo versions in this Dataset. Solutions will be discussed in the upcoming run of the course. Kaggle competitions 1. Catch Me If You Can: Intruder Detection through Webpage Session Tracking. Kaggle Inclass 2. How good is your Medium article? Kaggle Inclass Rating Throughout the course we are maintaining a student rating. It takes into account credits scored in assignments and Kaggle competitions. Top students (according to the final rating) will be listed on a special Wiki page. Community Discussions between students are held in the #mlcourse_ai channel of the OpenDataScience Slack team. A registration form will be shared prior to the start of the new session Collaboration You can publish Kernels using this Dataset. But please respect others' interests: don't share solutions to assignments and well-performing solutions for Kaggle Inclass competitions. If you notice any typos/errors in course material, please open an Issue or make a pull request in the course repo. The course is free but you can support organizers by making a pledge on Patreon (monthly support) or a one-time payment on Ko-fi

  12. Pickled glove.840B.300d

    • kaggle.com
    zip
    Updated Apr 16, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    عثمان (2019). Pickled glove.840B.300d [Dataset]. https://www.kaggle.com/datasets/authman/pickled-glove840b300d-for-10sec-loading/discussion
    Explore at:
    zip(2505925996 bytes)Available download formats
    Dataset updated
    Apr 16, 2019
    Authors
    عثمان
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    https://nlp.stanford.edu/projects/glove/

    Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)

    GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

  13. PlaygroundS4E07|ModelCollation

    • kaggle.com
    zip
    Updated Aug 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravi Ramakrishnan (2024). PlaygroundS4E07|ModelCollation [Dataset]. https://www.kaggle.com/datasets/ravi20076/playgrounds4e07modelcollation
    Explore at:
    zip(11810135111 bytes)Available download formats
    Dataset updated
    Aug 1, 2024
    Authors
    Ravi Ramakrishnan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Competition link- https://www.kaggle.com/competitions/playground-series-s4e7

    This dataset contains 3 directories as below- | Directory Label | Contents | | --- | --- | |V1 | * This contains all single models from private experiments V1-V8
    * 60+ single models are stored here | |V4 | * This contains all the blends and stacks through the last 2 weeks of the competition
    * Final submissions are also stored here, refer to the ones ending with V9/ V10 | |V5 | * This contains more private experiments and their results
    * We wanted to subsume everything into V1 tables, but found it difficult to maintain the dataset due to its size. Thus we created a new version with additional models (30+ models) |

    CV scores across all of these models and the final dataset for stacking are presented in the kernel linked below- https://www.kaggle.com/code/ravi20076/playgrounds4e07-modelpp

  14. dsbowl19_features

    • kaggle.com
    zip
    Updated Nov 8, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Lukyanenko (2019). dsbowl19_features [Dataset]. https://www.kaggle.com/artgor/dsbowl19-features
    Explore at:
    zip(4650353 bytes)Available download formats
    Dataset updated
    Nov 8, 2019
    Authors
    Andrew Lukyanenko
    Description

    These are features for DSbowl19 competition. The code for generation is in my kernel: https://www.kaggle.com/artgor/oop-approach-to-fe-and-models

  15. Cat in dat 2: Public Kernels

    • kaggle.com
    zip
    Updated Mar 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavel Prokhorov (2020). Cat in dat 2: Public Kernels [Dataset]. https://www.kaggle.com/datasets/pavelvpster/cat-in-dat-2-public-kernels
    Explore at:
    zip(30866178 bytes)Available download formats
    Dataset updated
    Mar 25, 2020
    Authors
    Pavel Prokhorov
    Description

    Content

    This dataset contains submissions of popular public kernels of https://www.kaggle.com/c/cat-in-the-dat-ii competition.

  16. PyTorch Resnext50 Pretrained Model

    • kaggle.com
    zip
    Updated Jan 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wei Hao Khoong (2020). PyTorch Resnext50 Pretrained Model [Dataset]. https://www.kaggle.com/khoongweihao/pytorch-resnext50-pretrained-model
    Explore at:
    zip(93271178 bytes)Available download formats
    Dataset updated
    Jan 27, 2020
    Authors
    Wei Hao Khoong
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description
  17. Cat in dat: Kernels

    • kaggle.com
    zip
    Updated Sep 6, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavel Prokhorov (2019). Cat in dat: Kernels [Dataset]. https://www.kaggle.com/pavelvpster/cat-in-dat-kernels
    Explore at:
    zip(11470236 bytes)Available download formats
    Dataset updated
    Sep 6, 2019
    Authors
    Pavel Prokhorov
    Description

    Context

    This dataset contains submissions (and scores) obtained from well performed kernels published in https://www.kaggle.com/c/cat-in-the-dat competition.

    Links are in 'kernels.csv' file.

    Regards to great authors!

  18. TMDB Competition Additional Features 2

    • kaggle.com
    zip
    Updated Jul 25, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kamal Chhirang (2019). TMDB Competition Additional Features 2 [Dataset]. https://www.kaggle.com/kamalchhirang/tmdb-competition-additional-features-2
    Explore at:
    zip(132049 bytes)Available download formats
    Dataset updated
    Jul 25, 2019
    Authors
    Kamal Chhirang
    Description

    Dataset

    This dataset was created by Kamal Chhirang

    Contents

    It contains the following files:

  19. Gemma-Data Science Agent- Instruct- Dataset

    • kaggle.com
    zip
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ian cecil akoto (2024). Gemma-Data Science Agent- Instruct- Dataset [Dataset]. https://www.kaggle.com/datasets/ianakoto/gemma-data-science-agent-instruct-dataset
    Explore at:
    zip(9680013 bytes)Available download formats
    Dataset updated
    Apr 2, 2024
    Authors
    ian cecil akoto
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview This dataset contains question-answer pairs with context extracted from Kaggle solution write-ups and discussion forums. The dataset was created to facilitate fine-tuning Gemma, an AI model, for data scientist assistant tasks such as question answering and providing data science assistance.

    Dataset Details Columns: Question: The question generated based on the context extracted from Kaggle solution write-ups and discussion forums. Answer: The corresponding answer to the generated question. Context: The context extracted from Kaggle solution write-ups and discussion forums, which serves as the basis for generating questions and answers. Subtitle: Subtitle or additional information related to the Kaggle competition or topic. Title: Title of the Kaggle competition or topic. Sources and Inspiration

    Sources:

    Meta Kaggle: The dataset was sourced from Meta Kaggle, an official Kaggle platform where users discuss competitions, kernels, datasets, and more. Kaggle Solution Write-ups: Solution write-ups submitted by Kaggle users were utilized as a primary source of context for generating questions and answers. Discussion Forums: Discussion threads on Kaggle forums were used to gather additional insights and context for the dataset. Inspiration:

    The dataset was inspired by the need for a specialized dataset tailored for fine-tuning Gemma, an AI model designed for data scientist assistant tasks. The goal was to create a dataset that captures the essence of real-world data science problems discussed on Kaggle, enabling Gemma to provide accurate and relevant assistance to data scientists and Kaggle users. Dataset Specifics Total Records: [Specify the total number of question-answer pairs in the dataset] Format: CSV (Comma Separated Values) Size: [Specify the size of the dataset in MB or GB] License: [Specify the license under which the dataset is distributed, e.g., CC BY-SA 4.0] Download Link: [Provide a link to download the dataset] Acknowledgments We acknowledge Kaggle and its community for providing valuable data science resources and discussions that contributed to the creation of this dataset. We appreciate the efforts of Gemma and Langchain in fine-tuning AI models for data scientist assistant tasks, enabling enhanced productivity and efficiency in the field of data science.

  20. covid19 week 2 - pretrained model

    • kaggle.com
    zip
    Updated Apr 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordi Mas (2020). covid19 week 2 - pretrained model [Dataset]. https://www.kaggle.com/jordimas/covid19
    Explore at:
    zip(532932 bytes)Available download formats
    Dataset updated
    Apr 1, 2020
    Authors
    Jordi Mas
    Description

    Pretrained model for use version 4 of this kernel. Consists of a set of initial populations for use in a DEoptim evolution, and has been built and can be recreated with the same kernel script.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737
Organization logo

Code4ML 2.0

Explore at:
csv, txtAvailable download formats
Dataset updated
May 19, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonimous authors; Anonimous authors
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

The original dataset is organized into multiple CSV files, each containing structured data on different entities:

  • code_blocks.csv: Contains raw code snippets extracted from Kaggle.
  • kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
  • competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
  • markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
  • vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

Table 1. code_blocks.csv structure

ColumnDescription
code_blocks_indexGlobal index linking code blocks to markup_data.csv.
kernel_idIdentifier for the Kaggle Jupyter notebook from which the code block was extracted.
code_block_id

Position of the code block within the notebook.

code_block

The actual machine learning code snippet.

Table 2. kernels_meta.csv structure

ColumnDescription
kernel_idIdentifier for the Kaggle Jupyter notebook.
kaggle_scorePerformance metric of the notebook.
kaggle_commentsNumber of comments on the notebook.
kaggle_upvotesNumber of upvotes the notebook received.
kernel_linkURL to the notebook.
comp_nameName of the associated Kaggle competition.

Table 3. competitions_meta.csv structure

ColumnDescription
comp_nameName of the Kaggle competition.
descriptionOverview of the competition task.
data_typeType of data used in the competition.
comp_typeClassification of the competition.
subtitleShort description of the task.
EvaluationAlgorithmAbbreviationMetric used for assessing competition submissions.
data_sourcesLinks to datasets used.
metric typeClass label for the assessment metric.

Table 4. markup_data.csv structure

ColumnDescription
code_blockMachine learning code block.
too_longFlag indicating whether the block spans multiple semantic types.
marksConfidence level of the annotation.
graph_vertex_idID of the semantic type.

The dataset allows mapping between these tables. For example:

  • code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
  • kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

Code4ML 2.0 Enhancements

The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

Applications

The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

  • Code generation
  • Code understanding
  • Natural language processing of code-related tasks
Search
Clear search
Close search
Google apps
Main menu