59 datasets found

Code4ML 2.0

zenodo.org

csv, txt

Updated May 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737

Explore at:

csv, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15465737

Dataset updated

May 19, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonimous authors; Anonimous authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

The original dataset is organized into multiple CSV files, each containing structured data on different entities:

code_blocks.csv: Contains raw code snippets extracted from Kaggle.
kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

Table 1. code_blocks.csv structure

Column	Description
code_blocks_index	Global index linking code blocks to markup_data.csv.
kernel_id	Identifier for the Kaggle Jupyter notebook from which the code block was extracted.
code_block_id	Position of the code block within the notebook.
code_block	The actual machine learning code snippet.

Table 2. kernels_meta.csv structure

Column	Description
kernel_id	Identifier for the Kaggle Jupyter notebook.
kaggle_score	Performance metric of the notebook.
kaggle_comments	Number of comments on the notebook.
kaggle_upvotes	Number of upvotes the notebook received.
kernel_link	URL to the notebook.
comp_name	Name of the associated Kaggle competition.

Table 3. competitions_meta.csv structure

Column	Description
comp_name	Name of the Kaggle competition.
description	Overview of the competition task.
data_type	Type of data used in the competition.
comp_type	Classification of the competition.
subtitle	Short description of the task.
EvaluationAlgorithmAbbreviation	Metric used for assessing competition submissions.
data_sources	Links to datasets used.
metric type	Class label for the assessment metric.

Table 4. markup_data.csv structure

Column	Description
code_block	Machine learning code block.
too_long	Flag indicating whether the block spans multiple semantic types.
marks	Confidence level of the annotation.
graph_vertex_id	ID of the semantic type.

The dataset allows mapping between these tables. For example:

code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

Code4ML 2.0 Enhancements

The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

Applications

The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

Code generation
Code understanding
Natural language processing of code-related tasks

Titanic Dataset - cleaned
kaggle.com
zip
Updated Aug 8, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WinstonSDodson (2019). Titanic Dataset - cleaned [Dataset]. https://www.kaggle.com/winstonsdodson/titanic-dataset-cleaned
Explore at:
zip(41906 bytes)Available download formats
Dataset updated
Aug 8, 2019
Authors
WinstonSDodson
Description
This is the classic Titanic Dataset provided in the Kaggle Competition K Kernel and then cleaned in one of the most popular Kernels there. Please see the Kernel titled, "A Data Science Framework: To Achieve 99% Accuracy" for a great lesson in data science. This Kernel gives a great explanaton of the thinking behind the of this data cleaning as well as a very professional demonstration of the technologies and skills to do so. It then continues to provide an overview of many ML techniques and it is copiously and meticulously documented with many useful citations.

Of course, data cleaning is an essential skill in data science but I wanted to use this data for a study of other machine learning techniques. So, I found and used this set of data that is well known and cleaned to a benchmark accepted by many.
Kaggle Survey Challenge - All Kernels
kaggle.com
zip
Updated Nov 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KlemenVodopivec (2022). Kaggle Survey Challenge - All Kernels [Dataset]. https://www.kaggle.com/datasets/klemenvodopivec/kaggle-survey-challenge-all-kernels/data
Explore at:
zip(206438 bytes)Available download formats
Dataset updated
Nov 22, 2022
Authors
KlemenVodopivec
Description
Collections of kernels submissions for the Kaggle survey competitions from 2017 to 2022. As this data was collected during the 2022 survey competition, it does not contain all the kernels for year 2022 .
PlaygroundS4E04|OriginalData
kaggle.com
zip
Updated Apr 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ravi Ramakrishnan (2024). PlaygroundS4E04|OriginalData [Dataset]. https://www.kaggle.com/datasets/ravi20076/playgrounds4e04originaldata
Explore at:
zip(67811 bytes)Available download formats
Dataset updated
Apr 1, 2024
Authors
Ravi Ramakrishnan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is created using the below reference-
https://archive.ics.uci.edu/dataset/1/abalone
We import the corresponding repository in a Kaggle kernel and populate the dataset thereby. Users may choose to import the corresponding dataset with a simple read_csv in pandas and proceed with the solution.

Best wishes!
Pickled Crawl-300D-2M For Kernel Competitions
kaggle.com
zip
Updated Jun 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Budi Ryan (2019). Pickled Crawl-300D-2M For Kernel Competitions [Dataset]. https://www.kaggle.com/budiryan/pickled-crawl300d2m-for-kernel-competitions
Explore at:
zip(1820206270 bytes)Available download formats
Dataset updated
Jun 4, 2019
Authors
Budi Ryan
Description
Dataset

This dataset was created by Budi Ryan

Contents
Lyft Best Performing Public Kernels
kaggle.com
zip
Updated Sep 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kkiller (2020). Lyft Best Performing Public Kernels [Dataset]. https://www.kaggle.com/kneroma/lyft-best-performing-public-kernels
Explore at:
zip(157791531 bytes)Available download formats
Dataset updated
Sep 10, 2020
Authors
kkiller
Description
Context

This a shelter for best performing kernels in the lyft l5kit competition

Competitions Shake-up

kaggle.com

zip

Updated Sep 27, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Daniboy370 (2020). Competitions Shake-up [Dataset]. https://www.kaggle.com/daniboy370/competitions-shakeup

Explore at:

zip(388789 bytes)Available download formats

Dataset updated

Sep 27, 2020

Authors

Daniboy370

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Shake-what ?!

The Shake phenomenon occurs when the competition is shifting between two different datasets :

\[ \text{Public test set} \ \Rightarrow \ \text{Private test set} \quad \Leftrightarrow \quad LB-\text{public} \ \Rightarrow \ LB-\text{private} \]

The private test set that so far was unavailable becomes available, and thus the models scores are re-calculated. This re-evaluation elicits a respective re-ranking of the contestants in the competition. The shake allows participants to assess the severity of their overfitting to the public dataset, and act to improve their model until the deadline.

Unable to find a uniform conventional term for this mechanism, I will use my common sense to define the following intuition :

             <img src="https://github.com/Daniboy370/Uploads/blob/master/Kaggle-shake-ups/images/latex.png?raw=true" width="550">

From the starter kernel :

               <img src="https://github.com/Daniboy370/Uploads/blob/master/Kaggle-shake-ups/vids/shakeup_VID.gif?raw=true" width="625">

Content

Seven datasets of competitions which were scraped from Kaggle :

Competition	Name of file
Elo Merchant Category Recommendation	df_{Elo}
Human Protein Atlas Image Classification	df_{Protein}
Humpback Whale Identification	df_{Humpback}
Microsoft Malware Prediction	df_{Microsoft}
Quora Insincere Questions Classification	df_{Quora}
TGS Salt Identification Challenge	df_{TGS}
VSB Power Line Fault Detection	df_{VSB}

As an example, consider the following dataframe from the Quora competition : Team Name | Rank-private | Rank-public | Shake | Score-private | Score-public --- | --- The Zoo |1|7|6|0.71323|0.71123 ...| ...| ...| ...| ...| ... D.J. Trump|1401|65|-1336|0.000|0.70573

I encourage everybody to investigate thoroughly the dataset in sought of interesting findings !

\[ \text{Enjoy !}\]

COVID19 pretrained
kaggle.com
zip
Updated Apr 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordi Mas (2020). COVID19 pretrained [Dataset]. https://www.kaggle.com/jordimas/covid19-pretrained
Explore at:
zip(1571087 bytes)Available download formats
Dataset updated
Apr 15, 2020
Authors
Jordi Mas
Description
This dataset contains data for use in COVID-19 competition kernels:

Pretrained models, that consists of several sets of initial populations for use in a DEoptim evolutions, and had been built and can be recreated with the same kernel scripts that use them, see the kernels for instructions on this.

World population data, in the file population.csv, all obtained from Wikipedia.
all_kernels_cleaned
kaggle.com
zip
Updated Nov 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KlemenVodopivec (2022). all_kernels_cleaned [Dataset]. https://www.kaggle.com/datasets/klemenvodopivec/all-kernels-cleaned
Explore at:
zip(79146 bytes)Available download formats
Dataset updated
Nov 16, 2022
Authors
KlemenVodopivec
Description
Dataset

This dataset was created by KlemenVodopivec

Contents
PyTorch Model Zoo
kaggle.com
zip
Updated Apr 3, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Igor Krashenyi (2019). PyTorch Model Zoo [Dataset]. https://www.kaggle.com/igorkrashenyi/pytorch-model-zoo
Explore at:
zip(8811991691 bytes)Available download formats
Dataset updated
Apr 3, 2019
Authors
Igor Krashenyi
Description
Dataset

This dataset was created by Igor Krashenyi

Contents
Mlcourse.ai-2020
kaggle.com
zip
Updated Oct 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anas qais (2020). Mlcourse.ai-2020 [Dataset]. https://www.kaggle.com/anasqais/mlcourseai2020
Explore at:
zip(15881 bytes)Available download formats
Dataset updated
Oct 14, 2020
Authors
anas qais
Description
Open Machine Learning Course mlcourse.ai is designed to perfectly balance theory and practice; therefore, each topic is followed by an assignment with a deadline in a week. You can also take part in several Kaggle Inclass competitions held during the course and write your own tutorials. The next session launches in September, 2019. For more info go to the mlcourse.ai main page. Outline This is the list of published articles on medium.com (English), habr.com (Russian), and jqr.com (Chinese). See Kernels of this Dataset for the same material in English. 1. Exploratory Data Analysis with Pandas uk ru, cn, Kaggle Kernel 2. Visual Data Analysis with Python uk ru, cn, Kaggle Kernels: part1, part2 3. Classification, Decision Trees and k Nearest Neighbors uk, ru, cn, Kaggle Kernel 4. Linear Classification and Regression uk, ru, cn, Kaggle Kernels: part1, part2, part3, part4, part5 5. Bagging and Random Forest uk, ru, cn, Kaggle Kernels: part1, part2, part3 6. Feature Engineering and Feature Selection uk, ru, cn, Kaggle Kernel 7. Unsupervised Learning: Principal Component Analysis and Clustering uk, ru, cn, Kaggle Kernel 8. Vowpal Wabbit: Learning with Gigabytes of Data uk, ru, cn, Kaggle Kernel 9. Time Series Analysis with Python, part 1 uk, ru, cn. Predicting future with Facebook Prophet, part 2 uk, cn Kaggle Kernels: part1, part2 10. Gradient Boosting uk, ru, cn, Kaggle Kernel Assignments Each topic is followed by an assignment. See demo versions in this Dataset. Solutions will be discussed in the upcoming run of the course. Kaggle competitions 1. Catch Me If You Can: Intruder Detection through Webpage Session Tracking. Kaggle Inclass 2. How good is your Medium article? Kaggle Inclass Rating Throughout the course we are maintaining a student rating. It takes into account credits scored in assignments and Kaggle competitions. Top students (according to the final rating) will be listed on a special Wiki page. Community Discussions between students are held in the #mlcourse_ai channel of the OpenDataScience Slack team. A registration form will be shared prior to the start of the new session Collaboration You can publish Kernels using this Dataset. But please respect others' interests: don't share solutions to assignments and well-performing solutions for Kaggle Inclass competitions. If you notice any typos/errors in course material, please open an Issue or make a pull request in the course repo. The course is free but you can support organizers by making a pledge on Patreon (monthly support) or a one-time payment on Ko-fi
Pickled glove.840B.300d
kaggle.com
zip
Updated Apr 16, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
عثمان (2019). Pickled glove.840B.300d [Dataset]. https://www.kaggle.com/datasets/authman/pickled-glove840b300d-for-10sec-loading/discussion
Explore at:
zip(2505925996 bytes)Available download formats
Dataset updated
Apr 16, 2019
Authors
عثمان
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
https://nlp.stanford.edu/projects/glove/

Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
PlaygroundS4E07|ModelCollation
kaggle.com
zip
Updated Aug 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ravi Ramakrishnan (2024). PlaygroundS4E07|ModelCollation [Dataset]. https://www.kaggle.com/datasets/ravi20076/playgrounds4e07modelcollation
Explore at:
zip(11810135111 bytes)Available download formats
Dataset updated
Aug 1, 2024
Authors
Ravi Ramakrishnan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Competition link- https://www.kaggle.com/competitions/playground-series-s4e7

This dataset contains 3 directories as below- | Directory Label | Contents | | --- | --- | |V1 | * This contains all single models from private experiments V1-V8
* 60+ single models are stored here | |V4 | * This contains all the blends and stacks through the last 2 weeks of the competition
* Final submissions are also stored here, refer to the ones ending with V9/ V10 | |V5 | * This contains more private experiments and their results
* We wanted to subsume everything into V1 tables, but found it difficult to maintain the dataset due to its size. Thus we created a new version with additional models (30+ models) |

CV scores across all of these models and the final dataset for stacking are presented in the kernel linked below- https://www.kaggle.com/code/ravi20076/playgrounds4e07-modelpp
dsbowl19_features
kaggle.com
zip
Updated Nov 8, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Lukyanenko (2019). dsbowl19_features [Dataset]. https://www.kaggle.com/artgor/dsbowl19-features
Explore at:
zip(4650353 bytes)Available download formats
Dataset updated
Nov 8, 2019
Authors
Andrew Lukyanenko
Description
These are features for DSbowl19 competition. The code for generation is in my kernel: https://www.kaggle.com/artgor/oop-approach-to-fe-and-models
Cat in dat 2: Public Kernels
kaggle.com
zip
Updated Mar 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel Prokhorov (2020). Cat in dat 2: Public Kernels [Dataset]. https://www.kaggle.com/datasets/pavelvpster/cat-in-dat-2-public-kernels
Explore at:
zip(30866178 bytes)Available download formats
Dataset updated
Mar 25, 2020
Authors
Pavel Prokhorov
Description
Content

This dataset contains submissions of popular public kernels of https://www.kaggle.com/c/cat-in-the-dat-ii competition.
PyTorch Resnext50 Pretrained Model
kaggle.com
zip
Updated Jan 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei Hao Khoong (2020). PyTorch Resnext50 Pretrained Model [Dataset]. https://www.kaggle.com/khoongweihao/pytorch-resnext50-pretrained-model
Explore at:
zip(93271178 bytes)Available download formats
Dataset updated
Jan 27, 2020
Authors
Wei Hao Khoong
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
For Deepfake Competition Kernel

used in kernel: https://www.kaggle.com/khoongweihao/deepfake-resnext-starter-kit-lr-heuristic-search
Cat in dat: Kernels
kaggle.com
zip
Updated Sep 6, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel Prokhorov (2019). Cat in dat: Kernels [Dataset]. https://www.kaggle.com/pavelvpster/cat-in-dat-kernels
Explore at:
zip(11470236 bytes)Available download formats
Dataset updated
Sep 6, 2019
Authors
Pavel Prokhorov
Description
Context

This dataset contains submissions (and scores) obtained from well performed kernels published in https://www.kaggle.com/c/cat-in-the-dat competition.

Links are in 'kernels.csv' file.

Regards to great authors!
TMDB Competition Additional Features 2
kaggle.com
zip
Updated Jul 25, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kamal Chhirang (2019). TMDB Competition Additional Features 2 [Dataset]. https://www.kaggle.com/kamalchhirang/tmdb-competition-additional-features-2
Explore at:
zip(132049 bytes)Available download formats
Dataset updated
Jul 25, 2019
Authors
Kamal Chhirang
Description
Dataset

This dataset was created by Kamal Chhirang

Contents

It contains the following files:
Gemma-Data Science Agent- Instruct- Dataset
kaggle.com
zip
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ian cecil akoto (2024). Gemma-Data Science Agent- Instruct- Dataset [Dataset]. https://www.kaggle.com/datasets/ianakoto/gemma-data-science-agent-instruct-dataset
Explore at:
zip(9680013 bytes)Available download formats
Dataset updated
Apr 2, 2024
Authors
ian cecil akoto
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview This dataset contains question-answer pairs with context extracted from Kaggle solution write-ups and discussion forums. The dataset was created to facilitate fine-tuning Gemma, an AI model, for data scientist assistant tasks such as question answering and providing data science assistance.

Dataset Details Columns: Question: The question generated based on the context extracted from Kaggle solution write-ups and discussion forums. Answer: The corresponding answer to the generated question. Context: The context extracted from Kaggle solution write-ups and discussion forums, which serves as the basis for generating questions and answers. Subtitle: Subtitle or additional information related to the Kaggle competition or topic. Title: Title of the Kaggle competition or topic. Sources and Inspiration

Sources:

Meta Kaggle: The dataset was sourced from Meta Kaggle, an official Kaggle platform where users discuss competitions, kernels, datasets, and more. Kaggle Solution Write-ups: Solution write-ups submitted by Kaggle users were utilized as a primary source of context for generating questions and answers. Discussion Forums: Discussion threads on Kaggle forums were used to gather additional insights and context for the dataset. Inspiration:

The dataset was inspired by the need for a specialized dataset tailored for fine-tuning Gemma, an AI model designed for data scientist assistant tasks. The goal was to create a dataset that captures the essence of real-world data science problems discussed on Kaggle, enabling Gemma to provide accurate and relevant assistance to data scientists and Kaggle users. Dataset Specifics Total Records: [Specify the total number of question-answer pairs in the dataset] Format: CSV (Comma Separated Values) Size: [Specify the size of the dataset in MB or GB] License: [Specify the license under which the dataset is distributed, e.g., CC BY-SA 4.0] Download Link: [Provide a link to download the dataset] Acknowledgments We acknowledge Kaggle and its community for providing valuable data science resources and discussions that contributed to the creation of this dataset. We appreciate the efforts of Gemma and Langchain in fine-tuning AI models for data scientist assistant tasks, enabling enhanced productivity and efficiency in the field of data science.
covid19 week 2 - pretrained model
kaggle.com
zip
Updated Apr 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordi Mas (2020). covid19 week 2 - pretrained model [Dataset]. https://www.kaggle.com/jordimas/covid19
Explore at:
zip(532932 bytes)Available download formats
Dataset updated
Apr 1, 2020
Authors
Jordi Mas
Description
Pretrained model for use version 4 of this kernel. Consists of a set of initial populations for use in a DEoptim evolution, and has been built and can be recreated with the same kernel script.

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737

Code4ML 2.0

Explore at:

csv, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15465737

Dataset updated

May 19, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonimous authors; Anonimous authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The original dataset is organized into multiple CSV files, each containing structured data on different entities:

code_blocks.csv: Contains raw code snippets extracted from Kaggle.
kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

Table 1. code_blocks.csv structure

Column	Description
code_blocks_index	Global index linking code blocks to markup_data.csv.
kernel_id	Identifier for the Kaggle Jupyter notebook from which the code block was extracted.
code_block_id	Position of the code block within the notebook.
code_block	The actual machine learning code snippet.

Table 2. kernels_meta.csv structure

Column	Description
kernel_id	Identifier for the Kaggle Jupyter notebook.
kaggle_score	Performance metric of the notebook.
kaggle_comments	Number of comments on the notebook.
kaggle_upvotes	Number of upvotes the notebook received.
kernel_link	URL to the notebook.
comp_name	Name of the associated Kaggle competition.

Table 3. competitions_meta.csv structure

Column	Description
comp_name	Name of the Kaggle competition.
description	Overview of the competition task.
data_type	Type of data used in the competition.
comp_type	Classification of the competition.
subtitle	Short description of the task.
EvaluationAlgorithmAbbreviation	Metric used for assessing competition submissions.
data_sources	Links to datasets used.
metric type	Class label for the assessment metric.

Table 4. markup_data.csv structure

Column	Description
code_block	Machine learning code block.
too_long	Flag indicating whether the block spans multiple semantic types.
marks	Confidence level of the annotation.
graph_vertex_id	ID of the semantic type.

The dataset allows mapping between these tables. For example:

code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

Code4ML 2.0 Enhancements

Applications

The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

Code generation
Code understanding
Natural language processing of code-related tasks

Clear search

Close search

Google apps

Main menu

Code4ML 2.0

Code4ML 2.0 Enhancements

Applications

Titanic Dataset - cleaned

Kaggle Survey Challenge - All Kernels

PlaygroundS4E04|OriginalData

Pickled Crawl-300D-2M For Kernel Competitions

Dataset

Contents

Lyft Best Performing Public Kernels

Context

Competitions Shake-up

Shake-what ?!

Content

COVID19 pretrained

all_kernels_cleaned

Dataset

Contents

PyTorch Model Zoo

Dataset

Contents

Mlcourse.ai-2020

Pickled glove.840B.300d

PlaygroundS4E07|ModelCollation

dsbowl19_features

Cat in dat 2: Public Kernels

Content

PyTorch Resnext50 Pretrained Model

For Deepfake Competition Kernel

Cat in dat: Kernels

Context

TMDB Competition Additional Features 2

Dataset

Contents

Gemma-Data Science Agent- Instruct- Dataset

covid19 week 2 - pretrained model

Code4ML 2.0

Code4ML 2.0 Enhancements

Applications