3 datasets found

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
csv
Updated Sep 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6607065
Dataset updated
Sep 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous authors; Anonymous authors
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
Updated Sep 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anastasia Drozdova; Polina Guseva; Ekaterina Trofimova; Ekaterina Trofimova; Anna Scherbakov; Andrey Ustyuzhanin; Anastasia Gorodilova; Valery Berezovsky; Anastasia Drozdova; Polina Guseva; Anna Scherbakov; Andrey Ustyuzhanin; Anastasia Gorodilova; Valery Berezovsky (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.7312771
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7312771
Dataset updated
Sep 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anastasia Drozdova; Polina Guseva; Ekaterina Trofimova; Ekaterina Trofimova; Anna Scherbakov; Andrey Ustyuzhanin; Anastasia Gorodilova; Valery Berezovsky; Anastasia Drozdova; Polina Guseva; Anna Scherbakov; Andrey Ustyuzhanin; Anastasia Gorodilova; Valery Berezovsky
Description
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.

The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).

Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.

The corpus can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
Updated May 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy (2024). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.11213783
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.11213783
Dataset updated
May 18, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is an enriched version of Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The initial corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.

The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).

Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.

Automatic classification of code_blocks are stored in data_with_preds.csv. The mapping of this table with code_blocks.csv can be doe through code_blocks_index column, which corresponds to code_blocks indices.

The updated Code4ML 2.0 corpus includes kernels retrieved from Code Kaggle Meta. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

kernels_meta2.csv may contain kernels without Kaggle score, but with the place in the leader board (rank).

Code4ML 2.0 dataset can be used for various purposes, including training and evaluating models for code generation, code understanding, and natural language processing tasks.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6607065

Dataset updated

Sep 15, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonymous authors; Anonymous authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.

Clear search

Close search

Google apps

Main menu

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code