100+ datasets found

Dataset_Python_Question_Answer
kaggle.com
zip
Updated Mar 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chinmaya (2024). Dataset_Python_Question_Answer [Dataset]. https://www.kaggle.com/datasets/chinmayadatt/dataset-python-question-answer
Explore at:
zip(189137 bytes)Available download formats
Dataset updated
Mar 29, 2024
Authors
Chinmaya
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
This dataset is about Python programming. Question and answers are generated using Gemma. There are more than four hundred questions and their corresponding answers about Python programming.

Questions are ranging from concepts like data-types, variables and keywords to regular-expression and threading.

I have used this dataset here

The code used for dataset generated is available here
h
python-code-dataset-500k
huggingface.co
Updated Jan 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James (2024). python-code-dataset-500k [Dataset]. https://huggingface.co/datasets/jtatman/python-code-dataset-500k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 22, 2024
Authors
James
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Attention: This dataset is a summary and reformat pulled from github code.

You should make your own assumptions based on this. In fact, there is another dataset I formed through parsing that addresses several points:

out of 500k python related items, most of them are python-ish, not pythonic the majority of the items here contain excessive licensing inclusion of original code the items here are sometimes not even python but have references There's a whole lot of gpl summaries… See the full description on the dataset page: https://huggingface.co/datasets/jtatman/python-code-dataset-500k.
All Seaborn Built-in Datasets 📊✨
kaggle.com
zip
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdelrahman Mohamed (2024). All Seaborn Built-in Datasets 📊✨ [Dataset]. https://www.kaggle.com/datasets/abdoomoh/all-seaborn-built-in-datasets
Explore at:
zip(1383218 bytes)Available download formats
Dataset updated
Aug 27, 2024
Authors
Abdelrahman Mohamed
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Description: - This dataset includes all 22 built-in datasets from the Seaborn library, a widely used Python data visualization tool. Seaborn's built-in datasets are essential resources for anyone interested in practicing data analysis, visualization, and machine learning. They span a wide range of topics, from classic datasets like the Iris flower classification to real-world data such as Titanic survival records and diamond characteristics.

Included Datasets:

Anagrams: Analysis of word anagram patterns.

Anscombe: Anscombe's quartet demonstrating the importance of data visualization.

Attention: Data on attention span variations in different scenarios.

Brain Networks: Connectivity data within brain networks.

Car Crashes: US car crash statistics.

Diamonds: Data on diamond properties including price, cut, and clarity.

Dots: Randomly generated data for scatter plot visualization.

Dow Jones: Historical records of the Dow Jones Industrial Average.

Exercise: The relationship between exercise and health metrics.

Flights: Monthly passenger numbers on flights.

FMRI: Functional MRI data capturing brain activity.

Geyser: Eruption times of the Old Faithful geyser.

Glue: Strength of glue under different conditions.

Health Expenditure: Health expenditure statistics across countries.

Iris: Famous dataset for classifying Iris species.

MPG: Miles per gallon for various vehicles.

Penguins: Data on penguin species and their features.

Planets: Characteristics of discovered exoplanets.

Sea Ice: Measurements of sea ice extent.

Taxis: Taxi trips data in a city.

Tips: Tipping data collected from a restaurant.

Titanic: Survival data from the Titanic disaster.

This complete collection serves as an excellent starting point for anyone looking to improve their data science skills, offering a wide array of datasets suitable for both beginners and advanced users.
Pandas Practice Dataset
kaggle.com
zip
Updated Jan 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mrityunjay Pathak (2023). Pandas Practice Dataset [Dataset]. https://www.kaggle.com/datasets/themrityunjaypathak/pandas-practice-dataset/discussion
Explore at:
zip(493 bytes)Available download formats
Dataset updated
Jan 27, 2023
Authors
Mrityunjay Pathak
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
What is Pandas?

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

What Can Pandas Do?

Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?

What is average value?

Max value?

Min value?
h
python-qa-instructions-dataset
huggingface.co
Updated Sep 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ketan (2023). python-qa-instructions-dataset [Dataset]. https://huggingface.co/datasets/iamketan25/python-qa-instructions-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 13, 2023
Authors
Ketan
Description
iamketan25/python-qa-instructions-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
f
datasets
figshare.com
txt
Updated Oct 5, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlos Rodriguez-Contreras (2017). datasets [Dataset]. http://doi.org/10.6084/m9.figshare.5472970.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5472970.v1
Dataset updated
Oct 5, 2017
Dataset provided by
figshare
Authors
Carlos Rodriguez-Contreras
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets for practising in class
h
code-search-net-python
huggingface.co
Updated Dec 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fernando Tarin Morales (2023). code-search-net-python [Dataset]. https://huggingface.co/datasets/Nan-Do/code-search-net-python
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 27, 2023
Authors
Fernando Tarin Morales
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for "code-search-net-python"

Dataset Description

Homepage: None Repository: https://huggingface.co/datasets/Nan-Do/code-search-net-python Paper: None Leaderboard: None Point of Contact: @Nan-Do

Dataset Summary

This dataset is the Python portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-python.
Z
#PraCegoVer dataset
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jan 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Oliveira dos Santos; Esther Luna Colombini; Sandra Avila (2023). #PraCegoVer dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5710561
Explore at:
Dataset updated
Jan 19, 2023
Dataset provided by
Institute of Computing, University of Campinas
Authors
Gabriel Oliveira dos Santos; Esther Luna Colombini; Sandra Avila
Description
Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

Dataset Structure

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

containing the images. The file dataset.json comprehends a list of json objects with the attributes:

user: anonymized user that made the post;

filename: image file name;

raw_caption: raw caption;

caption: clean caption;

date: post date.

Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

Download Instructions

If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:

cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz

Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

python download_dataset.py --access_token=
Project Python- Data Cleaning - EDA- Visualization
kaggle.com
zip
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hussein Al Chami (2023). Project Python- Data Cleaning - EDA- Visualization [Dataset]. https://www.kaggle.com/datasets/husseinalchami/project-python-data-cleaning-eda-visualization
Explore at:
zip(322085 bytes)Available download formats
Dataset updated
Dec 10, 2023
Authors
Hussein Al Chami
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Hussein Al Chami

Released under MIT

Contents
Datasets for manuscript "A data engineering framework for chemical flow...
catalog.data.gov
gimi9.com
Updated Nov 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2021). Datasets for manuscript "A data engineering framework for chemical flow analysis of industrial pollution abatement operations" [Dataset]. https://catalog.data.gov/dataset/datasets-for-manuscript-a-data-engineering-framework-for-chemical-flow-analysis-of-industr
Explore at:
Dataset updated
Nov 7, 2021
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The EPA GitHub repository PAU4ChemAs as described in the README.md file, contains Python scripts written to build the PAU dataset modules (technologies, capital and operating costs, and chemical prices) for tracking chemical flows transfers, releases estimation, and identification of potential occupation exposure scenarios in pollution abatement units (PAUs). These PAUs are employed for on-site chemical end-of-life management. The folder datasets contains the outputs for each framework step. The Chemicals_in_categories.csv contains the chemicals for the TRI chemical categories. The EPA GitHub repository PAU_case_study as described in its readme.md entry, contains the Python scripts to run the manuscript case study for designing the PAUs, the data-driven models, and the decision-making module for chemicals of concern and tracking flow transfers at the end-of-life stage. The data was obtained by means of data engineering using different publicly-available databases. The properties of chemicals were obtained using the GitHub repository Properties_Scraper, while the PAU dataset using the repository PAU4Chem. Finally, the EPA GitHub repository Properties_Scraper contains a Python script to massively gather information about exposure limits and physical properties from different publicly-available sources: EPA, NOAA, OSHA, and the institute for Occupational Safety and Health of the German Social Accident Insurance (IFA). Also, all GitHub repositories describe the Python libraries required for running their code, how to use them, the obtained outputs files after running the Python script modules, and the corresponding EPA Disclaimer. This dataset is associated with the following publication: Hernandez-Betancur, J.D., M. Martin, and G.J. Ruiz-Mercado. A data engineering framework for on-site end-of-life industrial operations. JOURNAL OF CLEANER PRODUCTION. Elsevier Science Ltd, New York, NY, USA, 327: 129514, (2021).
h
python-raw-dataset
huggingface.co
Updated Nov 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
srivastava (2023). python-raw-dataset [Dataset]. https://huggingface.co/datasets/greatdarklord/python-raw-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 22, 2023
Authors
srivastava
Description
greatdarklord/python-raw-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Data from: KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle
zenodo.org
dataon.kisti.re.kr
+1more
bin, bz2, pdf
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luigi Quaranta; Fabio Calefato; Fabio Calefato; Filippo Lanubile; Filippo Lanubile; Luigi Quaranta (2024). KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle [Dataset]. http://doi.org/10.5281/zenodo.4468523
Explore at:
bz2, pdf, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4468523
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Luigi Quaranta; Fabio Calefato; Fabio Calefato; Filippo Lanubile; Filippo Lanubile; Luigi Quaranta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
KGTorrent is a dataset of Python Jupyter notebooks from the Kaggle platform.

The dataset is accompanied by a MySQL database containing metadata about the notebooks and the activity of Kaggle users on the platform. The information to build the MySQL database has been derived from Meta Kaggle, a publicly available dataset containing Kaggle metadata.

In this package, we share the complete KGTorrent dataset (consisting of the dataset itself plus its companion database), as well as the specific version of Meta Kaggle used to build the database.

More specifically, the package comprises the following three compressed archives:

KGT_dataset.tar.bz2, the dataset of Jupyter notebooks;

KGTorrent_dump_10-2020.sql.tar.bz2, the dump of the MySQL companion database;

MetaKaggle27Oct2020.tar.bz2, a copy of the Meta Kaggle version used to build the database.

Moreover, we include KGTorrent_logical_schema.pdf, the logical schema of the KGTorrent MySQL database.
VegeNet - Image datasets and Codes
zenodo.org
data.niaid.nih.gov
zip
Updated Oct 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jo Yen Tan; Jo Yen Tan (2022). VegeNet - Image datasets and Codes [Dataset]. http://doi.org/10.5281/zenodo.7254508
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7254508
Dataset updated
Oct 27, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jo Yen Tan; Jo Yen Tan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).

Image datasets:

vege_original : Images of vegetables captured manually in data acquisition stage

vege_cropped_renamed : Images in (1) cropped to remove background areas and image labels renamed

non-vege images : Images of non-vegetable foods for CNN network to recognize other-than-vegetable foods

food_image_dataset : Complete set of vege (2) and non-vege (3) images for architecture building.

food_image_dataset_split : Image dataset (4) split into train and test sets

process : Images created when cropping (pre-processing step) to create dataset (2).
datasets
figshare.com
txt
Updated Sep 27, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlos Rodriguez-Contreras (2017). datasets [Dataset]. http://doi.org/10.6084/m9.figshare.5447167.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5447167.v1
Dataset updated
Sep 27, 2017
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Carlos Rodriguez-Contreras
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This folder contains datasets to be downloaded from students for their practices with R and Python
Ecommerce Dataset for Data Analysis
kaggle.com
zip
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
Explore at:
zip(2028853 bytes)Available download formats
Dataset updated
Sep 19, 2024
Authors
Shrishti Manja
Description
This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning
h
codeparrot
huggingface.co
Updated Sep 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natural Language Processing with Transformers (2021). codeparrot [Dataset]. https://huggingface.co/datasets/transformersbook/codeparrot
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 1, 2021
Dataset authored and provided by
Natural Language Processing with Transformers
Description
CodeParrot 🦜 Dataset

What is it?

This is the full CodeParrot dataset. It contains Python files used to train the code generation model in Chapter 10: Training Transformers from Scratch in the NLP with Transformers book. You can find the full code in the accompanying Github repository.

Creation

It was created with the GitHub dataset available via Google's BigQuery. It contains approximately 22 million Python files and is 180 GB (50 GB compressed) big. The… See the full description on the dataset page: https://huggingface.co/datasets/transformersbook/codeparrot.
T
mnist
tensorflow.org
universe.roboflow.com
+4more
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). mnist [Dataset]. https://www.tensorflow.org/datasets/catalog/mnist
Explore at:
Dataset updated
Jun 1, 2024
Description
The MNIST database of handwritten digits.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('mnist', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">

Code4ML 2.0

zenodo.org

csv, txt

Updated May 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737

Explore at:

csv, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15465737

Dataset updated

May 19, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonimous authors; Anonimous authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

The original dataset is organized into multiple CSV files, each containing structured data on different entities:

code_blocks.csv: Contains raw code snippets extracted from Kaggle.
kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

Table 1. code_blocks.csv structure

Column	Description
code_blocks_index	Global index linking code blocks to markup_data.csv.
kernel_id	Identifier for the Kaggle Jupyter notebook from which the code block was extracted.
code_block_id	Position of the code block within the notebook.
code_block	The actual machine learning code snippet.

Table 2. kernels_meta.csv structure

Column	Description
kernel_id	Identifier for the Kaggle Jupyter notebook.
kaggle_score	Performance metric of the notebook.
kaggle_comments	Number of comments on the notebook.
kaggle_upvotes	Number of upvotes the notebook received.
kernel_link	URL to the notebook.
comp_name	Name of the associated Kaggle competition.

Table 3. competitions_meta.csv structure

Column	Description
comp_name	Name of the Kaggle competition.
description	Overview of the competition task.
data_type	Type of data used in the competition.
comp_type	Classification of the competition.
subtitle	Short description of the task.
EvaluationAlgorithmAbbreviation	Metric used for assessing competition submissions.
data_sources	Links to datasets used.
metric type	Class label for the assessment metric.

Table 4. markup_data.csv structure

Column	Description
code_block	Machine learning code block.
too_long	Flag indicating whether the block spans multiple semantic types.
marks	Confidence level of the annotation.
graph_vertex_id	ID of the semantic type.

The dataset allows mapping between these tables. For example:

code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

Code4ML 2.0 Enhancements

The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

Applications

The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

Code generation
Code understanding
Natural language processing of code-related tasks

h
python-reasoning-dataset
huggingface.co
Updated Feb 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sara Han Díaz (2025). python-reasoning-dataset [Dataset]. https://huggingface.co/datasets/sdiazlor/python-reasoning-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2025
Authors
Sara Han Díaz
Description
Dataset Card for my-distiset-986461

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/my-distiset-986461/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/sdiazlor/python-reasoning-dataset.
h
xlcost-text-to-code
huggingface.co
Updated Nov 3, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CodeParrot (2022). xlcost-text-to-code [Dataset]. https://huggingface.co/datasets/codeparrot/xlcost-text-to-code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 3, 2022
Dataset authored and provided by
CodeParrot
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
XLCoST is a machine learning benchmark dataset that contains fine-grained parallel data in 7 commonly used programming languages (C++, Java, Python, C#, Javascript, PHP, C), and natural language (English).

Facebook

Twitter

Click to copy link

Link copied

Cite

Chinmaya (2024). Dataset_Python_Question_Answer [Dataset]. https://www.kaggle.com/datasets/chinmayadatt/dataset-python-question-answer

Dataset_Python_Question_Answer

Answer common questions about the Python programming language

Explore at:

zip(189137 bytes)Available download formats

Dataset updated

Mar 29, 2024

Authors

Chinmaya

License

Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically

Description

This dataset is about Python programming. Question and answers are generated using Gemma. There are more than four hundred questions and their corresponding answers about Python programming.

Questions are ranging from concepts like data-types, variables and keywords to regular-expression and threading.

I have used this dataset here

The code used for dataset generated is available here

Clear search

Close search

Google apps

Main menu

Dataset_Python_Question_Answer

python-code-dataset-500k

All Seaborn Built-in Datasets 📊✨

Pandas Practice Dataset

python-qa-instructions-dataset

datasets

code-search-net-python

#PraCegoVer dataset

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

Project Python- Data Cleaning - EDA- Visualization

Dataset

Contents

Datasets for manuscript "A data engineering framework for chemical flow...

python-raw-dataset

Data from: KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle

VegeNet - Image datasets and Codes

datasets

Ecommerce Dataset for Data Analysis

codeparrot

mnist

Code4ML 2.0

Code4ML 2.0 Enhancements

Applications

python-reasoning-dataset

xlcost-text-to-code

Dataset_Python_Question_Answer

Answer common questions about the Python programming language