100+ datasets found

h
python-code-dataset-500k
huggingface.co
Updated Jan 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James (2024). python-code-dataset-500k [Dataset]. https://huggingface.co/datasets/jtatman/python-code-dataset-500k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 24, 2024
Authors
James
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Attention: This dataset is a summary and reformat pulled from github code.

You should make your own assumptions based on this. In fact, there is another dataset I formed through parsing that addresses several points:

out of 500k python related items, most of them are python-ish, not pythonic the majority of the items here contain excessive licensing inclusion of original code the items here are sometimes not even python but have references There's a whole lot of gpl summaries… See the full description on the dataset page: https://huggingface.co/datasets/jtatman/python-code-dataset-500k.
P
GitHub-Python Dataset
paperswithcode.com
opendatalab.com
Updated Jun 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michihiro Yasunaga; Percy Liang (2021). GitHub-Python Dataset [Dataset]. https://paperswithcode.com/dataset/github-python
Explore at:
Dataset updated
Jun 15, 2021
Authors
Michihiro Yasunaga; Percy Liang
Description
Repair AST parse (syntax) errors in Python code
h
python-qa-instructions-dataset
huggingface.co
Updated Sep 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ketan (2023). python-qa-instructions-dataset [Dataset]. https://huggingface.co/datasets/iamketan25/python-qa-instructions-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 13, 2023
Authors
Ketan
Description
iamketan25/python-qa-instructions-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
P
ParallelCorpus-Python Dataset
paperswithcode.com
Updated Jan 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). ParallelCorpus-Python Dataset [Dataset]. https://paperswithcode.com/dataset/parallelcorpus-python
Explore at:
Dataset updated
Jan 12, 2022
Description
The Python dataset introduced in the Parallel Corpus paper (A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation), commonly used for evaluating automated code summarization.
u
Python Programming Dataset
pub.uni-bielefeld.de
commons.datacite.org
Updated Feb 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin Paaßen (2020). Python Programming Dataset [Dataset]. https://pub.uni-bielefeld.de/record/2941052
Explore at:
Dataset updated
Feb 10, 2020
Authors
Benjamin Paaßen
Description
This repository contains programming data collected from 15 students during November and December of 2019 at Bielefeld University. Students were asked to implement gradient descent. Note that this data set contains only source code snapshots and neither timestamps nor personal information. All students programmed in a web environment, which is also contained in this repository.
h
code-search-net-python
huggingface.co
Updated Dec 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fernando Tarin Morales (2023). code-search-net-python [Dataset]. https://huggingface.co/datasets/Nan-Do/code-search-net-python
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 27, 2023
Authors
Fernando Tarin Morales
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for "code-search-net-python"

Dataset Description

Homepage: None Repository: https://huggingface.co/datasets/Nan-Do/code-search-net-python Paper: None Leaderboard: None Point of Contact: @Nan-Do

Dataset Summary

This dataset is the Python portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-python.
Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...
zenodo.org
bin
Updated Aug 24, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4044636
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4044636
Dataset updated
Aug 24, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA. The dataset is gathered on Sep. 17th 2020.

The dataset has more 5.4K Python repositories that are hosted on GitHub.

It contains more than 1.1M type annotations.

Please note that this is the first version of the dataset. In the second version, we will provide processed Python projects in JSON files that contain relevant features and hints for ML-based type inference task.
h
python_code_instructions_18k_alpaca
huggingface.co
opendatalab.com
+1more
Updated Jul 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tarun Bisht (2023). python_code_instructions_18k_alpaca [Dataset]. https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 27, 2023
Authors
Tarun Bisht
Description
Dataset Card for python_code_instructions_18k_alpaca

The dataset contains problem descriptions and code in python language. This dataset is taken from sahil2801/code_instructions_120k, which adds a prompt column in alpaca style. Refer to the source here.
P
Python Programming Puzzles (P3) Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tal Schuster; Ashwin Kalyan; Oleksandr Polozov; Adam Tauman Kalai, Python Programming Puzzles (P3) Dataset [Dataset]. https://paperswithcode.com/dataset/python-programming-puzzles-p3
Explore at:
Authors
Tal Schuster; Ashwin Kalyan; Oleksandr Polozov; Adam Tauman Kalai
Description
Python Programming Puzzles (P3) is an open-source dataset where each puzzle is defined by a short Python program , and the goal is to find an input which makes output "True". The puzzles are objective in that each one is specified entirely by the source code of its verifier, so evaluating is all that is needed to test a candidate solution. They do not require an answer key or input/output examples, nor do they depend on natural language understanding.

The dataset is comprehensive in that it spans problems of a range of difficulties and domains, ranging from trivial string manipulation problems that are immediately obvious to human programmers (but not necessarily to AI), to classic programming puzzles (e.g., Towers of Hanoi), to interview/competitive-programming problems (e.g., dynamic programming), to longstanding open problems in algorithms and mathematics (e.g., factoring). The objective nature of P3 readily supports self-supervised bootstrapping.
E
Python Annotated Code Search (PACS) Datasets
live.european-language-grid.eu
json
Updated Jan 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Python Annotated Code Search (PACS) Datasets [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/9172
Explore at:
jsonAvailable download formats
Dataset updated
Jan 2, 2022
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This upload contains datasets used for the paper Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent. The code for easily loading these datasets will be made available here: http://github.com/nokia/codesearch.
There are three types of datasets:
- snippet collections (code snippets + natural language descriptions): so-ds-feb20, staqc-py-cleaned, conala-curated.
- code search evaluation data (queries linked to relevant snippets of one of the snippet collections): so-ds-feb20-{valid|test}, staqc-py-raw-{valid|test}, conala-curated-0.5-test.
- training data (datasets used to train code retrieval models): so-duplicates-pacs-train, so-python-question-titles-feb20.
The staqc-py-cleaned snippet collection, and the conala-curated datasets were derived from existing corpora:
staqc-py-cleaned was derived from the Python StaQC snippet collection. See https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset.
conala-curated was derived from the conala corpus. See https://conala-corpus.github.io/.
The other datasets were mined directly from a recent Stack Overflow dump (https://archive.org/details/stackexchange).
d
explore the data.world python sdk
data.world
zip
Updated Sep 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Noah Rippner (2023). explore the data.world python sdk [Dataset]. https://data.world/nrippner/explore-the-data-world-python-sdk
Explore at:
zipAvailable download formats
Dataset updated
Sep 12, 2023
Dataset provided by
data.world, Inc.
Authors
Noah Rippner
Description
Getting Started with the data.world Python SDK

* Seamless integration with Python and R

* Effortlessly load data

* SQL queries to pandas DataFrames

* data.world and python side by side -- a nice way to work with your data!

Check out the notebook below!
P
EVIL-Encoders Dataset
paperswithcode.com
opendatalab.com
Updated Aug 31, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). EVIL-Encoders Dataset [Dataset]. https://paperswithcode.com/dataset/evil-encoders
Explore at:
Dataset updated
Aug 31, 2021
Description
This dataset contains samples to generate Python code for security exploits. In order to make the dataset representative of real exploits, it includes code snippets drawn from exploits from public databases. Differing from general-purpose Python code found in previous datasets, the Python code of real exploits entails low-level operations on byte data for obfuscation purposes (i.e., to encode shellcodes). Therefore, real exploits make extensive use of Python instructions for converting data between different encoders, for performing low-level arithmetic and logical operations, and for bit-level slicing, which cannot be found in the previous general-purpose Python datasets. In total, we built a dataset that consists of 1,114 original samples of exploit-tailored Python snippets and their corresponding intent in the English language. These samples include complex and nested instructions, as typical of Python programming. In order to perform more realistic training and for a fair evaluation, we left untouched the developers' original code snippets and did not decompose them. We provided English intents to describe nested instructions altogether. In order to bootstrap the training process for the NMT model, we include in our dataset both the original, exploit-oriented snippets and snippets from a previous general-purpose Python dataset. This enables the NMT model to generate code that can mix general-purpose and exploit-oriented instructions. Among the several datasets for Python code generation, we choose the Django dataset due to its large size. This corpus contains 14,426 unique pairs of Python statements from the Django Web application framework and their corresponding description in English. Therefore, our final dataset contains 15,540 unique pairs of Python code snippets alongside their intents in natural language.
Pokkemon Dataset_csv
kaggle.com
zip
Updated Aug 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BHARGAV NATH (2021). Pokkemon Dataset_csv [Dataset]. https://www.kaggle.com/bhargavnath/new-dataset-fr-pandas
Explore at:
zip(13955 bytes)Available download formats
Dataset updated
Aug 18, 2021
Authors
BHARGAV NATH
Description
Dataset

This dataset was created by BHARGAV NATH

Contents
Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21967265.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Authors
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

GitHub page: https://github.com/soarsmu/NICHE
Glaive Python Code QA DataSet
kaggle.com
zip
Updated Feb 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BrucePayton (2024). Glaive Python Code QA DataSet [Dataset]. https://www.kaggle.com/datasets/brucepayton/glaive-python-code-qa-dataset
Explore at:
zip(62220335 bytes)Available download formats
Dataset updated
Feb 27, 2024
Authors
BrucePayton
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by BrucePayton

Released under Apache 2.0

Contents
P
Django Dataset
paperswithcode.com
opendatalab.com
Updated Feb 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yusuke Oda; Hiroyuki Fudaba; Graham Neubig; Hideaki Hata; Sakriani Sakti; Tomoki Toda; Satoshi Nakamura (2022). Django Dataset [Dataset]. https://paperswithcode.com/dataset/django
Explore at:
Dataset updated
Feb 7, 2022
Authors
Yusuke Oda; Hiroyuki Fudaba; Graham Neubig; Hideaki Hata; Sakriani Sakti; Tomoki Toda; Satoshi Nakamura
Description
The Django dataset is a dataset for code generation comprising of 16000 training, 1000 development and 1805 test annotations. Each data point consists of a line of Python code together with a manually created natural language description.
Dataset Sales - Aleatory Data - by python numpy
kaggle.com
zip
Updated Jul 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Italo Marcelo (2020). Dataset Sales - Aleatory Data - by python numpy [Dataset]. https://kaggle.com/italomarcelo/dataset-sales-aleatory-data-by-python-numpy
Explore at:
zip(4756152 bytes)Available download formats
Dataset updated
Jul 4, 2020
Authors
Italo Marcelo
Description
Dataset

This dataset was created by Italo Marcelo

Contents

It contains the following files:
c
Data from: Dataset for: PATATO: A Python Photoacoustic Analysis Toolkit
repository.cam.ac.uk
commons.datacite.org
bin, zip
Updated Jan 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Else, Thomas; Groehl, Janek; Hacker, Lina; Bohndiek, Sarah (2023). Dataset for: PATATO: A Python Photoacoustic Analysis Toolkit [Dataset]. http://doi.org/10.17863/CAM.93181
Explore at:
bin(1142092248 bytes), zip(1352132309 bytes), bin(1156744350 bytes), bin(3145 bytes), zip(800750333 bytes)Available download formats
Unique identifier
https://doi.org/10.17863/CAM.93181
Dataset updated
Jan 27, 2023
Dataset provided by
University of Cambridge
Apollo
Authors
Else, Thomas; Groehl, Janek; Hacker, Lina; Bohndiek, Sarah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
========================================================================

Example Data for PATATO: Python Photoacoustic Tomography Analysis toolkit

Thomas Else, Sarah Bohndiek (seb53@cam.ac.uk),

CRUK Cambridge Institute and Department of Physics, University of Cambridge.

Data collected between October 2020 and December 2022 in Cambridge, United Kingdom.

Please note: All animal procedures used to acquire the data described below were conducted in accordance with project (PE12C2B96) and personal licenses (I33984279) issued under the United Kingdom Animals (Scientific Procedures) Act, 1986, and were approved locally under compliance form number CFSB2022. \( \ \)

Description of Files

These datasets were collected using two different commercial photoacoustic imaging systems, details of which can be found on the vendor website. More information regarding the experimental acquisition of these files can be found via the PATATO Python toolkit repository, which is freely available on GitHub (https://github.com/tomelse/patato) \( \ \) clinical_phantom.hdf5: Data collected 02/08/2022. A photoacoustic imaging dataset containing raw data from a scan taken on a tissue-mimicking phantom. The data were acquired using the iThera Medical GmbH MSOT Acuity CE device. The file format is HDF5. \( \ \) preclinical_phantom.hdf5: Data collected 02/09/2021 A photoacoustic imaging dataset containing raw data from a scan taken on a tissue-mimicking phantom. The data were acquired using the iThera Medical GmbH MSOT inVision system. The file format is HDF5. \( \ \) invivo_oe.hdf5: Data collected 01/10/2020. A photoacoustic imaging dataset containing raw data from a scan of a mouse, with oxygen-enhanced imaging whereby the breathing gas of the mouse was changed during the scan time. The data were acquired using the iThera Medical MSOT inVision system. The file format is HDF5. \( \ \) invivo_dce.hdf5: Data collected 01/10/2020. A photoacoustic imaging dataset containing raw data from a scan of a mouse, with dynamic-contrast enhanced imaging using indocyanine green, whereby the contrast agent was introduced intravenously during the scan time. The data were acquired using the iThera Medical GmbH MSOT inVision system. The file format is HDF5. \( \ \) ithera_invivo_oe.zip: The same as invivo_oe.hdf5 but in a different format. The zip file contains imaging data in a proprietary format provided by the device manufacturer. It can be loaded using proprietary software such as iThera ViewMSOT, or by converting it to an open format. \( \ \) ithera_invivo_dce.zip: The same as invivo_dce.hdf5 but in a different format. The zip file contains imaging data in a proprietary format provided by the device manufacturer. It can be loaded using proprietary software such as iThera ViewMSOT, or by converting it to an open format.
python-question-and-answer-preprocessed-dataset
kaggle.com
zip
Updated Apr 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
shrey pachauri (2024). python-question-and-answer-preprocessed-dataset [Dataset]. https://www.kaggle.com/datasets/shreypachauri123/python-question-and-answer-preprocessed-dataset
Explore at:
zip(506766 bytes)Available download formats
Dataset updated
Apr 13, 2024
Authors
shrey pachauri
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by shrey pachauri

Released under Apache 2.0

Contents
d
Lifestyle data and python code - Dataset - B2FIND
b2find.dkrz.de
Updated Oct 23, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Lifestyle data and python code - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/efe390a2-7109-583f-9570-8aebb0e1af41
Explore at:
Dataset updated
Oct 23, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Lifestyle data:People, word, food, leisure, and transportation are placed in each column in Excel.- Column Description -people: the number of confirmed COVID-19 casesword: number of searchesfood: the amount paid by people, who ate at restaurantsleisure: the amount paid by people, who enjoyed leisure activities transportation: the amount paid by people, who used public transportation Python code 1:Python code using DNN algorithm to predict the number of confirmed COVID-19 cases (written in Jupyter notebook)Python code 2:Python code using LSTM algorithm to predict the number of confirmed COVID-19 cases (written in Jupyter notebook)

Facebook

Twitter

Click to copy link

Link copied

Cite

James (2024). python-code-dataset-500k [Dataset]. https://huggingface.co/datasets/jtatman/python-code-dataset-500k

python-code-dataset-500k

github_python

jtatman/python-code-dataset-500k

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 24, 2024

Authors

James

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Attention: This dataset is a summary and reformat pulled from github code.

You should make your own assumptions based on this. In fact, there is another dataset I formed through parsing that addresses several points:

out of 500k python related items, most of them are python-ish, not pythonic the majority of the items here contain excessive licensing inclusion of original code the items here are sometimes not even python but have references There's a whole lot of gpl summaries… See the full description on the dataset page: https://huggingface.co/datasets/jtatman/python-code-dataset-500k.

Clear search

Close search

Google apps

Main menu

python-code-dataset-500k

GitHub-Python Dataset

python-qa-instructions-dataset

ParallelCorpus-Python Dataset

Python Programming Dataset

code-search-net-python

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

python_code_instructions_18k_alpaca

Python Programming Puzzles (P3) Dataset

Python Annotated Code Search (PACS) Datasets

explore the data.world python sdk

Getting Started with the data.world Python SDK

* Seamless integration with Python and R

* Effortlessly load data

* SQL queries to pandas DataFrames

* data.world and python side by side -- a nice way to work with your data!

Check out the notebook below!

EVIL-Encoders Dataset

Pokkemon Dataset_csv

Dataset

Contents

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

Glaive Python Code QA DataSet

Dataset

Contents

Django Dataset

Dataset Sales - Aleatory Data - by python numpy

Dataset

Contents

Data from: Dataset for: PATATO: A Python Photoacoustic Analysis Toolkit

Example Data for PATATO: Python Photoacoustic Tomography Analysis toolkit

Description of Files

python-question-and-answer-preprocessed-dataset

Dataset

Contents

Lifestyle data and python code - Dataset - B2FIND

python-code-dataset-500k

github_python

jtatman/python-code-dataset-500k