MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Attention: This dataset is a summary and reformat pulled from github code.
You should make your own assumptions based on this. In fact, there is another dataset I formed through parsing that addresses several points:
out of 500k python related items, most of them are python-ish, not pythonic the majority of the items here contain excessive licensing inclusion of original code the items here are sometimes not even python but have references There's a whole lot of gpl summaries… See the full description on the dataset page: https://huggingface.co/datasets/jtatman/python-code-dataset-500k.
Repair AST parse (syntax) errors in Python code
iamketan25/python-qa-instructions-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
The Python dataset introduced in the Parallel Corpus paper (A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation), commonly used for evaluating automated code summarization.
This repository contains programming data collected from 15 students during November and December of 2019 at Bielefeld University. Students were asked to implement gradient descent. Note that this data set contains only source code snapshots and neither timestamps nor personal information. All students programmed in a web environment, which is also contained in this repository.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "code-search-net-python"
Dataset Description
Homepage: None Repository: https://huggingface.co/datasets/Nan-Do/code-search-net-python Paper: None Leaderboard: None Point of Contact: @Nan-Do
Dataset Summary
This dataset is the Python portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-python.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for python_code_instructions_18k_alpaca
The dataset contains problem descriptions and code in python language. This dataset is taken from sahil2801/code_instructions_120k, which adds a prompt column in alpaca style. Refer to the source here.
Python Programming Puzzles (P3) is an open-source dataset where each puzzle is defined by a short Python program , and the goal is to find an input which makes output "True". The puzzles are objective in that each one is specified entirely by the source code of its verifier, so evaluating is all that is needed to test a candidate solution. They do not require an answer key or input/output examples, nor do they depend on natural language understanding.
The dataset is comprehensive in that it spans problems of a range of difficulties and domains, ranging from trivial string manipulation problems that are immediately obvious to human programmers (but not necessarily to AI), to classic programming puzzles (e.g., Towers of Hanoi), to interview/competitive-programming problems (e.g., dynamic programming), to longstanding open problems in algorithms and mathematics (e.g., factoring). The objective nature of P3 readily supports self-supervised bootstrapping.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This upload contains datasets used for the paper Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent. The code for easily loading these datasets will be made available here: http://github.com/nokia/codesearch.
There are three types of datasets:
- snippet collections (code snippets + natural language descriptions): so-ds-feb20, staqc-py-cleaned, conala-curated.
- code search evaluation data (queries linked to relevant snippets of one of the snippet collections): so-ds-feb20-{valid|test}, staqc-py-raw-{valid|test}, conala-curated-0.5-test.
- training data (datasets used to train code retrieval models): so-duplicates-pacs-train, so-python-question-titles-feb20.
The staqc-py-cleaned snippet collection, and the conala-curated datasets were derived from existing corpora:
staqc-py-cleaned was derived from the Python StaQC snippet collection. See https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset.
conala-curated was derived from the conala corpus. See https://conala-corpus.github.io/.
The other datasets were mined directly from a recent Stack Overflow dump (https://archive.org/details/stackexchange).
This dataset contains samples to generate Python code for security exploits. In order to make the dataset representative of real exploits, it includes code snippets drawn from exploits from public databases. Differing from general-purpose Python code found in previous datasets, the Python code of real exploits entails low-level operations on byte data for obfuscation purposes (i.e., to encode shellcodes). Therefore, real exploits make extensive use of Python instructions for converting data between different encoders, for performing low-level arithmetic and logical operations, and for bit-level slicing, which cannot be found in the previous general-purpose Python datasets. In total, we built a dataset that consists of 1,114 original samples of exploit-tailored Python snippets and their corresponding intent in the English language. These samples include complex and nested instructions, as typical of Python programming. In order to perform more realistic training and for a fair evaluation, we left untouched the developers' original code snippets and did not decompose them. We provided English intents to describe nested instructions altogether. In order to bootstrap the training process for the NMT model, we include in our dataset both the original, exploit-oriented snippets and snippets from a previous general-purpose Python dataset. This enables the NMT model to generate code that can mix general-purpose and exploit-oriented instructions. Among the several datasets for Python code generation, we choose the Django dataset due to its large size. This corpus contains 14,426 unique pairs of Python statements from the Django Web application framework and their corresponding description in English. Therefore, our final dataset contains 15,540 unique pairs of Python code snippets alongside their intents in natural language.
This dataset was created by BHARGAV NATH
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.
GitHub page: https://github.com/soarsmu/NICHE
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by BrucePayton
Released under Apache 2.0
The Django dataset is a dataset for code generation comprising of 16000 training, 1000 development and 1805 test annotations. Each data point consists of a line of Python code together with a manually created natural language description.
This dataset was created by Italo Marcelo
It contains the following files:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
========================================================================
Thomas Else, Sarah Bohndiek (seb53@cam.ac.uk),
CRUK Cambridge Institute and Department of Physics, University of Cambridge.
Data collected between October 2020 and December 2022 in Cambridge, United Kingdom.
Please note: All animal procedures used to acquire the data described below were conducted in accordance with project (PE12C2B96) and personal licenses (I33984279) issued under the United Kingdom Animals (Scientific Procedures) Act, 1986, and were approved locally under compliance form number CFSB2022. \( \ \)
These datasets were collected using two different commercial photoacoustic imaging systems, details of which can be found on the vendor website. More information regarding the experimental acquisition of these files can be found via the PATATO Python toolkit repository, which is freely available on GitHub (https://github.com/tomelse/patato) \( \ \) clinical_phantom.hdf5: Data collected 02/08/2022. A photoacoustic imaging dataset containing raw data from a scan taken on a tissue-mimicking phantom. The data were acquired using the iThera Medical GmbH MSOT Acuity CE device. The file format is HDF5. \( \ \) preclinical_phantom.hdf5: Data collected 02/09/2021 A photoacoustic imaging dataset containing raw data from a scan taken on a tissue-mimicking phantom. The data were acquired using the iThera Medical GmbH MSOT inVision system. The file format is HDF5. \( \ \) invivo_oe.hdf5: Data collected 01/10/2020. A photoacoustic imaging dataset containing raw data from a scan of a mouse, with oxygen-enhanced imaging whereby the breathing gas of the mouse was changed during the scan time. The data were acquired using the iThera Medical MSOT inVision system. The file format is HDF5. \( \ \) invivo_dce.hdf5: Data collected 01/10/2020. A photoacoustic imaging dataset containing raw data from a scan of a mouse, with dynamic-contrast enhanced imaging using indocyanine green, whereby the contrast agent was introduced intravenously during the scan time. The data were acquired using the iThera Medical GmbH MSOT inVision system. The file format is HDF5. \( \ \) ithera_invivo_oe.zip: The same as invivo_oe.hdf5 but in a different format. The zip file contains imaging data in a proprietary format provided by the device manufacturer. It can be loaded using proprietary software such as iThera ViewMSOT, or by converting it to an open format. \( \ \) ithera_invivo_dce.zip: The same as invivo_dce.hdf5 but in a different format. The zip file contains imaging data in a proprietary format provided by the device manufacturer. It can be loaded using proprietary software such as iThera ViewMSOT, or by converting it to an open format.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by shrey pachauri
Released under Apache 2.0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Lifestyle data:People, word, food, leisure, and transportation are placed in each column in Excel.- Column Description -people: the number of confirmed COVID-19 casesword: number of searchesfood: the amount paid by people, who ate at restaurantsleisure: the amount paid by people, who enjoyed leisure activities transportation: the amount paid by people, who used public transportation Python code 1:Python code using DNN algorithm to predict the number of confirmed COVID-19 cases (written in Jupyter notebook)Python code 2:Python code using LSTM algorithm to predict the number of confirmed COVID-19 cases (written in Jupyter notebook)
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Attention: This dataset is a summary and reformat pulled from github code.
You should make your own assumptions based on this. In fact, there is another dataset I formed through parsing that addresses several points:
out of 500k python related items, most of them are python-ish, not pythonic the majority of the items here contain excessive licensing inclusion of original code the items here are sometimes not even python but have references There's a whole lot of gpl summaries… See the full description on the dataset page: https://huggingface.co/datasets/jtatman/python-code-dataset-500k.