100+ datasets found
  1. h

    python-code-dataset-500k

    • huggingface.co
    Updated Jan 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James (2024). python-code-dataset-500k [Dataset]. https://huggingface.co/datasets/jtatman/python-code-dataset-500k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 24, 2024
    Authors
    James
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Attention: This dataset is a summary and reformat pulled from github code.

    You should make your own assumptions based on this. In fact, there is another dataset I formed through parsing that addresses several points:

    out of 500k python related items, most of them are python-ish, not pythonic the majority of the items here contain excessive licensing inclusion of original code the items here are sometimes not even python but have references There's a whole lot of gpl summaries… See the full description on the dataset page: https://huggingface.co/datasets/jtatman/python-code-dataset-500k.

  2. P

    GitHub-Python Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jun 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michihiro Yasunaga; Percy Liang (2021). GitHub-Python Dataset [Dataset]. https://paperswithcode.com/dataset/github-python
    Explore at:
    Dataset updated
    Jun 15, 2021
    Authors
    Michihiro Yasunaga; Percy Liang
    Description

    Repair AST parse (syntax) errors in Python code

  3. h

    python-qa-instructions-dataset

    • huggingface.co
    Updated Sep 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ketan (2023). python-qa-instructions-dataset [Dataset]. https://huggingface.co/datasets/iamketan25/python-qa-instructions-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 13, 2023
    Authors
    Ketan
    Description

    iamketan25/python-qa-instructions-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. P

    ParallelCorpus-Python Dataset

    • paperswithcode.com
    Updated Jan 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). ParallelCorpus-Python Dataset [Dataset]. https://paperswithcode.com/dataset/parallelcorpus-python
    Explore at:
    Dataset updated
    Jan 12, 2022
    Description

    The Python dataset introduced in the Parallel Corpus paper (A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation), commonly used for evaluating automated code summarization.

  5. u

    Python Programming Dataset

    • pub.uni-bielefeld.de
    • commons.datacite.org
    Updated Feb 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin Paaßen (2020). Python Programming Dataset [Dataset]. https://pub.uni-bielefeld.de/record/2941052
    Explore at:
    Dataset updated
    Feb 10, 2020
    Authors
    Benjamin Paaßen
    Description

    This repository contains programming data collected from 15 students during November and December of 2019 at Bielefeld University. Students were asked to implement gradient descent. Note that this data set contains only source code snapshots and neither timestamps nor personal information. All students programmed in a web environment, which is also contained in this repository.

  6. h

    code-search-net-python

    • huggingface.co
    Updated Dec 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fernando Tarin Morales (2023). code-search-net-python [Dataset]. https://huggingface.co/datasets/Nan-Do/code-search-net-python
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 27, 2023
    Authors
    Fernando Tarin Morales
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for "code-search-net-python"

      Dataset Description
    

    Homepage: None Repository: https://huggingface.co/datasets/Nan-Do/code-search-net-python Paper: None Leaderboard: None Point of Contact: @Nan-Do

      Dataset Summary
    

    This dataset is the Python portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-python.

  7. Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

    • zenodo.org
    bin
    Updated Aug 24, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4044636
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 24, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA. The dataset is gathered on Sep. 17th 2020.
    • The dataset has more 5.4K Python repositories that are hosted on GitHub.
    • It contains more than 1.1M type annotations.
    • Please note that this is the first version of the dataset. In the second version, we will provide processed Python projects in JSON files that contain relevant features and hints for ML-based type inference task.
  8. h

    python_code_instructions_18k_alpaca

    • huggingface.co
    • opendatalab.com
    • +1more
    Updated Jul 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tarun Bisht (2023). python_code_instructions_18k_alpaca [Dataset]. https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 27, 2023
    Authors
    Tarun Bisht
    Description

    Dataset Card for python_code_instructions_18k_alpaca

    The dataset contains problem descriptions and code in python language. This dataset is taken from sahil2801/code_instructions_120k, which adds a prompt column in alpaca style. Refer to the source here.

  9. P

    Python Programming Puzzles (P3) Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tal Schuster; Ashwin Kalyan; Oleksandr Polozov; Adam Tauman Kalai, Python Programming Puzzles (P3) Dataset [Dataset]. https://paperswithcode.com/dataset/python-programming-puzzles-p3
    Explore at:
    Authors
    Tal Schuster; Ashwin Kalyan; Oleksandr Polozov; Adam Tauman Kalai
    Description

    Python Programming Puzzles (P3) is an open-source dataset where each puzzle is defined by a short Python program , and the goal is to find an input which makes output "True". The puzzles are objective in that each one is specified entirely by the source code of its verifier, so evaluating is all that is needed to test a candidate solution. They do not require an answer key or input/output examples, nor do they depend on natural language understanding.

    The dataset is comprehensive in that it spans problems of a range of difficulties and domains, ranging from trivial string manipulation problems that are immediately obvious to human programmers (but not necessarily to AI), to classic programming puzzles (e.g., Towers of Hanoi), to interview/competitive-programming problems (e.g., dynamic programming), to longstanding open problems in algorithms and mathematics (e.g., factoring). The objective nature of P3 readily supports self-supervised bootstrapping.

  10. E

    Python Annotated Code Search (PACS) Datasets

    • live.european-language-grid.eu
    json
    Updated Jan 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Python Annotated Code Search (PACS) Datasets [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/9172
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jan 2, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This upload contains datasets used for the paper Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent. The code for easily loading these datasets will be made available here: http://github.com/nokia/codesearch.

    There are three types of datasets:

    - snippet collections (code snippets + natural language descriptions): so-ds-feb20, staqc-py-cleaned, conala-curated.

    - code search evaluation data (queries linked to relevant snippets of one of the snippet collections): so-ds-feb20-{valid|test}, staqc-py-raw-{valid|test}, conala-curated-0.5-test.

    - training data (datasets used to train code retrieval models): so-duplicates-pacs-train, so-python-question-titles-feb20.

    The staqc-py-cleaned snippet collection, and the conala-curated datasets were derived from existing corpora:

    staqc-py-cleaned was derived from the Python StaQC snippet collection. See https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset.

    conala-curated was derived from the conala corpus. See https://conala-corpus.github.io/.

    The other datasets were mined directly from a recent Stack Overflow dump (https://archive.org/details/stackexchange).

  11. d

    explore the data.world python sdk

    • data.world
    zip
    Updated Sep 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Noah Rippner (2023). explore the data.world python sdk [Dataset]. https://data.world/nrippner/explore-the-data-world-python-sdk
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 12, 2023
    Dataset provided by
    data.world, Inc.
    Authors
    Noah Rippner
    Description

    Getting Started with the data.world Python SDK

    * Seamless integration with Python and R

    * Effortlessly load data

    * SQL queries to pandas DataFrames

    * data.world and python side by side -- a nice way to work with your data!

    Check out the notebook below!

  12. P

    EVIL-Encoders Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Aug 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). EVIL-Encoders Dataset [Dataset]. https://paperswithcode.com/dataset/evil-encoders
    Explore at:
    Dataset updated
    Aug 31, 2021
    Description

    This dataset contains samples to generate Python code for security exploits. In order to make the dataset representative of real exploits, it includes code snippets drawn from exploits from public databases. Differing from general-purpose Python code found in previous datasets, the Python code of real exploits entails low-level operations on byte data for obfuscation purposes (i.e., to encode shellcodes). Therefore, real exploits make extensive use of Python instructions for converting data between different encoders, for performing low-level arithmetic and logical operations, and for bit-level slicing, which cannot be found in the previous general-purpose Python datasets. In total, we built a dataset that consists of 1,114 original samples of exploit-tailored Python snippets and their corresponding intent in the English language. These samples include complex and nested instructions, as typical of Python programming. In order to perform more realistic training and for a fair evaluation, we left untouched the developers' original code snippets and did not decompose them. We provided English intents to describe nested instructions altogether. In order to bootstrap the training process for the NMT model, we include in our dataset both the original, exploit-oriented snippets and snippets from a previous general-purpose Python dataset. This enables the NMT model to generate code that can mix general-purpose and exploit-oriented instructions. Among the several datasets for Python code generation, we choose the Django dataset due to its large size. This corpus contains 14,426 unique pairs of Python statements from the Django Web application framework and their corresponding description in English. Therefore, our final dataset contains 15,540 unique pairs of Python code snippets alongside their intents in natural language.

  13. Pokkemon Dataset_csv

    • kaggle.com
    zip
    Updated Aug 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BHARGAV NATH (2021). Pokkemon Dataset_csv [Dataset]. https://www.kaggle.com/bhargavnath/new-dataset-fr-pandas
    Explore at:
    zip(13955 bytes)Available download formats
    Dataset updated
    Aug 18, 2021
    Authors
    BHARGAV NATH
    Description

    Dataset

    This dataset was created by BHARGAV NATH

    Contents

  14. Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

    GitHub page: https://github.com/soarsmu/NICHE

  15. Glaive Python Code QA DataSet

    • kaggle.com
    zip
    Updated Feb 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BrucePayton (2024). Glaive Python Code QA DataSet [Dataset]. https://www.kaggle.com/datasets/brucepayton/glaive-python-code-qa-dataset
    Explore at:
    zip(62220335 bytes)Available download formats
    Dataset updated
    Feb 27, 2024
    Authors
    BrucePayton
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by BrucePayton

    Released under Apache 2.0

    Contents

  16. P

    Django Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Feb 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yusuke Oda; Hiroyuki Fudaba; Graham Neubig; Hideaki Hata; Sakriani Sakti; Tomoki Toda; Satoshi Nakamura (2022). Django Dataset [Dataset]. https://paperswithcode.com/dataset/django
    Explore at:
    Dataset updated
    Feb 7, 2022
    Authors
    Yusuke Oda; Hiroyuki Fudaba; Graham Neubig; Hideaki Hata; Sakriani Sakti; Tomoki Toda; Satoshi Nakamura
    Description

    The Django dataset is a dataset for code generation comprising of 16000 training, 1000 development and 1805 test annotations. Each data point consists of a line of Python code together with a manually created natural language description.

  17. Dataset Sales - Aleatory Data - by python numpy

    • kaggle.com
    zip
    Updated Jul 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Italo Marcelo (2020). Dataset Sales - Aleatory Data - by python numpy [Dataset]. https://kaggle.com/italomarcelo/dataset-sales-aleatory-data-by-python-numpy
    Explore at:
    zip(4756152 bytes)Available download formats
    Dataset updated
    Jul 4, 2020
    Authors
    Italo Marcelo
    Description

    Dataset

    This dataset was created by Italo Marcelo

    Contents

    It contains the following files:

  18. c

    Data from: Dataset for: PATATO: A Python Photoacoustic Analysis Toolkit

    • repository.cam.ac.uk
    • commons.datacite.org
    bin, zip
    Updated Jan 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Else, Thomas; Groehl, Janek; Hacker, Lina; Bohndiek, Sarah (2023). Dataset for: PATATO: A Python Photoacoustic Analysis Toolkit [Dataset]. http://doi.org/10.17863/CAM.93181
    Explore at:
    bin(1142092248 bytes), zip(1352132309 bytes), bin(1156744350 bytes), bin(3145 bytes), zip(800750333 bytes)Available download formats
    Dataset updated
    Jan 27, 2023
    Dataset provided by
    University of Cambridge
    Apollo
    Authors
    Else, Thomas; Groehl, Janek; Hacker, Lina; Bohndiek, Sarah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ========================================================================

    Example Data for PATATO: Python Photoacoustic Tomography Analysis toolkit

    Thomas Else, Sarah Bohndiek (seb53@cam.ac.uk),

    CRUK Cambridge Institute and Department of Physics, University of Cambridge.

    Data collected between October 2020 and December 2022 in Cambridge, United Kingdom.

    Please note: All animal procedures used to acquire the data described below were conducted in accordance with project (PE12C2B96) and personal licenses (I33984279) issued under the United Kingdom Animals (Scientific Procedures) Act, 1986, and were approved locally under compliance form number CFSB2022. \( \ \)

    Description of Files

    These datasets were collected using two different commercial photoacoustic imaging systems, details of which can be found on the vendor website. More information regarding the experimental acquisition of these files can be found via the PATATO Python toolkit repository, which is freely available on GitHub (https://github.com/tomelse/patato) \( \ \) clinical_phantom.hdf5: Data collected 02/08/2022. A photoacoustic imaging dataset containing raw data from a scan taken on a tissue-mimicking phantom. The data were acquired using the iThera Medical GmbH MSOT Acuity CE device. The file format is HDF5. \( \ \) preclinical_phantom.hdf5: Data collected 02/09/2021 A photoacoustic imaging dataset containing raw data from a scan taken on a tissue-mimicking phantom. The data were acquired using the iThera Medical GmbH MSOT inVision system. The file format is HDF5. \( \ \) invivo_oe.hdf5: Data collected 01/10/2020. A photoacoustic imaging dataset containing raw data from a scan of a mouse, with oxygen-enhanced imaging whereby the breathing gas of the mouse was changed during the scan time. The data were acquired using the iThera Medical MSOT inVision system. The file format is HDF5. \( \ \) invivo_dce.hdf5: Data collected 01/10/2020. A photoacoustic imaging dataset containing raw data from a scan of a mouse, with dynamic-contrast enhanced imaging using indocyanine green, whereby the contrast agent was introduced intravenously during the scan time. The data were acquired using the iThera Medical GmbH MSOT inVision system. The file format is HDF5. \( \ \) ithera_invivo_oe.zip: The same as invivo_oe.hdf5 but in a different format. The zip file contains imaging data in a proprietary format provided by the device manufacturer. It can be loaded using proprietary software such as iThera ViewMSOT, or by converting it to an open format. \( \ \) ithera_invivo_dce.zip: The same as invivo_dce.hdf5 but in a different format. The zip file contains imaging data in a proprietary format provided by the device manufacturer. It can be loaded using proprietary software such as iThera ViewMSOT, or by converting it to an open format.

  19. python-question-and-answer-preprocessed-dataset

    • kaggle.com
    zip
    Updated Apr 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    shrey pachauri (2024). python-question-and-answer-preprocessed-dataset [Dataset]. https://www.kaggle.com/datasets/shreypachauri123/python-question-and-answer-preprocessed-dataset
    Explore at:
    zip(506766 bytes)Available download formats
    Dataset updated
    Apr 13, 2024
    Authors
    shrey pachauri
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by shrey pachauri

    Released under Apache 2.0

    Contents

  20. d

    Lifestyle data and python code - Dataset - B2FIND

    • b2find.dkrz.de
    Updated Oct 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Lifestyle data and python code - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/efe390a2-7109-583f-9570-8aebb0e1af41
    Explore at:
    Dataset updated
    Oct 23, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Lifestyle data:People, word, food, leisure, and transportation are placed in each column in Excel.- Column Description -people: the number of confirmed COVID-19 casesword: number of searchesfood: the amount paid by people, who ate at restaurantsleisure: the amount paid by people, who enjoyed leisure activities transportation: the amount paid by people, who used public transportation Python code 1:Python code using DNN algorithm to predict the number of confirmed COVID-19 cases (written in Jupyter notebook)Python code 2:Python code using LSTM algorithm to predict the number of confirmed COVID-19 cases (written in Jupyter notebook)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
James (2024). python-code-dataset-500k [Dataset]. https://huggingface.co/datasets/jtatman/python-code-dataset-500k

python-code-dataset-500k

github_python

jtatman/python-code-dataset-500k

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 24, 2024
Authors
James
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Attention: This dataset is a summary and reformat pulled from github code.

You should make your own assumptions based on this. In fact, there is another dataset I formed through parsing that addresses several points:

out of 500k python related items, most of them are python-ish, not pythonic the majority of the items here contain excessive licensing inclusion of original code the items here are sometimes not even python but have references There's a whole lot of gpl summaries… See the full description on the dataset page: https://huggingface.co/datasets/jtatman/python-code-dataset-500k.

Search
Clear search
Close search
Google apps
Main menu