100+ datasets found

h
python-codes-25k
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FLOCK4H, python-codes-25k [Dataset]. https://huggingface.co/datasets/flytech/python-codes-25k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
FLOCK4H
Description
License

MIT

This is a Cleaned Python Dataset Covering 25,000 Instructional Tasks Overview

The dataset has 4 key features (fields): instruction, input, output, and text.It's a rich source for Python codes, tasks, and extends into behavioral aspects.

Dataset Statistics

Total Entries: 24,813 Unique Instructions: 24,580 Unique Inputs: 3,666 Unique Outputs: 24,581 Unique Texts: 24,813 Average Tokens per example: 508

Features… See the full description on the dataset page: https://huggingface.co/datasets/flytech/python-codes-25k.
Datasets for manuscript "A data engineering framework for chemical flow...
catalog.data.gov
gimi9.com
Updated Nov 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2021). Datasets for manuscript "A data engineering framework for chemical flow analysis of industrial pollution abatement operations" [Dataset]. https://catalog.data.gov/dataset/datasets-for-manuscript-a-data-engineering-framework-for-chemical-flow-analysis-of-industr
Explore at:
Dataset updated
Nov 7, 2021
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The EPA GitHub repository PAU4ChemAs as described in the README.md file, contains Python scripts written to build the PAU dataset modules (technologies, capital and operating costs, and chemical prices) for tracking chemical flows transfers, releases estimation, and identification of potential occupation exposure scenarios in pollution abatement units (PAUs). These PAUs are employed for on-site chemical end-of-life management. The folder datasets contains the outputs for each framework step. The Chemicals_in_categories.csv contains the chemicals for the TRI chemical categories. The EPA GitHub repository PAU_case_study as described in its readme.md entry, contains the Python scripts to run the manuscript case study for designing the PAUs, the data-driven models, and the decision-making module for chemicals of concern and tracking flow transfers at the end-of-life stage. The data was obtained by means of data engineering using different publicly-available databases. The properties of chemicals were obtained using the GitHub repository Properties_Scraper, while the PAU dataset using the repository PAU4Chem. Finally, the EPA GitHub repository Properties_Scraper contains a Python script to massively gather information about exposure limits and physical properties from different publicly-available sources: EPA, NOAA, OSHA, and the institute for Occupational Safety and Health of the German Social Accident Insurance (IFA). Also, all GitHub repositories describe the Python libraries required for running their code, how to use them, the obtained outputs files after running the Python script modules, and the corresponding EPA Disclaimer. This dataset is associated with the following publication: Hernandez-Betancur, J.D., M. Martin, and G.J. Ruiz-Mercado. A data engineering framework for on-site end-of-life industrial operations. JOURNAL OF CLEANER PRODUCTION. Elsevier Science Ltd, New York, NY, USA, 327: 129514, (2021).
h
CodeExercise-Python-27k
huggingface.co
Updated Sep 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CodeFuse AI (2023). CodeExercise-Python-27k [Dataset]. https://huggingface.co/datasets/codefuse-ai/CodeExercise-Python-27k
Explore at:
Dataset updated
Sep 9, 2023
Dataset authored and provided by
CodeFuse AI
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset Card for CodeFuse-CodeExercise-Python-27k

[中文] [English]

Dataset Description

This dataset consists of 27K Python programming exercises (in English), covering hundreds of Python-related topics including basic syntax and data structures, algorithm applications, database queries, machine learning, and more. Please note that this dataset was generated with the help of a teacher model and Camel, and has not undergone strict validation. There may be errors or… See the full description on the dataset page: https://huggingface.co/datasets/codefuse-ai/CodeExercise-Python-27k.
f
datasets
figshare.com
txt
Updated Oct 5, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlos Rodriguez-Contreras (2017). datasets [Dataset]. http://doi.org/10.6084/m9.figshare.5472970.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5472970.v1
Dataset updated
Oct 5, 2017
Dataset provided by
figshare
Authors
Carlos Rodriguez-Contreras
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets for practising in class
Data Pre-Processing : Data Integration
kaggle.com
Updated Aug 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mr.Machine (2022). Data Pre-Processing : Data Integration [Dataset]. https://www.kaggle.com/datasets/ilayaraja07/data-preprocessing-data-integration
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 2, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mr.Machine
Description
In this exercise, we'll merge the details of students from two datasets, namely student.csv and marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The marks.csv dataset contains columns such as Mark and City. The Student_id column is common between the two datasets. Follow these steps to complete this exercise
f
datasets
figshare.com
txt
Updated Sep 27, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlos Rodriguez-Contreras (2017). datasets [Dataset]. http://doi.org/10.6084/m9.figshare.5447167.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5447167.v1
Dataset updated
Sep 27, 2017
Dataset provided by
figshare
Authors
Carlos Rodriguez-Contreras
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This folder contains datasets to be downloaded from students for their practices with R and Python
Z
Data from: #PraCegoVer dataset
data.niaid.nih.gov
Updated Jan 19, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sandra Avila (2023). #PraCegoVer dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5710561
Explore at:
Dataset updated
Jan 19, 2023
Dataset provided by
Gabriel Oliveira dos Santos
Esther Luna Colombini
Sandra Avila
Description
Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

Dataset Structure

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

containing the images. The file dataset.json comprehends a list of json objects with the attributes:

user: anonymized user that made the post;

filename: image file name;

raw_caption: raw caption;

caption: clean caption;

date: post date.

Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

Download Instructions

If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:

cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz

Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

python download_dataset.py --access_token=
P
EVIL-Encoders Dataset
paperswithcode.com
opendatalab.com
Updated Aug 31, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). EVIL-Encoders Dataset [Dataset]. https://paperswithcode.com/dataset/evil-encoders
Explore at:
Dataset updated
Aug 31, 2021
Description
This dataset contains samples to generate Python code for security exploits. In order to make the dataset representative of real exploits, it includes code snippets drawn from exploits from public databases. Differing from general-purpose Python code found in previous datasets, the Python code of real exploits entails low-level operations on byte data for obfuscation purposes (i.e., to encode shellcodes). Therefore, real exploits make extensive use of Python instructions for converting data between different encoders, for performing low-level arithmetic and logical operations, and for bit-level slicing, which cannot be found in the previous general-purpose Python datasets. In total, we built a dataset that consists of 1,114 original samples of exploit-tailored Python snippets and their corresponding intent in the English language. These samples include complex and nested instructions, as typical of Python programming. In order to perform more realistic training and for a fair evaluation, we left untouched the developers' original code snippets and did not decompose them. We provided English intents to describe nested instructions altogether. In order to bootstrap the training process for the NMT model, we include in our dataset both the original, exploit-oriented snippets and snippets from a previous general-purpose Python dataset. This enables the NMT model to generate code that can mix general-purpose and exploit-oriented instructions. Among the several datasets for Python code generation, we choose the Django dataset due to its large size. This corpus contains 14,426 unique pairs of Python statements from the Django Web application framework and their corresponding description in English. Therefore, our final dataset contains 15,540 unique pairs of Python code snippets alongside their intents in natural language.
python qa datasets
kaggle.com
Updated Apr 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
terrychan (2023). python qa datasets [Dataset]. https://www.kaggle.com/datasets/terrychanorg/python-qa-datasets/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
terrychan
Description
Dataset

This dataset was created by terrychan

Contents
T
mnist
tensorflow.org
universe.roboflow.com
+3more
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). mnist [Dataset]. https://www.tensorflow.org/datasets/catalog/mnist
Explore at:
Dataset updated
Jun 1, 2024
Description
The MNIST database of handwritten digits.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('mnist', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">
Vector datasets for workshop "Introduction to Geospatial Raster and Vector...
figshare.com
Updated Oct 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan Avery (2022). Vector datasets for workshop "Introduction to Geospatial Raster and Vector Data with Python" [Dataset]. http://doi.org/10.6084/m9.figshare.21273837.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21273837.v1
Dataset updated
Oct 5, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ryan Avery
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cadaster data from PDOK used to illustrate the use of geopandas and shapely, geospatial python packages for manipulating vector data. The brpgewaspercelen_definitief_2020.gpkg file has been subsetted in order to make the download manageable for workshops. Other datasets are copies of those available from PDOK.
Sample data files for Python Course
figshare.com
txt
Updated Nov 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Verhaar (2022). Sample data files for Python Course [Dataset]. http://doi.org/10.6084/m9.figshare.21501549.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21501549.v1
Dataset updated
Nov 4, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Peter Verhaar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sample data set used in an introductory course on Programming in Python
VegeNet - Image datasets and Codes
zenodo.org
data.niaid.nih.gov
zip
Updated Oct 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jo Yen Tan; Jo Yen Tan (2022). VegeNet - Image datasets and Codes [Dataset]. http://doi.org/10.5281/zenodo.7254508
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7254508
Dataset updated
Oct 27, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jo Yen Tan; Jo Yen Tan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).

Image datasets:

vege_original : Images of vegetables captured manually in data acquisition stage

vege_cropped_renamed : Images in (1) cropped to remove background areas and image labels renamed

non-vege images : Images of non-vegetable foods for CNN network to recognize other-than-vegetable foods

food_image_dataset : Complete set of vege (2) and non-vege (3) images for architecture building.

food_image_dataset_split : Image dataset (4) split into train and test sets

process : Images created when cropping (pre-processing step) to create dataset (2).
Meta Kaggle Code
kaggle.com
zip
Updated Jul 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Explore at:
zip(148301844275 bytes)Available download formats
Dataset updated
Jul 10, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!
T
cifar10
tensorflow.org
opendatalab.com
+3more
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). cifar10 [Dataset]. https://www.tensorflow.org/datasets/catalog/cifar10
Explore at:
Dataset updated
Jun 1, 2024
Description
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('cifar10', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/cifar10-3.0.2.png" alt="Visualization" width="500px">
Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
csv
Updated Sep 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6607065
Dataset updated
Sep 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous authors; Anonymous authors
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
Datasets for manuscript "Data engineering for tracking chemicals and...
catalog.data.gov
gimi9.com
Updated Feb 10, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2021). Datasets for manuscript "Data engineering for tracking chemicals and releases at industrial end-of-life activities" [Dataset]. https://catalog.data.gov/dataset/datasets-for-manuscript-data-engineering-for-tracking-chemicals-and-releases-at-industrial
Explore at:
Dataset updated
Feb 10, 2021
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The GitHub repository contains a Python code (MC_Case_Study.py) to support and replicate the case study results shown in the manuscript entitled Data engineering for tracking chemicals and releases at industrial end-of-life activities. Also, it indicates the free-available Python libraries that are required for running the code "MC_Case_Study.py." The dataset "EoL_database_for_MC.csv" contains all data to execute the Python code and obtain "Figure 6: 6-level Sankey diagram for the case study", "Figure 7: Box plot for the case study", and "Figure 8: Histogram for the case study." A Table describing the data name entry and data type for the dataset "EoL_database_for_MC.csv" is shown. Also, this dataset information and Python code are provided in the manuscript Supporting Info file (see supporting documents). This dataset is associated with the following publication: Hernandez-Betancur, J.D., G.J. Ruiz-Mercado, J.P. Abraham, M. Martin, W.W. Ingwersen, and R.L. Smith. Data engineering for tracking chemicals and releases at industrial end-of-life activities. JOURNAL OF HAZARDOUS MATERIALS. Elsevier Science Ltd, New York, NY, USA, 405: 124270, (2021).
h
wikihow
huggingface.co
paperswithcode.com
+2more
Updated Mar 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Yang Wang (2024). wikihow [Dataset]. https://huggingface.co/datasets/wangwilliamyang/wikihow
Explore at:
Dataset updated
Mar 15, 2024
Authors
William Yang Wang
Description
WikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base.

There are two features: - text: wikihow answers texts. - headline: bold lines as summary.

There are two separate versions: - all: consisting of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries. - sep: consisting of each paragraph and its summary.

Download "wikihowAll.csv" and "wikihowSep.csv" from https://github.com/mahnazkoupaee/WikiHow-Dataset and place them in manual folder https://www.tensorflow.org/datasets/api_docs/python/tfds/download/DownloadConfig. Train/validation/test splits are provided by the authors. Preprocessing is applied to remove short articles (abstract length < 0.75 article length) and clean up extra commas.
Z
Curated list of HAR datasets
data.niaid.nih.gov
zenodo.org
Updated May 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Králik, Matej (2020). Curated list of HAR datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3831957
Explore at:
Dataset updated
May 18, 2020
Dataset authored and provided by
Králik, Matej
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A curated list of preprocessed & ready to use under a minute Human Activity Recognition datasets.

All the datasets are preprocessed in HDF5 format, created using the h5py python library. Scripts used for data preprocessing are provided as well (Load.ipynb and load_jordao.py)

Each HDF5 file contains at least the keys:

x a single array of size [sample count, temporal length, sensor channel count], contains the actual sensor data. Metadata contains the names of individual sensor channel count. All samples are zero-padded for constant length in the file, original lengths before padding available under the meta keys.

y a single array of size [sample count] with integer values for target classes (zero-based). Metadata contains the names of the target classes.

meta contain various metadata, depends on the dataset (original length before padding, subject no., trial no., etc.)

Usage example

import h5py

with h5py.File(f'data/waveglove_multi.h5', 'r') as h5f: x = h5f['x'] y = h5f['y']['class'] print(f'WaveGlove-multi: {x.shape[0]} samples') print(f'Sensor channels: {h5f["x"].attrs["channels"]}') print(f'Target classes: {h5f["y"].attrs["labels"]}') first_sample = x[0]

Output:

WaveGlove-multi: 10044 samples

Sensor channels: ['acc1-x' 'acc1-y' 'acc1-z' 'gyro1-x' 'gyro1-y' 'gyro1-z' 'acc2-x'

'acc2-y' 'acc2-z' 'gyro2-x' 'gyro2-y' 'gyro2-z' 'acc3-x' 'acc3-y'

'acc3-z' 'gyro3-x' 'gyro3-y' 'gyro3-z' 'acc4-x' 'acc4-y' 'acc4-z'

'gyro4-x' 'gyro4-y' 'gyro4-z' 'acc5-x' 'acc5-y' 'acc5-z' 'gyro5-x'

'gyro5-y' 'gyro5-z']

Target classes: ['null' 'hand swipe left' 'hand swipe right' 'pinch in' 'pinch out'

'thumb double tap' 'grab' 'ungrab' 'page flip' 'peace' 'metal']

Current list of datasets:

WaveGlove-single (waveglove_single.h5)

WaveGlove-multi (waveglove_multi.h5)

uWave (uwave.h5)

OPPORTUNITY (opportunity.h5)

PAMAP2 (pamap2.h5)

SKODA (skoda.h5)

MHEALTH (non overlapping windows) (mhealth.h5)

Six datasets with all four predefined train/test folds as preprocessed by Jordao et al. originally in WearableSensorData (FNOW, LOSO, LOTO and SNOW prefixed .h5 files)
P
Django Dataset
paperswithcode.com
opendatalab.com
Updated Feb 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yusuke Oda; Hiroyuki Fudaba; Graham Neubig; Hideaki Hata; Sakriani Sakti; Tomoki Toda; Satoshi Nakamura (2022). Django Dataset [Dataset]. https://paperswithcode.com/dataset/django
Explore at:
Dataset updated
Feb 7, 2022
Authors
Yusuke Oda; Hiroyuki Fudaba; Graham Neubig; Hideaki Hata; Sakriani Sakti; Tomoki Toda; Satoshi Nakamura
Description
The Django dataset is a dataset for code generation comprising of 16000 training, 1000 development and 1805 test annotations. Each data point consists of a line of Python code together with a manually created natural language description.

Facebook

Twitter

Click to copy link

Link copied

Cite

FLOCK4H, python-codes-25k [Dataset]. https://huggingface.co/datasets/flytech/python-codes-25k

python-codes-25k

flytech/python-codes-25k

Explore at:

5 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Authors

FLOCK4H

Description

License

MIT

  This is a Cleaned Python Dataset Covering 25,000 Instructional Tasks





  Overview

The dataset has 4 key features (fields): instruction, input, output, and text.It's a rich source for Python codes, tasks, and extends into behavioral aspects.

  Dataset Statistics

Total Entries: 24,813 Unique Instructions: 24,580 Unique Inputs: 3,666 Unique Outputs: 24,581 Unique Texts: 24,813 Average Tokens per example: 508

  Features… See the full description on the dataset page: https://huggingface.co/datasets/flytech/python-codes-25k.

Clear search

Close search

Google apps

Main menu

python-codes-25k

Datasets for manuscript "A data engineering framework for chemical flow...

CodeExercise-Python-27k

datasets

Data Pre-Processing : Data Integration

datasets

Data from: #PraCegoVer dataset

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

EVIL-Encoders Dataset

python qa datasets

Dataset

Contents

mnist

Vector datasets for workshop "Introduction to Geospatial Raster and Vector...

Sample data files for Python Course

VegeNet - Image datasets and Codes

Meta Kaggle Code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

cifar10

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

Datasets for manuscript "Data engineering for tracking chemicals and...

wikihow

Curated list of HAR datasets

Output:

WaveGlove-multi: 10044 samples

Sensor channels: ['acc1-x' 'acc1-y' 'acc1-z' 'gyro1-x' 'gyro1-y' 'gyro1-z' 'acc2-x'

'acc2-y' 'acc2-z' 'gyro2-x' 'gyro2-y' 'gyro2-z' 'acc3-x' 'acc3-y'

'acc3-z' 'gyro3-x' 'gyro3-y' 'gyro3-z' 'acc4-x' 'acc4-y' 'acc4-z'

'gyro4-x' 'gyro4-y' 'gyro4-z' 'acc5-x' 'acc5-y' 'acc5-z' 'gyro5-x'

'gyro5-y' 'gyro5-z']

Target classes: ['null' 'hand swipe left' 'hand swipe right' 'pinch in' 'pinch out'

'thumb double tap' 'grab' 'ungrab' 'page flip' 'peace' 'metal']

Django Dataset

python-codes-25k

flytech/python-codes-25k