67 datasets found

R
Data from: Fashion Mnist Dataset
universe.roboflow.com
opendatalab.com
+4more
zip
Updated Aug 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Popular Benchmarks (2022). Fashion Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/fashion-mnist-ztryt/model/3
Explore at:
zipAvailable download formats
Dataset updated
Aug 10, 2022
Dataset authored and provided by
Popular Benchmarks
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Clothing
Description
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Han Xiao, Kashif Rasul and Roland Vollgraf

https://arxiv.org/abs/1708.07747

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. * Source

Here's an example of how the data looks (each class takes three-rows): https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">

Version 1 (original-images_Original-FashionMNIST-Splits):

Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.

This version was not trained

Version 3 (original-images_trainSetSplitBy80_20):

Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set

https://blog.roboflow.com/train-test-split/ https://i.imgur.com/angfheJ.png" alt="Train/Valid/Test Split Rebalancing">

Citation:

@online{xiao2017/online, author = {Han Xiao and Kashif Rasul and Roland Vollgraf}, title = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms}, date = {2017-08-28}, year = {2017}, eprintclass = {cs.LG}, eprinttype = {arXiv}, eprint = {cs.LG/1708.07747}, }
d
Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND
b2find.dkrz.de
Updated Oct 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/63d20a5e-3584-5096-a34d-d3f93fcc8857
Explore at:
Dataset updated
Oct 24, 2023
Description
Using Machine Learning Techniques in general and Deep Learning techniques in specific needs a certain amount of data often not available in large quantities in some technical domains. The manual inspection of Machine Tool Components, as well as the manual end of line check of products, are labour intensive tasks in industrial applications that often want to be automated by companies. To automate the classification processes and to develop reliable and robust Machine Learning based classification and wear prognostics models there is a need for real-world datasets to train and test models on. The dataset contains 1104 channel 3 images with 394 image-annotations for the surface damage type “pitting”. The annotations made with the annotation tool labelme, are available in JSON format and hence convertible to VOC and COCO format. All images come from two BSD types. The dataset available for download is divided into two folders, data with all images as JPEG, label with all annotations, and saved_model with a baseline model. The authors also provide a python script to divide the data and labels into three different split types – train_test_split, which splits images into the same train and test data-split the authors used for the baseline model, wear_dev_split, which creates all 27 wear developments and type_split, which splits the data into the occurring BSD-types. One of the two mentioned BSD types is represented with 69 images and 55 different image-sizes. All images with this BSD type come either in a clean or soiled condition. The other BSD type is shown on 325 images with two image-sizes. Since all images of this type have been taken with continuous time the degree of soiling is evolving. Also, the dataset contains as above mentioned 27 pitting development sequences with every 69 images. Instruction dataset split The authors of this dataset provide 3 types of different dataset splits. To get the data split you have to run the python script split_dataset.py. Script inputs: split-type (mandatory) output directory (mandatory) Different split-types: train_test_split: splits dataset into train and test data (80%/20%) wear_dev_split: splits dataset into 27 wear-developments type_split: splits dataset into different BSD types Example: C:\Users\Desktop>python split_dataset.py --split_type=train_test_split --output_dir=BSD_split_folder
d
Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
Z
Data from: Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes,...
data.niaid.nih.gov
zenodo.org
Updated Jun 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes, Material Systems, and Data Heterogeneity for Materials Informatics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4903957
Explore at:
Dataset updated
Jun 6, 2021
Dataset provided by
Sparks, D. Taylor
Kauwe, K. Steven
Henderson, N. Ashley
Description
This benchmark data is comprised of 50 different datasets for materials properties obtained from 16 previous publications. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits.

For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method.

For further information, as well as directions on how to access the data, please go to the corresponding GitHub repository: https://github.com/anhender/mse_ML_datasets/tree/v1.0
BUTTER - Empirical Deep Learning Dataset
osti.gov
Updated May 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BUTTER - Empirical Deep Learning Dataset [Dataset]. https://www.osti.gov/biblio/1872441
Explore at:
Unique identifier
https://doi.org/10.25984/1872441
Dataset updated
May 20, 2022
Dataset provided by
Office of Sciencehttp://www.er.doe.gov/
United States Department of Energyhttp://energy.gov/
DOE Open Energy Data Initiative (OEDI)
National Renewable Energy Laboratory (NREL), Golden, CO (United States)
Description
The BUTTER Empirical Deep Learning Dataset represents an empirical study of the deep learning phenomena on dense fully connected networks, scanning across thirteen datasets, eight network shapes, fourteen depths, twenty-three network sizes (number of trainable parameters), four learning rates, six minibatch sizes, four levels of label noise, and fourteen levels of L1 and L2 regularization each. Multiple repetitions (typically 30, sometimes 10) of each combination of hyperparameters were preformed, and statistics including training and test loss (using a 80% / 20% shuffled train-test split) are recorded at the end of each training epoch. In total, this dataset covers 178 thousand distinct hyperparameter settings ("experiments"), 3.55 million individual training runs (an average of 20 repetitions of each experiments), and a total of 13.3 billion training epochs (three thousand epochs were covered by most runs). Accumulating this dataset consumed 5,448.4 CPU core-years, 17.8 GPU-years, and 111.2 node-years.
R
Hard Hat Workers Dataset
universe.roboflow.com
zip
Updated Sep 30, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph Nelson (2022). Hard Hat Workers Dataset [Dataset]. https://universe.roboflow.com/joseph-nelson/hard-hat-workers/model/13
Explore at:
zipAvailable download formats
Dataset updated
Sep 30, 2022
Dataset authored and provided by
Joseph Nelson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Variables measured
Workers Bounding Boxes
Description
Overview

The Hard Hat dataset is an object detection dataset of workers in workplace settings that require a hard hat. Annotations also include examples of just "person" and "head," for when an individual may be present without a hard hart.

The original dataset has a 75/25 train-test split.

Example Image: https://i.imgur.com/7spoIJT.png" alt="Example Image">

Use Cases

One could use this dataset to, for example, build a classifier of workers that are abiding safety code within a workplace versus those that may not be. It is also a good general dataset for practice.

Using this Dataset

Use the fork or Download this Dataset button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

Dataset Versions:

Image Preprocessing | Image Augmentation | Modify Classes * v1 (resize-416x416-reflect): generated with the original 75/25 train-test split | No augmentations * v2 (raw_75-25_trainTestSplit): generated with the original 75/25 train-test split | These are the raw, original images * v3 (v3): generated with the original 75/25 train-test split | Modify Classes used to drop person class | Preprocessing and Augmentation applied * v5 (raw_HeadHelmetClasses): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person class * v8 (raw_HelmetClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head and person classes * v9 (raw_PersonClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head and helmet classes * v10 (raw_AllClasses): generated with a 70/20/10 train/valid/test split | These are the raw, original images * v11 (augmented3x-AllClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied | 3x image generation | Trained with Roboflow's Fast Model * v12 (augmented3x-HeadHelmetClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person class | 3x image generation | Trained with Roboflow's Fast Model * v13 (augmented3x-HeadHelmetClasses-AccurateModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person class | 3x image generation | Trained with Roboflow's Accurate Model * v14 (raw_HeadClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person class, and remap/relabel helmet class to head

Choosing Between Computer Vision Model Sizes | Roboflow Train

About Roboflow

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
i
Gramatika
ieee-dataport.org
Updated Mar 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Felix Haryono (2025). Gramatika [Dataset]. http://doi.org/10.21227/h056-hb64
Explore at:
Unique identifier
https://doi.org/10.21227/h056-hb64
Dataset updated
Mar 6, 2025
Dataset provided by
IEEE Dataport
Authors
Michael Felix Haryono
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Gramatika is a syntectic GEC dataset for Indonesian. The Gramatika dataset has a total of 1.5 million sentences with 4,666,185 errors. Of all sentences, only 30,000 (2%) are correct sentences with no mistakes. Each sentence has a maximum of 6 errors, and there can only be 2 of the same error type in each sentence.We also split the dataset into three splits: train, dev, and test splits, with the proportion of 8:1:1 (with the size of 1,199,705, 150,171, and 150,124 sentences, respectively). The proportion of valid sentences in each split is 2%; 24,000 in the train split, and 3000 in each dev and test split. Moreover, we also set the proportion of each error type in all splits to be the same, as shown in Table 3.3.2. For example, the proportion of noun errors is 7.5% in all splits, while the proportion of particle errors is also only 0.3% in all splits.
S233
zenodo.org
tar
Updated Oct 6, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Zurowietz; Martin Zurowietz (2020). S233 [Dataset]. http://doi.org/10.5281/zenodo.3603815
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3603815
Dataset updated
Oct 6, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Martin Zurowietz; Martin Zurowietz
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
A fully annotated subset of the SO242/2_233-1 image dataset. The annotations are given as train and test splits that can be used to evaluate machine learning methods. The following classes of fauna were used for annotation:

anemone

coral

crustacean

ipnops fish

litter

ophiuroid

other fauna

sea cucumber

sponge

stalked crinoid

For a definition of the classes see [1].

Related datasets:

S083: https://doi.org/10.5281/zenodo.3600132

S155: https://doi.org/10.5281/zenodo.3603803

S171: https://doi.org/10.5281/zenodo.3603809

This dataset contains the following files:

annotations/test.csv: The BIIGLE CSV annotation report of the annotations of the test split of this dataset. These annotations are used to test the performance of the trained Mask R-CNN model.

annotations/train.csv: The BIIGLE CSV annotation report of the annotations of the train split of this dataset. These annotations are used to generate the annotation patches which are transformed with scale and style transfer to be used to train the Mask R-CNN model.

images/: Directory that contains all the original image files.

dataset.json: JSON file that contains information about the dataset.

name: The name of the dataset.

images_dir: Name of the directory that contains the original image files.

metadata_file: Path to the CSV file that contains image metadata.

test_annotations_file: Path to the CSV file that contains the test annotations.

train_annotations_file: Path to the CSV file that contains the train annotations.

annotation_patches_dir: Name of the directory that should contain the scale- and style-transferred annotation patches.

crop_dimension: Edge length of an annotation or style patch in pixels.

metadata.csv: A CSV file that contains metadata for each original image file. In this case the distance of the camera to the sea floor is given for each image.
Training and test data for the preparation of the article: Convolutional...
4tu.edu.hpc.n-helix.com
data.4tu.nl
zip
Updated May 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dmytro Kolenov; D. (Davy) Davidse (2020). Training and test data for the preparation of the article: Convolutional Neural Network Applied for Nanoparticle Classification using Coherent Scaterometry Data [Dataset]. http://doi.org/10.4121/uuid:516ab2fa-4c47-42f8-b614-5e283889b218
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/uuid:516ab2fa-4c47-42f8-b614-5e283889b218
Dataset updated
May 29, 2020
Dataset provided by
4TUhttps://www.4tu.nl/
Authors
Dmytro Kolenov; D. (Davy) Davidse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Here we supply the training and test data as used in the prepared publication of "Convolutional Neural Network Applied for Nanoparticle Classification using Coherent Scaterometry Data" by D. Kolenov, D. Davidse, J. Le Cam, S.F. Pereira.

We present the "main dataset" samples in the pixel size of both 150x150 and 100x100, and for the three "fooling datasets" the pixel size is 100x100. On average each dataset contains 1100 images with the .mat extension. The .mat extension is straightforward with MatLab, but it could also be opened in Python or MS Excel. For the "main dataset" the pixels represent the sampling points, and the magnitude of these pixels represent the em field registered as the photocurrent on the split-detector. For the three types of "fooling data" the images of a 1) noisy and 2) mirrored set are also based on the photocurrent; 3) the elephant set is based on the open-source Animal-10 data.
R
Mnist Dataset
universe.roboflow.com
tensorflow.org
+5more
zip
Updated Aug 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Popular Benchmarks (2022). Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/mnist-cjkff/model/2
Explore at:
zipAvailable download formats
Dataset updated
Aug 8, 2022
Dataset authored and provided by
Popular Benchmarks
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Digits
Description
THE MNIST DATABASE of handwritten digits

Authors:

Yann LeCun, Courant Institute, NYU

Corinna Cortes, Google Labs, New York

Christopher J.C. Burges, Microsoft Research, Redmond

Dataset Obtained From: http://yann.lecun.com/exdb/mnist/

All images were sized 28x28 in the original dataset

The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.

It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.

Version 1 (original-images_trainSetSplitBy80_20):

Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set

Trained from Roboflow Classification Model's ImageNet training checkpoint

Version 2 (original-images_ModifiedClasses_trainSetSplitBy80_20):

Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set

Modify Classes, a Roboflow preprocessing feature, was employed to change class names from 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 to one, two, three, four, five, six, seven, eight, nine

Trained from the Roboflow Classification Model's ImageNet training checkpoint

Version 3 (original-images_Original-MNIST-Splits):

Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.

This version was not trained

Citation:

@article{lecun2010mnist, title={MNIST handwritten digit database}, author={LeCun, Yann and Cortes, Corinna and Burges, CJ}, journal={ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist}, volume={2}, year={2010} }
P
MS COCO Dataset
paperswithcode.com
Updated Apr 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tsung-Yi Lin; Michael Maire; Serge Belongie; Lubomir Bourdev; Ross Girshick; James Hays; Pietro Perona; Deva Ramanan; C. Lawrence Zitnick; Piotr Dollár, MS COCO Dataset [Dataset]. https://paperswithcode.com/dataset/coco
Explore at:
Dataset updated
Apr 15, 2024
Authors
Tsung-Yi Lin; Michael Maire; Serge Belongie; Lubomir Bourdev; Ross Girshick; James Hays; Pietro Perona; Deva Ramanan; C. Lawrence Zitnick; Piotr Dollár
Description
The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.

Splits: The first version of MS COCO dataset was released in 2014. It contains 164K images split into training (83K), validation (41K) and test (41K) sets. In 2015 additional test set of 81K images was released, including all the previous test images and 40K new images.

Based on community feedback, in 2017 the training/validation split was changed from 83K/41K to 118K/5K. The new split uses the same images and annotations. The 2017 test set is a subset of 41K images of the 2015 test set. Additionally, the 2017 release contains a new unannotated dataset of 123K images.

Annotations: The dataset has annotations for

object detection: bounding boxes and per-instance segmentation masks with 80 object categories, captioning: natural language descriptions of the images (see MS COCO Captions), keypoints detection: containing more than 200,000 images and 250,000 person instances labeled with keypoints (17 possible keypoints, such as left eye, nose, right hip, right ankle), stuff image segmentation – per-pixel segmentation masks with 91 stuff categories, such as grass, wall, sky (see MS COCO Stuff), panoptic: full scene segmentation, with 80 thing categories (such as person, bicycle, elephant) and a subset of 91 stuff categories (grass, sky, road), dense pose: more than 39,000 images and 56,000 person instances labeled with DensePose annotations – each labeled person is annotated with an instance id and a mapping between image pixels that belong to that person body and a template 3D model. The annotations are publicly available only for training and validation images.
P
WDC LSPM Dataset
paperswithcode.com
Updated May 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WDC LSPM Dataset [Dataset]. https://paperswithcode.com/dataset/wdc-products
Explore at:
Dataset updated
May 31, 2022
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.

In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.

The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.
h
gigaspeech
huggingface.co
paperswithcode.com
+1more
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SpeechColab, gigaspeech [Dataset]. https://huggingface.co/datasets/speechcolab/gigaspeech
Explore at:
Dataset authored and provided by
SpeechColab
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
GigaSpeech is an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality.

LIAR2 Dataset

paperswithcode.com

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Cheng Xu; M-Tahar Kechadi, LIAR2 Dataset [Dataset]. https://paperswithcode.com/dataset/liar2

Explore at:

Authors

Cheng Xu; M-Tahar Kechadi

Description

The LIAR dataset has been widely followed by fake news detection researchers since its release, and along with a great deal of research, the community has provided a variety of feedback on the dataset to improve it. We adopted these feedbacks and released the LIAR2 dataset, a new benchmark dataset of ~23k manually labeled by professional fact-checkers for fake news detection tasks. We have used a split ratio of 8:1:1 to distinguish between the training set, the test set, and the validation set, details of which are provided in the paper of "An Enhanced Fake News Detection System With Fuzzy Deep Learning". The LIAR2 dataset can be accessed at Huggingface and Github, and statistical information for LIAR and LIAR2 is provided in the table below:

Statistics	LIAR	LIAR2
Training set size	10,269	18,369
Validation set size	1,284	2,297
Testing set size	1,283	2,296
Avg. statement length (tokens)	17.9	17.7
Avg. speaker description length (tokens)	\	39.4
Avg. justification length (tokens)	\	94.4
Labels
Pants on fire	1,050	3,031
False	2,511	6,605
Barely-true	2,108	3,603
Half-true	2,638	3,709
Mostly-true	2,466	3,429
True	2,063	2,585

Ablation Experiment The LIAR2 dataset is an upgrade of the LIAR dataset, which inherits the ideas of the LIAR dataset, refines the details and architecture, and expands the size of the dataset to make it more responsive to the needs of fake news detection tasks. We believe that with the help of the LIAR2 dataset, it will be able to perform better fake news detection tasks. The analysis and baseline information about the LIAR2 dataset is provided in below.

Feature	Val. Accuracy	Val. F1-Macro	Val. F1-Micro	Test Accuracy	Test F1-Macro	Test F1-Micro	Mean
Statement	0.3174	0.1957	0.3117	0.3197	0.2380	0.3197	0.2837
Date	0.2912	0.1879	0.2912	0.3079	0.1775	0.3079	0.2606
Subject	0.3243	0.2311	0.3183	0.3267	0.2271	0.3267	0.2924
Speaker	0.3283	0.2250	0.3174	0.3310	0.2462	0.3310	0.2965
Speaker Description	0.3322	0.2444	0.3250	0.3280	0.2444	0.3280	0.3003
State Info	0.2930	0.1577	0.2950	0.2979	0.1521	0.2979	0.2489
Credibility History	0.5007	0.4696	0.4985	0.5057	0.4656	0.5057	0.4910
Context	0.2982	0.1817	0.2982	0.3132	0.1791	0.3132	0.2639
Justification	0.5964	0.5657	0.5827	0.6115	0.5968	0.6115	0.5941
All without
Statement	0.7079	0.6734	0.6822	0.7182	0.7108	0.7182	0.7018
Date	0.6931	0.6572	0.6680	0.7078	0.6993	0.7078	0.6889
Subject	0.7000	0.6579	0.6681	0.7078	0.7013	0.7078	0.6905
Speaker	0.6944	0.6648	0.6757	0.7043	0.6942	0.7043	0.6896
Speaker Description	0.6892	0.6640	0.6739	0.7169	0.7073	0.7169	0.6947
State Info	0.7074	0.6625	0.6729	0.7099	0.7016	0.7099	0.6940
Credibility History	0.6025	0.5717	0.5900	0.6185	0.6046	0.6185	0.6010
Context	0.7005	0.6622	0.6720	0.7043	0.6967	0.7043	0.6900
Justification	0.5285	0.4898	0.5153	0.5340	0.5148	0.5340	0.5194
Statement +
Date	0.3431	0.2540	0.3343	0.3380	0.2514	0.3380	0.3098
Subject	0.3548	0.2759	0.3513	0.3375	0.2580	0.3375	0.3192
Speaker	0.3618	0.2862	0.3539	0.3476	0.2640	0.3476	0.3269
Speaker Description	0.3583	0.2814	0.3531	0.3667	0.2886	0.3667	0.3358
State Info	0.3317	0.2367	0.3294	0.3328	0.2362	0.3328	0.2999
Credibility History	0.5067	0.4737	0.5084	0.5244	0.5000	0.5244	0.5063
Context	0.3361	0.2682	0.3391	0.3458	0.2560	0.3458	0.3152
Justification	0.6017	0.5578	0.5796	0.6176	0.6026	0.6176	0.5962
All	0.6974	0.6570	0.6676	0.7021	0.6961	0.7021	0.6871

mmlu
huggingface.co
Updated Jul 31, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mmlu [Dataset]. https://huggingface.co/datasets/cais/mmlu
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 31, 2021
Dataset authored and provided by
Center for AI Safetyhttps://safe.ai/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for MMLU

Dataset Summary

Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57… See the full description on the dataset page: https://huggingface.co/datasets/cais/mmlu.
InductiveQE Datasets
zenodo.org
zip
Updated Nov 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mikhail Galkin; Mikhail Galkin (2022). InductiveQE Datasets [Dataset]. http://doi.org/10.5281/zenodo.7306046
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7306046
Dataset updated
Nov 9, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mikhail Galkin; Mikhail Galkin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
InductiveQE datasets

UPD 2.0: Regenerated datasets free of potential test set leakages

UPD 1.1: Added train_answers_val.pkl files to all freebase-derived datasets - answers of training queries on larger validation graphs

This repository contains 10 inductive complex query answering datasets published in "Inductive Logical Query Answering in Knowledge Graphs" (NeurIPS 2022). 9 datasets (106-550) were created from FB15k-237, the wikikg dataset was created from OGB WikiKG 2 graph. In the datasets, all inference graphs extend training graphs and include new nodes and edges. Dataset numbers indicate a relative size of the inference graph compared to the training graph, e.g., in 175, the number of nodes in the inference graph is 175% compared to the number of nodes in the training graph. The higher the ratio, the more new unseen nodes appear at inference time, the more complex the task is. The Wikikg split has a fixed 133% ratio.

Each dataset is a zip archive containing 17 files:

train_graph.txt (pt for wikikg) - original training graph

val_inference.txt (pt) - inference graph (validation split), new nodes in validation are disjoint with the test inference graph

val_predict.txt (pt) - missing edges in the validation inference graph to be predicted.

test_intference.txt (pt) - inference graph (test splits), new nodes in test are disjoint with the validation inference graph

test_predict.txt (pt) - missing edges in the test inference graph to be predicted.

train/valid/test_queries.pkl - queries of the respective split, 14 query types for fb-derived datasets, 9 types for Wikikg (EPFO-only)

*_answers_easy.pkl - easy answers to respective queries that do not require predicting missing links but only edge traversal

*_answers_hard.pkl - hard answers to respective queries that DO require predicting missing links and against which the final metrics will be computed

train_answers_val.pkl - the extended set of answers for training queries on the bigger validation graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models

train_answers_test.pkl - the extended set of answers for training queries on the bigger test graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models

og_mappings.pkl - contains entity2id / relation2id dictionaries mapping local node/relation IDs from a respective dataset to the original fb15k237 / wikikg2

stats.txt - a small file with dataset stats

Overall unzipped size of all datasets combined is about 10 GB. Please refer to the paper for the sizes of graphs and the number of queries per graph.

The Wikikg dataset is supposed to be evaluated in the inference-only regime being pre-trained solely on simple link prediction, the number of training complex queries is not enough for such a large dataset.

Paper pre-print: https://arxiv.org/abs/2210.08008

The full source code of training/inference models is available at https://github.com/DeepGraphLearning/InductiveQE
Mars surface image (Curiosity rover) labeled data set version 1
s.cnmilf.com
data.nasa.gov
+2more
Updated Dec 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NASA (2023). Mars surface image (Curiosity rover) labeled data set version 1 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/mars-surface-image-curiosity-rover-labeled-data-set-version-1
Explore at:
Dataset updated
Dec 6, 2023
Dataset provided by
NASAhttp://nasa.gov/
Description
This data set consists of 6691 images spanning 24 classes that were collected by the Mars Science Laboratory (MSL, Curosity) rover by three instruments (Mastcam Right eye, Mastcam Left eye, and MAHLI). These images are the "browse" version of each original data product, not full resolution. They are roughly 256x256 pixels each. We divided the MSL images into train, validation, and test data sets according to their sol (Martian day) of acquisition. This strategy was chosen to model how the system will be used operationally with an image archive that grows over time. The images were collected from sols 3 to 1060 (August 2012 to July 2015). The exact train/validation/test splits are given in individual files. Full-size images can be obtained from the PDS at https://pds-imaging.jpl.nasa.gov/search/ .
Data from: World's Fastest Brain-Computer Interface: Combining EEG2Code with...
figshare.com
bin
Updated Feb 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Nagel; Martin Spüler (2019). World's Fastest Brain-Computer Interface: Combining EEG2Code with Deep Learning [Dataset]. http://doi.org/10.6084/m9.figshare.7701065.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7701065.v1
Dataset updated
Feb 11, 2019
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Sebastian Nagel; Martin Spüler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
World
Description
General descriptionData was recorded using BCI2000 with g.USBamp (g.tec, Austria) EEG amplifier. 32 electrodes were used. Sampling rate was set to 600 Hz and data was bandpass filtered by the amplifier between 0.1 Hz and 60 Hz using a Chebyshev filter of order 8 and notch-filtered at 50 Hz. Data was stored as MATLAB mat-File.2. Experimental descriptionThe experiment was split in a training phase and a testing phase. During both, the participant had to focus a target which was modulated with fully random stimulation patterns, which were presented with 60 bits per second.For training, the participant had to perform 96 runs for, each with 4 s of stimulation, which means a total of 96*4*60=23040 bits were presented. For testing, the participant also had to perform 96 runs, but with 5 s of stimulation, which results in 96*5*60 = 28800 Bits.3. Variable descriptionThe file VP1.mat contains the following variables:- train_data_xcontains the raw EEG data of the training runs split by runs. The matrix has the following dimension: #runs X #channels X #samples- train_data_ycontains the stimulation pattern for each train run, upsampled to be synchronized with the EEG data. The matrix has the following dimension: #runs X #samples- test_data_xcontains the raw EEG data of the test runs split by runs. The matrix has the following dimension: #runs X #channels X #samples- test_data_ycontains the stimulation pattern for each test run, upsampled to be synchronized with the EEG data. The matrix has the following dimension: #runs X #samplesThe file VP1.hdf5 is the Keras CNN model which was trained during the online experiment.The file EEG2Code.py is a python script which takes the MAT-file as input and outputs the pattern prediction accuracy for each of the test run. It must be noted that the script searches for a Keras model with the file name as the MAT-file (but with hdf5 file extension). If the model exists, it will be loaded, otherwise a new model will be trained.
d
Input Files and Code for: Machine learning can accurately assign geologic...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Input Files and Code for: Machine learning can accurately assign geologic basin to produced water samples using major geochemical parameters [Dataset]. https://catalog.data.gov/dataset/input-files-and-code-for-machine-learning-can-accurately-assign-geologic-basin-to-produced
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
As more hydrocarbon production from hydraulic fracturing and other methods produce large volumes of water, innovative methods must be explored for treatment and reuse of these waters. However, understanding the general water chemistry of these fluids is essential to providing the best treatment options optimized for each producing area. Machine learning algorithms can often be applied to datasets to solve complex problems. In this study, we used the U.S. Geological Survey’s National Produced Waters Geochemical Database (USGS PWGD) in an exploratory exercise to determine if systematic variations exist between produced waters and geologic environment that could be used to accurately classify a water sample to a given geologic province. Two datasets were used, one with fewer attributes (n = 7) but more samples (n = 58,541) named PWGD7, and another with more attributes (n = 9) but fewer samples (n = 33,271) named PWGD9. The attributes of interest were specific gravity, pH, HCO3, Na, Mg, Ca, Cl, SO4, and total dissolved solids. The two datasets, PWGD7 and PWGD9, contained samples from 20 and 19 geologic provinces, respectively. Outliers across all attributes for each province were removed at a 99% confidence interval. Both datasets were divided into a training and test set using an 80/20 split and a 90/10 split, respectively. Random forest, Naïve Bayes, and k-Nearest Neighbors algorithms were applied to the two different training datasets and used to predict on three different testing datasets. Overall model accuracies across the two datasets and three applied models ranged from 23.5% to 73.5%. A random forest algorithm (split rule = extratrees, mtry = 5) performed best on both datasets, producing an accuracy of 67.1% for a training set based on the PWGD7 dataset, and 73.5% for a training set based on the PWGD9 dataset. Overall, the three algorithms predicted more accurately on the PWGD7 dataset than PWGD9 dataset, suggesting that either a larger sample size and/or fewer attributes lead to a more successful predicting algorithm. Individual balanced accuracies for each producing province ranged from 50.6% (Anadarko) to 100% (Raton) for PWGD7, and from 44.5% (Gulf Coast) to 99.8% (Sedgwick) for PWGD9. Results from testing the model on recently published data outside of the USGS PWGD suggests that some provinces may be lacking information about their true geochemical diversity while others included in this dataset are well described. Expanding on this effort could lead to predictive tools that provide ranges of contaminants or other chemicals of concern within each province to design future treatment facilities to reclaim wastewater. We anticipate that this classification model will be improved over time as more diverse data are added to the USGS PWGD.
Z
Underwater Plastic dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Machado, Pedro (2022). Underwater Plastic dataset [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_6907229
Explore at:
Dataset updated
Jul 27, 2022
Dataset authored and provided by
Machado, Pedro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset was generated using the Roboflow platform. The annotations are compatible with the PyTorch YOLOv5 architecture.

Dataset details:

Images: 1220 images

Image Split:

Train / Test Split: 92

Training Set: 1.1k

Preprocessing

Auto-Orient: Applied

Resize: Stretch to 416x416

Augmentations

Outputs per training example: 5

Flip: Horizontal, Vertical

Crop: 0% Minimum Zoom, 49% Maximum Zoom

Grayscale: Apply to 47% of images

Hue: Between -25° and +25°

Saturation: Between -42% and +42%

Exposure: Between -22% and +22%

Blur: Up to 3.25px

Cutout: 8 boxes with 10% size each

Mosaic: Applied

Details

Version Name: 2022-07-24 12:50am

Version ID: 1

Generated: Jul 24, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Popular Benchmarks (2022). Fashion Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/fashion-mnist-ztryt/model/3

Data from: Fashion Mnist Dataset

fashion-mnist-ztryt

fashion-mnist-dataset

Explore at:

zipAvailable download formats

Dataset updated

Aug 10, 2022

Dataset authored and provided by

Popular Benchmarks

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Variables measured

Clothing

Description

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Han Xiao, Kashif Rasul and Roland Vollgraf
https://arxiv.org/abs/1708.07747

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. * Source

Here's an example of how the data looks (each class takes three-rows): https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">

Version 1 (original-images_Original-FashionMNIST-Splits):

Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.
This version was not trained

Version 3 (original-images_trainSetSplitBy80_20):

Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set
https://blog.roboflow.com/train-test-split/ https://i.imgur.com/angfheJ.png" alt="Train/Valid/Test Split Rebalancing">

Citation:

@online{xiao2017/online,
 author    = {Han Xiao and Kashif Rasul and Roland Vollgraf},
 title    = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms},
 date     = {2017-08-28},
 year     = {2017},
 eprintclass = {cs.LG},
 eprinttype  = {arXiv},
 eprint    = {cs.LG/1708.07747},
}

Clear search

Close search

Google apps

Main menu

Data from: Fashion Mnist Dataset

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Version 1 (original-images_Original-FashionMNIST-Splits):

Version 3 (original-images_trainSetSplitBy80_20):

Citation:

Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND

Training dataset for NABat Machine Learning V1.0

Data from: Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes,...

BUTTER - Empirical Deep Learning Dataset

Hard Hat Workers Dataset

Overview

Use Cases

Using this Dataset

Dataset Versions:

About Roboflow

Gramatika

S233

Training and test data for the preparation of the article: Convolutional...

Mnist Dataset

THE MNIST DATABASE of handwritten digits

Authors:

Dataset Obtained From: http://yann.lecun.com/exdb/mnist/

All images were sized 28x28 in the original dataset

Version 1 (original-images_trainSetSplitBy80_20):

Version 2 (original-images_ModifiedClasses_trainSetSplitBy80_20):

Version 3 (original-images_Original-MNIST-Splits):

Citation:

MS COCO Dataset

WDC LSPM Dataset

gigaspeech

LIAR2 Dataset

mmlu

InductiveQE Datasets

Mars surface image (Curiosity rover) labeled data set version 1

Data from: World's Fastest Brain-Computer Interface: Combining EEG2Code with...

Input Files and Code for: Machine learning can accurately assign geologic...

Underwater Plastic dataset

Images: 1220 images

Data from: Fashion Mnist Dataset

fashion-mnist-ztryt

fashion-mnist-dataset

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Version 1 (original-images_Original-FashionMNIST-Splits):

Version 3 (original-images_trainSetSplitBy80_20):

Citation: