67 datasets found
  1. R

    Data from: Fashion Mnist Dataset

    • universe.roboflow.com
    • opendatalab.com
    • +4more
    zip
    Updated Aug 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Popular Benchmarks (2022). Fashion Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/fashion-mnist-ztryt/model/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 10, 2022
    Dataset authored and provided by
    Popular Benchmarks
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Clothing
    Description

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Authors:

    Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

    All images were sized 28x28 in the original dataset

    Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. * Source

    Here's an example of how the data looks (each class takes three-rows): https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">

    Version 1 (original-images_Original-FashionMNIST-Splits):

    • Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.
    • This version was not trained

    Version 3 (original-images_trainSetSplitBy80_20):

    • Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set
    • https://blog.roboflow.com/train-test-split/ https://i.imgur.com/angfheJ.png" alt="Train/Valid/Test Split Rebalancing">

    Citation:

    @online{xiao2017/online,
     author    = {Han Xiao and Kashif Rasul and Roland Vollgraf},
     title    = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms},
     date     = {2017-08-28},
     year     = {2017},
     eprintclass = {cs.LG},
     eprinttype  = {arXiv},
     eprint    = {cs.LG/1708.07747},
    }
    
  2. d

    Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND

    • b2find.dkrz.de
    Updated Oct 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/63d20a5e-3584-5096-a34d-d3f93fcc8857
    Explore at:
    Dataset updated
    Oct 24, 2023
    Description

    Using Machine Learning Techniques in general and Deep Learning techniques in specific needs a certain amount of data often not available in large quantities in some technical domains. The manual inspection of Machine Tool Components, as well as the manual end of line check of products, are labour intensive tasks in industrial applications that often want to be automated by companies. To automate the classification processes and to develop reliable and robust Machine Learning based classification and wear prognostics models there is a need for real-world datasets to train and test models on. The dataset contains 1104 channel 3 images with 394 image-annotations for the surface damage type “pitting”. The annotations made with the annotation tool labelme, are available in JSON format and hence convertible to VOC and COCO format. All images come from two BSD types. The dataset available for download is divided into two folders, data with all images as JPEG, label with all annotations, and saved_model with a baseline model. The authors also provide a python script to divide the data and labels into three different split types – train_test_split, which splits images into the same train and test data-split the authors used for the baseline model, wear_dev_split, which creates all 27 wear developments and type_split, which splits the data into the occurring BSD-types. One of the two mentioned BSD types is represented with 69 images and 55 different image-sizes. All images with this BSD type come either in a clean or soiled condition. The other BSD type is shown on 325 images with two image-sizes. Since all images of this type have been taken with continuous time the degree of soiling is evolving. Also, the dataset contains as above mentioned 27 pitting development sequences with every 69 images. Instruction dataset split The authors of this dataset provide 3 types of different dataset splits. To get the data split you have to run the python script split_dataset.py. Script inputs: split-type (mandatory) output directory (mandatory) Different split-types: train_test_split: splits dataset into train and test data (80%/20%) wear_dev_split: splits dataset into 27 wear-developments type_split: splits dataset into different BSD types Example: C:\Users\Desktop>python split_dataset.py --split_type=train_test_split --output_dir=BSD_split_folder

  3. d

    Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  4. Z

    Data from: Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes,...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes, Material Systems, and Data Heterogeneity for Materials Informatics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4903957
    Explore at:
    Dataset updated
    Jun 6, 2021
    Dataset provided by
    Sparks, D. Taylor
    Kauwe, K. Steven
    Henderson, N. Ashley
    Description

    This benchmark data is comprised of 50 different datasets for materials properties obtained from 16 previous publications. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits.

    For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method.

    For further information, as well as directions on how to access the data, please go to the corresponding GitHub repository: https://github.com/anhender/mse_ML_datasets/tree/v1.0

  5. BUTTER - Empirical Deep Learning Dataset

    • osti.gov
    Updated May 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BUTTER - Empirical Deep Learning Dataset [Dataset]. https://www.osti.gov/biblio/1872441
    Explore at:
    Dataset updated
    May 20, 2022
    Dataset provided by
    Office of Sciencehttp://www.er.doe.gov/
    United States Department of Energyhttp://energy.gov/
    DOE Open Energy Data Initiative (OEDI)
    National Renewable Energy Laboratory (NREL), Golden, CO (United States)
    Description

    The BUTTER Empirical Deep Learning Dataset represents an empirical study of the deep learning phenomena on dense fully connected networks, scanning across thirteen datasets, eight network shapes, fourteen depths, twenty-three network sizes (number of trainable parameters), four learning rates, six minibatch sizes, four levels of label noise, and fourteen levels of L1 and L2 regularization each. Multiple repetitions (typically 30, sometimes 10) of each combination of hyperparameters were preformed, and statistics including training and test loss (using a 80% / 20% shuffled train-test split) are recorded at the end of each training epoch. In total, this dataset covers 178 thousand distinct hyperparameter settings ("experiments"), 3.55 million individual training runs (an average of 20 repetitions of each experiments), and a total of 13.3 billion training epochs (three thousand epochs were covered by most runs). Accumulating this dataset consumed 5,448.4 CPU core-years, 17.8 GPU-years, and 111.2 node-years.

  6. R

    Hard Hat Workers Dataset

    • universe.roboflow.com
    zip
    Updated Sep 30, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph Nelson (2022). Hard Hat Workers Dataset [Dataset]. https://universe.roboflow.com/joseph-nelson/hard-hat-workers/model/13
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 30, 2022
    Dataset authored and provided by
    Joseph Nelson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Variables measured
    Workers Bounding Boxes
    Description

    Overview

    The Hard Hat dataset is an object detection dataset of workers in workplace settings that require a hard hat. Annotations also include examples of just "person" and "head," for when an individual may be present without a hard hart.

    The original dataset has a 75/25 train-test split.

    Example Image: https://i.imgur.com/7spoIJT.png" alt="Example Image">

    Use Cases

    One could use this dataset to, for example, build a classifier of workers that are abiding safety code within a workplace versus those that may not be. It is also a good general dataset for practice.

    Using this Dataset

    Use the fork or Download this Dataset button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

    Dataset Versions:

    Image Preprocessing | Image Augmentation | Modify Classes * v1 (resize-416x416-reflect): generated with the original 75/25 train-test split | No augmentations * v2 (raw_75-25_trainTestSplit): generated with the original 75/25 train-test split | These are the raw, original images * v3 (v3): generated with the original 75/25 train-test split | Modify Classes used to drop person class | Preprocessing and Augmentation applied * v5 (raw_HeadHelmetClasses): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person class * v8 (raw_HelmetClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head and person classes * v9 (raw_PersonClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head and helmet classes * v10 (raw_AllClasses): generated with a 70/20/10 train/valid/test split | These are the raw, original images * v11 (augmented3x-AllClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied | 3x image generation | Trained with Roboflow's Fast Model * v12 (augmented3x-HeadHelmetClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person class | 3x image generation | Trained with Roboflow's Fast Model * v13 (augmented3x-HeadHelmetClasses-AccurateModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person class | 3x image generation | Trained with Roboflow's Accurate Model * v14 (raw_HeadClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person class, and remap/relabel helmet class to head

    Choosing Between Computer Vision Model Sizes | Roboflow Train

    About Roboflow

    Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

    Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.

    Roboflow Workmark

  7. i

    Gramatika

    • ieee-dataport.org
    Updated Mar 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Felix Haryono (2025). Gramatika [Dataset]. http://doi.org/10.21227/h056-hb64
    Explore at:
    Dataset updated
    Mar 6, 2025
    Dataset provided by
    IEEE Dataport
    Authors
    Michael Felix Haryono
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Gramatika is a syntectic GEC dataset for Indonesian. The Gramatika dataset has a total of 1.5 million sentences with 4,666,185 errors. Of all sentences, only 30,000 (2%) are correct sentences with no mistakes. Each sentence has a maximum of 6 errors, and there can only be 2 of the same error type in each sentence.We also split the dataset into three splits: train, dev, and test splits, with the proportion of 8:1:1 (with the size of 1,199,705, 150,171, and 150,124 sentences, respectively). The proportion of valid sentences in each split is 2%; 24,000 in the train split, and 3000 in each dev and test split. Moreover, we also set the proportion of each error type in all splits to be the same, as shown in Table 3.3.2. For example, the proportion of noun errors is 7.5% in all splits, while the proportion of particle errors is also only 0.3% in all splits.

  8. S233

    • zenodo.org
    tar
    Updated Oct 6, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Zurowietz; Martin Zurowietz (2020). S233 [Dataset]. http://doi.org/10.5281/zenodo.3603815
    Explore at:
    tarAvailable download formats
    Dataset updated
    Oct 6, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Martin Zurowietz; Martin Zurowietz
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    A fully annotated subset of the SO242/2_233-1 image dataset. The annotations are given as train and test splits that can be used to evaluate machine learning methods. The following classes of fauna were used for annotation:

    • anemone
    • coral
    • crustacean
    • ipnops fish
    • litter
    • ophiuroid
    • other fauna
    • sea cucumber
    • sponge
    • stalked crinoid

    For a definition of the classes see [1].

    Related datasets:

    This dataset contains the following files:

    • annotations/test.csv: The BIIGLE CSV annotation report of the annotations of the test split of this dataset. These annotations are used to test the performance of the trained Mask R-CNN model.
    • annotations/train.csv: The BIIGLE CSV annotation report of the annotations of the train split of this dataset. These annotations are used to generate the annotation patches which are transformed with scale and style transfer to be used to train the Mask R-CNN model.
    • images/: Directory that contains all the original image files.
    • dataset.json: JSON file that contains information about the dataset.
      • name: The name of the dataset.
      • images_dir: Name of the directory that contains the original image files.
      • metadata_file: Path to the CSV file that contains image metadata.
      • test_annotations_file: Path to the CSV file that contains the test annotations.
      • train_annotations_file: Path to the CSV file that contains the train annotations.
      • annotation_patches_dir: Name of the directory that should contain the scale- and style-transferred annotation patches.
      • crop_dimension: Edge length of an annotation or style patch in pixels.
    • metadata.csv: A CSV file that contains metadata for each original image file. In this case the distance of the camera to the sea floor is given for each image.
  9. Training and test data for the preparation of the article: Convolutional...

    • 4tu.edu.hpc.n-helix.com
    • data.4tu.nl
    zip
    Updated May 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dmytro Kolenov; D. (Davy) Davidse (2020). Training and test data for the preparation of the article: Convolutional Neural Network Applied for Nanoparticle Classification using Coherent Scaterometry Data [Dataset]. http://doi.org/10.4121/uuid:516ab2fa-4c47-42f8-b614-5e283889b218
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 29, 2020
    Dataset provided by
    4TUhttps://www.4tu.nl/
    Authors
    Dmytro Kolenov; D. (Davy) Davidse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Here we supply the training and test data as used in the prepared publication of "Convolutional Neural Network Applied for Nanoparticle Classification using Coherent Scaterometry Data" by D. Kolenov, D. Davidse, J. Le Cam, S.F. Pereira.

    We present the "main dataset" samples in the pixel size of both 150x150 and 100x100, and for the three "fooling datasets" the pixel size is 100x100. On average each dataset contains 1100 images with the .mat extension. The .mat extension is straightforward with MatLab, but it could also be opened in Python or MS Excel. For the "main dataset" the pixels represent the sampling points, and the magnitude of these pixels represent the em field registered as the photocurrent on the split-detector. For the three types of "fooling data" the images of a 1) noisy and 2) mirrored set are also based on the photocurrent; 3) the elephant set is based on the open-source Animal-10 data.

  10. R

    Mnist Dataset

    • universe.roboflow.com
    • tensorflow.org
    • +5more
    zip
    Updated Aug 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Popular Benchmarks (2022). Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/mnist-cjkff/model/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 8, 2022
    Dataset authored and provided by
    Popular Benchmarks
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Digits
    Description

    THE MNIST DATABASE of handwritten digits

    Authors:

    • Yann LeCun, Courant Institute, NYU
    • Corinna Cortes, Google Labs, New York
    • Christopher J.C. Burges, Microsoft Research, Redmond

    Dataset Obtained From: http://yann.lecun.com/exdb/mnist/

    All images were sized 28x28 in the original dataset

    The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.

    It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.

    Version 1 (original-images_trainSetSplitBy80_20):

    • Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set
    • Trained from Roboflow Classification Model's ImageNet training checkpoint

    Version 2 (original-images_ModifiedClasses_trainSetSplitBy80_20):

    • Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set
    • Modify Classes, a Roboflow preprocessing feature, was employed to change class names from 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 to one, two, three, four, five, six, seven, eight, nine
    • Trained from the Roboflow Classification Model's ImageNet training checkpoint

    Version 3 (original-images_Original-MNIST-Splits):

    • Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.
    • This version was not trained

    Citation:

    @article{lecun2010mnist,
     title={MNIST handwritten digit database},
     author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
     journal={ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist},
     volume={2},
     year={2010}
    }
    
  11. P

    MS COCO Dataset

    • paperswithcode.com
    Updated Apr 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tsung-Yi Lin; Michael Maire; Serge Belongie; Lubomir Bourdev; Ross Girshick; James Hays; Pietro Perona; Deva Ramanan; C. Lawrence Zitnick; Piotr Dollár, MS COCO Dataset [Dataset]. https://paperswithcode.com/dataset/coco
    Explore at:
    Dataset updated
    Apr 15, 2024
    Authors
    Tsung-Yi Lin; Michael Maire; Serge Belongie; Lubomir Bourdev; Ross Girshick; James Hays; Pietro Perona; Deva Ramanan; C. Lawrence Zitnick; Piotr Dollár
    Description

    The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.

    Splits: The first version of MS COCO dataset was released in 2014. It contains 164K images split into training (83K), validation (41K) and test (41K) sets. In 2015 additional test set of 81K images was released, including all the previous test images and 40K new images.

    Based on community feedback, in 2017 the training/validation split was changed from 83K/41K to 118K/5K. The new split uses the same images and annotations. The 2017 test set is a subset of 41K images of the 2015 test set. Additionally, the 2017 release contains a new unannotated dataset of 123K images.

    Annotations: The dataset has annotations for

    object detection: bounding boxes and per-instance segmentation masks with 80 object categories, captioning: natural language descriptions of the images (see MS COCO Captions), keypoints detection: containing more than 200,000 images and 250,000 person instances labeled with keypoints (17 possible keypoints, such as left eye, nose, right hip, right ankle), stuff image segmentation – per-pixel segmentation masks with 91 stuff categories, such as grass, wall, sky (see MS COCO Stuff), panoptic: full scene segmentation, with 80 thing categories (such as person, bicycle, elephant) and a subset of 91 stuff categories (grass, sky, road), dense pose: more than 39,000 images and 56,000 person instances labeled with DensePose annotations – each labeled person is annotated with an instance id and a mapping between image pixels that belong to that person body and a template 3D model. The annotations are publicly available only for training and validation images.

  12. P

    WDC LSPM Dataset

    • paperswithcode.com
    Updated May 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WDC LSPM Dataset [Dataset]. https://paperswithcode.com/dataset/wdc-products
    Explore at:
    Dataset updated
    May 31, 2022
    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.

    In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.

    The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.

  13. h

    gigaspeech

    • huggingface.co
    • paperswithcode.com
    • +1more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SpeechColab, gigaspeech [Dataset]. https://huggingface.co/datasets/speechcolab/gigaspeech
    Explore at:
    Dataset authored and provided by
    SpeechColab
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    GigaSpeech is an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality.

  14. P

    LIAR2 Dataset

    • paperswithcode.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cheng Xu; M-Tahar Kechadi, LIAR2 Dataset [Dataset]. https://paperswithcode.com/dataset/liar2
    Explore at:
    Authors
    Cheng Xu; M-Tahar Kechadi
    Description

    The LIAR dataset has been widely followed by fake news detection researchers since its release, and along with a great deal of research, the community has provided a variety of feedback on the dataset to improve it. We adopted these feedbacks and released the LIAR2 dataset, a new benchmark dataset of ~23k manually labeled by professional fact-checkers for fake news detection tasks. We have used a split ratio of 8:1:1 to distinguish between the training set, the test set, and the validation set, details of which are provided in the paper of "An Enhanced Fake News Detection System With Fuzzy Deep Learning". The LIAR2 dataset can be accessed at Huggingface and Github, and statistical information for LIAR and LIAR2 is provided in the table below:

    StatisticsLIARLIAR2
    Training set size10,26918,369
    Validation set size1,2842,297
    Testing set size1,2832,296
    Avg. statement length (tokens)17.917.7
    Avg. speaker description length (tokens)\39.4
    Avg. justification length (tokens)\94.4
    Labels
    Pants on fire1,0503,031
    False2,5116,605
    Barely-true2,1083,603
    Half-true2,6383,709
    Mostly-true2,4663,429
    True2,0632,585

    Ablation Experiment The LIAR2 dataset is an upgrade of the LIAR dataset, which inherits the ideas of the LIAR dataset, refines the details and architecture, and expands the size of the dataset to make it more responsive to the needs of fake news detection tasks. We believe that with the help of the LIAR2 dataset, it will be able to perform better fake news detection tasks. The analysis and baseline information about the LIAR2 dataset is provided in below.

    FeatureVal. AccuracyVal. F1-MacroVal. F1-MicroTest AccuracyTest F1-MacroTest F1-MicroMean
    Statement0.31740.19570.31170.31970.23800.31970.2837
    Date0.29120.18790.29120.30790.17750.30790.2606
    Subject0.32430.23110.31830.32670.22710.32670.2924
    Speaker0.32830.22500.31740.33100.24620.33100.2965
    Speaker Description0.33220.24440.32500.32800.24440.32800.3003
    State Info0.29300.15770.29500.29790.15210.29790.2489
    Credibility History0.50070.46960.49850.50570.46560.50570.4910
    Context0.29820.18170.29820.31320.17910.31320.2639
    Justification0.59640.56570.58270.61150.59680.61150.5941
    All without
    Statement0.70790.67340.68220.71820.71080.71820.7018
    Date0.69310.65720.66800.70780.69930.70780.6889
    Subject0.70000.65790.66810.70780.70130.70780.6905
    Speaker0.69440.66480.67570.70430.69420.70430.6896
    Speaker Description0.68920.66400.67390.71690.70730.71690.6947
    State Info0.70740.66250.67290.70990.70160.70990.6940
    Credibility History0.60250.57170.59000.61850.60460.61850.6010
    Context0.70050.66220.67200.70430.69670.70430.6900
    Justification0.52850.48980.51530.53400.51480.53400.5194
    Statement +
    Date0.34310.25400.33430.33800.25140.33800.3098
    Subject0.35480.27590.35130.33750.25800.33750.3192
    Speaker0.36180.28620.35390.34760.26400.34760.3269
    Speaker Description0.35830.28140.35310.36670.28860.36670.3358
    State Info0.33170.23670.32940.33280.23620.33280.2999
    Credibility History0.50670.47370.50840.52440.50000.52440.5063
    Context0.33610.26820.33910.34580.25600.34580.3152
    Justification0.60170.55780.57960.61760.60260.61760.5962
    All0.69740.65700.66760.70210.69610.70210.6871
  15. mmlu

    • huggingface.co
    Updated Jul 31, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mmlu [Dataset]. https://huggingface.co/datasets/cais/mmlu
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 31, 2021
    Dataset authored and provided by
    Center for AI Safetyhttps://safe.ai/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for MMLU

      Dataset Summary
    

    Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57… See the full description on the dataset page: https://huggingface.co/datasets/cais/mmlu.

  16. InductiveQE Datasets

    • zenodo.org
    zip
    Updated Nov 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mikhail Galkin; Mikhail Galkin (2022). InductiveQE Datasets [Dataset]. http://doi.org/10.5281/zenodo.7306046
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 9, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mikhail Galkin; Mikhail Galkin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    InductiveQE datasets

    UPD 2.0: Regenerated datasets free of potential test set leakages

    UPD 1.1: Added train_answers_val.pkl files to all freebase-derived datasets - answers of training queries on larger validation graphs

    This repository contains 10 inductive complex query answering datasets published in "Inductive Logical Query Answering in Knowledge Graphs" (NeurIPS 2022). 9 datasets (106-550) were created from FB15k-237, the wikikg dataset was created from OGB WikiKG 2 graph. In the datasets, all inference graphs extend training graphs and include new nodes and edges. Dataset numbers indicate a relative size of the inference graph compared to the training graph, e.g., in 175, the number of nodes in the inference graph is 175% compared to the number of nodes in the training graph. The higher the ratio, the more new unseen nodes appear at inference time, the more complex the task is. The Wikikg split has a fixed 133% ratio.

    Each dataset is a zip archive containing 17 files:

    • train_graph.txt (pt for wikikg) - original training graph
    • val_inference.txt (pt) - inference graph (validation split), new nodes in validation are disjoint with the test inference graph
    • val_predict.txt (pt) - missing edges in the validation inference graph to be predicted.
    • test_intference.txt (pt) - inference graph (test splits), new nodes in test are disjoint with the validation inference graph
    • test_predict.txt (pt) - missing edges in the test inference graph to be predicted.
    • train/valid/test_queries.pkl - queries of the respective split, 14 query types for fb-derived datasets, 9 types for Wikikg (EPFO-only)
    • *_answers_easy.pkl - easy answers to respective queries that do not require predicting missing links but only edge traversal
    • *_answers_hard.pkl - hard answers to respective queries that DO require predicting missing links and against which the final metrics will be computed
    • train_answers_val.pkl - the extended set of answers for training queries on the bigger validation graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models
    • train_answers_test.pkl - the extended set of answers for training queries on the bigger test graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models
    • og_mappings.pkl - contains entity2id / relation2id dictionaries mapping local node/relation IDs from a respective dataset to the original fb15k237 / wikikg2
    • stats.txt - a small file with dataset stats

    Overall unzipped size of all datasets combined is about 10 GB. Please refer to the paper for the sizes of graphs and the number of queries per graph.

    The Wikikg dataset is supposed to be evaluated in the inference-only regime being pre-trained solely on simple link prediction, the number of training complex queries is not enough for such a large dataset.

    Paper pre-print: https://arxiv.org/abs/2210.08008

    The full source code of training/inference models is available at https://github.com/DeepGraphLearning/InductiveQE

  17. Mars surface image (Curiosity rover) labeled data set version 1

    • s.cnmilf.com
    • data.nasa.gov
    • +2more
    Updated Dec 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NASA (2023). Mars surface image (Curiosity rover) labeled data set version 1 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/mars-surface-image-curiosity-rover-labeled-data-set-version-1
    Explore at:
    Dataset updated
    Dec 6, 2023
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    This data set consists of 6691 images spanning 24 classes that were collected by the Mars Science Laboratory (MSL, Curosity) rover by three instruments (Mastcam Right eye, Mastcam Left eye, and MAHLI). These images are the "browse" version of each original data product, not full resolution. They are roughly 256x256 pixels each. We divided the MSL images into train, validation, and test data sets according to their sol (Martian day) of acquisition. This strategy was chosen to model how the system will be used operationally with an image archive that grows over time. The images were collected from sols 3 to 1060 (August 2012 to July 2015). The exact train/validation/test splits are given in individual files. Full-size images can be obtained from the PDS at https://pds-imaging.jpl.nasa.gov/search/ .

  18. Data from: World's Fastest Brain-Computer Interface: Combining EEG2Code with...

    • figshare.com
    bin
    Updated Feb 11, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Nagel; Martin Spüler (2019). World's Fastest Brain-Computer Interface: Combining EEG2Code with Deep Learning [Dataset]. http://doi.org/10.6084/m9.figshare.7701065.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 11, 2019
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Sebastian Nagel; Martin Spüler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description
    1. General descriptionData was recorded using BCI2000 with g.USBamp (g.tec, Austria) EEG amplifier. 32 electrodes were used. Sampling rate was set to 600 Hz and data was bandpass filtered by the amplifier between 0.1 Hz and 60 Hz using a Chebyshev filter of order 8 and notch-filtered at 50 Hz. Data was stored as MATLAB mat-File.2. Experimental descriptionThe experiment was split in a training phase and a testing phase. During both, the participant had to focus a target which was modulated with fully random stimulation patterns, which were presented with 60 bits per second.For training, the participant had to perform 96 runs for, each with 4 s of stimulation, which means a total of 96*4*60=23040 bits were presented. For testing, the participant also had to perform 96 runs, but with 5 s of stimulation, which results in 96*5*60 = 28800 Bits.3. Variable descriptionThe file VP1.mat contains the following variables:- train_data_xcontains the raw EEG data of the training runs split by runs. The matrix has the following dimension: #runs X #channels X #samples- train_data_ycontains the stimulation pattern for each train run, upsampled to be synchronized with the EEG data. The matrix has the following dimension: #runs X #samples- test_data_xcontains the raw EEG data of the test runs split by runs. The matrix has the following dimension: #runs X #channels X #samples- test_data_ycontains the stimulation pattern for each test run, upsampled to be synchronized with the EEG data. The matrix has the following dimension: #runs X #samplesThe file VP1.hdf5 is the Keras CNN model which was trained during the online experiment.The file EEG2Code.py is a python script which takes the MAT-file as input and outputs the pattern prediction accuracy for each of the test run. It must be noted that the script searches for a Keras model with the file name as the MAT-file (but with hdf5 file extension). If the model exists, it will be loaded, otherwise a new model will be trained.
  19. d

    Input Files and Code for: Machine learning can accurately assign geologic...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Input Files and Code for: Machine learning can accurately assign geologic basin to produced water samples using major geochemical parameters [Dataset]. https://catalog.data.gov/dataset/input-files-and-code-for-machine-learning-can-accurately-assign-geologic-basin-to-produced
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    As more hydrocarbon production from hydraulic fracturing and other methods produce large volumes of water, innovative methods must be explored for treatment and reuse of these waters. However, understanding the general water chemistry of these fluids is essential to providing the best treatment options optimized for each producing area. Machine learning algorithms can often be applied to datasets to solve complex problems. In this study, we used the U.S. Geological Survey’s National Produced Waters Geochemical Database (USGS PWGD) in an exploratory exercise to determine if systematic variations exist between produced waters and geologic environment that could be used to accurately classify a water sample to a given geologic province. Two datasets were used, one with fewer attributes (n = 7) but more samples (n = 58,541) named PWGD7, and another with more attributes (n = 9) but fewer samples (n = 33,271) named PWGD9. The attributes of interest were specific gravity, pH, HCO3, Na, Mg, Ca, Cl, SO4, and total dissolved solids. The two datasets, PWGD7 and PWGD9, contained samples from 20 and 19 geologic provinces, respectively. Outliers across all attributes for each province were removed at a 99% confidence interval. Both datasets were divided into a training and test set using an 80/20 split and a 90/10 split, respectively. Random forest, Naïve Bayes, and k-Nearest Neighbors algorithms were applied to the two different training datasets and used to predict on three different testing datasets. Overall model accuracies across the two datasets and three applied models ranged from 23.5% to 73.5%. A random forest algorithm (split rule = extratrees, mtry = 5) performed best on both datasets, producing an accuracy of 67.1% for a training set based on the PWGD7 dataset, and 73.5% for a training set based on the PWGD9 dataset. Overall, the three algorithms predicted more accurately on the PWGD7 dataset than PWGD9 dataset, suggesting that either a larger sample size and/or fewer attributes lead to a more successful predicting algorithm. Individual balanced accuracies for each producing province ranged from 50.6% (Anadarko) to 100% (Raton) for PWGD7, and from 44.5% (Gulf Coast) to 99.8% (Sedgwick) for PWGD9. Results from testing the model on recently published data outside of the USGS PWGD suggests that some provinces may be lacking information about their true geochemical diversity while others included in this dataset are well described. Expanding on this effort could lead to predictive tools that provide ranges of contaminants or other chemicals of concern within each province to design future treatment facilities to reclaim wastewater. We anticipate that this classification model will be improved over time as more diverse data are added to the USGS PWGD.

  20. Z

    Underwater Plastic dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Machado, Pedro (2022). Underwater Plastic dataset [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_6907229
    Explore at:
    Dataset updated
    Jul 27, 2022
    Dataset authored and provided by
    Machado, Pedro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset was generated using the Roboflow platform. The annotations are compatible with the PyTorch YOLOv5 architecture.

    Dataset details:

    Images: 1220 images

    Image Split:

    Train / Test Split: 92

    Training Set: 1.1k

    Preprocessing

    Auto-Orient: Applied

    Resize: Stretch to 416x416

    Augmentations

    Outputs per training example: 5

    Flip: Horizontal, Vertical

    Crop: 0% Minimum Zoom, 49% Maximum Zoom

    Grayscale: Apply to 47% of images

    Hue: Between -25° and +25°

    Saturation: Between -42% and +42%

    Exposure: Between -22% and +22%

    Blur: Up to 3.25px

    Cutout: 8 boxes with 10% size each

    Mosaic: Applied

    Details

    Version Name: 2022-07-24 12:50am

    Version ID: 1

    Generated: Jul 24, 2022

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Popular Benchmarks (2022). Fashion Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/fashion-mnist-ztryt/model/3

Data from: Fashion Mnist Dataset

fashion-mnist-ztryt

fashion-mnist-dataset

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Aug 10, 2022
Dataset authored and provided by
Popular Benchmarks
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Variables measured
Clothing
Description

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. * Source

Here's an example of how the data looks (each class takes three-rows): https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">

Version 1 (original-images_Original-FashionMNIST-Splits):

  • Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.
  • This version was not trained

Version 3 (original-images_trainSetSplitBy80_20):

  • Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set
  • https://blog.roboflow.com/train-test-split/ https://i.imgur.com/angfheJ.png" alt="Train/Valid/Test Split Rebalancing">

Citation:

@online{xiao2017/online,
 author    = {Han Xiao and Kashif Rasul and Roland Vollgraf},
 title    = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms},
 date     = {2017-08-28},
 year     = {2017},
 eprintclass = {cs.LG},
 eprinttype  = {arXiv},
 eprint    = {cs.LG/1708.07747},
}
Search
Clear search
Close search
Google apps
Main menu