Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fashion-MNIST
is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST
to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.
* Source
Here's an example of how the data looks (each class takes three-rows):
https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">
train
(86% of images - 60,000 images) set and test
(14% of images - 10,000 images) set only.train
set split to provide 80% of its images to the training set and 20% of its images to the validation set@online{xiao2017/online,
author = {Han Xiao and Kashif Rasul and Roland Vollgraf},
title = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms},
date = {2017-08-28},
year = {2017},
eprintclass = {cs.LG},
eprinttype = {arXiv},
eprint = {cs.LG/1708.07747},
}
Using Machine Learning Techniques in general and Deep Learning techniques in specific needs a certain amount of data often not available in large quantities in some technical domains. The manual inspection of Machine Tool Components, as well as the manual end of line check of products, are labour intensive tasks in industrial applications that often want to be automated by companies. To automate the classification processes and to develop reliable and robust Machine Learning based classification and wear prognostics models there is a need for real-world datasets to train and test models on. The dataset contains 1104 channel 3 images with 394 image-annotations for the surface damage type “pitting”. The annotations made with the annotation tool labelme, are available in JSON format and hence convertible to VOC and COCO format. All images come from two BSD types. The dataset available for download is divided into two folders, data with all images as JPEG, label with all annotations, and saved_model with a baseline model. The authors also provide a python script to divide the data and labels into three different split types – train_test_split, which splits images into the same train and test data-split the authors used for the baseline model, wear_dev_split, which creates all 27 wear developments and type_split, which splits the data into the occurring BSD-types. One of the two mentioned BSD types is represented with 69 images and 55 different image-sizes. All images with this BSD type come either in a clean or soiled condition. The other BSD type is shown on 325 images with two image-sizes. Since all images of this type have been taken with continuous time the degree of soiling is evolving. Also, the dataset contains as above mentioned 27 pitting development sequences with every 69 images. Instruction dataset split The authors of this dataset provide 3 types of different dataset splits. To get the data split you have to run the python script split_dataset.py. Script inputs: split-type (mandatory) output directory (mandatory) Different split-types: train_test_split: splits dataset into train and test data (80%/20%) wear_dev_split: splits dataset into 27 wear-developments type_split: splits dataset into different BSD types Example: C:\Users\Desktop>python split_dataset.py --split_type=train_test_split --output_dir=BSD_split_folder
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
This benchmark data is comprised of 50 different datasets for materials properties obtained from 16 previous publications. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits.
For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method.
For further information, as well as directions on how to access the data, please go to the corresponding GitHub repository: https://github.com/anhender/mse_ML_datasets/tree/v1.0
The BUTTER Empirical Deep Learning Dataset represents an empirical study of the deep learning phenomena on dense fully connected networks, scanning across thirteen datasets, eight network shapes, fourteen depths, twenty-three network sizes (number of trainable parameters), four learning rates, six minibatch sizes, four levels of label noise, and fourteen levels of L1 and L2 regularization each. Multiple repetitions (typically 30, sometimes 10) of each combination of hyperparameters were preformed, and statistics including training and test loss (using a 80% / 20% shuffled train-test split) are recorded at the end of each training epoch. In total, this dataset covers 178 thousand distinct hyperparameter settings ("experiments"), 3.55 million individual training runs (an average of 20 repetitions of each experiments), and a total of 13.3 billion training epochs (three thousand epochs were covered by most runs). Accumulating this dataset consumed 5,448.4 CPU core-years, 17.8 GPU-years, and 111.2 node-years.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Hard Hat
dataset is an object detection dataset of workers in workplace settings that require a hard hat. Annotations also include examples of just "person" and "head," for when an individual may be present without a hard hart.
The original dataset has a 75/25 train-test split.
Example Image:
https://i.imgur.com/7spoIJT.png" alt="Example Image">
One could use this dataset to, for example, build a classifier of workers that are abiding safety code within a workplace versus those that may not be. It is also a good general dataset for practice.
Use the fork
or Download this Dataset
button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.
Image Preprocessing | Image Augmentation | Modify Classes
* v1
(resize-416x416-reflect): generated with the original 75/25 train-test split | No augmentations
* v2
(raw_75-25_trainTestSplit): generated with the original 75/25 train-test split | These are the raw, original images
* v3
(v3): generated with the original 75/25 train-test split | Modify Classes used to drop person
class | Preprocessing and Augmentation applied
* v5
(raw_HeadHelmetClasses): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person
class
* v8
(raw_HelmetClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head
and person
classes
* v9
(raw_PersonClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head
and helmet
classes
* v10
(raw_AllClasses): generated with a 70/20/10 train/valid/test split | These are the raw, original images
* v11
(augmented3x-AllClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied | 3x image generation | Trained with Roboflow's Fast Model
* v12
(augmented3x-HeadHelmetClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person
class | 3x image generation | Trained with Roboflow's Fast Model
* v13
(augmented3x-HeadHelmetClasses-AccurateModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person
class | 3x image generation | Trained with Roboflow's Accurate Model
* v14
(raw_HeadClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person
class, and remap/relabel helmet
class to head
Choosing Between Computer Vision Model Sizes | Roboflow Train
Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.
Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Gramatika is a syntectic GEC dataset for Indonesian. The Gramatika dataset has a total of 1.5 million sentences with 4,666,185 errors. Of all sentences, only 30,000 (2%) are correct sentences with no mistakes. Each sentence has a maximum of 6 errors, and there can only be 2 of the same error type in each sentence.We also split the dataset into three splits: train, dev, and test splits, with the proportion of 8:1:1 (with the size of 1,199,705, 150,171, and 150,124 sentences, respectively). The proportion of valid sentences in each split is 2%; 24,000 in the train split, and 3000 in each dev and test split. Moreover, we also set the proportion of each error type in all splits to be the same, as shown in Table 3.3.2. For example, the proportion of noun errors is 7.5% in all splits, while the proportion of particle errors is also only 0.3% in all splits.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
A fully annotated subset of the SO242/2_233-1 image dataset. The annotations are given as train and test splits that can be used to evaluate machine learning methods. The following classes of fauna were used for annotation:
For a definition of the classes see [1].
Related datasets:
This dataset contains the following files:
annotations/test.csv
: The BIIGLE CSV annotation report of the annotations of the test split of this dataset. These annotations are used to test the performance of the trained Mask R-CNN model.annotations/train.csv
: The BIIGLE CSV annotation report of the annotations of the train split of this dataset. These annotations are used to generate the annotation patches which are transformed with scale and style transfer to be used to train the Mask R-CNN model.images/
: Directory that contains all the original image files.dataset.json
: JSON file that contains information about the dataset.
name
: The name of the dataset.images_dir
: Name of the directory that contains the original image files.metadata_file
: Path to the CSV file that contains image metadata.test_annotations_file
: Path to the CSV file that contains the test annotations.train_annotations_file
: Path to the CSV file that contains the train annotations.annotation_patches_dir
: Name of the directory that should contain the scale- and style-transferred annotation patches.crop_dimension
: Edge length of an annotation or style patch in pixels.metadata.csv
: A CSV file that contains metadata for each original image file. In this case the distance of the camera to the sea floor is given for each image.CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Here we supply the training and test data as used in the prepared publication of "Convolutional Neural Network Applied for Nanoparticle Classification using Coherent Scaterometry Data" by D. Kolenov, D. Davidse, J. Le Cam, S.F. Pereira.
We present the "main dataset" samples in the pixel size of both 150x150 and 100x100, and for the three "fooling datasets" the pixel size is 100x100. On average each dataset contains 1100 images with the .mat extension. The .mat extension is straightforward with MatLab, but it could also be opened in Python or MS Excel. For the "main dataset" the pixels represent the sampling points, and the magnitude of these pixels represent the em field registered as the photocurrent on the split-detector. For the three types of "fooling data" the images of a 1) noisy and 2) mirrored set are also based on the photocurrent; 3) the elephant set is based on the open-source Animal-10 data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.
It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.
train
set split to provide 80% of its images to the training set and 20% of its images to the validation settrain
set split to provide 80% of its images to the training set and 20% of its images to the validation set0
, 1
, 2
, 3
, 4
, 5
, 6
, 7
, 8
, 9
to one
, two
, three
, four
, five
, six
, seven
, eight
, nine
train
(86% of images - 60,000 images) set and test
(14% of images - 10,000 images) set only.@article{lecun2010mnist,
title={MNIST handwritten digit database},
author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
journal={ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist},
volume={2},
year={2010}
}
The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
Splits: The first version of MS COCO dataset was released in 2014. It contains 164K images split into training (83K), validation (41K) and test (41K) sets. In 2015 additional test set of 81K images was released, including all the previous test images and 40K new images.
Based on community feedback, in 2017 the training/validation split was changed from 83K/41K to 118K/5K. The new split uses the same images and annotations. The 2017 test set is a subset of 41K images of the 2015 test set. Additionally, the 2017 release contains a new unannotated dataset of 123K images.
Annotations: The dataset has annotations for
object detection: bounding boxes and per-instance segmentation masks with 80 object categories, captioning: natural language descriptions of the images (see MS COCO Captions), keypoints detection: containing more than 200,000 images and 250,000 person instances labeled with keypoints (17 possible keypoints, such as left eye, nose, right hip, right ankle), stuff image segmentation – per-pixel segmentation masks with 91 stuff categories, such as grass, wall, sky (see MS COCO Stuff), panoptic: full scene segmentation, with 80 thing categories (such as person, bicycle, elephant) and a subset of 91 stuff categories (grass, sky, road), dense pose: more than 39,000 images and 56,000 person instances labeled with DensePose annotations – each labeled person is annotated with an instance id and a mapping between image pixels that belong to that person body and a template 3D model. The annotations are publicly available only for training and validation images.
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.
In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.
The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
GigaSpeech is an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality.
The LIAR dataset has been widely followed by fake news detection researchers since its release, and along with a great deal of research, the community has provided a variety of feedback on the dataset to improve it. We adopted these feedbacks and released the LIAR2 dataset, a new benchmark dataset of ~23k manually labeled by professional fact-checkers for fake news detection tasks. We have used a split ratio of 8:1:1 to distinguish between the training set, the test set, and the validation set, details of which are provided in the paper of "An Enhanced Fake News Detection System With Fuzzy Deep Learning". The LIAR2 dataset can be accessed at Huggingface and Github, and statistical information for LIAR and LIAR2 is provided in the table below:
Statistics | LIAR | LIAR2 |
---|---|---|
Training set size | 10,269 | 18,369 |
Validation set size | 1,284 | 2,297 |
Testing set size | 1,283 | 2,296 |
Avg. statement length (tokens) | 17.9 | 17.7 |
Avg. speaker description length (tokens) | \ | 39.4 |
Avg. justification length (tokens) | \ | 94.4 |
Labels | ||
Pants on fire | 1,050 | 3,031 |
False | 2,511 | 6,605 |
Barely-true | 2,108 | 3,603 |
Half-true | 2,638 | 3,709 |
Mostly-true | 2,466 | 3,429 |
True | 2,063 | 2,585 |
Ablation Experiment The LIAR2 dataset is an upgrade of the LIAR dataset, which inherits the ideas of the LIAR dataset, refines the details and architecture, and expands the size of the dataset to make it more responsive to the needs of fake news detection tasks. We believe that with the help of the LIAR2 dataset, it will be able to perform better fake news detection tasks. The analysis and baseline information about the LIAR2 dataset is provided in below.
Feature | Val. Accuracy | Val. F1-Macro | Val. F1-Micro | Test Accuracy | Test F1-Macro | Test F1-Micro | Mean |
---|---|---|---|---|---|---|---|
Statement | 0.3174 | 0.1957 | 0.3117 | 0.3197 | 0.2380 | 0.3197 | 0.2837 |
Date | 0.2912 | 0.1879 | 0.2912 | 0.3079 | 0.1775 | 0.3079 | 0.2606 |
Subject | 0.3243 | 0.2311 | 0.3183 | 0.3267 | 0.2271 | 0.3267 | 0.2924 |
Speaker | 0.3283 | 0.2250 | 0.3174 | 0.3310 | 0.2462 | 0.3310 | 0.2965 |
Speaker Description | 0.3322 | 0.2444 | 0.3250 | 0.3280 | 0.2444 | 0.3280 | 0.3003 |
State Info | 0.2930 | 0.1577 | 0.2950 | 0.2979 | 0.1521 | 0.2979 | 0.2489 |
Credibility History | 0.5007 | 0.4696 | 0.4985 | 0.5057 | 0.4656 | 0.5057 | 0.4910 |
Context | 0.2982 | 0.1817 | 0.2982 | 0.3132 | 0.1791 | 0.3132 | 0.2639 |
Justification | 0.5964 | 0.5657 | 0.5827 | 0.6115 | 0.5968 | 0.6115 | 0.5941 |
All without | |||||||
Statement | 0.7079 | 0.6734 | 0.6822 | 0.7182 | 0.7108 | 0.7182 | 0.7018 |
Date | 0.6931 | 0.6572 | 0.6680 | 0.7078 | 0.6993 | 0.7078 | 0.6889 |
Subject | 0.7000 | 0.6579 | 0.6681 | 0.7078 | 0.7013 | 0.7078 | 0.6905 |
Speaker | 0.6944 | 0.6648 | 0.6757 | 0.7043 | 0.6942 | 0.7043 | 0.6896 |
Speaker Description | 0.6892 | 0.6640 | 0.6739 | 0.7169 | 0.7073 | 0.7169 | 0.6947 |
State Info | 0.7074 | 0.6625 | 0.6729 | 0.7099 | 0.7016 | 0.7099 | 0.6940 |
Credibility History | 0.6025 | 0.5717 | 0.5900 | 0.6185 | 0.6046 | 0.6185 | 0.6010 |
Context | 0.7005 | 0.6622 | 0.6720 | 0.7043 | 0.6967 | 0.7043 | 0.6900 |
Justification | 0.5285 | 0.4898 | 0.5153 | 0.5340 | 0.5148 | 0.5340 | 0.5194 |
Statement + | |||||||
Date | 0.3431 | 0.2540 | 0.3343 | 0.3380 | 0.2514 | 0.3380 | 0.3098 |
Subject | 0.3548 | 0.2759 | 0.3513 | 0.3375 | 0.2580 | 0.3375 | 0.3192 |
Speaker | 0.3618 | 0.2862 | 0.3539 | 0.3476 | 0.2640 | 0.3476 | 0.3269 |
Speaker Description | 0.3583 | 0.2814 | 0.3531 | 0.3667 | 0.2886 | 0.3667 | 0.3358 |
State Info | 0.3317 | 0.2367 | 0.3294 | 0.3328 | 0.2362 | 0.3328 | 0.2999 |
Credibility History | 0.5067 | 0.4737 | 0.5084 | 0.5244 | 0.5000 | 0.5244 | 0.5063 |
Context | 0.3361 | 0.2682 | 0.3391 | 0.3458 | 0.2560 | 0.3458 | 0.3152 |
Justification | 0.6017 | 0.5578 | 0.5796 | 0.6176 | 0.6026 | 0.6176 | 0.5962 |
All | 0.6974 | 0.6570 | 0.6676 | 0.7021 | 0.6961 | 0.7021 | 0.6871 |
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for MMLU
Dataset Summary
Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57… See the full description on the dataset page: https://huggingface.co/datasets/cais/mmlu.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
InductiveQE datasets
UPD 2.0: Regenerated datasets free of potential test set leakages
UPD 1.1: Added train_answers_val.pkl files to all freebase-derived datasets - answers of training queries on larger validation graphs
This repository contains 10 inductive complex query answering datasets published in "Inductive Logical Query Answering in Knowledge Graphs" (NeurIPS 2022). 9 datasets (106-550) were created from FB15k-237, the wikikg dataset was created from OGB WikiKG 2 graph. In the datasets, all inference graphs extend training graphs and include new nodes and edges. Dataset numbers indicate a relative size of the inference graph compared to the training graph, e.g., in 175, the number of nodes in the inference graph is 175% compared to the number of nodes in the training graph. The higher the ratio, the more new unseen nodes appear at inference time, the more complex the task is. The Wikikg split has a fixed 133% ratio.
Each dataset is a zip archive containing 17 files:
Overall unzipped size of all datasets combined is about 10 GB. Please refer to the paper for the sizes of graphs and the number of queries per graph.
The Wikikg dataset is supposed to be evaluated in the inference-only regime being pre-trained solely on simple link prediction, the number of training complex queries is not enough for such a large dataset.
Paper pre-print: https://arxiv.org/abs/2210.08008
The full source code of training/inference models is available at https://github.com/DeepGraphLearning/InductiveQE
This data set consists of 6691 images spanning 24 classes that were collected by the Mars Science Laboratory (MSL, Curosity) rover by three instruments (Mastcam Right eye, Mastcam Left eye, and MAHLI). These images are the "browse" version of each original data product, not full resolution. They are roughly 256x256 pixels each. We divided the MSL images into train, validation, and test data sets according to their sol (Martian day) of acquisition. This strategy was chosen to model how the system will be used operationally with an image archive that grows over time. The images were collected from sols 3 to 1060 (August 2012 to July 2015). The exact train/validation/test splits are given in individual files. Full-size images can be obtained from the PDS at https://pds-imaging.jpl.nasa.gov/search/ .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As more hydrocarbon production from hydraulic fracturing and other methods produce large volumes of water, innovative methods must be explored for treatment and reuse of these waters. However, understanding the general water chemistry of these fluids is essential to providing the best treatment options optimized for each producing area. Machine learning algorithms can often be applied to datasets to solve complex problems. In this study, we used the U.S. Geological Survey’s National Produced Waters Geochemical Database (USGS PWGD) in an exploratory exercise to determine if systematic variations exist between produced waters and geologic environment that could be used to accurately classify a water sample to a given geologic province. Two datasets were used, one with fewer attributes (n = 7) but more samples (n = 58,541) named PWGD7, and another with more attributes (n = 9) but fewer samples (n = 33,271) named PWGD9. The attributes of interest were specific gravity, pH, HCO3, Na, Mg, Ca, Cl, SO4, and total dissolved solids. The two datasets, PWGD7 and PWGD9, contained samples from 20 and 19 geologic provinces, respectively. Outliers across all attributes for each province were removed at a 99% confidence interval. Both datasets were divided into a training and test set using an 80/20 split and a 90/10 split, respectively. Random forest, Naïve Bayes, and k-Nearest Neighbors algorithms were applied to the two different training datasets and used to predict on three different testing datasets. Overall model accuracies across the two datasets and three applied models ranged from 23.5% to 73.5%. A random forest algorithm (split rule = extratrees, mtry = 5) performed best on both datasets, producing an accuracy of 67.1% for a training set based on the PWGD7 dataset, and 73.5% for a training set based on the PWGD9 dataset. Overall, the three algorithms predicted more accurately on the PWGD7 dataset than PWGD9 dataset, suggesting that either a larger sample size and/or fewer attributes lead to a more successful predicting algorithm. Individual balanced accuracies for each producing province ranged from 50.6% (Anadarko) to 100% (Raton) for PWGD7, and from 44.5% (Gulf Coast) to 99.8% (Sedgwick) for PWGD9. Results from testing the model on recently published data outside of the USGS PWGD suggests that some provinces may be lacking information about their true geochemical diversity while others included in this dataset are well described. Expanding on this effort could lead to predictive tools that provide ranges of contaminants or other chemicals of concern within each province to design future treatment facilities to reclaim wastewater. We anticipate that this classification model will be improved over time as more diverse data are added to the USGS PWGD.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was generated using the Roboflow platform. The annotations are compatible with the PyTorch YOLOv5 architecture.
Dataset details:
Image Split:
Train / Test Split: 92
Training Set: 1.1k
Preprocessing
Auto-Orient: Applied
Resize: Stretch to 416x416
Augmentations
Outputs per training example: 5
Flip: Horizontal, Vertical
Crop: 0% Minimum Zoom, 49% Maximum Zoom
Grayscale: Apply to 47% of images
Hue: Between -25° and +25°
Saturation: Between -42% and +42%
Exposure: Between -22% and +22%
Blur: Up to 3.25px
Cutout: 8 boxes with 10% size each
Mosaic: Applied
Details
Version Name: 2022-07-24 12:50am
Version ID: 1
Generated: Jul 24, 2022
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fashion-MNIST
is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST
to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.
* Source
Here's an example of how the data looks (each class takes three-rows):
https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">
train
(86% of images - 60,000 images) set and test
(14% of images - 10,000 images) set only.train
set split to provide 80% of its images to the training set and 20% of its images to the validation set@online{xiao2017/online,
author = {Han Xiao and Kashif Rasul and Roland Vollgraf},
title = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms},
date = {2017-08-28},
year = {2017},
eprintclass = {cs.LG},
eprinttype = {arXiv},
eprint = {cs.LG/1708.07747},
}