100+ datasets found
  1. Machine learning algorithm validation with a limited sample size

    • plos.figshare.com
    text/x-python
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

  2. h

    AIME_2024-train-test

    • huggingface.co
    Updated Nov 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeong Seong Cheol (2025). AIME_2024-train-test [Dataset]. https://huggingface.co/datasets/lejelly/AIME_2024-train-test
    Explore at:
    Dataset updated
    Nov 3, 2025
    Authors
    Jeong Seong Cheol
    Description

    AIME 2024 (train/test 1:9)

    This dataset is derived from Maxwell-Jia/AIME_2024 by splitting the original single train split into train and test with a ratio of 1:9.

      Source
    

    Original dataset: Maxwell-Jia/AIME_2024Link: https://huggingface.co/datasets/Maxwell-Jia/AIME_2024

      License
    

    Inherits the original dataset's license (MIT) unless otherwise noted in this repository.

      Splitting Details
    

    Method: datasets.Dataset.train_test_split Test size: 90.0%… See the full description on the dataset page: https://huggingface.co/datasets/lejelly/AIME_2024-train-test.

  3. Data from: Web Data Commons Training and Test Sets for Large-Scale Product...

    • linkagelibrary.icpsr.umich.edu
    • da-ra.de
    Updated Nov 26, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 [Dataset]. http://doi.org/10.3886/E127481V1
    Explore at:
    Dataset updated
    Nov 26, 2020
    Dataset provided by
    University of Mannheim (Germany)
    Authors
    Ralph Peeters; Anna Primpeli; Christian Bizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.

  4. d

    Data from: Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  5. landslide-train-test-valid-img-mask-2

    • kaggle.com
    zip
    Updated Nov 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Equinn Yip (2025). landslide-train-test-valid-img-mask-2 [Dataset]. https://www.kaggle.com/datasets/equinnyip/landslide-train-test-valid-img-mask-2
    Explore at:
    zip(3058599924 bytes)Available download formats
    Dataset updated
    Nov 21, 2025
    Authors
    Equinn Yip
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The Landslide4Sense dataset has three splits, training/validation/test, consisting of 3799, 245, and 800 image patches, respectively. Each image patch is a composite of 14 bands that include:

    Multispectral data from Sentinel-2: B1, B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12.

    Slope data from ALOS PALSAR: B13.

    Digital elevation model (DEM) from ALOS PALSAR: B14.

    All bands in the competition dataset are resized to the resolution of ~10m per pixel. The image patches have the size of 128 x 128 pixels and are labeled pixel-wise.

  6. Prediction of Personality Traits using the Big 5 Framework

    • zenodo.org
    csv, text/x-python
    Updated Feb 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neelima Brahmbhatt; Neelima Brahmbhatt (2023). Prediction of Personality Traits using the Big 5 Framework [Dataset]. http://doi.org/10.5281/zenodo.7596072
    Explore at:
    text/x-python, csvAvailable download formats
    Dataset updated
    Feb 2, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Neelima Brahmbhatt; Neelima Brahmbhatt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The methodology is the core component of any research-related work. The methods used to gain the results are shown in the methodology. Here, the whole research implementation is done using python. There are different steps involved to get the entire research work done which is as follows:

    1. Acquire Personality Dataset

    The kaggle machine learning dataset is a collection of datasets, data generators which are used by machine learning community for analysis purpose. The personality prediction dataset is acquired from the kaggle website. This dataset was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP. The personality prediction dataset can be downloaded in zip file format just by clicking on the link available. The personality prediction file consists of two subject CSV files (test.csv & train.csv). The test.csv file has 0 missing values, 7 attributes, and final label output. Also, the dataset has multivariate characteristics. Here, data-preprocessing is done for checking inconsistent behaviors or trends.

    2. Data preprocessing

    After, Data acquisition the next step is to clean and preprocess the data. The Dataset available has numerical type features. The target value is a five-level personality consisting of serious,lively,responsible,dependable & extraverted. The preprocessed dataset is further split into training and testing datasets. This is achieved by passing feature value, target value, test size to the train-test split method of the scikit-learn package. After splitting of data, the training data is sent to the following Logistic regression & SVM design is used for training the artificial neural networks then test data is used to predict the accuracy of the trained network model.

    3. Feature Extraction

    The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree

            EXT1 I am the life of the party.
            EXT2  I don't talk a lot.
            EXT3  I feel comfortable around people.
            EXT4  I am quiet around strangers.
            EST1  I get stressed out easily.
            EST2  I get irritated easily.
            EST3  I worry about things.
            EST4  I change my mood a lot.
            AGR1  I have a soft heart.
            AGR2  I am interested in people.
            AGR3  I insult people.
            AGR4  I am not really interested in others.
            CSN1  I am always prepared.
            CSN2  I leave my belongings around.
            CSN3  I follow a schedule.
            CSN4  I make a mess of things.
            OPN1  I have a rich vocabulary.
            OPN2  I have difficulty understanding abstract ideas.
            OPN3  I do not have a good imagination.
            OPN4  I use difficult words.

    4. Training the Model

    Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the the data set into two sets: a training set and a testing set. 80% for training, and 20% for testing. You train the model using the training set.In this model we trained our dataset using linear_model.LogisticRegression() & svm.SVC() from sklearn Package

    5. Personality Prediction Output

    After the training of the designed neural network, the testing of Logistic Regression & SVM is performed using Cohen_kappa_score & Accuracy Score.

  7. R

    Data from: Fashion Mnist Dataset

    • universe.roboflow.com
    • opendatalab.com
    • +3more
    zip
    Updated Aug 10, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Popular Benchmarks (2022). Fashion Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/fashion-mnist-ztryt/model/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 10, 2022
    Dataset authored and provided by
    Popular Benchmarks
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Clothing
    Description

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Authors:

    Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

    All images were sized 28x28 in the original dataset

    Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. * Source

    Here's an example of how the data looks (each class takes three-rows): https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">

    Version 1 (original-images_Original-FashionMNIST-Splits):

    • Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.
    • This version was not trained

    Version 3 (original-images_trainSetSplitBy80_20):

    • Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set
    • https://blog.roboflow.com/train-test-split/ https://i.imgur.com/angfheJ.png" alt="Train/Valid/Test Split Rebalancing">

    Citation:

    @online{xiao2017/online,
     author    = {Han Xiao and Kashif Rasul and Roland Vollgraf},
     title    = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms},
     date     = {2017-08-28},
     year     = {2017},
     eprintclass = {cs.LG},
     eprinttype  = {arXiv},
     eprint    = {cs.LG/1708.07747},
    }
    
  8. Fake News Detection

    • kaggle.com
    zip
    Updated Nov 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KranNaik777 (2025). Fake News Detection [Dataset]. https://www.kaggle.com/datasets/krannaik777/train-news
    Explore at:
    zip(38846301 bytes)Available download formats
    Dataset updated
    Nov 4, 2025
    Authors
    KranNaik777
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    The fake news detection dataset used in this project contains labeled news articles categorized as either "fake" or "real." These articles have been collected from credible real-world sources and fact-checking websites, ensuring diverse and high-quality data. The dataset includes textual features such as the news content, along with metadata like publication date, author, and source details. On average, articles vary in length, providing a rich linguistic variety for model training. The dataset is balanced to minimize bias between fake and real news categories, supporting robust classification. It often contains thousands to hundreds of thousands of articles, enabling effective machine learning model development and evaluation. Additionally, some versions of the dataset may also include image URLs for multimodal analysis, expanding the detection capability beyond text alone. This comprehensive dataset plays a critical role in training and validating the fake news detection model used in this project.

    Here is a description for each column header of the fake news dataset:

    id: A unique identifier assigned to each news article in the dataset for easy reference and indexing.

    headline: The title or headline of the news article, summarizing the key news story in brief.

    written by: The author or journalist who wrote the news article; this may sometimes be missing or anonymized.

    news: The full text content of the news article, which is the main body used for analysis and classification.

    label: The classification label indicating the authenticity of the news article, typically a binary value such as "fake" or "real" (or 0 for real and 1 for fake), indicating whether the news is deceptive or truthful.

    This detailed column description provides clarity on the structure and contents of the dataset used for fake news detection modeling.

  9. Z

    Data from: Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes,...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Jun 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Henderson, N. Ashley; Kauwe, K. Steven; Sparks, D. Taylor (2021). Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes, Material Systems, and Data Heterogeneity for Materials Informatics [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_4903957
    Explore at:
    Dataset updated
    Jun 6, 2021
    Authors
    Henderson, N. Ashley; Kauwe, K. Steven; Sparks, D. Taylor
    Description

    This benchmark data is comprised of 50 different datasets for materials properties obtained from 16 previous publications. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits.

    For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method.

    For further information, as well as directions on how to access the data, please go to the corresponding GitHub repository: https://github.com/anhender/mse_ML_datasets/tree/v1.0

  10. PGD MNIST

    • kaggle.com
    zip
    Updated Mar 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kishore Sudula (2024). PGD MNIST [Dataset]. https://www.kaggle.com/datasets/sudulakishore/pgd-mnist
    Explore at:
    zip(55759208 bytes)Available download formats
    Dataset updated
    Mar 2, 2024
    Authors
    Kishore Sudula
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Context PGD MNIST is a dataset of adversarial images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. PGD MNIST intends to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms on adversarial examples. It shares the same image size and structure of training and testing splits.

    The original MNIST dataset contains a lot of handwritten digits. Members of the AI/ML/Data Science community love this dataset and use it as a benchmark to validate their algorithms. In fact, MNIST is often the first dataset researchers try. "If it doesn't work on MNIST, it won't work at all", they said. "Well, if it does work on MNIST, it may still fail on others."

    Check your models accuracy on this dataset and improve the adversarial robustness of the models

    Content Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255. The training and test data sets have 785 columns. The first column consists of the class labels (see above), and represents an integer. The rest of the columns contain the pixel-values of the associated image.

    To locate a pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27. The pixel is located on row i and column j of a 28 x 28 matrix. For example, 31 indicates the pixel that is in the fourth column from the left, and the second row from the top, as in the ascii-diagram below.

    Columns description Each row is a separate image Column 1 is the class label (integers 0-9). Remaining columns are pixel numbers (784 total). Each value is the darkness of the pixel (1 to 255)

  11. f

    Probing Datasets for Noisy Texts

    • federation.figshare.com
    • researchdata.edu.au
    Updated Mar 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Buddhika Kasthuriarachchy; Madhu Chetty; Adrian Shatte (2021). Probing Datasets for Noisy Texts [Dataset]. http://doi.org/10.25955/604c5307db043
    Explore at:
    Dataset updated
    Mar 14, 2021
    Dataset provided by
    Federation University Australia
    Authors
    Buddhika Kasthuriarachchy; Madhu Chetty; Adrian Shatte
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    ContextProbing tasks are popular among NLP researchers to assess the richness of the encoded representations of linguistic information. Each probing task is a classification problem, and the model’s performance shall vary depending on the richness of the linguistic properties crammed into the representation.

    This dataset contains five new probing datasets consist of noisy texts (Tweets) which can serve as a benchmark dataset for researchers to study the linguistic characteristics of unstructured and noisy texts.File StructureFormat: A tab-separated text file

    Column 1: train/test/validation split (tr-train, te-test, va-validation)

         Column 2: class label (refer to the content
    

    section for the class labels of each task file)

         Column 3: Tweet message (text)
    
         Column
    

    4: a unique ID Contentsent_len.tsvIn this classification task, the goal is to predict the sentence length in 8 possible bins (0-7) based on their lengths; 0: (5-8), 1: (9-12), 2: (13-16), 3: (17-20), 4: (21-25), 5: (26-29), 6: (30-33), 7: (34-70). This task is called “SentLen” in the paper.word_content.tsvWe consider a 10-way classifications task with 10 words as targets considering the available manually annotated instances. The task is predicting which of the target words appears on the given sentence. We have considered only the words that appear in the BERT vocabulary as target words. We constructed the data by picking the first 10 lower-cased words occurring in the corpus vocabulary ordered by frequency and having a length of at least 4 characters (to remove noise). Each sentence contains a single target word, and the word occurs precisely once in the sentence. The task is referred to as “WC” in the paper. bigram_shift.tsvThe purpose of the Bigram Shift task is to test whether an encoder is sensitive to legal word orders. Two adjacent words in a Tweet are inverted, and the classification model performs a binary classification to identify inverted (I) and non-inverted/original (O) Tweets. The task is referred to as “BShift” in the paper. tree_depth.tsvThe Tree Depth task evaluates the encoded sentence's ability to understand the hierarchical structure by allowing the classification model to predict the depth of the longest path from the root to any leaf in the Tweet's parser tree. The task is referred to as “TreeDepth” in the paper. odd_man_out.tsv

    The Tweets are modified by replacing a random noun or a verb o with another noun or verb r. The task of the classifier is to identify whether the sentence gets modified due to this change. Class label O refers to the unmodified sentences while C refers to modified sentences. The task is called “SOMO” in the paper.

  12. Train, validation and test split size used in this experiment.

    • plos.figshare.com
    xls
    Updated Oct 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuanqing Ding; Fanliang Bu; Hanming Zhai; Zhiwen Hou; Yifan Wang (2024). Train, validation and test split size used in this experiment. [Dataset]. http://doi.org/10.1371/journal.pone.0311720.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 10, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yuanqing Ding; Fanliang Bu; Hanming Zhai; Zhiwen Hou; Yifan Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Train, validation and test split size used in this experiment.

  13. h

    test-olmocr2

    • huggingface.co
    Updated Oct 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel van Strien (2025). test-olmocr2 [Dataset]. https://huggingface.co/datasets/davanstrien/test-olmocr2
    Explore at:
    Dataset updated
    Oct 23, 2025
    Authors
    Daniel van Strien
    Description

    Document OCR using olmOCR-2-7B-1025-FP8

    This dataset contains markdown-formatted OCR results from images in davanstrien/test-olmocr2 using olmOCR-2-7B.

      Processing Details
    

    Source Dataset: davanstrien/test-olmocr2 Model: allenai/olmOCR-2-7B-1025-FP8 Number of Samples: 100 Processing Time: 0h 3m 32s Processing Date: 2025-10-23 17:00 UTC

      Configuration
    

    Image Column: image Output Column: markdown Dataset Split: train Batch Size: 512 Max Model Length: 16,384… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/test-olmocr2.

  14. h

    rolm-test

    • huggingface.co
    Updated Aug 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel van Strien (2025). rolm-test [Dataset]. https://huggingface.co/datasets/davanstrien/rolm-test
    Explore at:
    Dataset updated
    Aug 4, 2025
    Authors
    Daniel van Strien
    Description

    OCR Text Extraction using RolmOCR

    This dataset contains extracted text from images in davanstrien/playbills-pdf-images-text using RolmOCR.

      Processing Details
    

    Source Dataset: davanstrien/playbills-pdf-images-text Model: reducto/RolmOCR Number of Samples: 10 Processing Time: 5.8 minutes Processing Date: 2025-08-04 17:08 UTC

      Configuration
    

    Image Column: image Output Column: rolmocr_text Dataset Split: train Batch Size: 16 Max Model Length: 24,000 tokens Max… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/rolm-test.

  15. Glaucoma Dataset: EyePACS-AIROGS-light-V2

    • kaggle.com
    zip
    Updated Mar 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Riley Kiefer (2024). Glaucoma Dataset: EyePACS-AIROGS-light-V2 [Dataset]. https://www.kaggle.com/datasets/deathtrooper/glaucoma-dataset-eyepacs-airogs-light-v2/code
    Explore at:
    zip(549533071 bytes)Available download formats
    Dataset updated
    Mar 9, 2024
    Authors
    Riley Kiefer
    Description

    News: Now with a 10.0 Kaggle usability score: supplemental metadata.csv file added to dataset.

    Overview: This is an improved machine-learning-ready glaucoma dataset using a balanced subset of standardized fundus images from the Rotterdam EyePACS AIROGS [1] set. This dataset is split into training, validation, and test folders which contain 4000 (~84%), 385 (~8%), and 385 (~8%) fundus images in each class respectively. Each training set has a folder for each class: referable glaucoma (RG) and non-referable glaucoma (NRG). This dataset is designed to easily benchmark your glaucoma classification models in Kaggle. Please make a contribution in the code tab, I have created a template to make it even easier!

    Please cite the dataset and at least the first of my related works if you found this dataset useful!

    • Riley Kiefer. "EyePACS-AIROGS-light-V2". Kaggle, 2024, doi: 10.34740/KAGGLE/DSV/7802508.
    • Riley Kiefer. "EyePACS-AIROGS-light-V1". Kaggle, 2023, doi: 10.34740/kaggle/ds/3222646.
    • Riley Kiefer. "Standardized Multi-Channel Dataset for Glaucoma, v19 (SMDG-19)". Kaggle, 2023, doi: 10.34740/kaggle/ds/2329670
    • Steen, J., Kiefer, R., Ardali, M., Abid, M. & Amjadian, E. Standardized and Open-Access Glaucoma Dataset for Artificial Intelligence Applications. Invest. Ophthalmol. Vis. Sci. 64, 384–384 (2023).
    • Amjadian, E., Ardali, M. R., Kiefer, R., Abid, M. & Steen, J. Ground truth validation of publicly available datasets utilized in artificial intelligence models for glaucoma detection. Invest. Ophthalmol. Vis. Sci. 64, 392–392 (2023).
    • R. Kiefer, M. Abid, M. R. Ardali, J. Steen and E. Amjadian, "Automated Fundus Image Standardization Using a Dynamic Global Foreground Threshold Algorithm," 2023 8th International Conference on Image, Vision and Computing (ICIVC), Dalian, China, 2023, pp. 460-465, doi: 10.1109/ICIVC58118.2023.10270429.
    • Kiefer, Riley, et al. "A Catalog of Public Glaucoma Datasets for Machine Learning Applications: A detailed description and analysis of public glaucoma datasets available to machine learning engineers tackling glaucoma-related problems using retinal fundus images and OCT images." Proceedings of the 2023 7th International Conference on Information System and Data Mining. 2023.
    • R. Kiefer, J. Steen, M. Abid, M. R. Ardali and E. Amjadian, "A Survey of Glaucoma Detection Algorithms using Fundus and OCT Images," 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 2022, pp. 0191-0196, doi: 10.1109/IEMCON56893.2022.9946629.
    • E. Amjadian, R. Kiefer, J. Steen, M. Abid, M. Ardali, "A Comprehensive Survey of Publicly Available Glaucoma Datasets for Automated Glaucoma Detection". American Academy of Optometry. 2022.

    Improvements from v1: - According to an ablation study on the image standardization methods applied to dataset v1 [3], images are standardized according to the CROP methodology (remove black background before resizing). This method yields more of the actual fundus foreground in the resultant image. - Increased the image resize dimensions from 256x256 pixels to 512x512 pixels - Reason: Provides greater model input flexibility, detail, and size. This also better supports the ONH-cropping models. - Added 3000 images from the Rotterdam EyePACS AIROGS dev set - Reason: More data samples can improve model generalizability - Readjusted train/val/test split - Reason: The validation and test split sizes were different - Improved sampling from source dataset - Reason: v1 NRG samples were not randomly selected

    Drawbacks of Rotterdam EyePACS AIROGS: One of the largest drawbacks of the original dataset is the accessibility of the dataset. The dataset requires a long download, a large storage space, it spans several folders, and it is not machine-learning-ready (it requires data processing and splitting). The dataset also contains raw fundus images in their original dimensions; these original images often contain a large amount of black background and the dimensions are too large for machine learning inputs. The proposed dataset addresses the aforementioned concerns by image sampling and image standardization to balance and reduce the dataset size respectively.

    Origin: The images in this dataset are sourced from the Rotterdam EyePACS AIROGS [1] dataset, which contains 113,893 color fundus images from 60,357 subjects and approximately 500 different sites with a heterogeneous ethnicity; this impressive dataset is over 60GB when compressed. The first lightweight version of the dataset is known as EyePACS-AIROGS-light (v1) [2].

    About Me: I have studied glaucoma-related research for my computer science master's thesis. Since my graduation, I have dedicated my time to keeping my research up-to-date and relevant for fellow glaucoma researchers. I hope that my research can provi...

  16. n

    Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning...

    • narcis.nl
    • data.mendeley.com
    Updated Jan 11, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yoo, T (via Mendeley Data) (2021). Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics" [Dataset]. http://doi.org/10.17632/ffn745r57z.2
    Explore at:
    Dataset updated
    Jan 11, 2021
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Yoo, T (via Mendeley Data)
    Description

    Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics. Authors: Kazutaka Kamiya, MD, PhD, Ik Hee Ryu, MD, MS, Tae Keun Yoo, MD, Jung Sub Kim MD, In Sik Lee, MD, PhD, Jin Kook Kim MD, Wakako Ando CO, Nobuyuki Shoji, MD, PhD, Tomofusa, Yamauchi, MD, PhD, Hitoshi Tabuchi, MD, PhD.

    We hypothesize that machine learning of preoperative biometric data obtained by the As-OCT may be clinically beneficial for predicting the actual ICL vault. Therefore, we built the machine learning model using Random Forest to predict ICL vault after surgery.

    This multicenter study comprised one thousand seven hundred forty-five eyes of 1745 consecutive patients (656 men and 1089 women), who underwent EVO ICL implantation (V4c and V5 Visian ICL with KS-AquaPORT) for the correction of moderate to high myopia and myopic astigmatism, and who completed at least a 1-month follow-up, at Kitasato University Hospital (Kanagawa, Japan), or at B&VIIT Eye Center (Seoul, Korea).

    This data file (RFR_model(feature=12).mat) is the final trained random forest model for MATLAB 2020a.

    Python version:

    from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor

    connect data in your google drive

    from google.colab import auth auth.authenticate_user() from google.colab import drive drive.mount('/content/gdrive')

    Change the path for the custom data

    In this case, we used ICL vault prediction using preop measurement

    dataset = pd.read_csv('gdrive/My Drive/ICL/data_icl.csv') dataset.head()

    optimal features (sorted by importance) :

    1. ICL size 2. ICL power 3. LV 4. CLR 5. ACD 6. ATA

    7. MSE 8.Age 9. Pupil size 10. WTW 11. CCT 12. ACW

    y = dataset['Vault_1M'] X = dataset.drop(['Vault_1M'], axis = 1)

    Split the dataset to train and test data, if necessary.

    For example, we can split data to 8:2 as a simple validation test

    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)

    In our study, we already defined the training (B&VIIT Eye Center, n=1455) and test (Kitasato University, n=290) dataset, this code was not necessary to perform our analysis.

    Optimal parameter search could be performed in this section

    parameters = {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 500, 'criterion': 'mae' 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6, 'max_leaf_nodes': None}

    RF_model = RandomForestRegressor(**parameters) RF_model.fit(train_X, train_y) RF_predictions = RF_model.predict(test_X) importance = RF_model.feature_importances_

  17. Autism Facial Emotion Recognition

    • kaggle.com
    zip
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md. Hasibur Rahman (2025). Autism Facial Emotion Recognition [Dataset]. https://www.kaggle.com/datasets/hasibur013/autism-facial-emotion-recognition
    Explore at:
    zip(208400945 bytes)Available download formats
    Dataset updated
    Jul 29, 2025
    Authors
    Md. Hasibur Rahman
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Autism Facial Recognition Dataset (Augmented & Split)

    Overview

    This dataset is a significantly augmented and meticulously split version of an existing Autism Facial Recognition Dataset, designed for training and evaluating machine learning models to detect Autism Spectrum Disorder (ASD) from facial images. Through extensive data augmentation and a structured train/test split, this dataset aims to provide a robust and diverse foundation for developing highly accurate and generalizable facial recognition models in the context of ASD research.

    Context

    Autism Spectrum Disorder (ASD) is a neurodevelopmental condition that impacts social interaction, communication, and behavior. Early detection and intervention are crucial for improving outcomes for individuals with ASD. Recent research has explored the potential of computer vision and facial recognition techniques to aid in the preliminary screening or identification of ASD-related facial cues. This dataset is tailored to support such research by providing a rich collection of processed facial images.

    Data Augmentation

    The original dataset has been expanded through a comprehensive augmentation pipeline using the Albumentations library. This process generates 10 augmented versions for each original image, dramatically increasing the dataset size and variety. The augmentation techniques applied include:

    • Geometric Augmentations:
      • HorizontalFlip: Randomly flips images horizontally (p=0.5).
      • Rotate: Rotates images by a random angle within -10 to +10 degrees (p=0.7), filling new pixels with a constant border.
    • Color and Brightness Augmentations:
      • HueSaturationValue: Randomly shifts hue, saturation, and value (brightness) to simulate varying lighting conditions and camera settings (p=0.5).
      • RandomGamma: Adjusts image contrast by applying gamma correction (gamma limit 85-115, p=0.5).
    • Quality and Noise Augmentations:
      • GaussianBlur: Applies a Gaussian blur to introduce slight blurring effects (sigma limit 0.1-1.4, p=0.3).
      • GaussNoise: Adds random Gaussian noise to images, simulating sensor noise (p=0.3).
    • Resizing and Cropping:
      • RandomResizedCrop: Crops a random portion of the image and resizes it to a uniform size of (224, 224) pixels. This helps the model become more robust to variations in image scale and composition (scale 0.91-0.95, p=1.0).

    These augmentations ensure that the trained models are less prone to overfitting and can generalize better to unseen data, accounting for variations in real-world image capture.

    Dataset Structure

    The processed dataset is organized into a standard train/test split, making it immediately usable for machine learning workflows. The directory structure is as follows:

    Autism Facial Emotion Recognition Dataset/
    ├── train/
    │  ├── Autistic/
    │  │  ├── image_001_aug_1.jpg
    │  │  ├── ...
    │  └── Non_Autistic/
    │    ├── image_001_aug_1.jpg
    │    └── ...
    └── test/
      ├── Autistic/
      │  ├── image_xxx_aug_y.jpg
      │  ├── ...
      └── Non_Autistic/
        ├── image_xxx_aug_y.jpg
        └── ...
    
    • train/: Contains the training set, comprising approximately 80% of the augmented images from each class.
    • test/: Contains the testing set, comprising the remaining 20% of the augmented images from each class.

    Each class (Autistic and Non_Autistic) maintains its proportional representation within both the training and testing sets to ensure balanced evaluation.

    Image Format

    All images are processed and saved as JPEG files. They are resized to 224x224 pixels, a common input size for many pre-trained deep learning models (e.g., ResNet, VGG, MobileNet).

    Usage

    This dataset is ideal for:

    • Training deep learning models for facial recognition and classification (e.g., CNNs, Vision Transformers).
    • Transfer learning experiments using pre-trained models.
    • Research into facial biomarkers for ASD.
    • Developing explainable AI (XAI) techniques to understand model decisions related to ASD features.

    Acknowledgements

    The original dataset from which this augmented version was derived should be credited appropriately. Please refer to the source of the "Autism Facial Recognition Dataset" you used as INPUT_DIR.

    Citation

    If you use this dataset in your research, please cite this Kaggle dataset.

  18. R

    Hard Hat Workers Dataset

    • universe.roboflow.com
    zip
    Updated Sep 30, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph Nelson (2022). Hard Hat Workers Dataset [Dataset]. https://universe.roboflow.com/joseph-nelson/hard-hat-workers/model/13
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 30, 2022
    Dataset authored and provided by
    Joseph Nelson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Variables measured
    Workers Bounding Boxes
    Description

    Overview

    The Hard Hat dataset is an object detection dataset of workers in workplace settings that require a hard hat. Annotations also include examples of just "person" and "head," for when an individual may be present without a hard hart.

    The original dataset has a 75/25 train-test split.

    Example Image: https://i.imgur.com/7spoIJT.png" alt="Example Image">

    Use Cases

    One could use this dataset to, for example, build a classifier of workers that are abiding safety code within a workplace versus those that may not be. It is also a good general dataset for practice.

    Using this Dataset

    Use the fork or Download this Dataset button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

    Dataset Versions:

    Image Preprocessing | Image Augmentation | Modify Classes * v1 (resize-416x416-reflect): generated with the original 75/25 train-test split | No augmentations * v2 (raw_75-25_trainTestSplit): generated with the original 75/25 train-test split | These are the raw, original images * v3 (v3): generated with the original 75/25 train-test split | Modify Classes used to drop person class | Preprocessing and Augmentation applied * v5 (raw_HeadHelmetClasses): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person class * v8 (raw_HelmetClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head and person classes * v9 (raw_PersonClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head and helmet classes * v10 (raw_AllClasses): generated with a 70/20/10 train/valid/test split | These are the raw, original images * v11 (augmented3x-AllClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied | 3x image generation | Trained with Roboflow's Fast Model * v12 (augmented3x-HeadHelmetClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person class | 3x image generation | Trained with Roboflow's Fast Model * v13 (augmented3x-HeadHelmetClasses-AccurateModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person class | 3x image generation | Trained with Roboflow's Accurate Model * v14 (raw_HeadClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person class, and remap/relabel helmet class to head

    Choosing Between Computer Vision Model Sizes | Roboflow Train

    About Roboflow

    Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

    Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.

    Roboflow Workmark

  19. S155

    • zenodo.org
    tar
    Updated Oct 6, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Zurowietz; Martin Zurowietz (2020). S155 [Dataset]. http://doi.org/10.5281/zenodo.3603803
    Explore at:
    tarAvailable download formats
    Dataset updated
    Oct 6, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Martin Zurowietz; Martin Zurowietz
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    A fully annotated subset of the SO242/2_155-1 image dataset. The annotations are given as train and test splits that can be used to evaluate machine learning methods. The following classes of fauna were used for annotation:

    • anemone
    • coral
    • crustacean
    • ipnops fish
    • litter
    • ophiuroid
    • other fauna
    • sea cucumber
    • sponge
    • stalked crinoid

    For a definition of the classes see [1].

    Related datasets:

    This dataset contains the following files:

    • annotations/test.csv: The BIIGLE CSV annotation report of the annotations of the test split of this dataset. These annotations are used to test the performance of the trained Mask R-CNN model.
    • annotations/train.csv: The BIIGLE CSV annotation report of the annotations of the train split of this dataset. These annotations are used to generate the annotation patches which are transformed with scale and style transfer to be used to train the Mask R-CNN model.
    • images/: Directory that contains all the original image files.
    • dataset.json: JSON file that contains information about the dataset.
      • name: The name of the dataset.
      • images_dir: Name of the directory that contains the original image files.
      • metadata_file: Path to the CSV file that contains image metadata.
      • test_annotations_file: Path to the CSV file that contains the test annotations.
      • train_annotations_file: Path to the CSV file that contains the train annotations.
      • annotation_patches_dir: Name of the directory that should contain the scale- and style-transferred annotation patches.
      • crop_dimension: Edge length of an annotation or style patch in pixels.
    • metadata.csv: A CSV file that contains metadata for each original image file. In this case the distance of the camera to the sea floor is given for each image.
  20. StreetSurfaceVis: a dataset of street-level imagery with annotations of road...

    • zenodo.org
    • data.niaid.nih.gov
    csv, pdf, zip
    Updated Jan 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandra Kapp; Alexandra Kapp; Edith Hoffmann; Edith Hoffmann; Esther Weigmann; Esther Weigmann; Helena Mihaljevic; Helena Mihaljevic (2025). StreetSurfaceVis: a dataset of street-level imagery with annotations of road surface type and quality [Dataset]. http://doi.org/10.5281/zenodo.11449977
    Explore at:
    zip, pdf, csvAvailable download formats
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alexandra Kapp; Alexandra Kapp; Edith Hoffmann; Edith Hoffmann; Esther Weigmann; Esther Weigmann; Helena Mihaljevic; Helena Mihaljevic
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    StreetSurfaceVis

    StreetSurfaceVis is an image dataset containing 9,122 street-level images from Germany with labels on road surface type and quality. The CSV file streetSurfaceVis_v1_0.csv contains all image metadata and four folders contain the image files. All images are available in four different sizes, based on the image width, in 256px, 1024px, 2048px and the original size.
    Folders containing the images are named according to the respective image size. Image files are named based on the mapillary_image_id.

    You can find the corresponding publication here: StreetSurfaceVis: a dataset of crowdsourced street-level imagery with semi-automated annotations of road surface type and quality

    Image metadata

    Each CSV record contains information about one street-level image with the following attributes:

    • mapillary_image_id: ID provided by Mapillary (see information below on Mapillary)
    • user_id: Mapillary user ID of contributor
    • user_name: Mapillary user name of contributor
    • captured_at: timestamp, capture time of image
    • longitude, latitude: location the image was taken at
    • train: Suggestion to split train and test data. `True` for train data and `False` for test data. Test data contains data from 5 cities which are excluded in the training data.
    • surface_type: Surface type of the road in the focal area (the center of the lower image half) of the image. Possible values: asphalt, concrete, paving_stones, sett, unpaved
    • surface_quality: Surface quality of the road in the focal area of the image. Possible values: (1) excellent, (2) good, (3) intermediate, (4) bad, (5) very bad (see the attached Labeling Guide document for details)

    Image source

    Images are obtained from Mapillary, a crowd-sourcing plattform for street-level imagery. More metadata about each image can be obtained via the Mapillary API . User-generated images are shared by Mapillary under the CC-BY-SA License.

    For each image, the dataset contains the mapillary_image_id and user_name.
    You can access user information on the Mapillary website by https://www.mapillary.com/app/user/
    and image information by https://www.mapillary.com/app/?focus=photo&pKey=

    If you use the provided images, please adhere to the terms of use of Mapillary.

    Instances per class

    Total number of images: 9,122

    excellentgoodintermediatebadvery bad
    asphalt9711697821246-
    concrete31435025058-
    paving stones385106351970-
    sett-129694540-
    unpaved--326387303

    For modeling, we recommend using a train-test split where the test data includes geospatially distinct areas, thereby ensuring the model's ability to generalize to unseen regions is tested. We propose five cities varying in population size and from different regions in Germany for testing - images are tagged accordingly.

    Number of test images (train-test split): 776

    Inter-rater-reliablility

    Three annotators labeled the dataset, such that each image was annotated by one person. Annotators were encouraged to consult each other for a second opinion when uncertain.
    1,800 images were annotated by all three annotators, resulting in a Krippendorff's alpha of 0.96 for surface type and 0.74 for surface quality.

    Recommended image preprocessing

    As the focal road located in the bottom center of the street-level image is labeled, it is recommended to crop images to their lower and middle half prior using for classification tasks.

    This is an exemplary code for recommended image preprocessing in Python:

    from PIL import Image
    img = Image.open(image_path)
    width, height = img.size
    img_cropped = img.crop((0.25 * width, 0.5 * height, 0.75 * width, height))


    License

    CC-BY-SA

    Citation

    If you use this dataset, please cite as:

    Kapp, A., Hoffmann, E., Weigmann, E. et al. StreetSurfaceVis: a dataset of crowdsourced street-level imagery annotated by road surface type and quality. Sci Data 12, 92 (2025). https://doi.org/10.1038/s41597-024-04295-9

    @article{kapp_streetsurfacevis_2025,
    title = {{StreetSurfaceVis}: a dataset of crowdsourced street-level imagery annotated by road surface type and quality},
    volume = {12},
    issn = {2052-4463},
    url = {https://doi.org/10.1038/s41597-024-04295-9},
    doi = {10.1038/s41597-024-04295-9},
    pages = {92},
    number = {1},
    journaltitle = {Scientific Data},
    shortjournal = {Scientific Data},
    author = {Kapp, Alexandra and Hoffmann, Edith and Weigmann, Esther and Mihaljević, Helena},
    date = {2025-01-16},
    }

    -----------------------------------------------------------------------------------------------------------------------------------------------------------

    This is part of the SurfaceAI project at the University of Applied Sciences, HTW Berlin.


    - Prof. Dr. Helena Mihajlević
    - Alexandra Kapp
    - Edith Hoffmann
    - Esther Weigmann

    Contact: surface-ai@htw-berlin.de

    https://surfaceai.github.io/surfaceai/

    Funding: SurfaceAI is a mFund project funded by the Federal Ministry for Digital and Transportation Germany.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Organization logo

Machine learning algorithm validation with a limited sample size

Explore at:
text/x-pythonAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Search
Clear search
Close search
Google apps
Main menu