100+ datasets found
  1. Machine learning algorithm validation with a limited sample size

    • plos.figshare.com
    text/x-python
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

  2. Machine Learning Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Updated Dec 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Dec 23, 2024
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

  3. d

    Data from: Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  4. Data from: S1 Dataset -

    • plos.figshare.com
    rar
    Updated Aug 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adel H. Elmetwalli; Asaad Derbala; Ibtisam Mohammed Alsudays; Eman A. Al-Shahari; Mahmoud Elhosary; Salah Elsayed; Laila A. Al-Shuraym; Farahat S. Moghanm; Osama Elsherbiny (2024). S1 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0308826.s001
    Explore at:
    rarAvailable download formats
    Dataset updated
    Aug 26, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Adel H. Elmetwalli; Asaad Derbala; Ibtisam Mohammed Alsudays; Eman A. Al-Shahari; Mahmoud Elhosary; Salah Elsayed; Laila A. Al-Shuraym; Farahat S. Moghanm; Osama Elsherbiny
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Estimation of fruit quality parameters are usually based on destructive techniques which are tedious, costly and unreliable when dealing with huge amounts of fruits. Alternatively, non–destructive techniques such as image processing and spectral reflectance would be useful in rapid detection of fruit quality parameters. This research study aimed to assess the potential of image processing, spectral reflectance indices (SRIs), and machine learning models such as decision tree (DT) and random forest (RF) to qualitatively estimate characteristics of mandarin and tomato fruits at different ripening stages. Quality parameters such as chlorophyll a (Chl a), chlorophyll b (Chl b), total soluble solids (TSS), titratable acidity (TA), TSS/TA, carotenoids (car), lycopene and firmness were measured. The results showed that Red-Blue-Green (RGB) indices and newly developed SRIs demonstrated high efficiency for quantifying different fruit properties. For example, the R2 of the relationships between all RGB indices (RGBI) and measured parameters varied between 0.62 and 0.96 for mandarin and varied between 0.29 and 0.90 for tomato. The RGBI such as visible atmospheric resistant index (VARI) and normalized red (Rn) presented the highest R2 = 0.96 with car of mandarin fruits. While excess red vegetation index (ExR) presented the highest R2 = 0.84 with car of tomato fruits. The SRIs such as RSI 710,600, and R730,650 showed the greatest R2 values with respect to Chl a (R2 = 0.80) for mandarin fruits while the GI had the greatest R2 with Chl a (R2 = 0.68) for tomato fruits. Combining RGB and SRIs with DT and RF models would be a robust strategy for estimating eight observed variables associated with reasonable accuracy. Regarding mandarin fruits, in the task of predicting Chl a, the DT-2HV model delivered exceptional results, registering an R2 of 0.993 with an RMSE of 0.149 for the training set, and an R2 of 0.991 with an RMSE of 0.114 for the validation set. As well as for tomato fruits, the DT-5HV model demonstrated exemplary performance in the Chl a prediction, achieving an R2 of 0.905 and an RMSE of 0.077 for the training dataset, and an R2 of 0.785 with an RMSE of 0.077 for the validation dataset. The overall outcomes showed that the RGB, newly SRIs as well as DT and RF based RGBI, and SRIs could be used to evaluate the measured parameters of mandarin and tomato fruits.

  5. Emails dataset for Spam Detection

    • kaggle.com
    zip
    Updated Jul 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gokul Raja (2023). Emails dataset for Spam Detection [Dataset]. https://www.kaggle.com/datasets/gokulraja84/emails-dataset-for-spam-detection
    Explore at:
    zip(2994798 bytes)Available download formats
    Dataset updated
    Jul 1, 2023
    Authors
    Gokul Raja
    Description

    Contains Emails and Email descriptions, can be used for Spam detection using Machine Learning

    The Emails dataset for Spam Detection is a collection of emails that have been labeled as either "spam" or "ham" (non-spam). It is commonly used in the field of machine learning and natural language processing to develop and evaluate models for spam email detection.

    The dataset consists of a large number of emails, each represented as a document or text sample, along with its corresponding label indicating whether it is spam or ham. The emails in the dataset vary in length and content, mimicking the diversity found in real-world email communication.

    The purpose of this dataset is to train and test machine learning models to accurately classify incoming emails as either spam or ham. By analyzing the textual content and other features of the emails, models can learn patterns and characteristics that distinguish spam emails from legitimate ones. This helps in automating the process of filtering unwanted and potentially malicious emails, saving users' time and protecting them from scams, phishing attempts, and other fraudulent activities.

    The dataset provides a valuable resource for developing and evaluating different spam detection algorithms and techniques. Researchers and practitioners can apply various machine learning algorithms, such as decision trees, support vector machines, naive Bayes, or deep learning approaches, to build models that can effectively classify emails as spam or ham.

    To use the dataset, researchers can split it into training, validation, and testing sets. The training set is used to train the machine learning models, the validation set is used for tuning hyperparameters and evaluating model performance during development, and the testing set is used to assess the final performance of the trained models on unseen data.

    When working with the dataset, it is important to preprocess the emails by performing tasks such as removing stop words, stemming or lemmatizing words, handling special characters, and converting the text into numerical representations that can be fed into machine learning algorithms. Additionally, feature engineering techniques can be applied to extract relevant features from the emails, such as word frequencies, presence of certain keywords, or structural properties of the email.

    By leveraging the Emails dataset for Spam Detection, researchers and practitioners can develop robust and accurate spam detection systems that can effectively filter out unwanted emails, enhance email security, and improve overall user experience in managing their inbox.

  6. t

    FAIR Dataset for Disease Prediction in Healthcare Applications

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
    Explore at:
    csv, json, bin, pngAvailable download formats
    Dataset updated
    Apr 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Context and Methodology

    • Research Domain/Project:
      This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

    • Purpose of the Dataset:
      The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

    • Dataset Creation:
      Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

    Technical Details

    • Structure of the Dataset:
      The dataset consists of several files organized into folders by data type:

      • Training Data: Contains the training dataset used to train the machine learning model.

      • Validation Data: Used for hyperparameter tuning and model selection.

      • Test Data: Reserved for final model evaluation.

      Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

    • Software Requirements:
      To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

      • Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

    Further Details

    • Reusability:
      Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

    • Limitations:
      The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

  7. m

    Resized (224x224) COVID and Non-COVID X-Ray Images

    • data.mendeley.com
    Updated Jun 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reymond Mesuga (2022). Resized (224x224) COVID and Non-COVID X-Ray Images [Dataset]. http://doi.org/10.17632/db4wdy7cdf.1
    Explore at:
    Dataset updated
    Jun 6, 2022
    Authors
    Reymond Mesuga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the resized version of several datasets from Mendeley and Kaggle. The dataset from Mendeley entitled "Extensive COVID-19 X-Ray and CT Chest Images Dataset" was published by El-Shafai and El-Samie, which was used as the Train Set. Meanwhile, the dataset from Kaggle entitled "COVID19 with Pneumonia and Normal Chest Xray (PA) Dataset" was published by Asraf and used as the Validation Set and Test Set. This dataset has been prepared for machine learning and deep learning purposes. Please do check the sources of the said datasets with the link below.

    This dataset contains the following: -Train Set (COVID: 4044; Non-COVID: 5500; Total: 9544) -Validation Set (COVID: 763; Non-COVID: 763; Total: 1526) -Test Set (COVID: 762; Non-COVID: 762; Total: 1524)

    Since this dataset is the modified (i.e., resized) version of the datasets stated above, please cite this dataset (see citation on the right side), including the other sources.

  8. m

    Data from: SalmonScan: A Novel Image Dataset for Machine Learning and Deep...

    • data.mendeley.com
    Updated Apr 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Shoaib Ahmed (2024). SalmonScan: A Novel Image Dataset for Machine Learning and Deep Learning Analysis in Fish Disease Detection in Aquaculture [Dataset]. http://doi.org/10.17632/x3fz2nfm4w.3
    Explore at:
    Dataset updated
    Apr 2, 2024
    Authors
    Md Shoaib Ahmed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The SalmonScan dataset is a collection of images of salmon fish, including healthy fish and infected fish. The dataset consists of two classes of images:

    Fresh salmon 🐟 Infected Salmon 🐠

    This dataset is ideal for various computer vision tasks in machine learning and deep learning applications. Whether you are a researcher, developer, or student, the SalmonScan dataset offers a rich and diverse data source to support your projects and experiments.

    So, dive in and explore the fascinating world of salmon health and disease!

    The SalmonScan dataset (raw) consists of 24 fresh fish and 91 infected fish. [Due to server cleaning in the past, some raw datasets have been deleted]

    The SalmonScan dataset (augmented) consists of approximately 1,208 images of salmon fish, classified into two classes:

    • Fresh salmon (healthy fish with no visible signs of disease), 456 images
    • Infected Salmon containing disease, 752 images

    Each class contains a representative and diverse collection of images, capturing a range of different perspectives, scales, and lighting conditions. The images have been carefully curated to ensure that they are of high quality and suitable for use in a variety of computer vision tasks.

    Data Preprocessing

    The input images were preprocessed to enhance their quality and suitability for further analysis. The following steps were taken:

    Resizing 📏: All the images were resized to a uniform size of 600 pixels in width and 250 pixels in height to ensure compatibility with the learning algorithm. Image Augmentation 📸: To overcome the small amount of images, various image augmentation techniques were applied to the input images. These included: Horizontal Flip ↩️: The images were horizontally flipped to create additional samples. Vertical Flip ⬆️: The images were vertically flipped to create additional samples. Rotation 🔄: The images were rotated to create additional samples. Cropping 🪓: A portion of the image was randomly cropped to create additional samples. Gaussian Noise 🌌: Gaussian noise was added to the images to create additional samples. Shearing 🌆: The images were sheared to create additional samples. Contrast Adjustment (Gamma) ⚖️: The gamma correction was applied to the images to adjust their contrast. Contrast Adjustment (Sigmoid) ⚖️: The sigmoid function was applied to the images to adjust their contrast.

    Usage

    To use the salmon scan dataset in your ML and DL projects, follow these steps:

    • Clone or download the salmon scan dataset repository from GitHub.
    • Use standard libraries such as numpy or pandas to convert the images into arrays, which can be input into a machine learning or deep learning model.
    • Split the dataset into training, validation, and test sets as per your requirement.
    • Preprocess the data as needed, such as resizing and normalizing the images.
    • Train your ML/DL model using the preprocessed training data.
    • Evaluate the model on the test set and make predictions on new, unseen data.
  9. Federated Learning for Distributed Intrusion Detection Systems in Public...

    • zenodo.org
    • data.europa.eu
    bz2
    Updated May 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alireza Bakhshi Zadi Mahmoodi; Alireza Bakhshi Zadi Mahmoodi; Panos Kostakos; Panos Kostakos (2023). Federated Learning for Distributed Intrusion Detection Systems in Public Networks - Validation Dataset [Dataset]. http://doi.org/10.5281/zenodo.7956304
    Explore at:
    bz2Available download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alireza Bakhshi Zadi Mahmoodi; Alireza Bakhshi Zadi Mahmoodi; Panos Kostakos; Panos Kostakos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset has been meticulously prepared and utilized as a validation set during the evaluation phase of "Meta IDS" to asses the performance of various machine learning models. It is now made available for interested users and researchers who seek a reliable and diverse dataset for training and testing their own custom models.

    The validation dataset comprises a comprehensive collection of labeled entries, that determines whether the packet type is "malicious" or "benign." It covers complex design patterns that are commonly encountered in real-world applications. The dataset is designed to be representative, encompassing edge and fog layers that are in contact with cloud layer, thereby enabling thorough testing and evaluation of different models. Each sample in the dataset is labeled with the corresponding ground truth, providing a reliable reference for model performance evaluation.

    To ensure convenient distribution and storage, the dataset has been broken down into three separate batches, each containing a portion of the dataset. This allows for convenient downloading and management of the dataset. The three batches are provided as individual compressed files.

    In order to extract the data, follow the following instructions:

    • Download and install bzip2 (if not already installed) from the official website or your package manager.
    • Place the compressed dataset file in a directory of your choice.
    • Open a terminal or command prompt and navigate to the directory where the compressed dataset file is located.
    • Execute the following command to uncompress the dataset:
      • bzip2 -d filename.bz2
    • Replace "filename.bz2" with the actual name of the compressed dataset file.

    Once uncompressed, you will have access to the dataset in its original format for further exploration, analysis, and model training etc. The total storage required for extraction is approximately 800 GB in total, with the first batch requiring approximately 302 GB, the second batch requiring approximately 203 GB, and the third batch requiring approximately 297 GB of data storage.

    The first batch contains 1,049,527,992 entries, where as the second batch contains 711,043,331 entries, and for the third and last batch we have 1,029,303,062 entries. The following table provides the feature names along with their explanation and example value once the dataset is extracted.

    FeatureDescriptionExample Value
    ip.srcSource IP address in the packeta05d4ecc38da01406c9635ec694917e969622160e728495e3169f62822444e17
    ip.dstDestination IP address in the packeta52db0d87623d8a25d0db324d74f0900deb5ca4ec8ad9f346114db134e040ec5
    frame.time_epochEpoch time of the frame1676165569.930869
    arp.hw.typeHardware type1
    arp.hw.sizeHardware size6
    arp.proto.sizeProtocol size4
    arp.opcodeOpcode2
    data.lenLength2713
    eth.dst.lgDestination LG bit1
    eth.dst.igDestination IG bit1
    eth.src.lgSource LG bit1
    eth.src.igSource IG bit1
    frame.offset_shiftTime shift for this packet0
    frame.lenframe length on the wire1208
    frame.cap_lenFrame length stored into the capture file215
    frame.markedFrame is marked0
    frame.ignoredFrame is ignored0
    frame.encap_typeEncapsulation type1
    greGeneric Routing Encapsulation'Generic Routing
    Encapsulation (IP)’
    ip.versionVersion6
    ip.hdr_lenHeader length24
    ip.dsfield.dscpDifferentiated Services
    Codepoint
    56
    ip.dsfield.ecnExplicit Congestion
    Notification
    2
    ip.lenTotal length614
    ip.flags.rbReserved bit0
    ip.flags.dfDon't fragment1
    ip.flags.mfMore fragments0
    ip.frag_offsetFragment offset0
    ip.ttlTime to live31
    ip.protoProtocol47
    ip.checksum.statusHeader checksum status2
    tcp.srcportTCP source port53425
    tcp.flagsFlags0x00000098
    tcp.flags.nsNonce0
    tcp.flags.cwrCongestion Window Reduced
    (CWR)
    1
    udp.srcportUDP source port64413
    udp.dstportUDP destination port54087
    udp.streamStream index1345
    udp.lengthLength225
    udp.checksum.statusChecksum status3
    packet_typeType of the packet which is either "benign" or "malicious"0

    Furthermore, in compliance with the GDPR and to ensure the privacy of individuals, all IP addresses present in the dataset have been anonymized through hashing. This anonymization process helps protect the identity of individuals while preserving the integrity and utility of the dataset for research and model development purposes.

    Please note that while the dataset provides valuable insights and a solid foundation for machine learning tasks, it is not a substitute for extensive real-world data collection. However, it serves as a valuable resource for researchers, practitioners, and enthusiasts in the machine learning community, offering a compliant and anonymized dataset for developing and validating custom models in a specific problem domain.

    By leveraging the validation dataset for machine learning model evaluation and custom model training, users can accelerate their research and development efforts, building upon the knowledge gained from my thesis while contributing to the advancement of the field.

  10. R

    Pothole Dataset

    • universe.roboflow.com
    zip
    Updated Nov 1, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brad Dwyer (2020). Pothole Dataset [Dataset]. https://universe.roboflow.com/brad-dwyer/pothole-voxrl/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 1, 2020
    Dataset authored and provided by
    Brad Dwyer
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Variables measured
    Potholes Bounding Boxes
    Description

    Pothole Dataset

    https://i.imgur.com/7Xz8d5M.gif" alt="Example Image">

    This is a collection of 665 images of roads with the potholes labeled. The dataset was created and shared by Atikur Rahman Chitholian as part of his undergraduate thesis and was originally shared on Kaggle.

    Note: The original dataset did not contain a validation set; we have re-shuffled the images into a 70/20/10 train-valid-test split.

    Usage

    This dataset could be used for automatically finding and categorizing potholes in city streets so the worst ones can be fixed faster.

    The dataset is provided in a wide variety of formats for various common machine learning models.

  11. Model performance per dataset.

    • plos.figshare.com
    xls
    Updated May 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erik D. Huckvale; Hunter N. B. Moseley (2024). Model performance per dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0299583.t010
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 2, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Erik D. Huckvale; Hunter N. B. Moseley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizing a training dataset of known metabolite-to-pathway mappings. A few such training datasets have been derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG). However, several prior published machine learning approaches utilized an erroneous KEGG-derived training dataset that used SMILES molecular representations strings (KEGG-SMILES dataset) and contained a sizable proportion (~26%) duplicate entries. The presence of so many duplicates taint the training and testing sets generated from k-fold cross-validation of the KEGG-SMILES dataset. Therefore, the k-fold cross-validation performance of the resulting machine learning models was grossly inflated by the erroneous presence of these duplicate entries. Here we describe and evaluate the KEGG-SMILES dataset so that others may avoid using it. We also identify the prior publications that utilized this erroneous KEGG-SMILES dataset so their machine learning results can be properly and critically evaluated. In addition, we demonstrate the reduction of model k-fold cross-validation (CV) performance after de-duplicating the KEGG-SMILES dataset. This is a cautionary tale about properly vetting prior published benchmark datasets before using them in machine learning approaches. We hope others will avoid similar mistakes.

  12. Complete Blood Count (CBC)

    • kaggle.com
    zip
    Updated Aug 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Noukhez (2024). Complete Blood Count (CBC) [Dataset]. https://www.kaggle.com/datasets/mdnoukhej/complete-blood-count-cbc
    Explore at:
    zip(12859168 bytes)Available download formats
    Dataset updated
    Aug 1, 2024
    Authors
    Muhammad Noukhez
    Description

    Dataset Description:

    This dataset is a comprehensive collection of Complete Blood Count (CBC) images, meticulously organized to support machine learning and deep learning projects, especially in the domain of medical image analysis. The dataset's structure ensures a balanced and systematic approach to model development, validation, and testing.

    Dataset Breakdown:

    • Training Images: 300
    • Validation Images: 60
    • Test Images: 60
    • Annotations: Detailed annotations included for all images

    Overview:

    The Complete Blood Count (CBC) is a crucial test used in medical diagnostics to evaluate the overall health and detect a variety of disorders, including anemia, infection, and many other diseases. This dataset provides a rich source of CBC images that can be used to train machine learning models to automate the analysis and interpretation of these tests.

    Data Composition:

    1. Training Set:

      • Contains 300 images
      • These images are used to train machine learning models, enabling them to learn and recognize patterns associated with various blood cell types and conditions.
    2. Validation Set:

      • Contains 60 images
      • Used to tune the models and optimize their performance, ensuring that the models generalize well to new, unseen data.
    3. Test Set:

      • Contains 60 images
      • Used to evaluate the final model performance, providing an unbiased assessment of how well the model performs on new data.

    Annotations:

    Each image in the dataset is accompanied by detailed annotations, which include information about the different types of blood cells present and any relevant diagnostic features. These annotations are essential for supervised learning, allowing models to learn from labeled examples and improve their accuracy and reliability.

    Key Features:

    • High-Quality Images: All images are of high quality, making them suitable for a variety of machine learning tasks, including image classification, object detection, and segmentation.
    • Comprehensive Annotations: Each image is thoroughly annotated, providing valuable information that can be used to train and validate models.
    • Balanced Dataset: The dataset is carefully balanced with distinct sets for training, validation, and testing, ensuring that models trained on this data will be robust and generalizable.

    Applications:

    This dataset is ideal for researchers and practitioners in the fields of machine learning, deep learning, and medical image analysis. Potential applications include: - Automated CBC Analysis: Developing algorithms to automatically analyze CBC images and provide diagnostic insights. - Blood Cell Classification: Training models to accurately classify different types of blood cells, which is critical for diagnosing various blood disorders. - Educational Purposes: Using the dataset as a teaching tool to help students and new practitioners understand the complexities of CBC image analysis.

    Usage Notes:

    • Data Augmentation: Users may consider applying data augmentation techniques to increase the diversity of the training data and improve model robustness.
    • Preprocessing: Proper preprocessing, such as normalization and noise reduction, can enhance model performance.
    • Evaluation Metrics: It is recommended to use standard evaluation metrics such as accuracy, precision, recall, and F1-score to assess model performance.

    Conclusion:

    This CBC dataset is a valuable resource for anyone looking to advance the field of automated medical diagnostics through machine learning and deep learning. With its high-quality images, detailed annotations, and balanced composition, it provides the necessary foundation for developing accurate and reliable models for CBC analysis.

  13. Data from: Robust Validation: Confident Predictions Even When Distributions...

    • tandf.figshare.com
    bin
    Updated Dec 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maxime Cauchois; Suyash Gupta; Alnur Ali; John C. Duchi (2023). Robust Validation: Confident Predictions Even When Distributions Shift* [Dataset]. http://doi.org/10.6084/m9.figshare.24904721.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Dec 26, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Maxime Cauchois; Suyash Gupta; Alnur Ali; John C. Duchi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    While the traditional viewpoint in machine learning and statistics assumes training and testing samples come from the same population, practice belies this fiction. One strategy—coming from robust statistics and optimization—is thus to build a model robust to distributional perturbations. In this paper, we take a different approach to describe procedures for robust predictive inference, where a model provides uncertainty estimates on its predictions rather than point predictions. We present a method that produces prediction sets (almost exactly) giving the right coverage level for any test distribution in an f-divergence ball around the training population. The method, based on conformal inference, achieves (nearly) valid coverage in finite samples, under only the condition that the training data be exchangeable. An essential component of our methodology is to estimate the amount of expected future data shift and build robustness to it; we develop estimators and prove their consistency for protection and validity of uncertainty estimates under shifts. By experimenting on several large-scale benchmark datasets, including Recht et al.’s CIFAR-v4 and ImageNet-V2 datasets, we provide complementary empirical results that highlight the importance of robust predictive validity.

  14. Mathematical Problems Dataset: Various

    • kaggle.com
    zip
    Updated Dec 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Mathematical Problems Dataset: Various [Dataset]. https://www.kaggle.com/datasets/thedevastator/mathematical-problems-dataset-various-mathematic/code
    Explore at:
    zip(2498203187 bytes)Available download formats
    Dataset updated
    Dec 2, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Mathematical Problems Dataset: Various Mathematical Problems and Solutions

    Mathematical Problems Dataset: Questions and Answers

    By math_dataset (From Huggingface) [source]

    About this dataset

    This dataset comprises a collection of mathematical problems and their solutions designed for training and testing purposes. Each problem is presented in the form of a question, followed by its corresponding answer. The dataset covers various mathematical topics such as arithmetic, polynomials, and prime numbers. For instance, the arithmetic_nearest_integer_root_test.csv file focuses on problems involving finding the nearest integer root of a given number. Similarly, the polynomials_simplify_power_test.csv file deals with problems related to simplifying polynomials with powers. Additionally, the dataset includes the numbers_is_prime_train.csv file containing math problems that require determining whether a specific number is prime or not. The questions and answers are provided in text format to facilitate analysis and experimentation with mathematical problem-solving algorithms or models

    How to use the dataset

    • Introduction: The Mathematical Problems Dataset contains a collection of various mathematical problems and their corresponding solutions or answers. This guide will provide you with all the necessary information on how to utilize this dataset effectively.

    • Understanding the columns: The dataset consists of several columns, each representing a different aspect of the mathematical problem and its solution. The key columns are:

      • question: This column contains the text representation of the mathematical problem or equation.
      • answer: This column contains the text representation of the solution or answer to the corresponding problem.
    • Exploring specific problem categories: To focus on specific types of mathematical problems, you can filter or search within the dataset using relevant keywords or terms related to your area of interest. For example, if you are interested in prime numbers, you can search for prime in the question column.

    • Applying machine learning techniques: This dataset can be used for training machine learning models related to natural language understanding and mathematics. You can explore various techniques such as text classification, sentiment analysis, or even sequence-to-sequence models for solving mathematical problems based on their textual representations.

    • Generating new questions and solutions: By analyzing patterns in this dataset, you can generate new questions and solutions programmatically using techniques like data augmentation or rule-based methods.

    • Validation and evaluation: As with any other machine learning task, it is essential to validate your models on separate validation sets not included in this dataset properly. You can also evaluate model performance by comparing predictions against known answers provided in this dataset's answer column.

    • Sharing insights and findings: After working with this datasets, it would be beneficial for researchers or educators to share their insights, approaches taken during analysis/modelling as Kaggle notebooks/ discussions/ blogs/ tutorials etc., so that others could get benefited from such shared resources too.

    Note: Please note that the dataset does not include dates.

    By following these guidelines, you can effectively explore and utilize the Mathematical Problems Dataset for various mathematical problem-solving tasks. Happy exploring!

    Research Ideas

    • Developing machine learning algorithms for solving mathematical problems: This dataset can be used to train and test models that can accurately predict the solution or answer to different mathematical problems.
    • Creating educational resources: The dataset can be used to create a wide variety of educational materials such as problem sets, worksheets, and quizzes for students studying mathematics.
    • Research in mathematical problem-solving strategies: Researchers and educators can analyze the dataset to identify common patterns or strategies employed in solving different types of mathematical problems. This analysis can help improve teaching methodologies and develop effective problem-solving techniques

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purpos...

  15. F

    Data from: A Neural Approach for Text Extraction from Scholarly Figures

    • data.uni-hannover.de
    zip
    Updated Jan 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 20, 2022
    Dataset authored and provided by
    TIB
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    A Neural Approach for Text Extraction from Scholarly Figures

    This is the readme for the supplemental data for our ICDAR 2019 paper.

    You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

    If you found this dataset useful, please consider citing our paper:

    @inproceedings{DBLP:conf/icdar/MorrisTE19,
     author  = {David Morris and
            Peichen Tang and
            Ralph Ewerth},
     title   = {A Neural Approach for Text Extraction from Scholarly Figures},
     booktitle = {2019 International Conference on Document Analysis and Recognition,
            {ICDAR} 2019, Sydney, Australia, September 20-25, 2019},
     pages   = {1438--1443},
     publisher = {{IEEE}},
     year   = {2019},
     url    = {https://doi.org/10.1109/ICDAR.2019.00231},
     doi    = {10.1109/ICDAR.2019.00231},
     timestamp = {Tue, 04 Feb 2020 13:28:39 +0100},
     biburl  = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib},
     bibsource = {dblp computer science bibliography, https://dblp.org}
    }
    

    This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

    Datasets

    We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

    Testing

    These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

    Validation

    The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

    Training

    We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

    Code

    We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

    Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

    We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

    We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

    Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.

  16. Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

    • data.europa.eu
    unknown
    Updated Feb 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. https://data.europa.eu/88u/dataset/oai-zenodo-org-4571228
    Explore at:
    unknown(395470535)Available download formats
    Dataset updated
    Feb 28, 2021
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA. The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file. All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file. The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file. Notable changes to each version of the dataset are documented in CHANGELOG.md.

  17. Results of applying optimized machine learning approach for multi-tasks...

    • plos.figshare.com
    xls
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmadreza Keihani; Amin Mohammad Mohammadi; Hengameh Marzbani; Shahriar Nafissi; Mohsen Reza Haidari; Amir Homayoun Jafari (2023). Results of applying optimized machine learning approach for multi-tasks classification. [Dataset]. http://doi.org/10.1371/journal.pone.0270757.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ahmadreza Keihani; Amin Mohammad Mohammadi; Hengameh Marzbani; Shahriar Nafissi; Mohsen Reza Haidari; Amir Homayoun Jafari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Results of applying optimized machine learning approach for multi-tasks classification.

  18. Z

    Downsized camera trap images for automated classification

    • data.niaid.nih.gov
    Updated Dec 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Norman, Danielle L; Wearne, Oliver R; Chapman, Philip M; Heon, Sui P; Ewers, Robert M (2022). Downsized camera trap images for automated classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6627706
    Explore at:
    Dataset updated
    Dec 1, 2022
    Dataset provided by
    Imperial College London
    Authors
    Norman, Danielle L; Wearne, Oliver R; Chapman, Philip M; Heon, Sui P; Ewers, Robert M
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description: Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707. Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions Funding: These data were collected as part of research funded by:

    NERC (NERC QMEE CDT Studentship, NE/P012345/1, http://gotw.nerc.ac.uk/list_full.asp?pcode=NE%2FP012345%2F1&cookieConsent=A) This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.

    XML metadata: GEMINI compliant metadata for this dataset is available here Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip CT_image_data_info2.xlsx This file contains dataset metadata and 1 data tables:

    Dataset Images (described in worksheet Dataset_images) Description: This worksheet details the composition of each dataset used in the analyses Number of fields: 69 Number of data rows: 270287 Fields:

    filename: Root ID (Field type: id) camera_trap_site: Site ID for the camera trap location (Field type: location) taxon: Taxon recorded by camera trap (Field type: taxa) dist_level: Level of disturbance at site (Field type: ordered categorical) baseline: Label as to whether image is included in the baseline training, validation (val) or test set, or not included (NA) (Field type: categorical) increased_cap: Label as to whether image is included in the 'increased cap' training, validation (val) or test set, or not included (NA) (Field type: categorical) dist_individ_event_level: Label as to whether image is included in the 'individual disturbance level datasets split at event level' training, validation (val) or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_1: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 1' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 2' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 3' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 4' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 5' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 2 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 3 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 3 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 4 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 3 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 4 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_2_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_all_1_2_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3, 4 and 5 (all)' training set, or not included (NA) (Field type: categorical) dist_camera_level_individ_1: Label as to whether image is included in the 'disturbance level combination analysis split at camera level: disturbance

  19. Optimized parameter values for play detection.

    • plos.figshare.com
    xls
    Updated Apr 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonas Bischofberger; Arnold Baca; Erich Schikuta (2024). Optimized parameter values for play detection. [Dataset]. http://doi.org/10.1371/journal.pone.0298107.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 18, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Jonas Bischofberger; Arnold Baca; Erich Schikuta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With recent technological advancements, quantitative analysis has become an increasingly important area within professional sports. However, the manual process of collecting data on relevant match events like passes, goals and tacklings comes with considerable costs and limited consistency across providers, affecting both research and practice. In football, while automatic detection of events from positional data of the players and the ball could alleviate these issues, it is not entirely clear what accuracy current state-of-the-art methods realistically achieve because there is a lack of high-quality validations on realistic and diverse data sets. This paper adds context to existing research by validating a two-step rule-based pass and shot detection algorithm on four different data sets using a comprehensive validation routine that accounts for the temporal, hierarchical and imbalanced nature of the task. Our evaluation shows that pass and shot detection performance is highly dependent on the specifics of the data set. In accordance with previous studies, we achieve F-scores of up to 0.92 for passes, but only when there is an inherent dependency between event and positional data. We find a significantly lower accuracy with F-scores of 0.71 for passes and 0.65 for shots if event and positional data are independent. This result, together with a critical evaluation of existing methodologies, suggests that the accuracy of current football event detection algorithms operating on positional data is currently overestimated. Further analysis reveals that the temporal extraction of passes and shots from positional data poses the main challenge for rule-based approaches. Our results further indicate that the classification of plays into shots and passes is a relatively straightforward task, achieving F-scores between 0.83 to 0.91 ro rule-based classifiers and up to 0.95 for machine learning classifiers. We show that there exist simple classifiers that accurately differentiate shots from passes in different data sets using a low number of human-understandable rules. Operating on basic spatial features, our classifiers provide a simple, objective event definition that can be used as a foundation for more reliable event-based match analysis.

  20. IUST-PDFCorpus

    • zenodo.org
    • live.european-language-grid.eu
    zip
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morteza Zakeri-Nasrabadi; Morteza Zakeri-Nasrabadi (2025). IUST-PDFCorpus [Dataset]. http://doi.org/10.5281/zenodo.3484013
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Morteza Zakeri-Nasrabadi; Morteza Zakeri-Nasrabadi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About

    IUST-PDFCorpus is a large set of various PDF files, aimed at building and manipulating new PDF files, to test, debug, and improve the qualification of real-world PDF readers such as Adobe Acrobat Reader, Foxit Reader, Nitro Reader, MuPDF. IUST-PDFCorpus contains 6,141 PDF complete files in various sizes and contents. The corpus includes 507,299 PDF data objects and 151,132 PDF streams extracted from the set of complete files. Data objects are in the textual format while streams have a binary format and together they make PDF files. In addition, we attached the code coverage of each PDF file when it used as test data in testing MuPDF. The coverage info is available in both binary and XML formats. PDF data objects are organized into three categories. The first category contains all objects in the corpus. Each file in this category holds all PDF objects extracted from one PDF file without any preprocessing. The second category is a dataset made by merging all files in the first category with some preprocessing. The dataset is spilled into train, test and validation set which is useful for using in the machine learning tasks. The third category is the same as the second category but in a smaller size for using in the developing stage of different algorithms. IUST-PDFCorpus is collected from various sources including the Mozilla PDF.js open test corpus, some PDFs which are used in AFL as initial seed, and PDFs gathered from existing e-books, software documents, and public web in different languages. We first introduced IUST-PDFCorpus in our paper “Format-aware learn&fuzz: deep test data generation for efficient fuzzing” where we used it to build an intelligent file format fuzzer, called IUST-DeepFuzz. For the time being, we are gathering other file formats to automate testing of related applications.

    Citing IUST-PDFCorpus

    If IUST-PDFCorpus is used in your work in any form please cite the relevant paper: https://arxiv.org/abs/1812.09961v2

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Organization logo

Machine learning algorithm validation with a limited sample size

Explore at:
text/x-pythonAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Search
Clear search
Close search
Google apps
Main menu