100+ datasets found
  1. Data from: Time-Split Cross-Validation as a Method for Estimating the...

    • acs.figshare.com
    • figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Robert P. Sheridan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

  2. d

    Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    • data.usgs.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  3. P

    WDC LSPM Dataset

    • library.toponeai.link
    • paperswithcode.com
    Updated Feb 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). WDC LSPM Dataset [Dataset]. https://library.toponeai.link/dataset/wdc-products
    Explore at:
    Dataset updated
    Feb 8, 2025
    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.

    In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.

    The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.

  4. d

    Web Data Commons Training and Test Sets for Large-Scale Product Matching -...

    • da-ra.de
    Updated Oct 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Bizer; Anna Primpeli; Ralph Peeters (2019). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 [Dataset]. http://doi.org/10.7801/351
    Explore at:
    Dataset updated
    Oct 2019
    Dataset provided by
    da|ra
    Mannheim University Library
    Authors
    Christian Bizer; Anna Primpeli; Ralph Peeters
    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.

  5. t

    FAIR Dataset for Disease Prediction in Healthcare Applications

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
    Explore at:
    csv, json, bin, pngAvailable download formats
    Dataset updated
    Apr 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Context and Methodology

    • Research Domain/Project:
      This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

    • Purpose of the Dataset:
      The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

    • Dataset Creation:
      Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

    Technical Details

    • Structure of the Dataset:
      The dataset consists of several files organized into folders by data type:

      • Training Data: Contains the training dataset used to train the machine learning model.

      • Validation Data: Used for hyperparameter tuning and model selection.

      • Test Data: Reserved for final model evaluation.

      Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

    • Software Requirements:
      To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

      • Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

    Further Details

    • Reusability:
      Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

    • Limitations:
      The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

  6. f

    DataSheet_1_Automated data preparation for in vivo tumor characterization...

    • frontiersin.figshare.com
    docx
    Updated Jun 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Denis Krajnc; Clemens P. Spielvogel; Marko Grahovac; Boglarka Ecsedi; Sazan Rasul; Nina Poetsch; Tatjana Traub-Weidinger; Alexander R. Haug; Zsombor Ritter; Hussain Alizadeh; Marcus Hacker; Thomas Beyer; Laszlo Papp (2023). DataSheet_1_Automated data preparation for in vivo tumor characterization with machine learning.docx [Dataset]. http://doi.org/10.3389/fonc.2022.1017911.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 13, 2023
    Dataset provided by
    Frontiers
    Authors
    Denis Krajnc; Clemens P. Spielvogel; Marko Grahovac; Boglarka Ecsedi; Sazan Rasul; Nina Poetsch; Tatjana Traub-Weidinger; Alexander R. Haug; Zsombor Ritter; Hussain Alizadeh; Marcus Hacker; Thomas Beyer; Laszlo Papp
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundThis study proposes machine learning-driven data preparation (MLDP) for optimal data preparation (DP) prior to building prediction models for cancer cohorts.MethodsA collection of well-established DP methods were incorporated for building the DP pipelines for various clinical cohorts prior to machine learning. Evolutionary algorithm principles combined with hyperparameter optimization were employed to iteratively select the best fitting subset of data preparation algorithms for the given dataset. The proposed method was validated for glioma and prostate single center cohorts by 100-fold Monte Carlo (MC) cross-validation scheme with 80-20% training-validation split ratio. In addition, a dual-center diffuse large B-cell lymphoma (DLBCL) cohort was utilized with Center 1 as training and Center 2 as independent validation datasets to predict cohort-specific clinical endpoints. Five machine learning (ML) classifiers were employed for building prediction models across all analyzed cohorts. Predictive performance was estimated by confusion matrix analytics over the validation sets of each cohort. The performance of each model with and without MLDP, as well as with manually-defined DP were compared in each of the four cohorts.ResultsSixteen of twenty established predictive models demonstrated area under the receiver operator characteristics curve (AUC) performance increase utilizing the MLDP. The MLDP resulted in the highest performance increase for random forest (RF) (+0.16 AUC) and support vector machine (SVM) (+0.13 AUC) model schemes for predicting 36-months survival in the glioma cohort. Single center cohorts resulted in complex (6-7 DP steps) DP pipelines, with a high occurrence of outlier detection, feature selection and synthetic majority oversampling technique (SMOTE). In contrast, the optimal DP pipeline for the dual-center DLBCL cohort only included outlier detection and SMOTE DP steps.ConclusionsThis study demonstrates that data preparation prior to ML prediction model building in cancer cohorts shall be ML-driven itself, yielding optimal prediction models in both single and multi-centric settings.

  7. Data from: Solar flare forecasting based on magnetogram sequences learning...

    • zenodo.org
    • redu.unicamp.br
    • +1more
    zip
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luís Fernando Lopes Grim; Luís Fernando Lopes Grim; André Leon Sampaio Gradvohl; André Leon Sampaio Gradvohl (2023). Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation [Dataset]. http://doi.org/10.5281/zenodo.10246577
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 4, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Luís Fernando Lopes Grim; Luís Fernando Lopes Grim; André Leon Sampaio Gradvohl; André Leon Sampaio Gradvohl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Source codes and dataset of the research "Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation".

    Our work employed PyTorch, a framework for training Deep Learning models with GPU support and automatic back-propagation, to load the MViTv2 s models with Kinetics-400 weights. To simplify the code implementation, eliminating the need for an explicit loop to train and the automation of some hyperparameters, we use the PyTorch Lightning module. The inputs were batches of 10 samples with 16 sequenced images in 3-channel resized to 224 × 224 pixels and normalized from 0 to 1.

    Most of the papers in our literature survey split the original dataset chronologically. Some authors also apply k-fold cross-validation to emphasize the evaluation of the model stability. However, we adopt a hybrid split taking the first 50,000 to apply the 5-fold cross-validation between the training and validation sets (known data), with 40,000 samples for training and 10,000 for validation. Thus, we can evaluate performance and stability by analyzing the mean and standard deviation of all trained models in the test set, composed of the last 9,834 samples, preserving the chronological order (simulating unknown data).

    We develop three distinct models to evaluate the impact of oversampling magnetogram sequences through the dataset. The first model, Solar Flare MViT (SF MViT), has trained only with the original data from our base dataset without using oversampling. In the second model, Solar Flare MViT over Train (SF MViT oT), we only apply oversampling on training data, maintaining the original validation dataset. In the third model, Solar Flare MViT over Train and Validation (SF MViT oTV), we apply oversampling in both training and validation sets.

    We also trained a model oversampling the entire dataset. We called it the "SF_MViT_oTV Test" to verify how resampling or adopting a test set with unreal data may bias the results positively.

    GitHub version

    The .zip hosted here contains all files from the project, including the checkpoint and the output files generated by the codes. We have a clean version hosted on GitHub (https://github.com/lfgrim/SFF_MagSeq_MViTs), without the magnetogram_jpg folder (which can be downloaded directly on https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip) and the output and checkpoint files. Most code files hosted here also contain comments on the Portuguese language, which are being updated to English in the GitHub version.

    Folders Structure

    In the Root directory of the project, we have two folders:

    • magnetogram_jpg: holds the source images provided by Space Environment Artificial Intelligence Early Warning Innovation Workshop through the link https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip. It comprises 73,810 samples of high-quality magnetograms captured by HMI/SDO from 2010 May 4 to 2019 January 26. The HMI instrument provides these data (stored in hmi.sharp_720s dataset), making new samples available every 12 minutes. However, the images from this dataset were collected every 96 minutes. Each image has an associated magnetogram comprising a ready-made snippet of one or most solar ARs. It is essential to notice that the magnetograms cropped by SHARP can contain one or more solar ARs classified by the National Oceanic and Atmospheric Administration (NOAA).
    • Seq_Magnetogram: contains the references for source images with the corresponding labels in the next 24 h. and 48 h. in the respectively M24 and M48 sub-folders.
      • M24/M48: both present the following sub-folders structure:
        • Seqs16;
        • SF_MViT;
        • SF_MViT_oT;
        • SF_MViT_oTV;
        • SF_MViT_oTV_Test.

    There are also two files in root:

    • inst_packages.sh: install the packages and dependencies to run the models.
    • download_MViTS.py: download the pre-trained MViTv2_S from PyTorch and store it in the cache.

    M24 and M48 folders hold reference text files (flare_Mclass...) linking the images in the magnetogram_jpg folders or the sequences (Seq16_flare_Mclass...) in the Seqs16 folders with their respective labels. They also hold "cria_seqs.py" which was responsible for creating the sequences and "test_pandas.py" to verify head info and check the number of samples categorized by the label of the text files. All the text files with the prefix "Seq16" and inside the Seqs16 folder were created by "criaseqs.py" code based on the correspondent "flare_Mclass" prefixed text files.

    Seqs16 folder holds reference text files, in which each file contains a sequence of images that was pointed to the magnetogram_jpg folders.

    All SF_MViT... folders hold the model training codes itself (SF_MViT...py) and the corresponding job submission (jobMViT...), temporary input (Seq16_flare...), output (saida_MVIT... and MViT_S...), error (err_MViT...) and checkpoint files (sample-FLARE...ckpt). Executed model training codes generate output, error, and checkpoint files. There is also a folder called "lightning_logs" that stores logs of trained models.

    Naming pattern for the files:

    • magnetogram_jpg: follows the format "hmi.sharp_720s.
    • Seqs16: follows the format "hmi.sharp_720s.<SHARP-ID>.
    • Reference text files in M24 and M48 or inside SF_MViT... folders follows the format "
    • All SF_MViT...folders:
      • Model training codes: "SF_MViT_
      • Job submission files: "jobMViT_
      • Temporary inputs: "Seq16_flare_Mclass_
      • Outputs: "saida_MViT_Adam_10-7
      • Error files: "err_MViT_Adam_10-7
      • Checkpoint files: "sample-FLARE_MViT_S_10-7-epoch=

  8. h

    SignLanguage_MiniProject

    • huggingface.co
    Updated Dec 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonas Jensen (2024). SignLanguage_MiniProject [Dataset]. https://huggingface.co/datasets/Jonasbj99/SignLanguage_MiniProject
    Explore at:
    Dataset updated
    Dec 8, 2024
    Authors
    Jonas Jensen
    Description

    Dataset used for training a model to classify Danish Sign Language signs, based on MediaPipe hand landmark data. The data is not split into training, test and validation sets.
    The dataset consist of four classes, 'unknown', 'hello', 'bye' and 'thanks'. There are 30 datapoints for each class. Each data point is 30 frames of data stored in individual Numpy files with x, y and z values for each hand landmark.

  9. Iris Dataset - Logistic Regression

    • kaggle.com
    Updated Mar 8, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tanya Ganesan (2019). Iris Dataset - Logistic Regression [Dataset]. https://www.kaggle.com/tanyaganesan/iris-dataset-logistic-regression/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 8, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Tanya Ganesan
    Description

    Visualization of Iris Species Dataset:

    https://i.imgur.com/XqkskaX.png">

    • The data has four features.
    • Each subplot considers two features.
    • From the figure it can be observed that the data points for species Iris-setosa are clubbed together and for the other two species they sort of overlap.

    Classification using Logistic Regression:

    • There are 50 samples for each of the species. The data for each species is split into three sets - training, validation and test.
      • The training data is prepared separately for the three species. For instance, if the species is Iris-Setosa, then the corresponding outputs are set to 1 and for the other two species they are set to 0.
      • The training data sets are modeled separately. Three sets of model parameters(theta) are obtained. A sigmoid function is used to predict the output.
      • Gradient descent method is used to converge on 'theta' using a cost function.

    https://i.imgur.com/USfd26D.png"> https://i.imgur.com/AAxz3Ma.png"> https://i.imgur.com/kLNQPu1.png">

    Choosing best model:

    • Polynomial features are included to train the model better. Including more polynomial features will better fit the training set, but it may not give good results on validation set. The cost for training data decreases as more polynomial features are included.
      • So, to know which one is the best fit, first training data set is used to find the model parameters which is then used on the validation set. Whichever gives the least cost on validation set is chosen as the better fit to the data.
      • A regularization term is included to keep a check overfitting of the data as more polynomial features are added.

    Observations: - For Iris-Setosa, inclusion of polynomial features did not do well on the cross validation set. - For Iris-Versicolor, it seems more polynomial features needs to be included to be more conclusive. However, polynomial features up to the third degree was being used already, hence the idea of adding more features was dropped.

    https://i.imgur.com/RT0rsHU.png"> https://i.imgur.com/wsOFfi0.png"> https://i.imgur.com/tQkla35.png">

    https://i.imgur.com/GzPuAsT.png"> https://i.imgur.com/CBnjTki.png"> https://i.imgur.com/tF103lm.png">

    Bias-Variance trade off:

    • A check is done to see if the model will perform better if more features are included. The number of samples is increased in steps, the corresponding model parameters and cost are calculated. The model parameters obtained can then used to get the cost using validation set.
    • So if the costs for both sets converge, it is an indication that fit is good.

    https://i.imgur.com/UNh0Veo.png"> https://i.imgur.com/Ae9ObBR.png"> https://i.imgur.com/oHrjRLF.png">

    Training error:

    • The heuristic function should ideally be 1 for positive outputs and 0 for negative.
    • It is acceptable if the heuristic function is >=0.5 for positive outputs and < 0.5 for negative outputs.
    • The training error is calculated for all the sets. Observations: It performs very well for Iris-Setosa and Iris-Virginica. Except for validation set for Iris-Versicolor, rest have been modeled pretty well.

    https://i.imgur.com/WwB6B55.png"> https://i.imgur.com/Pj0c0NJ.png"> https://i.imgur.com/i3Wpzt8.png">

    https://i.imgur.com/62HanTn.png"> https://i.imgur.com/jj5sATL.png"> https://i.imgur.com/yVJvpkW.png">

    https://i.imgur.com/HyCRIb7.png"> https://i.imgur.com/MblLr1C.png"> https://i.imgur.com/zcDHt58.png">

    Accuracy: The highest probability (from heuristic function) obtained is predicted to be the species it belongs to. The accuracy came out to be 93.33% for validation data. And surprisingly 100% for test data.

    Improvements that can be done: A more sophisticated algorithm for finding the model parameters can be used instead of gradient descent. The training data, validation and test data can be chosen randomly to get the best performance.

  10. X-Ray and Non X-Ray Image Classification Data

    • kaggle.com
    Updated Apr 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Madushani Rodrigo (2024). X-Ray and Non X-Ray Image Classification Data [Dataset]. https://www.kaggle.com/datasets/bmadushanirodrigo/x-ray-and-non-x-ray-image-classification-data/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 28, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Madushani Rodrigo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is designed for X-ray and non-X-ray image classification tasks, specifically tailored for the identification of X-ray images. It includes a comprehensive collection of data split into training, validation, and test sets. Each set is organized into folders, with subfolders dedicated to X-ray and non-X-ray images respectively. This structured arrangement facilitates seamless training and evaluation of classification models aimed at distinguishing between X-ray and non-X-ray images.

  11. Z

    Downsized camera trap images for automated classification

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chapman, Philip M (2022). Downsized camera trap images for automated classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6627706
    Explore at:
    Dataset updated
    Dec 1, 2022
    Dataset provided by
    Chapman, Philip M
    Heon, Sui P
    Norman, Danielle L
    Wearne, Oliver R
    Ewers, Robert M
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description: Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707. Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions Funding: These data were collected as part of research funded by:

    NERC (NERC QMEE CDT Studentship, NE/P012345/1, http://gotw.nerc.ac.uk/list_full.asp?pcode=NE%2FP012345%2F1&cookieConsent=A) This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.

    XML metadata: GEMINI compliant metadata for this dataset is available here Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip CT_image_data_info2.xlsx This file contains dataset metadata and 1 data tables:

    Dataset Images (described in worksheet Dataset_images) Description: This worksheet details the composition of each dataset used in the analyses Number of fields: 69 Number of data rows: 270287 Fields:

    filename: Root ID (Field type: id) camera_trap_site: Site ID for the camera trap location (Field type: location) taxon: Taxon recorded by camera trap (Field type: taxa) dist_level: Level of disturbance at site (Field type: ordered categorical) baseline: Label as to whether image is included in the baseline training, validation (val) or test set, or not included (NA) (Field type: categorical) increased_cap: Label as to whether image is included in the 'increased cap' training, validation (val) or test set, or not included (NA) (Field type: categorical) dist_individ_event_level: Label as to whether image is included in the 'individual disturbance level datasets split at event level' training, validation (val) or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_1: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 1' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 2' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 3' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 4' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 5' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 2 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 3 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 3 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 4 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 3 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 4 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_2_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_all_1_2_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3, 4 and 5 (all)' training set, or not included (NA) (Field type: categorical) dist_camera_level_individ_1: Label as to whether image is included in the 'disturbance level combination analysis split at camera level: disturbance

  12. Z

    Data from: DCASE 2021 Task 5: Few-shot Bioacoustic Event Detection...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 4, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sridhar, Sripathi (2021). DCASE 2021 Task 5: Few-shot Bioacoustic Event Detection Development Set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4543503
    Explore at:
    Dataset updated
    Sep 4, 2021
    Dataset provided by
    Morfi, Veronica
    Nolasco, Ines
    Farnsworth, Andrew
    Sridhar, Sripathi
    Duteil, Mathieu
    Benvent, David
    Gill, Lisa
    Pamula, Hanna
    Singh, Shubhr
    Strandburg-Peshkin, Ariana
    Stowell, Dan
    Lostanlen, Vincent
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General Description

    The development set for task 5 of DCASE 2021 "Few-shot Bioacoustic Event Detection" consists of 19 audio files acquired from different bioacoustic sources. The dataset is split into training and validation Sets.

    Multi-class annotations are provided for the training set with positive (POS), negative (NEG) and unkwown (UNK) values for each class. UNK indicates uncertainty about a class.

    Single-class (class of interest) annotations are provided for the validation set, with events marked as positive (POS) or unkwown (UNK) provided for the class of interest.

    Folder Structure

    Development_Set.zip

    |_Development_Set/

    |_Training_Set/
    
    
      |_BV/
    
    
        |_*.wav
    
    
        |_*.csv
    
    
      |_HT/
    
    
        |_*.wav
    
    
        |_*.csv
    
    
      |_JD/
    
    
        |_*.wav
    
    
        |_*.csv
    
    
      |_MT/
    
    
        |_*.wav
    
    
        |_*.csv
    
    
    |_Validation_Set/
    
    
      |_HV/
    
    
        |_*.wav
    
    
        |_*.csv
    
    
      |_PB/
    
    
        |_*.wav
    
    
        |_*.csv
    

    Development_Set_Audio.zip has the same structure but contains only the *.wav files.

    Development_Set_Annotations.zip has the same structure but contains only the *.csv files

    Dataset statistics

    Some statistics on this dataset are as follows, split between training and validation set and their sub-folders:

    TRAINING SET

    Number of audio recordings | 11 Total duration | 14 hours and 20 mins Total classes (excl. UNK) | 19

    Total events (excl. UNK) | 4,686

    TRAINING SET/BV

    Number of audio recordings | 5 Total duration | 10 hours Total classes (excl. UNK) | 11 Total events (excl. UNK) | 2,662

    Sampling rate | 24,000 Hz

    TRAINING SET/HT

    Number of audio recordings | 3 Total duration | 3 hours Total classes (excl. UNK) | 3 Total events (excl. UNK) | 435

    Sampling rate | 6,000 Hz

    TRAINING SET/JD

    Number of audio recordings | 1 Total duration | 10 mins Total classes (excl. UNK) | 1 Total events (excl. UNK) | 355

    Sampling rate | 22,050 Hz

    TRAINING SET/MT

    Number of audio recordings | 2 Total duration | 1 hour and 10 mins Total classes (excl. UNK) | 4 Total events (excl. UNK) | 1,234

    Sampling rate | 8,000 Hz

    VALIDATION SET

    Number of audio recordings | 8 Total duration | 5 hours Total classes (excl. UNK) | 4

    Total events (excl. UNK) | 310

    VALIDATION SET/HV

    Number of audio recordings | 2 Total duration | 2 hours Total classes (excl. UNK) | 2 Total events (excl. UNK) | 50

    Sampling rate | 6,000 Hz

    VALIDATION SET/PB

    Number of audio recordings | 6 Total duration | 3 hours Total classes (excl. UNK) | 2 Total events (excl. UNK) | 260

    Sampling rate | 44,100 Hz

    Annotation structure

    Each line of the annotation csv represents an event in the audio file. The column descriptions are as follows:

    TRAINING SET

    Audiofilename, Starttime, Endtime, CLASS_1, CLASS_2, ...CLASS_N

    VALIDATION SET

    Audiofilename, Starttime, Endtime, Q

    Classes

    DCASE2021_task5_training_set_classes.csv and DCASE2021_task5_validation_set_classes.csv provide a table with class code correspondace to class name for all classes in the Development set.

    DCASE2021_task5_training_set_classes.csv

    dataset, class_code, class_name

    DCASE2021_task5_validation_set_classes.csv

    dataset, recording, class_code, class_name

    Evaluation Set

    The Evaluation set for the same task can be found at: https://doi.org/10.5281/zenodo.5413149

    Open Access

    This dataset is available under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

    Contact info

    Please send any feedback or questions to: Veronica Morfi: g.v.morfi@qmul.ac.uk

  13. o

    MER Opportunity and Spirit Rovers Pancam Images Labeled Data Set

    • explore.openaire.eu
    Updated Dec 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brandon Zhao; Shoshanna Cole; Steven Lu (2020). MER Opportunity and Spirit Rovers Pancam Images Labeled Data Set [Dataset]. http://doi.org/10.5281/zenodo.4302759
    Explore at:
    Dataset updated
    Dec 3, 2020
    Authors
    Brandon Zhao; Shoshanna Cole; Steven Lu
    Description

    Introduction The data set is based on 3,004 images collected by the Pancam instruments mounted on the Opportunity and Spirit rovers from NASA's Mars Exploration Rovers (MER) mission. We used rotation, skewing, and shearing augmentation methods to increase the total collection to 70,864 (see Image Augmentation section for more information). Based on the MER Data Catalog User Survey [1], we identified 25 classes of both scientific (e.g. soil trench, float rocks, etc.) and engineering (e.g. rover deck, Pancam calibration target, etc.) interests (see Classes section for more information). The 3,004 images were labeled on Zooniverse platform, and each image is allowed to be assigned with multiple labels. The images are either 512 x 512 or 1024 x 1024 pixels in size (see Image Sampling section for more information). Classes There is a total of 25 classes for this data set. See the list below for class names, counts, and percentages (the percentages are computed as count divided by 3,004). Note that the total counts don't sum up to 3,004 and the percentages don't sum up to 1.0 because each image may be assigned with more than one class. Class name, count, percentage of dataset Rover Deck, 222, 7.39% Pancam Calibration Target, 14, 0.47% Arm Hardware, 4, 0.13% Other Hardware, 116, 3.86% Rover Tracks, 301, 10.02% Soil Trench, 34, 1.13% RAT Brushed Target, 17, 0.57% RAT Hole, 30, 1.00% Rock Outcrop, 1915, 63.75% Float Rocks, 860, 28.63% Clasts, 1676, 55.79% Rocks (misc), 249, 8.29% Bright Soil, 122, 4.06% Dunes/Ripples, 1000, 33.29% Rock (Linear Features), 943, 31.39% Rock (Round Features), 219, 7.29% Soil, 2891, 96.24% Astronomy, 12, 0.40% Spherules, 868, 28.89% Distant Vista, 903, 30.23% Sky, 954, 31.76% Close-up Rock, 23, 0.77% Nearby Surface, 2006, 66.78% Rover Parts, 301, 10.02% Artifacts, 28, 0.93% Image Sampling Images in the MER rover Pancam archive are of sizes ranging from 64x64 to 1024x1024 pixels. The largest size, 1024x1024, was by far the most common size in the archive. For the deep learning dataset, we elected to sample only 1024x1024 and 512x512 images as the higher resolution would be beneficial to feature extraction. In order to ensure that the data set is representative of the total image archive of 4.3 million images, we elected to sample via "site code". Each Pancam image has a corresponding two-digit alphanumeric "site code" which is used to track location throughout its mission. Since each "site code" corresponds to a different general location, sampling a fixed proportion of images taken from each site ensure that the data set contained some images from each location. In this way, we could ensure that a model performing well on this dataset would generalize well to the unlabeled archive data as a whole. We randomly sampled 20% of the images at each site within the subset of Pancam data fitting all other image criteria, applying a floor function to non-whole number sample sizes, resulting in a dataset of 3,004 images. Train/validation/test sets split The 3,004 images were split into train, validation, and test data sets. The split was done so that roughly 60, 15, and 25 percent of the 3,004 images would end up as train, validation, and test data sets respectively, while ensuing that images from a given site are not split between train/validaiton/test data sets. This resulted in 1,806 train images, 456 validation images, and 742 test images. Augmentation To augment the images in train and validation data sets (note that images in the test data set were not augmented), three augmentation methods were chosen that best represent transformations that could be realistically seen in Pancam images. The three augmentations methods are rotation, skew, and shear. The augmentation methods were applied with random magnitude, followed by a random horizontal flipping, to create 30 augmented images for each image. Since each transformation is followed by a square crop in order to keep input shape consistent, we had to constrict the magnitude limits of each augmentation to avoid cropping out important features at the edges of input images. Thus, rotations were limited to 15 degrees in either direction, the 3-dimensional skew was limited to 45 degrees in any direction, and shearing was limited to 10 degrees in either direction. Note that augmentation was done only on training and validation images. Directory Contents images: contains all 70,864 images train-set-v1.1.0.txt: label file for the training data set val-set-v1.1.0.txt: label file for the validation data set test-set-v1.1.0.txt: label file for the testing data set Images with relatively short file names (e.g., 1p128287181mrd0000p2303l2m1.img.jpg) are original images, and images with long file names (e.g., 1p128287181mrd0000p2303l2m1.img.jpg_04140167-5781-49bd-a913-6d4d0a61dab1.jpg) are augmented images. The label files are formatted as "Image name, Class1, Class2, ..., ClassN". Reference [1] S.B. Cole, J.C. Aubele, B.A. Cohen, S.M. Milkovich, and S.A...

  14. Z

    Data from: DCASE 2024 Task 5: Few-shot Bioacoustic Event Detection...

    • data.niaid.nih.gov
    Updated Mar 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Singh, Shubhr (2024). DCASE 2024 Task 5: Few-shot Bioacoustic Event Detection Development Set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10829603
    Explore at:
    Dataset updated
    Mar 31, 2024
    Dataset provided by
    Gill, Lisa
    Grout, Emily
    Jensen, Frants
    Nolasco, Inês
    Pamula, Hanna
    Singh, Shubhr
    Morford, Joe
    Ghani, Burooj
    Emmerson, Michael
    Vidaña-Vila, Ester
    Kiskin, Ivan
    Liang, Jinhua
    Whitehead, Helen
    Strandburg-Peshkin, Ariana
    Stowell, Dan
    Lostanlen, Vincent
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General Description:

    The development set for task 5 of DCASE 2024 "Few-shot Bioacoustic Event Detection" consists of 217 audio files acquired from different bioacoustic sources. The dataset is split into training and validation sets.

    Multi-class annotations are provided for the training set with positive (POS), negative (NEG) and unkwown (UNK) values for each class. UNK indicates uncertainty about a class.

    Single-class (class of interest) annotations are provided for the validation set, with events marked as positive (POS) or unkwown (UNK) provided for the class of interest.

    Folder Structure:

    Development_set.zip

    |_Development_Set/

    |_Training_Set/
    
      |_JD/
    
        |_*.wav
    
        |_*.csv
    
      |_HT/
    
        |_*.wav
    
        |_*.csv
    
      |_BV/
    
        |_*.wav
    
        |_*.csv
    
      |_MT/
    
        |_*.wav
    
        |_*.csv
    
      |_WMW/
    
        |_*.wav
    
        |_*.csv
    
    
    
    |_Validation_Set/
    
      |_HB/
    
        |_*.wav
    
        |_*.csv
    
      |_PB/
    
        |_*.wav
    
        |_*.csv
    
      |_ME/
    
        |_*.wav
    
        |_*.csv
    
      |_PB24/
    
        |_*.wav
    
        |_*.csv
    
      |_RD/
    
        |_*.wav
    
        |_*.csv
    
      |_PW/
    
        |_*.wav
    
        |_*.csv
    

    Development_set_annotations.zip has the same structure but contains only the *.csv files

    Dataset statistics

    Some statistics on this dataset are as follows, split between training and validation set and their sub-folders:

    -----------------------------------------------------TRAINING SET-----------------------------------------------------Number of audio recordings | 174Total duration | 21 hoursTotal classes | 47Total events | 14229-----------------------------------------------------TRAINING SET/BV-----------------------------------------------------Number of audio recordings | 5Total duration | 10 hoursTotal classes | 11Total events | 9026Sampling rate | 24000 Hz-----------------------------------------------------TRAINING SET/HT-----------------------------------------------------Number of audio recordings | 5Total duration | 5 hoursTotal classes | 5Total events | 611Sampling rate | 6000 Hz-----------------------------------------------------TRAINING SET/JD-----------------------------------------------------Number of audio recordings | 1Total duration | 10 minsTotal classes | 1Total events | 357Sampling rate | 22050 Hz-----------------------------------------------------TRAINING SET/MT-----------------------------------------------------Number of audio recordings | 2Total duration | 1 hour and 10 minsTotal classes | 4Total events | 1294Sampling rate | 8000 Hz-----------------------------------------------------TRAINING SET/WMW-----------------------------------------------------Number of audio recordings | 161Total duration | 4 hours and 40 minsTotal classes | 26Total events | 2941Sampling rate | various sampling rates-----------------------------------------------------

    -----------------------------------------------------VALIDATION SET-----------------------------------------------------Number of audio recordings | 43Total duration | 49 hours and 57 minutesTotal classes | 7Total events | 3504-----------------------------------------------------VALIDATION SET/HB-----------------------------------------------------Number of audio recordings | 10Total duration | 2 hours and 38 minutesTotal classes | 1Total events | 712Sampling rate | 44100 Hz-----------------------------------------------------VALIDATION SET/PB-----------------------------------------------------Number of audio recordings | 6Total duration | 3 hoursTotal classes | 2Total events | 292Sampling rate | 44100 Hz-----------------------------------------------------VALIDATION SET/ME-----------------------------------------------------Number of audio recordings | 2Total duration | 20 minutesTotal classes | 2Total events | 73Sampling rate | 44100 Hz-----------------------------------------------------VALIDATION SET/PB24-----------------------------------------------------Number of audio recordings | 4Total duration | 2 hoursTotal classes | 2Total events | 350Sampling rate | 44100 Hz-----------------------------------------------------VALIDATION SET/RD-----------------------------------------------------Number of audio recordings | 6Total duration | 18 hoursTotal classes | 1Total events | 1372Sampling rate | 48000 Hz-----------------------------------------------------VALIDATION SET/PW-----------------------------------------------------Number of audio recordings | 15Total duration | 24 hoursTotal classes | 1Total events | 705Sampling rate | 96000 Hz-----------------------------------------------------

    Annotation structure

    Each line of the annotation csv represents an event in the audio file. The column descriptions are as follows:

    TRAINING SET---------------------Audiofilename, Starttime, Endtime, CLASS_1, CLASS_2, ...CLASS_N

    VALIDATION SET---------------------Audiofilename, Starttime, Endtime, Q

    Classes

    DCASE2024_task5_training_set_classes.csv and DCASE2024_task5_validation_set_classes.csv provide a table with class code correspondence to class name for all classes in the Development set. Additionally, DCASE2024_task5_validation_set_classes.csv also provides a recording names column.

    DCASE2024_task5_training_set_classes.csv---------------------dataset, class_code, class_name

    DCASE2024_task5_validation_set_classes.csv---------------------dataset, recording, class_code, class_name

    Evaluation Set

    The Evaluation set for this task will be released on the 1 June 2024

    Open Access:

    This dataset is available under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

    Contact info:

    Please send any feedback or questions to:

    Burooj Ghani - burooj.ghani@naturalis.nl | Ines Nolasco - i.dealmeidanolasco@qmul.ac.uk

    Alternately, join us on Slack: task-fewshot-bio-sed

  15. o

    BoolQ: Question Answering Dataset

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). BoolQ: Question Answering Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/0aa8f4c4-227b-48ab-8294-fafde5cb3afe
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    The BoolQ dataset is a valuable resource crafted for question answering tasks. It is organised into two main splits: a validation split and a training split. The primary aim of this dataset is to facilitate research in natural language processing (NLP) and machine learning (ML), particularly in tasks involving the answering of questions based on provided text. It offers a rich collection of user-posed questions, their corresponding answers, and the passages from which these answers are derived. This enables researchers to develop and evaluate models for real-world scenarios where information needs to be retrieved or understood from textual sources.

    Columns

    • question: This column contains the specific questions posed by users. It provides insight into the information that needs to be extracted from the given passage.
    • answer: This column holds the correct answers to each corresponding question in the dataset. The objective is to build models that can accurately predict these answers. The 'answer' column includes Boolean values, with true appearing 5,874 times (62%) and false appearing 3,553 times (38%).
    • passage: This column serves as the context or background information from which questions are formulated and answers must be located.

    Distribution

    The BoolQ dataset consists of two main parts: a validation split and a training split. Both splits feature consistent data fields: question, answer, and passage. The train.csv file, for example, is part of the training data. While specific row or record counts are not detailed for the entire dataset, the 'answer' column uniquely features 9,427 boolean values.

    Usage

    This dataset is ideally suited for: * Question Answering Systems: Training models to identify correct answers from multiple choices, given a question and a passage. * Machine Reading Comprehension: Developing models that can understand and interpret written text effectively. * Information Retrieval: Enabling models to retrieve relevant passages or documents that contain answers to a given query or question.

    Coverage

    The sources do not specify the geographic, time range, or demographic scope of the data.

    License

    CC0

    Who Can Use It

    The BoolQ dataset is primarily intended for researchers and developers working in artificial intelligence fields such as Natural Language Processing (NLP) and Machine Learning (ML). It is particularly useful for those building or evaluating: * Question answering algorithms * Information retrieval systems * Machine reading comprehension models

    Dataset Name Suggestions

    • BoolQ: Question Answering Dataset
    • Text-Based Question Answering Corpus
    • NLP Question-Answer-Passage Data
    • Machine Reading Comprehension BoolQ
    • Boolean Question Answering Data

    Attributes

    Original Data Source: BoolQ - Question-Answer-Passage Consistency

  16. m

    pinterest_dataset

    • data.mendeley.com
    Updated Oct 27, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan Carlos Gomez (2017). pinterest_dataset [Dataset]. http://doi.org/10.17632/fs4k2zc5j5.2
    Explore at:
    Dataset updated
    Oct 27, 2017
    Authors
    Juan Carlos Gomez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset with 72000 pins from 117 users in Pinterest. Each pin contains a short raw text and an image. The images are processed using a pretrained Convolutional Neural Network and transformed into a vector of 4096 features.

    This dataset was used in the paper "User Identification in Pinterest Through the Refinement of a Cascade Fusion of Text and Images" to idenfity specific users given their comments. The paper is publishe in the Research in Computing Science Journal, as part of the LKE 2017 conference. The dataset includes the splits used in the paper.

    There are nine files. text_test, text_train and text_val, contain the raw text of each pin in the corresponding split of the data. imag_test, imag_train and imag_val contain the image features of each pin in the corresponding split of the data. train_user and val_test_users contain the index of the user of each pin (between 0 and 116). There is a correspondance one-to-one among the test, train and validation files for images, text and users. There are 400 pins per user in the train set, and 100 pins per user in the validation and test sets each one.

    If you have questions regarding the data, write to: jc dot gomez at ugto dot mx

  17. P

    ImageNet VIPriors subset Dataset

    • paperswithcode.com
    Updated Mar 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert-Jan Bruintjes; Attila Lengyel; Marcos Baptista Rios; Osman Semih Kayhan; Jan van Gemert (2021). ImageNet VIPriors subset Dataset [Dataset]. https://paperswithcode.com/dataset/imagenet-vipriors-subset
    Explore at:
    Dataset updated
    Mar 4, 2021
    Authors
    Robert-Jan Bruintjes; Attila Lengyel; Marcos Baptista Rios; Osman Semih Kayhan; Jan van Gemert
    Description

    The training and validation data are subsets of the training split of the Imagenet 2012. The test set is taken from the validation split of the Imagenet 2012 dataset. Each data set includes 50 images per class.

  18. Z

    Dataset for generating TL;DR

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syed, Shahbaz (2020). Dataset for generating TL;DR [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1168854
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Voelske, Michael
    Stein, Benno
    Syed, Shahbaz
    Potthast, Martin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset for the TL;DR challenge containing posts from the Reddit corpus, suitable for abstractive summarization using deep learning. The format is a json file where each line is a JSON object representing a post. The schema of each post is shown below:

    author: string (nullable = true)

    body: string (nullable = true)

    normalizedBody: string (nullable = true)

    content: string (nullable = true)

    content_len: long (nullable = true)

    summary: string (nullable = true)

    summary_len: long (nullable = true)

    id: string (nullable = true)

    subreddit: string (nullable = true)

    subreddit_id: string (nullable = true)

    title: string (nullable = true)

    Specifically, the content and summary fields can be directly used as inputs to a deep learning model (e.g. Sequence to Sequence model ). The dataset consists of 3,084,410 posts with an average length of 211 words for content, and 25 words for the summary.

    Note : As this is the complete dataset for the challenge, it is up to the participants to split it into training and validation sets accordingly.

  19. Z

    MSL Curiosity Rover Images with Science and Engineering Classes

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kiri L. Wagstaff (2020). MSL Curiosity Rover Images with Science and Engineering Classes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3892023
    Explore at:
    Dataset updated
    Sep 17, 2020
    Dataset provided by
    Steven Lu
    Kiri L. Wagstaff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please note that the file msl-labeled-data-set-v2.1.zip below contains the latest images and labels associated with this data set.

    Data Set Description

    The data set consists of 6,820 images that were collected by the Mars Science Laboratory (MSL) Curiosity Rover by three instruments: (1) the Mast Camera (Mastcam) Left Eye; (2) the Mast Camera Right Eye; (3) the Mars Hand Lens Imager (MAHLI). With the help from Dr. Raymond Francis, a member of the MSL operations team, we identified 19 classes with science and engineering interests (see the "Classes" section for more information), and each image is assigned with 1 class label. We split the data set into training, validation, and test sets in order to train and evaluate machine learning algorithms. The training set contains 5,920 images (including augmented images; see the "Image Augmentation" section for more information); the validation set contains 300 images; the test set contains 600 images. The training set images were randomly sampled from sol (Martian day) range 1 - 948; validation set images were randomly sampled from sol range 949 - 1920; test set images were randomly sampled from sol range 1921 - 2224. All images are resized to 227 x 227 pixels without preserving the original height/width aspect ratio.

    Directory Contents

    images - contains all 6,820 images

    class_map.csv - string-integer class mappings

    train-set-v2.1.txt - label file for the training set

    val-set-v2.1.txt - label file for the validation set

    test-set-v2.1.txt - label file for the test set

    The label files are formatted as below:

    "Image-file-name class_in_integer_representation"

    Labeling Process

    Each image was labeled with help from three different volunteers (see Contributor list). The final labels are determined using the following processes:

    If all three labels agree with each other, then use the label as the final label.

    If the three labels do not agree with each other, then we manually review the labels and decide the final label.

    We also performed error analysis to correct labels as a post-processing step in order to remove noisy/incorrect labels in the data set.

    Classes

    There are 19 classes identified in this data set. In order to simplify our training and evaluation algorithms, we mapped the class names from string to integer representations. The names of classes, string-integer mappings, distributions are shown below:

    Class name, counts (training set), counts (validation set), counts (test set), integer representation

    Arm cover, 10, 1, 4, 0

    Other rover part, 190, 11, 10, 1

    Artifact, 680, 62, 132, 2

    Nearby surface, 1554, 74, 187, 3

    Close-up rock, 1422, 50, 84, 4

    DRT, 8, 4, 6, 5

    DRT spot, 214, 1, 7, 6

    Distant landscape, 342, 14, 34, 7

    Drill hole, 252, 5, 12, 8

    Night sky, 40, 3, 4, 9

    Float, 190, 5, 1, 10

    Layers, 182, 21, 17, 11

    Light-toned veins, 42, 4, 27, 12

    Mastcam cal target, 122, 12, 29, 13

    Sand, 228, 19, 16, 14

    Sun, 182, 5, 19, 15

    Wheel, 212, 5, 5, 16

    Wheel joint, 62, 1, 5, 17

    Wheel tracks, 26, 3, 1, 18

    Image Augmentation

    Only the training set contains augmented images. 3,920 of the 5,920 images in the training set are augmented versions of the remaining 2000 original training images. Images taken by different instruments were augmented differently. As shown below, we employed 5 different methods to augment images. Images taken by the Mastcam left and right eye cameras were augmented using a horizontal flipping method, and images taken by the MAHLI camera were augmented using all 5 methods. Note that one can filter based on the file names listed in the train-set.txt file to obtain a set of non-augmented images.

    90 degrees clockwise rotation (file name ends with -r90.jpg)

    180 degrees clockwise rotation (file name ends with -r180.jpg)

    270 degrees clockwise rotation (file name ends with -r270.jpg)

    Horizontal flip (file name ends with -fh.jpg)

    Vertical flip (file name ends with -fv.jpg)

    Acknowledgment

    The authors would like to thank the volunteers (as in the Contributor list) who provided annotations for this data set. We would also like to thank the PDS Imaging Note for the continuous support of this work.

  20. Multimodal Skin Lesion Classification Dataset

    • kaggle.com
    Updated Sep 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bekhzod Olimov (2024). Multimodal Skin Lesion Classification Dataset [Dataset]. https://www.kaggle.com/datasets/killa92/multimodal-skin-lesion-classification-dataset/versions/3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 2, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Bekhzod Olimov
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains skin lesion images with corresponding meta data features. The dataset is split into two folders, namely train and test. However, for training purposes, it should be split into three sets necessary for Machine Learning and Deep Learning tasks, namely train, validation, and test splits. The structure of the data is as follows:

    ROOT

    train: - img_file; - img_file; - img_file; - …….. - img_file. test: - img_file; - img_file; - img_file; - …….. - img_file. train.csv: meta data features; test.csv: meta data features For the multimodal image classification task, GT labels can be found from the csv file (target column). Good luck!

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
Organization logo

Data from: Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction.

Related Article
Explore at:
txtAvailable download formats
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

Search
Clear search
Close search
Google apps
Main menu