100+ datasets found
  1. h

    alpaca-train-validation-test-split

    • huggingface.co
    Updated Aug 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 12, 2023
    Authors
    Doula Isham Rashik Hasan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for Alpaca

    I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

      Dataset Summary
    

    Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.

  2. Training/Validation/Test set split

    • figshare.com
    zip
    Updated Mar 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tianfan Jin (2024). Training/Validation/Test set split [Dataset]. http://doi.org/10.6084/m9.figshare.25511056.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 30, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Tianfan Jin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Including the split of real and null reactions for training, validation and test

  3. DUDE competition train - validation - test splits ground truth

    • zenodo.org
    json
    Updated Mar 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordy Van Landeghem; Jordy Van Landeghem (2023). DUDE competition train - validation - test splits ground truth [Dataset]. http://doi.org/10.5281/zenodo.7763635
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Mar 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jordy Van Landeghem; Jordy Van Landeghem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This JSON file contains the ground truth annotations for the train and validation set of the DUDE competition (https://rrc.cvc.uab.es/?ch=23&com=tasks) of ICDAR 2023 (https://icdar2023.org/).

    V1.0.7 release: 41454 annotations for 4974 documents (train-validation-test)

    DatasetDict({
      train: Dataset({
        features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'],
        num_rows: 23728
      })
      val: Dataset({
        features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'],
        num_rows: 6315
      })
      test: Dataset({
        features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'],
        num_rows: 11402
      })
    })
    
    ++update on answer_type
    +++formatting change to answers_variants
    ++++stricter check on answer_variants & rename annotations file
    
    + blind test set (no ground truth answers provided)
    ++ removed duplicates from test set: 
    

    "92bd5c758bda9bdceb5f67c17009207b_ac6964cbdf483e765b6668e27b3d0bc4",

    "6ee71a16d4e4d1dbd7c1f569a92d4e08_549f2a163f8ff3e9f0293cf59fdd98bc",

    "e6f3855472231a7ca6aada2f8e85fe5a_827c03a72f2552c722f2c872fd7f74c3",

    "e3eecd7cca5de11f1d17cd94ae6a8d77_6300df64e4cf6ba0600ac81278f68de2",

    "107b4037df8127a92ee4b6ae9b5df8fb_d7a60e7a9fc0b27487ea39cd7f56f98e",

    "300cc3900080064d308983f958141232_6a7cf1aad908d58a75ab8e02ddc856f4",

    "fdd3308efacddb88d4aa6e2073f481d4_138cb868ecc804a63cc7a4502c0009b2",

    "1f7de256ff1743d329a8402ba0d132e7_95b6e8758533a9817b9f20a958e7b776",

    "4f399b8c526ffb6a2fd585a18d4ed5ec_51097231bc327c26c59a4fd8d3ff3069",

  4. Train Test and Validation Split

    • kaggle.com
    Updated Apr 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IMT2022053 (2025). Train Test and Validation Split [Dataset]. https://www.kaggle.com/datasets/pranavakulkarni/train-test-and-validation-split/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 18, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    IMT2022053
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by IMT2022053

    Released under Apache 2.0

    Contents

  5. Dataset, splits, models, and scripts for the QM descriptors prediction

    • zenodo.org
    • explore.openaire.eu
    application/gzip
    Updated Apr 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shih-Cheng Li; Shih-Cheng Li; Haoyang Wu; Haoyang Wu; Angiras Menon; Angiras Menon; Kevin A. Spiekermann; Kevin A. Spiekermann; Yi-Pei Li; Yi-Pei Li; William H. Green; William H. Green (2024). Dataset, splits, models, and scripts for the QM descriptors prediction [Dataset]. http://doi.org/10.5281/zenodo.10668491
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Apr 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shih-Cheng Li; Shih-Cheng Li; Haoyang Wu; Haoyang Wu; Angiras Menon; Angiras Menon; Kevin A. Spiekermann; Kevin A. Spiekermann; Yi-Pei Li; Yi-Pei Li; William H. Green; William H. Green
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset, splits, models, and scripts from the manuscript "When Do Quantum Mechanical Descriptors Help Graph Neural Networks Predict Chemical Properties?" are provided. The curated dataset includes 37 QM descriptors for 64,921 unique molecules across six levels of theory: wB97XD, B3LYP, M06-2X, PBE0, TPSS, and BP86. This dataset is stored in the data.tar.gz file, which also contains a file for multitask constraints applied to various atomic and bond properties. The data splits (training, validation, and test splits) for both random and scaffold-based divisions are saved as separate index files in splits.tar.gz. The trained D-MPNN models for predicting QM descriptors are saved in the models.tar.gz file. The scripts.tar.gz file contains ready-to-use scripts for training machine learning models to predict QM descriptors, as well as scripts for predicting QM descriptors using our trained models on unseen molecules and for applying radial basis function (RBF) expansion to QM atom and bond features.

    Below are descriptions of the available scripts:

    1. atom_bond_descriptors.sh: Trains atom/bond targets.
    2. atom_bond_descriptors_predict.sh: Predicts atom/bond targets from pre-trained model.
    3. dipole_quadrupole_moments.sh: Trains dipole and quadrupole moments.
    4. dipole_quadrupole_moments_predict.sh: Predicts dipole and quadrupole moments from pre-trained model.
    5. energy_gaps_IP_EA.sh: Trains energy gaps, ionization potential (IP), and electron affinity (EA).
    6. energy_gaps_IP_EA_predict.sh: Predicts energy gaps, IP, and EA from pre-trained model.
    7. get_constraints.py: Generates constraints file for testing dataset. This generated file needs to be provided before using our trained models to predict the atom/bond QM descriptors of your testing data.
    8. csv2pkl.py: Converts QM atom and bond features to .pkl files using RBF expansion for use with Chemprop software.

    Below is the procedure for running the ml-QM-GNN on your own dataset:

    1. Use get_constraints.py to generate a constraint file required for predicting atom/bond QM descriptors with the trained ML models.
    2. Execute atom_bond_descriptors_predict.sh to predict atom and bond properties. Run dipole_quadrupole_moments_predict.sh and energy_gaps_IP_EA_predict.sh to calculate molecular QM descriptors.
    3. Utilize csv2pkl.py to convert the data from predicted atom/bond descriptors .csv file into separate atom and bond feature files (which are saved as .pkl files here).
    4. Run Chemprop to train your models using the additional predicted features supported here.
  6. P

    Squirrel (60%/20%/20% random splits) Dataset

    • paperswithcode.com
    Updated Nov 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Squirrel (60%/20%/20% random splits) Dataset [Dataset]. https://paperswithcode.com/dataset/squirrel-60-20-20-random-splits
    Explore at:
    Dataset updated
    Nov 3, 2022
    Description

    Node classification on Squirrel with 60%/20%/20% random splits for training/validation/test.

  7. Data from: Time-Split Cross-Validation as a Method for Estimating the...

    • acs.figshare.com
    • figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Robert P. Sheridan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

  8. P

    Film (60%/20%/20% random splits) Dataset

    • library.toponeai.link
    Updated Feb 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Film (60%/20%/20% random splits) Dataset [Dataset]. https://library.toponeai.link/dataset/film-60-20-20-random-splits
    Explore at:
    Dataset updated
    Feb 8, 2025
    Description

    Node classification on Film with 60%/20%/20% random splits for training/validation/test.

  9. Yeast genotype/phenotype data partitioned into train/validation/test splits...

    • zenodo.org
    zip
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). Yeast genotype/phenotype data partitioned into train/validation/test splits from Ba Nguyen et al, 2022 [Dataset]. http://doi.org/10.5281/zenodo.15313069
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The raw data comes from Ba Nguyen et al, 2022, who hosted their data here. This dataset was used in an independent study in Rijal et al, 2025, who preprocessed the data using these notebook scripts. They did not release their processed data, so we reproduced their processing pipeline and have uploaded the data ourselves as part of this data resource.

    This release accompanies this publication: https://doi.org/10.57844/arcadia-bmb9-fzxd

  10. h

    MNLP_M3_mcqa_dataset

    • huggingface.co
    Updated Jun 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    andres nowak (2025). MNLP_M3_mcqa_dataset [Dataset]. https://huggingface.co/datasets/andresnowak/MNLP_M3_mcqa_dataset
    Explore at:
    Dataset updated
    Jun 10, 2025
    Authors
    andres nowak
    Description

    This dataset contains the MCQA and instruction finetuning datasets (and the test and validation splits are only used for testing not for training):

    The messages column is used by the instruction finetuning dataset The choices, question, context, and answer columns are used by the MCQA dataset

    For the MCQA dataset (of only single answer) contains a mixture of the train, validation and test splits from this datasets as to have for training and testing:

    mmlu auxiliary train we only use the… See the full description on the dataset page: https://huggingface.co/datasets/andresnowak/MNLP_M3_mcqa_dataset.

  11. f

    Dataset

    • figshare.com
    application/x-gzip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moynuddin Ahmed Shibly (2023). Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.13577873.v1
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Moynuddin Ahmed Shibly
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an open source - publicly available dataset which can be found at https://shahariarrabby.github.io/ekush/ . We split the dataset into three sets - train, validation, and test. For our experiments, we created two other versions of the dataset. We have applied 10-fold cross validation on the train set and created ten folds. We also created ten bags of datasets using bootstrap aggregating method on the train and validation sets. Lastly, we created another dataset using pre-trained ResNet50 model as feature extractor. On the features extracted by ResNet50 we have applied PCA and created a tabilar dataset containing 80 features. pca_features.csv is the train set and pca_test_features.csv is the test set. Fold.tar.gz contains the ten folds of images described above. Those folds are also been compressed. Similarly, Bagging.tar.gz contains the ten compressed bags of images. The original train, validation, and test sets are in Train.tar.gz, Validation.tar.gz, and Test.tar.gz, respectively. The compression has been performed for speeding up the upload and download purpose and mostly for the sake of convenience. If anyone has any question about how the datasets are organized please feel free to ask me at shiblygnr@gmail.com .I will get back to you in earliest time possible.

  12. Data Cleaning, Translation & Split of the Dataset for the Automatic...

    • zenodo.org
    bin, csv +1
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
    Explore at:
    text/x-python, csv, binAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juliane Köhler; Juliane Köhler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
    • Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
    • ger_train.csv – The German training set as CSV file.
    • ger_validation.csv – The German validation set as CSV file.
    • en_test.csv – The English test set as CSV file.
    • en_train.csv – The English training set as CSV file.
    • en_validation.csv – The English validation set as CSV file.
    • splitting.py – The python code for splitting a dataset into train, test and validation set.
    • DataSetTrans_de.csv – The final German dataset as a CSV file.
    • DataSetTrans_en.csv – The final English dataset as a CSV file.
    • translation.py – The python code for translating the cleaned dataset.
  13. FSOCO-split

    • kaggle.com
    Updated Oct 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tommaso Fava (2024). FSOCO-split [Dataset]. https://www.kaggle.com/datasets/tommasofava/fsoco-split/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 14, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Tommaso Fava
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    FSOCO dataset split into train (80%), validation (10%), and test (10%) set. Ready for Ultralytics YOLO training.

  14. e

    Web Data Commons Training and Test Sets for Large-Scale Product Matching -...

    • b2find.eudat.eu
    Updated Nov 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 Product Matching Task derived from the WDC Product Data Corpus - Version 2.0 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/720b440c-eda0-5182-af9f-f868ed999bd7
    Explore at:
    Dataset updated
    Nov 27, 2020
    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.

  15. t

    Data from: Spam Mails Dataset

    • dbrepo.datalab.tuwien.ac.at
    • test.dbrepo.tuwien.ac.at
    Updated Apr 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bernal, Nicolas (2025). Spam Mails Dataset [Dataset]. http://doi.org/10.82556/bexb-5283
    Explore at:
    Dataset updated
    Apr 19, 2025
    Authors
    Bernal, Nicolas
    Time period covered
    2025
    Description

    Preprocessed data derived from the "spam-mails" dataset, containing email messages labeled as spam or ham. Each record includes a unique identifier from the original dataset and an experiment_id indicating its assignment to a specific data split (training, validation, or test) used in this experiment. The email content has been lemmatized and cleaned to remove noise such as punctuation, special characters, and stopwords, ensuring consistent input for embedding and model training. Original data source: https://www.kaggle.com/datasets/venky73/spam-mails-dataset

  16. h

    Putnam-AXIOM-for-zip-fit-splits

    • huggingface.co
    Updated Jan 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ZIP-FIT - Compression-Based Data Selection For Code (2023). Putnam-AXIOM-for-zip-fit-splits [Dataset]. https://huggingface.co/datasets/zipfit/Putnam-AXIOM-for-zip-fit-splits
    Explore at:
    Dataset updated
    Jan 1, 2023
    Dataset authored and provided by
    ZIP-FIT - Compression-Based Data Selection For Code
    Description

    Putnam-AXIOM Splits for ZIP-FIT

    This repository contains the train, validation, and test splits of the Putnam-AXIOM dataset specifically for use with the ZIP-FIT methodology research. The dataset is split as follows:

    train: 150 examples validation: 150 examples test: 222 examples

    These splits are derived from the original 522 Putnam problems found in the main Putnam-AXIOM repository.

      Main Repository
    

    The full dataset with original problems and variations is available… See the full description on the dataset page: https://huggingface.co/datasets/zipfit/Putnam-AXIOM-for-zip-fit-splits.

  17. o

    Data from: Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes,...

    • explore.openaire.eu
    • data.niaid.nih.gov
    Updated Jun 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    N. Ashley Henderson; K. Steven Kauwe; D. Taylor Sparks (2021). Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes, Material Systems, and Data Heterogeneity for Materials Informatics [Dataset]. http://doi.org/10.5281/zenodo.4903957
    Explore at:
    Dataset updated
    Jun 5, 2021
    Authors
    N. Ashley Henderson; K. Steven Kauwe; D. Taylor Sparks
    Description

    This benchmark data is comprised of 50 different datasets for materials properties obtained from 16 previous publications. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits. For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method. For further information, as well as directions on how to access the data, please go to the corresponding GitHub repository: https://github.com/anhender/mse_ML_datasets/tree/v1.0

  18. Karpathy splits for Image Captioning

    • kaggle.com
    Updated Dec 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shravan Kumar (2020). Karpathy splits for Image Captioning [Dataset]. https://www.kaggle.com/shtvkumar/karpathy-splits/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 2, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shravan Kumar
    Description

    The splits were created by Andrej Karpathy and is predominently useful for Image Captioning purpose. Contains captions for Flickr8k, Flickr30k and MSCOCO datasets. And the datasets has been divided into train, test and validation splits.

    Source: http://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip

  19. d

    Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    • data.usgs.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  20. R

    Pothole Dataset

    • universe.roboflow.com
    zip
    Updated Nov 1, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brad Dwyer (2020). Pothole Dataset [Dataset]. https://universe.roboflow.com/brad-dwyer/pothole-voxrl/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 1, 2020
    Dataset authored and provided by
    Brad Dwyer
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Variables measured
    Potholes Bounding Boxes
    Description

    Pothole Dataset

    https://i.imgur.com/7Xz8d5M.gif" alt="Example Image">

    This is a collection of 665 images of roads with the potholes labeled. The dataset was created and shared by Atikur Rahman Chitholian as part of his undergraduate thesis and was originally shared on Kaggle.

    Note: The original dataset did not contain a validation set; we have re-shuffled the images into a 70/20/10 train-valid-test split.

    Usage

    This dataset could be used for automatically finding and categorizing potholes in city streets so the worst ones can be fixed faster.

    The dataset is provided in a wide variety of formats for various common machine learning models.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split

alpaca-train-validation-test-split

Alpaca

disham993/alpaca-train-validation-test-split

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2023
Authors
Doula Isham Rashik Hasan
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Dataset Card for Alpaca

I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

  Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.

Search
Clear search
Close search
Google apps
Main menu