100+ datasets found
  1. d

    Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  2. P

    Something-Something V1 Dataset

    • paperswithcode.com
    Updated Mar 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raghav Goyal; Samira Ebrahimi Kahou; Vincent Michalski; Joanna Materzyńska; Susanne Westphal; Heuna Kim; Valentin Haenel; Ingo Fruend; Peter Yianilos; Moritz Mueller-Freitag; Florian Hoppe; Christian Thurau; Ingo Bax; Roland Memisevic (2023). Something-Something V1 Dataset [Dataset]. https://paperswithcode.com/dataset/something-something-v1
    Explore at:
    Dataset updated
    Mar 28, 2023
    Authors
    Raghav Goyal; Samira Ebrahimi Kahou; Vincent Michalski; Joanna Materzyńska; Susanne Westphal; Heuna Kim; Valentin Haenel; Ingo Fruend; Peter Yianilos; Moritz Mueller-Freitag; Florian Hoppe; Christian Thurau; Ingo Bax; Roland Memisevic
    Description

    The 20BN-SOMETHING-SOMETHING dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with everyday objects. The dataset was created by a large number of crowd workers. It allows machine learning models to develop fine-grained understanding of basic actions that occur in the physical world. It contains 108,499 videos, with 86,017 in the training set, 11,522 in the validation set and 10,960 in the test set. There are 174 labels.

    ⚠️ Attention: This is the outdated V1 of the dataset. V2 is available here.

  3. o

    PAN22 Authorship Analysis: Style Change Detection

    • explore.openaire.eu
    Updated Mar 7, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eva Zangerle; Maximilian Mayerl; Michael Tschuggnall; Martin Potthast; Benno Stein (2022). PAN22 Authorship Analysis: Style Change Detection [Dataset]. http://doi.org/10.5281/zenodo.6334244
    Explore at:
    Dataset updated
    Mar 7, 2022
    Authors
    Eva Zangerle; Maximilian Mayerl; Michael Tschuggnall; Martin Potthast; Benno Stein
    Description

    This is the dataset for the Style Change Detection task of PAN 2022. Task The goal of the style change detection task is to identify text positions within a given multi-author document at which the author switches. Hence, a fundamental question is the following: If multiple authors have written a text together, can we find evidence for this fact; i.e., do we have a means to detect variations in the writing style? Answering this question belongs to the most difficult and most interesting challenges in author identification: Style change detection is the only means to detect plagiarism in a document if no comparison texts are given; likewise, style change detection can help to uncover gift authorships, to verify a claimed authorship, or to develop new technology for writing support. Previous editions of the Style Change Detection task aim at e.g., detecting whether a document is single- or multi-authored (2018), the actual number of authors within a document (2019), whether there was a style change between two consecutive paragraphs (2020, 2021) and where the actual style changes were located (2021). Based on the progress made towards this goal in previous years, we again extend the set of challenges to likewise entice novices and experts: Given a document, we ask participants to solve the following three tasks: [Task1] Style Change Basic: for a text written by two authors that contains a single style change only, find the position of this change (i.e., cut the text into the two authors��� texts on the paragraph-level), [Task2] Style Change Advanced: for a text written by two or more authors, find all positions of writing style change (i.e., assign all paragraphs of the text uniquely to some author out of the number of authors assumed for the multi-author document) [Task3] Style Change Real-World: for a text written by two or more authors, find all positions of writing style change, where style changes now not only occur between paragraphs, but at the sentence level. All documents are provided in English and may contain an arbitrary number of style changes, resulting from at most five different authors. Data To develop and then test your algorithms, three datasets including ground truth information are provided (dataset1 for task 1, dataset2 for task 2, and dataset3 for task 3). Each dataset is split into three parts: training set: Contains 70% of the whole dataset and includes ground truth data. Use this set to develop and train your models. validation set: Contains 15% of the whole dataset and includes ground truth data. Use this set to evaluate and optimize your models. test set: Contains 15% of the whole dataset, no ground truth data is given. This set is used for evaluation (see later). You are free to use additional external data for training your models. However, we ask you to make the additional data utilized freely available under a suitable license. Input Format The datasets are based on user posts from various sites of the StackExchange network, covering different topics. We refer to each input problem (i.e., the document for which to detect style changes) by an ID, which is subsequently also used to identify the submitted solution to this input problem. We provide one folder for train, validation, and test data for each dataset, respectively. For each problem instance X (i.e., each input document), two files are provided: problem-X.txt contains the actual text, where paragraphs are denoted by for tasks 1 and 2. For task 3, we provide one sentence per paragraph (again, split by ). truth-problem-X.json contains the ground truth, i.e., the correct solution in JSON format. An example file is listed in the following (note that we list keys for the three tasks here): { "authors": NUMBER_OF_AUTHORS, "site": SOURCE_SITE, "changes": RESULT_ARRAY_TASK1 or RESULT_ARRAY_TASK3, "paragraph-authors": RESULT_ARRAY_TASK2 } The result for task 1 (key "changes") is represented as an array, holding a binary for each pair of consecutive paragraphs within the document (0 if there was no style change, 1 if there was a style change). For task 2 (key "paragraph-authors"), the result is the order of authors contained in the document (e.g., [1, 2, 1] for a two-author document), where the first author is "1", the second author appearing in the document is referred to as "2", etc. Furthermore, we provide the total number of authors and the Stackoverflow site the texts were extracted from (i.e., topic). The result for task 3 (key "changes") is similarly structured as the results array for task 1. However, for task 3, the changes array holds a binary for each pair of consecutive sentences and they may be multiple style changes in the document. An example of a multi-author document with a style change between the third and fourth paragraph (or sentence for task 3) could be described as follows (we only list the relevant key/value pairs here): { "changes": [0,0,1,...], "paragraph-authors": [1,1,1,2,...] } Output Format To...

  4. Data for the manuscript "Spatially resolved uncertainties for machine...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin +1
    Updated May 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esther Heid; Esther Heid; Johannes Schörghuber; Johannes Schörghuber; Ralf Wanzenböck; Ralf Wanzenböck; Georg K. H. Madsen; Georg K. H. Madsen (2024). Data for the manuscript "Spatially resolved uncertainties for machine learning potentials" [Dataset]. http://doi.org/10.5281/zenodo.11093925
    Explore at:
    application/gzip, bin, text/x-pythonAvailable download formats
    Dataset updated
    May 3, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Esther Heid; Esther Heid; Johannes Schörghuber; Johannes Schörghuber; Ralf Wanzenböck; Ralf Wanzenböck; Georg K. H. Madsen; Georg K. H. Madsen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository accompanies the manuscript "Spatially resolved uncertainties for machine learning potentials" by E. Heid, J. Schörghuber, R. Wanzenböck, and G. K. H. Madsen. The following files are available:

    • mc_experiment.ipynb is a Jupyter notebook for the Monte Carlo experiment described in the study (artificial model with only variance as error source).

    • aggregate_cut_relax.py contains code to cut and relax boxes for the water active learning cycle.

    • data_t1x.tar.gz contains reaction pathways for 10,073 reactions from a subset of the Transition1x dataset, split into training, validation and test sets. The training and validation sets contain the indices 1, 2, 9, and 10 from a 10-image nudged-elastic band search (40k datapoints), while the test set contains indices 3-8 (60k datapoints). The test set is ordered according to the reaction and index, i.e. rxn1_index3, rxn1_index4, [...] rxn1_index8, rxn2_index3, [...].

    • data_sto.tar.gz contains surface reconstructions of SrTiO3, randomly split into a training and validation set, as well as a test set.

    • data_h2o.tar.gz contains:

      • full_db.extxyz: The full dataset of 1.5k structures.

      • iter00_train.extxyz and iter00_validation.extxyz: The initial training and validation set for the active learning cycle.

      • the subfolders in the folders random and uncertain contain the training and validation sets for the random and uncertainty-based active learning loops.

  5. Pubmed Journal Recommendation System dataset

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Dec 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiayun Liu; Manuel Castillo Cara; Manuel Castillo Cara; Raúl García Castro; Raúl García Castro; Jiayun Liu (2023). Pubmed Journal Recommendation System dataset [Dataset]. http://doi.org/10.5281/zenodo.8386011
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 18, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jiayun Liu; Manuel Castillo Cara; Manuel Castillo Cara; Raúl García Castro; Raúl García Castro; Jiayun Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset for Journal recommendation, includes title, abstract, keywords, and journal.

    We extracted the journals and more information of:

    Jiasheng Sheng. (2022). PubMed-OA-Extraction-dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6330817.

    Dataset Components:

    • data_pubmed_all: This dataset encompasses all articles, each containing the following columns: 'pubmed_id', 'title', 'keywords', 'journal', 'abstract', 'conclusions', 'methods', 'results', 'copyrights', 'doi', 'publication_date', 'authors', 'AKE_pubmed_id', 'AKE_pubmed_title', 'AKE_abstract', 'AKE_keywords', 'File_Name'.

    • data_pubmed: To focus on recent and relevant publications, we have filtered this dataset to include articles published within the last five years, from January 1, 2018, to December 13, 2022—the latest date in the dataset. Additionally, we have exclusively retained journals with more than 200 published articles, resulting in 262,870 articles from 469 different journals.

    • data_pubmed_train, data_pubmed_val, and data_pubmed_test: For machine learning and model development purposes, we have partitioned the 'data_pubmed' dataset into three subsets—training, validation, and test—using a random 60/20/20 split ratio. Notably, this division was performed on a per-journal basis, ensuring that each journal's articles are proportionally represented in the training (60%), validation (20%), and test (20%) sets. The resulting partitions consist of 157,540 articles in the training set, 52,571 articles in the validation set, and 52,759 articles in the test set.

  6. Gender Classification from an image

    • kaggle.com
    zip
    Updated Jul 6, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gerry (2021). Gender Classification from an image [Dataset]. https://www.kaggle.com/gpiosenka/gender-classification-from-an-image
    Explore at:
    zip(180858795 bytes)Available download formats
    Dataset updated
    Jul 6, 2021
    Authors
    Gerry
    Description

    Context

    I do a lot of work with image data sets. Often it is necessary to partition the images into male and female data sets. Doing this by hand can be a long and tedious task particularly on large data sets. So I decided to create a classifier that could do the task for me.

    Content

    I used the CELEBA aligned data set to provide the images. I went through and separated the images visually into 1747 female and 1747 male training images. I also created 100 male and 100 female test image and 100 male, 100 female validation images. I want to only the face to be in the image so I developed an image cropping function using MTCNN to crop all the images. That function is included as one of the notebooks should anyone have a need for a good face cropping function. I also created an image duplicate detector to try to eliminate any of the training images from appearing in the test or validation images. I have developed a general purpose image classification function that works very well for most image classification tasks. It contains the option to select 1 of 7 models for use. For this application I used the MobileNet model because it is less computationally expensive and gives excellent results. On the test set accuracy is near 100%.

    Acknowledgements

    The CELEBA aligned data set was used. This data set is very large and of good quality. To crop the images to only include the face I developed a face cropping function using MTCNN. MTCNN is a very accurate program and is reasonably fast, however it is notflawless so after cropping the iages you shouldalways visually inspect the results.

    Inspiration

    I developed this data set to train a classifier to be able to distinguish the gender shown in an image. Why bother you may ask I can just look at the image and tell. True but lets say you have a data set of 50,000 images that you want to separate it into male and female data sets. Doing that by hand would take forever. With the trained classifier with near 100% accuracy you can use the classifier with model.predict to do the job for you.

  7. R

    Data from: Fashion Mnist Dataset

    • universe.roboflow.com
    • opendatalab.com
    • +4more
    zip
    Updated Aug 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Popular Benchmarks (2022). Fashion Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/fashion-mnist-ztryt/model/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 10, 2022
    Dataset authored and provided by
    Popular Benchmarks
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Clothing
    Description

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Authors:

    Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

    All images were sized 28x28 in the original dataset

    Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. * Source

    Here's an example of how the data looks (each class takes three-rows): https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">

    Version 1 (original-images_Original-FashionMNIST-Splits):

    • Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.
    • This version was not trained

    Version 3 (original-images_trainSetSplitBy80_20):

    • Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set
    • https://blog.roboflow.com/train-test-split/ https://i.imgur.com/angfheJ.png" alt="Train/Valid/Test Split Rebalancing">

    Citation:

    @online{xiao2017/online,
     author    = {Han Xiao and Kashif Rasul and Roland Vollgraf},
     title    = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms},
     date     = {2017-08-28},
     year     = {2017},
     eprintclass = {cs.LG},
     eprinttype  = {arXiv},
     eprint    = {cs.LG/1708.07747},
    }
    
  8. f

    Best regions for predicting False Negatives (FN, i.e. no prediction of...

    • plos.figshare.com
    csv
    Updated Oct 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mélanie Garcia; Clare Kelly (2024). Best regions for predicting False Negatives (FN, i.e. no prediction of Autism whereas diagnosed Autism). [Dataset]. http://doi.org/10.1371/journal.pone.0276832.s009
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 21, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Mélanie Garcia; Clare Kelly
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Each row is for one region, each column is for one model and one combination of datasets considered (training+validation+testing 1 sets (no comorbidity), or all these sets + testing set 2 (containing subjects with comorbidities)), each case returns the number of datasets where the region was important for predicting TN for the model considered. (CSV)

  9. P

    WDC LSPM Dataset

    • paperswithcode.com
    Updated May 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WDC LSPM Dataset [Dataset]. https://paperswithcode.com/dataset/wdc-products
    Explore at:
    Dataset updated
    May 31, 2022
    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.

    In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.

    The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.

  10. h

    code-knowledge-eval

    • huggingface.co
    Updated Oct 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    code-knowledge-eval [Dataset]. https://huggingface.co/datasets/kimsan0622/code-knowledge-eval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 26, 2024
    Authors
    san kim
    Description

    Code Knowledge Value Evaluation Dataset

    This dataset is created by evaluating the knowledge value of code sourced from the bigcode/the-stack repository. It is designed to assess the educational and knowledge potential of different code samples.

      Dataset Overview
    

    The dataset is split into training, validation, and test sets with the following number of samples:

    Training set: 22,786 samples Validation set: 4,555 samples Test set: 18,232 samples

      Usage… See the full description on the dataset page: https://huggingface.co/datasets/kimsan0622/code-knowledge-eval.
    
  11. Multivariate time series for testing -- RacketSports dataset

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated Apr 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Huber; Florian Huber (2020). Multivariate time series for testing -- RacketSports dataset [Dataset]. http://doi.org/10.5281/zenodo.3742271
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 7, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Florian Huber; Florian Huber
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The original data was retrieved from http://www.timeseriesclassification.com/description.php?Dataset=RacketSports

    Original data description:
    The data was created by university students plyaing badminton or squash whilst wearing a smart watch (Sony Smart watch 35). The watch relayed the x-y-z coordinates for
    both the gyroscope and accelerometer to an android phone (One Plus 56). The phone
    wrote these values to an Attribute-Relation File Format (arff) file using an app developed
    by a UEA computer science masters student. The problem is to identify which sport and which stroke the players are making. The data was collected at a rate of 10 HZ over 3 seconds whilst the player played
    either a forehand/backhand in squash or a clear/smash in badminton.
    The data was collected as part of an undergraduate project by Phillip Perks in 2017/18.

    Pre-processing
    Data processing was done as described in: https://github.com/NLeSC/mcfly-tutorial/blob/master/utils/tutorial_racketsports.py
    The original data was split into train and test set. Here the data was loaded and further divided into train, test, validation sets.
    To keep it simple we here simply divided the original test part into test and validation.
    The resulting data was stored as numpy .npy files.

    The zip file contains three sets of time series data (X_train, X_test, X_valid) and the respective labels (y_train, y_test, y_valid).

    Reference:
    http://www.timeseriesclassification.com/description.php?Dataset=RacketSports
    (The data was collected as part of an undergraduate project by Phillip Perks in 2017/18.)

  12. m

    pinterest_dataset

    • data.mendeley.com
    Updated Oct 27, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    pinterest_dataset [Dataset]. https://data.mendeley.com/datasets/fs4k2zc5j5/2
    Explore at:
    Dataset updated
    Oct 27, 2017
    Authors
    Juan Carlos Gomez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset with 72000 pins from 117 users in Pinterest. Each pin contains a short raw text and an image. The images are processed using a pretrained Convolutional Neural Network and transformed into a vector of 4096 features.

    This dataset was used in the paper "User Identification in Pinterest Through the Refinement of a Cascade Fusion of Text and Images" to idenfity specific users given their comments. The paper is publishe in the Research in Computing Science Journal, as part of the LKE 2017 conference. The dataset includes the splits used in the paper.

    There are nine files. text_test, text_train and text_val, contain the raw text of each pin in the corresponding split of the data. imag_test, imag_train and imag_val contain the image features of each pin in the corresponding split of the data. train_user and val_test_users contain the index of the user of each pin (between 0 and 116). There is a correspondance one-to-one among the test, train and validation files for images, text and users. There are 400 pins per user in the train set, and 100 pins per user in the validation and test sets each one.

    If you have questions regarding the data, write to: jc dot gomez at ugto dot mx

  13. n

    Train, validation, test data sets and confusion matrices underlying...

    • 4tu.edu.hpc.n-helix.com
    zip
    Updated Sep 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen (2023). Train, validation, test data sets and confusion matrices underlying publication: "Automated cell counting for Trypan blue stained cell cultures using machine learning" [Dataset]. http://doi.org/10.4121/21695819.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 7, 2023
    Dataset provided by
    4TU.ResearchData
    Authors
    Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Annotated test and train data sets. Both images and annotations are provided separately.


    Validation data set for Hi5, Sf9 and HEK cells.


    Confusion matrices for the determination of performance parameters

  14. cars_wagonr_swift

    • kaggle.com
    zip
    Updated Sep 11, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ajay (2019). cars_wagonr_swift [Dataset]. https://www.kaggle.com/ajaykgp12/cars-wagonr-swift
    Explore at:
    zip(44486490 bytes)Available download formats
    Dataset updated
    Sep 11, 2019
    Authors
    Ajay
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Data science beginners start with curated set of data, but it's a well known fact that in a real Data Science Project, major time is spent on collecting, cleaning and organizing data . Also domain expertise is considered as important aspect of creating good ML models. Being an automobile enthusiast, I tool up this challenge to collect images of two of the popular car models from a used car website, where users upload the images of the car they want to sell and then train a Deep Neural Network to identify model of a car from car images. In my search for images I found that approximately 10 percent of the cars pictures did not represent the intended car correctly and those pictures have to be deleted from final data.

    Content

    There are 4000 images of two of the popular cars (Swift and Wagonr) in India of make Maruti Suzuki with 2000 pictures belonging to each model. The data is divided into training set with 2400 images , validation set with 800 images and test set with 800 images. The data was randomized before splitting into training, test and validation set.

    The starter kernal is provided for keras with CNN. I have also created github project documenting advanced techniques in pytorch and keras for image classification like data augmentation, dropout, batch normalization and transfer learning

    Inspiration

    1. With small dataset like this, how much accuracy can we achieve and whether more data is always better. The baseline model trained in Keras achieves 88% accuracy on validation set, can we achieve even better performance and by how much.

    2. Is the data collected for the two car models representative of all possible car from all over country or there is sample bias .

    3. I would also like someone to extend the concept to build a use case so that if user uploads an incorrect car picture of car , the ML model could automatically flag it. For example user uploading incorrect model or an image which is not a car

  15. Z

    Data Cleaning, Translation & Split of the Dataset for the Automatic...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841
    Explore at:
    Dataset updated
    Aug 8, 2022
    Dataset authored and provided by
    Köhler, Juliane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

    Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

    ger_train.csv – The German training set as CSV file.

    ger_validation.csv – The German validation set as CSV file.

    en_test.csv – The English test set as CSV file.

    en_train.csv – The English training set as CSV file.

    en_validation.csv – The English validation set as CSV file.

    splitting.py – The python code for splitting a dataset into train, test and validation set.

    DataSetTrans_de.csv – The final German dataset as a CSV file.

    DataSetTrans_en.csv – The final English dataset as a CSV file.

    translation.py – The python code for translating the cleaned dataset.

  16. h

    TempKB-Event

    • huggingface.co
    Updated Feb 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TempKB-Event [Dataset]. https://huggingface.co/datasets/ESITime/TempKB-Event
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 26, 2025
    Dataset authored and provided by
    Time Travelers
    Description

    TempKB

      Overview
    

    TempKB is a comprehensive collection of knowledge graph data designed to train ML models on knowledge graph completion and reasoning tasks. This is the Event version, which contains only data where the time information is presented.

      Dataset Structure
    

    The dataset is organized into the following main components: Train Set: 1373083 instances for training models. Validation Set: 63421 instances for validating model performance. Test Set:… See the full description on the dataset page: https://huggingface.co/datasets/ESITime/TempKB-Event.

  17. Z

    Multimodal Vision-Audio-Language Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schaumlöffel, Timothy (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Choksi, Bhavin
    Schaumlöffel, Timothy
    Roig, Gemma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

    pip install pandas pyarrow Example

    import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

    dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

  18. Synthetic nursing handover training and development data set - text files

    • data.csiro.au
    • researchdata.edu.au
    Updated Mar 21, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maricel Angel; Hanna Suominen; Liyuan Zhou; Leif Hanlen (2017). Synthetic nursing handover training and development data set - text files [Dataset]. http://doi.org/10.4225/08/58d097ee92e95
    Explore at:
    Dataset updated
    Mar 21, 2017
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    Maricel Angel; Hanna Suominen; Liyuan Zhou; Leif Hanlen
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Dataset funded by
    NICTAhttp://nicta.com.au/
    Description

    This is one of two collection records. Please see the link below for the other collection of associated audio files.

    Both collections together comprise an open clinical dataset of three sets of 101 nursing handover records, very similar to real documents in Australian English. Each record consists of a patient profile, spoken free-form text document, written free-form text document, and written structured document.

    This collection contains 3 sets of text documents.

    Data Set 1 for Training and Development

    The data set, released in June 2014, includes the following documents:

    Folder initialisation: Initialisation details for speech recognition using Dragon Medical 11.0 (i.e., i) DOCX for the written, free-form text document that originates from the Dragon software release and ii) WMA for the spoken, free-form text document by the RN) Folder 100profiles: 100 patient profiles (DOCX) Folder 101writtenfreetextreports: 101 written, free-form text documents (TXT) Folder 100x6speechrecognised: 100 speech-recognized, written, free-form text documents for six Dragon vocabularies (TXT) Folder 101informationextraction: 101 written, structured documents for information extraction that include i) the reference standard text, ii) features used by our best system, iii) form categories with respect to the reference standard and iv) form categories with respect to the our best information extraction system (TXT in CRF++ format).

    An Independent Data Set 2

    The aforementioned data set was supplemented in April 2015 with an independent set that was used as a test set in the CLEFeHealth 2015 Task 1a on clinical speech recognition and can be used as a validation set in the CLEFeHealth 2016 Task 1 on handover information extraction. Hence, when using this set, please avoid its repeated use in evaluation – we do not wish to overfit to these data sets.

    The set released in April 2015 consists of 100 patient profiles (DOCX), 100 written, and 100 speech-recognized, written, free-form text documents for the Dragon vocabulary of Nursing (TXT). The set released in November 2015 consists of the respective 100 written free-form text documents (TXT) and 100 written, structured documents for information extraction.

    An Independent Data Set 3

    For evaluation purposes, the aforementioned data sets were supplemented in April 2016 with an independent set of another 100 synthetic cases.

    Lineage: Data creation included the following steps: generation of patient profiles; creation of written, free form text documents; development of a structured handover form, using this form and the written, free-form text documents to create written, structured documents; creation of spoken, free-form text documents; using a speech recognition engine with different vocabularies to convert the spoken documents to written, free-form text; and using an information extraction system to fill out the handover form from the written, free-form text documents.

    See Suominen et al (2015) in the links below for a detailed description and examples.

  19. f

    Table1_An artificial intelligence-based bone age assessment model for Han...

    • frontiersin.figshare.com
    docx
    Updated Feb 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qixing Liu; Huogen Wang; Cidan Wangjiu; Tudan Awang; Meijie Yang; Puqiong Qiongda; Xiao Yang; Hui Pan; Fengdan Wang (2024). Table1_An artificial intelligence-based bone age assessment model for Han and Tibetan children.DOCX [Dataset]. http://doi.org/10.3389/fphys.2024.1329145.s005
    Explore at:
    docxAvailable download formats
    Dataset updated
    Feb 15, 2024
    Dataset provided by
    Frontiers
    Authors
    Qixing Liu; Huogen Wang; Cidan Wangjiu; Tudan Awang; Meijie Yang; Puqiong Qiongda; Xiao Yang; Hui Pan; Fengdan Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background: Manual bone age assessment (BAA) is associated with longer interpretation time and higher cost and variability, thus posing challenges in areas with restricted medical facilities, such as the high-altitude Tibetan Plateau. The application of artificial intelligence (AI) for automating BAA could facilitate resolving this issue. This study aimed to develop an AI-based BAA model for Han and Tibetan children.Methods: A model named “EVG-BANet” was trained using three datasets, including the Radiology Society of North America (RSNA) dataset (training set n = 12611, validation set n = 1425, and test set n = 200), the Radiological Hand Pose Estimation (RHPE) dataset (training set n = 5491, validation set n = 713, and test set n = 79), and a self-established local dataset [training set n = 825 and test set n = 351 (Han n = 216 and Tibetan n = 135)]. An open-access state-of-the-art model BoNet was used for comparison. The accuracy and generalizability of the two models were evaluated using the abovementioned three test sets and an external test set (n = 256, all were Tibetan). Mean absolute difference (MAD) and accuracy within 1 year were used as indicators. Bias was evaluated by comparing the MAD between the demographic groups.Results: EVG-BANet outperformed BoNet in the MAD on the RHPE test set (0.52 vs. 0.63 years, p < 0.001), the local test set (0.47 vs. 0.62 years, p < 0.001), and the external test set (0.53 vs. 0.66 years, p < 0.001) and exhibited a comparable MAD on the RSNA test set (0.34 vs. 0.35 years, p = 0.934). EVG-BANet achieved accuracy within 1 year of 97.7% on the local test set (BoNet 90%, p < 0.001) and 89.5% on the external test set (BoNet 85.5%, p = 0.066). EVG-BANet showed no bias in the local test set but exhibited a bias related to chronological age in the external test set.Conclusion: EVG-BANet can accurately predict the bone age (BA) for both Han children and Tibetan children living in the Tibetan Plateau with limited healthcare facilities.

  20. Data from: Product Datasets from the MWPD2020 Challenge at the ISWC2020...

    • linkagelibrary.icpsr.umich.edu
    • da-ra.de
    Updated Nov 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Product Datasets from the MWPD2020 Challenge at the ISWC2020 Conference (Task 1) [Dataset]. http://doi.org/10.3886/E127482V1
    Explore at:
    Dataset updated
    Nov 26, 2020
    Dataset provided by
    University of Mannheim (Germany)
    Authors
    Ralph Peeters; Anna Primpeli; Christian Bizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The goal of Task 1 of the Mining the Web of Product Data Challenge (MWPD2020) was to compare the performance of methods for identifying offers for the same product from different e-shops. The datasets that are provided to the participants of the competition contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) from the product category computers. The data is available in the form of training, validation and test set for machine learning experiments. The Training set consists of ~70K product pairs which were automatically labeled using the weak supervision of marked up product identifiers on the web. The validation set contains 1.100 manually labeled pairs. The test set which was used for the evaluation of participating systems consists of 1500 manually labeled pairs. The test set is intentionally harder than the other sets due to containing more very hard matching cases as well as a variety of matching challenges for a subset of the pairs, e.g. products not having training data in the training set or products which have had typos introduced. These can be used to measure the performance of methods on these kinds of matching challenges. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites, marking up their offers with schema.org vocabulary. For more information and download links for the corpus itself, please follow the links below.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0

Training dataset for NABat Machine Learning V1.0

Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description

Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

Search
Clear search
Close search
Google apps
Main menu