100+ datasets found

d
Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
P
Something-Something V1 Dataset
paperswithcode.com
Updated Mar 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raghav Goyal; Samira Ebrahimi Kahou; Vincent Michalski; Joanna Materzyńska; Susanne Westphal; Heuna Kim; Valentin Haenel; Ingo Fruend; Peter Yianilos; Moritz Mueller-Freitag; Florian Hoppe; Christian Thurau; Ingo Bax; Roland Memisevic (2023). Something-Something V1 Dataset [Dataset]. https://paperswithcode.com/dataset/something-something-v1
Explore at:
Dataset updated
Mar 28, 2023
Authors
Raghav Goyal; Samira Ebrahimi Kahou; Vincent Michalski; Joanna Materzyńska; Susanne Westphal; Heuna Kim; Valentin Haenel; Ingo Fruend; Peter Yianilos; Moritz Mueller-Freitag; Florian Hoppe; Christian Thurau; Ingo Bax; Roland Memisevic
Description
The 20BN-SOMETHING-SOMETHING dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with everyday objects. The dataset was created by a large number of crowd workers. It allows machine learning models to develop fine-grained understanding of basic actions that occur in the physical world. It contains 108,499 videos, with 86,017 in the training set, 11,522 in the validation set and 10,960 in the test set. There are 174 labels.

⚠️ Attention: This is the outdated V1 of the dataset. V2 is available here.
o
PAN22 Authorship Analysis: Style Change Detection
explore.openaire.eu
Updated Mar 7, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eva Zangerle; Maximilian Mayerl; Michael Tschuggnall; Martin Potthast; Benno Stein (2022). PAN22 Authorship Analysis: Style Change Detection [Dataset]. http://doi.org/10.5281/zenodo.6334244
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6334244
Dataset updated
Mar 7, 2022
Authors
Eva Zangerle; Maximilian Mayerl; Michael Tschuggnall; Martin Potthast; Benno Stein
Description
This is the dataset for the Style Change Detection task of PAN 2022. Task The goal of the style change detection task is to identify text positions within a given multi-author document at which the author switches. Hence, a fundamental question is the following: If multiple authors have written a text together, can we find evidence for this fact; i.e., do we have a means to detect variations in the writing style? Answering this question belongs to the most difficult and most interesting challenges in author identification: Style change detection is the only means to detect plagiarism in a document if no comparison texts are given; likewise, style change detection can help to uncover gift authorships, to verify a claimed authorship, or to develop new technology for writing support. Previous editions of the Style Change Detection task aim at e.g., detecting whether a document is single- or multi-authored (2018), the actual number of authors within a document (2019), whether there was a style change between two consecutive paragraphs (2020, 2021) and where the actual style changes were located (2021). Based on the progress made towards this goal in previous years, we again extend the set of challenges to likewise entice novices and experts: Given a document, we ask participants to solve the following three tasks: [Task1] Style Change Basic: for a text written by two authors that contains a single style change only, find the position of this change (i.e., cut the text into the two authors�� texts on the paragraph-level), [Task2] Style Change Advanced: for a text written by two or more authors, find all positions of writing style change (i.e., assign all paragraphs of the text uniquely to some author out of the number of authors assumed for the multi-author document) [Task3] Style Change Real-World: for a text written by two or more authors, find all positions of writing style change, where style changes now not only occur between paragraphs, but at the sentence level. All documents are provided in English and may contain an arbitrary number of style changes, resulting from at most five different authors. Data To develop and then test your algorithms, three datasets including ground truth information are provided (dataset1 for task 1, dataset2 for task 2, and dataset3 for task 3). Each dataset is split into three parts: training set: Contains 70% of the whole dataset and includes ground truth data. Use this set to develop and train your models. validation set: Contains 15% of the whole dataset and includes ground truth data. Use this set to evaluate and optimize your models. test set: Contains 15% of the whole dataset, no ground truth data is given. This set is used for evaluation (see later). You are free to use additional external data for training your models. However, we ask you to make the additional data utilized freely available under a suitable license. Input Format The datasets are based on user posts from various sites of the StackExchange network, covering different topics. We refer to each input problem (i.e., the document for which to detect style changes) by an ID, which is subsequently also used to identify the submitted solution to this input problem. We provide one folder for train, validation, and test data for each dataset, respectively. For each problem instance X (i.e., each input document), two files are provided: problem-X.txt contains the actual text, where paragraphs are denoted by for tasks 1 and 2. For task 3, we provide one sentence per paragraph (again, split by ). truth-problem-X.json contains the ground truth, i.e., the correct solution in JSON format. An example file is listed in the following (note that we list keys for the three tasks here): { "authors": NUMBER_OF_AUTHORS, "site": SOURCE_SITE, "changes": RESULT_ARRAY_TASK1 or RESULT_ARRAY_TASK3, "paragraph-authors": RESULT_ARRAY_TASK2 } The result for task 1 (key "changes") is represented as an array, holding a binary for each pair of consecutive paragraphs within the document (0 if there was no style change, 1 if there was a style change). For task 2 (key "paragraph-authors"), the result is the order of authors contained in the document (e.g., [1, 2, 1] for a two-author document), where the first author is "1", the second author appearing in the document is referred to as "2", etc. Furthermore, we provide the total number of authors and the Stackoverflow site the texts were extracted from (i.e., topic). The result for task 3 (key "changes") is similarly structured as the results array for task 1. However, for task 3, the changes array holds a binary for each pair of consecutive sentences and they may be multiple style changes in the document. An example of a multi-author document with a style change between the third and fourth paragraph (or sentence for task 3) could be described as follows (we only list the relevant key/value pairs here): { "changes": [0,0,1,...], "paragraph-authors": [1,1,1,2,...] } Output Format To...
Data for the manuscript "Spatially resolved uncertainties for machine...
zenodo.org
data.niaid.nih.gov
application/gzip, bin +1
Updated May 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esther Heid; Esther Heid; Johannes Schörghuber; Johannes Schörghuber; Ralf Wanzenböck; Ralf Wanzenböck; Georg K. H. Madsen; Georg K. H. Madsen (2024). Data for the manuscript "Spatially resolved uncertainties for machine learning potentials" [Dataset]. http://doi.org/10.5281/zenodo.11093925
Explore at:
application/gzip, bin, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11093925
Dataset updated
May 3, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Esther Heid; Esther Heid; Johannes Schörghuber; Johannes Schörghuber; Ralf Wanzenböck; Ralf Wanzenböck; Georg K. H. Madsen; Georg K. H. Madsen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository accompanies the manuscript "Spatially resolved uncertainties for machine learning potentials" by E. Heid, J. Schörghuber, R. Wanzenböck, and G. K. H. Madsen. The following files are available:

mc_experiment.ipynb is a Jupyter notebook for the Monte Carlo experiment described in the study (artificial model with only variance as error source).

aggregate_cut_relax.py contains code to cut and relax boxes for the water active learning cycle.

data_t1x.tar.gz contains reaction pathways for 10,073 reactions from a subset of the Transition1x dataset, split into training, validation and test sets. The training and validation sets contain the indices 1, 2, 9, and 10 from a 10-image nudged-elastic band search (40k datapoints), while the test set contains indices 3-8 (60k datapoints). The test set is ordered according to the reaction and index, i.e. rxn1_index3, rxn1_index4, [...] rxn1_index8, rxn2_index3, [...].

data_sto.tar.gz contains surface reconstructions of SrTiO3, randomly split into a training and validation set, as well as a test set.

data_h2o.tar.gz contains:

full_db.extxyz: The full dataset of 1.5k structures.

iter00_train.extxyz and iter00_validation.extxyz: The initial training and validation set for the active learning cycle.

the subfolders in the folders random and uncertain contain the training and validation sets for the random and uncertainty-based active learning loops.
Pubmed Journal Recommendation System dataset
zenodo.org
data.niaid.nih.gov
csv
Updated Dec 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiayun Liu; Manuel Castillo Cara; Manuel Castillo Cara; Raúl García Castro; Raúl García Castro; Jiayun Liu (2023). Pubmed Journal Recommendation System dataset [Dataset]. http://doi.org/10.5281/zenodo.8386011
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8386011
Dataset updated
Dec 18, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jiayun Liu; Manuel Castillo Cara; Manuel Castillo Cara; Raúl García Castro; Raúl García Castro; Jiayun Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset for Journal recommendation, includes title, abstract, keywords, and journal.

We extracted the journals and more information of:

Jiasheng Sheng. (2022). PubMed-OA-Extraction-dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6330817.

Dataset Components:

data_pubmed_all: This dataset encompasses all articles, each containing the following columns: 'pubmed_id', 'title', 'keywords', 'journal', 'abstract', 'conclusions', 'methods', 'results', 'copyrights', 'doi', 'publication_date', 'authors', 'AKE_pubmed_id', 'AKE_pubmed_title', 'AKE_abstract', 'AKE_keywords', 'File_Name'.

data_pubmed: To focus on recent and relevant publications, we have filtered this dataset to include articles published within the last five years, from January 1, 2018, to December 13, 2022—the latest date in the dataset. Additionally, we have exclusively retained journals with more than 200 published articles, resulting in 262,870 articles from 469 different journals.

data_pubmed_train, data_pubmed_val, and data_pubmed_test: For machine learning and model development purposes, we have partitioned the 'data_pubmed' dataset into three subsets—training, validation, and test—using a random 60/20/20 split ratio. Notably, this division was performed on a per-journal basis, ensuring that each journal's articles are proportionally represented in the training (60%), validation (20%), and test (20%) sets. The resulting partitions consist of 157,540 articles in the training set, 52,571 articles in the validation set, and 52,759 articles in the test set.
Gender Classification from an image
kaggle.com
zip
Updated Jul 6, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gerry (2021). Gender Classification from an image [Dataset]. https://www.kaggle.com/gpiosenka/gender-classification-from-an-image
Explore at:
zip(180858795 bytes)Available download formats
Dataset updated
Jul 6, 2021
Authors
Gerry
Description
Context

I do a lot of work with image data sets. Often it is necessary to partition the images into male and female data sets. Doing this by hand can be a long and tedious task particularly on large data sets. So I decided to create a classifier that could do the task for me.

Content

I used the CELEBA aligned data set to provide the images. I went through and separated the images visually into 1747 female and 1747 male training images. I also created 100 male and 100 female test image and 100 male, 100 female validation images. I want to only the face to be in the image so I developed an image cropping function using MTCNN to crop all the images. That function is included as one of the notebooks should anyone have a need for a good face cropping function. I also created an image duplicate detector to try to eliminate any of the training images from appearing in the test or validation images. I have developed a general purpose image classification function that works very well for most image classification tasks. It contains the option to select 1 of 7 models for use. For this application I used the MobileNet model because it is less computationally expensive and gives excellent results. On the test set accuracy is near 100%.

Acknowledgements

The CELEBA aligned data set was used. This data set is very large and of good quality. To crop the images to only include the face I developed a face cropping function using MTCNN. MTCNN is a very accurate program and is reasonably fast, however it is notflawless so after cropping the iages you shouldalways visually inspect the results.

Inspiration

I developed this data set to train a classifier to be able to distinguish the gender shown in an image. Why bother you may ask I can just look at the image and tell. True but lets say you have a data set of 50,000 images that you want to separate it into male and female data sets. Doing that by hand would take forever. With the trained classifier with near 100% accuracy you can use the classifier with model.predict to do the job for you.
R
Data from: Fashion Mnist Dataset
universe.roboflow.com
opendatalab.com
+4more
zip
Updated Aug 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Popular Benchmarks (2022). Fashion Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/fashion-mnist-ztryt/model/3
Explore at:
zipAvailable download formats
Dataset updated
Aug 10, 2022
Dataset authored and provided by
Popular Benchmarks
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Clothing
Description
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Han Xiao, Kashif Rasul and Roland Vollgraf

https://arxiv.org/abs/1708.07747

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. * Source

Here's an example of how the data looks (each class takes three-rows): https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">

Version 1 (original-images_Original-FashionMNIST-Splits):

Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.

This version was not trained

Version 3 (original-images_trainSetSplitBy80_20):

Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set

https://blog.roboflow.com/train-test-split/ https://i.imgur.com/angfheJ.png" alt="Train/Valid/Test Split Rebalancing">

Citation:

@online{xiao2017/online, author = {Han Xiao and Kashif Rasul and Roland Vollgraf}, title = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms}, date = {2017-08-28}, year = {2017}, eprintclass = {cs.LG}, eprinttype = {arXiv}, eprint = {cs.LG/1708.07747}, }
f
Best regions for predicting False Negatives (FN, i.e. no prediction of...
plos.figshare.com
csv
Updated Oct 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mélanie Garcia; Clare Kelly (2024). Best regions for predicting False Negatives (FN, i.e. no prediction of Autism whereas diagnosed Autism). [Dataset]. http://doi.org/10.1371/journal.pone.0276832.s009
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0276832.s009
Dataset updated
Oct 21, 2024
Dataset provided by
PLOS ONE
Authors
Mélanie Garcia; Clare Kelly
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Each row is for one region, each column is for one model and one combination of datasets considered (training+validation+testing 1 sets (no comorbidity), or all these sets + testing set 2 (containing subjects with comorbidities)), each case returns the number of datasets where the region was important for predicting TN for the model considered. (CSV)
P
WDC LSPM Dataset
paperswithcode.com
Updated May 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WDC LSPM Dataset [Dataset]. https://paperswithcode.com/dataset/wdc-products
Explore at:
Dataset updated
May 31, 2022
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.

In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.

The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.
h
code-knowledge-eval
huggingface.co
Updated Oct 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
code-knowledge-eval [Dataset]. https://huggingface.co/datasets/kimsan0622/code-knowledge-eval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 26, 2024
Authors
san kim
Description
Code Knowledge Value Evaluation Dataset

This dataset is created by evaluating the knowledge value of code sourced from the bigcode/the-stack repository. It is designed to assess the educational and knowledge potential of different code samples.

Dataset Overview

The dataset is split into training, validation, and test sets with the following number of samples:

Training set: 22,786 samples Validation set: 4,555 samples Test set: 18,232 samples

Usage… See the full description on the dataset page: https://huggingface.co/datasets/kimsan0622/code-knowledge-eval.
Multivariate time series for testing -- RacketSports dataset
zenodo.org
explore.openaire.eu
+1more
zip
Updated Apr 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Huber; Florian Huber (2020). Multivariate time series for testing -- RacketSports dataset [Dataset]. http://doi.org/10.5281/zenodo.3742271
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3742271
Dataset updated
Apr 7, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Florian Huber; Florian Huber
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The original data was retrieved from http://www.timeseriesclassification.com/description.php?Dataset=RacketSports

Original data description:
The data was created by university students plyaing badminton or squash whilst wearing a smart watch (Sony Smart watch 35). The watch relayed the x-y-z coordinates for
both the gyroscope and accelerometer to an android phone (One Plus 56). The phone
wrote these values to an Attribute-Relation File Format (arff) file using an app developed
by a UEA computer science masters student. The problem is to identify which sport and which stroke the players are making. The data was collected at a rate of 10 HZ over 3 seconds whilst the player played
either a forehand/backhand in squash or a clear/smash in badminton.
The data was collected as part of an undergraduate project by Phillip Perks in 2017/18.

Pre-processing
Data processing was done as described in: https://github.com/NLeSC/mcfly-tutorial/blob/master/utils/tutorial_racketsports.py
The original data was split into train and test set. Here the data was loaded and further divided into train, test, validation sets.
To keep it simple we here simply divided the original test part into test and validation.
The resulting data was stored as numpy .npy files.

The zip file contains three sets of time series data (X_train, X_test, X_valid) and the respective labels (y_train, y_test, y_valid).

Reference:
http://www.timeseriesclassification.com/description.php?Dataset=RacketSports
(The data was collected as part of an undergraduate project by Phillip Perks in 2017/18.)
m
pinterest_dataset
data.mendeley.com
Updated Oct 27, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
pinterest_dataset [Dataset]. https://data.mendeley.com/datasets/fs4k2zc5j5/2
Explore at:
Unique identifier
https://doi.org/10.17632/fs4k2zc5j5.2
Dataset updated
Oct 27, 2017
Authors
Juan Carlos Gomez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset with 72000 pins from 117 users in Pinterest. Each pin contains a short raw text and an image. The images are processed using a pretrained Convolutional Neural Network and transformed into a vector of 4096 features.

This dataset was used in the paper "User Identification in Pinterest Through the Refinement of a Cascade Fusion of Text and Images" to idenfity specific users given their comments. The paper is publishe in the Research in Computing Science Journal, as part of the LKE 2017 conference. The dataset includes the splits used in the paper.

There are nine files. text_test, text_train and text_val, contain the raw text of each pin in the corresponding split of the data. imag_test, imag_train and imag_val contain the image features of each pin in the corresponding split of the data. train_user and val_test_users contain the index of the user of each pin (between 0 and 116). There is a correspondance one-to-one among the test, train and validation files for images, text and users. There are 400 pins per user in the train set, and 100 pins per user in the validation and test sets each one.

If you have questions regarding the data, write to: jc dot gomez at ugto dot mx
n
Train, validation, test data sets and confusion matrices underlying...
4tu.edu.hpc.n-helix.com
zip
Updated Sep 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen (2023). Train, validation, test data sets and confusion matrices underlying publication: "Automated cell counting for Trypan blue stained cell cultures using machine learning" [Dataset]. http://doi.org/10.4121/21695819.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/21695819.v1
Dataset updated
Sep 7, 2023
Dataset provided by
4TU.ResearchData
Authors
Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Annotated test and train data sets. Both images and annotations are provided separately.

Validation data set for Hi5, Sf9 and HEK cells.

Confusion matrices for the determination of performance parameters
cars_wagonr_swift
kaggle.com
zip
Updated Sep 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ajay (2019). cars_wagonr_swift [Dataset]. https://www.kaggle.com/ajaykgp12/cars-wagonr-swift
Explore at:
zip(44486490 bytes)Available download formats
Dataset updated
Sep 11, 2019
Authors
Ajay
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Data science beginners start with curated set of data, but it's a well known fact that in a real Data Science Project, major time is spent on collecting, cleaning and organizing data . Also domain expertise is considered as important aspect of creating good ML models. Being an automobile enthusiast, I tool up this challenge to collect images of two of the popular car models from a used car website, where users upload the images of the car they want to sell and then train a Deep Neural Network to identify model of a car from car images. In my search for images I found that approximately 10 percent of the cars pictures did not represent the intended car correctly and those pictures have to be deleted from final data.

Content

There are 4000 images of two of the popular cars (Swift and Wagonr) in India of make Maruti Suzuki with 2000 pictures belonging to each model. The data is divided into training set with 2400 images , validation set with 800 images and test set with 800 images. The data was randomized before splitting into training, test and validation set.

The starter kernal is provided for keras with CNN. I have also created github project documenting advanced techniques in pytorch and keras for image classification like data augmentation, dropout, batch normalization and transfer learning

Inspiration

With small dataset like this, how much accuracy can we achieve and whether more data is always better. The baseline model trained in Keras achieves 88% accuracy on validation set, can we achieve even better performance and by how much.

Is the data collected for the two car models representative of all possible car from all over country or there is sample bias .

I would also like someone to extend the concept to build a use case so that if user uploads an incorrect car picture of car , the ML model could automatically flag it. For example user uploading incorrect model or an image which is not a car
Z
Data Cleaning, Translation & Split of the Dataset for the Automatic...
data.niaid.nih.gov
zenodo.org
Updated Aug 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841
Explore at:
Dataset updated
Aug 8, 2022
Dataset authored and provided by
Köhler, Juliane
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.
h
TempKB-Event
huggingface.co
Updated Feb 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TempKB-Event [Dataset]. https://huggingface.co/datasets/ESITime/TempKB-Event
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 26, 2025
Dataset authored and provided by
Time Travelers
Description
TempKB

Overview

TempKB is a comprehensive collection of knowledge graph data designed to train ML models on knowledge graph completion and reasoning tasks. This is the Event version, which contains only data where the time information is presented.

Dataset Structure

The dataset is organized into the following main components: Train Set: 1373083 instances for training models. Validation Set: 63421 instances for validating model performance. Test Set:… See the full description on the dataset page: https://huggingface.co/datasets/ESITime/TempKB-Event.
Z
Multimodal Vision-Audio-Language Dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schaumlöffel, Timothy (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Choksi, Bhavin
Schaumlöffel, Timothy
Roig, Gemma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

pip install pandas pyarrow Example

import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
Synthetic nursing handover training and development data set - text files
data.csiro.au
researchdata.edu.au
Updated Mar 21, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maricel Angel; Hanna Suominen; Liyuan Zhou; Leif Hanlen (2017). Synthetic nursing handover training and development data set - text files [Dataset]. http://doi.org/10.4225/08/58d097ee92e95
Explore at:
Unique identifier
https://doi.org/10.4225/08/58d097ee92e95
Dataset updated
Mar 21, 2017
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Maricel Angel; Hanna Suominen; Liyuan Zhou; Leif Hanlen
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Dataset funded by
NICTAhttp://nicta.com.au/
Description
This is one of two collection records. Please see the link below for the other collection of associated audio files.

Both collections together comprise an open clinical dataset of three sets of 101 nursing handover records, very similar to real documents in Australian English. Each record consists of a patient profile, spoken free-form text document, written free-form text document, and written structured document.

This collection contains 3 sets of text documents.

Data Set 1 for Training and Development

The data set, released in June 2014, includes the following documents:

Folder initialisation: Initialisation details for speech recognition using Dragon Medical 11.0 (i.e., i) DOCX for the written, free-form text document that originates from the Dragon software release and ii) WMA for the spoken, free-form text document by the RN) Folder 100profiles: 100 patient profiles (DOCX) Folder 101writtenfreetextreports: 101 written, free-form text documents (TXT) Folder 100x6speechrecognised: 100 speech-recognized, written, free-form text documents for six Dragon vocabularies (TXT) Folder 101informationextraction: 101 written, structured documents for information extraction that include i) the reference standard text, ii) features used by our best system, iii) form categories with respect to the reference standard and iv) form categories with respect to the our best information extraction system (TXT in CRF++ format).

An Independent Data Set 2

The aforementioned data set was supplemented in April 2015 with an independent set that was used as a test set in the CLEFeHealth 2015 Task 1a on clinical speech recognition and can be used as a validation set in the CLEFeHealth 2016 Task 1 on handover information extraction. Hence, when using this set, please avoid its repeated use in evaluation – we do not wish to overfit to these data sets.

The set released in April 2015 consists of 100 patient profiles (DOCX), 100 written, and 100 speech-recognized, written, free-form text documents for the Dragon vocabulary of Nursing (TXT). The set released in November 2015 consists of the respective 100 written free-form text documents (TXT) and 100 written, structured documents for information extraction.

An Independent Data Set 3

For evaluation purposes, the aforementioned data sets were supplemented in April 2016 with an independent set of another 100 synthetic cases.

Lineage: Data creation included the following steps: generation of patient profiles; creation of written, free form text documents; development of a structured handover form, using this form and the written, free-form text documents to create written, structured documents; creation of spoken, free-form text documents; using a speech recognition engine with different vocabularies to convert the spoken documents to written, free-form text; and using an information extraction system to fill out the handover form from the written, free-form text documents.

See Suominen et al (2015) in the links below for a detailed description and examples.
f
Table1_An artificial intelligence-based bone age assessment model for Han...
frontiersin.figshare.com
docx
Updated Feb 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qixing Liu; Huogen Wang; Cidan Wangjiu; Tudan Awang; Meijie Yang; Puqiong Qiongda; Xiao Yang; Hui Pan; Fengdan Wang (2024). Table1_An artificial intelligence-based bone age assessment model for Han and Tibetan children.DOCX [Dataset]. http://doi.org/10.3389/fphys.2024.1329145.s005
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fphys.2024.1329145.s005
Dataset updated
Feb 15, 2024
Dataset provided by
Frontiers
Authors
Qixing Liu; Huogen Wang; Cidan Wangjiu; Tudan Awang; Meijie Yang; Puqiong Qiongda; Xiao Yang; Hui Pan; Fengdan Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background: Manual bone age assessment (BAA) is associated with longer interpretation time and higher cost and variability, thus posing challenges in areas with restricted medical facilities, such as the high-altitude Tibetan Plateau. The application of artificial intelligence (AI) for automating BAA could facilitate resolving this issue. This study aimed to develop an AI-based BAA model for Han and Tibetan children.Methods: A model named “EVG-BANet” was trained using three datasets, including the Radiology Society of North America (RSNA) dataset (training set n = 12611, validation set n = 1425, and test set n = 200), the Radiological Hand Pose Estimation (RHPE) dataset (training set n = 5491, validation set n = 713, and test set n = 79), and a self-established local dataset [training set n = 825 and test set n = 351 (Han n = 216 and Tibetan n = 135)]. An open-access state-of-the-art model BoNet was used for comparison. The accuracy and generalizability of the two models were evaluated using the abovementioned three test sets and an external test set (n = 256, all were Tibetan). Mean absolute difference (MAD) and accuracy within 1 year were used as indicators. Bias was evaluated by comparing the MAD between the demographic groups.Results: EVG-BANet outperformed BoNet in the MAD on the RHPE test set (0.52 vs. 0.63 years, p < 0.001), the local test set (0.47 vs. 0.62 years, p < 0.001), and the external test set (0.53 vs. 0.66 years, p < 0.001) and exhibited a comparable MAD on the RSNA test set (0.34 vs. 0.35 years, p = 0.934). EVG-BANet achieved accuracy within 1 year of 97.7% on the local test set (BoNet 90%, p < 0.001) and 89.5% on the external test set (BoNet 85.5%, p = 0.066). EVG-BANet showed no bias in the local test set but exhibited a bias related to chronological age in the external test set.Conclusion: EVG-BANet can accurately predict the bone age (BA) for both Han children and Tibetan children living in the Tibetan Plateau with limited healthcare facilities.
Data from: Product Datasets from the MWPD2020 Challenge at the ISWC2020...
linkagelibrary.icpsr.umich.edu
da-ra.de
Updated Nov 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Product Datasets from the MWPD2020 Challenge at the ISWC2020 Conference (Task 1) [Dataset]. http://doi.org/10.3886/E127482V1
Explore at:
Unique identifier
https://doi.org/10.3886/E127482V1
Dataset updated
Nov 26, 2020
Dataset provided by
University of Mannheim (Germany)
Authors
Ralph Peeters; Anna Primpeli; Christian Bizer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The goal of Task 1 of the Mining the Web of Product Data Challenge (MWPD2020) was to compare the performance of methods for identifying offers for the same product from different e-shops. The datasets that are provided to the participants of the competition contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) from the product category computers. The data is available in the form of training, validation and test set for machine learning experiments. The Training set consists of ~70K product pairs which were automatically labeled using the weak supervision of marked up product identifiers on the web. The validation set contains 1.100 manually labeled pairs. The test set which was used for the evaluation of participating systems consists of 1500 manually labeled pairs. The test set is intentionally harder than the other sets due to containing more very hard matching cases as well as a variety of matching challenges for a subset of the pairs, e.g. products not having training data in the training set or products which have had typos introduced. These can be used to measure the performance of methods on these kinds of matching challenges. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites, marking up their offers with schema.org vocabulary. For more information and download links for the corpus itself, please follow the links below.

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0

Training dataset for NABat Machine Learning V1.0

Explore at:

Dataset updated

Jul 6, 2024

Dataset provided by

U.S. Geological Survey

Description

Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

Clear search

Close search

Google apps

Main menu

Training dataset for NABat Machine Learning V1.0

Something-Something V1 Dataset

PAN22 Authorship Analysis: Style Change Detection

Data for the manuscript "Spatially resolved uncertainties for machine...

Pubmed Journal Recommendation System dataset

Gender Classification from an image

Context

Content

Acknowledgements

Inspiration

Data from: Fashion Mnist Dataset

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Version 1 (original-images_Original-FashionMNIST-Splits):

Version 3 (original-images_trainSetSplitBy80_20):

Citation:

Best regions for predicting False Negatives (FN, i.e. no prediction of...

WDC LSPM Dataset

code-knowledge-eval

Multivariate time series for testing -- RacketSports dataset

pinterest_dataset

Train, validation, test data sets and confusion matrices underlying...

cars_wagonr_swift

Context

Content

Inspiration

Data Cleaning, Translation & Split of the Dataset for the Automatic...

TempKB-Event

Multimodal Vision-Audio-Language Dataset

Synthetic nursing handover training and development data set - text files

Table1_An artificial intelligence-based bone age assessment model for Han...

Data from: Product Datasets from the MWPD2020 Challenge at the ISWC2020...

Training dataset for NABat Machine Learning V1.0See More Versions

Training dataset for NABat Machine Learning V1.0