100+ datasets found

d
Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
f
Data from: Isometric Stratified Ensembles: A Partial and Incremental...
figshare.com
acs.figshare.com
xlsx
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christophe Molina; Lilia Ait-Ouarab; Hervé Minoux (2023). Isometric Stratified Ensembles: A Partial and Incremental Adaptive Applicability Domain and Consensus-Based Classification Strategy for Highly Imbalanced Data Sets with Application to Colloidal Aggregation [Dataset]. http://doi.org/10.1021/acs.jcim.2c00293.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.2c00293.s003
Dataset updated
Jun 15, 2023
Dataset provided by
ACS Publications
Authors
Christophe Molina; Lilia Ait-Ouarab; Hervé Minoux
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Partial and incremental stratification analysis of a quantitative structure-interference relationship (QSIR) is a novel strategy intended to categorize classification provided by machine learning techniques. It is based on a 2D mapping of classification statistics onto two categorical axes: the degree of consensus and level of applicability domain. An internal cross-validation set allows to determine the statistical performance of the ensemble at every 2D map stratum and hence to define isometric local performance regions with the aim of better hit ranking and selection. During training, isometric stratified ensembles (ISE) applies a recursive decorrelated variable selection and considers the cardinal ratio of classes to balance training sets and thus avoid bias due to possible class imbalance. To exemplify the interest of this strategy, three different highly imbalanced PubChem pairs of AmpC β-lactamase and cruzain inhibition assay campaigns of colloidal aggregators and complementary aggregators data set available at the AGGREGATOR ADVISOR predictor web page were employed. Statistics obtained using this new strategy show outperforming results compared to former published tools, with and without a classical applicability domain. ISE performance on classifying colloidal aggregators shows from a global AUC of 0.82, when the whole test data set is considered, up to a maximum AUC of 0.88, when its highest confidence isometric stratum is retained.
R
Data from: Pothole Dataset
universe.roboflow.com
zip
Updated Nov 1, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brad Dwyer (2020). Pothole Dataset [Dataset]. https://universe.roboflow.com/brad-dwyer/pothole-voxrl/model/1
Explore at:
zipAvailable download formats
Dataset updated
Nov 1, 2020
Dataset authored and provided by
Brad Dwyer
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Variables measured
Potholes Bounding Boxes
Description
Pothole Dataset

https://i.imgur.com/7Xz8d5M.gif" alt="Example Image">

This is a collection of 665 images of roads with the potholes labeled. The dataset was created and shared by Atikur Rahman Chitholian as part of his undergraduate thesis and was originally shared on Kaggle.

Note: The original dataset did not contain a validation set; we have re-shuffled the images into a 70/20/10 train-valid-test split.

Usage

This dataset could be used for automatically finding and categorizing potholes in city streets so the worst ones can be fixed faster.

The dataset is provided in a wide variety of formats for various common machine learning models.
P
Something-Something V1 Dataset
paperswithcode.com
Updated Mar 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raghav Goyal; Samira Ebrahimi Kahou; Vincent Michalski; Joanna Materzyńska; Susanne Westphal; Heuna Kim; Valentin Haenel; Ingo Fruend; Peter Yianilos; Moritz Mueller-Freitag; Florian Hoppe; Christian Thurau; Ingo Bax; Roland Memisevic (2023). Something-Something V1 Dataset [Dataset]. https://paperswithcode.com/dataset/something-something-v1
Explore at:
Dataset updated
Mar 28, 2023
Authors
Raghav Goyal; Samira Ebrahimi Kahou; Vincent Michalski; Joanna Materzyńska; Susanne Westphal; Heuna Kim; Valentin Haenel; Ingo Fruend; Peter Yianilos; Moritz Mueller-Freitag; Florian Hoppe; Christian Thurau; Ingo Bax; Roland Memisevic
Description
The 20BN-SOMETHING-SOMETHING dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with everyday objects. The dataset was created by a large number of crowd workers. It allows machine learning models to develop fine-grained understanding of basic actions that occur in the physical world. It contains 108,499 videos, with 86,017 in the training set, 11,522 in the validation set and 10,960 in the test set. There are 174 labels.

⚠️ Attention: This is the outdated V1 of the dataset. V2 is available here.
R
Data from: Fashion Mnist Dataset
universe.roboflow.com
opendatalab.com
+4more
zip
Updated Aug 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Popular Benchmarks (2022). Fashion Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/fashion-mnist-ztryt/model/3
Explore at:
zipAvailable download formats
Dataset updated
Aug 10, 2022
Dataset authored and provided by
Popular Benchmarks
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Clothing
Description
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Han Xiao, Kashif Rasul and Roland Vollgraf

https://arxiv.org/abs/1708.07747

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. * Source

Here's an example of how the data looks (each class takes three-rows): https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">

Version 1 (original-images_Original-FashionMNIST-Splits):

Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.

This version was not trained

Version 3 (original-images_trainSetSplitBy80_20):

Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set

https://blog.roboflow.com/train-test-split/ https://i.imgur.com/angfheJ.png" alt="Train/Valid/Test Split Rebalancing">

Citation:

@online{xiao2017/online, author = {Han Xiao and Kashif Rasul and Roland Vollgraf}, title = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms}, date = {2017-08-28}, year = {2017}, eprintclass = {cs.LG}, eprinttype = {arXiv}, eprint = {cs.LG/1708.07747}, }
n
Train, validation, test data sets and confusion matrices underlying...
4tu.edu.hpc.n-helix.com
zip
Updated Sep 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen (2023). Train, validation, test data sets and confusion matrices underlying publication: "Automated cell counting for Trypan blue stained cell cultures using machine learning" [Dataset]. http://doi.org/10.4121/21695819.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/21695819.v1
Dataset updated
Sep 7, 2023
Dataset provided by
4TU.ResearchData
Authors
Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Annotated test and train data sets. Both images and annotations are provided separately.

Validation data set for Hi5, Sf9 and HEK cells.

Confusion matrices for the determination of performance parameters
f
Data from: S1 Dataset -
plos.figshare.com
rar
Updated Aug 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adel H. Elmetwalli; Asaad Derbala; Ibtisam Mohammed Alsudays; Eman A. Al-Shahari; Mahmoud Elhosary; Salah Elsayed; Laila A. Al-Shuraym; Farahat S. Moghanm; Osama Elsherbiny (2024). S1 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0308826.s001
Explore at:
rarAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0308826.s001
Dataset updated
Aug 26, 2024
Dataset provided by
PLOS ONE
Authors
Adel H. Elmetwalli; Asaad Derbala; Ibtisam Mohammed Alsudays; Eman A. Al-Shahari; Mahmoud Elhosary; Salah Elsayed; Laila A. Al-Shuraym; Farahat S. Moghanm; Osama Elsherbiny
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Estimation of fruit quality parameters are usually based on destructive techniques which are tedious, costly and unreliable when dealing with huge amounts of fruits. Alternatively, non–destructive techniques such as image processing and spectral reflectance would be useful in rapid detection of fruit quality parameters. This research study aimed to assess the potential of image processing, spectral reflectance indices (SRIs), and machine learning models such as decision tree (DT) and random forest (RF) to qualitatively estimate characteristics of mandarin and tomato fruits at different ripening stages. Quality parameters such as chlorophyll a (Chl a), chlorophyll b (Chl b), total soluble solids (TSS), titratable acidity (TA), TSS/TA, carotenoids (car), lycopene and firmness were measured. The results showed that Red-Blue-Green (RGB) indices and newly developed SRIs demonstrated high efficiency for quantifying different fruit properties. For example, the R2 of the relationships between all RGB indices (RGBI) and measured parameters varied between 0.62 and 0.96 for mandarin and varied between 0.29 and 0.90 for tomato. The RGBI such as visible atmospheric resistant index (VARI) and normalized red (Rn) presented the highest R2 = 0.96 with car of mandarin fruits. While excess red vegetation index (ExR) presented the highest R2 = 0.84 with car of tomato fruits. The SRIs such as RSI 710,600, and R730,650 showed the greatest R2 values with respect to Chl a (R2 = 0.80) for mandarin fruits while the GI had the greatest R2 with Chl a (R2 = 0.68) for tomato fruits. Combining RGB and SRIs with DT and RF models would be a robust strategy for estimating eight observed variables associated with reasonable accuracy. Regarding mandarin fruits, in the task of predicting Chl a, the DT-2HV model delivered exceptional results, registering an R2 of 0.993 with an RMSE of 0.149 for the training set, and an R2 of 0.991 with an RMSE of 0.114 for the validation set. As well as for tomato fruits, the DT-5HV model demonstrated exemplary performance in the Chl a prediction, achieving an R2 of 0.905 and an RMSE of 0.077 for the training dataset, and an R2 of 0.785 with an RMSE of 0.077 for the validation dataset. The overall outcomes showed that the RGB, newly SRIs as well as DT and RF based RGBI, and SRIs could be used to evaluate the measured parameters of mandarin and tomato fruits.
f
Model performance per dataset.
plos.figshare.com
xls
Updated May 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erik D. Huckvale; Hunter N. B. Moseley (2024). Model performance per dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0299583.t010
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0299583.t010
Dataset updated
May 2, 2024
Dataset provided by
PLOS ONE
Authors
Erik D. Huckvale; Hunter N. B. Moseley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizing a training dataset of known metabolite-to-pathway mappings. A few such training datasets have been derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG). However, several prior published machine learning approaches utilized an erroneous KEGG-derived training dataset that used SMILES molecular representations strings (KEGG-SMILES dataset) and contained a sizable proportion (~26%) duplicate entries. The presence of so many duplicates taint the training and testing sets generated from k-fold cross-validation of the KEGG-SMILES dataset. Therefore, the k-fold cross-validation performance of the resulting machine learning models was grossly inflated by the erroneous presence of these duplicate entries. Here we describe and evaluate the KEGG-SMILES dataset so that others may avoid using it. We also identify the prior publications that utilized this erroneous KEGG-SMILES dataset so their machine learning results can be properly and critically evaluated. In addition, we demonstrate the reduction of model k-fold cross-validation (CV) performance after de-duplicating the KEGG-SMILES dataset. This is a cautionary tale about properly vetting prior published benchmark datasets before using them in machine learning approaches. We hope others will avoid similar mistakes.
f
Optimized parameter values for play detection.
plos.figshare.com
xls
Updated Apr 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonas Bischofberger; Arnold Baca; Erich Schikuta (2024). Optimized parameter values for play detection. [Dataset]. http://doi.org/10.1371/journal.pone.0298107.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0298107.t004
Dataset updated
Apr 18, 2024
Dataset provided by
PLOS ONE
Authors
Jonas Bischofberger; Arnold Baca; Erich Schikuta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With recent technological advancements, quantitative analysis has become an increasingly important area within professional sports. However, the manual process of collecting data on relevant match events like passes, goals and tacklings comes with considerable costs and limited consistency across providers, affecting both research and practice. In football, while automatic detection of events from positional data of the players and the ball could alleviate these issues, it is not entirely clear what accuracy current state-of-the-art methods realistically achieve because there is a lack of high-quality validations on realistic and diverse data sets. This paper adds context to existing research by validating a two-step rule-based pass and shot detection algorithm on four different data sets using a comprehensive validation routine that accounts for the temporal, hierarchical and imbalanced nature of the task. Our evaluation shows that pass and shot detection performance is highly dependent on the specifics of the data set. In accordance with previous studies, we achieve F-scores of up to 0.92 for passes, but only when there is an inherent dependency between event and positional data. We find a significantly lower accuracy with F-scores of 0.71 for passes and 0.65 for shots if event and positional data are independent. This result, together with a critical evaluation of existing methodologies, suggests that the accuracy of current football event detection algorithms operating on positional data is currently overestimated. Further analysis reveals that the temporal extraction of passes and shots from positional data poses the main challenge for rule-based approaches. Our results further indicate that the classification of plays into shots and passes is a relatively straightforward task, achieving F-scores between 0.83 to 0.91 ro rule-based classifiers and up to 0.95 for machine learning classifiers. We show that there exist simple classifiers that accurately differentiate shots from passes in different data sets using a low number of human-understandable rules. Operating on basic spatial features, our classifiers provide a simple, objective event definition that can be used as a foundation for more reliable event-based match analysis.
DCASE-2023-TASK-5
kaggle.com
zip
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Víctor Aguado (2023). DCASE-2023-TASK-5 [Dataset]. https://www.kaggle.com/datasets/aguado/dcase-2023-task-5
Explore at:
zip(7712922302 bytes)Available download formats
Dataset updated
Jun 5, 2023
Authors
Víctor Aguado
Description
Introduction

This task focuses on sound event detection in a few-shot learning setting for animal (mammal and bird) vocalisations. Participants will be expected to create a method that can extract information from five exemplar vocalisations (shots) of mammals or birds and detect and classify sounds in field recordings.

For more info please reffer to the official website: https://dcase.community/challenge2023/task-few-shot-bioacoustic-event-detection

Description

Few-shot learning is a highly promising paradigm for sound event detection. It is also an extremely good fit to the needs of users in bioacoustics, in which increasingly large acoustic datasets commonly need to be labelled for events of an identified category (e.g. species or call-type), even though this category might not be known in other datasets or have any yet-known label. While satisfying user needs, this will also benchmark few-shot learning for the wider domain of sound event detection (SED).

Few-shot learning describes tasks in which an algorithm must make predictions given only a few instances of each class, contrary to standard supervised learning paradigm. The main objective is to find reliable algorithms that are capable of dealing with data sparsity, class imbalance and noisy/busy environments. Few-shot learning is usually studied using N-way-K-shot classification, where N denotes the number of classes and K the number of examples for each class.

Some reasons why few-shot learning has been of increasing interest:

Scarcity of supervised data can lead to unreliable generalisations of machine learning models. Explicitly labeling a huge dataset can be costly both in time and resources. Fixed ontologies or class labels used in SED and other DCASE tasks are often a poor fit to a given user’s goal. Development Set The development set is pre-split into training and validation sets. The training set consists of five sub-folders deriving from a different source each. Along with the audio files multi-class annotations are provided for each. The validation set consists of two sub-folders deriving from a different source each, with a single-class (class of interest) annotation file provided for each audio file.

Training Set

The training set contains four different sub-folders (BV, HV, JD, MT,WMW). Statistics are given overall and specific for each sub-folder.

Overall Statistics Values Number of audio recordings 174 Total duration 21 hours Total classes (excl. UNK) 47 Total events (excl. UNK) 14229

BV

The BirdVox-DCASE-10h (BV for short) contains five audio files from four different autonomous recording units, each lasting two hours. These autonomous recording units are all located in Tompkins County, New York, United States. Furthermore, they follow the same hardware specification: the Recording and Observing Bird Identification Node (ROBIN) developed by the Cornell Lab of Ornithology. Andrew Farnsworth, an expert ornithologist, has annotated these recordings for the presence of flight calls from migratory passerines, namely: American sparrows, cardinals, thrushes, and warblers. In total, the annotator found 2,662 from 11 different species. We estimate these flight calls to have a duration of 150 milliseconds and a fundamental frequency between 2 kHz and 10 kHz.

Statistics Values Number of audio recordings 5 Total duration 10 hours Total classes (excl. UNK) 11 Total events (excl. UNK) 9026 Ratio event/duration 0.04 Sampling rate 24,000 Hz

HT

Spotted hyenas are a highly social species that live in "fission-fusion" groups where group members range alone or in smaller subgroups that split and merge over time. Hyenas use a variety of types of vocalizations to coordinate with one another over both short and long distances. Spotted hyena vocalization data were recorded on custom-developed audio tags designed by Mark Johnson and integrated into combined GPS / acoustic collars (Followit Sweden AB) by Frants Jensen and Mark Johnson. Collars were deployed on female hyenas of the Talek West hyena clan at the MSU-Mara Hyena Project (directed by Kay Holekamp) in the Masai Mara, Kenya as part of a multi-species study on communication and collective behavior. Field work was carried out by Kay Holekamp, Andrew Gersick, Frants Jensen, Ariana Strandburg-Peshkin, and Benson Pion; labeling was done by Kenna Lehmann and colleagues.

Statistics Values Number of audio recordings 5 Total duration 5 hours Total classes (excl. UNK) 3 Total events (excl. UNK) 611 Ratio events/duration 0.05 Sampling rate 6000 Hz

JD

Jackdaws are corvid songbirds which usually breed, forage and sleep in large groups, but form a pair bond with the same partner for life. They produce thousands of vocalisations per day, but many aspects of their vocal behaviour remained unexplored due to the difficulty in recording and assigning vocalisations to specific individuals, especia...
Z
Magnetic Tape Recorder Dataset
data.niaid.nih.gov
zenodo.org
Updated Jun 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moliner, Eloi (2023). Magnetic Tape Recorder Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8026271
Explore at:
Dataset updated
Jun 30, 2023
Dataset provided by
Välimäki, Vesa
Moliner, Eloi
Wright, Alec
Mikkonen, Otto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the datasets collected and used in the research project:

O. Mikkonen, A. Wright, E. Moliner and V. Välimäki, “Neural Modeling Of Magnetic Tape Recorders,” in Proceedings of the International Conference on Digital Audio Effects (DAFx), Copenhagen, Denmark, 4-7 September 2023.

A pre-print of the article is available in arXiv. The code is open-source and published in GitHub. The accompanying web page can be found from here.

Overview

The data is divided into various subsets, stored in separate directories. The data contains both toy data generated using a software emulation of a reel-to-reel tape recorder, as well as real data collected from a physical device. The various subsets can be used for training, validating, and testing neural network behavior, similarly as was done in the research article.

Toy and Real Data

The toy data was generated using CHOWTape, a physically modeled reel-to-reel tape recorder. The subsets generated with the software emulation are denoted with the string CHOWTAPE. Two variants of the toy data was produced: in the first variant, the fluctuating delay produced by the simulated tape transport was disabled, and in the second kind, the delay was enabled. The latter variants are denoted with the string WOWFLUTTER.

The real data is collected using an Akai 4000D reel-to-reel tape recorder. The corresponding subsets are denoted with the string AKAI. Two tape speeds were used during the recording: 3 3/4 IPS (inches per second) and 7 1/2 IPS, with the corresponding subsets denoted with '3.75IPS' and '7.5IPS' respectively. On top of this, two different brands of magnetic tape were used for capturing the datasets with different tape speeds: Maxell and Scotch, with the corresponding subsets denoted with 'MAXELL' and 'SCOTCH' respectively.

Directories

For training the models, a fraction of the inputs from SignalTrain LA2A Dataset was used. The training, validation, and testing can be replicated using the subsets:

ReelToReel_Dataset_MiniPulse100_AKAI_*/ (hysteretic nonlinearity, real data)

ReelToReel_Dataset_Mini192kHzPulse100_AKAI_*/ (delay generator, real data)

Silence_AKAI_*/ (noise generator, real data)

ReelToReel_Dataset_MiniPulse100_CHOWTAPE*/ (hysteretic nonlinearity, toy data)

ReelToReel_Dataset_MiniPulse100_CHOWTAPE_F[0.6]_SL[60]_TRAJECTORIES/ (delay generator, toy data)

For visualizing the model behavior, the following subsets can be used:

LogSweepsContinuousPulse100_*/ (nonlinear magnitude responses)

SinesFadedShortContinuousPulse100*/ (magnetic hysteresis curves)

Directory structure

Each directory/subset is made of up of further subdirectories that are most often used to separate the training, validation and test sets from each other. Thus, a typical directory will look like the following: [DIRECTORY_NAME] ├── Train │ ├── input_x_.wav │ ... │ ├── target_x_.wav │ ... └── Val │ ├── input_y_.wav │ ... │ ├── target_y_.wav │ ... ├── Test │ ├── input_z_.wav │ ... │ ├── target_z_.wav │ ...

While not all of the audio is used for training purposes, all of the subsets share part of this structure to make the corresponding datasets compatible with the dataloader that was used.

The input and target files denoted with the same number x, e.g. input_100_.wav and target_100_.wav make up a pair, such that the target audio is the input audio processed with one of the used effects. In some of the cases, a third file named trajectory_x_.npy can be found, which consists of the corresponding pre-extracted delay trajectory in the NumPy binary file format.

Revision History

Version 1.1.0

Added high-resolution (192kHz) dataset for configuration (SCOTCH, 3.75 IPS)

Version 1.0.0

Initial publish
Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...
zenodo.org
zip
Updated Aug 24, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4571228
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4571228
Dataset updated
Aug 24, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA.

The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file.

All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file.

The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file.

Notable changes to each version of the dataset are documented in CHANGELOG.md.
P
WDC LSPM Dataset
paperswithcode.com
Updated May 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WDC LSPM Dataset [Dataset]. https://paperswithcode.com/dataset/wdc-products
Explore at:
Dataset updated
May 31, 2022
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.

In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.

The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.
m
pinterest_dataset
data.mendeley.com
Updated Oct 27, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
pinterest_dataset [Dataset]. https://data.mendeley.com/datasets/fs4k2zc5j5/2
Explore at:
Unique identifier
https://doi.org/10.17632/fs4k2zc5j5.2
Dataset updated
Oct 27, 2017
Authors
Juan Carlos Gomez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset with 72000 pins from 117 users in Pinterest. Each pin contains a short raw text and an image. The images are processed using a pretrained Convolutional Neural Network and transformed into a vector of 4096 features.

This dataset was used in the paper "User Identification in Pinterest Through the Refinement of a Cascade Fusion of Text and Images" to idenfity specific users given their comments. The paper is publishe in the Research in Computing Science Journal, as part of the LKE 2017 conference. The dataset includes the splits used in the paper.

There are nine files. text_test, text_train and text_val, contain the raw text of each pin in the corresponding split of the data. imag_test, imag_train and imag_val contain the image features of each pin in the corresponding split of the data. train_user and val_test_users contain the index of the user of each pin (between 0 and 116). There is a correspondance one-to-one among the test, train and validation files for images, text and users. There are 400 pins per user in the train set, and 100 pins per user in the validation and test sets each one.

If you have questions regarding the data, write to: jc dot gomez at ugto dot mx
m
Data from: ElectroCom61: A Multiclass Dataset for Detection of Electronic...
data.mendeley.com
Updated Nov 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Faiyaz Abdullah Sayeedi Faiyaz (2024). ElectroCom61: A Multiclass Dataset for Detection of Electronic Components [Dataset]. http://doi.org/10.17632/6scy6h8sjz.2
Explore at:
Unique identifier
https://doi.org/10.17632/6scy6h8sjz.2
Dataset updated
Nov 21, 2024
Authors
Md Faiyaz Abdullah Sayeedi Faiyaz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The "ElectroCom61" dataset contains 2121 annotated images of electronic components sourced from the Electronic Lab Support Room, the United International University (UIU). This dataset was specifically designed to facilitate the development and validation of machine learning models for the real-time detection of electronic components. To mimic real-world scenarios and enhance the robustness of models trained on this data, images were captured under varied lighting conditions and against diverse backgrounds. Each electronic component was photographed from multiple angles, and following collection, images were standardized through auto-orientation and resized to 640x640 pixels, introducing some degree of stretching. The dataset is organized into 61 distinct classes of commonly used electronic components. The dataset were split into training (70%), validation (20%), and test (10%) sets.
Z
Pubmed Journal Recommendation System dataset
data.niaid.nih.gov
zenodo.org
Updated Dec 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manuel Castillo Cara (2023). Pubmed Journal Recommendation System dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8386010
Explore at:
Dataset updated
Dec 18, 2023
Dataset provided by
Raúl García Castro
Manuel Castillo Cara
Jiayun Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset for Journal recommendation, includes title, abstract, keywords, and journal.

We extracted the journals and more information of:

Jiasheng Sheng. (2022). PubMed-OA-Extraction-dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6330817.

Dataset Components:

data_pubmed_all: This dataset encompasses all articles, each containing the following columns: 'pubmed_id', 'title', 'keywords', 'journal', 'abstract', 'conclusions', 'methods', 'results', 'copyrights', 'doi', 'publication_date', 'authors', 'AKE_pubmed_id', 'AKE_pubmed_title', 'AKE_abstract', 'AKE_keywords', 'File_Name'.

data_pubmed: To focus on recent and relevant publications, we have filtered this dataset to include articles published within the last five years, from January 1, 2018, to December 13, 2022—the latest date in the dataset. Additionally, we have exclusively retained journals with more than 200 published articles, resulting in 262,870 articles from 469 different journals.

data_pubmed_train, data_pubmed_val, and data_pubmed_test: For machine learning and model development purposes, we have partitioned the 'data_pubmed' dataset into three subsets—training, validation, and test—using a random 60/20/20 split ratio. Notably, this division was performed on a per-journal basis, ensuring that each journal's articles are proportionally represented in the training (60%), validation (20%), and test (20%) sets. The resulting partitions consist of 157,540 articles in the training set, 52,571 articles in the validation set, and 52,759 articles in the test set.
InductiveQE Datasets
zenodo.org
zip
Updated Nov 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mikhail Galkin; Mikhail Galkin (2022). InductiveQE Datasets [Dataset]. http://doi.org/10.5281/zenodo.7306046
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7306046
Dataset updated
Nov 9, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mikhail Galkin; Mikhail Galkin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
InductiveQE datasets

UPD 2.0: Regenerated datasets free of potential test set leakages

UPD 1.1: Added train_answers_val.pkl files to all freebase-derived datasets - answers of training queries on larger validation graphs

This repository contains 10 inductive complex query answering datasets published in "Inductive Logical Query Answering in Knowledge Graphs" (NeurIPS 2022). 9 datasets (106-550) were created from FB15k-237, the wikikg dataset was created from OGB WikiKG 2 graph. In the datasets, all inference graphs extend training graphs and include new nodes and edges. Dataset numbers indicate a relative size of the inference graph compared to the training graph, e.g., in 175, the number of nodes in the inference graph is 175% compared to the number of nodes in the training graph. The higher the ratio, the more new unseen nodes appear at inference time, the more complex the task is. The Wikikg split has a fixed 133% ratio.

Each dataset is a zip archive containing 17 files:

train_graph.txt (pt for wikikg) - original training graph

val_inference.txt (pt) - inference graph (validation split), new nodes in validation are disjoint with the test inference graph

val_predict.txt (pt) - missing edges in the validation inference graph to be predicted.

test_intference.txt (pt) - inference graph (test splits), new nodes in test are disjoint with the validation inference graph

test_predict.txt (pt) - missing edges in the test inference graph to be predicted.

train/valid/test_queries.pkl - queries of the respective split, 14 query types for fb-derived datasets, 9 types for Wikikg (EPFO-only)

*_answers_easy.pkl - easy answers to respective queries that do not require predicting missing links but only edge traversal

*_answers_hard.pkl - hard answers to respective queries that DO require predicting missing links and against which the final metrics will be computed

train_answers_val.pkl - the extended set of answers for training queries on the bigger validation graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models

train_answers_test.pkl - the extended set of answers for training queries on the bigger test graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models

og_mappings.pkl - contains entity2id / relation2id dictionaries mapping local node/relation IDs from a respective dataset to the original fb15k237 / wikikg2

stats.txt - a small file with dataset stats

Overall unzipped size of all datasets combined is about 10 GB. Please refer to the paper for the sizes of graphs and the number of queries per graph.

The Wikikg dataset is supposed to be evaluated in the inference-only regime being pre-trained solely on simple link prediction, the number of training complex queries is not enough for such a large dataset.

Paper pre-print: https://arxiv.org/abs/2210.08008

The full source code of training/inference models is available at https://github.com/DeepGraphLearning/InductiveQE
Data from: Product Datasets from the MWPD2020 Challenge at the ISWC2020...
linkagelibrary.icpsr.umich.edu
da-ra.de
Updated Nov 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Product Datasets from the MWPD2020 Challenge at the ISWC2020 Conference (Task 1) [Dataset]. http://doi.org/10.3886/E127482V1
Explore at:
Unique identifier
https://doi.org/10.3886/E127482V1
Dataset updated
Nov 26, 2020
Dataset provided by
University of Mannheim (Germany)
Authors
Ralph Peeters; Anna Primpeli; Christian Bizer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The goal of Task 1 of the Mining the Web of Product Data Challenge (MWPD2020) was to compare the performance of methods for identifying offers for the same product from different e-shops. The datasets that are provided to the participants of the competition contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) from the product category computers. The data is available in the form of training, validation and test set for machine learning experiments. The Training set consists of ~70K product pairs which were automatically labeled using the weak supervision of marked up product identifiers on the web. The validation set contains 1.100 manually labeled pairs. The test set which was used for the evaluation of participating systems consists of 1500 manually labeled pairs. The test set is intentionally harder than the other sets due to containing more very hard matching cases as well as a variety of matching challenges for a subset of the pairs, e.g. products not having training data in the training set or products which have had typos introduced. These can be used to measure the performance of methods on these kinds of matching challenges. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites, marking up their offers with schema.org vocabulary. For more information and download links for the corpus itself, please follow the links below.
Data from: Self-Supervised Representation Learning on Neural Network Weights...
zenodo.org
data.niaid.nih.gov
bin
Updated Nov 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kontantin Schürholt; Kontantin Schürholt; Dimche Kostadinov; Damian Borth; Dimche Kostadinov; Damian Borth (2021). Self-Supervised Representation Learning on Neural Network Weights for Model Characteristic Prediction - Datasets [Dataset]. http://doi.org/10.5281/zenodo.5645138
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5645138
Dataset updated
Nov 13, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kontantin Schürholt; Kontantin Schürholt; Dimche Kostadinov; Damian Borth; Dimche Kostadinov; Damian Borth
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets to NeurIPS 2021 accepted paper "Self-Supervised Representation Learning on Neural Network Weights for Model Characteristic Prediction".

Datasets are pytorch files containing a dictionary with training, validation and test sets. Train, validation and test sets are custom dataset classes which inherit from the standard torch dataset class. Corresponding code an be found at https://github.com/HSG-AIML/NeurIPS_2021-Weight_Space_Learning.

Datasets 41, 42, 43 and 44 are our dataset format wrapped around the zoos from Unterthiner et al, 2020 (https://github.com/google-research/google-research/tree/master/dnn_predict_accuracy)

Abstract:
Self-Supervised Learning (SSL) has been shown to learn useful and information-preserving representations. Neural Networks (NNs) are widely applied, yet their weight space is still not fully understood. Therefore, we propose to use SSL to learn neural representations of the weights of populations of NNs. To that end, we introduce domain specific data augmentations and an adapted attention architecture. Our empirical evaluation demonstrates that self-supervised representation learning in this domain is able to recover diverse NN model characteristics. Further, we show that the proposed learned representations outperform prior work for predicting hyper-parameters, test accuracy, and generalization gap as well as transfer to out-of-distribution settings.
IUST-PDFCorpus
zenodo.org
live.european-language-grid.eu
zip
Updated May 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morteza Zakeri-Nasrabadi; Morteza Zakeri-Nasrabadi (2023). IUST-PDFCorpus [Dataset]. http://doi.org/10.5281/zenodo.3484013
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3484013
Dataset updated
May 3, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Morteza Zakeri-Nasrabadi; Morteza Zakeri-Nasrabadi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About

IUST-PDFCorpus is a large set of various PDF files, aimed at building and manipulating new PDF files, to test, debug, and improve the qualification of real-world PDF readers such as Adobe Acrobat Reader, Foxit Reader, Nitro Reader, MuPDF. IUST-PDFCorpus contains 6,141 PDF complete files in various sizes and contents. The corpus includes 507,299 PDF data objects and 151,132 PDF streams extracted from the set of complete files. Data objects are in the textual format while streams have a binary format and together they make PDF files. In addition, we attached the code coverage of each PDF file when it used as test data in testing MuPDF. The coverage info is available in both binary and XML formats. PDF data objects are organized into three categories. The first category contains all objects in the corpus. Each file in this category holds all PDF objects extracted from one PDF file without any preprocessing. The second category is a dataset made by merging all files in the first category with some preprocessing. The dataset is spilled into train, test and validation set which is useful for using in the machine learning tasks. The third category is the same as the second category but in a smaller size for using in the developing stage of different algorithms. IUST-PDFCorpus is collected from various sources including the Mozilla PDF.js open test corpus, some PDFs which are used in AFL as initial seed, and PDFs gathered from existing e-books, software documents, and public web in different languages. We first introduced IUST-PDFCorpus in our paper “Format-aware learn&fuzz: deep test data generation for efficient fuzzing” where we used it to build an intelligent file format fuzzer, called IUST-DeepFuzz. For the time being, we are gathering other file formats to automate testing of related applications.

Citing IUST-PDFCorpus

If IUST-PDFCorpus is used in your work in any form please cite the relevant paper: https://arxiv.org/abs/1812.09961v2

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0

Training dataset for NABat Machine Learning V1.0

Explore at:

Dataset updated

Jul 6, 2024

Dataset provided by

U.S. Geological Survey

Description

Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

Clear search

Close search

Google apps

Main menu

Training dataset for NABat Machine Learning V1.0

Data from: Isometric Stratified Ensembles: A Partial and Incremental...

Data from: Pothole Dataset

Pothole Dataset

Usage

Something-Something V1 Dataset

Data from: Fashion Mnist Dataset

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Version 1 (original-images_Original-FashionMNIST-Splits):

Version 3 (original-images_trainSetSplitBy80_20):

Citation:

Train, validation, test data sets and confusion matrices underlying...

Data from: S1 Dataset -

Model performance per dataset.

Optimized parameter values for play detection.

DCASE-2023-TASK-5

Introduction

Description

Training Set

BV

HT

JD

Magnetic Tape Recorder Dataset

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

WDC LSPM Dataset

pinterest_dataset

Data from: ElectroCom61: A Multiclass Dataset for Detection of Electronic...

Pubmed Journal Recommendation System dataset

InductiveQE Datasets

Data from: Product Datasets from the MWPD2020 Challenge at the ISWC2020...

Data from: Self-Supervised Representation Learning on Neural Network Weights...

IUST-PDFCorpus

Training dataset for NABat Machine Learning V1.0See More Versions

Training dataset for NABat Machine Learning V1.0