Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Partial and incremental stratification analysis of a quantitative structure-interference relationship (QSIR) is a novel strategy intended to categorize classification provided by machine learning techniques. It is based on a 2D mapping of classification statistics onto two categorical axes: the degree of consensus and level of applicability domain. An internal cross-validation set allows to determine the statistical performance of the ensemble at every 2D map stratum and hence to define isometric local performance regions with the aim of better hit ranking and selection. During training, isometric stratified ensembles (ISE) applies a recursive decorrelated variable selection and considers the cardinal ratio of classes to balance training sets and thus avoid bias due to possible class imbalance. To exemplify the interest of this strategy, three different highly imbalanced PubChem pairs of AmpC β-lactamase and cruzain inhibition assay campaigns of colloidal aggregators and complementary aggregators data set available at the AGGREGATOR ADVISOR predictor web page were employed. Statistics obtained using this new strategy show outperforming results compared to former published tools, with and without a classical applicability domain. ISE performance on classifying colloidal aggregators shows from a global AUC of 0.82, when the whole test data set is considered, up to a maximum AUC of 0.88, when its highest confidence isometric stratum is retained.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
https://i.imgur.com/7Xz8d5M.gif" alt="Example Image">
This is a collection of 665 images of roads with the potholes labeled. The dataset was created and shared by Atikur Rahman Chitholian as part of his undergraduate thesis and was originally shared on Kaggle.
Note: The original dataset did not contain a validation set; we have re-shuffled the images into a 70/20/10 train-valid-test split.
This dataset could be used for automatically finding and categorizing potholes in city streets so the worst ones can be fixed faster.
The dataset is provided in a wide variety of formats for various common machine learning models.
The 20BN-SOMETHING-SOMETHING dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with everyday objects. The dataset was created by a large number of crowd workers. It allows machine learning models to develop fine-grained understanding of basic actions that occur in the physical world. It contains 108,499 videos, with 86,017 in the training set, 11,522 in the validation set and 10,960 in the test set. There are 174 labels.
⚠️ Attention: This is the outdated V1 of the dataset. V2 is available here.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fashion-MNIST
is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST
to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.
* Source
Here's an example of how the data looks (each class takes three-rows):
https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">
train
(86% of images - 60,000 images) set and test
(14% of images - 10,000 images) set only.train
set split to provide 80% of its images to the training set and 20% of its images to the validation set@online{xiao2017/online,
author = {Han Xiao and Kashif Rasul and Roland Vollgraf},
title = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms},
date = {2017-08-28},
year = {2017},
eprintclass = {cs.LG},
eprinttype = {arXiv},
eprint = {cs.LG/1708.07747},
}
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Annotated test and train data sets. Both images and annotations are provided separately.
Validation data set for Hi5, Sf9 and HEK cells.
Confusion matrices for the determination of performance parameters
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Estimation of fruit quality parameters are usually based on destructive techniques which are tedious, costly and unreliable when dealing with huge amounts of fruits. Alternatively, non–destructive techniques such as image processing and spectral reflectance would be useful in rapid detection of fruit quality parameters. This research study aimed to assess the potential of image processing, spectral reflectance indices (SRIs), and machine learning models such as decision tree (DT) and random forest (RF) to qualitatively estimate characteristics of mandarin and tomato fruits at different ripening stages. Quality parameters such as chlorophyll a (Chl a), chlorophyll b (Chl b), total soluble solids (TSS), titratable acidity (TA), TSS/TA, carotenoids (car), lycopene and firmness were measured. The results showed that Red-Blue-Green (RGB) indices and newly developed SRIs demonstrated high efficiency for quantifying different fruit properties. For example, the R2 of the relationships between all RGB indices (RGBI) and measured parameters varied between 0.62 and 0.96 for mandarin and varied between 0.29 and 0.90 for tomato. The RGBI such as visible atmospheric resistant index (VARI) and normalized red (Rn) presented the highest R2 = 0.96 with car of mandarin fruits. While excess red vegetation index (ExR) presented the highest R2 = 0.84 with car of tomato fruits. The SRIs such as RSI 710,600, and R730,650 showed the greatest R2 values with respect to Chl a (R2 = 0.80) for mandarin fruits while the GI had the greatest R2 with Chl a (R2 = 0.68) for tomato fruits. Combining RGB and SRIs with DT and RF models would be a robust strategy for estimating eight observed variables associated with reasonable accuracy. Regarding mandarin fruits, in the task of predicting Chl a, the DT-2HV model delivered exceptional results, registering an R2 of 0.993 with an RMSE of 0.149 for the training set, and an R2 of 0.991 with an RMSE of 0.114 for the validation set. As well as for tomato fruits, the DT-5HV model demonstrated exemplary performance in the Chl a prediction, achieving an R2 of 0.905 and an RMSE of 0.077 for the training dataset, and an R2 of 0.785 with an RMSE of 0.077 for the validation dataset. The overall outcomes showed that the RGB, newly SRIs as well as DT and RF based RGBI, and SRIs could be used to evaluate the measured parameters of mandarin and tomato fruits.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizing a training dataset of known metabolite-to-pathway mappings. A few such training datasets have been derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG). However, several prior published machine learning approaches utilized an erroneous KEGG-derived training dataset that used SMILES molecular representations strings (KEGG-SMILES dataset) and contained a sizable proportion (~26%) duplicate entries. The presence of so many duplicates taint the training and testing sets generated from k-fold cross-validation of the KEGG-SMILES dataset. Therefore, the k-fold cross-validation performance of the resulting machine learning models was grossly inflated by the erroneous presence of these duplicate entries. Here we describe and evaluate the KEGG-SMILES dataset so that others may avoid using it. We also identify the prior publications that utilized this erroneous KEGG-SMILES dataset so their machine learning results can be properly and critically evaluated. In addition, we demonstrate the reduction of model k-fold cross-validation (CV) performance after de-duplicating the KEGG-SMILES dataset. This is a cautionary tale about properly vetting prior published benchmark datasets before using them in machine learning approaches. We hope others will avoid similar mistakes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With recent technological advancements, quantitative analysis has become an increasingly important area within professional sports. However, the manual process of collecting data on relevant match events like passes, goals and tacklings comes with considerable costs and limited consistency across providers, affecting both research and practice. In football, while automatic detection of events from positional data of the players and the ball could alleviate these issues, it is not entirely clear what accuracy current state-of-the-art methods realistically achieve because there is a lack of high-quality validations on realistic and diverse data sets. This paper adds context to existing research by validating a two-step rule-based pass and shot detection algorithm on four different data sets using a comprehensive validation routine that accounts for the temporal, hierarchical and imbalanced nature of the task. Our evaluation shows that pass and shot detection performance is highly dependent on the specifics of the data set. In accordance with previous studies, we achieve F-scores of up to 0.92 for passes, but only when there is an inherent dependency between event and positional data. We find a significantly lower accuracy with F-scores of 0.71 for passes and 0.65 for shots if event and positional data are independent. This result, together with a critical evaluation of existing methodologies, suggests that the accuracy of current football event detection algorithms operating on positional data is currently overestimated. Further analysis reveals that the temporal extraction of passes and shots from positional data poses the main challenge for rule-based approaches. Our results further indicate that the classification of plays into shots and passes is a relatively straightforward task, achieving F-scores between 0.83 to 0.91 ro rule-based classifiers and up to 0.95 for machine learning classifiers. We show that there exist simple classifiers that accurately differentiate shots from passes in different data sets using a low number of human-understandable rules. Operating on basic spatial features, our classifiers provide a simple, objective event definition that can be used as a foundation for more reliable event-based match analysis.
This task focuses on sound event detection in a few-shot learning setting for animal (mammal and bird) vocalisations. Participants will be expected to create a method that can extract information from five exemplar vocalisations (shots) of mammals or birds and detect and classify sounds in field recordings.
For more info please reffer to the official website: https://dcase.community/challenge2023/task-few-shot-bioacoustic-event-detection
Few-shot learning is a highly promising paradigm for sound event detection. It is also an extremely good fit to the needs of users in bioacoustics, in which increasingly large acoustic datasets commonly need to be labelled for events of an identified category (e.g. species or call-type), even though this category might not be known in other datasets or have any yet-known label. While satisfying user needs, this will also benchmark few-shot learning for the wider domain of sound event detection (SED).
Few-shot learning describes tasks in which an algorithm must make predictions given only a few instances of each class, contrary to standard supervised learning paradigm. The main objective is to find reliable algorithms that are capable of dealing with data sparsity, class imbalance and noisy/busy environments. Few-shot learning is usually studied using N-way-K-shot classification, where N denotes the number of classes and K the number of examples for each class.
Some reasons why few-shot learning has been of increasing interest:
Scarcity of supervised data can lead to unreliable generalisations of machine learning models. Explicitly labeling a huge dataset can be costly both in time and resources. Fixed ontologies or class labels used in SED and other DCASE tasks are often a poor fit to a given user’s goal. Development Set The development set is pre-split into training and validation sets. The training set consists of five sub-folders deriving from a different source each. Along with the audio files multi-class annotations are provided for each. The validation set consists of two sub-folders deriving from a different source each, with a single-class (class of interest) annotation file provided for each audio file.
The training set contains four different sub-folders (BV, HV, JD, MT,WMW). Statistics are given overall and specific for each sub-folder.
Overall Statistics Values Number of audio recordings 174 Total duration 21 hours Total classes (excl. UNK) 47 Total events (excl. UNK) 14229
The BirdVox-DCASE-10h (BV for short) contains five audio files from four different autonomous recording units, each lasting two hours. These autonomous recording units are all located in Tompkins County, New York, United States. Furthermore, they follow the same hardware specification: the Recording and Observing Bird Identification Node (ROBIN) developed by the Cornell Lab of Ornithology. Andrew Farnsworth, an expert ornithologist, has annotated these recordings for the presence of flight calls from migratory passerines, namely: American sparrows, cardinals, thrushes, and warblers. In total, the annotator found 2,662 from 11 different species. We estimate these flight calls to have a duration of 150 milliseconds and a fundamental frequency between 2 kHz and 10 kHz.
Statistics Values Number of audio recordings 5 Total duration 10 hours Total classes (excl. UNK) 11 Total events (excl. UNK) 9026 Ratio event/duration 0.04 Sampling rate 24,000 Hz
Spotted hyenas are a highly social species that live in "fission-fusion" groups where group members range alone or in smaller subgroups that split and merge over time. Hyenas use a variety of types of vocalizations to coordinate with one another over both short and long distances. Spotted hyena vocalization data were recorded on custom-developed audio tags designed by Mark Johnson and integrated into combined GPS / acoustic collars (Followit Sweden AB) by Frants Jensen and Mark Johnson. Collars were deployed on female hyenas of the Talek West hyena clan at the MSU-Mara Hyena Project (directed by Kay Holekamp) in the Masai Mara, Kenya as part of a multi-species study on communication and collective behavior. Field work was carried out by Kay Holekamp, Andrew Gersick, Frants Jensen, Ariana Strandburg-Peshkin, and Benson Pion; labeling was done by Kenna Lehmann and colleagues.
Statistics Values Number of audio recordings 5 Total duration 5 hours Total classes (excl. UNK) 3 Total events (excl. UNK) 611 Ratio events/duration 0.05 Sampling rate 6000 Hz
Jackdaws are corvid songbirds which usually breed, forage and sleep in large groups, but form a pair bond with the same partner for life. They produce thousands of vocalisations per day, but many aspects of their vocal behaviour remained unexplored due to the difficulty in recording and assigning vocalisations to specific individuals, especia...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the datasets collected and used in the research project:
O. Mikkonen, A. Wright, E. Moliner and V. Välimäki, “Neural Modeling Of Magnetic Tape Recorders,” in Proceedings of the International Conference on Digital Audio Effects (DAFx), Copenhagen, Denmark, 4-7 September 2023.
A pre-print of the article is available in arXiv. The code is open-source and published in GitHub. The accompanying web page can be found from here.
Overview
The data is divided into various subsets, stored in separate directories. The data contains both toy data generated using a software emulation of a reel-to-reel tape recorder, as well as real data collected from a physical device. The various subsets can be used for training, validating, and testing neural network behavior, similarly as was done in the research article.
Toy and Real Data
The toy data was generated using CHOWTape, a physically modeled reel-to-reel tape recorder. The subsets generated with the software emulation are denoted with the string CHOWTAPE
. Two variants of the toy data was produced: in the first variant, the fluctuating delay produced by the simulated tape transport was disabled, and in the second kind, the delay was enabled. The latter variants are denoted with the string WOWFLUTTER
.
The real data is collected using an Akai 4000D reel-to-reel tape recorder. The corresponding subsets are denoted with the string AKAI
. Two tape speeds were used during the recording: 3 3/4 IPS (inches per second) and 7 1/2 IPS, with the corresponding subsets denoted with '3.75IPS' and '7.5IPS' respectively. On top of this, two different brands of magnetic tape were used for capturing the datasets with different tape speeds: Maxell and Scotch, with the corresponding subsets denoted with 'MAXELL' and 'SCOTCH' respectively.
Directories
For training the models, a fraction of the inputs from SignalTrain LA2A Dataset was used. The training, validation, and testing can be replicated using the subsets:
ReelToReel_Dataset_MiniPulse100_AKAI_*/ (hysteretic nonlinearity, real data)
ReelToReel_Dataset_Mini192kHzPulse100_AKAI_*/ (delay generator, real data)
Silence_AKAI_*/ (noise generator, real data)
ReelToReel_Dataset_MiniPulse100_CHOWTAPE*/ (hysteretic nonlinearity, toy data)
ReelToReel_Dataset_MiniPulse100_CHOWTAPE_F[0.6]_SL[60]_TRAJECTORIES/ (delay generator, toy data)
For visualizing the model behavior, the following subsets can be used:
LogSweepsContinuousPulse100_*/ (nonlinear magnitude responses)
SinesFadedShortContinuousPulse100*/ (magnetic hysteresis curves)
Directory structure
Each directory/subset is made of up of further subdirectories that are most often used to separate the training, validation and test sets from each other. Thus, a typical directory will look like the following:
[DIRECTORY_NAME]
├── Train
│ ├── input_x_.wav
│ ...
│ ├── target_x_.wav
│ ...
└── Val
│ ├── input_y_.wav
│ ...
│ ├── target_y_.wav
│ ...
├── Test
│ ├── input_z_.wav
│ ...
│ ├── target_z_.wav
│ ...
While not all of the audio is used for training purposes, all of the subsets share part of this structure to make the corresponding datasets compatible with the dataloader that was used.
The input and target files denoted with the same number x
, e.g. input_100_.wav
and target_100_.wav
make up a pair, such that the target audio is the input audio processed with one of the used effects. In some of the cases, a third file named trajectory_x_.npy
can be found, which consists of the corresponding pre-extracted delay trajectory in the NumPy
binary file format.
Revision History
Version 1.1.0
Added high-resolution (192kHz) dataset for configuration (SCOTCH, 3.75 IPS)
Version 1.0.0
Initial publish
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.
In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.
The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset with 72000 pins from 117 users in Pinterest. Each pin contains a short raw text and an image. The images are processed using a pretrained Convolutional Neural Network and transformed into a vector of 4096 features.
This dataset was used in the paper "User Identification in Pinterest Through the Refinement of a Cascade Fusion of Text and Images" to idenfity specific users given their comments. The paper is publishe in the Research in Computing Science Journal, as part of the LKE 2017 conference. The dataset includes the splits used in the paper.
There are nine files. text_test, text_train and text_val, contain the raw text of each pin in the corresponding split of the data. imag_test, imag_train and imag_val contain the image features of each pin in the corresponding split of the data. train_user and val_test_users contain the index of the user of each pin (between 0 and 116). There is a correspondance one-to-one among the test, train and validation files for images, text and users. There are 400 pins per user in the train set, and 100 pins per user in the validation and test sets each one.
If you have questions regarding the data, write to: jc dot gomez at ugto dot mx
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The "ElectroCom61" dataset contains 2121 annotated images of electronic components sourced from the Electronic Lab Support Room, the United International University (UIU). This dataset was specifically designed to facilitate the development and validation of machine learning models for the real-time detection of electronic components. To mimic real-world scenarios and enhance the robustness of models trained on this data, images were captured under varied lighting conditions and against diverse backgrounds. Each electronic component was photographed from multiple angles, and following collection, images were standardized through auto-orientation and resized to 640x640 pixels, introducing some degree of stretching. The dataset is organized into 61 distinct classes of commonly used electronic components. The dataset were split into training (70%), validation (20%), and test (10%) sets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for Journal recommendation, includes title, abstract, keywords, and journal.
We extracted the journals and more information of:
Jiasheng Sheng. (2022). PubMed-OA-Extraction-dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6330817.
Dataset Components:
data_pubmed_all: This dataset encompasses all articles, each containing the following columns: 'pubmed_id', 'title', 'keywords', 'journal', 'abstract', 'conclusions', 'methods', 'results', 'copyrights', 'doi', 'publication_date', 'authors', 'AKE_pubmed_id', 'AKE_pubmed_title', 'AKE_abstract', 'AKE_keywords', 'File_Name'.
data_pubmed: To focus on recent and relevant publications, we have filtered this dataset to include articles published within the last five years, from January 1, 2018, to December 13, 2022—the latest date in the dataset. Additionally, we have exclusively retained journals with more than 200 published articles, resulting in 262,870 articles from 469 different journals.
data_pubmed_train, data_pubmed_val, and data_pubmed_test: For machine learning and model development purposes, we have partitioned the 'data_pubmed' dataset into three subsets—training, validation, and test—using a random 60/20/20 split ratio. Notably, this division was performed on a per-journal basis, ensuring that each journal's articles are proportionally represented in the training (60%), validation (20%), and test (20%) sets. The resulting partitions consist of 157,540 articles in the training set, 52,571 articles in the validation set, and 52,759 articles in the test set.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
InductiveQE datasets
UPD 2.0: Regenerated datasets free of potential test set leakages
UPD 1.1: Added train_answers_val.pkl files to all freebase-derived datasets - answers of training queries on larger validation graphs
This repository contains 10 inductive complex query answering datasets published in "Inductive Logical Query Answering in Knowledge Graphs" (NeurIPS 2022). 9 datasets (106-550) were created from FB15k-237, the wikikg dataset was created from OGB WikiKG 2 graph. In the datasets, all inference graphs extend training graphs and include new nodes and edges. Dataset numbers indicate a relative size of the inference graph compared to the training graph, e.g., in 175, the number of nodes in the inference graph is 175% compared to the number of nodes in the training graph. The higher the ratio, the more new unseen nodes appear at inference time, the more complex the task is. The Wikikg split has a fixed 133% ratio.
Each dataset is a zip archive containing 17 files:
Overall unzipped size of all datasets combined is about 10 GB. Please refer to the paper for the sizes of graphs and the number of queries per graph.
The Wikikg dataset is supposed to be evaluated in the inference-only regime being pre-trained solely on simple link prediction, the number of training complex queries is not enough for such a large dataset.
Paper pre-print: https://arxiv.org/abs/2210.08008
The full source code of training/inference models is available at https://github.com/DeepGraphLearning/InductiveQE
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The goal of Task 1 of the Mining the Web of Product Data Challenge (MWPD2020) was to compare the performance of methods for identifying offers for the same product from different e-shops. The datasets that are provided to the participants of the competition contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) from the product category computers. The data is available in the form of training, validation and test set for machine learning experiments. The Training set consists of ~70K product pairs which were automatically labeled using the weak supervision of marked up product identifiers on the web. The validation set contains 1.100 manually labeled pairs. The test set which was used for the evaluation of participating systems consists of 1500 manually labeled pairs. The test set is intentionally harder than the other sets due to containing more very hard matching cases as well as a variety of matching challenges for a subset of the pairs, e.g. products not having training data in the training set or products which have had typos introduced. These can be used to measure the performance of methods on these kinds of matching challenges. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites, marking up their offers with schema.org vocabulary. For more information and download links for the corpus itself, please follow the links below.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets to NeurIPS 2021 accepted paper "Self-Supervised Representation Learning on Neural Network Weights for Model Characteristic Prediction".
Datasets are pytorch files containing a dictionary with training, validation and test sets. Train, validation and test sets are custom dataset classes which inherit from the standard torch dataset class. Corresponding code an be found at https://github.com/HSG-AIML/NeurIPS_2021-Weight_Space_Learning.
Datasets 41, 42, 43 and 44 are our dataset format wrapped around the zoos from Unterthiner et al, 2020 (https://github.com/google-research/google-research/tree/master/dnn_predict_accuracy)
Abstract:
Self-Supervised Learning (SSL) has been shown to learn useful and information-preserving representations. Neural Networks (NNs) are widely applied, yet their weight space is still not fully understood. Therefore, we propose to use SSL to learn neural representations of the weights of populations of NNs. To that end, we introduce domain specific data augmentations and an adapted attention architecture. Our empirical evaluation demonstrates that self-supervised representation learning in this domain is able to recover diverse NN model characteristics. Further, we show that the proposed learned representations outperform prior work for predicting hyper-parameters, test accuracy, and generalization gap as well as transfer to out-of-distribution settings.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
About
IUST-PDFCorpus is a large set of various PDF files, aimed at building and manipulating new PDF files, to test, debug, and improve the qualification of real-world PDF readers such as Adobe Acrobat Reader, Foxit Reader, Nitro Reader, MuPDF. IUST-PDFCorpus contains 6,141 PDF complete files in various sizes and contents. The corpus includes 507,299 PDF data objects and 151,132 PDF streams extracted from the set of complete files. Data objects are in the textual format while streams have a binary format and together they make PDF files. In addition, we attached the code coverage of each PDF file when it used as test data in testing MuPDF. The coverage info is available in both binary and XML formats. PDF data objects are organized into three categories. The first category contains all objects in the corpus. Each file in this category holds all PDF objects extracted from one PDF file without any preprocessing. The second category is a dataset made by merging all files in the first category with some preprocessing. The dataset is spilled into train, test and validation set which is useful for using in the machine learning tasks. The third category is the same as the second category but in a smaller size for using in the developing stage of different algorithms. IUST-PDFCorpus is collected from various sources including the Mozilla PDF.js open test corpus, some PDFs which are used in AFL as initial seed, and PDFs gathered from existing e-books, software documents, and public web in different languages. We first introduced IUST-PDFCorpus in our paper “Format-aware learn&fuzz: deep test data generation for efficient fuzzing” where we used it to build an intelligent file format fuzzer, called IUST-DeepFuzz. For the time being, we are gathering other file formats to automate testing of related applications.
Citing IUST-PDFCorpus
If IUST-PDFCorpus is used in your work in any form please cite the relevant paper: https://arxiv.org/abs/1812.09961v2
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.