Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
The 20BN-SOMETHING-SOMETHING dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with everyday objects. The dataset was created by a large number of crowd workers. It allows machine learning models to develop fine-grained understanding of basic actions that occur in the physical world. It contains 108,499 videos, with 86,017 in the training set, 11,522 in the validation set and 10,960 in the test set. There are 174 labels.
⚠️ Attention: This is the outdated V1 of the dataset. V2 is available here.
This is the dataset for the Style Change Detection task of PAN 2022. Task The goal of the style change detection task is to identify text positions within a given multi-author document at which the author switches. Hence, a fundamental question is the following: If multiple authors have written a text together, can we find evidence for this fact; i.e., do we have a means to detect variations in the writing style? Answering this question belongs to the most difficult and most interesting challenges in author identification: Style change detection is the only means to detect plagiarism in a document if no comparison texts are given; likewise, style change detection can help to uncover gift authorships, to verify a claimed authorship, or to develop new technology for writing support. Previous editions of the Style Change Detection task aim at e.g., detecting whether a document is single- or multi-authored (2018), the actual number of authors within a document (2019), whether there was a style change between two consecutive paragraphs (2020, 2021) and where the actual style changes were located (2021). Based on the progress made towards this goal in previous years, we again extend the set of challenges to likewise entice novices and experts: Given a document, we ask participants to solve the following three tasks: [Task1] Style Change Basic: for a text written by two authors that contains a single style change only, find the position of this change (i.e., cut the text into the two authors��� texts on the paragraph-level), [Task2] Style Change Advanced: for a text written by two or more authors, find all positions of writing style change (i.e., assign all paragraphs of the text uniquely to some author out of the number of authors assumed for the multi-author document) [Task3] Style Change Real-World: for a text written by two or more authors, find all positions of writing style change, where style changes now not only occur between paragraphs, but at the sentence level. All documents are provided in English and may contain an arbitrary number of style changes, resulting from at most five different authors. Data To develop and then test your algorithms, three datasets including ground truth information are provided (dataset1 for task 1, dataset2 for task 2, and dataset3 for task 3). Each dataset is split into three parts: training set: Contains 70% of the whole dataset and includes ground truth data. Use this set to develop and train your models. validation set: Contains 15% of the whole dataset and includes ground truth data. Use this set to evaluate and optimize your models. test set: Contains 15% of the whole dataset, no ground truth data is given. This set is used for evaluation (see later). You are free to use additional external data for training your models. However, we ask you to make the additional data utilized freely available under a suitable license. Input Format The datasets are based on user posts from various sites of the StackExchange network, covering different topics. We refer to each input problem (i.e., the document for which to detect style changes) by an ID, which is subsequently also used to identify the submitted solution to this input problem. We provide one folder for train, validation, and test data for each dataset, respectively. For each problem instance X (i.e., each input document), two files are provided: problem-X.txt contains the actual text, where paragraphs are denoted by for tasks 1 and 2. For task 3, we provide one sentence per paragraph (again, split by ). truth-problem-X.json contains the ground truth, i.e., the correct solution in JSON format. An example file is listed in the following (note that we list keys for the three tasks here): { "authors": NUMBER_OF_AUTHORS, "site": SOURCE_SITE, "changes": RESULT_ARRAY_TASK1 or RESULT_ARRAY_TASK3, "paragraph-authors": RESULT_ARRAY_TASK2 } The result for task 1 (key "changes") is represented as an array, holding a binary for each pair of consecutive paragraphs within the document (0 if there was no style change, 1 if there was a style change). For task 2 (key "paragraph-authors"), the result is the order of authors contained in the document (e.g., [1, 2, 1] for a two-author document), where the first author is "1", the second author appearing in the document is referred to as "2", etc. Furthermore, we provide the total number of authors and the Stackoverflow site the texts were extracted from (i.e., topic). The result for task 3 (key "changes") is similarly structured as the results array for task 1. However, for task 3, the changes array holds a binary for each pair of consecutive sentences and they may be multiple style changes in the document. An example of a multi-author document with a style change between the third and fourth paragraph (or sentence for task 3) could be described as follows (we only list the relevant key/value pairs here): { "changes": [0,0,1,...], "paragraph-authors": [1,1,1,2,...] } Output Format To...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository accompanies the manuscript "Spatially resolved uncertainties for machine learning potentials" by E. Heid, J. Schörghuber, R. Wanzenböck, and G. K. H. Madsen. The following files are available:
mc_experiment.ipynb
is a Jupyter notebook for the Monte Carlo experiment described in the study (artificial model with only variance as error source).
aggregate_cut_relax.py
contains code to cut and relax boxes for the water active learning cycle.
data_t1x.tar.gz
contains reaction pathways for 10,073 reactions from a subset of the Transition1x dataset, split into training, validation and test sets. The training and validation sets contain the indices 1, 2, 9, and 10 from a 10-image nudged-elastic band search (40k datapoints), while the test set contains indices 3-8 (60k datapoints). The test set is ordered according to the reaction and index, i.e. rxn1_index3, rxn1_index4, [...] rxn1_index8, rxn2_index3, [...].
data_sto.tar.gz
contains surface reconstructions of SrTiO3, randomly split into a training and validation set, as well as a test set.
data_h2o.tar.gz
contains:
full_db.extxyz
: The full dataset of 1.5k structures.
iter00_train.extxyz
and iter00_validation.extxyz
: The initial training and validation set for the active learning cycle.
the subfolders in the folders random
and uncertain
contain the training and validation sets for the random and uncertainty-based active learning loops.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for Journal recommendation, includes title, abstract, keywords, and journal.
We extracted the journals and more information of:
Jiasheng Sheng. (2022). PubMed-OA-Extraction-dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6330817.
Dataset Components:
data_pubmed_all: This dataset encompasses all articles, each containing the following columns: 'pubmed_id', 'title', 'keywords', 'journal', 'abstract', 'conclusions', 'methods', 'results', 'copyrights', 'doi', 'publication_date', 'authors', 'AKE_pubmed_id', 'AKE_pubmed_title', 'AKE_abstract', 'AKE_keywords', 'File_Name'.
data_pubmed: To focus on recent and relevant publications, we have filtered this dataset to include articles published within the last five years, from January 1, 2018, to December 13, 2022—the latest date in the dataset. Additionally, we have exclusively retained journals with more than 200 published articles, resulting in 262,870 articles from 469 different journals.
data_pubmed_train, data_pubmed_val, and data_pubmed_test: For machine learning and model development purposes, we have partitioned the 'data_pubmed' dataset into three subsets—training, validation, and test—using a random 60/20/20 split ratio. Notably, this division was performed on a per-journal basis, ensuring that each journal's articles are proportionally represented in the training (60%), validation (20%), and test (20%) sets. The resulting partitions consist of 157,540 articles in the training set, 52,571 articles in the validation set, and 52,759 articles in the test set.
I do a lot of work with image data sets. Often it is necessary to partition the images into male and female data sets. Doing this by hand can be a long and tedious task particularly on large data sets. So I decided to create a classifier that could do the task for me.
I used the CELEBA aligned data set to provide the images. I went through and separated the images visually into 1747 female and 1747 male training images. I also created 100 male and 100 female test image and 100 male, 100 female validation images. I want to only the face to be in the image so I developed an image cropping function using MTCNN to crop all the images. That function is included as one of the notebooks should anyone have a need for a good face cropping function. I also created an image duplicate detector to try to eliminate any of the training images from appearing in the test or validation images. I have developed a general purpose image classification function that works very well for most image classification tasks. It contains the option to select 1 of 7 models for use. For this application I used the MobileNet model because it is less computationally expensive and gives excellent results. On the test set accuracy is near 100%.
The CELEBA aligned data set was used. This data set is very large and of good quality. To crop the images to only include the face I developed a face cropping function using MTCNN. MTCNN is a very accurate program and is reasonably fast, however it is notflawless so after cropping the iages you shouldalways visually inspect the results.
I developed this data set to train a classifier to be able to distinguish the gender shown in an image. Why bother you may ask I can just look at the image and tell. True but lets say you have a data set of 50,000 images that you want to separate it into male and female data sets. Doing that by hand would take forever. With the trained classifier with near 100% accuracy you can use the classifier with model.predict to do the job for you.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fashion-MNIST
is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST
to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.
* Source
Here's an example of how the data looks (each class takes three-rows):
https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">
train
(86% of images - 60,000 images) set and test
(14% of images - 10,000 images) set only.train
set split to provide 80% of its images to the training set and 20% of its images to the validation set@online{xiao2017/online,
author = {Han Xiao and Kashif Rasul and Roland Vollgraf},
title = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms},
date = {2017-08-28},
year = {2017},
eprintclass = {cs.LG},
eprinttype = {arXiv},
eprint = {cs.LG/1708.07747},
}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Each row is for one region, each column is for one model and one combination of datasets considered (training+validation+testing 1 sets (no comorbidity), or all these sets + testing set 2 (containing subjects with comorbidities)), each case returns the number of datasets where the region was important for predicting TN for the model considered. (CSV)
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.
In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.
The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.
Code Knowledge Value Evaluation Dataset
This dataset is created by evaluating the knowledge value of code sourced from the bigcode/the-stack repository. It is designed to assess the educational and knowledge potential of different code samples.
Dataset Overview
The dataset is split into training, validation, and test sets with the following number of samples:
Training set: 22,786 samples Validation set: 4,555 samples Test set: 18,232 samples
Usage… See the full description on the dataset page: https://huggingface.co/datasets/kimsan0622/code-knowledge-eval.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The original data was retrieved from http://www.timeseriesclassification.com/description.php?Dataset=RacketSports
Original data description:
The data was created by university students plyaing badminton or squash whilst wearing a smart watch (Sony Smart watch 35). The watch relayed the x-y-z coordinates for
both the gyroscope and accelerometer to an android phone (One Plus 56). The phone
wrote these values to an Attribute-Relation File Format (arff) file using an app developed
by a UEA computer science masters student. The problem is to identify which sport and which stroke the players are making. The data was collected at a rate of 10 HZ over 3 seconds whilst the player played
either a forehand/backhand in squash or a clear/smash in badminton.
The data was collected as part of an undergraduate project by Phillip Perks in 2017/18.
Pre-processing
Data processing was done as described in: https://github.com/NLeSC/mcfly-tutorial/blob/master/utils/tutorial_racketsports.py
The original data was split into train and test set. Here the data was loaded and further divided into train, test, validation sets.
To keep it simple we here simply divided the original test part into test and validation.
The resulting data was stored as numpy .npy files.
The zip file contains three sets of time series data (X_train, X_test, X_valid) and the respective labels (y_train, y_test, y_valid).
Reference:
http://www.timeseriesclassification.com/description.php?Dataset=RacketSports
(The data was collected as part of an undergraduate project by Phillip Perks in 2017/18.)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset with 72000 pins from 117 users in Pinterest. Each pin contains a short raw text and an image. The images are processed using a pretrained Convolutional Neural Network and transformed into a vector of 4096 features.
This dataset was used in the paper "User Identification in Pinterest Through the Refinement of a Cascade Fusion of Text and Images" to idenfity specific users given their comments. The paper is publishe in the Research in Computing Science Journal, as part of the LKE 2017 conference. The dataset includes the splits used in the paper.
There are nine files. text_test, text_train and text_val, contain the raw text of each pin in the corresponding split of the data. imag_test, imag_train and imag_val contain the image features of each pin in the corresponding split of the data. train_user and val_test_users contain the index of the user of each pin (between 0 and 116). There is a correspondance one-to-one among the test, train and validation files for images, text and users. There are 400 pins per user in the train set, and 100 pins per user in the validation and test sets each one.
If you have questions regarding the data, write to: jc dot gomez at ugto dot mx
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Annotated test and train data sets. Both images and annotations are provided separately.
Validation data set for Hi5, Sf9 and HEK cells.
Confusion matrices for the determination of performance parameters
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data science beginners start with curated set of data, but it's a well known fact that in a real Data Science Project, major time is spent on collecting, cleaning and organizing data . Also domain expertise is considered as important aspect of creating good ML models. Being an automobile enthusiast, I tool up this challenge to collect images of two of the popular car models from a used car website, where users upload the images of the car they want to sell and then train a Deep Neural Network to identify model of a car from car images. In my search for images I found that approximately 10 percent of the cars pictures did not represent the intended car correctly and those pictures have to be deleted from final data.
There are 4000 images of two of the popular cars (Swift and Wagonr) in India of make Maruti Suzuki with 2000 pictures belonging to each model. The data is divided into training set with 2400 images , validation set with 800 images and test set with 800 images. The data was randomized before splitting into training, test and validation set.
The starter kernal is provided for keras with CNN. I have also created github project documenting advanced techniques in pytorch and keras for image classification like data augmentation, dropout, batch normalization and transfer learning
With small dataset like this, how much accuracy can we achieve and whether more data is always better. The baseline model trained in Keras achieves 88% accuracy on validation set, can we achieve even better performance and by how much.
Is the data collected for the two car models representative of all possible car from all over country or there is sample bias .
I would also like someone to extend the concept to build a use case so that if user uploads an incorrect car picture of car , the ML model could automatically flag it. For example user uploading incorrect model or an image which is not a car
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
ger_train.csv – The German training set as CSV file.
ger_validation.csv – The German validation set as CSV file.
en_test.csv – The English test set as CSV file.
en_train.csv – The English training set as CSV file.
en_validation.csv – The English validation set as CSV file.
splitting.py – The python code for splitting a dataset into train, test and validation set.
DataSetTrans_de.csv – The final German dataset as a CSV file.
DataSetTrans_en.csv – The final English dataset as a CSV file.
translation.py – The python code for translating the cleaned dataset.
TempKB
Overview
TempKB is a comprehensive collection of knowledge graph data designed to train ML models on knowledge graph completion and reasoning tasks. This is the Event version, which contains only data where the time information is presented.
Dataset Structure
The dataset is organized into the following main components: Train Set: 1373083 instances for training models. Validation Set: 63421 instances for validating model performance. Test Set:… See the full description on the dataset page: https://huggingface.co/datasets/ESITime/TempKB-Event.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation
pip install pandas pyarrow Example
import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])
dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This is one of two collection records. Please see the link below for the other collection of associated audio files.
Both collections together comprise an open clinical dataset of three sets of 101 nursing handover records, very similar to real documents in Australian English. Each record consists of a patient profile, spoken free-form text document, written free-form text document, and written structured document.
This collection contains 3 sets of text documents.
Data Set 1 for Training and Development
The data set, released in June 2014, includes the following documents:
Folder initialisation: Initialisation details for speech recognition using Dragon Medical 11.0 (i.e., i) DOCX for the written, free-form text document that originates from the Dragon software release and ii) WMA for the spoken, free-form text document by the RN) Folder 100profiles: 100 patient profiles (DOCX) Folder 101writtenfreetextreports: 101 written, free-form text documents (TXT) Folder 100x6speechrecognised: 100 speech-recognized, written, free-form text documents for six Dragon vocabularies (TXT) Folder 101informationextraction: 101 written, structured documents for information extraction that include i) the reference standard text, ii) features used by our best system, iii) form categories with respect to the reference standard and iv) form categories with respect to the our best information extraction system (TXT in CRF++ format).
An Independent Data Set 2
The aforementioned data set was supplemented in April 2015 with an independent set that was used as a test set in the CLEFeHealth 2015 Task 1a on clinical speech recognition and can be used as a validation set in the CLEFeHealth 2016 Task 1 on handover information extraction. Hence, when using this set, please avoid its repeated use in evaluation – we do not wish to overfit to these data sets.
The set released in April 2015 consists of 100 patient profiles (DOCX), 100 written, and 100 speech-recognized, written, free-form text documents for the Dragon vocabulary of Nursing (TXT). The set released in November 2015 consists of the respective 100 written free-form text documents (TXT) and 100 written, structured documents for information extraction.
An Independent Data Set 3
For evaluation purposes, the aforementioned data sets were supplemented in April 2016 with an independent set of another 100 synthetic cases.
Lineage: Data creation included the following steps: generation of patient profiles; creation of written, free form text documents; development of a structured handover form, using this form and the written, free-form text documents to create written, structured documents; creation of spoken, free-form text documents; using a speech recognition engine with different vocabularies to convert the spoken documents to written, free-form text; and using an information extraction system to fill out the handover form from the written, free-form text documents.
See Suominen et al (2015) in the links below for a detailed description and examples.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background: Manual bone age assessment (BAA) is associated with longer interpretation time and higher cost and variability, thus posing challenges in areas with restricted medical facilities, such as the high-altitude Tibetan Plateau. The application of artificial intelligence (AI) for automating BAA could facilitate resolving this issue. This study aimed to develop an AI-based BAA model for Han and Tibetan children.Methods: A model named “EVG-BANet” was trained using three datasets, including the Radiology Society of North America (RSNA) dataset (training set n = 12611, validation set n = 1425, and test set n = 200), the Radiological Hand Pose Estimation (RHPE) dataset (training set n = 5491, validation set n = 713, and test set n = 79), and a self-established local dataset [training set n = 825 and test set n = 351 (Han n = 216 and Tibetan n = 135)]. An open-access state-of-the-art model BoNet was used for comparison. The accuracy and generalizability of the two models were evaluated using the abovementioned three test sets and an external test set (n = 256, all were Tibetan). Mean absolute difference (MAD) and accuracy within 1 year were used as indicators. Bias was evaluated by comparing the MAD between the demographic groups.Results: EVG-BANet outperformed BoNet in the MAD on the RHPE test set (0.52 vs. 0.63 years, p < 0.001), the local test set (0.47 vs. 0.62 years, p < 0.001), and the external test set (0.53 vs. 0.66 years, p < 0.001) and exhibited a comparable MAD on the RSNA test set (0.34 vs. 0.35 years, p = 0.934). EVG-BANet achieved accuracy within 1 year of 97.7% on the local test set (BoNet 90%, p < 0.001) and 89.5% on the external test set (BoNet 85.5%, p = 0.066). EVG-BANet showed no bias in the local test set but exhibited a bias related to chronological age in the external test set.Conclusion: EVG-BANet can accurately predict the bone age (BA) for both Han children and Tibetan children living in the Tibetan Plateau with limited healthcare facilities.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The goal of Task 1 of the Mining the Web of Product Data Challenge (MWPD2020) was to compare the performance of methods for identifying offers for the same product from different e-shops. The datasets that are provided to the participants of the competition contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) from the product category computers. The data is available in the form of training, validation and test set for machine learning experiments. The Training set consists of ~70K product pairs which were automatically labeled using the weak supervision of marked up product identifiers on the web. The validation set contains 1.100 manually labeled pairs. The test set which was used for the evaluation of participating systems consists of 1500 manually labeled pairs. The test set is intentionally harder than the other sets due to containing more very hard matching cases as well as a variety of matching challenges for a subset of the pairs, e.g. products not having training data in the training set or products which have had typos introduced. These can be used to measure the performance of methods on these kinds of matching challenges. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites, marking up their offers with schema.org vocabulary. For more information and download links for the corpus itself, please follow the links below.
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.