100+ datasets found

TREC 2022 Deep Learning test collection
catalog.data.gov
data.nist.gov
Updated May 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection
Explore at:
Dataset updated
May 9, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.
d
Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
Predictive modeling of treatment resistant depression using data from STAR*D...
plos.figshare.com
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhi Nie; Srinivasan Vairavan; Vaibhav A. Narayan; Jieping Ye; Qingqin S. Li (2023). Predictive modeling of treatment resistant depression using data from STAR*D and an independent clinical study [Dataset]. http://doi.org/10.1371/journal.pone.0197268
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0197268
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Zhi Nie; Srinivasan Vairavan; Vaibhav A. Narayan; Jieping Ye; Qingqin S. Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Identification of risk factors of treatment resistance may be useful to guide treatment selection, avoid inefficient trial-and-error, and improve major depressive disorder (MDD) care. We extended the work in predictive modeling of treatment resistant depression (TRD) via partition of the data from the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) cohort into a training and a testing dataset. We also included data from a small yet completely independent cohort RIS-INT-93 as an external test dataset. We used features from enrollment and level 1 treatment (up to week 2 response only) of STAR*D to explore the feature space comprehensively and applied machine learning methods to model TRD outcome at level 2. For TRD defined using QIDS-C16 remission criteria, multiple machine learning models were internally cross-validated in the STAR*D training dataset and externally validated in both the STAR*D testing dataset and RIS-INT-93 independent dataset with an area under the receiver operating characteristic curve (AUC) of 0.70–0.78 and 0.72–0.77, respectively. The upper bound for the AUC achievable with the full set of features could be as high as 0.78 in the STAR*D testing dataset. Model developed using top 30 features identified using feature selection technique (k-means clustering followed by χ2 test) achieved an AUC of 0.77 in the STAR*D testing dataset. In addition, the model developed using overlapping features between STAR*D and RIS-INT-93, achieved an AUC of > 0.70 in both the STAR*D testing and RIS-INT-93 datasets. Among all the features explored in STAR*D and RIS-INT-93 datasets, the most important feature was early or initial treatment response or symptom severity at week 2. These results indicate that prediction of TRD prior to undergoing a second round of antidepressant treatment could be feasible even in the absence of biomarker data.
m
Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning...
data.mendeley.com
Updated Dec 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zihao Wang (2022). Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning based Encrypted Traffic Analysis [Dataset]. http://doi.org/10.17632/xw7r4tt54g.1
Explore at:
Unique identifier
https://doi.org/10.17632/xw7r4tt54g.1
Dataset updated
Dec 6, 2022
Authors
Zihao Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This traffic dataset contains a balance size of encrypted malicious and legitimate traffic for encrypted malicious traffic detection and analysis. The dataset is a secondary csv feature data that is composed of six public traffic datasets.

Our dataset is curated based on two criteria: The first criterion is to combine widely considered public datasets which contain enough encrypted malicious or encrypted legitimate traffic in existing works, such as Malware Capture Facility Project datasets. The second criterion is to ensure the final dataset balance of encrypted malicious and legitimate network traffic.

Based on the criteria, 6 public datasets are selected. After data pre-processing, details of each selected public dataset and the size of different encrypted traffic are shown in the “Dataset Statistic Analysis Document”. The document summarized the malicious and legitimate traffic size we selected from each selected public dataset, the traffic size of each malicious traffic type, and the total traffic size of the composed dataset. From the table, we are able to observe that encrypted malicious and legitimate traffic equally contributes to approximately 50% of the final composed dataset.

The datasets now made available were prepared to aim at encrypted malicious traffic detection. Since the dataset is used for machine learning or deep learning model training, a sample of train and test sets are also provided. The train and test datasets are separated based on 1:4. Such datasets can be used for machine learning or deep learning model training and testing based on selected features or after processing further data pre-processing.
Supplementary Material (code and data sets from Nokia under CC BY NC ND 4.0...
figshare.com
zip
Updated Feb 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lech Madeyski; Szymon Stradowski (2025). Supplementary Material (code and data sets from Nokia under CC BY NC ND 4.0 license) for the following paper: L. Madeyski and S. Stradowski, “Predicting test failures induced by software defects: A lightweight alternative to software defect prediction and its industrial application,” Journal of Systems and Software, p. 112360, 2025. DOI: 10.1016/j.jss.2025.112360 URL: https://doi.org/10.1016/j.jss.2025.112360 [Dataset]. http://doi.org/10.6084/m9.figshare.28263290.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28263290.v1
Dataset updated
Feb 4, 2025
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Lech Madeyski; Szymon Stradowski
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This package (see also https://madeyski.e-informatyka.pl/download/MadeyskiStradowski24Supplement.pdf) includes research artefacts (developed code and datasets from Nokia under CC BY NC ND 4.0 license) required to reproduce the results presented in the paper:Lech Madeyski and Szymon Stradowski, “Predicting test failures induced by software defects: A lightweight alternative to software defect prediction and its industrial application,” Journal of Systems and Software, p. 112360, 2025. DOI: 10.1016/j.jss.2025.112360 URL: https://doi.org/10.1016/j.jss.2025.112360 Highlights from the paper:We propose a Lightweight Alternative to Software Defect Prediction (LA2SDP)The idea behind LA2SDP is to predict test failures induced by software defectsWe use eXplainable AI to give feedback to stakeholders & initiate improvement actionsWe validate our proposed approach in a real-world Nokia 5G test processOur results show that LA2SDP is feasible in vivo using data available in Nokia 5G
CollegeMsg - Training and test sets
figshare.com
zip
Updated Aug 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emanuele Pio Barracchia; Gianvito Pio; Heitor Murilo Gomes; Bernhard Pfahringer; Albert Bifet; Michelangelo Ceci (2020). CollegeMsg - Training and test sets [Dataset]. http://doi.org/10.6084/m9.figshare.12780827.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12780827.v1
Dataset updated
Aug 10, 2020
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Emanuele Pio Barracchia; Gianvito Pio; Heitor Murilo Gomes; Bernhard Pfahringer; Albert Bifet; Michelangelo Ceci
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Training and test sets extracted from CollegeMsg dataset (https://snap.stanford.edu/data/CollegeMsg.html)
m
pinterest_dataset
data.mendeley.com
Updated Oct 27, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
pinterest_dataset [Dataset]. https://data.mendeley.com/datasets/fs4k2zc5j5/2
Explore at:
Unique identifier
https://doi.org/10.17632/fs4k2zc5j5.2
Dataset updated
Oct 27, 2017
Authors
Juan Carlos Gomez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset with 72000 pins from 117 users in Pinterest. Each pin contains a short raw text and an image. The images are processed using a pretrained Convolutional Neural Network and transformed into a vector of 4096 features.

This dataset was used in the paper "User Identification in Pinterest Through the Refinement of a Cascade Fusion of Text and Images" to idenfity specific users given their comments. The paper is publishe in the Research in Computing Science Journal, as part of the LKE 2017 conference. The dataset includes the splits used in the paper.

There are nine files. text_test, text_train and text_val, contain the raw text of each pin in the corresponding split of the data. imag_test, imag_train and imag_val contain the image features of each pin in the corresponding split of the data. train_user and val_test_users contain the index of the user of each pin (between 0 and 116). There is a correspondance one-to-one among the test, train and validation files for images, text and users. There are 400 pins per user in the train set, and 100 pins per user in the validation and test sets each one.

If you have questions regarding the data, write to: jc dot gomez at ugto dot mx
Data from: TocoDecoy: a new approach to design unbiased datasets for...
zenodo.org
zip
Updated Apr 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhang; Zhang (2022). TocoDecoy: a new approach to design unbiased datasets for training and benchmarking machine-learning scoring functions [Dataset]. http://doi.org/10.5281/zenodo.5290011
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5290011
Dataset updated
Apr 22, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Zhang; Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset file contains TocoDecoy datasets generated based on the targets and active ligands of LIT-PCBA.

1_property_filtered.zip :

TD set: the ligand file name, 2D T-sne vectors, Smiles, molecular weight (MW), Wildman-Crippen partition coefficient (log P), number of rotatable bonds (RB), number of hydrogen-bond acceptors (HBA), number of hydrogen-bond donors (HBD), number of halogens (HAL), topology similarities of decoys to the seed active ligands, active label (active or inactive) and training set label (whether belongs to training set or test set) OF active ligands and their topologically dissimilar decoys

CD set: the decoy conformations with low docking scores generated by docking active ligands into protein pockets using Glide, Schrödinger.
InductiveQE Datasets
zenodo.org
zip
Updated Nov 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mikhail Galkin; Mikhail Galkin (2022). InductiveQE Datasets [Dataset]. http://doi.org/10.5281/zenodo.7306046
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7306046
Dataset updated
Nov 9, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mikhail Galkin; Mikhail Galkin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
InductiveQE datasets

UPD 2.0: Regenerated datasets free of potential test set leakages

UPD 1.1: Added train_answers_val.pkl files to all freebase-derived datasets - answers of training queries on larger validation graphs

This repository contains 10 inductive complex query answering datasets published in "Inductive Logical Query Answering in Knowledge Graphs" (NeurIPS 2022). 9 datasets (106-550) were created from FB15k-237, the wikikg dataset was created from OGB WikiKG 2 graph. In the datasets, all inference graphs extend training graphs and include new nodes and edges. Dataset numbers indicate a relative size of the inference graph compared to the training graph, e.g., in 175, the number of nodes in the inference graph is 175% compared to the number of nodes in the training graph. The higher the ratio, the more new unseen nodes appear at inference time, the more complex the task is. The Wikikg split has a fixed 133% ratio.

Each dataset is a zip archive containing 17 files:

train_graph.txt (pt for wikikg) - original training graph

val_inference.txt (pt) - inference graph (validation split), new nodes in validation are disjoint with the test inference graph

val_predict.txt (pt) - missing edges in the validation inference graph to be predicted.

test_intference.txt (pt) - inference graph (test splits), new nodes in test are disjoint with the validation inference graph

test_predict.txt (pt) - missing edges in the test inference graph to be predicted.

train/valid/test_queries.pkl - queries of the respective split, 14 query types for fb-derived datasets, 9 types for Wikikg (EPFO-only)

*_answers_easy.pkl - easy answers to respective queries that do not require predicting missing links but only edge traversal

*_answers_hard.pkl - hard answers to respective queries that DO require predicting missing links and against which the final metrics will be computed

train_answers_val.pkl - the extended set of answers for training queries on the bigger validation graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models

train_answers_test.pkl - the extended set of answers for training queries on the bigger test graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models

og_mappings.pkl - contains entity2id / relation2id dictionaries mapping local node/relation IDs from a respective dataset to the original fb15k237 / wikikg2

stats.txt - a small file with dataset stats

Overall unzipped size of all datasets combined is about 10 GB. Please refer to the paper for the sizes of graphs and the number of queries per graph.

The Wikikg dataset is supposed to be evaluated in the inference-only regime being pre-trained solely on simple link prediction, the number of training complex queries is not enough for such a large dataset.

Paper pre-print: https://arxiv.org/abs/2210.08008

The full source code of training/inference models is available at https://github.com/DeepGraphLearning/InductiveQE
R
Data from: Fashion Mnist Dataset
universe.roboflow.com
opendatalab.com
+4more
zip
Updated Aug 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Popular Benchmarks (2022). Fashion Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/fashion-mnist-ztryt/model/3
Explore at:
zipAvailable download formats
Dataset updated
Aug 10, 2022
Dataset authored and provided by
Popular Benchmarks
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Clothing
Description
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Han Xiao, Kashif Rasul and Roland Vollgraf

https://arxiv.org/abs/1708.07747

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. * Source

Here's an example of how the data looks (each class takes three-rows): https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">

Version 1 (original-images_Original-FashionMNIST-Splits):

Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.

This version was not trained

Version 3 (original-images_trainSetSplitBy80_20):

Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set

https://blog.roboflow.com/train-test-split/ https://i.imgur.com/angfheJ.png" alt="Train/Valid/Test Split Rebalancing">

Citation:

@online{xiao2017/online, author = {Han Xiao and Kashif Rasul and Roland Vollgraf}, title = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms}, date = {2017-08-28}, year = {2017}, eprintclass = {cs.LG}, eprinttype = {arXiv}, eprint = {cs.LG/1708.07747}, }
Z
Magnetic Tape Recorder Dataset
data.niaid.nih.gov
zenodo.org
Updated Jun 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moliner, Eloi (2023). Magnetic Tape Recorder Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8026271
Explore at:
Dataset updated
Jun 30, 2023
Dataset provided by
Wright, Alec
Välimäki, Vesa
Mikkonen, Otto
Moliner, Eloi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the datasets collected and used in the research project:

O. Mikkonen, A. Wright, E. Moliner and V. Välimäki, “Neural Modeling Of Magnetic Tape Recorders,” in Proceedings of the International Conference on Digital Audio Effects (DAFx), Copenhagen, Denmark, 4-7 September 2023.

A pre-print of the article is available in arXiv. The code is open-source and published in GitHub. The accompanying web page can be found from here.

Overview

The data is divided into various subsets, stored in separate directories. The data contains both toy data generated using a software emulation of a reel-to-reel tape recorder, as well as real data collected from a physical device. The various subsets can be used for training, validating, and testing neural network behavior, similarly as was done in the research article.

Toy and Real Data

The toy data was generated using CHOWTape, a physically modeled reel-to-reel tape recorder. The subsets generated with the software emulation are denoted with the string CHOWTAPE. Two variants of the toy data was produced: in the first variant, the fluctuating delay produced by the simulated tape transport was disabled, and in the second kind, the delay was enabled. The latter variants are denoted with the string WOWFLUTTER.

The real data is collected using an Akai 4000D reel-to-reel tape recorder. The corresponding subsets are denoted with the string AKAI. Two tape speeds were used during the recording: 3 3/4 IPS (inches per second) and 7 1/2 IPS, with the corresponding subsets denoted with '3.75IPS' and '7.5IPS' respectively. On top of this, two different brands of magnetic tape were used for capturing the datasets with different tape speeds: Maxell and Scotch, with the corresponding subsets denoted with 'MAXELL' and 'SCOTCH' respectively.

Directories

For training the models, a fraction of the inputs from SignalTrain LA2A Dataset was used. The training, validation, and testing can be replicated using the subsets:

ReelToReel_Dataset_MiniPulse100_AKAI_*/ (hysteretic nonlinearity, real data)

ReelToReel_Dataset_Mini192kHzPulse100_AKAI_*/ (delay generator, real data)

Silence_AKAI_*/ (noise generator, real data)

ReelToReel_Dataset_MiniPulse100_CHOWTAPE*/ (hysteretic nonlinearity, toy data)

ReelToReel_Dataset_MiniPulse100_CHOWTAPE_F[0.6]_SL[60]_TRAJECTORIES/ (delay generator, toy data)

For visualizing the model behavior, the following subsets can be used:

LogSweepsContinuousPulse100_*/ (nonlinear magnitude responses)

SinesFadedShortContinuousPulse100*/ (magnetic hysteresis curves)

Directory structure

Each directory/subset is made of up of further subdirectories that are most often used to separate the training, validation and test sets from each other. Thus, a typical directory will look like the following: [DIRECTORY_NAME] ├── Train │ ├── input_x_.wav │ ... │ ├── target_x_.wav │ ... └── Val │ ├── input_y_.wav │ ... │ ├── target_y_.wav │ ... ├── Test │ ├── input_z_.wav │ ... │ ├── target_z_.wav │ ...

While not all of the audio is used for training purposes, all of the subsets share part of this structure to make the corresponding datasets compatible with the dataloader that was used.

The input and target files denoted with the same number x, e.g. input_100_.wav and target_100_.wav make up a pair, such that the target audio is the input audio processed with one of the used effects. In some of the cases, a third file named trajectory_x_.npy can be found, which consists of the corresponding pre-extracted delay trajectory in the NumPy binary file format.

Revision History

Version 1.1.0

Added high-resolution (192kHz) dataset for configuration (SCOTCH, 3.75 IPS)

Version 1.0.0

Initial publish
P
Something-Something V1 Dataset
paperswithcode.com
Updated Mar 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raghav Goyal; Samira Ebrahimi Kahou; Vincent Michalski; Joanna Materzyńska; Susanne Westphal; Heuna Kim; Valentin Haenel; Ingo Fruend; Peter Yianilos; Moritz Mueller-Freitag; Florian Hoppe; Christian Thurau; Ingo Bax; Roland Memisevic (2023). Something-Something V1 Dataset [Dataset]. https://paperswithcode.com/dataset/something-something-v1
Explore at:
Dataset updated
Mar 28, 2023
Authors
Raghav Goyal; Samira Ebrahimi Kahou; Vincent Michalski; Joanna Materzyńska; Susanne Westphal; Heuna Kim; Valentin Haenel; Ingo Fruend; Peter Yianilos; Moritz Mueller-Freitag; Florian Hoppe; Christian Thurau; Ingo Bax; Roland Memisevic
Description
The 20BN-SOMETHING-SOMETHING dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with everyday objects. The dataset was created by a large number of crowd workers. It allows machine learning models to develop fine-grained understanding of basic actions that occur in the physical world. It contains 108,499 videos, with 86,017 in the training set, 11,522 in the validation set and 10,960 in the test set. There are 174 labels.

⚠️ Attention: This is the outdated V1 of the dataset. V2 is available here.
Trojan Detection Software Challenge -...
data.nist.gov
catalog.data.gov
Updated May 14, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2021). Trojan Detection Software Challenge - nlp-sentiment-classification-apr2021-test [Dataset]. http://doi.org/10.18434/mds2-2404
Explore at:
Unique identifier
https://doi.org/10.18434/mds2-2404, https://identifiers.org/ark:/88434/mds2-2404
Dataset updated
May 14, 2021
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
License
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Description
Round 6 Test Dataset This is the test data used to construct and evaluate trojan detection software solutions. This data, generated at NIST, consists of natural language processing (NLP) AIs trained to perform text sentiment classification on English text. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 480 sentiment classification AI models using a small set of model architectures. The models were trained on text data drawn from product reviews. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the input when the trigger is present.
Training and testing dataset for Machine Learning models from experimental...
zenodo.org
explore.openaire.eu
Updated Sep 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rowida Meligy; Rowida Meligy; Alaric Montenon; Alaric Montenon (2024). Training and testing dataset for Machine Learning models from experimental data from a Linear Fresnel Reflector in Cyprus [Dataset]. http://doi.org/10.5281/zenodo.11235652
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.11235652
Dataset updated
Sep 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rowida Meligy; Rowida Meligy; Alaric Montenon; Alaric Montenon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
May 3, 2018 - Sep 23, 2019
Description
**Experimental data set**

ML_dataset.csv file contains the experimental data for 50+ days of operation. This is an epurated version of 10.5281/zenodo.11195748. Only full days of experiments have been kept and the tracking mode state has been added, meaning if the primary field was tracking or not. When reflectometry measurements where done, the tracking was stopped. These are the two changes compared to 10.5281/zenodo.11195748.

**Machine learning**

The X_train.csv, Y_train.csv, X_test.csv and Y_test.csv are files used to train the models and to test them. X files contain DNI, mass flow, inlet temperature, IAM and humidity. Y files contains the output powers. Data are normalised between 0 and 1 while zero values have been removed.
Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...
zenodo.org
zip
Updated Aug 24, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4571228
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4571228
Dataset updated
Aug 24, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA.

The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file.

All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file.

The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file.

Notable changes to each version of the dataset are documented in CHANGELOG.md.
f
Best regions for predicting False Negatives (FN, i.e. no prediction of...
plos.figshare.com
csv
Updated Oct 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mélanie Garcia; Clare Kelly (2024). Best regions for predicting False Negatives (FN, i.e. no prediction of Autism whereas diagnosed Autism). [Dataset]. http://doi.org/10.1371/journal.pone.0276832.s009
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0276832.s009
Dataset updated
Oct 21, 2024
Dataset provided by
PLOS ONE
Authors
Mélanie Garcia; Clare Kelly
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Each row is for one region, each column is for one model and one combination of datasets considered (training+validation+testing 1 sets (no comorbidity), or all these sets + testing set 2 (containing subjects with comorbidities)), each case returns the number of datasets where the region was important for predicting TN for the model considered. (CSV)
Medley-solos-DB: a cross-collection dataset for musical instrument...
zenodo.org
explore.openaire.eu
+1more
csv
Updated Jul 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vincent Lostanlen; Vincent Lostanlen; Carmine-Emanuele Cella; Rachel Bittner; Rachel Bittner; Slim Essid; Carmine-Emanuele Cella; Slim Essid (2022). Medley-solos-DB: a cross-collection dataset for musical instrument recognition [Dataset]. http://doi.org/10.5281/zenodo.2582102
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2582102
Dataset updated
Jul 4, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Vincent Lostanlen; Vincent Lostanlen; Carmine-Emanuele Cella; Rachel Bittner; Rachel Bittner; Slim Essid; Carmine-Emanuele Cella; Slim Essid
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Medley-solos-DB
=============
Version 1.1, March 2019.

Created By
--------------

Vincent Lostanlen (1), Carmine-Emanuele Cella (2), Rachel Bittner (3), Slim Essid (4).

(1): New York University
(2): UC Berkeley
(3): Spotify, Inc.
(4): Télécom ParisTech

Description
---------------

Medley-solos-DB is a cross-collection dataset for automatic musical instrument recognition in solo recordings. It consists of a training set of 3-second audio clips, which are extracted from the MedleyDB dataset of Bittner et al. (ISMIR 2014) as well as a test set set of 3-second clips, which are extracted from the solosDB dataset of Essid et al. (IEEE TASLP 2009). Each of these clips contains a single instrument among a taxonomy of eight: clarinet, distorted electric guitar, female singer, flute, piano, tenor saxophone, trumpet, and violin.

The Medley-solos-DB dataset is the dataset that is used in the benchmarks of musical instrument recognition in the publications of Lostanlen and Cella (ISMIR 2016) and Andén et al. (IEEE TSP 2019).

[1] V. Lostanlen, C.E. Cella. Deep convolutional networks on the pitch spiral for musical instrument recognition. Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2016.

[2] J. Andén, V. Lostanlen S. Mallat. Joint time-frequency scattering. IEEE Transactions in Signal Processing. 2019, to appear.

Data Files
--------------

The Medley-solos-DB contains 21572 audio clips as WAV files, sampled at 44.1 kHz, with a single channel (mono), at a bit depth of 32. Every audio clip has a fixed duration of 2972 milliseconds, that is, 65536 discrete-time samples.

Every audio file has a name of the form:

Medley-solos-DB_SUBSET-INSTRUMENTID_UUID.wav

For example:

Medley-solos-DB_test-0_0a282672-c22c-59ff-faaa-ff9eb73fc8e6.wav

corresponds to the snippet whose universally unique identifier (UUID) is 0a282672-c22c-59ff-faaa-ff9eb73fc8e6, contains clarinet sounds (clarinet has instrument id equal to 0), and belongs to the test set.

Metadata Files
-------------------

The Medley-solos-DB_metadata is a CSV file containing 21572 rows (one for each audio clip) and five columns:

1. subset: either "training", "validation", or "test"

2. instrument: tag in Medley-DB taxonomy, such as "clarinet", "distorted electric guitar", etc.

3. instrument id: integer from 0 to 7. There is a one-to-one between "instrument" (string format) and "instrument id" (integer). We provide both for convenience.

4. track id: integer from 0 to 226. The track and artist names are anonymized.

5. UUID: universally unique identifier. Assigned and random, and different for every row.

The list of instrument classes is:

0. clarinet

1. distorted electric guitar

2. female singer

3. flute

4. piano

5. tenor saxophone

6. trumpet

7. violin

Please acknowledge Medley-solos-DB in academic research
---------------------------------------------------------------------------------

When Medley-solos-DB is used for academic research, we would highly appreciate it if scientific publications of works partly based on this dataset cite the following publication:

V. Lostanlen, C.E. Cella. Deep convolutional networks on the pitch spiral for musical instrument recognition. Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2016.

The creation of this dataset was supported by ERC InvariantClass grant 320959.

Conditions of Use
------------------------

Dataset created by Vincent Lostanlen, Rachel Bittner, and Slim Essid, as a derivative work of Medley-DB and solos-Db.

The Medley-solos-DB dataset is offered free of charge under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license:
https://creativecommons.org/licenses/by/4.0/

The dataset and its contents are made available on an "as is" basis and without warranties of any kind, including without limitation satisfactory quality and conformity, merchantability, fitness for a particular purpose, accuracy or completeness, or absence of errors. Subject to any liability that may not be excluded or limited by law, the authors are not liable for, and expressly exclude all liability for, loss or damage however and whenever caused to anyone by any use of the Medley-solos-DB dataset or any part of it.

Feedback
-------------

Please help us improve Medley-solos-DB by sending your feedback to:
vincent.lostanlen@nyu.edu

In case of a problem, please include as many details as possible.

Acknowledgement
-------------------------
We thank all artists, recording engineers, curators, and annotators of both MedleyDB and solosDb.
BitcoinOTC - Training and test sets
figshare.com
txt
Updated Aug 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emanuele Pio Barracchia; Gianvito Pio; Heitor Murilo Gomes; Bernhard Pfahringer; Albert Bifet; Michelangelo Ceci (2020). BitcoinOTC - Training and test sets [Dataset]. http://doi.org/10.6084/m9.figshare.12780776.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12780776.v1
Dataset updated
Aug 10, 2020
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Emanuele Pio Barracchia; Gianvito Pio; Heitor Murilo Gomes; Bernhard Pfahringer; Albert Bifet; Michelangelo Ceci
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Training and test sets extracted from BitcoinOTC dataset (https://snap.stanford.edu/data/soc-sign-bitcoin-otc.html)We considered only links associated with atrust value greater than 0.
n
Train, validation, test data sets and confusion matrices underlying...
4tu.edu.hpc.n-helix.com
zip
Updated Sep 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen (2023). Train, validation, test data sets and confusion matrices underlying publication: "Automated cell counting for Trypan blue stained cell cultures using machine learning" [Dataset]. http://doi.org/10.4121/21695819.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/21695819.v1
Dataset updated
Sep 7, 2023
Dataset provided by
4TU.ResearchData
Authors
Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Annotated test and train data sets. Both images and annotations are provided separately.

Validation data set for Hi5, Sf9 and HEK cells.

Confusion matrices for the determination of performance parameters
d
A Dataset for Machine Learning Algorithm Development
catalog.data.gov
s.cnmilf.com
+1more
Updated May 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact, Custodian) (2024). A Dataset for Machine Learning Algorithm Development [Dataset]. https://catalog.data.gov/dataset/a-dataset-for-machine-learning-algorithm-development2
Explore at:
Dataset updated
May 1, 2024
Dataset provided by
(Point of Contact, Custodian)
Description
This dataset consists of imagery, imagery footprints, associated ice seal detections and homography files associated with the KAMERA Test Flights conducted in 2019. This dataset was subset to include relevant data for detection algorithm development. This dataset is limited to data collected during flights 4, 5, 6 and 7 from our 2019 surveys.

Facebook

Twitter

Click to copy link

Link copied

Cite

National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection

TREC 2022 Deep Learning test collection

Explore at:

Dataset updated

May 9, 2023

Dataset provided by

National Institute of Standards and Technologyhttp://www.nist.gov/

Description

This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.

Clear search

Close search

Google apps

Main menu

TREC 2022 Deep Learning test collection

Training dataset for NABat Machine Learning V1.0

Predictive modeling of treatment resistant depression using data from STAR*D...

Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning...

Supplementary Material (code and data sets from Nokia under CC BY NC ND 4.0...

CollegeMsg - Training and test sets

pinterest_dataset

Data from: TocoDecoy: a new approach to design unbiased datasets for...

InductiveQE Datasets

Data from: Fashion Mnist Dataset

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Version 1 (original-images_Original-FashionMNIST-Splits):

Version 3 (original-images_trainSetSplitBy80_20):

Citation:

Magnetic Tape Recorder Dataset

Something-Something V1 Dataset

Trojan Detection Software Challenge -...

Training and testing dataset for Machine Learning models from experimental...

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

Best regions for predicting False Negatives (FN, i.e. no prediction of...

Medley-solos-DB: a cross-collection dataset for musical instrument...

BitcoinOTC - Training and test sets

Train, validation, test data sets and confusion matrices underlying...

A Dataset for Machine Learning Algorithm Development

TREC 2022 Deep Learning test collection