100+ datasets found
  1. TREC 2022 Deep Learning test collection

    • catalog.data.gov
    • data.nist.gov
    Updated May 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection
    Explore at:
    Dataset updated
    May 9, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.

  2. d

    Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  3. Predictive modeling of treatment resistant depression using data from STAR*D...

    • plos.figshare.com
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhi Nie; Srinivasan Vairavan; Vaibhav A. Narayan; Jieping Ye; Qingqin S. Li (2023). Predictive modeling of treatment resistant depression using data from STAR*D and an independent clinical study [Dataset]. http://doi.org/10.1371/journal.pone.0197268
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Zhi Nie; Srinivasan Vairavan; Vaibhav A. Narayan; Jieping Ye; Qingqin S. Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Identification of risk factors of treatment resistance may be useful to guide treatment selection, avoid inefficient trial-and-error, and improve major depressive disorder (MDD) care. We extended the work in predictive modeling of treatment resistant depression (TRD) via partition of the data from the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) cohort into a training and a testing dataset. We also included data from a small yet completely independent cohort RIS-INT-93 as an external test dataset. We used features from enrollment and level 1 treatment (up to week 2 response only) of STAR*D to explore the feature space comprehensively and applied machine learning methods to model TRD outcome at level 2. For TRD defined using QIDS-C16 remission criteria, multiple machine learning models were internally cross-validated in the STAR*D training dataset and externally validated in both the STAR*D testing dataset and RIS-INT-93 independent dataset with an area under the receiver operating characteristic curve (AUC) of 0.70–0.78 and 0.72–0.77, respectively. The upper bound for the AUC achievable with the full set of features could be as high as 0.78 in the STAR*D testing dataset. Model developed using top 30 features identified using feature selection technique (k-means clustering followed by χ2 test) achieved an AUC of 0.77 in the STAR*D testing dataset. In addition, the model developed using overlapping features between STAR*D and RIS-INT-93, achieved an AUC of > 0.70 in both the STAR*D testing and RIS-INT-93 datasets. Among all the features explored in STAR*D and RIS-INT-93 datasets, the most important feature was early or initial treatment response or symptom severity at week 2. These results indicate that prediction of TRD prior to undergoing a second round of antidepressant treatment could be feasible even in the absence of biomarker data.

  4. m

    Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning...

    • data.mendeley.com
    Updated Dec 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zihao Wang (2022). Encrypted Traffic Feature Dataset for Machine Learning and Deep Learning based Encrypted Traffic Analysis [Dataset]. http://doi.org/10.17632/xw7r4tt54g.1
    Explore at:
    Dataset updated
    Dec 6, 2022
    Authors
    Zihao Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This traffic dataset contains a balance size of encrypted malicious and legitimate traffic for encrypted malicious traffic detection and analysis. The dataset is a secondary csv feature data that is composed of six public traffic datasets.

    Our dataset is curated based on two criteria: The first criterion is to combine widely considered public datasets which contain enough encrypted malicious or encrypted legitimate traffic in existing works, such as Malware Capture Facility Project datasets. The second criterion is to ensure the final dataset balance of encrypted malicious and legitimate network traffic.

    Based on the criteria, 6 public datasets are selected. After data pre-processing, details of each selected public dataset and the size of different encrypted traffic are shown in the “Dataset Statistic Analysis Document”. The document summarized the malicious and legitimate traffic size we selected from each selected public dataset, the traffic size of each malicious traffic type, and the total traffic size of the composed dataset. From the table, we are able to observe that encrypted malicious and legitimate traffic equally contributes to approximately 50% of the final composed dataset.

    The datasets now made available were prepared to aim at encrypted malicious traffic detection. Since the dataset is used for machine learning or deep learning model training, a sample of train and test sets are also provided. The train and test datasets are separated based on 1:4. Such datasets can be used for machine learning or deep learning model training and testing based on selected features or after processing further data pre-processing.

  5. Supplementary Material (code and data sets from Nokia under CC BY NC ND 4.0...

    • figshare.com
    zip
    Updated Feb 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lech Madeyski; Szymon Stradowski (2025). Supplementary Material (code and data sets from Nokia under CC BY NC ND 4.0 license) for the following paper: L. Madeyski and S. Stradowski, “Predicting test failures induced by software defects: A lightweight alternative to software defect prediction and its industrial application,” Journal of Systems and Software, p. 112360, 2025. DOI: 10.1016/j.jss.2025.112360 URL: https://doi.org/10.1016/j.jss.2025.112360 [Dataset]. http://doi.org/10.6084/m9.figshare.28263290.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 4, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Lech Madeyski; Szymon Stradowski
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This package (see also https://madeyski.e-informatyka.pl/download/MadeyskiStradowski24Supplement.pdf) includes research artefacts (developed code and datasets from Nokia under CC BY NC ND 4.0 license) required to reproduce the results presented in the paper:Lech Madeyski and Szymon Stradowski, “Predicting test failures induced by software defects: A lightweight alternative to software defect prediction and its industrial application,” Journal of Systems and Software, p. 112360, 2025. DOI: 10.1016/j.jss.2025.112360 URL: https://doi.org/10.1016/j.jss.2025.112360 Highlights from the paper:We propose a Lightweight Alternative to Software Defect Prediction (LA2SDP)The idea behind LA2SDP is to predict test failures induced by software defectsWe use eXplainable AI to give feedback to stakeholders & initiate improvement actionsWe validate our proposed approach in a real-world Nokia 5G test processOur results show that LA2SDP is feasible in vivo using data available in Nokia 5G

  6. CollegeMsg - Training and test sets

    • figshare.com
    zip
    Updated Aug 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emanuele Pio Barracchia; Gianvito Pio; Heitor Murilo Gomes; Bernhard Pfahringer; Albert Bifet; Michelangelo Ceci (2020). CollegeMsg - Training and test sets [Dataset]. http://doi.org/10.6084/m9.figshare.12780827.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 10, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Emanuele Pio Barracchia; Gianvito Pio; Heitor Murilo Gomes; Bernhard Pfahringer; Albert Bifet; Michelangelo Ceci
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Training and test sets extracted from CollegeMsg dataset (https://snap.stanford.edu/data/CollegeMsg.html)

  7. m

    pinterest_dataset

    • data.mendeley.com
    Updated Oct 27, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    pinterest_dataset [Dataset]. https://data.mendeley.com/datasets/fs4k2zc5j5/2
    Explore at:
    Dataset updated
    Oct 27, 2017
    Authors
    Juan Carlos Gomez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset with 72000 pins from 117 users in Pinterest. Each pin contains a short raw text and an image. The images are processed using a pretrained Convolutional Neural Network and transformed into a vector of 4096 features.

    This dataset was used in the paper "User Identification in Pinterest Through the Refinement of a Cascade Fusion of Text and Images" to idenfity specific users given their comments. The paper is publishe in the Research in Computing Science Journal, as part of the LKE 2017 conference. The dataset includes the splits used in the paper.

    There are nine files. text_test, text_train and text_val, contain the raw text of each pin in the corresponding split of the data. imag_test, imag_train and imag_val contain the image features of each pin in the corresponding split of the data. train_user and val_test_users contain the index of the user of each pin (between 0 and 116). There is a correspondance one-to-one among the test, train and validation files for images, text and users. There are 400 pins per user in the train set, and 100 pins per user in the validation and test sets each one.

    If you have questions regarding the data, write to: jc dot gomez at ugto dot mx

  8. Data from: TocoDecoy: a new approach to design unbiased datasets for...

    • zenodo.org
    zip
    Updated Apr 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhang; Zhang (2022). TocoDecoy: a new approach to design unbiased datasets for training and benchmarking machine-learning scoring functions [Dataset]. http://doi.org/10.5281/zenodo.5290011
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 22, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Zhang; Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset file contains TocoDecoy datasets generated based on the targets and active ligands of LIT-PCBA.

    1_property_filtered.zip :

    • TD set: the ligand file name, 2D T-sne vectors, Smiles, molecular weight (MW), Wildman-Crippen partition coefficient (log P), number of rotatable bonds (RB), number of hydrogen-bond acceptors (HBA), number of hydrogen-bond donors (HBD), number of halogens (HAL), topology similarities of decoys to the seed active ligands, active label (active or inactive) and training set label (whether belongs to training set or test set) OF active ligands and their topologically dissimilar decoys
    • CD set: the decoy conformations with low docking scores generated by docking active ligands into protein pockets using Glide, Schrödinger.

  9. InductiveQE Datasets

    • zenodo.org
    zip
    Updated Nov 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mikhail Galkin; Mikhail Galkin (2022). InductiveQE Datasets [Dataset]. http://doi.org/10.5281/zenodo.7306046
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 9, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mikhail Galkin; Mikhail Galkin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    InductiveQE datasets

    UPD 2.0: Regenerated datasets free of potential test set leakages

    UPD 1.1: Added train_answers_val.pkl files to all freebase-derived datasets - answers of training queries on larger validation graphs

    This repository contains 10 inductive complex query answering datasets published in "Inductive Logical Query Answering in Knowledge Graphs" (NeurIPS 2022). 9 datasets (106-550) were created from FB15k-237, the wikikg dataset was created from OGB WikiKG 2 graph. In the datasets, all inference graphs extend training graphs and include new nodes and edges. Dataset numbers indicate a relative size of the inference graph compared to the training graph, e.g., in 175, the number of nodes in the inference graph is 175% compared to the number of nodes in the training graph. The higher the ratio, the more new unseen nodes appear at inference time, the more complex the task is. The Wikikg split has a fixed 133% ratio.

    Each dataset is a zip archive containing 17 files:

    • train_graph.txt (pt for wikikg) - original training graph
    • val_inference.txt (pt) - inference graph (validation split), new nodes in validation are disjoint with the test inference graph
    • val_predict.txt (pt) - missing edges in the validation inference graph to be predicted.
    • test_intference.txt (pt) - inference graph (test splits), new nodes in test are disjoint with the validation inference graph
    • test_predict.txt (pt) - missing edges in the test inference graph to be predicted.
    • train/valid/test_queries.pkl - queries of the respective split, 14 query types for fb-derived datasets, 9 types for Wikikg (EPFO-only)
    • *_answers_easy.pkl - easy answers to respective queries that do not require predicting missing links but only edge traversal
    • *_answers_hard.pkl - hard answers to respective queries that DO require predicting missing links and against which the final metrics will be computed
    • train_answers_val.pkl - the extended set of answers for training queries on the bigger validation graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models
    • train_answers_test.pkl - the extended set of answers for training queries on the bigger test graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models
    • og_mappings.pkl - contains entity2id / relation2id dictionaries mapping local node/relation IDs from a respective dataset to the original fb15k237 / wikikg2
    • stats.txt - a small file with dataset stats

    Overall unzipped size of all datasets combined is about 10 GB. Please refer to the paper for the sizes of graphs and the number of queries per graph.

    The Wikikg dataset is supposed to be evaluated in the inference-only regime being pre-trained solely on simple link prediction, the number of training complex queries is not enough for such a large dataset.

    Paper pre-print: https://arxiv.org/abs/2210.08008

    The full source code of training/inference models is available at https://github.com/DeepGraphLearning/InductiveQE

  10. R

    Data from: Fashion Mnist Dataset

    • universe.roboflow.com
    • opendatalab.com
    • +4more
    zip
    Updated Aug 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Popular Benchmarks (2022). Fashion Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/fashion-mnist-ztryt/model/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 10, 2022
    Dataset authored and provided by
    Popular Benchmarks
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Clothing
    Description

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Authors:

    Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

    All images were sized 28x28 in the original dataset

    Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. * Source

    Here's an example of how the data looks (each class takes three-rows): https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">

    Version 1 (original-images_Original-FashionMNIST-Splits):

    • Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.
    • This version was not trained

    Version 3 (original-images_trainSetSplitBy80_20):

    • Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set
    • https://blog.roboflow.com/train-test-split/ https://i.imgur.com/angfheJ.png" alt="Train/Valid/Test Split Rebalancing">

    Citation:

    @online{xiao2017/online,
     author    = {Han Xiao and Kashif Rasul and Roland Vollgraf},
     title    = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms},
     date     = {2017-08-28},
     year     = {2017},
     eprintclass = {cs.LG},
     eprinttype  = {arXiv},
     eprint    = {cs.LG/1708.07747},
    }
    
  11. Z

    Magnetic Tape Recorder Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moliner, Eloi (2023). Magnetic Tape Recorder Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8026271
    Explore at:
    Dataset updated
    Jun 30, 2023
    Dataset provided by
    Wright, Alec
    Välimäki, Vesa
    Mikkonen, Otto
    Moliner, Eloi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the datasets collected and used in the research project:

    O. Mikkonen, A. Wright, E. Moliner and V. Välimäki, “Neural Modeling Of Magnetic Tape Recorders,” in Proceedings of the International Conference on Digital Audio Effects (DAFx), Copenhagen, Denmark, 4-7 September 2023.

    A pre-print of the article is available in arXiv. The code is open-source and published in GitHub. The accompanying web page can be found from here.

    Overview

    The data is divided into various subsets, stored in separate directories. The data contains both toy data generated using a software emulation of a reel-to-reel tape recorder, as well as real data collected from a physical device. The various subsets can be used for training, validating, and testing neural network behavior, similarly as was done in the research article.

    Toy and Real Data

    The toy data was generated using CHOWTape, a physically modeled reel-to-reel tape recorder. The subsets generated with the software emulation are denoted with the string CHOWTAPE. Two variants of the toy data was produced: in the first variant, the fluctuating delay produced by the simulated tape transport was disabled, and in the second kind, the delay was enabled. The latter variants are denoted with the string WOWFLUTTER.

    The real data is collected using an Akai 4000D reel-to-reel tape recorder. The corresponding subsets are denoted with the string AKAI. Two tape speeds were used during the recording: 3 3/4 IPS (inches per second) and 7 1/2 IPS, with the corresponding subsets denoted with '3.75IPS' and '7.5IPS' respectively. On top of this, two different brands of magnetic tape were used for capturing the datasets with different tape speeds: Maxell and Scotch, with the corresponding subsets denoted with 'MAXELL' and 'SCOTCH' respectively.

    Directories

    For training the models, a fraction of the inputs from SignalTrain LA2A Dataset was used. The training, validation, and testing can be replicated using the subsets:

    ReelToReel_Dataset_MiniPulse100_AKAI_*/ (hysteretic nonlinearity, real data)

    ReelToReel_Dataset_Mini192kHzPulse100_AKAI_*/ (delay generator, real data)

    Silence_AKAI_*/ (noise generator, real data)

    ReelToReel_Dataset_MiniPulse100_CHOWTAPE*/ (hysteretic nonlinearity, toy data)

    ReelToReel_Dataset_MiniPulse100_CHOWTAPE_F[0.6]_SL[60]_TRAJECTORIES/ (delay generator, toy data)

    For visualizing the model behavior, the following subsets can be used:

    LogSweepsContinuousPulse100_*/ (nonlinear magnitude responses)

    SinesFadedShortContinuousPulse100*/ (magnetic hysteresis curves)

    Directory structure

    Each directory/subset is made of up of further subdirectories that are most often used to separate the training, validation and test sets from each other. Thus, a typical directory will look like the following: [DIRECTORY_NAME] ├── Train │ ├── input_x_.wav │ ... │ ├── target_x_.wav │ ... └── Val │ ├── input_y_.wav │ ... │ ├── target_y_.wav │ ... ├── Test │ ├── input_z_.wav │ ... │ ├── target_z_.wav │ ...

    While not all of the audio is used for training purposes, all of the subsets share part of this structure to make the corresponding datasets compatible with the dataloader that was used.

    The input and target files denoted with the same number x, e.g. input_100_.wav and target_100_.wav make up a pair, such that the target audio is the input audio processed with one of the used effects. In some of the cases, a third file named trajectory_x_.npy can be found, which consists of the corresponding pre-extracted delay trajectory in the NumPy binary file format.

    Revision History

    Version 1.1.0

    Added high-resolution (192kHz) dataset for configuration (SCOTCH, 3.75 IPS)

    Version 1.0.0

    Initial publish

  12. P

    Something-Something V1 Dataset

    • paperswithcode.com
    Updated Mar 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raghav Goyal; Samira Ebrahimi Kahou; Vincent Michalski; Joanna Materzyńska; Susanne Westphal; Heuna Kim; Valentin Haenel; Ingo Fruend; Peter Yianilos; Moritz Mueller-Freitag; Florian Hoppe; Christian Thurau; Ingo Bax; Roland Memisevic (2023). Something-Something V1 Dataset [Dataset]. https://paperswithcode.com/dataset/something-something-v1
    Explore at:
    Dataset updated
    Mar 28, 2023
    Authors
    Raghav Goyal; Samira Ebrahimi Kahou; Vincent Michalski; Joanna Materzyńska; Susanne Westphal; Heuna Kim; Valentin Haenel; Ingo Fruend; Peter Yianilos; Moritz Mueller-Freitag; Florian Hoppe; Christian Thurau; Ingo Bax; Roland Memisevic
    Description

    The 20BN-SOMETHING-SOMETHING dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with everyday objects. The dataset was created by a large number of crowd workers. It allows machine learning models to develop fine-grained understanding of basic actions that occur in the physical world. It contains 108,499 videos, with 86,017 in the training set, 11,522 in the validation set and 10,960 in the test set. There are 174 labels.

    ⚠️ Attention: This is the outdated V1 of the dataset. V2 is available here.

  13. Trojan Detection Software Challenge -...

    • data.nist.gov
    • catalog.data.gov
    Updated May 14, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2021). Trojan Detection Software Challenge - nlp-sentiment-classification-apr2021-test [Dataset]. http://doi.org/10.18434/mds2-2404
    Explore at:
    Dataset updated
    May 14, 2021
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    Round 6 Test Dataset This is the test data used to construct and evaluate trojan detection software solutions. This data, generated at NIST, consists of natural language processing (NLP) AIs trained to perform text sentiment classification on English text. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 480 sentiment classification AI models using a small set of model architectures. The models were trained on text data drawn from product reviews. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the input when the trigger is present.

  14. Training and testing dataset for Machine Learning models from experimental...

    • zenodo.org
    • explore.openaire.eu
    Updated Sep 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rowida Meligy; Rowida Meligy; Alaric Montenon; Alaric Montenon (2024). Training and testing dataset for Machine Learning models from experimental data from a Linear Fresnel Reflector in Cyprus [Dataset]. http://doi.org/10.5281/zenodo.11235652
    Explore at:
    Dataset updated
    Sep 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rowida Meligy; Rowida Meligy; Alaric Montenon; Alaric Montenon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    May 3, 2018 - Sep 23, 2019
    Description

    **Experimental data set**

    ML_dataset.csv file contains the experimental data for 50+ days of operation. This is an epurated version of 10.5281/zenodo.11195748. Only full days of experiments have been kept and the tracking mode state has been added, meaning if the primary field was tracking or not. When reflectometry measurements where done, the tracking was stopped. These are the two changes compared to 10.5281/zenodo.11195748.

    **Machine learning**

    The X_train.csv, Y_train.csv, X_test.csv and Y_test.csv are files used to train the models and to test them. X files contain DNI, mass flow, inlet temperature, IAM and humidity. Y files contains the output powers. Data are normalised between 0 and 1 while zero values have been removed.

  15. Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

    • zenodo.org
    zip
    Updated Aug 24, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4571228
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 24, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA.
    • The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file.
    • All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file.
    • The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file.
    • Notable changes to each version of the dataset are documented in CHANGELOG.md.
  16. f

    Best regions for predicting False Negatives (FN, i.e. no prediction of...

    • plos.figshare.com
    csv
    Updated Oct 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mélanie Garcia; Clare Kelly (2024). Best regions for predicting False Negatives (FN, i.e. no prediction of Autism whereas diagnosed Autism). [Dataset]. http://doi.org/10.1371/journal.pone.0276832.s009
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 21, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Mélanie Garcia; Clare Kelly
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Each row is for one region, each column is for one model and one combination of datasets considered (training+validation+testing 1 sets (no comorbidity), or all these sets + testing set 2 (containing subjects with comorbidities)), each case returns the number of datasets where the region was important for predicting TN for the model considered. (CSV)

  17. Medley-solos-DB: a cross-collection dataset for musical instrument...

    • zenodo.org
    • explore.openaire.eu
    • +1more
    csv
    Updated Jul 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vincent Lostanlen; Vincent Lostanlen; Carmine-Emanuele Cella; Rachel Bittner; Rachel Bittner; Slim Essid; Carmine-Emanuele Cella; Slim Essid (2022). Medley-solos-DB: a cross-collection dataset for musical instrument recognition [Dataset]. http://doi.org/10.5281/zenodo.2582102
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jul 4, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Vincent Lostanlen; Vincent Lostanlen; Carmine-Emanuele Cella; Rachel Bittner; Rachel Bittner; Slim Essid; Carmine-Emanuele Cella; Slim Essid
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Medley-solos-DB
    =============
    Version 1.1, March 2019.

    Created By
    --------------

    Vincent Lostanlen (1), Carmine-Emanuele Cella (2), Rachel Bittner (3), Slim Essid (4).

    (1): New York University
    (2): UC Berkeley
    (3): Spotify, Inc.
    (4): Télécom ParisTech


    Description
    ---------------

    Medley-solos-DB is a cross-collection dataset for automatic musical instrument recognition in solo recordings. It consists of a training set of 3-second audio clips, which are extracted from the MedleyDB dataset of Bittner et al. (ISMIR 2014) as well as a test set set of 3-second clips, which are extracted from the solosDB dataset of Essid et al. (IEEE TASLP 2009). Each of these clips contains a single instrument among a taxonomy of eight: clarinet, distorted electric guitar, female singer, flute, piano, tenor saxophone, trumpet, and violin.

    The Medley-solos-DB dataset is the dataset that is used in the benchmarks of musical instrument recognition in the publications of Lostanlen and Cella (ISMIR 2016) and Andén et al. (IEEE TSP 2019).

    [1] V. Lostanlen, C.E. Cella. Deep convolutional networks on the pitch spiral for musical instrument recognition. Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2016.

    [2] J. Andén, V. Lostanlen S. Mallat. Joint time-frequency scattering. IEEE Transactions in Signal Processing. 2019, to appear.


    Data Files
    --------------

    The Medley-solos-DB contains 21572 audio clips as WAV files, sampled at 44.1 kHz, with a single channel (mono), at a bit depth of 32. Every audio clip has a fixed duration of 2972 milliseconds, that is, 65536 discrete-time samples.

    Every audio file has a name of the form:

    Medley-solos-DB_SUBSET-INSTRUMENTID_UUID.wav

    For example:

    Medley-solos-DB_test-0_0a282672-c22c-59ff-faaa-ff9eb73fc8e6.wav

    corresponds to the snippet whose universally unique identifier (UUID) is 0a282672-c22c-59ff-faaa-ff9eb73fc8e6, contains clarinet sounds (clarinet has instrument id equal to 0), and belongs to the test set.


    Metadata Files
    -------------------

    The Medley-solos-DB_metadata is a CSV file containing 21572 rows (one for each audio clip) and five columns:

    1. subset: either "training", "validation", or "test"

    2. instrument: tag in Medley-DB taxonomy, such as "clarinet", "distorted electric guitar", etc.

    3. instrument id: integer from 0 to 7. There is a one-to-one between "instrument" (string format) and "instrument id" (integer). We provide both for convenience.

    4. track id: integer from 0 to 226. The track and artist names are anonymized.

    5. UUID: universally unique identifier. Assigned and random, and different for every row.

    The list of instrument classes is:

    0. clarinet

    1. distorted electric guitar

    2. female singer

    3. flute

    4. piano

    5. tenor saxophone

    6. trumpet

    7. violin


    Please acknowledge Medley-solos-DB in academic research
    ---------------------------------------------------------------------------------

    When Medley-solos-DB is used for academic research, we would highly appreciate it if scientific publications of works partly based on this dataset cite the following publication:

    V. Lostanlen, C.E. Cella. Deep convolutional networks on the pitch spiral for musical instrument recognition. Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2016.

    The creation of this dataset was supported by ERC InvariantClass grant 320959.


    Conditions of Use
    ------------------------

    Dataset created by Vincent Lostanlen, Rachel Bittner, and Slim Essid, as a derivative work of Medley-DB and solos-Db.

    The Medley-solos-DB dataset is offered free of charge under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license:
    https://creativecommons.org/licenses/by/4.0/

    The dataset and its contents are made available on an "as is" basis and without warranties of any kind, including without limitation satisfactory quality and conformity, merchantability, fitness for a particular purpose, accuracy or completeness, or absence of errors. Subject to any liability that may not be excluded or limited by law, the authors are not liable for, and expressly exclude all liability for, loss or damage however and whenever caused to anyone by any use of the Medley-solos-DB dataset or any part of it.


    Feedback
    -------------

    Please help us improve Medley-solos-DB by sending your feedback to:
    vincent.lostanlen@nyu.edu

    In case of a problem, please include as many details as possible.

    Acknowledgement
    -------------------------
    We thank all artists, recording engineers, curators, and annotators of both MedleyDB and solosDb.

  18. BitcoinOTC - Training and test sets

    • figshare.com
    txt
    Updated Aug 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emanuele Pio Barracchia; Gianvito Pio; Heitor Murilo Gomes; Bernhard Pfahringer; Albert Bifet; Michelangelo Ceci (2020). BitcoinOTC - Training and test sets [Dataset]. http://doi.org/10.6084/m9.figshare.12780776.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 10, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Emanuele Pio Barracchia; Gianvito Pio; Heitor Murilo Gomes; Bernhard Pfahringer; Albert Bifet; Michelangelo Ceci
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Training and test sets extracted from BitcoinOTC dataset (https://snap.stanford.edu/data/soc-sign-bitcoin-otc.html)We considered only links associated with atrust value greater than 0.

  19. n

    Train, validation, test data sets and confusion matrices underlying...

    • 4tu.edu.hpc.n-helix.com
    zip
    Updated Sep 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen (2023). Train, validation, test data sets and confusion matrices underlying publication: "Automated cell counting for Trypan blue stained cell cultures using machine learning" [Dataset]. http://doi.org/10.4121/21695819.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 7, 2023
    Dataset provided by
    4TU.ResearchData
    Authors
    Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Annotated test and train data sets. Both images and annotations are provided separately.


    Validation data set for Hi5, Sf9 and HEK cells.


    Confusion matrices for the determination of performance parameters

  20. d

    A Dataset for Machine Learning Algorithm Development

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated May 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact, Custodian) (2024). A Dataset for Machine Learning Algorithm Development [Dataset]. https://catalog.data.gov/dataset/a-dataset-for-machine-learning-algorithm-development2
    Explore at:
    Dataset updated
    May 1, 2024
    Dataset provided by
    (Point of Contact, Custodian)
    Description

    This dataset consists of imagery, imagery footprints, associated ice seal detections and homography files associated with the KAMERA Test Flights conducted in 2019. This dataset was subset to include relevant data for detection algorithm development. This dataset is limited to data collected during flights 4, 5, 6 and 7 from our 2019 surveys.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection
Organization logo

TREC 2022 Deep Learning test collection

Explore at:
Dataset updated
May 9, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description

This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.

Search
Clear search
Close search
Google apps
Main menu