100+ datasets found

R
Objects Without Annotations Dataset
universe.roboflow.com
zip
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ThreatDet (2025). Objects Without Annotations Dataset [Dataset]. https://universe.roboflow.com/threatdet/objects-without-annotations/model/1
Explore at:
zipAvailable download formats
Dataset updated
Jun 5, 2025
Dataset authored and provided by
ThreatDet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Objects Bounding Boxes
Description
Objects Without Annotations

## Overview Objects Without Annotations is a dataset for object detection tasks - it contains Objects annotations for 1,300 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
R
Dataset Without Tree Instance Dataset
universe.roboflow.com
zip
Updated Aug 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yihang Sun (2024). Dataset Without Tree Instance Dataset [Dataset]. https://universe.roboflow.com/yihang-sun/dataset-without-tree-instance/model/1
Explore at:
zipAvailable download formats
Dataset updated
Aug 6, 2024
Dataset authored and provided by
Yihang Sun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cars Person Trees Bicycles 5GCR Polygons
Description
Dataset Without Tree Instance

## Overview Dataset Without Tree Instance is a dataset for instance segmentation tasks - it contains Cars Person Trees Bicycles 5GCR annotations for 368 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
h
do-not-answer
huggingface.co
Updated Sep 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LibrAI (2023). do-not-answer [Dataset]. https://huggingface.co/datasets/LibrAI/do-not-answer
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 13, 2023
Dataset authored and provided by
LibrAI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

Overview

Do not answer is an open-source dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.

Instruction… See the full description on the dataset page: https://huggingface.co/datasets/LibrAI/do-not-answer.
h
wikipedia-small-3000-embedded
huggingface.co
Updated Apr 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hafedh Hichri (2024). wikipedia-small-3000-embedded [Dataset]. https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2024
Authors
Hafedh Hichri
License
https://choosealicense.com/licenses/gfdl/https://choosealicense.com/licenses/gfdl/
Description
this is a subset of the wikimedia/wikipedia dataset code for creating this dataset : from datasets import load_dataset, Dataset from sentence_transformers import SentenceTransformer model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

load dataset in streaming mode (no download and it's fast)

dataset = load_dataset( "wikimedia/wikipedia", "20231101.en", split="train", streaming=True )

select 3000 samples

from tqdm importtqdm data = Dataset.from_dict({}) for i, entry in… See the full description on the dataset page: https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded.
Mini Fashion Product Images and Text Dataset
kaggle.com
Updated Nov 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmal Sankalana (2024). Mini Fashion Product Images and Text Dataset [Dataset]. https://www.kaggle.com/datasets/nirmalsankalana/mini-product-image-and-text-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 24, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nirmal Sankalana
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is a curated collection of fashion product images paired with their titles and descriptions, designed for training and fine-tuning multimodal AI models. Originally derived from Param Aggraval's "Fashion Product Images Dataset," it has undergone extensive preprocessing to improve usability and efficiency.

Preprocessing steps include:
1. Resize all images to a size of 256 X 256 px, preserving their original aspect ratio.
2. Streamlining the reference CSV file to retain only essential fields: image file name, display name, product description, and category.
3. Removing redundant style JSON files to minimize dataset complexity.

These optimizations have reduced the dataset size by 95%, making it lighter and faster to use without compromising data quality. This refined dataset is ideal for research and applications in multimodal AI, including tasks like product recommendation, image-text matching, and domain-specific fine-tuning.
P
catbAbI LM-mode Dataset
paperswithcode.com
Updated Nov 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Imanol Schlag; Tsendsuren Munkhdalai; Jürgen Schmidhuber (2020). catbAbI LM-mode Dataset [Dataset]. https://paperswithcode.com/dataset/catbabi-lm-mode
Explore at:
Dataset updated
Nov 15, 2020
Authors
Imanol Schlag; Tsendsuren Munkhdalai; Jürgen Schmidhuber
Description
We aim to improve the bAbI benchmark as a means of developing intelligent dialogue agents. To this end, we propose concatenated-bAbI (catbAbI): an infinite sequence of bAbI stories. catbAbI is generated from the bAbI dataset and during training, a random sample/story from any task is drawn without replacement and concatenated to the ongoing story. The preprocessig for catbAbI addresses several issues: it removes the supporting facts, leaves the questions embedded in the story, inserts the correct answer after the question mark, and tokenises the full sample into a single sequence of words. As such, catbAbI is designed to be trained in an autoregressive way and analogous to closed-book question answering.

catbAbI models can be trained in two different ways: language modelling mode (LM-mode) or question-answering mode (QA-mode). In LM-mode, the catbAbI models are trained like autoregressive word-level language models. In QA-mode, the catbAbI models are only trained to predict the tokens that are answers to questions—making it more similar to regular bAbI. QA-mode is simply implemented by masking out losses on non-answer predictions. In both training modes, the model performance is solely measured by its accuracy and perplexity when answering the questions.
h
generated-usa-passeports-dataset
huggingface.co
Updated Jul 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Training Data (2023). generated-usa-passeports-dataset [Dataset]. https://huggingface.co/datasets/TrainingDataPro/generated-usa-passeports-dataset
Explore at:
Dataset updated
Jul 15, 2023
Authors
Training Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Data generation in machine learning involves creating or manipulating data to train and evaluate machine learning models. The purpose of data generation is to provide diverse and representative examples that cover a wide range of scenarios, ensuring the model's robustness and generalization. Data augmentation techniques involve applying various transformations to existing data samples to create new ones. These transformations include: random rotations, translations, scaling, flips, and more. Augmentation helps in increasing the dataset size, introducing natural variations, and improving model performance by making it more invariant to specific transformations. The dataset contains GENERATED USA passports, which are replicas of official passports but with randomly generated details, such as name, date of birth etc. The primary intention of generating these fake passports is to demonstrate the structure and content of a typical passport document and to train the neural network to identify this type of document. Generated passports can assist in conducting research without accessing or compromising real user data that is often sensitive and subject to privacy regulations. Synthetic data generation allows researchers to develop and refine models using simulated passport data without risking privacy leaks.
Simulation Data Set
catalog.data.gov
s.cnmilf.com
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
annual average model predicted concentrations without and with DMS chemistry...
catalog.data.gov
data.amerigeoss.org
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). annual average model predicted concentrations without and with DMS chemistry [Dataset]. https://catalog.data.gov/dataset/annual-average-model-predicted-concentrations-without-and-with-dms-chemistry
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
COMBINE_CONC_A0_2016_without_DMS_AVG.tar – annual average model predicted concentrations without DMS chemistry COMBINE_CONC_A_2016_annual_AVG.tar – annual average model predicted concentrations with DMS chemistry GRIDCRO2D.108NHEMI2.44L.20060101.tar - file containing latitude and longitude of model grid-cell Model: The Community Multiscale Air Quality (CMAQv53) was used. It is available at https://www.epa.gov/cmaq. This dataset is associated with the following publication: Gantt, B., K. Foley, B. Henderson, H. Pye, K. Fahey, D. Kang, R. Mathur, J. Zhao, Y. Zhang, Q. Li, A. Saiz-Lopez, and G. Sarwar. Impact of dimethylsulfide chemistry on air quality over the Northern Hemisphere. ATMOSPHERIC ENVIRONMENT. Elsevier Science Ltd, New York, NY, USA, 244: 117961, (2020).
f
Summary of the classification performance with confidence intervals (CIs)...
plos.figshare.com
xls
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kerol Djoumessi; Ziwei Huang; Laura Kühlewein; Annekatrin Rickmann; Natalia Simon; Lisa M. Koch; Philipp Berens (2025). Summary of the classification performance with confidence intervals (CIs) computed at 95% using bootstrapping (n=1000). “AUC” refer to the receiver-operating curve. “Loc Bag” and “Loc GBP” respectively refer to the localization precision of the sparse BagNet and Guided Backpropagation on ResNet-50 at localizing lesions from annotated images. For each dataset, the first row shows the performance of the interpretable sparse BagNet model, while the second row shows the performance of the baseline black-box ResNet-50 model. The Kaggle dataset (first row) is the internal dataset used to train and evaluate the model, while the other datasets were used for external validation to assess the generalization properties of the trained model. The low classification performance on the FCM-UNA and FGA-DR datasets can be explained by the relatively low quality of most images in the FCM-UNA dataset and the large intensity variation of the FGA-DR dataset (S [Dataset]. http://doi.org/10.1371/journal.pdig.0000831.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000831.t002
Dataset updated
May 14, 2025
Dataset provided by
PLOS Digital Health
Authors
Kerol Djoumessi; Ziwei Huang; Laura Kühlewein; Annekatrin Rickmann; Natalia Simon; Lisa M. Koch; Philipp Berens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary of the classification performance with confidence intervals (CIs) computed at 95% using bootstrapping (n=1000). “AUC” refer to the receiver-operating curve. “Loc Bag” and “Loc GBP” respectively refer to the localization precision of the sparse BagNet and Guided Backpropagation on ResNet-50 at localizing lesions from annotated images. For each dataset, the first row shows the performance of the interpretable sparse BagNet model, while the second row shows the performance of the baseline black-box ResNet-50 model. The Kaggle dataset (first row) is the internal dataset used to train and evaluate the model, while the other datasets were used for external validation to assess the generalization properties of the trained model. The low classification performance on the FCM-UNA and FGA-DR datasets can be explained by the relatively low quality of most images in the FCM-UNA dataset and the large intensity variation of the FGA-DR dataset (S
F
Polish Open Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Polish Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/polish-open-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The Polish Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Polish language, advancing the field of artificial intelligence.
Dataset Content:
This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Polish. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Polish people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Polish Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in Polish are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Polish Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.
F
Open Ended Question Answer Text Dataset in English
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Open Ended Question Answer Text Dataset in English [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/english-open-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The English Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the English language, advancing the field of artificial intelligence.
Dataset Content:
This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in English. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native English Speaking people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled English Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in English are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy English Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.
Z
DCASE 2024 Challenge Task 2 Additional Training Dataset
data.niaid.nih.gov
Updated May 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yohei, Kawaguchi (2024). DCASE 2024 Challenge Task 2 Additional Training Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11183283
Explore at:
Dataset updated
May 15, 2024
Dataset provided by
Albertini, Davide
Daisuke, Niizumi
Takashi, Endo
Yohei, Kawaguchi
Augusti, Filippo
Sannino, Roberto
Noboru, Harada
Tomoya, Nishida
Keisuke, Imoto
Kota, Dohi
Harsh, Purohit
Pradolini, Simone
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Description

This dataset is the "additional training dataset" for the DCASE 2024 Challenge Task 2.

The data consists of the normal/anomalous operating sounds of nine types of real/toy machines. Each recording is a single-channel audio that includes both a machine's operating sound and environmental noise. The duration of recordings varies from 6 to 10 seconds. The following nine types of real/toy machines are used in this task:

3DPrinter

AirCompressor

BrushlessMotor

HairDryer

HoveringDrone

RoboticArm

Scanner

ToothBrush

ToyCircuit

Overview of the task

Anomalous sound detection (ASD) is the task of identifying whether the sound emitted from a target machine is normal or anomalous. Automatic detection of mechanical failure is an essential technology in the fourth industrial revolution, which involves artificial-intelligence-based factory automation. Prompt detection of machine anomalies by observing sounds is useful for monitoring the condition of machines.

This task is the follow-up from DCASE 2020 Task 2 to DCASE 2023 Task 2. The task this year is to develop an ASD system that meets the following five requirements.

Train a model using only normal sound (unsupervised learning scenario) Because anomalies rarely occur and are highly diverse in real-world factories, it can be difficult to collect exhaustive patterns of anomalous sounds. Therefore, the system must detect unknown types of anomalous sounds that are not provided in the training data. This is the same requirement as in the previous tasks.

Detect anomalies regardless of domain shifts (domain generalization task) In real-world cases, the operational states of a machine or the environmental noise can change to cause domain shifts. Domain-generalization techniques can be useful for handling domain shifts that occur frequently or are hard-to-notice. In this task, the system is required to use domain-generalization techniques for handling these domain shifts. This requirement is the same as in DCASE 2022 Task 2 and DCASE 2023 Task 2.

Train a model for a completely new machine typeFor a completely new machine type, hyperparameters of the trained model cannot be tuned. Therefore, the system should have the ability to train models without additional hyperparameter tuning. This requirement is the same as in DCASE 2023 Task 2.

Train a model using a limited number of machines from its machine typeWhile sounds from multiple machines of the same machine type can be used to enhance the detection performance, it is often the case that only a limited number of machines are available for a machine type. In such a case, the system should be able to train models using a few machines from a machine type. This requirement is the same as in DCASE 2023 Task 2.

5 . Train a model both with or without attribute informationWhile additional attribute information can help enhance the detection performance, we cannot always obtain such information. Therefore, the system must work well both when attribute information is available and when it is not.

The last requirement is newly introduced in DCASE 2024 Task2.

Definition

We first define key terms in this task: "machine type," "section," "source domain," "target domain," and "attributes.".

"Machine type" indicates the type of machine, which in the additional training dataset is one of nine: 3D-printer, air compressor, brushless motor, hair dryer, hovering drone, robotic arm, document scanner (scanner), toothbrush, and Toy circuit.

A section is defined as a subset of the dataset for calculating performance metrics.

The source domain is the domain under which most of the training data and some of the test data were recorded, and the target domain is a different set of domains under which some of the training data and some of the test data were recorded. There are differences between the source and target domains in terms of operating speed, machine load, viscosity, heating temperature, type of environmental noise, signal-to-noise ratio, etc.

Attributes are parameters that define states of machines or types of noise. For several machine types, the attributes are hidden.

Dataset

This dataset consists of nine machine types. For each machine type, one section is provided, and the section is a complete set of training data. A set of test data corresponding to this training data will be provided in another seperate zenodo page as an "evaluation dataset" for the DCASE 2024 Challenge task 2. For each section, this dataset provides (i) 990 clips of normal sounds in the source domain for training and (ii) ten clips of normal sounds in the target domain for training. The source/target domain of each sample is provided. Additionally, the attributes of each sample in the training and test data are provided in the file names and attribute csv files.

File names and attribute csv files

File names and attribute csv files provide reference labels for each clip. The given reference labels for each training clip include machine type, section index, normal/anomaly information, and attributes regarding the condition other than normal/anomaly. The machine type is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information and the attributes are given by their respective file names. Note that for machine types that has its attribute information hidden, the attribute information in each file names are only labeled as "noAttributes". Attribute csv files are for easy access to attributes that cause domain shifts. In these files, the file names, name of parameters that cause domain shifts (domain shift parameter, dp), and the value or type of these parameters (domain shift value, dv) are listed. Each row takes the following format:

[filename (string)], [d1p (string)], [d1v (int | float | string)], [d2p], [d2v]...

For machine types that have their attribute information hidden, all columns except the filename column are left blank for each row.

Recording procedure

Normal/anomalous operating sounds of machines and its related equipment are recorded. Anomalous sounds were collected by deliberately damaging target machines. For simplifying the task, we use only the first channel of multi-channel recordings; all recordings are regarded as single-channel recordings of a fixed microphone. We mixed a target machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise samples were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline.

Directory structure

/eval_data

/raw - /3DPrinter - /train (only normal clips) - /section_00_source_train_normal_0001_.wav - ... - /section_00_source_train_normal_0990_.wav - /section_00_target_train_normal_0001_.wav - ... - /section_00_target_train_normal_0010_.wav - attributes_00.csv (attribute csv for section 00) - /AirCompressor (The other machine types have the same directory structure as 3DPrinter.) - /BrushlessMotor - /HairDryer - /HoveringDrone - /RoboticArm - /Scanner - /ToothBrush - /ToyCircuit

Baseline system

The baseline system is available on the Github repository . The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.

Condition of use

This dataset was created jointly by Hitachi, Ltd., NTT Corporation and STMicroelectronics and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Citation

Contact

If there is any problem, please contact us:

Tomoya Nishida, tomoya.nishida.ax@hitachi.com

Keisuke Imoto, keisuke.imoto@ieee.org

Noboru Harada, noboru@ieee.org

Daisuke Niizumi, daisuke.niizumi.dt@hco.ntt.co.jp

Yohei Kawaguchi, yohei.kawaguchi.xk@hitachi.com
Jazz vs Classical Music Classification
kaggle.com
Updated Jan 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ben Fitzgerald (2024). Jazz vs Classical Music Classification [Dataset]. https://www.kaggle.com/datasets/benfitzgerald3132/jazz-vs-classical-music-classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 9, 2024
Dataset provided by
Kaggle
Authors
Ben Fitzgerald
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Useful for training a binary audio classification model. My model of this dataset achieved ~95.5% accuracy without data augmentations, using a xresnet18 architecture as implemented by the fastai and fastaudio libraries. Feel free to train a model from this specific dataset, or use my script to create your own!

Script used to generate dataset: https://github.com/bfitzgerald3132/jazz-classical-dataset/

Using this script, you can dynamically generate datasets of >30,000 items from only ~100 YouTube links. For each link, the script downloads the corresponding MP4 and splits it a randomly-shuffled set of clips of n-seconds each. Dynamic generation produces an equal number of jazz and classical clips, as well as an equal number of clips from each link, to prevent bias in dataset

When I generated this dataset, I used the following parameters. - NUMBER_OF_CLIPS = 1500 - LENGTH_OF_CLIP = 5000
P
SuperAnimal-Quadruped Dataset
paperswithcode.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaokai Ye; Anastasiia Filippova; Jessy Lauer; Steffen Schneider; Maxime Vidal; Tian Qiu; Alexander Mathis; Mackenzie Weygandt Mathis, SuperAnimal-Quadruped Dataset [Dataset]. https://paperswithcode.com/dataset/superanimal-quadruped
Explore at:
Authors
Shaokai Ye; Anastasiia Filippova; Jessy Lauer; Steffen Schneider; Maxime Vidal; Tian Qiu; Alexander Mathis; Mackenzie Weygandt Mathis
Description
Introduction This dataset supports Ye et al. 2024 Nature Communications.

Ye, S., Filippova, A., Lauer, J. et al. SuperAnimal pretrained pose estimation models for behavioral analysis. Nat Commun 15, 5165 (2024). https://doi.org/10.1038/s41467-024-48792-2

Please cite this dataset and paper if you use this resource. Please also see Ye et al. 2024 for the full DataSheet accompanying this download, including the metadata for how to use this data if you want to compare model results on benchmark tasks. Below is just a summary. Also see the dataset licensing below.

Training Data It consists of being trained together on the following datasets:

AwA-Pose Quadruped dataset, see full details at (1). AnimalPose See full details at (2). AcinoSet See full details at (3). Horse-30 Horse-30 dataset, benchmark task is called Horse-10; See full details at (4). StanfordDogs See full details at (5, 6). AP-10K See full details at (7). iRodent We utilized the iNaturalist API functions for scraping observations with the taxon ID of Suborder Myomorpha (8). The functions allowed us to filter the large amount of observations down to the ones with photos under the CC BY-NC creative license. The most common types of rodents from the collected observations are Muskrat (Ondatra zibethicus), Brown Rat (Rattus norvegicus), House Mouse (Mus musculus), Black Rat (Rattus rattus), Hispid Cotton Rat (Sigmodon hispidus), Meadow Vole (Microtus pennsylvanicus), Bank Vole (Clethrionomys glareolus), Deer Mouse (Peromyscus maniculatus), White-footed Mouse (Peromyscus leucopus), Striped Field Mouse (Apodemus agrarius). We then generated segmentation masks over target animals in the data by processing the media through an algorithm we designed that uses a Mask Region Based Convolutional Neural Networks(Mask R-CNN) (9) model with a ResNet-50-FPN backbone (10), pretrained on the COCO datasets (11). The processed 443 images were then manually labeled with both pose annotations and segmentation masks. iRodent data is banked at https://zenodo.org/record/8250392. APT-36K See full details at (12). Here is an image with a keypoint guide.

Ethical Considerations • No experimental data was collected for this model; all datasets used are cited above.

Caveats and Recommendations • Please note that each dataest was labeled by separate labs & separate individuals, therefore while we map names to a unified pose vocabulary, there will be annotator bias in keypoint placement (See Ye et al. 2024 for our Supplementary Note on annotator bias). You will also note the dataset is highly diverse across species, but collectively has more representation of domesticated animals like dogs, cats, horses, and cattle. We recommend if performance of a model trained on this data is not as good as you need it to be, first try video adaptation (see Ye et al. 2024), or fine-tune the weights with your own labeling.

License Modified MIT.

Copyright 2023-present by Mackenzie Mathis, Shaokai Ye, and contributors.

Permission is hereby granted to you (hereafter "LICENSEE") a fully-paid, non-exclusive, and non-transferable license for academic, non-commercial purposes only (hereafter “LICENSE”) to use the "DATASET" subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software:

This data or resulting software may not be used to harm any animal deliberately.

LICENSEE acknowledges that the DATASET is a research tool. THE DATASET IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE DATASET.

If this license is not appropriate for your application, please contact Prof. Mackenzie W. Mathis (mackenzie@post.harvard.edu) for a commercial use license.

Please cite Ye et al 2024 if you use this DATASET in your work.

References Prianka Banik, Lin Li, and Xishuang Dong. A novel dataset for keypoint detection of quadruped animals from images. ArXiv, abs/2108.13958, 2021

Jinkun Cao, Hongyang Tang, Haoshu Fang, Xiaoyong Shen, Cewu Lu, and Yu-Wing Tai. Cross-domain adaptation for animal pose estimation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9497–9506, 2019.

Daniel Joska, Liam Clark, Naoya Muramatsu, Ricardo Jericevich, Fred Nicolls, Alexander Mathis, Mackenzie W. Mathis, and Amir Patel. Acinoset: A 3d pose estimation dataset and baseline models for cheetahs in the wild. 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13901–13908, 2021.

Alexander Mathis, Thomas Biasi, Steffen Schneider, Mert Yuksekgonul, Byron Rogers, Matthias Bethge, and Mackenzie W Mathis. Pretraining boosts out-of-domain robustness for pose estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1859–1868, 2021.

Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, June 2011.

Benjamin Biggs, Thomas Roddick, Andrew Fitzgibbon, and Roberto Cipolla. Creatures great and smal: Recovering the shape and motion of animals from video. In Asian Conference on Computer Vision, pages 3–19. Springer, 2018. Hang Yu, Yufei Xu, Jing Zhang, Wei Zhao, Ziyu Guan, and Dacheng Tao. Ap-10k: A benchmark for animal pose estimation in the wild. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.

iNaturalist. OGBIF Occurrence Download. https://doi.org/10.15468/dl.p7nbxt. iNaturalist, July 2020

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection, 2016.

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll’ar, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014

Yuxiang Yang, Junjie Yang, Yufei Xu, Jing Zhang, Long Lan, and Dacheng Tao. Apt-36k: A large-scale benchmark for animal pose estimation and tracking. Advances in Neural Information Processing Systems, 35:17301–17313, 2022
f
Summary of the internal and external validation datasets used to evaluate...
plos.figshare.com
xls
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kerol Djoumessi; Ziwei Huang; Laura Kühlewein; Annekatrin Rickmann; Natalia Simon; Lisa M. Koch; Philipp Berens (2025). Summary of the internal and external validation datasets used to evaluate the models. “Origin” refers to the country where the data was collected. “Lesion” refers to the number of images in the dataset with lesion annotations. The Kaggle dataset (first row, shaded in gray) is the internal dataset used to evaluate the model, while the other datasets were used for external validation to assess the generalization properties of the trained model. [Dataset]. http://doi.org/10.1371/journal.pdig.0000831.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000831.t001
Dataset updated
May 14, 2025
Dataset provided by
PLOS Digital Health
Authors
Kerol Djoumessi; Ziwei Huang; Laura Kühlewein; Annekatrin Rickmann; Natalia Simon; Lisa M. Koch; Philipp Berens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary of the internal and external validation datasets used to evaluate the models. “Origin” refers to the country where the data was collected. “Lesion” refers to the number of images in the dataset with lesion annotations. The Kaggle dataset (first row, shaded in gray) is the internal dataset used to evaluate the model, while the other datasets were used for external validation to assess the generalization properties of the trained model.
f
Dataset for paper: Body Positivity but not for everyone
sussex.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kathleen Simon; Megan Hurst (2023). Dataset for paper: Body Positivity but not for everyone [Dataset]. http://doi.org/10.25377/sussex.9885644.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.25377/sussex.9885644.v1
Dataset updated
May 31, 2023
Dataset provided by
University of Sussex
Authors
Kathleen Simon; Megan Hurst
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Data for a Brief Report/Short Communication published in Body Image (2021). Details of the study are included below via the abstract from the manuscript. The dataset includes online experimental data from 167 women who were recruited via social media and institutional participant pools. The experiment was completed in Qualtrics.Women viewed either neutral travel images (control), body positivity posts with an average-sized model (e.g., ~ UK size 14), or body positivity posts with a larger model (e.g., UK size 18+); which images women viewed is show in the ‘condition’ variable in the data.The data includes the age range, height, weight, calculated BMI, and Instagram use of participants. After viewing the images, women responded to the Positive and Negative Affect Schedule (PANAS), a state version of the Body Satisfaction Scale (BSS), and reported their immediate social comparison with the images (SAC items). Women then selected a lunch for themselves from a hypothetical menu; these selections are detailed in the data, as are the total calories calculated from this and the proportion of their picks which were (provided as a percentage, and as a categorical variable [as used in the paper analyses]). Women also reported whether they were on a special diet (e.g., vegan or vegetarian), had food intolerances, when they last ate, and how hungry they were.

Women also completed trait measures of Body Appreciation (BAS-2) and social comparison (PACS-R). Women also were asked to comment on what they thought the experiment was about. Items and computed scales are included within the dataset.This item includes the dataset collected for the manuscript (in SPSS and CSV formats), the variable list for the CSV file (for users working with the CSV datafile; the variable list and details are contained within the .sav file for the SPSS version), and the SPSS syntax for our analyses (.sps). Also included are the information and consent form (collected via Qualtrics) and the questions as completed by participants (both in pdf format).Please note that the survey order in the PDF is not the same as in the datafiles; users should utilise the variable list (either in CSV or SPSS formats) to identify the items in the data.The SPSS syntax can be used to replicate the analyses reported in the Results section of the paper. Annotations within the syntax file guide the user through these.

A copy of SPSS Statistics is needed to open the .sav and .sps files.

Manuscript abstract:

Body Positivity (or ‘BoPo’) social media content may be beneficial for women’s mood and body image, but concerns have been raised that it may reduce motivation for healthy behaviours. This study examines differences in women’s mood, body satisfaction, and hypothetical food choices after viewing BoPo posts (featuring average or larger women) or a neutral travel control. Women (N = 167, 81.8% aged 18-29) were randomly assigned in an online experiment to one of three conditions (BoPo-average, BoPo-larger, or Travel/Control) and viewed three Instagram posts for two minutes, before reporting their mood and body satisfaction, and selecting a meal from a hypothetical menu. Women who viewed the BoPo posts featuring average-size women reported more positive mood than the control group; women who viewed posts featuring larger women did not. There were no effects of condition on negative mood or body satisfaction. Women did not make less healthy food choices than the control in either BoPo condition; women who viewed the BoPo images of larger women showed a stronger association between hunger and calories selected. These findings suggest that concerns over BoPo promoting unhealthy behaviours may be misplaced, but further research is needed regarding women’s responses to different body sizes.
F
German Open Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). German Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/german-open-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The German Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the German language, advancing the field of artificial intelligence.
Dataset Content:
This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in German. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native German people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled German Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in German are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy German Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.
without aug for fold 4
kaggle.com
Updated Jan 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
salma_mohamed98 (2021). without aug for fold 4 [Dataset]. https://www.kaggle.com/datasets/salmamohamed98/without-aug-for-fold-4
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 28, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
salma_mohamed98
Description
Dataset

This dataset was created by mostafa_mohamed98

Contents
d
Water Modelling-Water Models-Without Development-Peel - Dataset - CKAN
demo.datashades.com
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Water Modelling-Water Models-Without Development-Peel - Dataset - CKAN [Dataset]. https://demo.datashades.com/dataset/water-modelling-water-models-without-development-peel
Explore at:
Dataset updated
Jun 4, 2025
Description
The Peel River 'without development' model is a hydrological model of the Peel River system that does not represent river system infrastructure, extractions or water management rules. Note: Source software (available from ewater.org.au) is required to view and run the model(s) within. ----------------------------------- Note: If you would like to ask a question, make any suggestions, or tell us how you are using this dataset, please vis

Facebook

Twitter

Click to copy link

Link copied

Cite

ThreatDet (2025). Objects Without Annotations Dataset [Dataset]. https://universe.roboflow.com/threatdet/objects-without-annotations/model/1

Objects Without Annotations Dataset

objects-without-annotations

objects-without-annotations-dataset

Explore at:

95 scholarly articles cite this dataset (View in Google Scholar)

zipAvailable download formats

Dataset updated

Jun 5, 2025

Dataset authored and provided by

ThreatDet

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Variables measured

Objects Bounding Boxes

Description

Objects Without Annotations

## Overview

Objects Without Annotations is a dataset for object detection tasks - it contains Objects annotations for 1,300 images.

## Getting Started

You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.

  ## License

  This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).

Clear search

Close search

Google apps

Main menu

Objects Without Annotations Dataset

Objects Without Annotations

Dataset Without Tree Instance Dataset

Dataset Without Tree Instance

do-not-answer

wikipedia-small-3000-embedded

load dataset in streaming mode (no download and it's fast)

select 3000 samples

Mini Fashion Product Images and Text Dataset

catbAbI LM-mode Dataset

generated-usa-passeports-dataset

Simulation Data Set

annual average model predicted concentrations without and with DMS chemistry...

Summary of the classification performance with confidence intervals (CIs)...

Polish Open Ended Question Answer Text Dataset

What’s Included

Open Ended Question Answer Text Dataset in English

What’s Included

DCASE 2024 Challenge Task 2 Additional Training Dataset

Jazz vs Classical Music Classification

SuperAnimal-Quadruped Dataset

Summary of the internal and external validation datasets used to evaluate...

Dataset for paper: Body Positivity but not for everyone

German Open Ended Question Answer Text Dataset

What’s Included

without aug for fold 4

Dataset

Contents

Water Modelling-Water Models-Without Development-Peel - Dataset - CKAN

Objects Without Annotations Dataset

objects-without-annotations

objects-without-annotations-dataset

Objects Without Annotations