MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
error-detection-positives
This dataset is part of the PARC (Premise-Annotated Reasoning Collection) and contains mathematical reasoning problems with error annotations. This dataset combines positives samples from multiple domains.
Domain Breakdown
gsm8k: 50 samples math: 53 samples metamathqa: 93 samples orca_math: 96 samples
Features
Each example contains:
data_source: The domain/source of the problem (gsm8k, math, metamathqa, orca_math) question: The… See the full description on the dataset page: https://huggingface.co/datasets/PARC-DATASETS/error-detection-positives.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).
Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.
All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Domain Generation Dataset
This dataset contains 1,667 high-quality examples for fine-tuning language models to generate creative and relevant domain names for businesses, with built-in safety training and edge case handling.
Dataset Creation
Methodology: Hybrid approach combining Claude API generation with manual curation after encountering API reliability issues. Original Target: 2,000 examples → Final Result: 1,667 examples after deduplication and quality control.… See the full description on the dataset page: https://huggingface.co/datasets/Maikobi/domain-generation-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As a part of ESB benchmark, we provide a small, 8h diagnostic dataset of in-domain validation data with newly annotated transcriptions. The audio data is sampled from each of the ESB validation sets, giving a range of different domains and speaking styles. The transcriptions are annotated according to a consistent style guide with two formats: normalised and un-normalised. The dataset is structured in the same way as the ESB dataset, by grouping audio-transcription samples according to the… See the full description on the dataset page: https://huggingface.co/datasets/esb/diagnostic-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The benchmark code is available at: https://github.com/Junjue-Wang/LoveDA
Highlights:
Reference:
@inproceedings{wang2021loveda,
title={Love{DA}: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation},
author={Junjue Wang and Zhuo Zheng and Ailong Ma and Xiaoyan Lu and Yanfei Zhong},
booktitle={Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
editor = {J. Vanschoren and S. Yeung},
year={2021},
volume = {1},
pages = {},
url={https://datasets-benchmarks proceedings.neurips.cc/paper/2021/file/4e732ced3463d06de0ca9a15b6153677-Paper-round2.pdf}
}
License:
The owners of the data and of the copyright on the data are RSIDEA, Wuhan University. Use of the Google Earth images must respect the "Google Earth" terms of use. All images and their associated annotations in LoveDA can be used for academic purposes only, but any commercial use is prohibited. (CC BY-NC-SA 4.0)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains python repositories mined on GitHub on January 20, 2021. It allows a cross-domain evaluation of type inference systems. For this purpose, it consists of two sub-datasets, each containing only projects from the web or scientific calculation domain, respectively. Therefore we searched for projects with dependencies to either NumPy or Flask. Furthermore, only projects with dependencies to mypy were considered, because this should ensure that at least parts of the projects have type annotations. These can be used later as ground truth. Further details about the dataset will be described in an upcoming paper, as soon as it is published it will be linked here.
The dataset consists of two files for the two sub-datasets. The web domain dataset contains 3129 repositories and the scientific calculation domain dataset contains 4783 repositories. The files have two columns with the URL to the GitHub repository and the used commit hash. Thus, it is possible to download the dataset using shell or python scripts, for example, the pipeline provided by ManyTypes4Py can be used.
If repositories do not exist anymore or are private, you can contact us via the following email address: bernd.gruner@dlr.de. We have a backup of all repositories and will be happy to help you.
The Human Know-How Dataset describes 211,696 human activities from many different domains. These activities are decomposed into 2,609,236 entities (each with an English textual label). These entities represent over two million actions and half a million pre-requisites. Actions are interconnected both according to their dependencies (temporal/logical orders between actions) and decompositions (decomposition of complex actions into simpler ones). This dataset has been integrated with DBpedia (259,568 links). For more information see: - The project website: http://homepages.inf.ed.ac.uk/s1054760/prohow/index.htm - The data is also available on datahub: https://datahub.io/dataset/human-activities-and-instructions ---------------------------------------------------------------- * Quickstart: if you want to experiment with the most high-quality data before downloading all the datasets, download the file '9of11_knowhow_wikihow', and optionally files 'Process - Inputs', 'Process - Outputs', 'Process - Step Links' and 'wikiHow categories hierarchy'. * Data representation based on the PROHOW vocabulary: http://w3id.org/prohow# Data extracted from existing web resources is linked to the original resources using the Open Annotation specification * Data Model: an example of how the data is represented within the datasets is available in the attached Data Model PDF file. The attached example represents a simple set of instructions, but instructions in the dataset can have more complex structures. For example, instructions could have multiple methods, steps could have further sub-steps, and complex requirements could be decomposed into sub-requirements. ---------------------------------------------------------------- Statistics: * 211,696: number of instructions. From wikiHow: 167,232 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 44,464 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 2,609,236: number of RDF nodes within the instructions From wikiHow: 1,871,468 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 737,768 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 255,101: number of process inputs linked to 8,453 distinct DBpedia concepts (dataset Process - Inputs) * 4,467: number of process outputs linked to 3,439 distinct DBpedia concepts (dataset Process - Outputs) * 376,795: number of step links between 114,166 different sets of instructions (dataset Process - Step Links)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Different from the previous combined multi-domain dataset for unsupervised domain adaptation (UDA) in medical image segmentation, this multi-domain fundus image dataset contains annotations made by the same group of ophthalmologists. Hence the annotator bias among different datasets can be mitigated. Therefore, this dataset can provide a relatively fair benchmark for evaluating UDA methods in fundus image segmentation.
This dataset is based on the RIGA[1] dataset and MESSIDOR[2] dataset. We appreciate their efforts devoted by the authors of [1] and [2].
The six duplicated cases in the RIGA dataset are filtered out according to the Errata. We also remove the duplicated cases that exist in both the RIGA dataset and the MESSIDOR dataset by hash value matching.
Domain | Dataset |
Labeled Samples (Train+Test) |
Unlabeled Samples |
---|---|---|---|
Source | BinRushed | 195 (195+0) | 0 |
Source | Magrabia | 95 (95+0) | 0 |
Target | MESSIDOR-BASE1 | 173 (138+35) | 227 |
Target | MESSIDOR-BASE2 | 148 (118+30) | 238 |
Target | MESSIDOR-BASE3 | 133 (106+27) | 252 |
[1] Almazroa A, Alodhayb S, Osman E, et al. Retinal fundus images for glaucoma analysis: the RIGA dataset[C]//Medical Imaging 2018: Imaging Informatics for Healthcare, Research, and Applications. International Society for Optics and Photonics, 2018, 10579: 105790B.
[2] Decencière E, Zhang X, Cazuguel G, et al. Feedback on a publicly distributed image database: the Messidor database[J]. Image Analysis & Stereology, 2014, 33(3): 231-234.
If you find this dataset useful for your research, please consider citing the paper as follows:
@inproceedings{hu2022domain,
title={Domain Specific Convolution and High Frequency Reconstruction based Unsupervised Domain Adaptation for Medical Image Segmentation},
author={Shishuai Hu and Zehui Liao and Yong Xia},
booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
year={2022},
organization={Springer}
}
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
## Overview
Example is a dataset for object detection tasks - it contains Driver annotations for 25,809 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Iqman Singh Bhatia
Released under CC0: Public Domain
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Freebase is amongst the largest public cross-domain knowledge graphs. It possesses three main data modeling idiosyncrasies. It has a strong type system; its properties are purposefully represented in reverse pairs; and it uses mediator objects to represent multiary relationships. These design choices are important in modeling the real-world. But they also pose nontrivial challenges in research of embedding models for knowledge graph completion, especially when models are developed and evaluated agnostically of these idiosyncrasies. We make available several variants of the Freebase dataset by inclusion and exclusion of these data modeling idiosyncrasies. This is the first-ever publicly available full-scale Freebase dataset that has gone through proper preparation.
Dataset Details
The dataset consists of the four variants of Freebase dataset as well as related mapping/support files. For each variant, we made three kinds of files available:
We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing, including over 250k musical scores in MusicXML format. PDMX is the largest publicly available, copyright-free MusicXML dataset in existence. PDMX includes genre, tag, description, and popularity metadata for every file.
Dataset Card for "medical-domain"
Dataset Summary
Medical transcription data scraped from mtsamples.com Medical data is extremely hard to find due to HIPAA privacy regulations. This dataset offers a solution by providing medical transcription samples. This dataset contains sample medical transcriptions for various medical specialties.
Languages
english
Citation Information
Acknowledgements Medical transcription data scraped from mtsamples.com… See the full description on the dataset page: https://huggingface.co/datasets/argilla/medical-domain.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for PACS
PACS is an image dataset for domain generalization. It consists of four domains, namely Photo (1,670 images), Art Painting (2,048 images), Cartoon (2,344 images), and Sketch (3,929 images). Each domain contains seven categories (labels): Dog, Elephant, Giraffe, Guitar, Horse, and Person. The total number of sample is 9991.
Dataset Details
PACS DG dataset is created by intersecting the classes found in Caltech256 (Photo), Sketchy (Photo, Sketch)… See the full description on the dataset page: https://huggingface.co/datasets/flwrlabs/pacs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The entity relatedness problem refers to the question of exploring a knowledge base, represented as an RDF graph, to discover and understand how two entities are connected. More precisely, this problem can be defined as: “Given an RDF graph 'G' and a pair of entities 'a' and 'b', represented in 'G', compute the paths in 'G' from 'a' to 'b' that best describe the connectivity between them”.This dataset supports the evaluation of approaches that address the entity relatedness problem and contains a total of 240 ranked lists with 50 relationship paths each between entity pairs in two familiar domains, music and movies, in two subsets of the DBpedia that we called DBpedia21M and DBpedia45M. Specifically, we extracted data from the following two publicly available subsets of the English DBpedia corpus to form our two knowledge bases:1. mappingbased-objects: https://downloads.dbpedia.org/repo/dbpedia/mappings/mappingbased-objects/2021.03.01/mappingbased-objects_lang=en.ttl.bz22. infobox-properties: https://downloads.dbpedia.org/repo/dbpedia/generic/infobox-properties/2021.03.01/infobox-properties_lang=en.ttl.bz2 DBpedia21M contains the statements in the mappingbased-objects dataset, and DBpedia45M contains the union of the statements in mappingbased-objects and in infobox-properties. In both cases, we exclude statements involving literals or blank nodes.For each dataset (DBpedia21M and DBpedia45M), the ground truth contains 120 ranked lists with 50 relationship paths each. Each list corresponds to the most relevant paths between one of the 20 entity pairs, 10 pairs from the music domain and 10 from the movie domain, found using different path search strategies.A path search strategy consists of an entity similarity measure and a path ranking measure. The ground truth was created using the following 6 strategies:1. Jaccard Index & Predicate Frequency Inverse Triple Frequency (PF-ITF)2. Jaccard Index & Exclusivity-based Relatedness (EBR)3. Jaccard Index & Pointwise Mutual Information (PMI)4. Wikipedia Link-based Measure (WLM) & PF-ITF5. WLM & EBR6. WLM & PMIThe filename of a file that contains the ranked list of 50 relationship paths between a pair of entities has the following format:[Dataset].[EntityPairID].[SearchStrategyID].[Entity1-Entity2].txtExample 1: DBpedia21M.1.2.Michael_Jackson-Whitney_Houston.txtExample 2: DBpedia45M.27.4.Paul_Newman-Joanne_Woodward.txtThe file in Example 1 contains the top-50 most relevant paths between Michael Jackson and Whitney Houston in DBpedia21M using the search strategy number 2 (Jaccard Index & EBR)The file in Example 2 contains the top-50 most relevant paths between Paul Newman and Joanne Woodward in DBpedia45M using the search strategy number 4 (WLM & PF-ITF)The data is splitted into 2 files, one for each dataset and compressed in .zip format:DBpedia21M.GT.zip: contains 180 .txt files representing the ranked lists of relationship paths between entity pairs in DBpedia21M dataset. DBpedia45M.GT.zip: contains 180 .txt files representing the ranked lists of relationship paths between entity pairs in DBpedia45M dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Three corpora in different domains extracted from Wikipedia.For all datasets, the figures and tables have been filtered out, as well as the categories and "see also" sections.The article structure, and particularly the sub-titles and paragraphs are kept in these datasets.
Wines: Wikipedia wines dataset consists of 1635 articles from the wine domain. The extracted dataset consists of a non-trivial mixture of articles, including different wine categories, brands, wineries, grape types, and more. The ground-truth recommendations were crafted by a human sommelier, which annotated 92 source articles with ~10 ground-truth recommendations for each sample. Examples for ground-truth expert-based recommendations are Dom Pérignon - Moët & Chandon, Pinot Meunier - Chardonnay.
Movies: The Wikipedia movies dataset consists of 100385 articles describing different movies. The movies' articles may consist of text passages describing the plot, cast, production, reception, soundtrack, and more. For this dataset, we have extracted a test set of ground truth annotations for 50 source articles using the "BestSimilar" database. Each source articles is associated with a list of ${\scriptsize \sim}12$ most similar movies. Examples for ground-truth expert-based recommendations are Schindler's List - The PianistLion King - The Jungle Book.
Video games: The Wikipedia video games dataset consists of 21,935 articles reviewing video games from all genres and consoles. Each article may consist of a different combination of sections, including summary, gameplay, plot, production, etc. Examples for ground-truth expert-based recommendations are: Grand Theft Auto - Mafia, Burnout Paradise - Forza Horizon 3.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset encompasses a comprehensive collection of over 800,000 URLs, meticulously curated to provide a diverse representation of online domains. Within this extensive corpus, approximately 52% of the domains are identified as legitimate, reflective of established and trustworthy entities within the digital landscape. Conversely, the remaining 47% of domains are categorized as phishing domains, indicative of potential threats and malicious activities.
Structured with precision, the dataset comprises two key columns: "url" and "status". The "url" column serves as the primary identifier, housing the uniform resource locators (URLs) for each respective domain. Meanwhile, the "status" column employs binary encoding, with values represented as 0 and 1. Herein lies a crucial distinction: a value of 0 designates domains flagged as phishing, signaling a potential risk to users, while a value of 1 signifies domains deemed legitimate, offering assurance and credibility. Additionally paramount importance is the careful balance maintained between these two categories. With an almost equal distribution of instances across phishing and legitimate domains, this dataset mitigates the risk of class imbalance, ensuring robustness and reliability in subsequent analyses and model development. This deliberate approach fosters a more equitable and representative dataset, empowering researchers and practitioners in their endeavors to understand, combat, and mitigate online threats.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
These datasets are used for evaluating the process mining-based goal recognition system proposed in the paper "Fast and Accurate Data-Driven Goal Recognition Using Process Mining Techniques." The datasets include a running example, an evaluation dataset for synthetic domains, and real-world business logs.running_example.tar.bz contains the traces shown in figure 2 of the paper for learning six skill models toward six goal candidates and the three walks shown in figure 1.a.synthetic_domains.tar.bz2 is the dataset for evaluating GR system in synthetic domains (IPC domains). There are two types of traces used for learning skill models, generated by the top-k planner and generated by the diverse planner. Please extract the archived domains located in topk/ and diverse/. In each domain, the sub-folder problems/ contains the dataset for learning skill models, and the sub-folder test/ contains the traces (plans) for testing the GR performance. There are five levels of observations, 10%, 30%, 50%, 70%, and 100%. For each level of observation, there are multiple problem instances, the instance ID starts from 0. A problem instance contains the synthetic domain model (PDDL files), training traces (in train/), and an observation for testing (obs.dat). The top-k and diverse planners for generating traces can be accessed here. The original PDDL models of the problem instances for the 15 IPC domains mentioned in the paper are available here.business_logs.tar.bz is the dataset for evaluating GR system in real-world domains. There are two types of problem instances: one with only two goal candidates (yes or no), referred to as "binary," and the other containing multiple goal candidates, termed "multiple." Please extract the archived files located in the directories binary/ and multiple/. The traces for learning the skill models can be found in XES files, and the traces (plans) for testing can be found in the directory goal*/.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
IMAD-DS is a dataset developed for multi-rate multi-sensor anomaly detection (AD) in industrial environments, that considers varying operational and environmental conditions known as domain shifts.
Dataset Overview:
This dataset includes data from two scaled industrial machines: a robotic arm and a brushless motor.
It includes both normal and abnormal data recorded under various operating conditions to account for domain shifts. These shifts are categorized into:
Robotic Arm: The robotic arm is a scaled version of a robotic arm used to move silicon wafers in a factory. Anomalies are created by removing bolts at the nodes of the arm, resulting in an imbalance in the machine.
Brushless Motor: The brushless motor is a scaled representation of an industrial brushless motor. Two anomalies are introduced: first, a magnet is moved closer to the motor load, causing oscillations by interacting with two symmetrical magnets on the load; second, a belt that rotates in unison with the motor shaft is tightened, creating mechanical stress.
The following domain shifts are included in the dataset:
Operational Domain Shifts: Variations caused by changes in machine conditions (e.g., load changes for the robotic arm and speed changes for the brushless motor).
Environmental Domain Shifts: Variations due to changes in background noise levels.
Combinations of operating and environmental conditions divide each machine's dataset into two subsets: the source domain and the target domain. The source domain has a large number of training examples. The target domain, instead, has limited training data. This discrepancy highlights a common issue in the industry where sufficient training data is often unavailable for the target domain, as machine data is collected under controlled environments that do not fully represent the deployment environments.
Data Collection and Processing:
Data is collected using the STEVAL-STWINBX1 IoT Sensor Industrial Node. The sensor used to record the dataset are the following.
· Analog Microphone (16 kHz)
· 3-axis Accelerometer (6.7 kHz)
· 3-axis Gyroscope (6.7 kHz)
Recordings are conducted in an anechoic chamber to control acoustic conditions precisely
Data Format:
Files are already divided into train and test sets. Inside each folder, each sensor's data is stored in a separate '.parquet' file.
Sensor files related to the same segment of machine data share a unique ID. The mapping of each machine data segment to the sensor files is given in .csv files inside the train and test folders. Those .csv files also contain metadata denoting the operational and environmental conditions of a specific segment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionA special dataset that contains metadata for all the published datasets. Dataset profile fields conform to Dublin Core standard.Other
You can download metadata for individual datasets, via the links provided in descriptions.
Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: https://ukpowernetworks.opendatasoft.com/pages/glossary/
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
error-detection-positives
This dataset is part of the PARC (Premise-Annotated Reasoning Collection) and contains mathematical reasoning problems with error annotations. This dataset combines positives samples from multiple domains.
Domain Breakdown
gsm8k: 50 samples math: 53 samples metamathqa: 93 samples orca_math: 96 samples
Features
Each example contains:
data_source: The domain/source of the problem (gsm8k, math, metamathqa, orca_math) question: The… See the full description on the dataset page: https://huggingface.co/datasets/PARC-DATASETS/error-detection-positives.