100+ datasets found

h
error-detection-positives
huggingface.co
Updated May 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PARC (2025). error-detection-positives [Dataset]. https://huggingface.co/datasets/PARC-DATASETS/error-detection-positives
Explore at:
Dataset updated
May 22, 2025
Dataset authored and provided by
PARC
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
error-detection-positives

This dataset is part of the PARC (Premise-Annotated Reasoning Collection) and contains mathematical reasoning problems with error annotations. This dataset combines positives samples from multiple domains.

Domain Breakdown

gsm8k: 50 samples math: 53 samples metamathqa: 93 samples orca_math: 96 samples

Features

Each example contains:

data_source: The domain/source of the problem (gsm8k, math, metamathqa, orca_math) question: The… See the full description on the dataset page: https://huggingface.co/datasets/PARC-DATASETS/error-detection-positives.
c
Data from: LVMED: Dataset of Latvian text normalisation samples for the...
repository.clarin.lv
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Viesturs Jūlijs Lasmanis; Normunds Grūzītis (2023). LVMED: Dataset of Latvian text normalisation samples for the medical domain [Dataset]. https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/85
Explore at:
Dataset updated
May 30, 2023
Authors
Viesturs Jūlijs Lasmanis; Normunds Grūzītis
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).

Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.

All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.
h
domain-generation-dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abubakar Aliyu, domain-generation-dataset [Dataset]. https://huggingface.co/datasets/Maikobi/domain-generation-dataset
Explore at:
Authors
Abubakar Aliyu
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Domain Generation Dataset

This dataset contains 1,667 high-quality examples for fine-tuning language models to generate creative and relevant domain names for businesses, with built-in safety training and edge case handling.

Dataset Creation

Methodology: Hybrid approach combining Claude API generation with manual curation after encountering API reliability issues. Original Target: 2,000 examples → Final Result: 1,667 examples after deduplication and quality control.… See the full description on the dataset page: https://huggingface.co/datasets/Maikobi/domain-generation-dataset.
h
diagnostic-dataset
huggingface.co
Updated Apr 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
End-to-End Speech Benchmark (2023). diagnostic-dataset [Dataset]. https://huggingface.co/datasets/esb/diagnostic-dataset
Explore at:
Dataset updated
Apr 4, 2023
Dataset authored and provided by
End-to-End Speech Benchmark
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As a part of ESB benchmark, we provide a small, 8h diagnostic dataset of in-domain validation data with newly annotated transcriptions. The audio data is sampled from each of the ESB validation sets, giving a range of different domains and speaking styles. The transcriptions are annotated according to a consistent style guide with two formats: normalised and un-normalised. The dataset is structured in the same way as the ESB dataset, by grouping audio-transcription samples according to the… See the full description on the dataset page: https://huggingface.co/datasets/esb/diagnostic-dataset.
Data from: LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive...
zenodo.org
data.niaid.nih.gov
pdf, zip
Updated Jul 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wang Junjue; Wang Junjue; Zheng Zhuo; Ma Ailong; Lu Xiaoyan; Zhong Yanfei; Zheng Zhuo; Ma Ailong; Lu Xiaoyan; Zhong Yanfei (2024). LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation [Dataset]. http://doi.org/10.5281/zenodo.5706578
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5706578
Dataset updated
Jul 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Wang Junjue; Wang Junjue; Zheng Zhuo; Ma Ailong; Lu Xiaoyan; Zhong Yanfei; Zheng Zhuo; Ma Ailong; Lu Xiaoyan; Zhong Yanfei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The benchmark code is available at: https://github.com/Junjue-Wang/LoveDA

Highlights:

5987 high spatial resolution (0.3 m) remote sensing images from Nanjing, Changzhou, and Wuhan

Focus on different geographical environments between Urban and Rural

Advance both semantic segmentation and domain adaptation tasks

Three considerable challenges: multi-scale objects, complex background samples, and inconsistent class distributions

Reference:

@inproceedings{wang2021loveda, title={Love{DA}: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation}, author={Junjue Wang and Zhuo Zheng and Ailong Ma and Xiaoyan Lu and Yanfei Zhong}, booktitle={Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks}, editor = {J. Vanschoren and S. Yeung}, year={2021}, volume = {1}, pages = {}, url={https://datasets-benchmarks proceedings.neurips.cc/paper/2021/file/4e732ced3463d06de0ca9a15b6153677-Paper-round2.pdf} }

License:

The owners of the data and of the copyright on the data are RSIDEA, Wuhan University. Use of the Google Earth images must respect the "Google Earth" terms of use. All images and their associated annotations in LoveDA can be used for academic purposes only, but any commercial use is prohibited. (CC BY-NC-SA 4.0)
CrossDomainTypes4Py: A Python Dataset for Cross-Domain Evaluation of Type...
zenodo.org
data.niaid.nih.gov
bin
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bernd Gruner; Bernd Gruner; Thomas Heinze; Thomas Heinze; Clemens-Alexander Brust; Clemens-Alexander Brust (2022). CrossDomainTypes4Py: A Python Dataset for Cross-Domain Evaluation of Type Inference Systems [Dataset]. http://doi.org/10.5281/zenodo.5747024
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5747024
Dataset updated
Jan 28, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Bernd Gruner; Bernd Gruner; Thomas Heinze; Thomas Heinze; Clemens-Alexander Brust; Clemens-Alexander Brust
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains python repositories mined on GitHub on January 20, 2021. It allows a cross-domain evaluation of type inference systems. For this purpose, it consists of two sub-datasets, each containing only projects from the web or scientific calculation domain, respectively. Therefore we searched for projects with dependencies to either NumPy or Flask. Furthermore, only projects with dependencies to mypy were considered, because this should ensure that at least parts of the projects have type annotations. These can be used later as ground truth. Further details about the dataset will be described in an upcoming paper, as soon as it is published it will be linked here.
The dataset consists of two files for the two sub-datasets. The web domain dataset contains 3129 repositories and the scientific calculation domain dataset contains 4783 repositories. The files have two columns with the URL to the GitHub repository and the used commit hash. Thus, it is possible to download the dataset using shell or python scripts, for example, the pipeline provided by ManyTypes4Py can be used.
If repositories do not exist anymore or are private, you can contact us via the following email address: bernd.gruner@dlr.de. We have a backup of all repositories and will be happy to help you.
E
The Human Know-How Dataset
dtechtive.com
find.data.gov.scot
pdf, zip
Updated Apr 29, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). The Human Know-How Dataset [Dataset]. http://doi.org/10.7488/ds/1394
Explore at:
pdf(0.0582 MB), zip(19.67 MB), zip(0.0298 MB), zip(9.433 MB), zip(13.06 MB), zip(0.2837 MB), zip(5.372 MB), zip(69.8 MB), zip(20.43 MB), zip(5.769 MB), zip(14.86 MB), zip(19.78 MB), zip(43.28 MB), zip(62.92 MB), zip(92.88 MB), zip(90.08 MB)Available download formats
Unique identifier
https://doi.org/10.7488/ds/1394
Dataset updated
Apr 29, 2016
Description
The Human Know-How Dataset describes 211,696 human activities from many different domains. These activities are decomposed into 2,609,236 entities (each with an English textual label). These entities represent over two million actions and half a million pre-requisites. Actions are interconnected both according to their dependencies (temporal/logical orders between actions) and decompositions (decomposition of complex actions into simpler ones). This dataset has been integrated with DBpedia (259,568 links). For more information see: - The project website: http://homepages.inf.ed.ac.uk/s1054760/prohow/index.htm - The data is also available on datahub: https://datahub.io/dataset/human-activities-and-instructions ---------------------------------------------------------------- * Quickstart: if you want to experiment with the most high-quality data before downloading all the datasets, download the file '9of11_knowhow_wikihow', and optionally files 'Process - Inputs', 'Process - Outputs', 'Process - Step Links' and 'wikiHow categories hierarchy'. * Data representation based on the PROHOW vocabulary: http://w3id.org/prohow# Data extracted from existing web resources is linked to the original resources using the Open Annotation specification * Data Model: an example of how the data is represented within the datasets is available in the attached Data Model PDF file. The attached example represents a simple set of instructions, but instructions in the dataset can have more complex structures. For example, instructions could have multiple methods, steps could have further sub-steps, and complex requirements could be decomposed into sub-requirements. ---------------------------------------------------------------- Statistics: * 211,696: number of instructions. From wikiHow: 167,232 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 44,464 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 2,609,236: number of RDF nodes within the instructions From wikiHow: 1,871,468 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 737,768 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 255,101: number of process inputs linked to 8,453 distinct DBpedia concepts (dataset Process - Inputs) * 4,467: number of process outputs linked to 3,439 distinct DBpedia concepts (dataset Process - Outputs) * 376,795: number of step links between 114,166 different sets of instructions (dataset Process - Step Links)

RIGA+ Dataset for Unsupervised Domain Adaptation in Medical Image...

zenodo.org
explore.openaire.eu

zip

Updated Aug 31, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Shishuai Hu; Zehui Liao; Yong Xia; Shishuai Hu; Zehui Liao; Yong Xia (2022). RIGA+ Dataset for Unsupervised Domain Adaptation in Medical Image Segmentation [Dataset]. http://doi.org/10.5281/zenodo.6325549

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6325549

Dataset updated

Aug 31, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Shishuai Hu; Zehui Liao; Yong Xia; Shishuai Hu; Zehui Liao; Yong Xia

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Different from the previous combined multi-domain dataset for unsupervised domain adaptation (UDA) in medical image segmentation, this multi-domain fundus image dataset contains annotations made by the same group of ophthalmologists. Hence the annotator bias among different datasets can be mitigated. Therefore, this dataset can provide a relatively fair benchmark for evaluating UDA methods in fundus image segmentation.

This dataset is based on the RIGA[1] dataset and MESSIDOR[2] dataset. We appreciate their efforts devoted by the authors of [1] and [2].

The six duplicated cases in the RIGA dataset are filtered out according to the Errata. We also remove the duplicated cases that exist in both the RIGA dataset and the MESSIDOR dataset by hash value matching.

Details of the RIGA+ dataset
Domain	Dataset	Labeled Samples (Train+Test)	Unlabeled Samples
Source	BinRushed	195 (195+0)	0
Source	Magrabia	95 (95+0)	0
Target	MESSIDOR-BASE1	173 (138+35)	227
Target	MESSIDOR-BASE2	148 (118+30)	238
Target	MESSIDOR-BASE3	133 (106+27)	252

[1] Almazroa A, Alodhayb S, Osman E, et al. Retinal fundus images for glaucoma analysis: the RIGA dataset[C]//Medical Imaging 2018: Imaging Informatics for Healthcare, Research, and Applications. International Society for Optics and Photonics, 2018, 10579: 105790B.

[2] Decencière E, Zhang X, Cazuguel G, et al. Feedback on a publicly distributed image database: the Messidor database[J]. Image Analysis & Stereology, 2014, 33(3): 231-234.

If you find this dataset useful for your research, please consider citing the paper as follows:

@inproceedings{hu2022domain,
 title={Domain Specific Convolution and High Frequency Reconstruction based Unsupervised Domain Adaptation for Medical Image Segmentation},
 author={Shishuai Hu and Zehui Liao and Yong Xia},
 booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
 year={2022},
 organization={Springer}
}

R
Example Dataset
universe.roboflow.com
zip
Updated Oct 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BITS PILANI (2023). Example Dataset [Dataset]. https://universe.roboflow.com/bits-pilani-es4ip/example-jvsci/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Oct 6, 2023
Dataset authored and provided by
BITS PILANI
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Variables measured
Driver Bounding Boxes
Description
Example

## Overview Example is a dataset for object detection tasks - it contains Driver annotations for 25,809 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
Data Preprocessing Dataset
kaggle.com
Updated Apr 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iqman Singh Bhatia (2023). Data Preprocessing Dataset [Dataset]. https://www.kaggle.com/datasets/iqmansingh/data-preprocessing-dataset/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Iqman Singh Bhatia
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Iqman Singh Bhatia

Released under CC0: Public Domain

Contents
Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction...
zenodo.org
data.niaid.nih.gov
zip
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li (2023). Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction Models [Dataset]. http://doi.org/10.5281/zenodo.7909511
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7909511
Dataset updated
Nov 29, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Freebase is amongst the largest public cross-domain knowledge graphs. It possesses three main data modeling idiosyncrasies. It has a strong type system; its properties are purposefully represented in reverse pairs; and it uses mediator objects to represent multiary relationships. These design choices are important in modeling the real-world. But they also pose nontrivial challenges in research of embedding models for knowledge graph completion, especially when models are developed and evaluated agnostically of these idiosyncrasies. We make available several variants of the Freebase dataset by inclusion and exclusion of these data modeling idiosyncrasies. This is the first-ever publicly available full-scale Freebase dataset that has gone through proper preparation.

Dataset Details
The dataset consists of the four variants of Freebase dataset as well as related mapping/support files. For each variant, we made three kinds of files available:
Subject matter triples file
fb+/-CVT+/-REV One folder for each variant. In each folder there are 5 files: train.txt, valid.txt, test.txt, entity2id.txt, relation2id.txt Subject matter triples are the triples belong to subject matters domains—domains describing real-world facts.
Example of a row in train.txt, valid.txt, and test.txt:
2, 192, 0
Example of a row in entity2id.txt:
/g/112yfy2xr, 2
Example of a row in relation2id.txt:
/music/album/release_type, 192
Explaination
"/g/112yfy2xr" and "/m/02lx2r" are the MID of the subject entity and object entity, respectively. "/music/album/release_type" is the realtionship between the two entities. 2, 192, and 0 are the IDs assigned by the authors to the objects.
Type system file
freebase_endtypes: Each row maps an edge type to its required subject type and object type.
Example
92, 47178872, 90
Explanation
"92" and "90" are the type id of the subject and object which has the relationship id "47178872".
Metadata files
object_types: Each row maps the MID of a Freebase object to a type it belongs to.
Example
/g/11b41c22g, /type/object/type, /people/person
Explanation
The entity with MID "/g/11b41c22g" has a type "/people/person"
object_names: Each row maps the MID of a Freebase object to its textual label.
Example
/g/11b78qtr5m, /type/object/name, "Viroliano Tries Jazz"@en
Explanation
The entity with MID "/g/11b78qtr5m" has name "Viroliano Tries Jazz" in English.
object_ids: Each row maps the MID of a Freebase object to its user-friendly identifier.
Example
/m/05v3y9r, /type/object/id, "/music/live_album/concert"
Explanation
The entity with MID "/m/05v3y9r" can be interpreted by human as a music concert live album.
domains_id_label: Each row maps the MID of a Freebase domain to its label.
Example
/m/05v4pmy, geology, 77
Explanation
The object with MID "/m/05v4pmy" in Freebase is the domain "geology", and has id "77" in our dataset.
types_id_label: Each row maps the MID of a Freebase type to its label.
Example
/m/01xljxh, /government/political_party, 147
Explanation
The object with MID "/m/01xljxh" in Freebase is the type "/government/political_party", and has id "147" in our dataset.
entities_id_label: Each row maps the MID of a Freebase entity to its label.
Example
/g/11b78qtr5m, Viroliano Tries Jazz, 2234
Explanation
The entity with MID "/g/11b78qtr5m" in Freebase is "Viroliano Tries Jazz", and has id "2234" in our dataset.
properties_id_label: Each row maps the MID of a Freebase property to its label.
Example
/m/010h8tp2, /comedy/comedy_group/members, 47178867
Explanation
The object with MID "/m/010h8tp2" in Freebase is a property(relation/edge), it has label "/comedy/comedy_group/members" and has id "47178867" in our dataset.
uri_original2simplified and uri_simplified2original: The mapping between original URI and simplified URI and the mapping between simplified URI and original URI repectively.
Example
uri_original2simplified
"http://rdf.freebase.com/ns/type.property.unique": "/type/property/unique"
uri_simplified2original
"/type/property/unique": "http://rdf.freebase.com/ns/type.property.unique"
Explanation
The URI "http://rdf.freebase.com/ns/type.property.unique" in the original Freebase RDF dataset is simplified into "/type/property/unique" in our dataset.
The identifier "/type/property/unique" in our dataset has URI http://rdf.freebase.com/ns/type.property.unique in the original Freebase RDF dataset.
u
PDMX
cseweb.ucsd.edu
json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, PDMX [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing, including over 250k musical scores in MusicXML format. PDMX is the largest publicly available, copyright-free MusicXML dataset in existence. PDMX includes genre, tag, description, and popularity metadata for every file.
h
Data from: medical-domain
huggingface.co
opendatalab.com
Updated Dec 7, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Argilla (2022). medical-domain [Dataset]. https://huggingface.co/datasets/argilla/medical-domain
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 7, 2022
Dataset authored and provided by
Argilla
Description
Dataset Card for "medical-domain"

Dataset Summary

Medical transcription data scraped from mtsamples.com Medical data is extremely hard to find due to HIPAA privacy regulations. This dataset offers a solution by providing medical transcription samples. This dataset contains sample medical transcriptions for various medical specialties.

Languages

english

Citation Information

Acknowledgements Medical transcription data scraped from mtsamples.com… See the full description on the dataset page: https://huggingface.co/datasets/argilla/medical-domain.
Data from: pacs
huggingface.co
Updated Sep 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Flower Labs (2024). pacs [Dataset]. https://huggingface.co/datasets/flwrlabs/pacs
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2024
Dataset provided by
Flower Labs GmbH
Authors
Flower Labs
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for PACS

PACS is an image dataset for domain generalization. It consists of four domains, namely Photo (1,670 images), Art Painting (2,048 images), Cartoon (2,344 images), and Sketch (3,929 images). Each domain contains seven categories (labels): Dog, Elephant, Giraffe, Guitar, Horse, and Person. The total number of sample is 9991.

Dataset Details

PACS DG dataset is created by intersecting the classes found in Caltech256 (Photo), Sketchy (Photo, Sketch)… See the full description on the dataset page: https://huggingface.co/datasets/flwrlabs/pacs.
f
Ground Truth for Entity Relatedness Problem over DBpedia datasets
figshare.com
zip
Updated Aug 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier Guillot Jiménez (2021). Ground Truth for Entity Relatedness Problem over DBpedia datasets [Dataset]. http://doi.org/10.6084/m9.figshare.15181086.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.15181086.v1
Dataset updated
Aug 17, 2021
Dataset provided by
figshare
Authors
Javier Guillot Jiménez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The entity relatedness problem refers to the question of exploring a knowledge base, represented as an RDF graph, to discover and understand how two entities are connected. More precisely, this problem can be defined as: “Given an RDF graph 'G' and a pair of entities 'a' and 'b', represented in 'G', compute the paths in 'G' from 'a' to 'b' that best describe the connectivity between them”.This dataset supports the evaluation of approaches that address the entity relatedness problem and contains a total of 240 ranked lists with 50 relationship paths each between entity pairs in two familiar domains, music and movies, in two subsets of the DBpedia that we called DBpedia21M and DBpedia45M. Specifically, we extracted data from the following two publicly available subsets of the English DBpedia corpus to form our two knowledge bases:1. mappingbased-objects: https://downloads.dbpedia.org/repo/dbpedia/mappings/mappingbased-objects/2021.03.01/mappingbased-objects_lang=en.ttl.bz22. infobox-properties: https://downloads.dbpedia.org/repo/dbpedia/generic/infobox-properties/2021.03.01/infobox-properties_lang=en.ttl.bz2 DBpedia21M contains the statements in the mappingbased-objects dataset, and DBpedia45M contains the union of the statements in mappingbased-objects and in infobox-properties. In both cases, we exclude statements involving literals or blank nodes.For each dataset (DBpedia21M and DBpedia45M), the ground truth contains 120 ranked lists with 50 relationship paths each. Each list corresponds to the most relevant paths between one of the 20 entity pairs, 10 pairs from the music domain and 10 from the movie domain, found using different path search strategies.A path search strategy consists of an entity similarity measure and a path ranking measure. The ground truth was created using the following 6 strategies:1. Jaccard Index & Predicate Frequency Inverse Triple Frequency (PF-ITF)2. Jaccard Index & Exclusivity-based Relatedness (EBR)3. Jaccard Index & Pointwise Mutual Information (PMI)4. Wikipedia Link-based Measure (WLM) & PF-ITF5. WLM & EBR6. WLM & PMIThe filename of a file that contains the ranked list of 50 relationship paths between a pair of entities has the following format:[Dataset].[EntityPairID].[SearchStrategyID].[Entity1-Entity2].txtExample 1: DBpedia21M.1.2.Michael_Jackson-Whitney_Houston.txtExample 2: DBpedia45M.27.4.Paul_Newman-Joanne_Woodward.txtThe file in Example 1 contains the top-50 most relevant paths between Michael Jackson and Whitney Houston in DBpedia21M using the search strategy number 2 (Jaccard Index & EBR)The file in Example 2 contains the top-50 most relevant paths between Paul Newman and Joanne Woodward in DBpedia45M using the search strategy number 4 (WLM & PF-ITF)The data is splitted into 2 files, one for each dataset and compressed in .zip format:DBpedia21M.GT.zip: contains 180 .txt files representing the ranked lists of relationship paths between entity pairs in DBpedia21M dataset. DBpedia45M.GT.zip: contains 180 .txt files representing the ranked lists of relationship paths between entity pairs in DBpedia45M dataset.
E
Long document similarity datasets, Wikipedia excerptions for movies, video...
live.european-language-grid.eu
data.niaid.nih.gov
+1more
csv
Updated Apr 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Long document similarity datasets, Wikipedia excerptions for movies, video games and wine collections [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7843
Explore at:
csvAvailable download formats
Dataset updated
Apr 6, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Three corpora in different domains extracted from Wikipedia.For all datasets, the figures and tables have been filtered out, as well as the categories and "see also" sections.The article structure, and particularly the sub-titles and paragraphs are kept in these datasets.
Wines: Wikipedia wines dataset consists of 1635 articles from the wine domain. The extracted dataset consists of a non-trivial mixture of articles, including different wine categories, brands, wineries, grape types, and more. The ground-truth recommendations were crafted by a human sommelier, which annotated 92 source articles with ~10 ground-truth recommendations for each sample. Examples for ground-truth expert-based recommendations are Dom Pérignon - Moët & Chandon, Pinot Meunier - Chardonnay.
Movies: The Wikipedia movies dataset consists of 100385 articles describing different movies. The movies' articles may consist of text passages describing the plot, cast, production, reception, soundtrack, and more. For this dataset, we have extracted a test set of ground truth annotations for 50 source articles using the "BestSimilar" database. Each source articles is associated with a list of ${\scriptsize \sim}12$ most similar movies. Examples for ground-truth expert-based recommendations are Schindler's List - The PianistLion King - The Jungle Book.
Video games: The Wikipedia video games dataset consists of 21,935 articles reviewing video games from all genres and consoles. Each article may consist of a different combination of sections, including summary, gameplay, plot, production, etc. Examples for ground-truth expert-based recommendations are: Grand Theft Auto - Mafia, Burnout Paradise - Forza Horizon 3.
Phishing and Legitimate URLS
kaggle.com
Updated Sep 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hari sudhan411 (2023). Phishing and Legitimate URLS [Dataset]. https://www.kaggle.com/datasets/harisudhan411/phishing-and-legitimate-urls
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 21, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hari sudhan411
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset encompasses a comprehensive collection of over 800,000 URLs, meticulously curated to provide a diverse representation of online domains. Within this extensive corpus, approximately 52% of the domains are identified as legitimate, reflective of established and trustworthy entities within the digital landscape. Conversely, the remaining 47% of domains are categorized as phishing domains, indicative of potential threats and malicious activities.

Structured with precision, the dataset comprises two key columns: "url" and "status". The "url" column serves as the primary identifier, housing the uniform resource locators (URLs) for each respective domain. Meanwhile, the "status" column employs binary encoding, with values represented as 0 and 1. Herein lies a crucial distinction: a value of 0 designates domains flagged as phishing, signaling a potential risk to users, while a value of 1 signifies domains deemed legitimate, offering assurance and credibility. Additionally paramount importance is the careful balance maintained between these two categories. With an almost equal distribution of instances across phishing and legitimate domains, this dataset mitigates the risk of class imbalance, ensuring robustness and reliability in subsequent analyses and model development. This deliberate approach fosters a more equitable and representative dataset, empowering researchers and practitioners in their endeavors to understand, combat, and mitigate online threats.
u
Process Mining-Based Goal Recognition System Evaluation Dataset
figshare.unimelb.edu.au
application/bzip2
Updated Aug 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zihang Su (2023). Process Mining-Based Goal Recognition System Evaluation Dataset [Dataset]. http://doi.org/10.26188/21749570.v4
Explore at:
application/bzip2Available download formats
Unique identifier
https://doi.org/10.26188/21749570.v4
Dataset updated
Aug 11, 2023
Dataset provided by
The University of Melbourne
Authors
Zihang Su
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
These datasets are used for evaluating the process mining-based goal recognition system proposed in the paper "Fast and Accurate Data-Driven Goal Recognition Using Process Mining Techniques." The datasets include a running example, an evaluation dataset for synthetic domains, and real-world business logs.running_example.tar.bz contains the traces shown in figure 2 of the paper for learning six skill models toward six goal candidates and the three walks shown in figure 1.a.synthetic_domains.tar.bz2 is the dataset for evaluating GR system in synthetic domains (IPC domains). There are two types of traces used for learning skill models, generated by the top-k planner and generated by the diverse planner. Please extract the archived domains located in topk/ and diverse/. In each domain, the sub-folder problems/ contains the dataset for learning skill models, and the sub-folder test/ contains the traces (plans) for testing the GR performance. There are five levels of observations, 10%, 30%, 50%, 70%, and 100%. For each level of observation, there are multiple problem instances, the instance ID starts from 0. A problem instance contains the synthetic domain model (PDDL files), training traces (in train/), and an observation for testing (obs.dat). The top-k and diverse planners for generating traces can be accessed here. The original PDDL models of the problem instances for the 15 IPC domains mentioned in the paper are available here.business_logs.tar.bz is the dataset for evaluating GR system in real-world domains. There are two types of problem instances: one with only two goal candidates (yes or no), referred to as "binary," and the other containing multiple goal candidates, termed "multiple." Please extract the archived files located in the directories binary/ and multiple/. The traces for learning the skill models can be found in XES files, and the traces (plans) for testing can be found in the directory goal*/.
Data from: IMAD-DS: A Dataset for Industrial Multi-Sensor Anomaly Detection...
zenodo.org
bin
Updated Aug 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Filippo Augusti; Filippo Augusti; Davide Albertini; Davide Albertini; Kudret Esmer; Roberto Sannino; Alberto Bernardini; Alberto Bernardini; Kudret Esmer; Roberto Sannino (2024). IMAD-DS: A Dataset for Industrial Multi-Sensor Anomaly Detection Under Domain Shift Conditions [Dataset]. http://doi.org/10.5281/zenodo.12665499
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12665499
Dataset updated
Aug 28, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Filippo Augusti; Filippo Augusti; Davide Albertini; Davide Albertini; Kudret Esmer; Roberto Sannino; Alberto Bernardini; Alberto Bernardini; Kudret Esmer; Roberto Sannino
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
IMAD-DS is a dataset developed for multi-rate multi-sensor anomaly detection (AD) in industrial environments, that considers varying operational and environmental conditions known as domain shifts.

Dataset Overview:

This dataset includes data from two scaled industrial machines: a robotic arm and a brushless motor.

It includes both normal and abnormal data recorded under various operating conditions to account for domain shifts. These shifts are categorized into:

Robotic Arm: The robotic arm is a scaled version of a robotic arm used to move silicon wafers in a factory. Anomalies are created by removing bolts at the nodes of the arm, resulting in an imbalance in the machine.
Brushless Motor: The brushless motor is a scaled representation of an industrial brushless motor. Two anomalies are introduced: first, a magnet is moved closer to the motor load, causing oscillations by interacting with two symmetrical magnets on the load; second, a belt that rotates in unison with the motor shaft is tightened, creating mechanical stress.

The following domain shifts are included in the dataset:

Operational Domain Shifts: Variations caused by changes in machine conditions (e.g., load changes for the robotic arm and speed changes for the brushless motor).

Environmental Domain Shifts: Variations due to changes in background noise levels.

Combinations of operating and environmental conditions divide each machine's dataset into two subsets: the source domain and the target domain. The source domain has a large number of training examples. The target domain, instead, has limited training data. This discrepancy highlights a common issue in the industry where sufficient training data is often unavailable for the target domain, as machine data is collected under controlled environments that do not fully represent the deployment environments.

Data Collection and Processing:

Data is collected using the STEVAL-STWINBX1 IoT Sensor Industrial Node. The sensor used to record the dataset are the following.

· Analog Microphone (16 kHz)

· 3-axis Accelerometer (6.7 kHz)

· 3-axis Gyroscope (6.7 kHz)

Recordings are conducted in an anechoic chamber to control acoustic conditions precisely

Data Format:
Files are already divided into train and test sets. Inside each folder, each sensor's data is stored in a separate '.parquet' file.

Sensor files related to the same segment of machine data share a unique ID. The mapping of each machine data segment to the sensor files is given in .csv files inside the train and test folders. Those .csv files also contain metadata denoting the operational and environmental conditions of a specific segment.
o
Open Data Portal Catalogue Metadata
ukpowernetworks.opendatasoft.com
csv, excel, json
Updated Sep 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Open Data Portal Catalogue Metadata [Dataset]. https://ukpowernetworks.opendatasoft.com/explore/dataset/domain-dataset0/
Explore at:
json, excel, csvAvailable download formats
Dataset updated
Sep 2, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionA special dataset that contains metadata for all the published datasets. Dataset profile fields conform to Dublin Core standard.Other

You can download metadata for individual datasets, via the links provided in descriptions.

Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: https://ukpowernetworks.opendatasoft.com/pages/glossary/

Facebook

Twitter

Click to copy link

Link copied

Cite

PARC (2025). error-detection-positives [Dataset]. https://huggingface.co/datasets/PARC-DATASETS/error-detection-positives

error-detection-positives

PARC-DATASETS/error-detection-positives

Explore at:

Dataset updated

May 22, 2025

Dataset authored and provided by

PARC

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

error-detection-positives

This dataset is part of the PARC (Premise-Annotated Reasoning Collection) and contains mathematical reasoning problems with error annotations. This dataset combines positives samples from multiple domains.

  Domain Breakdown

gsm8k: 50 samples math: 53 samples metamathqa: 93 samples orca_math: 96 samples

  Features

Each example contains:

data_source: The domain/source of the problem (gsm8k, math, metamathqa, orca_math) question: The… See the full description on the dataset page: https://huggingface.co/datasets/PARC-DATASETS/error-detection-positives.

Clear search

Close search

Google apps

Main menu

error-detection-positives

Data from: LVMED: Dataset of Latvian text normalisation samples for the...

domain-generation-dataset

diagnostic-dataset

Data from: LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive...

CrossDomainTypes4Py: A Python Dataset for Cross-Domain Evaluation of Type...

The Human Know-How Dataset

RIGA+ Dataset for Unsupervised Domain Adaptation in Medical Image...

Example Dataset

Example

Data Preprocessing Dataset

Dataset

Contents

Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction...

PDMX

Data from: medical-domain

Data from: pacs

Ground Truth for Entity Relatedness Problem over DBpedia datasets

Long document similarity datasets, Wikipedia excerptions for movies, video...

Phishing and Legitimate URLS

Process Mining-Based Goal Recognition System Evaluation Dataset

Data from: IMAD-DS: A Dataset for Industrial Multi-Sensor Anomaly Detection...

Open Data Portal Catalogue Metadata

error-detection-positivesSee More Versions

PARC-DATASETS/error-detection-positives

error-detection-positives