100+ datasets found
  1. h

    error-detection-positives

    • huggingface.co
    Updated May 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PARC (2025). error-detection-positives [Dataset]. https://huggingface.co/datasets/PARC-DATASETS/error-detection-positives
    Explore at:
    Dataset updated
    May 22, 2025
    Dataset authored and provided by
    PARC
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    error-detection-positives

    This dataset is part of the PARC (Premise-Annotated Reasoning Collection) and contains mathematical reasoning problems with error annotations. This dataset combines positives samples from multiple domains.

      Domain Breakdown
    

    gsm8k: 50 samples math: 53 samples metamathqa: 93 samples orca_math: 96 samples

      Features
    

    Each example contains:

    data_source: The domain/source of the problem (gsm8k, math, metamathqa, orca_math) question: The… See the full description on the dataset page: https://huggingface.co/datasets/PARC-DATASETS/error-detection-positives.

  2. c

    Data from: LVMED: Dataset of Latvian text normalisation samples for the...

    • repository.clarin.lv
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Viesturs Jūlijs Lasmanis; Normunds Grūzītis (2023). LVMED: Dataset of Latvian text normalisation samples for the medical domain [Dataset]. https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/85
    Explore at:
    Dataset updated
    May 30, 2023
    Authors
    Viesturs Jūlijs Lasmanis; Normunds Grūzītis
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).

    Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.

    All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.

  3. h

    domain-generation-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abubakar Aliyu, domain-generation-dataset [Dataset]. https://huggingface.co/datasets/Maikobi/domain-generation-dataset
    Explore at:
    Authors
    Abubakar Aliyu
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Domain Generation Dataset

    This dataset contains 1,667 high-quality examples for fine-tuning language models to generate creative and relevant domain names for businesses, with built-in safety training and edge case handling.

      Dataset Creation
    

    Methodology: Hybrid approach combining Claude API generation with manual curation after encountering API reliability issues. Original Target: 2,000 examples → Final Result: 1,667 examples after deduplication and quality control.… See the full description on the dataset page: https://huggingface.co/datasets/Maikobi/domain-generation-dataset.

  4. h

    diagnostic-dataset

    • huggingface.co
    Updated Apr 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    End-to-End Speech Benchmark (2023). diagnostic-dataset [Dataset]. https://huggingface.co/datasets/esb/diagnostic-dataset
    Explore at:
    Dataset updated
    Apr 4, 2023
    Dataset authored and provided by
    End-to-End Speech Benchmark
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As a part of ESB benchmark, we provide a small, 8h diagnostic dataset of in-domain validation data with newly annotated transcriptions. The audio data is sampled from each of the ESB validation sets, giving a range of different domains and speaking styles. The transcriptions are annotated according to a consistent style guide with two formats: normalised and un-normalised. The dataset is structured in the same way as the ESB dataset, by grouping audio-transcription samples according to the… See the full description on the dataset page: https://huggingface.co/datasets/esb/diagnostic-dataset.

  5. Data from: LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive...

    • zenodo.org
    • data.niaid.nih.gov
    pdf, zip
    Updated Jul 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wang Junjue; Wang Junjue; Zheng Zhuo; Ma Ailong; Lu Xiaoyan; Zhong Yanfei; Zheng Zhuo; Ma Ailong; Lu Xiaoyan; Zhong Yanfei (2024). LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation [Dataset]. http://doi.org/10.5281/zenodo.5706578
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Wang Junjue; Wang Junjue; Zheng Zhuo; Ma Ailong; Lu Xiaoyan; Zhong Yanfei; Zheng Zhuo; Ma Ailong; Lu Xiaoyan; Zhong Yanfei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The benchmark code is available at: https://github.com/Junjue-Wang/LoveDA

    Highlights:

    1. 5987 high spatial resolution (0.3 m) remote sensing images from Nanjing, Changzhou, and Wuhan
    2. Focus on different geographical environments between Urban and Rural
    3. Advance both semantic segmentation and domain adaptation tasks
    4. Three considerable challenges: multi-scale objects, complex background samples, and inconsistent class distributions

    Reference:

    @inproceedings{wang2021loveda,
     title={Love{DA}: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation},
     author={Junjue Wang and Zhuo Zheng and Ailong Ma and Xiaoyan Lu and Yanfei Zhong},
     booktitle={Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
     editor = {J. Vanschoren and S. Yeung},
     year={2021},
     volume = {1},
     pages = {},
     url={https://datasets-benchmarks proceedings.neurips.cc/paper/2021/file/4e732ced3463d06de0ca9a15b6153677-Paper-round2.pdf}
    }

    License:

    The owners of the data and of the copyright on the data are RSIDEA, Wuhan University. Use of the Google Earth images must respect the "Google Earth" terms of use. All images and their associated annotations in LoveDA can be used for academic purposes only, but any commercial use is prohibited. (CC BY-NC-SA 4.0)

  6. CrossDomainTypes4Py: A Python Dataset for Cross-Domain Evaluation of Type...

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bernd Gruner; Bernd Gruner; Thomas Heinze; Thomas Heinze; Clemens-Alexander Brust; Clemens-Alexander Brust (2022). CrossDomainTypes4Py: A Python Dataset for Cross-Domain Evaluation of Type Inference Systems [Dataset]. http://doi.org/10.5281/zenodo.5747024
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 28, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bernd Gruner; Bernd Gruner; Thomas Heinze; Thomas Heinze; Clemens-Alexander Brust; Clemens-Alexander Brust
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains python repositories mined on GitHub on January 20, 2021. It allows a cross-domain evaluation of type inference systems. For this purpose, it consists of two sub-datasets, each containing only projects from the web or scientific calculation domain, respectively. Therefore we searched for projects with dependencies to either NumPy or Flask. Furthermore, only projects with dependencies to mypy were considered, because this should ensure that at least parts of the projects have type annotations. These can be used later as ground truth. Further details about the dataset will be described in an upcoming paper, as soon as it is published it will be linked here.
    The dataset consists of two files for the two sub-datasets. The web domain dataset contains 3129 repositories and the scientific calculation domain dataset contains 4783 repositories. The files have two columns with the URL to the GitHub repository and the used commit hash. Thus, it is possible to download the dataset using shell or python scripts, for example, the pipeline provided by ManyTypes4Py can be used.
    If repositories do not exist anymore or are private, you can contact us via the following email address: bernd.gruner@dlr.de. We have a backup of all repositories and will be happy to help you.

  7. E

    The Human Know-How Dataset

    • dtechtive.com
    • find.data.gov.scot
    pdf, zip
    Updated Apr 29, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). The Human Know-How Dataset [Dataset]. http://doi.org/10.7488/ds/1394
    Explore at:
    pdf(0.0582 MB), zip(19.67 MB), zip(0.0298 MB), zip(9.433 MB), zip(13.06 MB), zip(0.2837 MB), zip(5.372 MB), zip(69.8 MB), zip(20.43 MB), zip(5.769 MB), zip(14.86 MB), zip(19.78 MB), zip(43.28 MB), zip(62.92 MB), zip(92.88 MB), zip(90.08 MB)Available download formats
    Dataset updated
    Apr 29, 2016
    Description

    The Human Know-How Dataset describes 211,696 human activities from many different domains. These activities are decomposed into 2,609,236 entities (each with an English textual label). These entities represent over two million actions and half a million pre-requisites. Actions are interconnected both according to their dependencies (temporal/logical orders between actions) and decompositions (decomposition of complex actions into simpler ones). This dataset has been integrated with DBpedia (259,568 links). For more information see: - The project website: http://homepages.inf.ed.ac.uk/s1054760/prohow/index.htm - The data is also available on datahub: https://datahub.io/dataset/human-activities-and-instructions ---------------------------------------------------------------- * Quickstart: if you want to experiment with the most high-quality data before downloading all the datasets, download the file '9of11_knowhow_wikihow', and optionally files 'Process - Inputs', 'Process - Outputs', 'Process - Step Links' and 'wikiHow categories hierarchy'. * Data representation based on the PROHOW vocabulary: http://w3id.org/prohow# Data extracted from existing web resources is linked to the original resources using the Open Annotation specification * Data Model: an example of how the data is represented within the datasets is available in the attached Data Model PDF file. The attached example represents a simple set of instructions, but instructions in the dataset can have more complex structures. For example, instructions could have multiple methods, steps could have further sub-steps, and complex requirements could be decomposed into sub-requirements. ---------------------------------------------------------------- Statistics: * 211,696: number of instructions. From wikiHow: 167,232 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 44,464 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 2,609,236: number of RDF nodes within the instructions From wikiHow: 1,871,468 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 737,768 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 255,101: number of process inputs linked to 8,453 distinct DBpedia concepts (dataset Process - Inputs) * 4,467: number of process outputs linked to 3,439 distinct DBpedia concepts (dataset Process - Outputs) * 376,795: number of step links between 114,166 different sets of instructions (dataset Process - Step Links)

  8. RIGA+ Dataset for Unsupervised Domain Adaptation in Medical Image...

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Aug 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shishuai Hu; Zehui Liao; Yong Xia; Shishuai Hu; Zehui Liao; Yong Xia (2022). RIGA+ Dataset for Unsupervised Domain Adaptation in Medical Image Segmentation [Dataset]. http://doi.org/10.5281/zenodo.6325549
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 31, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shishuai Hu; Zehui Liao; Yong Xia; Shishuai Hu; Zehui Liao; Yong Xia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Different from the previous combined multi-domain dataset for unsupervised domain adaptation (UDA) in medical image segmentation, this multi-domain fundus image dataset contains annotations made by the same group of ophthalmologists. Hence the annotator bias among different datasets can be mitigated. Therefore, this dataset can provide a relatively fair benchmark for evaluating UDA methods in fundus image segmentation.

    This dataset is based on the RIGA[1] dataset and MESSIDOR[2] dataset. We appreciate their efforts devoted by the authors of [1] and [2].

    The six duplicated cases in the RIGA dataset are filtered out according to the Errata. We also remove the duplicated cases that exist in both the RIGA dataset and the MESSIDOR dataset by hash value matching.

    Details of the RIGA+ dataset
    DomainDataset

    Labeled Samples

    (Train+Test)

    Unlabeled

    Samples

    SourceBinRushed195 (195+0)0
    SourceMagrabia95 (95+0)0
    TargetMESSIDOR-BASE1173 (138+35)227
    TargetMESSIDOR-BASE2148 (118+30)238
    TargetMESSIDOR-BASE3133 (106+27)252

    [1] Almazroa A, Alodhayb S, Osman E, et al. Retinal fundus images for glaucoma analysis: the RIGA dataset[C]//Medical Imaging 2018: Imaging Informatics for Healthcare, Research, and Applications. International Society for Optics and Photonics, 2018, 10579: 105790B.

    [2] Decencière E, Zhang X, Cazuguel G, et al. Feedback on a publicly distributed image database: the Messidor database[J]. Image Analysis & Stereology, 2014, 33(3): 231-234.

    If you find this dataset useful for your research, please consider citing the paper as follows:

    @inproceedings{hu2022domain,
     title={Domain Specific Convolution and High Frequency Reconstruction based Unsupervised Domain Adaptation for Medical Image Segmentation},
     author={Shishuai Hu and Zehui Liao and Yong Xia},
     booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
     year={2022},
     organization={Springer}
    }

  9. R

    Example Dataset

    • universe.roboflow.com
    zip
    Updated Oct 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BITS PILANI (2023). Example Dataset [Dataset]. https://universe.roboflow.com/bits-pilani-es4ip/example-jvsci/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 6, 2023
    Dataset authored and provided by
    BITS PILANI
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Variables measured
    Driver Bounding Boxes
    Description

    Example

    ## Overview
    
    Example is a dataset for object detection tasks - it contains Driver annotations for 25,809 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
    
  10. Data Preprocessing Dataset

    • kaggle.com
    Updated Apr 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iqman Singh Bhatia (2023). Data Preprocessing Dataset [Dataset]. https://www.kaggle.com/datasets/iqmansingh/data-preprocessing-dataset/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 6, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Iqman Singh Bhatia
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Iqman Singh Bhatia

    Released under CC0: Public Domain

    Contents

  11. Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Nov 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li (2023). Freebase Datasets for Robust Evaluation of Knowledge Graph Link Prediction Models [Dataset]. http://doi.org/10.5281/zenodo.7909511
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li; Nasim Shirvani Mahdavi; Farahnaz Akrami; Mohammed Samiul Saeef; Xiao Shi; Chengkai Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Freebase is amongst the largest public cross-domain knowledge graphs. It possesses three main data modeling idiosyncrasies. It has a strong type system; its properties are purposefully represented in reverse pairs; and it uses mediator objects to represent multiary relationships. These design choices are important in modeling the real-world. But they also pose nontrivial challenges in research of embedding models for knowledge graph completion, especially when models are developed and evaluated agnostically of these idiosyncrasies. We make available several variants of the Freebase dataset by inclusion and exclusion of these data modeling idiosyncrasies. This is the first-ever publicly available full-scale Freebase dataset that has gone through proper preparation.

    Dataset Details

    The dataset consists of the four variants of Freebase dataset as well as related mapping/support files. For each variant, we made three kinds of files available:

    • Subject matter triples file
      • fb+/-CVT+/-REV One folder for each variant. In each folder there are 5 files: train.txt, valid.txt, test.txt, entity2id.txt, relation2id.txt Subject matter triples are the triples belong to subject matters domains—domains describing real-world facts.
        • Example of a row in train.txt, valid.txt, and test.txt:
          • 2, 192, 0
        • Example of a row in entity2id.txt:
          • /g/112yfy2xr, 2
        • Example of a row in relation2id.txt:
          • /music/album/release_type, 192
        • Explaination
          • "/g/112yfy2xr" and "/m/02lx2r" are the MID of the subject entity and object entity, respectively. "/music/album/release_type" is the realtionship between the two entities. 2, 192, and 0 are the IDs assigned by the authors to the objects.
    • Type system file
      • freebase_endtypes: Each row maps an edge type to its required subject type and object type.
        • Example
          • 92, 47178872, 90
        • Explanation
          • "92" and "90" are the type id of the subject and object which has the relationship id "47178872".
    • Metadata files
      • object_types: Each row maps the MID of a Freebase object to a type it belongs to.
        • Example
          • /g/11b41c22g, /type/object/type, /people/person
        • Explanation
          • The entity with MID "/g/11b41c22g" has a type "/people/person"
      • object_names: Each row maps the MID of a Freebase object to its textual label.
        • Example
          • /g/11b78qtr5m, /type/object/name, "Viroliano Tries Jazz"@en
        • Explanation
          • The entity with MID "/g/11b78qtr5m" has name "Viroliano Tries Jazz" in English.
      • object_ids: Each row maps the MID of a Freebase object to its user-friendly identifier.
        • Example
          • /m/05v3y9r, /type/object/id, "/music/live_album/concert"
        • Explanation
          • The entity with MID "/m/05v3y9r" can be interpreted by human as a music concert live album.
      • domains_id_label: Each row maps the MID of a Freebase domain to its label.
        • Example
          • /m/05v4pmy, geology, 77
        • Explanation
          • The object with MID "/m/05v4pmy" in Freebase is the domain "geology", and has id "77" in our dataset.
      • types_id_label: Each row maps the MID of a Freebase type to its label.
        • Example
          • /m/01xljxh, /government/political_party, 147
        • Explanation
          • The object with MID "/m/01xljxh" in Freebase is the type "/government/political_party", and has id "147" in our dataset.
      • entities_id_label: Each row maps the MID of a Freebase entity to its label.
        • Example
          • /g/11b78qtr5m, Viroliano Tries Jazz, 2234
        • Explanation
          • The entity with MID "/g/11b78qtr5m" in Freebase is "Viroliano Tries Jazz", and has id "2234" in our dataset.
        • properties_id_label: Each row maps the MID of a Freebase property to its label.
          • Example
            • /m/010h8tp2, /comedy/comedy_group/members, 47178867
          • Explanation
            • The object with MID "/m/010h8tp2" in Freebase is a property(relation/edge), it has label "/comedy/comedy_group/members" and has id "47178867" in our dataset.
        • uri_original2simplified and uri_simplified2original: The mapping between original URI and simplified URI and the mapping between simplified URI and original URI repectively.

  12. u

    PDMX

    • cseweb.ucsd.edu
    json
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, PDMX [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
    Explore at:
    jsonAvailable download formats
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing, including over 250k musical scores in MusicXML format. PDMX is the largest publicly available, copyright-free MusicXML dataset in existence. PDMX includes genre, tag, description, and popularity metadata for every file.

  13. h

    Data from: medical-domain

    • huggingface.co
    • opendatalab.com
    Updated Dec 7, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Argilla (2022). medical-domain [Dataset]. https://huggingface.co/datasets/argilla/medical-domain
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 7, 2022
    Dataset authored and provided by
    Argilla
    Description

    Dataset Card for "medical-domain"

      Dataset Summary
    

    Medical transcription data scraped from mtsamples.com Medical data is extremely hard to find due to HIPAA privacy regulations. This dataset offers a solution by providing medical transcription samples. This dataset contains sample medical transcriptions for various medical specialties.

      Languages
    

    english

      Citation Information
    

    Acknowledgements Medical transcription data scraped from mtsamples.com… See the full description on the dataset page: https://huggingface.co/datasets/argilla/medical-domain.

  14. Data from: pacs

    • huggingface.co
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Flower Labs (2024). pacs [Dataset]. https://huggingface.co/datasets/flwrlabs/pacs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Dataset provided by
    Flower Labs GmbH
    Authors
    Flower Labs
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Dataset Card for PACS

    PACS is an image dataset for domain generalization. It consists of four domains, namely Photo (1,670 images), Art Painting (2,048 images), Cartoon (2,344 images), and Sketch (3,929 images). Each domain contains seven categories (labels): Dog, Elephant, Giraffe, Guitar, Horse, and Person. The total number of sample is 9991.

      Dataset Details
    

    PACS DG dataset is created by intersecting the classes found in Caltech256 (Photo), Sketchy (Photo, Sketch)… See the full description on the dataset page: https://huggingface.co/datasets/flwrlabs/pacs.

  15. f

    Ground Truth for Entity Relatedness Problem over DBpedia datasets

    • figshare.com
    zip
    Updated Aug 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier Guillot Jiménez (2021). Ground Truth for Entity Relatedness Problem over DBpedia datasets [Dataset]. http://doi.org/10.6084/m9.figshare.15181086.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 17, 2021
    Dataset provided by
    figshare
    Authors
    Javier Guillot Jiménez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The entity relatedness problem refers to the question of exploring a knowledge base, represented as an RDF graph, to discover and understand how two entities are connected. More precisely, this problem can be defined as: “Given an RDF graph 'G' and a pair of entities 'a' and 'b', represented in 'G', compute the paths in 'G' from 'a' to 'b' that best describe the connectivity between them”.This dataset supports the evaluation of approaches that address the entity relatedness problem and contains a total of 240 ranked lists with 50 relationship paths each between entity pairs in two familiar domains, music and movies, in two subsets of the DBpedia that we called DBpedia21M and DBpedia45M. Specifically, we extracted data from the following two publicly available subsets of the English DBpedia corpus to form our two knowledge bases:1. mappingbased-objects: https://downloads.dbpedia.org/repo/dbpedia/mappings/mappingbased-objects/2021.03.01/mappingbased-objects_lang=en.ttl.bz22. infobox-properties: https://downloads.dbpedia.org/repo/dbpedia/generic/infobox-properties/2021.03.01/infobox-properties_lang=en.ttl.bz2 DBpedia21M contains the statements in the mappingbased-objects dataset, and DBpedia45M contains the union of the statements in mappingbased-objects and in infobox-properties. In both cases, we exclude statements involving literals or blank nodes.For each dataset (DBpedia21M and DBpedia45M), the ground truth contains 120 ranked lists with 50 relationship paths each. Each list corresponds to the most relevant paths between one of the 20 entity pairs, 10 pairs from the music domain and 10 from the movie domain, found using different path search strategies.A path search strategy consists of an entity similarity measure and a path ranking measure. The ground truth was created using the following 6 strategies:1. Jaccard Index & Predicate Frequency Inverse Triple Frequency (PF-ITF)2. Jaccard Index & Exclusivity-based Relatedness (EBR)3. Jaccard Index & Pointwise Mutual Information (PMI)4. Wikipedia Link-based Measure (WLM) & PF-ITF5. WLM & EBR6. WLM & PMIThe filename of a file that contains the ranked list of 50 relationship paths between a pair of entities has the following format:[Dataset].[EntityPairID].[SearchStrategyID].[Entity1-Entity2].txtExample 1: DBpedia21M.1.2.Michael_Jackson-Whitney_Houston.txtExample 2: DBpedia45M.27.4.Paul_Newman-Joanne_Woodward.txtThe file in Example 1 contains the top-50 most relevant paths between Michael Jackson and Whitney Houston in DBpedia21M using the search strategy number 2 (Jaccard Index & EBR)The file in Example 2 contains the top-50 most relevant paths between Paul Newman and Joanne Woodward in DBpedia45M using the search strategy number 4 (WLM & PF-ITF)The data is splitted into 2 files, one for each dataset and compressed in .zip format:DBpedia21M.GT.zip: contains 180 .txt files representing the ranked lists of relationship paths between entity pairs in DBpedia21M dataset. DBpedia45M.GT.zip: contains 180 .txt files representing the ranked lists of relationship paths between entity pairs in DBpedia45M dataset.

  16. E

    Long document similarity datasets, Wikipedia excerptions for movies, video...

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    • +1more
    csv
    Updated Apr 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Long document similarity datasets, Wikipedia excerptions for movies, video games and wine collections [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7843
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 6, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Three corpora in different domains extracted from Wikipedia.For all datasets, the figures and tables have been filtered out, as well as the categories and "see also" sections.The article structure, and particularly the sub-titles and paragraphs are kept in these datasets.

    Wines: Wikipedia wines dataset consists of 1635 articles from the wine domain. The extracted dataset consists of a non-trivial mixture of articles, including different wine categories, brands, wineries, grape types, and more. The ground-truth recommendations were crafted by a human sommelier, which annotated 92 source articles with ~10 ground-truth recommendations for each sample. Examples for ground-truth expert-based recommendations are Dom Pérignon - Moët & Chandon, Pinot Meunier - Chardonnay.

    Movies: The Wikipedia movies dataset consists of 100385 articles describing different movies. The movies' articles may consist of text passages describing the plot, cast, production, reception, soundtrack, and more. For this dataset, we have extracted a test set of ground truth annotations for 50 source articles using the "BestSimilar" database. Each source articles is associated with a list of ${\scriptsize \sim}12$ most similar movies. Examples for ground-truth expert-based recommendations are Schindler's List - The PianistLion King - The Jungle Book.

    Video games: The Wikipedia video games dataset consists of 21,935 articles reviewing video games from all genres and consoles. Each article may consist of a different combination of sections, including summary, gameplay, plot, production, etc. Examples for ground-truth expert-based recommendations are: Grand Theft Auto - Mafia, Burnout Paradise - Forza Horizon 3.

  17. Phishing and Legitimate URLS

    • kaggle.com
    Updated Sep 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hari sudhan411 (2023). Phishing and Legitimate URLS [Dataset]. https://www.kaggle.com/datasets/harisudhan411/phishing-and-legitimate-urls
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 21, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Hari sudhan411
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset encompasses a comprehensive collection of over 800,000 URLs, meticulously curated to provide a diverse representation of online domains. Within this extensive corpus, approximately 52% of the domains are identified as legitimate, reflective of established and trustworthy entities within the digital landscape. Conversely, the remaining 47% of domains are categorized as phishing domains, indicative of potential threats and malicious activities.

    Structured with precision, the dataset comprises two key columns: "url" and "status". The "url" column serves as the primary identifier, housing the uniform resource locators (URLs) for each respective domain. Meanwhile, the "status" column employs binary encoding, with values represented as 0 and 1. Herein lies a crucial distinction: a value of 0 designates domains flagged as phishing, signaling a potential risk to users, while a value of 1 signifies domains deemed legitimate, offering assurance and credibility. Additionally paramount importance is the careful balance maintained between these two categories. With an almost equal distribution of instances across phishing and legitimate domains, this dataset mitigates the risk of class imbalance, ensuring robustness and reliability in subsequent analyses and model development. This deliberate approach fosters a more equitable and representative dataset, empowering researchers and practitioners in their endeavors to understand, combat, and mitigate online threats.

  18. u

    Process Mining-Based Goal Recognition System Evaluation Dataset

    • figshare.unimelb.edu.au
    application/bzip2
    Updated Aug 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zihang Su (2023). Process Mining-Based Goal Recognition System Evaluation Dataset [Dataset]. http://doi.org/10.26188/21749570.v4
    Explore at:
    application/bzip2Available download formats
    Dataset updated
    Aug 11, 2023
    Dataset provided by
    The University of Melbourne
    Authors
    Zihang Su
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    These datasets are used for evaluating the process mining-based goal recognition system proposed in the paper "Fast and Accurate Data-Driven Goal Recognition Using Process Mining Techniques." The datasets include a running example, an evaluation dataset for synthetic domains, and real-world business logs.running_example.tar.bz contains the traces shown in figure 2 of the paper for learning six skill models toward six goal candidates and the three walks shown in figure 1.a.synthetic_domains.tar.bz2 is the dataset for evaluating GR system in synthetic domains (IPC domains). There are two types of traces used for learning skill models, generated by the top-k planner and generated by the diverse planner. Please extract the archived domains located in topk/ and diverse/. In each domain, the sub-folder problems/ contains the dataset for learning skill models, and the sub-folder test/ contains the traces (plans) for testing the GR performance. There are five levels of observations, 10%, 30%, 50%, 70%, and 100%. For each level of observation, there are multiple problem instances, the instance ID starts from 0. A problem instance contains the synthetic domain model (PDDL files), training traces (in train/), and an observation for testing (obs.dat). The top-k and diverse planners for generating traces can be accessed here. The original PDDL models of the problem instances for the 15 IPC domains mentioned in the paper are available here.business_logs.tar.bz is the dataset for evaluating GR system in real-world domains. There are two types of problem instances: one with only two goal candidates (yes or no), referred to as "binary," and the other containing multiple goal candidates, termed "multiple." Please extract the archived files located in the directories binary/ and multiple/. The traces for learning the skill models can be found in XES files, and the traces (plans) for testing can be found in the directory goal*/.

  19. Data from: IMAD-DS: A Dataset for Industrial Multi-Sensor Anomaly Detection...

    • zenodo.org
    bin
    Updated Aug 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Filippo Augusti; Filippo Augusti; Davide Albertini; Davide Albertini; Kudret Esmer; Roberto Sannino; Alberto Bernardini; Alberto Bernardini; Kudret Esmer; Roberto Sannino (2024). IMAD-DS: A Dataset for Industrial Multi-Sensor Anomaly Detection Under Domain Shift Conditions [Dataset]. http://doi.org/10.5281/zenodo.12665499
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 28, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Filippo Augusti; Filippo Augusti; Davide Albertini; Davide Albertini; Kudret Esmer; Roberto Sannino; Alberto Bernardini; Alberto Bernardini; Kudret Esmer; Roberto Sannino
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    IMAD-DS is a dataset developed for multi-rate multi-sensor anomaly detection (AD) in industrial environments, that considers varying operational and environmental conditions known as domain shifts.

    Dataset Overview:

    This dataset includes data from two scaled industrial machines: a robotic arm and a brushless motor.

    It includes both normal and abnormal data recorded under various operating conditions to account for domain shifts. These shifts are categorized into:

    Robotic Arm: The robotic arm is a scaled version of a robotic arm used to move silicon wafers in a factory. Anomalies are created by removing bolts at the nodes of the arm, resulting in an imbalance in the machine.
    Brushless Motor: The brushless motor is a scaled representation of an industrial brushless motor. Two anomalies are introduced: first, a magnet is moved closer to the motor load, causing oscillations by interacting with two symmetrical magnets on the load; second, a belt that rotates in unison with the motor shaft is tightened, creating mechanical stress.

    The following domain shifts are included in the dataset:

    Operational Domain Shifts: Variations caused by changes in machine conditions (e.g., load changes for the robotic arm and speed changes for the brushless motor).

    Environmental Domain Shifts: Variations due to changes in background noise levels.

    Combinations of operating and environmental conditions divide each machine's dataset into two subsets: the source domain and the target domain. The source domain has a large number of training examples. The target domain, instead, has limited training data. This discrepancy highlights a common issue in the industry where sufficient training data is often unavailable for the target domain, as machine data is collected under controlled environments that do not fully represent the deployment environments.

    Data Collection and Processing:

    Data is collected using the STEVAL-STWINBX1 IoT Sensor Industrial Node. The sensor used to record the dataset are the following.

    · Analog Microphone (16 kHz)

    · 3-axis Accelerometer (6.7 kHz)

    · 3-axis Gyroscope (6.7 kHz)

    Recordings are conducted in an anechoic chamber to control acoustic conditions precisely

    Data Format:
    Files are already divided into train and test sets. Inside each folder, each sensor's data is stored in a separate '.parquet' file.

    Sensor files related to the same segment of machine data share a unique ID. The mapping of each machine data segment to the sensor files is given in .csv files inside the train and test folders. Those .csv files also contain metadata denoting the operational and environmental conditions of a specific segment.

  20. o

    Open Data Portal Catalogue Metadata

    • ukpowernetworks.opendatasoft.com
    csv, excel, json
    Updated Sep 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Open Data Portal Catalogue Metadata [Dataset]. https://ukpowernetworks.opendatasoft.com/explore/dataset/domain-dataset0/
    Explore at:
    json, excel, csvAvailable download formats
    Dataset updated
    Sep 2, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionA special dataset that contains metadata for all the published datasets. Dataset profile fields conform to Dublin Core standard.Other

    You can download metadata for individual datasets, via the links provided in descriptions.

    Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: https://ukpowernetworks.opendatasoft.com/pages/glossary/

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
PARC (2025). error-detection-positives [Dataset]. https://huggingface.co/datasets/PARC-DATASETS/error-detection-positives

error-detection-positives

PARC-DATASETS/error-detection-positives

Explore at:
Dataset updated
May 22, 2025
Dataset authored and provided by
PARC
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

error-detection-positives

This dataset is part of the PARC (Premise-Annotated Reasoning Collection) and contains mathematical reasoning problems with error annotations. This dataset combines positives samples from multiple domains.

  Domain Breakdown

gsm8k: 50 samples math: 53 samples metamathqa: 93 samples orca_math: 96 samples

  Features

Each example contains:

data_source: The domain/source of the problem (gsm8k, math, metamathqa, orca_math) question: The… See the full description on the dataset page: https://huggingface.co/datasets/PARC-DATASETS/error-detection-positives.

Search
Clear search
Close search
Google apps
Main menu