14 datasets found
  1. h

    wiki-balance-synthetic-qrels

    • huggingface.co
    Updated Jun 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Social And Language Technology Lab (2024). wiki-balance-synthetic-qrels [Dataset]. https://huggingface.co/datasets/SALT-NLP/wiki-balance-synthetic-qrels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 4, 2024
    Dataset authored and provided by
    Social And Language Technology Lab
    Description

    SALT-NLP/wiki-balance-synthetic-qrels dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. h

    wiki-balance-synthetic

    • huggingface.co
    Updated Jun 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Social And Language Technology Lab (2024). wiki-balance-synthetic [Dataset]. https://huggingface.co/datasets/SALT-NLP/wiki-balance-synthetic
    Explore at:
    Dataset updated
    Jun 4, 2024
    Dataset authored and provided by
    Social And Language Technology Lab
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    SALT-NLP/wiki-balance-synthetic dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. Synthetic Suicide Prevention Dataset with SDoH

    • catalog.data.gov
    • datahub.va.gov
    • +2more
    Updated Jun 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Veterans Affairs (2025). Synthetic Suicide Prevention Dataset with SDoH [Dataset]. https://catalog.data.gov/dataset/synthetic-suicide-prevention-dataset-with-sdoh
    Explore at:
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    United States Department of Veterans Affairshttp://va.gov/
    Description

    The included dataset contains 10,000 synthetic Veteran patient records generated by Synthea. The scope of the data includes over 500 clinical concepts across 90 disease modules, as well as additional social determinants of health (SDoH) data elements that are not traditionally tracked in electronic health records. Each synthetic patient conceptually represents one Veteran in the existing US population; each Veteran has a name, sociodemographic profile, a series of documented clinical encounters and diagnoses, as well as associated cost and payer data. To learn more about Synthea, please visit the Synthea wiki at https://github.com/synthetichealth/synthea/wiki. To find a description of how this dataset is organized by data type, please visit the Synthea CSV File Data Dictionary at https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.The included dataset contains 10,000 synthetic Veteran patient records generated by Synthea. The scope of the data includes over 500 clinical concepts across 90 disease modules, as well as additional social determinants of health (SDoH) data elements that are not traditionally tracked in electronic health records. Each synthetic patient conceptually represents one Veteran in the existing US population; each Veteran has a name, sociodemographic profile, a series of documented clinical encounters and diagnoses, as well as associated cost and payer data. To learn more about Synthea, please visit the Synthea wiki at https://github.com/synthetichealth/synthea/wiki. To find a description of how this dataset is organized by data type, please visit the Synthea CSV File Data Dictionary at https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.

  4. f

    BraVL

    • figshare.com
    zip
    Updated May 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xxx xxx (2023). BraVL [Dataset]. http://doi.org/10.6084/m9.figshare.17024591.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 5, 2023
    Dataset provided by
    figshare
    Authors
    xxx xxx
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the official trimodal datasets for the paper "Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic Features" ( https://ieeexplore.ieee.org/document/10089190 ). Code is available at https://github.com/ChangdeDu/BraVL

  5. h

    wiki-sim

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lee Miller, wiki-sim [Dataset]. https://huggingface.co/datasets/dleemiller/wiki-sim
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Lee Miller
    License

    https://choosealicense.com/licenses/gfdl/https://choosealicense.com/licenses/gfdl/

    Description

    Wiki Sim

      Overview
    

    This new semi-synthetic dataset is derived from wikimedia/wikipedia. Each row contains 1-3 references sentences extracted from the original dataset. For each reference sentence, we use an optimized DSPy program to generate 4 similar sentences:

    Synonym (Replace words with synonyms to maintain the same meaning.) Paraphrase (Rephrase the sentence using a different structure while keeping the same idea.) Conceptual Overlap (Express a related concept… See the full description on the dataset page: https://huggingface.co/datasets/dleemiller/wiki-sim.

  6. English Wikipedia Quality Asssessment Dataset

    • figshare.com
    application/bzip2
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morten Warncke-Wang (2023). English Wikipedia Quality Asssessment Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.1375406.v2
    Explore at:
    application/bzip2Available download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Morten Warncke-Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets of articles and their associated quality assessment rating from the English Wikipedia. Each dataset is self-contained as it also includes all content (wiki markup) associated with a given revision. The datasets have been split into a 90% training set and 10% test set using a stratified random sampling strategy.The 2017 dataset is the preferred dataset to use, contains 32,460 articles, and was gathered on 2017/09/10. The 2015 dataset is maintained for historic reference, and contains 30,272 articles gathered on 2015/02/05.The articles were sampled from six of English Wikipedia's seven assessment classes, with the exception of the Featured Article class, which contains all (2015 dataset) or almost all (2017 dataset) articles in that class at the time. Articles are assumed to belong to the highest quality class they are rated as and article history has been mined to find the appropriate revision associated with a given quality rating. Due to the low usage of A-class articles, this class is not part of the datasets. For more details, see "The Success and Failure of Quality Improvement Projects in Peer Production Communities" by Warncke-Wang et al. (CSCW 2015), linked below. These datasets have been used in training the wikiclass Python library machine learner, also linked below.

  7. ds_10000000_100000000.hdf5

    • figshare.com
    hdf
    Updated Aug 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Murray (2021). ds_10000000_100000000.hdf5 [Dataset]. http://doi.org/10.6084/m9.figshare.16413255.v1
    Explore at:
    hdfAvailable download formats
    Dataset updated
    Aug 27, 2021
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Ben Murray
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A synthetic dataset generated to demonstrate the merge performance of ExeTera (https://github.com/KCL-BMEIS/ExeTera.git). It contains two tables, one with 10 million rows and one with 100 million rows. The primary key of the smaller table is a foreign key of the large tablePlease see the wikis hosted at https://github.com/KCL-BMEIS/ExeTeraEval/wiki and https://github.com/KCL-BMEIS/ExeTera/wiki for details of how to use ExeTera to load and view this data. Please see ExeTeraEval for details of how to regenerate this and similar datasets.

  8. h

    vi-wiki-conversational-search

    • huggingface.co
    Updated May 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vi-wiki-conversational-search [Dataset]. https://huggingface.co/datasets/trientp/vi-wiki-conversational-search
    Explore at:
    Dataset updated
    May 18, 2025
    Authors
    Thái Phát Triển
    Description

    Dataset Card for Vi-Wiki-Conversational-Search

    The ViWiki-QR dataset is a Vietnamese collection of 16.7K synthetic conversations and 250 human-annotated conversations, supporting the task of query rewriting for the conversational search.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    ViWiki-QR is a Vietnamese dataset designed for the task of query rewriting in conversational search. It contains two subsets: a large-scale synthetic training set and a smaller, manually… See the full description on the dataset page: https://huggingface.co/datasets/trientp/vi-wiki-conversational-search.

  9. f

    WikiProjects Machine Readable Dataset

    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit Asthana; Aaron Halfaker (2023). WikiProjects Machine Readable Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.5503819.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Sumit Asthana; Aaron Halfaker
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Machine readable format of WikiProjects listed at https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/DirectoryThe dataset is generated using the code at - https://github.com/wiki-ai/drafttopic/The dataset is modeled in the form of a nested tree structure after the original hierarchical mappings on the WikiProejcts home page and its child pages.* Each non-leaf entry represents a sub-category with a name and some associated information like the level in the page it was parsed at and the root url of the page it was parsed from.* Each non-leaf node has a mandatory key "topics" which leads to further sub-categories within it.* Each leaf node is a WikiProject entry, with actual WikiProject name and its active status.

  10. h

    WIKI-REASONING-TR

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    XAP Labs, WIKI-REASONING-TR [Dataset]. https://huggingface.co/datasets/xaplabs/WIKI-REASONING-TR
    Explore at:
    Dataset authored and provided by
    XAP Labs
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Synthetic Reasoning Data Structures

    This dataset comprises synthetically generated structured data points designed for training large language models to enhance reasoning, explanation generation, and complex problem-solving capabilities. Each data point is generated based on provided background knowledge (Wiki pages) and specific instructional prompts targeting defined reasoning types.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    This dataset consists… See the full description on the dataset page: https://huggingface.co/datasets/xaplabs/WIKI-REASONING-TR.

  11. CMS Synthetic Patient Data OMOP

    • redivis.com
    application/jsonl +7
    Updated Aug 19, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redivis Demo Organization (2020). CMS Synthetic Patient Data OMOP [Dataset]. https://redivis.com/datasets/ye2v-6skh7wdr7
    Explore at:
    sas, avro, parquet, stata, application/jsonl, arrow, csv, spssAvailable download formats
    Dataset updated
    Aug 19, 2020
    Dataset provided by
    Redivis Inc.
    Authors
    Redivis Demo Organization
    Time period covered
    Jan 1, 2008 - Dec 31, 2010
    Description

    Abstract

    This is a synthetic patient dataset in the OMOP Common Data Model v5.2, originally released by the CMS and accessed via BigQuery. The dataset includes 24 tables and records for 2 million synthetic patients from 2008 to 2010.

    Methodology

    This dataset takes on the format of the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). As shown in the diagram below, the purpose of the Common Data Model is to convert various distinctly-formatted datasets into a well-known, universal format with a set of standardized vocabularies. See the diagram below from the Observational Health Data Sciences and Informatics (OHDSI) webpage.

    https://redivis.com/fileUploads/d1a95a4e-074a-44d1-92e5-9adfd2f4068a%3E" alt="Why-CDM.png">

    Such universal data models ultimately enable researchers to streamline the analysis of observational medical data. For more information regarding the OMOP CDM, refer to the OHSDI OMOP site.

    Usage

    %3Cli%3EFor documentation regarding the source data format from the Center for Medicare and Medicaid Services (CMS), refer to the %3Ca href="https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF"%3ECMS Synthetic Public Use File%3C/a%3E.%3C/li%3E

    %3Cli%3EFor information regarding the conversion of the CMS data file to the OMOP CDM v5.2, refer to %3Ca href="https://github.com/OHDSI/ETL-CMS"%3Ethis OHDSI GitHub page%3C/a%3E. %3C/li%3E

    %3Cli%3EFor information regarding each of the 24 tables in this dataset, including more detailed variable metadata, see %3Ca href="https://github.com/OHDSI/CommonDataModel/wiki"%3Ethe OHDSI CDM GitHub Wiki page%3C/a%3E. All variable labels and descriptions as well as table descriptions come from this Wiki page. Note that this GitHub page includes information primarily regarding the 6.0 version of the CDM and that this dataset works with the 5.2 version. %3C/li%3E

  12. h

    synthetic-wiki

    • huggingface.co
    Updated Sep 21, 2005
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Duc-Nhan Nguyen (2005). synthetic-wiki [Dataset]. https://huggingface.co/datasets/iambestfeed/synthetic-wiki
    Explore at:
    Dataset updated
    Sep 21, 2005
    Authors
    Duc-Nhan Nguyen
    Description

    iambestfeed/synthetic-wiki dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    Proto

    • huggingface.co
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Fox (2025). Proto [Dataset]. https://huggingface.co/datasets/FlameF0X/Proto
    Explore at:
    Dataset updated
    Jun 18, 2025
    Authors
    Daniel Fox
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This a prototype dataset. While is not the typical synthetic dataset that i don, this is a scrapped dataset from Wiki talk, Gthub and Stack.

      Share the feedback with me
    
  14. h

    pl-wiki-printsyn

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patryk Niedźwiedziński, pl-wiki-printsyn [Dataset]. https://huggingface.co/datasets/pniedzwiedzinski/pl-wiki-printsyn
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Patryk Niedźwiedziński
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    pniedzwiedzinski/pl-wiki-printsyn dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Social And Language Technology Lab (2024). wiki-balance-synthetic-qrels [Dataset]. https://huggingface.co/datasets/SALT-NLP/wiki-balance-synthetic-qrels

wiki-balance-synthetic-qrels

SALT-NLP/wiki-balance-synthetic-qrels

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 4, 2024
Dataset authored and provided by
Social And Language Technology Lab
Description

SALT-NLP/wiki-balance-synthetic-qrels dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu