14 datasets found

h
wiki-balance-synthetic-qrels
huggingface.co
Updated Jun 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social And Language Technology Lab (2024). wiki-balance-synthetic-qrels [Dataset]. https://huggingface.co/datasets/SALT-NLP/wiki-balance-synthetic-qrels
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 4, 2024
Dataset authored and provided by
Social And Language Technology Lab
Description
SALT-NLP/wiki-balance-synthetic-qrels dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wiki-balance-synthetic
huggingface.co
Updated Jun 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social And Language Technology Lab (2024). wiki-balance-synthetic [Dataset]. https://huggingface.co/datasets/SALT-NLP/wiki-balance-synthetic
Explore at:
Dataset updated
Jun 4, 2024
Dataset authored and provided by
Social And Language Technology Lab
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
SALT-NLP/wiki-balance-synthetic dataset hosted on Hugging Face and contributed by the HF Datasets community
Synthetic Suicide Prevention Dataset with SDoH
catalog.data.gov
datahub.va.gov
+2more
Updated Jun 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Veterans Affairs (2025). Synthetic Suicide Prevention Dataset with SDoH [Dataset]. https://catalog.data.gov/dataset/synthetic-suicide-prevention-dataset-with-sdoh
Explore at:
Dataset updated
Jun 2, 2025
Dataset provided by
United States Department of Veterans Affairshttp://va.gov/
Description
The included dataset contains 10,000 synthetic Veteran patient records generated by Synthea. The scope of the data includes over 500 clinical concepts across 90 disease modules, as well as additional social determinants of health (SDoH) data elements that are not traditionally tracked in electronic health records. Each synthetic patient conceptually represents one Veteran in the existing US population; each Veteran has a name, sociodemographic profile, a series of documented clinical encounters and diagnoses, as well as associated cost and payer data. To learn more about Synthea, please visit the Synthea wiki at https://github.com/synthetichealth/synthea/wiki. To find a description of how this dataset is organized by data type, please visit the Synthea CSV File Data Dictionary at https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.The included dataset contains 10,000 synthetic Veteran patient records generated by Synthea. The scope of the data includes over 500 clinical concepts across 90 disease modules, as well as additional social determinants of health (SDoH) data elements that are not traditionally tracked in electronic health records. Each synthetic patient conceptually represents one Veteran in the existing US population; each Veteran has a name, sociodemographic profile, a series of documented clinical encounters and diagnoses, as well as associated cost and payer data. To learn more about Synthea, please visit the Synthea wiki at https://github.com/synthetichealth/synthea/wiki. To find a description of how this dataset is organized by data type, please visit the Synthea CSV File Data Dictionary at https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.
f
BraVL
figshare.com
zip
Updated May 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
xxx xxx (2023). BraVL [Dataset]. http://doi.org/10.6084/m9.figshare.17024591.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.17024591.v3
Dataset updated
May 5, 2023
Dataset provided by
figshare
Authors
xxx xxx
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the official trimodal datasets for the paper "Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic Features" ( https://ieeexplore.ieee.org/document/10089190 ). Code is available at https://github.com/ChangdeDu/BraVL
h
wiki-sim
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lee Miller, wiki-sim [Dataset]. https://huggingface.co/datasets/dleemiller/wiki-sim
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Lee Miller
License
https://choosealicense.com/licenses/gfdl/https://choosealicense.com/licenses/gfdl/
Description
Wiki Sim

Overview

This new semi-synthetic dataset is derived from wikimedia/wikipedia. Each row contains 1-3 references sentences extracted from the original dataset. For each reference sentence, we use an optimized DSPy program to generate 4 similar sentences:

Synonym (Replace words with synonyms to maintain the same meaning.) Paraphrase (Rephrase the sentence using a different structure while keeping the same idea.) Conceptual Overlap (Express a related concept… See the full description on the dataset page: https://huggingface.co/datasets/dleemiller/wiki-sim.
English Wikipedia Quality Asssessment Dataset
figshare.com
application/bzip2
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morten Warncke-Wang (2023). English Wikipedia Quality Asssessment Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.1375406.v2
Explore at:
application/bzip2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1375406.v2
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Morten Warncke-Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets of articles and their associated quality assessment rating from the English Wikipedia. Each dataset is self-contained as it also includes all content (wiki markup) associated with a given revision. The datasets have been split into a 90% training set and 10% test set using a stratified random sampling strategy.The 2017 dataset is the preferred dataset to use, contains 32,460 articles, and was gathered on 2017/09/10. The 2015 dataset is maintained for historic reference, and contains 30,272 articles gathered on 2015/02/05.The articles were sampled from six of English Wikipedia's seven assessment classes, with the exception of the Featured Article class, which contains all (2015 dataset) or almost all (2017 dataset) articles in that class at the time. Articles are assumed to belong to the highest quality class they are rated as and article history has been mined to find the appropriate revision associated with a given quality rating. Due to the low usage of A-class articles, this class is not part of the datasets. For more details, see "The Success and Failure of Quality Improvement Projects in Peer Production Communities" by Warncke-Wang et al. (CSCW 2015), linked below. These datasets have been used in training the wikiclass Python library machine learner, also linked below.
ds_10000000_100000000.hdf5
figshare.com
hdf
Updated Aug 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ben Murray (2021). ds_10000000_100000000.hdf5 [Dataset]. http://doi.org/10.6084/m9.figshare.16413255.v1
Explore at:
hdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16413255.v1
Dataset updated
Aug 27, 2021
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ben Murray
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A synthetic dataset generated to demonstrate the merge performance of ExeTera (https://github.com/KCL-BMEIS/ExeTera.git). It contains two tables, one with 10 million rows and one with 100 million rows. The primary key of the smaller table is a foreign key of the large tablePlease see the wikis hosted at https://github.com/KCL-BMEIS/ExeTeraEval/wiki and https://github.com/KCL-BMEIS/ExeTera/wiki for details of how to use ExeTera to load and view this data. Please see ExeTeraEval for details of how to regenerate this and similar datasets.
h
vi-wiki-conversational-search
huggingface.co
Updated May 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vi-wiki-conversational-search [Dataset]. https://huggingface.co/datasets/trientp/vi-wiki-conversational-search
Explore at:
Dataset updated
May 18, 2025
Authors
Thái Phát Triển
Description
Dataset Card for Vi-Wiki-Conversational-Search

The ViWiki-QR dataset is a Vietnamese collection of 16.7K synthetic conversations and 250 human-annotated conversations, supporting the task of query rewriting for the conversational search.

Dataset Details Dataset Description

ViWiki-QR is a Vietnamese dataset designed for the task of query rewriting in conversational search. It contains two subsets: a large-scale synthetic training set and a smaller, manually… See the full description on the dataset page: https://huggingface.co/datasets/trientp/vi-wiki-conversational-search.
f
WikiProjects Machine Readable Dataset
figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Asthana; Aaron Halfaker (2023). WikiProjects Machine Readable Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.5503819.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5503819.v3
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Sumit Asthana; Aaron Halfaker
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Machine readable format of WikiProjects listed at https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/DirectoryThe dataset is generated using the code at - https://github.com/wiki-ai/drafttopic/The dataset is modeled in the form of a nested tree structure after the original hierarchical mappings on the WikiProejcts home page and its child pages.* Each non-leaf entry represents a sub-category with a name and some associated information like the level in the page it was parsed at and the root url of the page it was parsed from.* Each non-leaf node has a mandatory key "topics" which leads to further sub-categories within it.* Each leaf node is a WikiProject entry, with actual WikiProject name and its active status.
h
WIKI-REASONING-TR
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
XAP Labs, WIKI-REASONING-TR [Dataset]. https://huggingface.co/datasets/xaplabs/WIKI-REASONING-TR
Explore at:
Dataset authored and provided by
XAP Labs
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for Synthetic Reasoning Data Structures

This dataset comprises synthetically generated structured data points designed for training large language models to enhance reasoning, explanation generation, and complex problem-solving capabilities. Each data point is generated based on provided background knowledge (Wiki pages) and specific instructional prompts targeting defined reasoning types.

Dataset Details Dataset Description

This dataset consists… See the full description on the dataset page: https://huggingface.co/datasets/xaplabs/WIKI-REASONING-TR.
CMS Synthetic Patient Data OMOP
redivis.com
application/jsonl +7
Updated Aug 19, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Redivis Demo Organization (2020). CMS Synthetic Patient Data OMOP [Dataset]. https://redivis.com/datasets/ye2v-6skh7wdr7
Explore at:
sas, avro, parquet, stata, application/jsonl, arrow, csv, spssAvailable download formats
Dataset updated
Aug 19, 2020
Dataset provided by
Redivis Inc.
Authors
Redivis Demo Organization
Time period covered
Jan 1, 2008 - Dec 31, 2010
Description
Abstract

This is a synthetic patient dataset in the OMOP Common Data Model v5.2, originally released by the CMS and accessed via BigQuery. The dataset includes 24 tables and records for 2 million synthetic patients from 2008 to 2010.

Methodology

This dataset takes on the format of the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). As shown in the diagram below, the purpose of the Common Data Model is to convert various distinctly-formatted datasets into a well-known, universal format with a set of standardized vocabularies. See the diagram below from the Observational Health Data Sciences and Informatics (OHDSI) webpage.

https://redivis.com/fileUploads/d1a95a4e-074a-44d1-92e5-9adfd2f4068a%3E" alt="Why-CDM.png">

Such universal data models ultimately enable researchers to streamline the analysis of observational medical data. For more information regarding the OMOP CDM, refer to the OHSDI OMOP site.

Usage

%3Cli%3EFor documentation regarding the source data format from the Center for Medicare and Medicaid Services (CMS), refer to the %3Ca href="https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF"%3ECMS Synthetic Public Use File%3C/a%3E.%3C/li%3E

%3Cli%3EFor information regarding the conversion of the CMS data file to the OMOP CDM v5.2, refer to %3Ca href="https://github.com/OHDSI/ETL-CMS"%3Ethis OHDSI GitHub page%3C/a%3E. %3C/li%3E

%3Cli%3EFor information regarding each of the 24 tables in this dataset, including more detailed variable metadata, see %3Ca href="https://github.com/OHDSI/CommonDataModel/wiki"%3Ethe OHDSI CDM GitHub Wiki page%3C/a%3E. All variable labels and descriptions as well as table descriptions come from this Wiki page. Note that this GitHub page includes information primarily regarding the 6.0 version of the CDM and that this dataset works with the 5.2 version. %3C/li%3E
h
synthetic-wiki
huggingface.co
Updated Sep 21, 2005
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Duc-Nhan Nguyen (2005). synthetic-wiki [Dataset]. https://huggingface.co/datasets/iambestfeed/synthetic-wiki
Explore at:
Dataset updated
Sep 21, 2005
Authors
Duc-Nhan Nguyen
Description
iambestfeed/synthetic-wiki dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Proto
huggingface.co
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Fox (2025). Proto [Dataset]. https://huggingface.co/datasets/FlameF0X/Proto
Explore at:
Dataset updated
Jun 18, 2025
Authors
Daniel Fox
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This a prototype dataset. While is not the typical synthetic dataset that i don, this is a scrapped dataset from Wiki talk, Gthub and Stack.

Share the feedback with me
h
pl-wiki-printsyn
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patryk Niedźwiedziński, pl-wiki-printsyn [Dataset]. https://huggingface.co/datasets/pniedzwiedzinski/pl-wiki-printsyn
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Patryk Niedźwiedziński
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
pniedzwiedzinski/pl-wiki-printsyn dataset hosted on Hugging Face and contributed by the HF Datasets community
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Social And Language Technology Lab (2024). wiki-balance-synthetic-qrels [Dataset]. https://huggingface.co/datasets/SALT-NLP/wiki-balance-synthetic-qrels

wiki-balance-synthetic-qrels

SALT-NLP/wiki-balance-synthetic-qrels

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 4, 2024

Dataset authored and provided by

Social And Language Technology Lab

Description

SALT-NLP/wiki-balance-synthetic-qrels dataset hosted on Hugging Face and contributed by the HF Datasets community

Clear search

Close search

Google apps

Main menu

wiki-balance-synthetic-qrels

wiki-balance-synthetic

Synthetic Suicide Prevention Dataset with SDoH

BraVL

wiki-sim

English Wikipedia Quality Asssessment Dataset

ds_10000000_100000000.hdf5

vi-wiki-conversational-search

WikiProjects Machine Readable Dataset

WIKI-REASONING-TR

CMS Synthetic Patient Data OMOP

Abstract

Methodology

Usage

synthetic-wiki

Proto

pl-wiki-printsyn

wiki-balance-synthetic-qrelsSee More Versions

SALT-NLP/wiki-balance-synthetic-qrels

wiki-balance-synthetic-qrels