SALT-NLP/wiki-balance-synthetic-qrels dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
SALT-NLP/wiki-balance-synthetic dataset hosted on Hugging Face and contributed by the HF Datasets community
The included dataset contains 10,000 synthetic Veteran patient records generated by Synthea. The scope of the data includes over 500 clinical concepts across 90 disease modules, as well as additional social determinants of health (SDoH) data elements that are not traditionally tracked in electronic health records. Each synthetic patient conceptually represents one Veteran in the existing US population; each Veteran has a name, sociodemographic profile, a series of documented clinical encounters and diagnoses, as well as associated cost and payer data. To learn more about Synthea, please visit the Synthea wiki at https://github.com/synthetichealth/synthea/wiki. To find a description of how this dataset is organized by data type, please visit the Synthea CSV File Data Dictionary at https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.The included dataset contains 10,000 synthetic Veteran patient records generated by Synthea. The scope of the data includes over 500 clinical concepts across 90 disease modules, as well as additional social determinants of health (SDoH) data elements that are not traditionally tracked in electronic health records. Each synthetic patient conceptually represents one Veteran in the existing US population; each Veteran has a name, sociodemographic profile, a series of documented clinical encounters and diagnoses, as well as associated cost and payer data. To learn more about Synthea, please visit the Synthea wiki at https://github.com/synthetichealth/synthea/wiki. To find a description of how this dataset is organized by data type, please visit the Synthea CSV File Data Dictionary at https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the official trimodal datasets for the paper "Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic Features" ( https://ieeexplore.ieee.org/document/10089190 ). Code is available at https://github.com/ChangdeDu/BraVL
https://choosealicense.com/licenses/gfdl/https://choosealicense.com/licenses/gfdl/
Wiki Sim
Overview
This new semi-synthetic dataset is derived from wikimedia/wikipedia. Each row contains 1-3 references sentences extracted from the original dataset. For each reference sentence, we use an optimized DSPy program to generate 4 similar sentences:
Synonym (Replace words with synonyms to maintain the same meaning.) Paraphrase (Rephrase the sentence using a different structure while keeping the same idea.) Conceptual Overlap (Express a related concept… See the full description on the dataset page: https://huggingface.co/datasets/dleemiller/wiki-sim.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets of articles and their associated quality assessment rating from the English Wikipedia. Each dataset is self-contained as it also includes all content (wiki markup) associated with a given revision. The datasets have been split into a 90% training set and 10% test set using a stratified random sampling strategy.The 2017 dataset is the preferred dataset to use, contains 32,460 articles, and was gathered on 2017/09/10. The 2015 dataset is maintained for historic reference, and contains 30,272 articles gathered on 2015/02/05.The articles were sampled from six of English Wikipedia's seven assessment classes, with the exception of the Featured Article class, which contains all (2015 dataset) or almost all (2017 dataset) articles in that class at the time. Articles are assumed to belong to the highest quality class they are rated as and article history has been mined to find the appropriate revision associated with a given quality rating. Due to the low usage of A-class articles, this class is not part of the datasets. For more details, see "The Success and Failure of Quality Improvement Projects in Peer Production Communities" by Warncke-Wang et al. (CSCW 2015), linked below. These datasets have been used in training the wikiclass Python library machine learner, also linked below.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A synthetic dataset generated to demonstrate the merge performance of ExeTera (https://github.com/KCL-BMEIS/ExeTera.git). It contains two tables, one with 10 million rows and one with 100 million rows. The primary key of the smaller table is a foreign key of the large tablePlease see the wikis hosted at https://github.com/KCL-BMEIS/ExeTeraEval/wiki and https://github.com/KCL-BMEIS/ExeTera/wiki for details of how to use ExeTera to load and view this data. Please see ExeTeraEval for details of how to regenerate this and similar datasets.
Dataset Card for Vi-Wiki-Conversational-Search
The ViWiki-QR dataset is a Vietnamese collection of 16.7K synthetic conversations and 250 human-annotated conversations, supporting the task of query rewriting for the conversational search.
Dataset Details
Dataset Description
ViWiki-QR is a Vietnamese dataset designed for the task of query rewriting in conversational search. It contains two subsets: a large-scale synthetic training set and a smaller, manually… See the full description on the dataset page: https://huggingface.co/datasets/trientp/vi-wiki-conversational-search.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Machine readable format of WikiProjects listed at https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/DirectoryThe dataset is generated using the code at - https://github.com/wiki-ai/drafttopic/The dataset is modeled in the form of a nested tree structure after the original hierarchical mappings on the WikiProejcts home page and its child pages.* Each non-leaf entry represents a sub-category with a name and some associated information like the level in the page it was parsed at and the root url of the page it was parsed from.* Each non-leaf node has a mandatory key "topics" which leads to further sub-categories within it.* Each leaf node is a WikiProject entry, with actual WikiProject name and its active status.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Synthetic Reasoning Data Structures
This dataset comprises synthetically generated structured data points designed for training large language models to enhance reasoning, explanation generation, and complex problem-solving capabilities. Each data point is generated based on provided background knowledge (Wiki pages) and specific instructional prompts targeting defined reasoning types.
Dataset Details
Dataset Description
This dataset consists… See the full description on the dataset page: https://huggingface.co/datasets/xaplabs/WIKI-REASONING-TR.
This is a synthetic patient dataset in the OMOP Common Data Model v5.2, originally released by the CMS and accessed via BigQuery. The dataset includes 24 tables and records for 2 million synthetic patients from 2008 to 2010.
This dataset takes on the format of the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). As shown in the diagram below, the purpose of the Common Data Model is to convert various distinctly-formatted datasets into a well-known, universal format with a set of standardized vocabularies. See the diagram below from the Observational Health Data Sciences and Informatics (OHDSI) webpage.
https://redivis.com/fileUploads/d1a95a4e-074a-44d1-92e5-9adfd2f4068a%3E" alt="Why-CDM.png">
Such universal data models ultimately enable researchers to streamline the analysis of observational medical data. For more information regarding the OMOP CDM, refer to the OHSDI OMOP site.
%3Cli%3EFor documentation regarding the source data format from the Center for Medicare and Medicaid Services (CMS), refer to the %3Ca href="https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF"%3ECMS Synthetic Public Use File%3C/a%3E.%3C/li%3E
%3Cli%3EFor information regarding the conversion of the CMS data file to the OMOP CDM v5.2, refer to %3Ca href="https://github.com/OHDSI/ETL-CMS"%3Ethis OHDSI GitHub page%3C/a%3E. %3C/li%3E
%3Cli%3EFor information regarding each of the 24 tables in this dataset, including more detailed variable metadata, see %3Ca href="https://github.com/OHDSI/CommonDataModel/wiki"%3Ethe OHDSI CDM GitHub Wiki page%3C/a%3E. All variable labels and descriptions as well as table descriptions come from this Wiki page. Note that this GitHub page includes information primarily regarding the 6.0 version of the CDM and that this dataset works with the 5.2 version. %3C/li%3E
iambestfeed/synthetic-wiki dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This a prototype dataset. While is not the typical synthetic dataset that i don, this is a scrapped dataset from Wiki talk, Gthub and Stack.
Share the feedback with me
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
pniedzwiedzinski/pl-wiki-printsyn dataset hosted on Hugging Face and contributed by the HF Datasets community
Not seeing a result you expected?
Learn how you can add new datasets to our index.
SALT-NLP/wiki-balance-synthetic-qrels dataset hosted on Hugging Face and contributed by the HF Datasets community