https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.
PubMed dataset for summarization
Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/pubmed-summarization": ("article", "abstract")
Data Fields
id: paper id article: a string containing the body… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/pubmed-summarization.
The PubMed Corpus in MedRAG
This HF dataset contains the snippets from the PubMed corpus used in MedRAG. It can be used for medical Retrieval-Augmented Generation (RAG).
News
(02/26/2024) The "id" column has been reformatted. A new "PMID" column is added.
Dataset Details
Dataset Descriptions
PubMed is the most widely used literature resource, containing over 36 million biomedical articles. For MedRAG, we use a PubMed subset of 23.9 million… See the full description on the dataset page: https://huggingface.co/datasets/MedRAG/pubmed.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
20M Vietnamese PubMed biomedical abstracts translated by the state-of-the-art English-Vietnamese Translation project. The data has been used as unlabeled dataset for pretraining a Vietnamese Biomedical-domain Transformer model.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
News
[2025/02/18]: We add the original captions of PubMedVision in PubMedVision_Original_Caption.json, as well as the Chinese version of PubMedVision in PubMedVision_Chinese.json. [2024/07/01]: We add annotations for 'body_part' and 'modality' of images, utilizing the HuatuoGPT-Vision-7B model.
PubMedVision
PubMedVision is a large-scale medical VQA dataset. We extracted high-quality image-text pairs from PubMed and used GPT-4V to reformat them to enhance their quality.… See the full description on the dataset page: https://huggingface.co/datasets/FreedomIntelligence/PubMedVision.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for [Dataset Name]
Dataset Summary
The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts.
Supported Tasks and Leaderboards
The official leaderboard is available at: https://pubmedqa.github.io/. 500 questions in the pqa_labeled are used as the test set. They can be found at… See the full description on the dataset page: https://huggingface.co/datasets/qiaojin/PubMedQA.
Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.
Both "arxiv" and "pubmed" have two features:
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('scientific_papers', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
nampdn-ai/mini-pubmed dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Summary
A daily-updated dataset of PubMed abstracts, collected via PubMed’s API and published on Hugging Face Datasets.Each snapshot is versioned by date (e.g., 2025-03-28) so users can track historical changes or use a consistent snapshot for reproducibility.
Updated daily Each version tagged by date Abstract-only dataset (no full text)
Dataset Structure
Column Type Description
pmid string Unique PubMed identifier
abstract string Abstract text… See the full description on the dataset page: https://huggingface.co/datasets/uiyunkim-hub/pubmed-abstract.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
nianlong/long-doc-extractive-summarization-truncated-pubmed dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for PubMed
Dataset Summary
NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.
Supported Tasks and Leaderboards
[More Information Needed]
Languages
English
Dataset Structure
Bear… See the full description on the dataset page: https://huggingface.co/datasets/justinphan3110/vi_pubmed.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency.
For more details, see: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3951655/
The original dataset can be downloaded from: https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/NCBI_corpus.zip This dataset has been converted to CoNLL format for NER using the following tool: https://github.com/spyysalo/standoff2conll Note: there is a duplicate document (PMID 8528200) in the original data, and the duplicate is recreated in the converted data.
This PUBMED Dataset was created based on the The PubMed Abstract GitHub Site and uploaded on the Hugging Face.
The PUBMED dataset reproduced
$git clone https://github.com/thoppe/The-Pile-PubMed.git $cd The-Pile-PubMed/ $python P0_download_listing.py $python P1_download_baseline.py $python P2_parse.py $python P3_build_final_LM_dataset.py
Load Dataset
from datasets import load_dataset pubmed_dataset = load_dataset("hwang2006/PUBMED_title_abstracts_2020_baseline", split="train")… See the full description on the dataset page: https://huggingface.co/datasets/hwang2006/PUBMED_title_abstracts_2020_baseline.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card
This dataset consists of abstracts from heart-related papers collected from PubMed. It can be used for pre-training a language model specialized in cardiology. The dataset was collected through the PubMed API, based on the names of heart-related journals and a glossary of cardiology terms.
Dataset
Data Sources
Pubmed: PubMed is a database that provides abstracts of research papers related to life sciences, biomedical fields, health psychology… See the full description on the dataset page: https://huggingface.co/datasets/InMedData/Cardio_v1.
sharad36/pubmed dataset hosted on Hugging Face and contributed by the HF Datasets community
PDFPages/pubmed-page-images dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Description
Long-COVID related articles have been manually collected by information specialists.Please find further information here.
Size
Training Development Test Total
Positive Examples 215 76 70 345
Negative Examples 199 62 68 345
Total 414 238 138 690
Citation
@article{10.1093/database/baac048,author = {Langnickel, Lisa and Darms, Johannes and Heldt, Katharina and Ducks, Denise and Fluck, Juliane},title = "{Continuous development… See the full description on the dataset page: https://huggingface.co/datasets/llangnickel/long-covid-classification-data.
gayanin/pubmed-abstracts-dist-noised-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
🎉 NEW DROP 🎉 PubMed Guidelines
We just added 1627 clinical guidelines found in PubMed and PubMed Central to the dataset on December 23rd, 2023. Merry Christmas!
Clinical Guidelines
The Clinical Guidelines corpus is a new dataset of 47K clinical practice guidelines from 17 high-quality online medical sources. This dataset serves as a crucial component of the original training corpus of the Meditron Large Language Model (LLM). We publicly release a subset of 37K articles… See the full description on the dataset page: https://huggingface.co/datasets/epfl-llm/guidelines.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
The PMC Open Access Subset includes more than 3.4 million journal articles and preprints that are made available under license terms that allow reuse.
Not all articles in PMC are available for text mining and other reuse, many have copyright protection, however articles in the PMC Open Access Subset are made available under Creative Commons or similar licenses that generally allow more liberal redistribution and reuse than a traditional copyrighted work.
The PMC Open Access Subset is one part of the PMC Article Datasets
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.