100+ datasets found

h
pubmed
huggingface.co
Updated Dec 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NLM/DIR BioNLP Group (2023). pubmed [Dataset]. https://huggingface.co/datasets/ncbi/pubmed
Explore at:
Dataset updated
Dec 15, 2023
Dataset authored and provided by
NLM/DIR BioNLP Group
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.
h
pubmed-summarization
huggingface.co
opendatalab.com
Updated Dec 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ccdv (2021). pubmed-summarization [Dataset]. https://huggingface.co/datasets/ccdv/pubmed-summarization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 1, 2021
Authors
ccdv
Description
PubMed dataset for summarization

Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/pubmed-summarization": ("article", "abstract")

Data Fields

id: paper id article: a string containing the body… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/pubmed-summarization.
h
pubmed
huggingface.co
Updated Feb 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MedRAG (2024). pubmed [Dataset]. https://huggingface.co/datasets/MedRAG/pubmed
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 26, 2024
Authors
MedRAG
Description
The PubMed Corpus in MedRAG

This HF dataset contains the snippets from the PubMed corpus used in MedRAG. It can be used for medical Retrieval-Augmented Generation (RAG).

News

(02/26/2024) The "id" column has been reformatted. A new "PMID" column is added.

Dataset Details Dataset Descriptions

PubMed is the most widely used literature resource, containing over 36 million biomedical articles. For MedRAG, we use a PubMed subset of 23.9 million… See the full description on the dataset page: https://huggingface.co/datasets/MedRAG/pubmed.
h
vi_pubmed
huggingface.co
Updated Jun 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SEACrowd (2024). vi_pubmed [Dataset]. https://huggingface.co/datasets/SEACrowd/vi_pubmed
Explore at:
Dataset updated
Jun 20, 2024
Dataset authored and provided by
SEACrowd
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
20M Vietnamese PubMed biomedical abstracts translated by the state-of-the-art English-Vietnamese Translation project. The data has been used as unlabeled dataset for pretraining a Vietnamese Biomedical-domain Transformer model.
h
PubMedVision
huggingface.co
Updated Jun 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FreedomAI (2024). PubMedVision [Dataset]. https://huggingface.co/datasets/FreedomIntelligence/PubMedVision
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2024
Dataset authored and provided by
FreedomAI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
News

[2025/02/18]: We add the original captions of PubMedVision in PubMedVision_Original_Caption.json, as well as the Chinese version of PubMedVision in PubMedVision_Chinese.json. [2024/07/01]: We add annotations for 'body_part' and 'modality' of images, utilizing the HuatuoGPT-Vision-7B model.

PubMedVision

PubMedVision is a large-scale medical VQA dataset. We extracted high-quality image-text pairs from PubMed and used GPT-4V to reformat them to enhance their quality.… See the full description on the dataset page: https://huggingface.co/datasets/FreedomIntelligence/PubMedVision.
h
PubMedQA
huggingface.co
paperswithcode.com
+1more
Updated Jan 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qiao Jin (2024). PubMedQA [Dataset]. https://huggingface.co/datasets/qiaojin/PubMedQA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 7, 2024
Authors
Qiao Jin
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for [Dataset Name]

Dataset Summary

The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts.

Supported Tasks and Leaderboards

The official leaderboard is available at: https://pubmedqa.github.io/. 500 questions in the pqa_labeled are used as the test set. They can be found at… See the full description on the dataset page: https://huggingface.co/datasets/qiaojin/PubMedQA.
T
scientific_papers
tensorflow.org
huggingface.co
Updated Dec 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). scientific_papers [Dataset]. https://www.tensorflow.org/datasets/catalog/scientific_papers
Explore at:
Dataset updated
Dec 23, 2022
Description
Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.

Both "arxiv" and "pubmed" have two features:

article: the body of the document, pagragraphs seperated by "/n".

abstract: the abstract of the document, pagragraphs seperated by "/n".

section_names: titles of sections, seperated by "/n".

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('scientific_papers', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
h
mini-pubmed
huggingface.co
Updated Jul 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nam Pham (2024). mini-pubmed [Dataset]. https://huggingface.co/datasets/nampdn-ai/mini-pubmed
Explore at:
Dataset updated
Jul 9, 2024
Authors
Nam Pham
Description
nampdn-ai/mini-pubmed dataset hosted on Hugging Face and contributed by the HF Datasets community
h
pubmed-abstract
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Uiyun Kim, pubmed-abstract [Dataset]. https://huggingface.co/datasets/uiyunkim-hub/pubmed-abstract
Explore at:
Authors
Uiyun Kim
Description
Dataset Summary

A daily-updated dataset of PubMed abstracts, collected via PubMed’s API and published on Hugging Face Datasets.Each snapshot is versioned by date (e.g., 2025-03-28) so users can track historical changes or use a consistent snapshot for reproducibility.

Updated daily Each version tagged by date Abstract-only dataset (no full text)

Dataset Structure

Column Type Description

pmid string Unique PubMed identifier

abstract string Abstract text… See the full description on the dataset page: https://huggingface.co/datasets/uiyunkim-hub/pubmed-abstract.
h
long-doc-extractive-summarization-truncated-pubmed
huggingface.co
Updated Nov 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nianlong Gu (2023). long-doc-extractive-summarization-truncated-pubmed [Dataset]. https://huggingface.co/datasets/nianlong/long-doc-extractive-summarization-truncated-pubmed
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 1, 2023
Authors
Nianlong Gu
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
nianlong/long-doc-extractive-summarization-truncated-pubmed dataset hosted on Hugging Face and contributed by the HF Datasets community
h
vi_pubmed
huggingface.co
Updated Mar 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Long Phan (2023). vi_pubmed [Dataset]. https://huggingface.co/datasets/justinphan3110/vi_pubmed
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2023
Authors
Long Phan
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for PubMed

Dataset Summary

NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

English

Dataset Structure

Bear… See the full description on the dataset page: https://huggingface.co/datasets/justinphan3110/vi_pubmed.
h
ncbi_disease
huggingface.co
Updated Sep 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NLM/DIR BioNLP Group (2023). ncbi_disease [Dataset]. https://huggingface.co/datasets/ncbi/ncbi_disease
Explore at:
Dataset updated
Sep 2, 2023
Dataset authored and provided by
NLM/DIR BioNLP Group
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency.

For more details, see: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3951655/

The original dataset can be downloaded from: https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/NCBI_corpus.zip This dataset has been converted to CoNLL format for NER using the following tool: https://github.com/spyysalo/standoff2conll Note: there is a duplicate document (PMID 8528200) in the original data, and the duplicate is recreated in the converted data.
h
PUBMED_title_abstracts_2020_baseline
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soonwook Hwang, PUBMED_title_abstracts_2020_baseline [Dataset]. https://huggingface.co/datasets/hwang2006/PUBMED_title_abstracts_2020_baseline
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Soonwook Hwang
Description
This PUBMED Dataset was created based on the The PubMed Abstract GitHub Site and uploaded on the Hugging Face.

The PUBMED dataset reproduced

$git clone https://github.com/thoppe/The-Pile-PubMed.git $cd The-Pile-PubMed/ $python P0_download_listing.py $python P1_download_baseline.py $python P2_parse.py $python P3_build_final_LM_dataset.py

Load Dataset

from datasets import load_dataset pubmed_dataset = load_dataset("hwang2006/PUBMED_title_abstracts_2020_baseline", split="train")… See the full description on the dataset page: https://huggingface.co/datasets/hwang2006/PUBMED_title_abstracts_2020_baseline.
h
Cardio_v1
huggingface.co
Updated Mar 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
INMED DATA (2024). Cardio_v1 [Dataset]. https://huggingface.co/datasets/InMedData/Cardio_v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 12, 2024
Authors
INMED DATA
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset Card

This dataset consists of abstracts from heart-related papers collected from PubMed. It can be used for pre-training a language model specialized in cardiology. The dataset was collected through the PubMed API, based on the names of heart-related journals and a glossary of cardiology terms.

Dataset Data Sources

Pubmed: PubMed is a database that provides abstracts of research papers related to life sciences, biomedical fields, health psychology… See the full description on the dataset page: https://huggingface.co/datasets/InMedData/Cardio_v1.
h
pubmed
huggingface.co
Updated Jun 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sharadmishra (2023). pubmed [Dataset]. https://huggingface.co/datasets/sharad36/pubmed
Explore at:
Dataset updated
Jun 25, 2023
Authors
sharadmishra
Description
sharad36/pubmed dataset hosted on Hugging Face and contributed by the HF Datasets community
h
pubmed-page-images
huggingface.co
Updated Mar 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PDFPages (2023). pubmed-page-images [Dataset]. https://huggingface.co/datasets/PDFPages/pubmed-page-images
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 6, 2023
Dataset authored and provided by
PDFPages
Description
PDFPages/pubmed-page-images dataset hosted on Hugging Face and contributed by the HF Datasets community
h
long-covid-classification-data
huggingface.co
Updated Jul 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lisa Langnickel (2022). long-covid-classification-data [Dataset]. https://huggingface.co/datasets/llangnickel/long-covid-classification-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 1, 2022
Authors
Lisa Langnickel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Description

Long-COVID related articles have been manually collected by information specialists.Please find further information here.

Size

Training Development Test Total

Positive Examples 215 76 70 345

Negative Examples 199 62 68 345

Total 414 238 138 690

Citation

@article{10.1093/database/baac048,author = {Langnickel, Lisa and Darms, Johannes and Heldt, Katharina and Ducks, Denise and Fluck, Juliane},title = "{Continuous development… See the full description on the dataset page: https://huggingface.co/datasets/llangnickel/long-covid-classification-data.
h
pubmed-abstracts-dist-noised-v2
huggingface.co
Updated Mar 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gayani Nanayakkara (2024). pubmed-abstracts-dist-noised-v2 [Dataset]. https://huggingface.co/datasets/gayanin/pubmed-abstracts-dist-noised-v2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 21, 2024
Authors
Gayani Nanayakkara
Description
gayanin/pubmed-abstracts-dist-noised-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
guidelines
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EPFL LLM Team, guidelines [Dataset]. https://huggingface.co/datasets/epfl-llm/guidelines
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
EPFL LLM Team
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
🎉 NEW DROP 🎉 PubMed Guidelines

We just added 1627 clinical guidelines found in PubMed and PubMed Central to the dataset on December 23rd, 2023. Merry Christmas!

Clinical Guidelines

The Clinical Guidelines corpus is a new dataset of 47K clinical practice guidelines from 17 high-quality online medical sources. This dataset serves as a crucial component of the original training corpus of the Meditron Large Language Model (LLM). We publicly release a subset of 37K articles… See the full description on the dataset page: https://huggingface.co/datasets/epfl-llm/guidelines.
open_access
huggingface.co
Updated Jan 7, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PubMed Central (2022). open_access [Dataset]. https://huggingface.co/datasets/pmc/open_access
Explore at:
Dataset updated
Jan 7, 2022
Dataset authored and provided by
PubMed Centralhttp://www.pubmedcentral.nih.gov/
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
The PMC Open Access Subset includes more than 3.4 million journal articles and preprints that are made available under license terms that allow reuse.

Not all articles in PMC are available for text mining and other reuse, many have copyright protection, however articles in the PMC Open Access Subset are made available under Creative Commons or similar licenses that generally allow more liberal redistribution and reuse than a traditional copyrighted work.

The PMC Open Access Subset is one part of the PMC Article Datasets

Facebook

Twitter

Click to copy link

Link copied

Cite

NLM/DIR BioNLP Group (2023). pubmed [Dataset]. https://huggingface.co/datasets/ncbi/pubmed

pubmed

PubMed

ncbi/pubmed

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Dec 15, 2023

Dataset authored and provided by

NLM/DIR BioNLP Group

License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.

Clear search

Close search

Google apps

Main menu

pubmed

pubmed-summarization

pubmed

vi_pubmed

PubMedVision

PubMedQA

scientific_papers

mini-pubmed

pubmed-abstract

long-doc-extractive-summarization-truncated-pubmed

vi_pubmed

ncbi_disease

PUBMED_title_abstracts_2020_baseline

Cardio_v1

pubmed

pubmed-page-images

long-covid-classification-data

pubmed-abstracts-dist-noised-v2

guidelines

open_access

pubmedSee More Versions

PubMed

ncbi/pubmed

pubmed