100+ datasets found
  1. h

    pubmed

    • huggingface.co
    Updated Dec 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NLM/DIR BioNLP Group (2023). pubmed [Dataset]. https://huggingface.co/datasets/ncbi/pubmed
    Explore at:
    Dataset updated
    Dec 15, 2023
    Dataset authored and provided by
    NLM/DIR BioNLP Group
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.

  2. h

    pubmed-summarization

    • huggingface.co
    • opendatalab.com
    Updated Dec 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ccdv (2021). pubmed-summarization [Dataset]. https://huggingface.co/datasets/ccdv/pubmed-summarization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 1, 2021
    Authors
    ccdv
    Description

    PubMed dataset for summarization

    Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/pubmed-summarization": ("article", "abstract")

      Data Fields
    

    id: paper id article: a string containing the body… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/pubmed-summarization.

  3. h

    pubmed

    • huggingface.co
    Updated Feb 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MedRAG (2024). pubmed [Dataset]. https://huggingface.co/datasets/MedRAG/pubmed
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 26, 2024
    Authors
    MedRAG
    Description

    The PubMed Corpus in MedRAG

    This HF dataset contains the snippets from the PubMed corpus used in MedRAG. It can be used for medical Retrieval-Augmented Generation (RAG).

      News
    

    (02/26/2024) The "id" column has been reformatted. A new "PMID" column is added.

      Dataset Details
    
    
    
    
    
      Dataset Descriptions
    

    PubMed is the most widely used literature resource, containing over 36 million biomedical articles. For MedRAG, we use a PubMed subset of 23.9 million… See the full description on the dataset page: https://huggingface.co/datasets/MedRAG/pubmed.

  4. h

    vi_pubmed

    • huggingface.co
    Updated Jun 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SEACrowd (2024). vi_pubmed [Dataset]. https://huggingface.co/datasets/SEACrowd/vi_pubmed
    Explore at:
    Dataset updated
    Jun 20, 2024
    Dataset authored and provided by
    SEACrowd
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    20M Vietnamese PubMed biomedical abstracts translated by the state-of-the-art English-Vietnamese Translation project. The data has been used as unlabeled dataset for pretraining a Vietnamese Biomedical-domain Transformer model.

  5. h

    PubMedVision

    • huggingface.co
    Updated Jun 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FreedomAI (2024). PubMedVision [Dataset]. https://huggingface.co/datasets/FreedomIntelligence/PubMedVision
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2024
    Dataset authored and provided by
    FreedomAI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    News

    [2025/02/18]: We add the original captions of PubMedVision in PubMedVision_Original_Caption.json, as well as the Chinese version of PubMedVision in PubMedVision_Chinese.json. [2024/07/01]: We add annotations for 'body_part' and 'modality' of images, utilizing the HuatuoGPT-Vision-7B model.

      PubMedVision
    

    PubMedVision is a large-scale medical VQA dataset. We extracted high-quality image-text pairs from PubMed and used GPT-4V to reformat them to enhance their quality.… See the full description on the dataset page: https://huggingface.co/datasets/FreedomIntelligence/PubMedVision.

  6. h

    PubMedQA

    • huggingface.co
    • paperswithcode.com
    • +1more
    Updated Jan 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qiao Jin (2024). PubMedQA [Dataset]. https://huggingface.co/datasets/qiaojin/PubMedQA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 7, 2024
    Authors
    Qiao Jin
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for [Dataset Name]

      Dataset Summary
    

    The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts.

      Supported Tasks and Leaderboards
    

    The official leaderboard is available at: https://pubmedqa.github.io/. 500 questions in the pqa_labeled are used as the test set. They can be found at… See the full description on the dataset page: https://huggingface.co/datasets/qiaojin/PubMedQA.

  7. T

    scientific_papers

    • tensorflow.org
    • huggingface.co
    Updated Dec 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). scientific_papers [Dataset]. https://www.tensorflow.org/datasets/catalog/scientific_papers
    Explore at:
    Dataset updated
    Dec 23, 2022
    Description

    Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.

    Both "arxiv" and "pubmed" have two features:

    • article: the body of the document, pagragraphs seperated by "/n".
    • abstract: the abstract of the document, pagragraphs seperated by "/n".
    • section_names: titles of sections, seperated by "/n".

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('scientific_papers', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  8. h

    mini-pubmed

    • huggingface.co
    Updated Jul 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nam Pham (2024). mini-pubmed [Dataset]. https://huggingface.co/datasets/nampdn-ai/mini-pubmed
    Explore at:
    Dataset updated
    Jul 9, 2024
    Authors
    Nam Pham
    Description

    nampdn-ai/mini-pubmed dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    pubmed-abstract

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Uiyun Kim, pubmed-abstract [Dataset]. https://huggingface.co/datasets/uiyunkim-hub/pubmed-abstract
    Explore at:
    Authors
    Uiyun Kim
    Description

    Dataset Summary

    A daily-updated dataset of PubMed abstracts, collected via PubMed’s API and published on Hugging Face Datasets.Each snapshot is versioned by date (e.g., 2025-03-28) so users can track historical changes or use a consistent snapshot for reproducibility.

    Updated daily Each version tagged by date Abstract-only dataset (no full text)

      Dataset Structure
    

    Column Type Description

    pmid string Unique PubMed identifier

    abstract string Abstract text… See the full description on the dataset page: https://huggingface.co/datasets/uiyunkim-hub/pubmed-abstract.

  10. h

    long-doc-extractive-summarization-truncated-pubmed

    • huggingface.co
    Updated Nov 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nianlong Gu (2023). long-doc-extractive-summarization-truncated-pubmed [Dataset]. https://huggingface.co/datasets/nianlong/long-doc-extractive-summarization-truncated-pubmed
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 1, 2023
    Authors
    Nianlong Gu
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    nianlong/long-doc-extractive-summarization-truncated-pubmed dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    vi_pubmed

    • huggingface.co
    Updated Mar 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Long Phan (2023). vi_pubmed [Dataset]. https://huggingface.co/datasets/justinphan3110/vi_pubmed
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2023
    Authors
    Long Phan
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for PubMed

      Dataset Summary
    

    NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.

      Supported Tasks and Leaderboards
    

    [More Information Needed]

      Languages
    

    English

      Dataset Structure
    

    Bear… See the full description on the dataset page: https://huggingface.co/datasets/justinphan3110/vi_pubmed.

  12. h

    ncbi_disease

    • huggingface.co
    Updated Sep 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NLM/DIR BioNLP Group (2023). ncbi_disease [Dataset]. https://huggingface.co/datasets/ncbi/ncbi_disease
    Explore at:
    Dataset updated
    Sep 2, 2023
    Dataset authored and provided by
    NLM/DIR BioNLP Group
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency.

    For more details, see: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3951655/

    The original dataset can be downloaded from: https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/NCBI_corpus.zip This dataset has been converted to CoNLL format for NER using the following tool: https://github.com/spyysalo/standoff2conll Note: there is a duplicate document (PMID 8528200) in the original data, and the duplicate is recreated in the converted data.

  13. h

    PUBMED_title_abstracts_2020_baseline

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soonwook Hwang, PUBMED_title_abstracts_2020_baseline [Dataset]. https://huggingface.co/datasets/hwang2006/PUBMED_title_abstracts_2020_baseline
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Soonwook Hwang
    Description

    This PUBMED Dataset was created based on the The PubMed Abstract GitHub Site and uploaded on the Hugging Face.

    The PUBMED dataset reproduced

    $git clone https://github.com/thoppe/The-Pile-PubMed.git $cd The-Pile-PubMed/ $python P0_download_listing.py $python P1_download_baseline.py $python P2_parse.py $python P3_build_final_LM_dataset.py

    Load Dataset

    from datasets import load_dataset pubmed_dataset = load_dataset("hwang2006/PUBMED_title_abstracts_2020_baseline", split="train")… See the full description on the dataset page: https://huggingface.co/datasets/hwang2006/PUBMED_title_abstracts_2020_baseline.

  14. h

    Cardio_v1

    • huggingface.co
    Updated Mar 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    INMED DATA (2024). Cardio_v1 [Dataset]. https://huggingface.co/datasets/InMedData/Cardio_v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 12, 2024
    Authors
    INMED DATA
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card

    This dataset consists of abstracts from heart-related papers collected from PubMed. It can be used for pre-training a language model specialized in cardiology. The dataset was collected through the PubMed API, based on the names of heart-related journals and a glossary of cardiology terms.

      Dataset
    
    
    
    
    
      Data Sources
    

    Pubmed: PubMed is a database that provides abstracts of research papers related to life sciences, biomedical fields, health psychology… See the full description on the dataset page: https://huggingface.co/datasets/InMedData/Cardio_v1.

  15. h

    pubmed

    • huggingface.co
    Updated Jun 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sharadmishra (2023). pubmed [Dataset]. https://huggingface.co/datasets/sharad36/pubmed
    Explore at:
    Dataset updated
    Jun 25, 2023
    Authors
    sharadmishra
    Description

    sharad36/pubmed dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    pubmed-page-images

    • huggingface.co
    Updated Mar 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PDFPages (2023). pubmed-page-images [Dataset]. https://huggingface.co/datasets/PDFPages/pubmed-page-images
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 6, 2023
    Dataset authored and provided by
    PDFPages
    Description

    PDFPages/pubmed-page-images dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    long-covid-classification-data

    • huggingface.co
    Updated Jul 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lisa Langnickel (2022). long-covid-classification-data [Dataset]. https://huggingface.co/datasets/llangnickel/long-covid-classification-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 1, 2022
    Authors
    Lisa Langnickel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Description

    Long-COVID related articles have been manually collected by information specialists.Please find further information here.

      Size
    

    Training Development Test Total

    Positive Examples 215 76 70 345

    Negative Examples 199 62 68 345

    Total 414 238 138 690

      Citation
    

    @article{10.1093/database/baac048,author = {Langnickel, Lisa and Darms, Johannes and Heldt, Katharina and Ducks, Denise and Fluck, Juliane},title = "{Continuous development… See the full description on the dataset page: https://huggingface.co/datasets/llangnickel/long-covid-classification-data.

  18. h

    pubmed-abstracts-dist-noised-v2

    • huggingface.co
    Updated Mar 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gayani Nanayakkara (2024). pubmed-abstracts-dist-noised-v2 [Dataset]. https://huggingface.co/datasets/gayanin/pubmed-abstracts-dist-noised-v2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 21, 2024
    Authors
    Gayani Nanayakkara
    Description

    gayanin/pubmed-abstracts-dist-noised-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    guidelines

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EPFL LLM Team, guidelines [Dataset]. https://huggingface.co/datasets/epfl-llm/guidelines
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    EPFL LLM Team
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    🎉 NEW DROP 🎉 PubMed Guidelines

    We just added 1627 clinical guidelines found in PubMed and PubMed Central to the dataset on December 23rd, 2023. Merry Christmas!

      Clinical Guidelines
    

    The Clinical Guidelines corpus is a new dataset of 47K clinical practice guidelines from 17 high-quality online medical sources. This dataset serves as a crucial component of the original training corpus of the Meditron Large Language Model (LLM). We publicly release a subset of 37K articles… See the full description on the dataset page: https://huggingface.co/datasets/epfl-llm/guidelines.

  20. open_access

    • huggingface.co
    Updated Jan 7, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PubMed Central (2022). open_access [Dataset]. https://huggingface.co/datasets/pmc/open_access
    Explore at:
    Dataset updated
    Jan 7, 2022
    Dataset authored and provided by
    PubMed Centralhttp://www.pubmedcentral.nih.gov/
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    The PMC Open Access Subset includes more than 3.4 million journal articles and preprints that are made available under license terms that allow reuse.

    Not all articles in PMC are available for text mining and other reuse, many have copyright protection, however articles in the PMC Open Access Subset are made available under Creative Commons or similar licenses that generally allow more liberal redistribution and reuse than a traditional copyrighted work.

    The PMC Open Access Subset is one part of the PMC Article Datasets

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
NLM/DIR BioNLP Group (2023). pubmed [Dataset]. https://huggingface.co/datasets/ncbi/pubmed

pubmed

PubMed

ncbi/pubmed

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Dec 15, 2023
Dataset authored and provided by
NLM/DIR BioNLP Group
License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.

Search
Clear search
Close search
Google apps
Main menu