40 datasets found
  1. h

    pubmed

    • huggingface.co
    Updated Dec 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NLM/DIR BioNLP Group (2023). pubmed [Dataset]. https://huggingface.co/datasets/ncbi/pubmed
    Explore at:
    Dataset updated
    Dec 15, 2023
    Dataset authored and provided by
    NLM/DIR BioNLP Group
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.

  2. h

    pubmed

    • huggingface.co
    Updated Oct 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aarush (2023). pubmed [Dataset]. https://huggingface.co/datasets/chungimungi/pubmed
    Explore at:
    Dataset updated
    Oct 9, 2023
    Authors
    Aarush
    Description

    PubMed dataset in raw XML.

      Dataset Summary
    

    Once a year, NLM produces a baseline set of PubMed citation records in XML format for download; the baseline file is a complete snapshot of PubMed data. When using this data in a local database, the best practice is to overwrite your local data each year with the baseline data.

      Dataset Structure
    

    XML

      Source Data
    

    https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/

  3. PubMed Article Summarization Dataset

    • kaggle.com
    zip
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). PubMed Article Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/pubmed-article-summarization-dataset
    Explore at:
    zip(686033678 bytes)Available download formats
    Dataset updated
    Dec 5, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    PubMed Article Summarization Dataset

    PubMed Summarization Dataset

    By ccdv (From Huggingface) [source]

    About this dataset

    The dataset consists of multiple files, including validation.csv, train.csv, and test.csv. Each file contains a combination of articles and their respective abstracts. The articles are sourced directly from PubMed, ensuring they represent a wide range of topics across various scientific disciplines.

    In order to provide reliable datasets for different purposes, the files have been carefully curated to serve specific functions. validation.csv contains a subset of articles with their corresponding abstracts that can be used for validating the performance of summarization models during development. train.csv features a larger set of article-abstract pairs specifically intended for training such models.

    Finally, test.csv serves as an independent evaluation set that allows developers to measure the effectiveness and generalizability of their summarization models against unseen data points. By using this test set, researchers can assess how well their algorithms perform in generating concise summaries that accurately capture the main findings and conclusions within scientific articles.

    Researchers in natural language processing (NLP), machine learning (ML), or any related field can utilize this dataset to advance automatic text summarization techniques focused on scientific literature. Whether it's building extractive or abstractive methods or exploring novel approaches like neural networks or transformer-based architectures, this rich dataset provides ample opportunities for experimentation and progress in the field.

    How to use the dataset

    Introduction:

    Dataset Structure:

    • article: The full text of a scientific article from the PubMed database (Text).
    • abstract: A summary of the main findings and conclusions of the article (Text).

    Using the Dataset: To maximize the utility of this dataset, it is important to understand its purpose and how it can be utilized:

    • Training Models: The train.csv file contains articles and their corresponding abstracts that can be used for training summarization models or developing algorithms that generate concise summaries automatically.

    • Validation Purposes: The validation.csv file serves as a test set for fine-tuning your models or comparing different approaches during development.

    • Evaluating Model Performance: The test.csv file offers a separate set of articles along with their corresponding abstracts specifically designed for evaluating the performance of various summarization models.

    Tips for Utilizing the Dataset Effectively:

    • Preprocessing: Before using this dataset, consider preprocessing steps such as removing irrelevant sections (e.g., acknowledgments, references), cleaning up invalid characters or formatting issues if any exist.

    • Feature Engineering: Explore additional features like article length, sentence structure complexity, or domain-specific details that may assist in improving summarization model performance.

    • Model Selection & Evaluation: Experiment with different summarization algorithms, ranging from traditional extractive approaches to more advanced abstractive methods. Evaluate model performance using established metrics such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation).

    • Data Augmentation: Depending on the size of your dataset, you may consider augmenting it further by applying techniques like data synthesis or employing external resources (e.g., pre-trained language models) to enhance model performance.

    Conclusion:

    Research Ideas

    • Textual analysis and information retrieval: Researchers can use this dataset to analyze patterns in scientific literature or conduct information retrieval tasks. By examining the relationship between article content and its abstract, researchers can gain insights into how different sections of a scientific paper contribute to its overall summary.

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: validation.csv | Column name | Description ...

  4. d

    PubMed Central (PMC)

    • catalog.data.gov
    • datadiscovery.nlm.nih.gov
    • +3more
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (2025). PubMed Central (PMC) [Dataset]. https://catalog.data.gov/dataset/pubmed-central-pmc
    Explore at:
    Dataset updated
    Jun 25, 2025
    Dataset provided by
    National Library of Medicine
    Description

    PubMed Central (PMC) is a free, digital archive of full text biomedical and life sciences journal literature.

  5. Pubmed Knowledge Graph Dataset

    • kaggle.com
    zip
    Updated Jan 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krishna Kumar S (2022). Pubmed Knowledge Graph Dataset [Dataset]. https://www.kaggle.com/datasets/krishnakumarkk/pubmed-knowledge-graph-dataset
    Explore at:
    zip(10883016548 bytes)Available download formats
    Dataset updated
    Jan 7, 2022
    Authors
    Krishna Kumar S
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    PubMed Knowledge Graph Datasets http://er.tacc.utexas.edu/datasets/ped

    Content

    Dataset Name : PKG2020S4 (1781-Dec. 2020), Version 4 The new version PKG, PKG2020S4 (1781-Dec. 2020), updated the previous PKG version with PubMed 2021 baseline files, PubMed daily updates files (up to Jan. 4th 2021), and extracted bio-entities, author disambiguation results, extended author information, Scimago that containing journal information, and WOS citations which contains reference relations between PMID and reference PMID and extracted from WOS.

    Database Features: 1-PKG2020S4 (1781-Dec. 2020) Features.pdf (https://web.corral.tacc.utexas.edu/dive_datasets/PKG2020S4/PKG2020S4_MySQL/1-PKG2020S4%20(1781-Dec.%202020)%20Features.pdf) Database Description: 2-PKG2020S4 (1781-Dec. 2020) Database Description.pdf (https://web.corral.tacc.utexas.edu/dive_datasets/PKG2020S4/PKG2020S4_MySQL/2-PKG2020S4%20(1781-Dec.%202020)%20Database%20Description.pdf)

    Acknowledgements

    http://er.tacc.utexas.edu/datasets/ped

    Inspiration

    http://er.tacc.utexas.edu/datasets/ped

  6. d

    MEDLINE/PubMed Citations

    • catalog.data.gov
    • healthdata.gov
    • +3more
    Updated Jun 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (2025). MEDLINE/PubMed Citations [Dataset]. https://catalog.data.gov/dataset/medline-pubmed-citations-d2ed0
    Explore at:
    Dataset updated
    Jun 19, 2025
    Dataset provided by
    National Library of Medicine
    Description

    PubMed is a free resource supporting the search and retrieval of biomedical and life sciences literature with the aim of improving health–both globally and personally. The PubMed database contains citations and abstracts of biomedical literature. It does not include full text journal articles; however, links to the full text are often present when available from other sources, such as the publisher's website or PubMed Central (PMC). See the PubMed User Guide for more information. https://pubmed.ncbi.nlm.nih.gov/help/

  7. I

    Parsed Open Citations and PubMed Data

    • databank.illinois.edu
    Updated Feb 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hossein Mohasel Arjomandi; Dmitriy Korobskiy; George Chacko (2024). Parsed Open Citations and PubMed Data [Dataset]. http://doi.org/10.13012/B2IDB-5216575_V1
    Explore at:
    Dataset updated
    Feb 16, 2024
    Authors
    Hossein Mohasel Arjomandi; Dmitriy Korobskiy; George Chacko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains five files. (i) open_citations_jan2024_pub_ids.csv.gz, open_citations_jan2024_iid_el.csv.gz, open_citations_jan2024_el.csv.gz, and open_citation_jan2024_pubs.csv.gz represent a conversion of Open Citations to an edge list using integer ids assigned by us. The integer ids can be mapped to omids, pmids, and dois using the open_citation_jan2024_pubs.csv and open_citations_jan2024_pub_ids.scv files. The network consists of 121,052,490 nodes and 1,962,840,983 edges. Code for generating these data can be found https://github.com/chackoge/ERNIE_Plus/tree/master/OpenCitations. (ii) The fifth file, baseline2024.csv.gz, provides information about the metadata of PubMed papers. A 2024 version of PubMed was downloaded using Entrez and parsed into a table restricted to records that contain a pmid, a doi, and has a title and an abstract. A value of 1 in columns indicates that the information exists in metadata and a zero indicates otherwise. Code for generating this data: https://github.com/illinois-or-research-analytics/pubmed_etl. If you use these data or code in your work, please cite https://doi.org/10.13012/B2IDB-5216575_V1.

  8. h

    pubmed

    • huggingface.co
    Updated Feb 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MedRAG (2024). pubmed [Dataset]. https://huggingface.co/datasets/MedRAG/pubmed
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 26, 2024
    Authors
    MedRAG
    Description

    The PubMed Corpus in MedRAG

    This HF dataset contains the snippets from the PubMed corpus used in MedRAG. It can be used for medical Retrieval-Augmented Generation (RAG).

      News
    

    (02/26/2024) The "id" column has been reformatted. A new "PMID" column is added.

      Dataset Details
    
    
    
    
    
      Dataset Descriptions
    

    PubMed is the most widely used literature resource, containing over 36 million biomedical articles. For MedRAG, we use a PubMed subset of 23.9 million… See the full description on the dataset page: https://huggingface.co/datasets/MedRAG/pubmed.

  9. I

    Conceptual novelty scores for PubMed articles

    • databank.illinois.edu
    • aws-databank-alb.library.illinois.edu
    Updated Feb 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shubhanshu Mishra; Vetle I. Torvik (2024). Conceptual novelty scores for PubMed articles [Dataset]. http://doi.org/10.13012/B2IDB-5060298_V1
    Explore at:
    Dataset updated
    Feb 1, 2024
    Authors
    Shubhanshu Mishra; Vetle I. Torvik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    U.S. National Institutes of Health (NIH)
    U.S. National Science Foundation (NSF)
    Description

    Conceptual novelty analysis data based on PubMed Medical Subject Headings ---------------------------------------------------------------------- Created by Shubhanshu Mishra, and Vetle I. Torvik on April 16th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : the magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra. It contains final data generated as part of our experiments based on MEDLINE 2015 baseline and MeSH tree from 2015. The dataset is distributed in the form of the following tab separated text files: * PubMed2015_NoveltyData.tsv - Novelty scores for each paper in PubMed. The file contains 22,349,417 rows and 6 columns, as follow: - PMID: PubMed ID - Year: year of publication - TimeNovelty: time novelty score of the paper based on individual concepts (see paper) - VolumeNovelty: volume novelty score of the paper based on individual concepts (see paper) - PairTimeNovelty: time novelty score of the paper based on pair of concepts (see paper) - PairVolumeNovelty: volume novelty score of the paper based on pair of concepts (see paper) * mesh_scores.tsv - Temporal profiles for each MeSH term for all years. The file contains 1,102,831 rows and 5 columns, as follow: - MeshTerm: Name of the MeSH term - Year: year - AbsVal: Total publications with that MeSH term in the given year - TimeNovelty: age (in years since first publication) of MeSH term in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH term in the given year * meshpair_scores.txt.gz (36 GB uncompressed) - Temporal profiles for each MeSH term for all years - Mesh1: Name of the first MeSH term (alphabetically sorted) - Mesh2: Name of the second MeSH term (alphabetically sorted) - Year: year - AbsVal: Total publications with that MeSH pair in the given year - TimeNovelty: age (in years since first publication) of MeSH pair in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH pair in the given year * README.txt file ## Dataset creation This dataset was constructed using multiple datasets described in the following locations: * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * MeSH tree 2015: ftp://nlmpubs.nlm.nih.gov/online/mesh/2015/meshtrees/ * Source code provided at: https://github.com/napsternxg/Novelty Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions: Additional data related updates can be found at: Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742 . The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Conceptual novelty analysis data based on PubMed Medical Subject Headings by Shubhanshu Mishra, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/Novelty

  10. PubMed Central Open Access Subset (PMC OA)

    • healthdata.gov
    • datadiscovery.nlm.nih.gov
    • +3more
    csv, xlsx, xml
    Updated Sep 1, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    datadiscovery.nlm.nih.gov (2021). PubMed Central Open Access Subset (PMC OA) [Dataset]. https://healthdata.gov/widgets/3vwy-a2x4?mobile_redirect=true
    Explore at:
    xml, csv, xlsxAvailable download formats
    Dataset updated
    Sep 1, 2021
    Dataset provided by
    datadiscovery.nlm.nih.gov
    Description

    Not all articles in PMC are available for text mining and other reuse, many have copyright protection, however articles in the PMC Open Access Subset are made available for download under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyrighted work.

  11. h

    pubmed_qa

    • huggingface.co
    Updated Mar 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Biomedical Datasets (2023). pubmed_qa [Dataset]. https://huggingface.co/datasets/bigbio/pubmed_qa
    Explore at:
    Dataset updated
    Mar 3, 2023
    Dataset authored and provided by
    BigScience Biomedical Datasets
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    PubMedQA is a novel biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research biomedical questions with yes/no/maybe using the corresponding abstracts. PubMedQA has 1k expert-annotated (PQA-L), 61.2k unlabeled (PQA-U) and 211.3k artificially generated QA instances (PQA-A).

    Each PubMedQA instance is composed of: (1) a question which is either an existing research article title or derived from one, (2) a context which is the corresponding PubMed abstract without its conclusion, (3) a long answer, which is the conclusion of the abstract and, presumably, answers the research question, and (4) a yes/no/maybe answer which summarizes the conclusion.

    PubMedQA is the first QA dataset where reasoning over biomedical research texts, especially their quantitative contents, is required to answer the questions.

    PubMedQA datasets comprise of 3 different subsets: (1) PubMedQA Labeled (PQA-L): A labeled PubMedQA subset comprises of 1k manually annotated yes/no/maybe QA data collected from PubMed articles. (2) PubMedQA Artificial (PQA-A): An artificially labelled PubMedQA subset comprises of 211.3k PubMed articles with automatically generated questions from the statement titles and yes/no answer labels generated using a simple heuristic. (3) PubMedQA Unlabeled (PQA-U): An unlabeled PubMedQA subset comprises of 61.2k context-question pairs data collected from PubMed articles.

  12. h

    PubMedQA

    • huggingface.co
    • opendatalab.com
    Updated Mar 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qiao Jin (2024). PubMedQA [Dataset]. https://huggingface.co/datasets/qiaojin/PubMedQA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 1, 2024
    Authors
    Qiao Jin
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for [Dataset Name]

      Dataset Summary
    

    The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts.

      Supported Tasks and Leaderboards
    

    The official leaderboard is available at: https://pubmedqa.github.io/. 500 questions in the pqa_labeled are used as the test set. They can be found at… See the full description on the dataset page: https://huggingface.co/datasets/qiaojin/PubMedQA.

  13. pubmed_qa

    • kaggle.com
    zip
    Updated Jul 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trung Đức Nguyễn (2024). pubmed_qa [Dataset]. https://www.kaggle.com/datasets/trungcnguyn/pubmed-qa
    Explore at:
    zip(206308999 bytes)Available download formats
    Dataset updated
    Jul 29, 2024
    Authors
    Trung Đức Nguyễn
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Trung Đức Nguyễn

    Released under MIT

    Contents

  14. A New Hybrid Citation-Text Model for Scientific Document Clustering

    • figshare.com
    xml
    Updated Jan 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yangbing Xu; Shuai Zhang; wenyu zhang; dejian yu (2018). A New Hybrid Citation-Text Model for Scientific Document Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.5797977.v1
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Jan 18, 2018
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Yangbing Xu; Shuai Zhang; wenyu zhang; dejian yu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    this is the raw data of the manuscript "a new hybrid citation-text model for scientific document clustering." The data contain two parts. One is the files in xml format, which are download from PMC database. The other is the file in txt format that contains a list of pmid of documents used in the manuscript.

  15. Data from the paper "The landscape of biomedical research"

    • zenodo.org
    bin, zip
    Updated Apr 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rita González-Márquez; Luca Schmidt; Benjamin M. Schmidt; Philipp Berens; Dmitry Kobak; Rita González-Márquez; Luca Schmidt; Benjamin M. Schmidt; Philipp Berens; Dmitry Kobak (2024). Data from the paper "The landscape of biomedical research" [Dataset]. http://doi.org/10.5281/zenodo.10992086
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Apr 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rita González-Márquez; Luca Schmidt; Benjamin M. Schmidt; Philipp Berens; Dmitry Kobak; Rita González-Márquez; Luca Schmidt; Benjamin M. Schmidt; Philipp Berens; Dmitry Kobak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 18, 2024
    Description

    Data from the paper "The landscape of biomedical research".

    The paper used the PubMed 2020 baseline (download date: 26.01.2021, not available anymore) supplemented with additional files from the 2021 baseline (download date: 27.04.2022, not available anymore), both originally obtained from https://www.nlm.nih.gov/databases/download/pubmed_medline.html, courtesy of the U.S. National Library of Medicine. This data can be found in v2 of this repository (https://zenodo.org/records/7849020).

    In the latest version of this repository we provide the PubMed 2024 baseline (download date: 06.02.2024) including all papers until the end of 2023, which is not the main data we analyzed in the paper but an updated version including newer articles. The paper contains two supplementary figures (S9 and S10) with the updated embedding.

    The latest version provided here includes the following files:

    pubmed_landscape_data_2024.zip, which includes:

    - from the PubMed database: article title, journal, PMID, and publication year.

    - produced by us: t-SNE embedding X and Y coordinates, label, color, whether the paper is retracted or not (combining PubMed and Retraction Watch information), and affiliation country (from the first affiliation of the first author).

    pubmed_landscape_abstracts_2024.zip, which includes:

    - from the PubMed database: PMID, and paper abstracts.

    PubMedBERT_embeddings_float16_2024.npy, which includes:

    - produced by us: PubMedBERT embeddings of the paper abstracts (numpy.ndarray of shape 23,389,083x768).

  16. h

    pubmed-rct20k

    • huggingface.co
    • opendatalab.com
    Updated Apr 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arman Cohan (2023). pubmed-rct20k [Dataset]. https://huggingface.co/datasets/armanc/pubmed-rct20k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 22, 2023
    Authors
    Arman Cohan
    Description

    The small 20K version of the Pubmed-RCT dataset by Dernoncourt et al (2017). @article{dernoncourt2017pubmed, title={Pubmed 200k rct: a dataset for sequential sentence classification in medical abstracts}, author={Dernoncourt, Franck and Lee, Ji Young}, journal={arXiv preprint arXiv:1710.06071}, year={2017} }

    Note: This is the cleaned up version by Jin and Szolovits (2018). @article{jin2018hierarchical, title={Hierarchical neural networks for sequential sentence classification in… See the full description on the dataset page: https://huggingface.co/datasets/armanc/pubmed-rct20k.

  17. d

    SJR and PubMed Indexed Medical Journals in 10 Medical Specialties

    • search.dataone.org
    Updated Nov 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kim, Eungi (2023). SJR and PubMed Indexed Medical Journals in 10 Medical Specialties [Dataset]. http://doi.org/10.7910/DVN/2HRPBF
    Explore at:
    Dataset updated
    Nov 12, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Kim, Eungi
    Description

    This file contains a list of journals used to assess publication productivity of the top 10 countries across medical specialties. For the 10 medical specialties, the journal category of the 2020 Scientific Journal Rankings (SJR) was used. These journals are listed in both and PubMed. Three types of journal lists are included: a) ALL dataset, b) 30H dataset, and c) 30P dataset. For the 10 medical specialties, the ALL dataset contains all journals, the 30H dataset contains 30 journals with the highest h-index scores, and the 30P dataset contains 30 journals with the highest number of published articles. For these journals, the actual bibliographic records could be downloaded from the NIH website (http://nlm.nih.gov/databases/download/pubmed_medline.html).

  18. Dataset of a Study of Computational reproducibility of Jupyter notebooks...

    • zenodo.org
    pdf, zip
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen (2024). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Dataset]. http://doi.org/10.5281/zenodo.8226725
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This repository contains the dataset for the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

    Data Collection and Analysis

    We use the code for reproducibility of Jupyter notebooks from the study done by Pimentel et al., 2019 and adapted the code from ReproduceMeGit. We provide code for collecting the publication metadata from PubMed Central using NCBI Entrez utilities via Biopython.

    Our approach involves searching PMC using the esearch function for Jupyter notebooks using the query: ``(ipynb OR jupyter OR ipython) AND github''. We meticulously retrieve data in XML format, capturing essential details about journals and articles. By systematically scanning the entire article, encompassing the abstract, body, data availability statement, and supplementary materials, we extract GitHub links. Additionally, we mine repositories for key information such as dependency declarations found in files like requirements.txt, setup.py, and pipfile. Leveraging the GitHub API, we enrich our data by incorporating repository creation dates, update histories, pushes, and programming languages.

    All the extracted information is stored in a SQLite database. After collecting and creating the database tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories based on the code from Pimentel et al., 2019.

    Our reproducibility pipeline was started on 27 March 2023.

    Repository Structure

    Our repository is organized into two main folders:

    • archaeology: This directory hosts scripts designed to download, parse, and extract metadata from PubMed Central publications and associated repositories. There are 24 database tables created which store the information on articles, journals, authors, repositories, notebooks, cells, modules, executions, etc. in the db.sqlite database file.
    • analyses: Here, you will find notebooks instrumental in the in-depth analysis of data related to our study. The db.sqlite file generated by running the archaelogy folder is stored in the analyses folder for further analysis. The path can however be configured in the config.py file. There are two sets of notebooks: one set (naming pattern N[0-9]*.ipynb) is focused on examining data pertaining to repositories and notebooks, while the other set (PMC[0-9]*.ipynb) is for analyzing data associated with publications in PubMed Central, i.e.\ for plots involving data about articles, journals, publication dates or research fields. The resultant figures from the these notebooks are stored in the 'outputs' folder.
    • MethodsWorkflow: The MethodsWorkflow file provides a conceptual overview of the workflow used in this study.

    Accessing Data and Resources:

    • All the data generated during the initial study can be accessed at https://doi.org/10.5281/zenodo.6802158
    • For the latest results and re-run data, refer to this link.
    • The comprehensive SQLite database that encapsulates all the study's extracted data is stored in the db.sqlite file.
    • The metadata in xml format extracted from PubMed Central which contains the information about the articles and journal can be accessed in pmc.xml file.

    System Requirements:

    Running the pipeline:

    • Clone the computational-reproducibility-pmc repository using Git:
      git clone https://github.com/fusion-jena/computational-reproducibility-pmc.git
    • Navigate to the computational-reproducibility-pmc directory:
      cd computational-reproducibility-pmc/computational-reproducibility-pmc
    • Configure environment variables in the config.py file:
      GITHUB_USERNAME = os.environ.get("JUP_GITHUB_USERNAME", "add your github username here")
      GITHUB_TOKEN = os.environ.get("JUP_GITHUB_PASSWORD", "add your github token here")
    • Other environment variables can also be set in the config.py file.
      BASE_DIR = Path(os.environ.get("JUP_BASE_DIR", "./")).expanduser() # Add the path of directory where the GitHub repositories will be saved
      DB_CONNECTION = os.environ.get("JUP_DB_CONNECTION", "sqlite:///db.sqlite") # Add the path where the database is stored.
    • To set up conda environments for each python versions, upgrade pip, install pipenv, and install the archaeology package in each environment, execute:
      source conda-setup.sh
    • Change to the archaeology directory
      cd archaeology
    • Activate conda environment. We used py36 to run the pipeline.
      conda activate py36
    • Execute the main pipeline script (r0_main.py):
      python r0_main.py

    Running the analysis:

    • Navigate to the analysis directory.
      cd analyses
    • Activate conda environment. We use raw38 for the analysis of the metadata collected in the study.
      conda activate raw38
    • Install the required packages using the requirements.txt file.
      pip install -r requirements.txt
    • Launch Jupyterlab
      jupyter lab
    • Refer to the Index.ipynb notebook for the execution order and guidance.

    References:

  19. MedRedQA

    • data.csiro.au
    • researchdata.edu.au
    Updated May 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vincent Nguyen; Sarvnaz Karimi; Maciek Rybinski; Zhenchang Xing (2024). MedRedQA [Dataset]. http://doi.org/10.25919/yn7x-9148
    Explore at:
    Dataset updated
    May 1, 2024
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    Vincent Nguyen; Sarvnaz Karimi; Maciek Rybinski; Zhenchang Xing
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Time period covered
    Jul 10, 2013 - Apr 2, 2022
    Dataset funded by
    CSIROhttp://www.csiro.au/
    Australian National University
    Description

    A large non-factoid English consumer Question Answering (QA) dataset containing 51,000 pairs of consumer questions and their corresponding expert answers. This dataset is useful for bench-marking or training systems on more difficult real-world questions and responses which may contain spelling or formatting errors, or lexical gaps between consumer and expert vocabularies.

    By downloading this dataset, you agree to have obtained ethics approval from your institution. Lineage: We collected data from posts and comments to subreddit /r/askdocs, published between July 10, 2013, and April 2, 2022, totalling 600,000 submissions (original posts) and 1,700,000 comments (replies). We generated question-answer pairs by taking the highest scoring answer from a verified medical expert to a Reddit question. Questions with only images are removed, all links are removed and authors are removed.

    We provide two separate datasets in this collection and provide the following schemas. MedRedQA - Reddit Medical Question and Answer pairs from /r/askdocs. CSV format. i. the poster's question (Body) ii. Title of the post iii. The filtered answer from a verified physician comment (Response) iv. Occupation indicated for verification status v. Any PMCIDs found in the post

    MedRedQA+PubMed - PubMed Enriched subset of MedRedQA. JSON format. i. Question. The user's original question. The is equivalent to the Body field in MedRedQA ii. Document: The abstract of the PubMed document (if it exists and contains an abstract) for that particular post. Note: it does not necessarily mean the answer references this document. But at least one other verified physician in the responses has mentioned that particular document. iii. The filtered response. This is equivalent to the Response field in MedRedQA.

  20. d

    Data from: Medical Subject Headings (MeSH)

    • catalog.data.gov
    • data.virginia.gov
    • +2more
    Updated Jul 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (2025). Medical Subject Headings (MeSH) [Dataset]. https://catalog.data.gov/dataset/medical-subject-headings-mesh-812b8
    Explore at:
    Dataset updated
    Jul 17, 2025
    Dataset provided by
    National Library of Medicine
    Description

    Medical Subject Headings (MeSH) is a hierarchically-organized terminology for indexing and cataloging of biomedical information. It is used for the indexing of PubMed and other NLM databases. Please see the Terms and Conditions for more information regarding the use and re-use of MeSH. NLM produces Medical Subject Headings XML, ASCII, MARC 21 and RDF formats. Updates to the data files are made according to the following schedule: MeSH XML MeSH Descriptor files updated annually MeSH Qualifier files updated annually MeSH Supplemental Concept Records (SCR) updated daily (Monday - Friday) MeSH ASCII MeSH Descriptor files updated annually MeSH Qualifier files updated annually MeSH Supplemental Concept Records (SCR) updated daily (Monday - Friday) MeSH MARC21 All files posted monthly MeSH RDF All files posted daily (Monday - Friday)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
NLM/DIR BioNLP Group (2023). pubmed [Dataset]. https://huggingface.co/datasets/ncbi/pubmed

pubmed

PubMed

ncbi/pubmed

Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Dec 15, 2023
Dataset authored and provided by
NLM/DIR BioNLP Group
License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.

Search
Clear search
Close search
Google apps
Main menu