40 datasets found

h
pubmed
huggingface.co
Updated Dec 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NLM/DIR BioNLP Group (2023). pubmed [Dataset]. https://huggingface.co/datasets/ncbi/pubmed
Explore at:
Dataset updated
Dec 15, 2023
Dataset authored and provided by
NLM/DIR BioNLP Group
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.
h
pubmed
huggingface.co
Updated Oct 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aarush (2023). pubmed [Dataset]. https://huggingface.co/datasets/chungimungi/pubmed
Explore at:
Dataset updated
Oct 9, 2023
Authors
Aarush
Description
PubMed dataset in raw XML.

Dataset Summary

Once a year, NLM produces a baseline set of PubMed citation records in XML format for download; the baseline file is a complete snapshot of PubMed data. When using this data in a local database, the best practice is to overwrite your local data each year with the baseline data.

Dataset Structure

XML

Source Data

https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
PubMed Article Summarization Dataset
kaggle.com
zip
Updated Dec 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). PubMed Article Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/pubmed-article-summarization-dataset
Explore at:
zip(686033678 bytes)Available download formats
Dataset updated
Dec 5, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
PubMed Article Summarization Dataset

PubMed Summarization Dataset

By ccdv (From Huggingface) [source]

About this dataset

The dataset consists of multiple files, including validation.csv, train.csv, and test.csv. Each file contains a combination of articles and their respective abstracts. The articles are sourced directly from PubMed, ensuring they represent a wide range of topics across various scientific disciplines.

In order to provide reliable datasets for different purposes, the files have been carefully curated to serve specific functions. validation.csv contains a subset of articles with their corresponding abstracts that can be used for validating the performance of summarization models during development. train.csv features a larger set of article-abstract pairs specifically intended for training such models.

Finally, test.csv serves as an independent evaluation set that allows developers to measure the effectiveness and generalizability of their summarization models against unseen data points. By using this test set, researchers can assess how well their algorithms perform in generating concise summaries that accurately capture the main findings and conclusions within scientific articles.

Researchers in natural language processing (NLP), machine learning (ML), or any related field can utilize this dataset to advance automatic text summarization techniques focused on scientific literature. Whether it's building extractive or abstractive methods or exploring novel approaches like neural networks or transformer-based architectures, this rich dataset provides ample opportunities for experimentation and progress in the field.

How to use the dataset

Introduction:

Dataset Structure:

article: The full text of a scientific article from the PubMed database (Text).

abstract: A summary of the main findings and conclusions of the article (Text).

Using the Dataset: To maximize the utility of this dataset, it is important to understand its purpose and how it can be utilized:

Training Models: The train.csv file contains articles and their corresponding abstracts that can be used for training summarization models or developing algorithms that generate concise summaries automatically.

Validation Purposes: The validation.csv file serves as a test set for fine-tuning your models or comparing different approaches during development.

Evaluating Model Performance: The test.csv file offers a separate set of articles along with their corresponding abstracts specifically designed for evaluating the performance of various summarization models.

Tips for Utilizing the Dataset Effectively:

Preprocessing: Before using this dataset, consider preprocessing steps such as removing irrelevant sections (e.g., acknowledgments, references), cleaning up invalid characters or formatting issues if any exist.

Feature Engineering: Explore additional features like article length, sentence structure complexity, or domain-specific details that may assist in improving summarization model performance.

Model Selection & Evaluation: Experiment with different summarization algorithms, ranging from traditional extractive approaches to more advanced abstractive methods. Evaluate model performance using established metrics such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation).

Data Augmentation: Depending on the size of your dataset, you may consider augmenting it further by applying techniques like data synthesis or employing external resources (e.g., pre-trained language models) to enhance model performance.

Conclusion:

Research Ideas

Textual analysis and information retrieval: Researchers can use this dataset to analyze patterns in scientific literature or conduct information retrieval tasks. By examining the relationship between article content and its abstract, researchers can gain insights into how different sections of a scientific paper contribute to its overall summary.

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description ...
d
PubMed Central (PMC)
catalog.data.gov
datadiscovery.nlm.nih.gov
+3more
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (2025). PubMed Central (PMC) [Dataset]. https://catalog.data.gov/dataset/pubmed-central-pmc
Explore at:
Dataset updated
Jun 25, 2025
Dataset provided by
National Library of Medicine
Description
PubMed Central (PMC) is a free, digital archive of full text biomedical and life sciences journal literature.
Pubmed Knowledge Graph Dataset
kaggle.com
zip
Updated Jan 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krishna Kumar S (2022). Pubmed Knowledge Graph Dataset [Dataset]. https://www.kaggle.com/datasets/krishnakumarkk/pubmed-knowledge-graph-dataset
Explore at:
zip(10883016548 bytes)Available download formats
Dataset updated
Jan 7, 2022
Authors
Krishna Kumar S
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

PubMed Knowledge Graph Datasets http://er.tacc.utexas.edu/datasets/ped

Content

Dataset Name : PKG2020S4 (1781-Dec. 2020), Version 4 The new version PKG, PKG2020S4 (1781-Dec. 2020), updated the previous PKG version with PubMed 2021 baseline files, PubMed daily updates files (up to Jan. 4th 2021), and extracted bio-entities, author disambiguation results, extended author information, Scimago that containing journal information, and WOS citations which contains reference relations between PMID and reference PMID and extracted from WOS.

Database Features: 1-PKG2020S4 (1781-Dec. 2020) Features.pdf (https://web.corral.tacc.utexas.edu/dive_datasets/PKG2020S4/PKG2020S4_MySQL/1-PKG2020S4%20(1781-Dec.%202020)%20Features.pdf) Database Description: 2-PKG2020S4 (1781-Dec. 2020) Database Description.pdf (https://web.corral.tacc.utexas.edu/dive_datasets/PKG2020S4/PKG2020S4_MySQL/2-PKG2020S4%20(1781-Dec.%202020)%20Database%20Description.pdf)

Acknowledgements

http://er.tacc.utexas.edu/datasets/ped

Inspiration

http://er.tacc.utexas.edu/datasets/ped
d
MEDLINE/PubMed Citations
catalog.data.gov
healthdata.gov
+3more
Updated Jun 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (2025). MEDLINE/PubMed Citations [Dataset]. https://catalog.data.gov/dataset/medline-pubmed-citations-d2ed0
Explore at:
Dataset updated
Jun 19, 2025
Dataset provided by
National Library of Medicine
Description
PubMed is a free resource supporting the search and retrieval of biomedical and life sciences literature with the aim of improving health–both globally and personally. The PubMed database contains citations and abstracts of biomedical literature. It does not include full text journal articles; however, links to the full text are often present when available from other sources, such as the publisher's website or PubMed Central (PMC). See the PubMed User Guide for more information. https://pubmed.ncbi.nlm.nih.gov/help/
I
Parsed Open Citations and PubMed Data
databank.illinois.edu
Updated Feb 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hossein Mohasel Arjomandi; Dmitriy Korobskiy; George Chacko (2024). Parsed Open Citations and PubMed Data [Dataset]. http://doi.org/10.13012/B2IDB-5216575_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-5216575_V1
Dataset updated
Feb 16, 2024
Authors
Hossein Mohasel Arjomandi; Dmitriy Korobskiy; George Chacko
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains five files. (i) open_citations_jan2024_pub_ids.csv.gz, open_citations_jan2024_iid_el.csv.gz, open_citations_jan2024_el.csv.gz, and open_citation_jan2024_pubs.csv.gz represent a conversion of Open Citations to an edge list using integer ids assigned by us. The integer ids can be mapped to omids, pmids, and dois using the open_citation_jan2024_pubs.csv and open_citations_jan2024_pub_ids.scv files. The network consists of 121,052,490 nodes and 1,962,840,983 edges. Code for generating these data can be found https://github.com/chackoge/ERNIE_Plus/tree/master/OpenCitations. (ii) The fifth file, baseline2024.csv.gz, provides information about the metadata of PubMed papers. A 2024 version of PubMed was downloaded using Entrez and parsed into a table restricted to records that contain a pmid, a doi, and has a title and an abstract. A value of 1 in columns indicates that the information exists in metadata and a zero indicates otherwise. Code for generating this data: https://github.com/illinois-or-research-analytics/pubmed_etl. If you use these data or code in your work, please cite https://doi.org/10.13012/B2IDB-5216575_V1.
h
pubmed
huggingface.co
Updated Feb 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MedRAG (2024). pubmed [Dataset]. https://huggingface.co/datasets/MedRAG/pubmed
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 26, 2024
Authors
MedRAG
Description
The PubMed Corpus in MedRAG

This HF dataset contains the snippets from the PubMed corpus used in MedRAG. It can be used for medical Retrieval-Augmented Generation (RAG).

News

(02/26/2024) The "id" column has been reformatted. A new "PMID" column is added.

Dataset Details Dataset Descriptions

PubMed is the most widely used literature resource, containing over 36 million biomedical articles. For MedRAG, we use a PubMed subset of 23.9 million… See the full description on the dataset page: https://huggingface.co/datasets/MedRAG/pubmed.
I
Conceptual novelty scores for PubMed articles
databank.illinois.edu
aws-databank-alb.library.illinois.edu
Updated Feb 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shubhanshu Mishra; Vetle I. Torvik (2024). Conceptual novelty scores for PubMed articles [Dataset]. http://doi.org/10.13012/B2IDB-5060298_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-5060298_V1
Dataset updated
Feb 1, 2024
Authors
Shubhanshu Mishra; Vetle I. Torvik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
U.S. National Institutes of Health (NIH)
U.S. National Science Foundation (NSF)
Description
Conceptual novelty analysis data based on PubMed Medical Subject Headings ---------------------------------------------------------------------- Created by Shubhanshu Mishra, and Vetle I. Torvik on April 16th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : the magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra. It contains final data generated as part of our experiments based on MEDLINE 2015 baseline and MeSH tree from 2015. The dataset is distributed in the form of the following tab separated text files: * PubMed2015_NoveltyData.tsv - Novelty scores for each paper in PubMed. The file contains 22,349,417 rows and 6 columns, as follow: - PMID: PubMed ID - Year: year of publication - TimeNovelty: time novelty score of the paper based on individual concepts (see paper) - VolumeNovelty: volume novelty score of the paper based on individual concepts (see paper) - PairTimeNovelty: time novelty score of the paper based on pair of concepts (see paper) - PairVolumeNovelty: volume novelty score of the paper based on pair of concepts (see paper) * mesh_scores.tsv - Temporal profiles for each MeSH term for all years. The file contains 1,102,831 rows and 5 columns, as follow: - MeshTerm: Name of the MeSH term - Year: year - AbsVal: Total publications with that MeSH term in the given year - TimeNovelty: age (in years since first publication) of MeSH term in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH term in the given year * meshpair_scores.txt.gz (36 GB uncompressed) - Temporal profiles for each MeSH term for all years - Mesh1: Name of the first MeSH term (alphabetically sorted) - Mesh2: Name of the second MeSH term (alphabetically sorted) - Year: year - AbsVal: Total publications with that MeSH pair in the given year - TimeNovelty: age (in years since first publication) of MeSH pair in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH pair in the given year * README.txt file ## Dataset creation This dataset was constructed using multiple datasets described in the following locations: * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * MeSH tree 2015: ftp://nlmpubs.nlm.nih.gov/online/mesh/2015/meshtrees/ * Source code provided at: https://github.com/napsternxg/Novelty Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions: Additional data related updates can be found at: Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742 . The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Conceptual novelty analysis data based on PubMed Medical Subject Headings by Shubhanshu Mishra, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/Novelty
PubMed Central Open Access Subset (PMC OA)
healthdata.gov
datadiscovery.nlm.nih.gov
+3more
csv, xlsx, xml
Updated Sep 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
datadiscovery.nlm.nih.gov (2021). PubMed Central Open Access Subset (PMC OA) [Dataset]. https://healthdata.gov/widgets/3vwy-a2x4?mobile_redirect=true
Explore at:
xml, csv, xlsxAvailable download formats
Dataset updated
Sep 1, 2021
Dataset provided by
datadiscovery.nlm.nih.gov
Description
Not all articles in PMC are available for text mining and other reuse, many have copyright protection, however articles in the PMC Open Access Subset are made available for download under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyrighted work.
h
pubmed_qa
huggingface.co
Updated Mar 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2023). pubmed_qa [Dataset]. https://huggingface.co/datasets/bigbio/pubmed_qa
Explore at:
Dataset updated
Mar 3, 2023
Dataset authored and provided by
BigScience Biomedical Datasets
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
PubMedQA is a novel biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research biomedical questions with yes/no/maybe using the corresponding abstracts. PubMedQA has 1k expert-annotated (PQA-L), 61.2k unlabeled (PQA-U) and 211.3k artificially generated QA instances (PQA-A).

Each PubMedQA instance is composed of: (1) a question which is either an existing research article title or derived from one, (2) a context which is the corresponding PubMed abstract without its conclusion, (3) a long answer, which is the conclusion of the abstract and, presumably, answers the research question, and (4) a yes/no/maybe answer which summarizes the conclusion.

PubMedQA is the first QA dataset where reasoning over biomedical research texts, especially their quantitative contents, is required to answer the questions.

PubMedQA datasets comprise of 3 different subsets: (1) PubMedQA Labeled (PQA-L): A labeled PubMedQA subset comprises of 1k manually annotated yes/no/maybe QA data collected from PubMed articles. (2) PubMedQA Artificial (PQA-A): An artificially labelled PubMedQA subset comprises of 211.3k PubMed articles with automatically generated questions from the statement titles and yes/no answer labels generated using a simple heuristic. (3) PubMedQA Unlabeled (PQA-U): An unlabeled PubMedQA subset comprises of 61.2k context-question pairs data collected from PubMed articles.
h
PubMedQA
huggingface.co
opendatalab.com
Updated Mar 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qiao Jin (2024). PubMedQA [Dataset]. https://huggingface.co/datasets/qiaojin/PubMedQA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 1, 2024
Authors
Qiao Jin
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for [Dataset Name]

Dataset Summary

The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts.

Supported Tasks and Leaderboards

The official leaderboard is available at: https://pubmedqa.github.io/. 500 questions in the pqa_labeled are used as the test set. They can be found at… See the full description on the dataset page: https://huggingface.co/datasets/qiaojin/PubMedQA.
pubmed_qa
kaggle.com
zip
Updated Jul 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trung Đức Nguyễn (2024). pubmed_qa [Dataset]. https://www.kaggle.com/datasets/trungcnguyn/pubmed-qa
Explore at:
zip(206308999 bytes)Available download formats
Dataset updated
Jul 29, 2024
Authors
Trung Đức Nguyễn
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Trung Đức Nguyễn

Released under MIT

Contents
A New Hybrid Citation-Text Model for Scientific Document Clustering
figshare.com
xml
Updated Jan 18, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yangbing Xu; Shuai Zhang; wenyu zhang; dejian yu (2018). A New Hybrid Citation-Text Model for Scientific Document Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.5797977.v1
Explore at:
xmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5797977.v1
Dataset updated
Jan 18, 2018
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Yangbing Xu; Shuai Zhang; wenyu zhang; dejian yu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
this is the raw data of the manuscript "a new hybrid citation-text model for scientific document clustering." The data contain two parts. One is the files in xml format, which are download from PMC database. The other is the file in txt format that contains a list of pmid of documents used in the manuscript.
Data from the paper "The landscape of biomedical research"
zenodo.org
bin, zip
Updated Apr 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rita González-Márquez; Luca Schmidt; Benjamin M. Schmidt; Philipp Berens; Dmitry Kobak; Rita González-Márquez; Luca Schmidt; Benjamin M. Schmidt; Philipp Berens; Dmitry Kobak (2024). Data from the paper "The landscape of biomedical research" [Dataset]. http://doi.org/10.5281/zenodo.10992086
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10992086
Dataset updated
Apr 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rita González-Márquez; Luca Schmidt; Benjamin M. Schmidt; Philipp Berens; Dmitry Kobak; Rita González-Márquez; Luca Schmidt; Benjamin M. Schmidt; Philipp Berens; Dmitry Kobak
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 18, 2024
Description
Data from the paper "The landscape of biomedical research".

The paper used the PubMed 2020 baseline (download date: 26.01.2021, not available anymore) supplemented with additional files from the 2021 baseline (download date: 27.04.2022, not available anymore), both originally obtained from https://www.nlm.nih.gov/databases/download/pubmed_medline.html, courtesy of the U.S. National Library of Medicine. This data can be found in v2 of this repository (https://zenodo.org/records/7849020).

In the latest version of this repository we provide the PubMed 2024 baseline (download date: 06.02.2024) including all papers until the end of 2023, which is not the main data we analyzed in the paper but an updated version including newer articles. The paper contains two supplementary figures (S9 and S10) with the updated embedding.

The latest version provided here includes the following files:

pubmed_landscape_data_2024.zip, which includes:

- from the PubMed database: article title, journal, PMID, and publication year.

- produced by us: t-SNE embedding X and Y coordinates, label, color, whether the paper is retracted or not (combining PubMed and Retraction Watch information), and affiliation country (from the first affiliation of the first author).

pubmed_landscape_abstracts_2024.zip, which includes:

- from the PubMed database: PMID, and paper abstracts.

PubMedBERT_embeddings_float16_2024.npy, which includes:

- produced by us: PubMedBERT embeddings of the paper abstracts (numpy.ndarray of shape 23,389,083x768).
h
pubmed-rct20k
huggingface.co
opendatalab.com
Updated Apr 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arman Cohan (2023). pubmed-rct20k [Dataset]. https://huggingface.co/datasets/armanc/pubmed-rct20k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 22, 2023
Authors
Arman Cohan
Description
The small 20K version of the Pubmed-RCT dataset by Dernoncourt et al (2017). @article{dernoncourt2017pubmed, title={Pubmed 200k rct: a dataset for sequential sentence classification in medical abstracts}, author={Dernoncourt, Franck and Lee, Ji Young}, journal={arXiv preprint arXiv:1710.06071}, year={2017} }

Note: This is the cleaned up version by Jin and Szolovits (2018). @article{jin2018hierarchical, title={Hierarchical neural networks for sequential sentence classification in… See the full description on the dataset page: https://huggingface.co/datasets/armanc/pubmed-rct20k.
d
SJR and PubMed Indexed Medical Journals in 10 Medical Specialties
search.dataone.org
Updated Nov 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kim, Eungi (2023). SJR and PubMed Indexed Medical Journals in 10 Medical Specialties [Dataset]. http://doi.org/10.7910/DVN/2HRPBF
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/2HRPBF
Dataset updated
Nov 12, 2023
Dataset provided by
Harvard Dataverse
Authors
Kim, Eungi
Description
This file contains a list of journals used to assess publication productivity of the top 10 countries across medical specialties. For the 10 medical specialties, the journal category of the 2020 Scientific Journal Rankings (SJR) was used. These journals are listed in both and PubMed. Three types of journal lists are included: a) ALL dataset, b) 30H dataset, and c) 30P dataset. For the 10 medical specialties, the ALL dataset contains all journals, the 30H dataset contains 30 journals with the highest h-index scores, and the 30P dataset contains 30 journals with the highest number of published articles. For these journals, the actual bibliographic records could be downloaded from the NIH website (http://nlm.nih.gov/databases/download/pubmed_medline.html).
Dataset of a Study of Computational reproducibility of Jupyter notebooks...
zenodo.org
pdf, zip
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen (2024). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Dataset]. http://doi.org/10.5281/zenodo.8226725
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8226725
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This repository contains the dataset for the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

Data Collection and Analysis

We use the code for reproducibility of Jupyter notebooks from the study done by Pimentel et al., 2019 and adapted the code from ReproduceMeGit. We provide code for collecting the publication metadata from PubMed Central using NCBI Entrez utilities via Biopython.

Our approach involves searching PMC using the esearch function for Jupyter notebooks using the query: ``(ipynb OR jupyter OR ipython) AND github''. We meticulously retrieve data in XML format, capturing essential details about journals and articles. By systematically scanning the entire article, encompassing the abstract, body, data availability statement, and supplementary materials, we extract GitHub links. Additionally, we mine repositories for key information such as dependency declarations found in files like requirements.txt, setup.py, and pipfile. Leveraging the GitHub API, we enrich our data by incorporating repository creation dates, update histories, pushes, and programming languages.

All the extracted information is stored in a SQLite database. After collecting and creating the database tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories based on the code from Pimentel et al., 2019.

Our reproducibility pipeline was started on 27 March 2023.

Repository Structure

Our repository is organized into two main folders:

archaeology: This directory hosts scripts designed to download, parse, and extract metadata from PubMed Central publications and associated repositories. There are 24 database tables created which store the information on articles, journals, authors, repositories, notebooks, cells, modules, executions, etc. in the db.sqlite database file.

analyses: Here, you will find notebooks instrumental in the in-depth analysis of data related to our study. The db.sqlite file generated by running the archaelogy folder is stored in the analyses folder for further analysis. The path can however be configured in the config.py file. There are two sets of notebooks: one set (naming pattern N[0-9]*.ipynb) is focused on examining data pertaining to repositories and notebooks, while the other set (PMC[0-9]*.ipynb) is for analyzing data associated with publications in PubMed Central, i.e.\ for plots involving data about articles, journals, publication dates or research fields. The resultant figures from the these notebooks are stored in the 'outputs' folder.

MethodsWorkflow: The MethodsWorkflow file provides a conceptual overview of the workflow used in this study.

Accessing Data and Resources:

All the data generated during the initial study can be accessed at https://doi.org/10.5281/zenodo.6802158

For the latest results and re-run data, refer to this link.

The comprehensive SQLite database that encapsulates all the study's extracted data is stored in the db.sqlite file.

The metadata in xml format extracted from PubMed Central which contains the information about the articles and journal can be accessed in pmc.xml file.

System Requirements:

Centos 7 (Documentation: https://www.centos.org/)

Conda 4.9.4 (Installation Guide: https://docs.anaconda.com/anaconda/install/linux/)

Python 3.7.6 (Download Link: https://www.python.org/downloads/)

GitHub account (Get Started: https://github.com/, Requires GitHub Username and Token)

gcc 7.3.0 (Installation Guide: https://gcc.gnu.org/install/)

lbzip2 (Command: `conda install -c conda-forge lbzip2')

Running the pipeline:

Clone the computational-reproducibility-pmc repository using Git:
git clone https://github.com/fusion-jena/computational-reproducibility-pmc.git

Navigate to the computational-reproducibility-pmc directory:
cd computational-reproducibility-pmc/computational-reproducibility-pmc

Configure environment variables in the config.py file:
GITHUB_USERNAME = os.environ.get("JUP_GITHUB_USERNAME", "add your github username here")
GITHUB_TOKEN = os.environ.get("JUP_GITHUB_PASSWORD", "add your github token here")

Other environment variables can also be set in the config.py file.
BASE_DIR = Path(os.environ.get("JUP_BASE_DIR", "./")).expanduser() # Add the path of directory where the GitHub repositories will be saved
DB_CONNECTION = os.environ.get("JUP_DB_CONNECTION", "sqlite:///db.sqlite") # Add the path where the database is stored.

To set up conda environments for each python versions, upgrade pip, install pipenv, and install the archaeology package in each environment, execute:
source conda-setup.sh

Change to the archaeology directory
cd archaeology

Activate conda environment. We used py36 to run the pipeline.
conda activate py36

Execute the main pipeline script (r0_main.py):
python r0_main.py

Running the analysis:

Navigate to the analysis directory.
cd analyses

Activate conda environment. We use raw38 for the analysis of the metadata collected in the study.
conda activate raw38

Install the required packages using the requirements.txt file.
pip install -r requirements.txt

Launch Jupyterlab
jupyter lab

Refer to the Index.ipynb notebook for the execution order and guidance.

References:

Sheeba Samuel, Daniel Mietchen. (2024). Computational reproducibility of Jupyter notebooks from biomedical publications, https://doi.org/10.1093/gigascience/giad113, GigaScience

Sheeba Samuel, Daniel Mietchen. (2022). Computational reproducibility of Jupyter notebooks from biomedical publications, https://arxiv.org/pdf/2209.04308.pdf, CoRR abs/2209.04308

Sheeba Samuel, & Daniel Mietchen. (2022). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6802158
MedRedQA
data.csiro.au
researchdata.edu.au
Updated May 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vincent Nguyen; Sarvnaz Karimi; Maciek Rybinski; Zhenchang Xing (2024). MedRedQA [Dataset]. http://doi.org/10.25919/yn7x-9148
Explore at:
Unique identifier
https://doi.org/10.25919/yn7x-9148
Dataset updated
May 1, 2024
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Vincent Nguyen; Sarvnaz Karimi; Maciek Rybinski; Zhenchang Xing
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Time period covered
Jul 10, 2013 - Apr 2, 2022
Dataset funded by
CSIROhttp://www.csiro.au/
Australian National University
Description
A large non-factoid English consumer Question Answering (QA) dataset containing 51,000 pairs of consumer questions and their corresponding expert answers. This dataset is useful for bench-marking or training systems on more difficult real-world questions and responses which may contain spelling or formatting errors, or lexical gaps between consumer and expert vocabularies.

By downloading this dataset, you agree to have obtained ethics approval from your institution. Lineage: We collected data from posts and comments to subreddit /r/askdocs, published between July 10, 2013, and April 2, 2022, totalling 600,000 submissions (original posts) and 1,700,000 comments (replies). We generated question-answer pairs by taking the highest scoring answer from a verified medical expert to a Reddit question. Questions with only images are removed, all links are removed and authors are removed.

We provide two separate datasets in this collection and provide the following schemas. MedRedQA - Reddit Medical Question and Answer pairs from /r/askdocs. CSV format. i. the poster's question (Body) ii. Title of the post iii. The filtered answer from a verified physician comment (Response) iv. Occupation indicated for verification status v. Any PMCIDs found in the post

MedRedQA+PubMed - PubMed Enriched subset of MedRedQA. JSON format. i. Question. The user's original question. The is equivalent to the Body field in MedRedQA ii. Document: The abstract of the PubMed document (if it exists and contains an abstract) for that particular post. Note: it does not necessarily mean the answer references this document. But at least one other verified physician in the responses has mentioned that particular document. iii. The filtered response. This is equivalent to the Response field in MedRedQA.
d
Data from: Medical Subject Headings (MeSH)
catalog.data.gov
data.virginia.gov
+2more
Updated Jul 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (2025). Medical Subject Headings (MeSH) [Dataset]. https://catalog.data.gov/dataset/medical-subject-headings-mesh-812b8
Explore at:
Dataset updated
Jul 17, 2025
Dataset provided by
National Library of Medicine
Description
Medical Subject Headings (MeSH) is a hierarchically-organized terminology for indexing and cataloging of biomedical information. It is used for the indexing of PubMed and other NLM databases. Please see the Terms and Conditions for more information regarding the use and re-use of MeSH. NLM produces Medical Subject Headings XML, ASCII, MARC 21 and RDF formats. Updates to the data files are made according to the following schedule: MeSH XML MeSH Descriptor files updated annually MeSH Qualifier files updated annually MeSH Supplemental Concept Records (SCR) updated daily (Monday - Friday) MeSH ASCII MeSH Descriptor files updated annually MeSH Qualifier files updated annually MeSH Supplemental Concept Records (SCR) updated daily (Monday - Friday) MeSH MARC21 All files posted monthly MeSH RDF All files posted daily (Monday - Friday)

Facebook

Twitter

Click to copy link

Link copied

Cite

NLM/DIR BioNLP Group (2023). pubmed [Dataset]. https://huggingface.co/datasets/ncbi/pubmed

pubmed

PubMed

ncbi/pubmed

Explore at:

5 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Dec 15, 2023

Dataset authored and provided by

NLM/DIR BioNLP Group

License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.

Clear search

Close search

Google apps

Main menu

pubmed

pubmed

PubMed Article Summarization Dataset

PubMed Article Summarization Dataset

PubMed Summarization Dataset

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

PubMed Central (PMC)

Pubmed Knowledge Graph Dataset

Context

Content

Acknowledgements

Inspiration

MEDLINE/PubMed Citations

Parsed Open Citations and PubMed Data

pubmed

Conceptual novelty scores for PubMed articles

PubMed Central Open Access Subset (PMC OA)

pubmed_qa

PubMedQA

pubmed_qa

Dataset

Contents

A New Hybrid Citation-Text Model for Scientific Document Clustering

Data from the paper "The landscape of biomedical research"

pubmed-rct20k

SJR and PubMed Indexed Medical Journals in 10 Medical Specialties

Dataset of a Study of Computational reproducibility of Jupyter notebooks...

MedRedQA

Data from: Medical Subject Headings (MeSH)

pubmedSee More Versions

PubMed

ncbi/pubmed

pubmed