56 datasets found

h
pubmed25
huggingface.co
Updated Apr 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hà Huy Hoàng (2025). pubmed25 [Dataset]. https://huggingface.co/datasets/HoangHa/pubmed25
Explore at:
Dataset updated
Apr 26, 2025
Authors
Hà Huy Hoàng
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information. This version is modified to extract the full text from structured abstracts.
a
Pubmed Baseline 2021-12-12
academictorrents.com
bittorrent
Updated May 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health and National Library of Medicine (2022). Pubmed Baseline 2021-12-12 [Dataset]. https://academictorrents.com/details/cc12294ab7bf1f730738a6bff89052ad8156d8d8
Explore at:
bittorrent(37381448042)Available download formats
Dataset updated
May 22, 2022
Dataset authored and provided by
National Institutes of Health and National Library of Medicine
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
Just the baseline files, no update files. md5 sums included and checked before upload. From : —————————————————————————————————————- The PubMed Baseline Repository and Daily Update files Last Updated December 13, 2022 All questions should be directed to: National Center for Biotechnology Information info@ncbi.nlm.nih.gov This document describes the PubMed Database available on the NCBI FTP site under the and directories. PubMed comprises more than 31 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publish
I
Conceptual novelty scores for PubMed articles
databank.illinois.edu
Updated Apr 27, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shubhanshu Mishra; Vetle I. Torvik (2018). Conceptual novelty scores for PubMed articles [Dataset]. http://doi.org/10.13012/B2IDB-5060298_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-5060298_V1
Dataset updated
Apr 27, 2018
Authors
Shubhanshu Mishra; Vetle I. Torvik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
U.S. National Institutes of Health (NIH)
U.S. National Science Foundation (NSF)
Description
Conceptual novelty analysis data based on PubMed Medical Subject Headings ---------------------------------------------------------------------- Created by Shubhanshu Mishra, and Vetle I. Torvik on April 16th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : the magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra. It contains final data generated as part of our experiments based on MEDLINE 2015 baseline and MeSH tree from 2015. The dataset is distributed in the form of the following tab separated text files: * PubMed2015_NoveltyData.tsv - Novelty scores for each paper in PubMed. The file contains 22,349,417 rows and 6 columns, as follow: - PMID: PubMed ID - Year: year of publication - TimeNovelty: time novelty score of the paper based on individual concepts (see paper) - VolumeNovelty: volume novelty score of the paper based on individual concepts (see paper) - PairTimeNovelty: time novelty score of the paper based on pair of concepts (see paper) - PairVolumeNovelty: volume novelty score of the paper based on pair of concepts (see paper) * mesh_scores.tsv - Temporal profiles for each MeSH term for all years. The file contains 1,102,831 rows and 5 columns, as follow: - MeshTerm: Name of the MeSH term - Year: year - AbsVal: Total publications with that MeSH term in the given year - TimeNovelty: age (in years since first publication) of MeSH term in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH term in the given year * meshpair_scores.txt.gz (36 GB uncompressed) - Temporal profiles for each MeSH term for all years - Mesh1: Name of the first MeSH term (alphabetically sorted) - Mesh2: Name of the second MeSH term (alphabetically sorted) - Year: year - AbsVal: Total publications with that MeSH pair in the given year - TimeNovelty: age (in years since first publication) of MeSH pair in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH pair in the given year * README.txt file ## Dataset creation This dataset was constructed using multiple datasets described in the following locations: * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * MeSH tree 2015: ftp://nlmpubs.nlm.nih.gov/online/mesh/2015/meshtrees/ * Source code provided at: https://github.com/napsternxg/Novelty Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions: Additional data related updates can be found at: Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742 . The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Conceptual novelty analysis data based on PubMed Medical Subject Headings by Shubhanshu Mishra, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/Novelty
N
MEDLINE/PubMed Baseline Statistics: Min/Max Report
datadiscovery.nlm.nih.gov
data.virginia.gov
+3more
xlsx, xml
Updated Mar 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). MEDLINE/PubMed Baseline Statistics: Min/Max Report [Dataset]. https://datadiscovery.nlm.nih.gov/d/xtvu-8j6w
Explore at:
xml, xlsxAvailable download formats
Dataset updated
Mar 22, 2023
Description
A file containing all Min/Max Baseline Reports for 2005-2023 in their original format is available in the Attachments section below. A second file includes a separate set of reports, made available from 2002-2017, that did not include OLDMEDLINE records.

MEDLINE/PubMed annual statistical reports are based upon the data elements in the baseline versions of MEDLINE®/PubMed are available. For each year covered the reports include: total citations containing each element; total occurrences of each element; minimum/average/maximum occurrences of each element in a record; minimum/average/maximum length of a single element occurrence; average record size; and other statistical data describing the content and size of the elements.
I
Hype - PubMed dataset
databank.illinois.edu
Updated Mar 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Apratim Mishra; Jana Diesner; Vetle I. Torvik (2025). Hype - PubMed dataset [Dataset]. http://doi.org/10.13012/B2IDB-0651259_V3
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-0651259_V3
Dataset updated
Mar 14, 2025
Authors
Apratim Mishra; Jana Diesner; Vetle I. Torvik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Hype - PubMed dataset Prepared by Apratim Mishra This dataset captures ‘Hype’ within biomedical abstracts sourced from PubMed. The selection chosen is ‘journal articles’ written in English, published between 1975 and 2019, totaling ~5.2 million. The classification relies on the presence of specific candidate ‘hype words’ and their abstract location. Therefore, each article (PMID) might have multiple instances in the dataset due to the presence of multiple hype words in different abstract sentences. The candidate hype words are 35 in count: 'major', 'novel', 'central', 'critical', 'essential', 'strongly', 'unique', 'promising', 'markedly', 'excellent', 'crucial', 'robust', 'importantly', 'prominent', 'dramatically', 'favorable', 'vital', 'surprisingly', 'remarkably', 'remarkable', 'definitive', 'pivotal', 'innovative', 'supportive', 'encouraging', 'unprecedented', 'enormous', 'exceptional', 'outstanding', 'noteworthy', 'creative', 'assuring', 'reassuring', 'spectacular', and 'hopeful’. This is version 3 of the dataset. Added new file - WSD_hype.tsv File 1: hype_dataset_final.tsv Primary dataset. It has the following columns: 1. PMID: represents unique article ID in PubMed 2. Year: Year of publication 3. Hype_word: Candidate hype word, such as ‘novel.’ 4. Sentence: Sentence in abstract containing the hype word. 5. Hype_percentile: Abstract relative position of hype word. 6. Hype_value: Propensity of hype based on the hype word, the sentence, and the abstract location. 7. Introduction: The ‘I’ component of the hype word based on IMRaD 8. Methods: The ‘M’ component of the hype word based on IMRaD 9. Results: The ‘R’ component of the hype word based on IMRaD 10. Discussion: The ‘D’ component of the hype word based on IMRaD File 2: hype_removed_phrases_final.tsv Secondary dataset with same columns as File 1. Hype in the primary dataset is based on excluding certain phrases that are rarely hype. The phrases that were removed are included in File 2 and modeled separately. Removed phrases: 1. Major: histocompatibility, component, protein, metabolite, complex, surgery 2. Novel: assay, mutation, antagonist, inhibitor, algorithm, technique, series, method, hybrid 3. Central: catheters, system, design, composite, catheter, pressure, thickness, compartment 4. Critical: compartment, micelle, temperature, incident, solution, ischemia, concentration, thinking, nurses, skills, analysis, review, appraisal, evaluation, values 5. Essential: medium, features, properties, opportunities, oil 6. Unique: model, amino 7. Robust: regression 8. Vital: capacity, signs, organs, status, structures, staining, rates, cells, information 9. Outstanding: questions, issues, question, questions, challenge, problems, problem, remains 10. Remarkable: properties 11. Definite: radiotherapy, surgery File 3: WSD_hype.tsv Includes hype-based disambiguation for candidate words targeted for WSD (Word sense disambiguation)
PubMed article IDs
kaggle.com
Updated Oct 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aarush Sinha (2023). PubMed article IDs [Dataset]. https://www.kaggle.com/datasets/chungimungi/pmidfinal/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 16, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aarush Sinha
Description
A dataset that contains PubMed articles PMIDs sourced from their OA Subset. It contains the PMID of the article and the PMIDs that article has cited. This was done by creating a parsing script (PubMed parser on github)that utilizes the PMIDs to iterate through the website and gather the required information.
h
pubmed_clean
huggingface.co
Updated Dec 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amish (2024). pubmed_clean [Dataset]. https://huggingface.co/datasets/darknight054/pubmed_clean
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 23, 2024
Authors
Amish
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
A cleaned Pubmed commercial available files dataset. Will update the script used to clean soon.
I
Diversity - PubMed Dataset
aws-databank-alb.library.illinois.edu
databank.illinois.edu
Updated Oct 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Apratim Mishra; Haejin Lee; Sullam Jeoung; Vetle Torvik; Jana Diesner (2024). Diversity - PubMed Dataset [Dataset]. http://doi.org/10.13012/B2IDB-5259667_V3
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-5259667_V3
Dataset updated
Oct 11, 2024
Authors
Apratim Mishra; Haejin Lee; Sullam Jeoung; Vetle Torvik; Jana Diesner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Diversity - PubMed dataset Contact: Apratim Mishra (Oct, 2024) This dataset presents article-level (pmid) and author-level (auid) diversity data for PubMed articles. The chosen selection includes articles retrieved from Authority 2018 [1], 907 024 papers, and 1 316 838 authors, and is an expanded dataset of V1. The sample of articles consists of the top 40 journals in the dataset, limited to 2-12 authors published between 1991 – 2014, which are article type "journal type" written in English. Files are 'gzip' compressed and separated by tab space, and V3 includes the correct author count for the included papers (pmids) and updated results with no NaNs. ################################################ File1: auids_plos_3.csv.gz (Important columns defined, 5 in total) • AUID: a unique ID for each author • Genni: gender prediction • Ethnea: ethnicity prediction ################################################# File2: pmids_plos_3.csv.gz (Important columns defined) • pmid: unique paper • auid: all unique auids (author-name unique identification) • year: Year of paper publication • no_authors: Author count • journal: Journal name • years: first year of publication for every author • Country-temporal: Country of affiliation for every author • h_index: Journal h-index • TimeNovelty: Paper Time novelty [2] • nih_funded: Binary variable indicating funding for any author • prior_cit_mean: Mean of all authors’ prior citation rate • Insti_impact: All unique institutions’ citation rate • mesh_vals: Top MeSH values for every author of that paper • relative_citation_ratio: RCR The ‘Readme’ includes a description for all columns. [1] Torvik, Vetle; Smalheiser, Neil (2021): Author-ity 2018 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2273402_V1 [2] Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1
I
Self-citation analysis data based on PubMed Central subset (2002-2005)
databank.illinois.edu
Updated Apr 27, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik (2018). Self-citation analysis data based on PubMed Central subset (2002-2005) [Dataset]. http://doi.org/10.13012/B2IDB-9665377_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-9665377_V1
Dataset updated
Apr 27, 2018
Authors
Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
U.S. National Institutes of Health (NIH)
U.S. National Science Foundation (NSF)
Description
Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations. Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304 - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105 * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720 - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927 * MapAffil for identifying article country of affiliation: - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik * IMPLICIT journal similarity: - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 * Novelty dataset for identify article level novelty: - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 - Paper citation: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra - Code: https://github.com/napsternxg/Novelty * Expertise dataset for identifying author expertise on articles: * Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.
o
Trialstreamer data
explore.openaire.eu
zenodo.org
Updated Apr 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iain Marshall; Benjamin Nye; Joël Kuiper; Rachel Marshall; Frank Soboczenski; Ani Nenkova; Anna Noel Storr; James Thomas; Byron Wallace (2020). Trialstreamer data [Dataset]. http://doi.org/10.5281/zenodo.6637538
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6637538
Dataset updated
Apr 26, 2020
Authors
Iain Marshall; Benjamin Nye; Joël Kuiper; Rachel Marshall; Frank Soboczenski; Ani Nenkova; Anna Noel Storr; James Thomas; Byron Wallace
Description
Trialstreamer annotated collection of RCTs. This respository contains baseline files (large), and subsequent updates (daily for PubMed, weekly for ICTRP).
Dataset of a Study of Computational reproducibility of Jupyter notebooks...
zenodo.org
explore.openaire.eu
pdf, zip
Updated Jul 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen (2024). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Dataset]. http://doi.org/10.5281/zenodo.8226725
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8226725
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This repository contains the dataset for the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

Data Collection and Analysis

We use the code for reproducibility of Jupyter notebooks from the study done by Pimentel et al., 2019 and adapted the code from ReproduceMeGit. We provide code for collecting the publication metadata from PubMed Central using NCBI Entrez utilities via Biopython.

Our approach involves searching PMC using the esearch function for Jupyter notebooks using the query: ``(ipynb OR jupyter OR ipython) AND github''. We meticulously retrieve data in XML format, capturing essential details about journals and articles. By systematically scanning the entire article, encompassing the abstract, body, data availability statement, and supplementary materials, we extract GitHub links. Additionally, we mine repositories for key information such as dependency declarations found in files like requirements.txt, setup.py, and pipfile. Leveraging the GitHub API, we enrich our data by incorporating repository creation dates, update histories, pushes, and programming languages.

All the extracted information is stored in a SQLite database. After collecting and creating the database tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories based on the code from Pimentel et al., 2019.

Our reproducibility pipeline was started on 27 March 2023.

Repository Structure

Our repository is organized into two main folders:

archaeology: This directory hosts scripts designed to download, parse, and extract metadata from PubMed Central publications and associated repositories. There are 24 database tables created which store the information on articles, journals, authors, repositories, notebooks, cells, modules, executions, etc. in the db.sqlite database file.

analyses: Here, you will find notebooks instrumental in the in-depth analysis of data related to our study. The db.sqlite file generated by running the archaelogy folder is stored in the analyses folder for further analysis. The path can however be configured in the config.py file. There are two sets of notebooks: one set (naming pattern N[0-9]*.ipynb) is focused on examining data pertaining to repositories and notebooks, while the other set (PMC[0-9]*.ipynb) is for analyzing data associated with publications in PubMed Central, i.e.\ for plots involving data about articles, journals, publication dates or research fields. The resultant figures from the these notebooks are stored in the 'outputs' folder.

MethodsWorkflow: The MethodsWorkflow file provides a conceptual overview of the workflow used in this study.

Accessing Data and Resources:

All the data generated during the initial study can be accessed at https://doi.org/10.5281/zenodo.6802158

For the latest results and re-run data, refer to this link.

The comprehensive SQLite database that encapsulates all the study's extracted data is stored in the db.sqlite file.

The metadata in xml format extracted from PubMed Central which contains the information about the articles and journal can be accessed in pmc.xml file.

System Requirements:

Centos 7 (Documentation: https://www.centos.org/)

Conda 4.9.4 (Installation Guide: https://docs.anaconda.com/anaconda/install/linux/)

Python 3.7.6 (Download Link: https://www.python.org/downloads/)

GitHub account (Get Started: https://github.com/, Requires GitHub Username and Token)

gcc 7.3.0 (Installation Guide: https://gcc.gnu.org/install/)

lbzip2 (Command: `conda install -c conda-forge lbzip2')

Running the pipeline:

Clone the computational-reproducibility-pmc repository using Git:
git clone https://github.com/fusion-jena/computational-reproducibility-pmc.git

Navigate to the computational-reproducibility-pmc directory:
cd computational-reproducibility-pmc/computational-reproducibility-pmc

Configure environment variables in the config.py file:
GITHUB_USERNAME = os.environ.get("JUP_GITHUB_USERNAME", "add your github username here")
GITHUB_TOKEN = os.environ.get("JUP_GITHUB_PASSWORD", "add your github token here")

Other environment variables can also be set in the config.py file.
BASE_DIR = Path(os.environ.get("JUP_BASE_DIR", "./")).expanduser() # Add the path of directory where the GitHub repositories will be saved
DB_CONNECTION = os.environ.get("JUP_DB_CONNECTION", "sqlite:///db.sqlite") # Add the path where the database is stored.

To set up conda environments for each python versions, upgrade pip, install pipenv, and install the archaeology package in each environment, execute:
source conda-setup.sh

Change to the archaeology directory
cd archaeology

Activate conda environment. We used py36 to run the pipeline.
conda activate py36

Execute the main pipeline script (r0_main.py):
python r0_main.py

Running the analysis:

Navigate to the analysis directory.
cd analyses

Activate conda environment. We use raw38 for the analysis of the metadata collected in the study.
conda activate raw38

Install the required packages using the requirements.txt file.
pip install -r requirements.txt

Launch Jupyterlab
jupyter lab

Refer to the Index.ipynb notebook for the execution order and guidance.

References:

Sheeba Samuel, Daniel Mietchen. (2024). Computational reproducibility of Jupyter notebooks from biomedical publications, https://doi.org/10.1093/gigascience/giad113, GigaScience

Sheeba Samuel, Daniel Mietchen. (2022). Computational reproducibility of Jupyter notebooks from biomedical publications, https://arxiv.org/pdf/2209.04308.pdf, CoRR abs/2209.04308

Sheeba Samuel, & Daniel Mietchen. (2022). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6802158
d
MEDLINE/PubMed Baseline Statistics: Misc Report
catalog.data.gov
data.virginia.gov
+2more
Updated Jun 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (2025). MEDLINE/PubMed Baseline Statistics: Misc Report [Dataset]. https://catalog.data.gov/dataset/2023-medline-pubmed-baseline-misc-report
Explore at:
Dataset updated
Jun 19, 2025
Dataset provided by
National Library of Medicine
Description
A file containing all Misc Baseline Reports for 2018-2023 in their original format is available in the Attachments section below. MEDLINE/PubMed annual statistical reports are based upon the data elements in the baseline versions of MEDLINE®/PubMed are available. For each year covered the reports include: total citations containing each element; total occurrences of each element; minimum/average/maximum occurrences of each element in a record; minimum/average/maximum length of a single element occurrence; average record size; and other statistical data describing the content and size of the elements.
h
vi_pubmed
huggingface.co
Updated Mar 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Long Phan (2023). vi_pubmed [Dataset]. https://huggingface.co/datasets/justinphan3110/vi_pubmed
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2023
Authors
Long Phan
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for PubMed

Dataset Summary

NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

English

Dataset Structure

Bear… See the full description on the dataset page: https://huggingface.co/datasets/justinphan3110/vi_pubmed.
PostgreSQL query to select the ten journals with the highest number of...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kersten Döring; Björn A. Grüning; Kiran K. Telukunta; Philippe Thomas; Stefan Günther (2023). PostgreSQL query to select the ten journals with the highest number of publications containing the MeSH term “Leukemia” [20] on the complete PubMed data set. [Dataset]. http://doi.org/10.1371/journal.pone.0163794.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0163794.t002
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Kersten Döring; Björn A. Grüning; Kiran K. Telukunta; Philippe Thomas; Stefan Günther
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PostgreSQL query to select the ten journals with the highest number of publications containing the MeSH term “Leukemia” [20] on the complete PubMed data set.
PMDB: a relational database for PubMed
zenodo.org
tar
Updated Jun 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Hughey; Jacob Hughey; Joshua Schoenbachler; Joshua Schoenbachler (2025). PMDB: a relational database for PubMed [Dataset]. http://doi.org/10.5281/zenodo.15658234
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15658234
Dataset updated
Jun 14, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jacob Hughey; Jacob Hughey; Joshua Schoenbachler; Joshua Schoenbachler
Description
The files constitute a compressed dump of PMDB, which was created in PostgreSQL 14 using the pmparser R package. Once you have a Postgres server running, you can set up the database as follows:

1. Untar the file containing the database dump, which will create a folder. Substitute

tar xvf

2. Restore the database onto your Postgres server. Below is one way. Replace <...> as appropriate, substituting

createdb -h

MEDLINE/PubMed data are courtesy of the U.S. National Library of Medicine. See NLM's Terms and Conditions.
I
MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and...
databank.illinois.edu
Updated Aug 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vetle I. Torvik (2020). MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide [Dataset]. http://doi.org/10.13012/B2IDB-4354331_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-4354331_V1
Dataset updated
Aug 10, 2020
Authors
Vetle I. Torvik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
U.S. National Institutes of Health (NIH)
U.S. National Science Foundation (NSF)
Description
MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. Prepared by Vetle Torvik 2018-04-05 The dataset comes as a single tab-delimited Latin-1 encoded file (only the City column uses non-ASCII characters), and should be about 3.5GB uncompressed. • How was the dataset created? The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions • Affiliations are linked to a particular author on a particular article. Prior to 2014, NLM recorded the affiliation of the first author only. However, MapAffil 2016 covers some PubMed records lacking affiliations that were harvested elsewhere, from PMC (e.g., PMID 22427989), NIH grants (e.g., 1838378), and Microsoft Academic Graph and ADS (e.g. 5833220). • Affiliations are pre-processed (e.g., transliterated into ASCII from UTF-8 and html) so they may differ (sometimes a lot; see PMID 27487542) from PubMed records. • All affiliation strings where processed using the MapAffil procedure, to identify and disambiguate the most specific place-name, as described in: Torvik VI. MapAffil: A bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide. D-Lib Magazine 2015; 21 (11/12). 10p • Look for Fig. 4 in the following article for coverage statistics over time: Palmblad M, Torvik VI. Spatiotemporal analysis of tropical disease research combining Europe PMC and affiliation mapping web services. Tropical medicine and health. 2017 Dec;45(1):33. Expect to see big upticks in coverage of PMIDs around 1988 and for non-first authors in 2014. • The code and back-end data is periodically updated and made available for query by PMID at Torvik Research Group • What is the format of the dataset? The dataset contains 37,406,692 rows. Each row (line) in the file has a unique PMID and author postition (e.g., 10786286_3 is the third author name on PMID 10786286), and the following thirteen columns, tab-delimited. All columns are ASCII, except city which contains Latin-1. 1. PMID: positive non-zero integer; int(10) unsigned 2. au_order: positive non-zero integer; smallint(4) 3. lastname: varchar(80) 4. firstname: varchar(80); NLM started including these in 2002 but many have been harvested from outside PubMed 5. year of publication: 6. type: EDU, HOS, EDU-HOS, ORG, COM, GOV, MIL, UNK 7. city: varchar(200); typically 'city, state, country' but could inlude further subvisions; unresolved ambiguities are concatenated by '|' 8. state: Australia, Canada and USA (which includes territories like PR, GU, AS, and post-codes like AE and AA) 9. country 10. journal 11. lat: at most 3 decimals (only available when city is not a country or state) 12. lon: at most 3 decimals (only available when city is not a country or state) 13. fips: varchar(5); for USA only retrieved by lat-lon query to https://geo.fcc.gov/api/census/block/find
I
uCite: The union of nine large-scale public PubMed citation datasets with...
databank.illinois.edu
Updated Apr 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liri Fang; Malik Oyewale Salami; Griffin M. Weber; Vetle I. Torvik (2025). uCite: The union of nine large-scale public PubMed citation datasets with reliability labels [Dataset]. http://doi.org/10.13012/B2IDB-6818660_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-6818660_V1
Dataset updated
Apr 4, 2025
Authors
Liri Fang; Malik Oyewale Salami; Griffin M. Weber; Vetle I. Torvik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset, uCite, is the union of nine large-scale open-access PubMed citation data separated by reliability. There are 20 files, including the reliable and unreliable citation PMID pairs, non-PMID identifiers to PMID mapping (for DOIs, Lens, MAG, and Semantic Scholar), original PMID pairs from the nine resources, some metadata for PMIDs, duplicate PMIDs, some redirected PMID pairs, and PMC OA Patci citation matching results. The short description of each data file is listed as follows. A detailed description can be found in the README.txt. DATASET DESCRIPTION

PPUB.tsv.gz - tsv format file containing reliable citation pairs uCite.

PUNR.tsv.gz - tsv format file containing reliable citation pairs uCite.

DOI2PMID.tsv.gz - tsv format file containing results mapping DOI to PMID.

LEN2PMID.tsv.gz - tsv format file containing results mapping LensID pairs to PMID pairs..

MAG2PMIDsorted.tsv.gz - tsv format file containing results mapping MAG ID to PMID.

SEM2PMID.tsv.gz - tsv ormat file containing results mapping Semantic Scholar ID to PMID.

JVNPYA.tsv.gz - tsv format file containing metadata of papers with PMID, journal name, volume, issue, pages, publication year, and first author's last name.

TiLTyAlJVNY.tsv.gz - tsv format file containing metadata of papers.

PMC-OA-patci.tsv.gz - tsv format file containing PubMed Central Open Access subset reference strings extracted by \cite{} processed by Patci.

REDIRECTS.gz - txt file containing unreliable PMID pairs mapped to reliable PMID pairs.

REMAP - file containing pairs of duplicate PubMed records (lhs PMID mapped to rhs PMID).

ami_pair.tsv.gz - tsv format file containing all citation pairs from Aminer (2015 version).

dim_pair.tsv.gz - tsv format file containing all citation pairs from Dimensions.

ice_pair.tsv.gz - tsv format file containing all citation pairs from iCite (April 2019 version, version 1).

len_pair.tsv.gz - tsv format file containing all citation pairs from Lens.org (harvested through Oct 2021).

mag_pair.tsv.gz - tsv format file containing all citation pairs from Microsoft Academic Graph (2015 version).

oci_pair.tsv.gz - tsv format file containing all citation pairs from Open Citations (Nov. 2021 dump, csv version ).

pat_pair.tsv.gz - tsv format file containing all citation pairs from Patci (i.e., from "PMC-OA-patci.tsv.gz").

pmc_pair.tsv.gz - tsv format file containing all citation pairs from PubMed Central (harvest through Dec 2018 via e-Utilities).

sem_pair.tsv.gz - tsv format file containing all citation pairs from Semantic Scholar (2019 version) .

COLUMN DESCRIPTION FILENAME : PPUB.tsv.gz, PUNR.tsv.gz (1) fromPMID - PubMed ID of the citing paper. (2) toPMID - PubMed ID of the cited paper. (3) sources - citation sources, in which the citation pairs are identified. (4) fromYEAR - Publication year of the citing paper. (5) toYEAR - Publication year of the cited paper. FILENAME : DOI2PMID.tsv.gz (1) DOI - Semantic Scholar ID of paper records. (2) PMID - PubMed ID of paper records. (3) PMID2 - Digital Object Identifier of paper records, “-” if the paper doesn't have DOIs. FILENAME : SEMID2PMID.tsv.gz (1) SemID - Semantic Scholar ID of paper records. (2) PMID - PubMed ID of paper records. (3) DOI - Digital Object Identifier of paper records, “-” if the paper doesn't have DOIs. FILENAME : JVNPYA.tsv.gz - Each row refers to a publication record. (1) PMID - PubMed ID. (2) journal - Journal name. (3) volume - Journal volume. (4) issue - Journal issue. (5) pages - The first page and last page (without leading digits) number of the publication separated by '-'. (6) year - Publication year. (7) lastname - Last name of the first author. FILENAME : TiLTyAlJVNY.tsv.gz (1) PMID - PubMed ID. (2) title_tokenized - Paper title after tokenization. (3) languages - Language that paper is written in. (4) pub_types - Types of the publication. (5) length(authors) - String length of author names. (6) journal -Journal name . (7) volume - Journal volume . (8) issue - Journal issue. (9) year - Publication year of print (not necessary epub). FILENAME : PMC-OA-patci.tsv.gz (1) pmcid - PubMed Central identifier. (2) pos - (3) fromPMID - PubMed ID of the citing paper. (4) toPMID - PubMed ID of the cited paper. (5) SRC - citation sources, in which the citation pairs are identified. (6) MatchDB - PubMed, ADS, DBLP. (7) Probability - Matching probability predicted by Patci. (8) toPMID2 - PubMed ID of the cited paper, extracted from OA xml file (9) SRC2 - citation sources, in which the citation pairs are identified. (10) intxt_id - (11) jounal - First character of the journal name. (12) same_ref_string - Y if patci and xml reference string match, otherwise N. (13) DIFF - (14) bestSRC - Citation sources, in which the citation pairs are identified. (15) Match - Matching strings annotated by Patci. FILENAME : REDIRECTS.gz Each row in Redirectis.txt is a string sequence in the same format as follows. - "REDIRECTED FROM: source PMID_i PMID_j -> PMID_i' PMID_j " - "REDIRECTED TO: source PMID_i PMID_j -> PMID_i PMID_j' " Note: source is the names of sources where the PMID_i and PMID_j are from. FILENAME : REMAP Each row is remapping unreliable PMID pairs mapped to reliable PMID pairs. The format of each row is "$REMAP{PMID_i} = PMID_j". FILENAME : ami_pair.tsv.gz, dim_pair.tsv.gz, ice_pair.tsv.gz, len_pair.tsv.gz, mag_pair.tsv.gz, oci_pair.tsv.gz, pat_pair.tsv.gz，pmc_pair.tsv.gz, sem_pair.tsv.gz (1) fromPMID - PubMed ID of the citing paper. (2) toPMID - PubMed ID of the cited paper.
Datasets for OntoClue Project
zenodo.org
tsv, zip
Updated Feb 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rohitha Ravinder; Rohitha Ravinder; Lukas Geist; Lukas Geist; Dietrich Rebholz-Schuhmann; Dietrich Rebholz-Schuhmann; Leyla Jael Castro; Leyla Jael Castro (2025). Datasets for OntoClue Project [Dataset]. http://doi.org/10.5281/zenodo.14801641
Explore at:
tsv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14801641
Dataset updated
Feb 13, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rohitha Ravinder; Rohitha Ravinder; Lukas Geist; Lukas Geist; Dietrich Rebholz-Schuhmann; Dietrich Rebholz-Schuhmann; Leyla Jael Castro; Leyla Jael Castro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description

This release contains the datasets and files associated with the OntoClue project, which investigates various text embedding techniques for assessing document-to-document similarity in biomedical literature. The project primarily utilizes the RELISH Corpus [1], a comprehensive dataset curated by experts that includes relevance annotations for document pairs based on their similarity. This release includes datasets for establishing ground truth, as well as retrieved titles and abstracts for all PMIDs in the RELISH database. The files also contain preprocessed tokens for use in text embedding neural network models, as well as annotated tokens based on the MeSH (Medical Subject Headings) [2] vocabulary.

Data Structure and Files

missing_pmids.tsv: List of PMIDs for which titles and abstracts could not be retrieved

relevance_matrix.tsv : Ground truth dataset file derived from the RELISH JSON file containing 189,634 documents pairs, with three columns: PMID1 (reference article), PMID2 (assessed article), and relevance (relevance score between the two documents). Consists of 68,479 completely relevant pairs, 65,406 partially relevant pairs and 55,749 irrelevant pairs.

relish_documents.tsv: Contains retrieved RELISH documents, including PMID, title and abstract (163,189 articles)

relish_bert_input_text.zip: Preprocessed titles and abstracts for use with BERT-based models

relish_preprocessed_normal_tokens.zip: Document text preprocessed for use with all embeddings approaches

relish_normal_split_datasets.zip: Preprocessed document text split into training, validation and test datasets

relish_xml_files.zip: RELISH articles retrieved as XML files

relish_annotated_xml_files.zip: Annotated XML files of RELISH articles (163,189 articles)

relish_preprocessed_annotated_tokens.zip: Document text preprocessed for use with all embeddings approaches, with annotations

relish_annotated_split_datasets.zip: Preprocessed and annotated document text split into a training, validation and test datasets

relish_ground_truth_split_datasets.zip: Ground truth dataset split into a training, validation and test datasets

Data Collection

The RELIHS dataset v1 was downloaded from the corresponding FigShare record [3] on January 24th, 2022. The dataset, in JSON format, contains PubMed IDs (PMIDs) along with relevance assessments for document pairs. Using the BioC API, we retrieved XML files containing the PMID, title, and abstract for each unique entry in the RELIHS JSON file. Any PMIDs that failed to retrieve, or lacked titles and abstracts, were recorded as missing. In total, approximately 163,189 XML files were successfully retrieved. These XML files were also converted into a TSV file with three columns: PMID, title, and abstract. The text from the titles and abstracts was further preprocessed for use in various approaches.

References

[1] Peter Brown, RELISH Consortium , Yaoqi Zhou, Large expert-curated database for benchmarking document similarity detection in biomedical literature search, Database, Volume 2019, 2019, baz085, https://doi.org/10.1093/database/baz085

[2] Lipscomb C. E. (2000). Medical Subject Headings (MeSH). Bulletin of the Medical Library Association, 88(3), 265–266.

[3] Brown, Peter (2019). RELISH_v1. figshare. Dataset. https://doi.org/10.6084/m9.figshare.7722905.v1
I
Dataset for "Continued use of retracted papers: Temporal trends in citations...
databank.illinois.edu
Updated Apr 6, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tzu-Kun Hsiao; Jodi Schneider (2021). Dataset for "Continued use of retracted papers: Temporal trends in citations and (lack of) awareness of retractions shown in citation contexts in biomedicine" [Dataset]. http://doi.org/10.13012/B2IDB-8255619_V2
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-8255619_V2
Dataset updated
Apr 6, 2021
Authors
Tzu-Kun Hsiao; Jodi Schneider
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
Alfred P. Sloan Foundation
U.S. National Institutes of Health (NIH)
Description
This dataset includes five files. Descriptions of the files are given as follows: FILENAME: PubMed_retracted_publication_full_v3.tsv - Bibliographic data of retracted papers indexed in PubMed (retrieved on August 20, 2020, searched with the query "retracted publication" [PT] ). - Except for the information in the "cited_by" column, all the data is from PubMed. - PMIDs in the "cited_by" column that meet either of the two conditions below have been excluded from analyses: [1] PMIDs of the citing papers are from retraction notices (i.e., those in the “retraction_notice_PMID.csv” file). [2] Citing paper and the cited retracted paper have the same PMID. ROW EXPLANATIONS - Each row is a retracted paper. There are 7,813 retracted papers. COLUMN HEADER EXPLANATIONS 1) PMID - PubMed ID 2) Title - Paper title 3) Authors - Author names 4) Citation - Bibliographic information of the paper 5) First Author - First author's name 6) Journal/Book - Publication name 7) Publication Year 8) Create Date - The date the record was added to the PubMed database 9) PMCID - PubMed Central ID (if applicable, otherwise blank) 10) NIHMS ID - NIH Manuscript Submission ID (if applicable, otherwise blank) 11) DOI - Digital object identifier (if applicable, otherwise blank) 12) retracted_in - Information of retraction notice (given by PubMed) 13) retracted_yr - Retraction year identified from "retracted_in" (if applicable, otherwise blank) 14) cited_by - PMIDs of the citing papers. (if applicable, otherwise blank) Data collected from iCite. 15) retraction_notice_pmid - PMID of the retraction notice (if applicable, otherwise blank) FILENAME: PubMed_retracted_publication_CitCntxt_withYR_v3.tsv - This file contains citation contexts (i.e., citing sentences) where the retracted papers were cited. The citation contexts were identified from the XML version of PubMed Central open access (PMCOA) articles. - This is part of the data from: Hsiao, T.-K., & Torvik, V. I. (manuscript in preparation). Citation contexts identified from PubMed Central open access articles: A resource for text mining and citation analysis. - Citation contexts that meet either of the two conditions below have been excluded from analyses: [1] PMIDs of the citing papers are from retraction notices (i.e., those in the “retraction_notice_PMID.csv” file). [2] Citing paper and the cited retracted paper have the same PMID. ROW EXPLANATIONS - Each row is a citation context associated with one retracted paper that's cited. - In the manuscript, we count each citation context once, even if it cites multiple retracted papers. COLUMN HEADER EXPLANATIONS 1) pmcid - PubMed Central ID of the citing paper 2) pmid - PubMed ID of the citing paper 3) year - Publication year of the citing paper 4) location - Location of the citation context (abstract = abstract, body = main text, back = supporting material, tbl_fig_caption = tables and table/figure captions) 5) IMRaD - IMRaD section of the citation context (I = Introduction, M = Methods, R = Results, D = Discussions/Conclusion, NoIMRaD = not identified) 6) sentence_id - The ID of the citation context in a given location. For location information, please see column 4. The first sentence in the location gets the ID 1, and subsequent sentences are numbered consecutively. 7) total_sentences - Total number of sentences in a given location 8) intxt_id - Identifier of a cited paper. Here, a cited paper is the retracted paper. 9) intxt_pmid - PubMed ID of a cited paper. Here, a cited paper is the retracted paper. 10) citation - The citation context 11) progression - Position of a citation context by centile within the citing paper. 12) retracted_yr - Retraction year of the retracted paper 13) post_retraction - 0 = not post-retraction citation; 1 = post-retraction citation. A post-retraction citation is a citation made after the calendar year of retraction. FILENAME: 724_knowingly_post_retraction_cit.csv (updated) - The 724 post-retraction citation contexts that we determined knowingly cited the 7,813 retracted papers in "PubMed_retracted_publication_full_v3.tsv". - Two citation contexts from retraction notices have been excluded from analyses. ROW EXPLANATIONS - Each row is a citation context. COLUMN HEADER EXPLANATIONS 1) pmcid - PubMed Central ID of the citing paper 2) pmid - PubMed ID of the citing paper 3) pub_type - Publication type collected from the metadata in the PMCOA XML files. 4) pub_type2 - Specific article types. Please see the manuscript for explanations. 5) year - Publication year of the citing paper 6) location - Location of the citation context (abstract = abstract, body = main text, back = supporting material, table_or_figure_caption = tables and table/figure captions) 7) intxt_id - Identifier of a cited paper. Here, a cited paper is the retracted paper. 8) intxt_pmid - PubMed ID of a cited paper. Here, a cited paper is the retracted paper. 9) citation - The citation context 10) retracted_yr - Retraction year of the retracted paper 11) cit_purpose - Purpose of citing the retracted paper. This is from human annotations. Please see the manuscript for further information about annotation. 12) longer_context - A extended version of the citation context. (if applicable, otherwise blank) Manually pulled from the full-texts in the process of annotation. FILENAME: Annotation manual.pdf - The manual for annotating the citation purposes in column 11) of the 724_knowingly_post_retraction_cit.tsv. FILENAME: retraction_notice_PMID.csv (new file added for this version) - A list of 8,346 PMIDs of retraction notices indexed in PubMed (retrieved on August 20, 2020, searched with the query "retraction of publication" [PT] ).
d
Distribution of trial registry numbers within full-text PubMed Central -...
search.dataone.org
datadryad.org
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arthur Holt; Neil Smalheiser; Ang Troy (2025). Distribution of trial registry numbers within full-text PubMed Central - full dataset of discovered links [Dataset]. http://doi.org/10.5061/dryad.dbrv15fb1
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.dbrv15fb1
Dataset updated
Feb 5, 2025
Dataset provided by
Dryad Digital Repository
Authors
Arthur Holt; Neil Smalheiser; Ang Troy
Description
Linking registered clinical trials with their published results continues to be a challenge. A varietyÂ of natural language processing (NLP)-based and machine learning-based models have been developed to assist users in identifying these connections. Articles from the PubMed Central full-text collection were scanned for mentions of ClinicalTrials.gov and international clinical trial registry identifiers. We analyzed the distribution of trial registry numbers within sectionsÂ of the articles and characterized their publication type indexing and other metrics. Three supporting files are included herein: a pdf containing supplementary figures pertaining to the distribution of registry numbers found within the full text of articles, a csv dataset providing the registry numbers discovered and the corresponding XML path location within the document, and an example Python script to locate registry identifiers within an XML article document. It should be noted that the purpose of this study is to..., These datasets and files are the results of scanning 6,901,686 XML documents within the Pubmed Central Open Access article datasets available at: https://ftp.ncbi.nlm.nih.gov/pub/pmc/ Each registry identifier match is represented by a row in the xmlScanOutput.csv file, along with PubMed identifiers, file information, XML path information, and several computed columns including a validation that an NCT number exists within ClinicalTrials.gov, a generalized article section, and publication types from multiple indexing sources. Summaries within the Distribution_of_Trial_Registry_Numbers_Additional_File.pdf were generated by counting distinct PMID values within the csv file across various groups., , # Distribution of trial registry numbers within full-text PubMed Central - full dataset of discovered links

https://doi.org/10.5061/dryad.dbrv15fb1

This data set contains a table with every combination of publication ID, registry number, XML path, and section of the publication discovered in the Full-Text scanning of PubMed Central articles.

Description of the data and file structure

Distribution_of_Trial_Registry_Numbers_Additional_File.pdf

This document contains charts and summaries of the trial registry numbers found from the XML document scanning process. The explicit criteria for locating registry identifiers and designating article sections are provided in this document and may be useful for further research and refinement.

Distribution_of_Trial_Registry_Numbers_ScanOutput.zip

This zip archive contains a comma-separated file named "xmlScanOutput.csv" that contains all rows of registry numbers and art...

Facebook

Twitter

Click to copy link

Link copied

Cite

Hà Huy Hoàng (2025). pubmed25 [Dataset]. https://huggingface.co/datasets/HoangHa/pubmed25

pubmed25

HoangHa/pubmed25

Explore at:

20 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Apr 26, 2025

Authors

Hà Huy Hoàng

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information. This version is modified to extract the full text from structured abstracts.

Clear search

Close search

Google apps

Main menu

pubmed25

Pubmed Baseline 2021-12-12

Conceptual novelty scores for PubMed articles

MEDLINE/PubMed Baseline Statistics: Min/Max Report

Hype - PubMed dataset

PubMed article IDs

pubmed_clean

Diversity - PubMed Dataset

Self-citation analysis data based on PubMed Central subset (2002-2005)

Trialstreamer data

Dataset of a Study of Computational reproducibility of Jupyter notebooks...

MEDLINE/PubMed Baseline Statistics: Misc Report

vi_pubmed

PostgreSQL query to select the ten journals with the highest number of...

PMDB: a relational database for PubMed

MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and...

uCite: The union of nine large-scale public PubMed citation datasets with...

Datasets for OntoClue Project

Dataset for "Continued use of retracted papers: Temporal trends in citations...

Distribution of trial registry numbers within full-text PubMed Central -...

Description of the data and file structure

Distribution_of_Trial_Registry_Numbers_Additional_File.pdf

Distribution_of_Trial_Registry_Numbers_ScanOutput.zip

pubmed25See More Versions

HoangHa/pubmed25

pubmed25