56 datasets found
  1. h

    pubmed25

    • huggingface.co
    Updated Apr 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hà Huy Hoàng (2025). pubmed25 [Dataset]. https://huggingface.co/datasets/HoangHa/pubmed25
    Explore at:
    Dataset updated
    Apr 26, 2025
    Authors
    Hà Huy Hoàng
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information. This version is modified to extract the full text from structured abstracts.

  2. a

    Pubmed Baseline 2021-12-12

    • academictorrents.com
    bittorrent
    Updated May 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health and National Library of Medicine (2022). Pubmed Baseline 2021-12-12 [Dataset]. https://academictorrents.com/details/cc12294ab7bf1f730738a6bff89052ad8156d8d8
    Explore at:
    bittorrent(37381448042)Available download formats
    Dataset updated
    May 22, 2022
    Dataset authored and provided by
    National Institutes of Health and National Library of Medicine
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    Just the baseline files, no update files. md5 sums included and checked before upload. From : —————————————————————————————————————- The PubMed Baseline Repository and Daily Update files Last Updated December 13, 2022 All questions should be directed to: National Center for Biotechnology Information info@ncbi.nlm.nih.gov This document describes the PubMed Database available on the NCBI FTP site under the and directories. PubMed comprises more than 31 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publish

  3. I

    Conceptual novelty scores for PubMed articles

    • databank.illinois.edu
    Updated Apr 27, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shubhanshu Mishra; Vetle I. Torvik (2018). Conceptual novelty scores for PubMed articles [Dataset]. http://doi.org/10.13012/B2IDB-5060298_V1
    Explore at:
    Dataset updated
    Apr 27, 2018
    Authors
    Shubhanshu Mishra; Vetle I. Torvik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    U.S. National Institutes of Health (NIH)
    U.S. National Science Foundation (NSF)
    Description

    Conceptual novelty analysis data based on PubMed Medical Subject Headings ---------------------------------------------------------------------- Created by Shubhanshu Mishra, and Vetle I. Torvik on April 16th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : the magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra. It contains final data generated as part of our experiments based on MEDLINE 2015 baseline and MeSH tree from 2015. The dataset is distributed in the form of the following tab separated text files: * PubMed2015_NoveltyData.tsv - Novelty scores for each paper in PubMed. The file contains 22,349,417 rows and 6 columns, as follow: - PMID: PubMed ID - Year: year of publication - TimeNovelty: time novelty score of the paper based on individual concepts (see paper) - VolumeNovelty: volume novelty score of the paper based on individual concepts (see paper) - PairTimeNovelty: time novelty score of the paper based on pair of concepts (see paper) - PairVolumeNovelty: volume novelty score of the paper based on pair of concepts (see paper) * mesh_scores.tsv - Temporal profiles for each MeSH term for all years. The file contains 1,102,831 rows and 5 columns, as follow: - MeshTerm: Name of the MeSH term - Year: year - AbsVal: Total publications with that MeSH term in the given year - TimeNovelty: age (in years since first publication) of MeSH term in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH term in the given year * meshpair_scores.txt.gz (36 GB uncompressed) - Temporal profiles for each MeSH term for all years - Mesh1: Name of the first MeSH term (alphabetically sorted) - Mesh2: Name of the second MeSH term (alphabetically sorted) - Year: year - AbsVal: Total publications with that MeSH pair in the given year - TimeNovelty: age (in years since first publication) of MeSH pair in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH pair in the given year * README.txt file ## Dataset creation This dataset was constructed using multiple datasets described in the following locations: * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * MeSH tree 2015: ftp://nlmpubs.nlm.nih.gov/online/mesh/2015/meshtrees/ * Source code provided at: https://github.com/napsternxg/Novelty Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions: Additional data related updates can be found at: Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742 . The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Conceptual novelty analysis data based on PubMed Medical Subject Headings by Shubhanshu Mishra, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/Novelty

  4. N

    MEDLINE/PubMed Baseline Statistics: Min/Max Report

    • datadiscovery.nlm.nih.gov
    • data.virginia.gov
    • +3more
    xlsx, xml
    Updated Mar 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). MEDLINE/PubMed Baseline Statistics: Min/Max Report [Dataset]. https://datadiscovery.nlm.nih.gov/d/xtvu-8j6w
    Explore at:
    xml, xlsxAvailable download formats
    Dataset updated
    Mar 22, 2023
    Description

    A file containing all Min/Max Baseline Reports for 2005-2023 in their original format is available in the Attachments section below. A second file includes a separate set of reports, made available from 2002-2017, that did not include OLDMEDLINE records.

    MEDLINE/PubMed annual statistical reports are based upon the data elements in the baseline versions of MEDLINE®/PubMed are available. For each year covered the reports include: total citations containing each element; total occurrences of each element; minimum/average/maximum occurrences of each element in a record; minimum/average/maximum length of a single element occurrence; average record size; and other statistical data describing the content and size of the elements.

  5. I

    Hype - PubMed dataset

    • databank.illinois.edu
    Updated Mar 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Apratim Mishra; Jana Diesner; Vetle I. Torvik (2025). Hype - PubMed dataset [Dataset]. http://doi.org/10.13012/B2IDB-0651259_V3
    Explore at:
    Dataset updated
    Mar 14, 2025
    Authors
    Apratim Mishra; Jana Diesner; Vetle I. Torvik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Hype - PubMed dataset Prepared by Apratim Mishra This dataset captures ‘Hype’ within biomedical abstracts sourced from PubMed. The selection chosen is ‘journal articles’ written in English, published between 1975 and 2019, totaling ~5.2 million. The classification relies on the presence of specific candidate ‘hype words’ and their abstract location. Therefore, each article (PMID) might have multiple instances in the dataset due to the presence of multiple hype words in different abstract sentences. The candidate hype words are 35 in count: 'major', 'novel', 'central', 'critical', 'essential', 'strongly', 'unique', 'promising', 'markedly', 'excellent', 'crucial', 'robust', 'importantly', 'prominent', 'dramatically', 'favorable', 'vital', 'surprisingly', 'remarkably', 'remarkable', 'definitive', 'pivotal', 'innovative', 'supportive', 'encouraging', 'unprecedented', 'enormous', 'exceptional', 'outstanding', 'noteworthy', 'creative', 'assuring', 'reassuring', 'spectacular', and 'hopeful’. This is version 3 of the dataset. Added new file - WSD_hype.tsv File 1: hype_dataset_final.tsv Primary dataset. It has the following columns: 1. PMID: represents unique article ID in PubMed 2. Year: Year of publication 3. Hype_word: Candidate hype word, such as ‘novel.’ 4. Sentence: Sentence in abstract containing the hype word. 5. Hype_percentile: Abstract relative position of hype word. 6. Hype_value: Propensity of hype based on the hype word, the sentence, and the abstract location. 7. Introduction: The ‘I’ component of the hype word based on IMRaD 8. Methods: The ‘M’ component of the hype word based on IMRaD 9. Results: The ‘R’ component of the hype word based on IMRaD 10. Discussion: The ‘D’ component of the hype word based on IMRaD File 2: hype_removed_phrases_final.tsv Secondary dataset with same columns as File 1. Hype in the primary dataset is based on excluding certain phrases that are rarely hype. The phrases that were removed are included in File 2 and modeled separately. Removed phrases: 1. Major: histocompatibility, component, protein, metabolite, complex, surgery 2. Novel: assay, mutation, antagonist, inhibitor, algorithm, technique, series, method, hybrid 3. Central: catheters, system, design, composite, catheter, pressure, thickness, compartment 4. Critical: compartment, micelle, temperature, incident, solution, ischemia, concentration, thinking, nurses, skills, analysis, review, appraisal, evaluation, values 5. Essential: medium, features, properties, opportunities, oil 6. Unique: model, amino 7. Robust: regression 8. Vital: capacity, signs, organs, status, structures, staining, rates, cells, information 9. Outstanding: questions, issues, question, questions, challenge, problems, problem, remains 10. Remarkable: properties 11. Definite: radiotherapy, surgery File 3: WSD_hype.tsv Includes hype-based disambiguation for candidate words targeted for WSD (Word sense disambiguation)

  6. PubMed article IDs

    • kaggle.com
    Updated Oct 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aarush Sinha (2023). PubMed article IDs [Dataset]. https://www.kaggle.com/datasets/chungimungi/pmidfinal/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 16, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aarush Sinha
    Description

    A dataset that contains PubMed articles PMIDs sourced from their OA Subset. It contains the PMID of the article and the PMIDs that article has cited. This was done by creating a parsing script (PubMed parser on github)that utilizes the PMIDs to iterate through the website and gather the required information.

  7. h

    pubmed_clean

    • huggingface.co
    Updated Dec 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amish (2024). pubmed_clean [Dataset]. https://huggingface.co/datasets/darknight054/pubmed_clean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 23, 2024
    Authors
    Amish
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    A cleaned Pubmed commercial available files dataset. Will update the script used to clean soon.

  8. I

    Diversity - PubMed Dataset

    • aws-databank-alb.library.illinois.edu
    • databank.illinois.edu
    Updated Oct 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Apratim Mishra; Haejin Lee; Sullam Jeoung; Vetle Torvik; Jana Diesner (2024). Diversity - PubMed Dataset [Dataset]. http://doi.org/10.13012/B2IDB-5259667_V3
    Explore at:
    Dataset updated
    Oct 11, 2024
    Authors
    Apratim Mishra; Haejin Lee; Sullam Jeoung; Vetle Torvik; Jana Diesner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Diversity - PubMed dataset Contact: Apratim Mishra (Oct, 2024) This dataset presents article-level (pmid) and author-level (auid) diversity data for PubMed articles. The chosen selection includes articles retrieved from Authority 2018 [1], 907 024 papers, and 1 316 838 authors, and is an expanded dataset of V1. The sample of articles consists of the top 40 journals in the dataset, limited to 2-12 authors published between 1991 – 2014, which are article type "journal type" written in English. Files are 'gzip' compressed and separated by tab space, and V3 includes the correct author count for the included papers (pmids) and updated results with no NaNs. ################################################ File1: auids_plos_3.csv.gz (Important columns defined, 5 in total) • AUID: a unique ID for each author • Genni: gender prediction • Ethnea: ethnicity prediction ################################################# File2: pmids_plos_3.csv.gz (Important columns defined) • pmid: unique paper • auid: all unique auids (author-name unique identification) • year: Year of paper publication • no_authors: Author count • journal: Journal name • years: first year of publication for every author • Country-temporal: Country of affiliation for every author • h_index: Journal h-index • TimeNovelty: Paper Time novelty [2] • nih_funded: Binary variable indicating funding for any author • prior_cit_mean: Mean of all authors’ prior citation rate • Insti_impact: All unique institutions’ citation rate • mesh_vals: Top MeSH values for every author of that paper • relative_citation_ratio: RCR The ‘Readme’ includes a description for all columns. [1] Torvik, Vetle; Smalheiser, Neil (2021): Author-ity 2018 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2273402_V1 [2] Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1

  9. I

    Self-citation analysis data based on PubMed Central subset (2002-2005)

    • databank.illinois.edu
    Updated Apr 27, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik (2018). Self-citation analysis data based on PubMed Central subset (2002-2005) [Dataset]. http://doi.org/10.13012/B2IDB-9665377_V1
    Explore at:
    Dataset updated
    Apr 27, 2018
    Authors
    Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    U.S. National Institutes of Health (NIH)
    U.S. National Science Foundation (NSF)
    Description

    Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations. Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304 - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105 * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720 - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927 * MapAffil for identifying article country of affiliation: - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik * IMPLICIT journal similarity: - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 * Novelty dataset for identify article level novelty: - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 - Paper citation: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra - Code: https://github.com/napsternxg/Novelty * Expertise dataset for identifying author expertise on articles: * Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.

  10. o

    Trialstreamer data

    • explore.openaire.eu
    • zenodo.org
    Updated Apr 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iain Marshall; Benjamin Nye; Joël Kuiper; Rachel Marshall; Frank Soboczenski; Ani Nenkova; Anna Noel Storr; James Thomas; Byron Wallace (2020). Trialstreamer data [Dataset]. http://doi.org/10.5281/zenodo.6637538
    Explore at:
    Dataset updated
    Apr 26, 2020
    Authors
    Iain Marshall; Benjamin Nye; Joël Kuiper; Rachel Marshall; Frank Soboczenski; Ani Nenkova; Anna Noel Storr; James Thomas; Byron Wallace
    Description

    Trialstreamer annotated collection of RCTs. This respository contains baseline files (large), and subsequent updates (daily for PubMed, weekly for ICTRP).

  11. Dataset of a Study of Computational reproducibility of Jupyter notebooks...

    • zenodo.org
    • explore.openaire.eu
    pdf, zip
    Updated Jul 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen (2024). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Dataset]. http://doi.org/10.5281/zenodo.8226725
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This repository contains the dataset for the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

    Data Collection and Analysis

    We use the code for reproducibility of Jupyter notebooks from the study done by Pimentel et al., 2019 and adapted the code from ReproduceMeGit. We provide code for collecting the publication metadata from PubMed Central using NCBI Entrez utilities via Biopython.

    Our approach involves searching PMC using the esearch function for Jupyter notebooks using the query: ``(ipynb OR jupyter OR ipython) AND github''. We meticulously retrieve data in XML format, capturing essential details about journals and articles. By systematically scanning the entire article, encompassing the abstract, body, data availability statement, and supplementary materials, we extract GitHub links. Additionally, we mine repositories for key information such as dependency declarations found in files like requirements.txt, setup.py, and pipfile. Leveraging the GitHub API, we enrich our data by incorporating repository creation dates, update histories, pushes, and programming languages.

    All the extracted information is stored in a SQLite database. After collecting and creating the database tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories based on the code from Pimentel et al., 2019.

    Our reproducibility pipeline was started on 27 March 2023.

    Repository Structure

    Our repository is organized into two main folders:

    • archaeology: This directory hosts scripts designed to download, parse, and extract metadata from PubMed Central publications and associated repositories. There are 24 database tables created which store the information on articles, journals, authors, repositories, notebooks, cells, modules, executions, etc. in the db.sqlite database file.
    • analyses: Here, you will find notebooks instrumental in the in-depth analysis of data related to our study. The db.sqlite file generated by running the archaelogy folder is stored in the analyses folder for further analysis. The path can however be configured in the config.py file. There are two sets of notebooks: one set (naming pattern N[0-9]*.ipynb) is focused on examining data pertaining to repositories and notebooks, while the other set (PMC[0-9]*.ipynb) is for analyzing data associated with publications in PubMed Central, i.e.\ for plots involving data about articles, journals, publication dates or research fields. The resultant figures from the these notebooks are stored in the 'outputs' folder.
    • MethodsWorkflow: The MethodsWorkflow file provides a conceptual overview of the workflow used in this study.

    Accessing Data and Resources:

    • All the data generated during the initial study can be accessed at https://doi.org/10.5281/zenodo.6802158
    • For the latest results and re-run data, refer to this link.
    • The comprehensive SQLite database that encapsulates all the study's extracted data is stored in the db.sqlite file.
    • The metadata in xml format extracted from PubMed Central which contains the information about the articles and journal can be accessed in pmc.xml file.

    System Requirements:

    Running the pipeline:

    • Clone the computational-reproducibility-pmc repository using Git:
      git clone https://github.com/fusion-jena/computational-reproducibility-pmc.git
    • Navigate to the computational-reproducibility-pmc directory:
      cd computational-reproducibility-pmc/computational-reproducibility-pmc
    • Configure environment variables in the config.py file:
      GITHUB_USERNAME = os.environ.get("JUP_GITHUB_USERNAME", "add your github username here")
      GITHUB_TOKEN = os.environ.get("JUP_GITHUB_PASSWORD", "add your github token here")
    • Other environment variables can also be set in the config.py file.
      BASE_DIR = Path(os.environ.get("JUP_BASE_DIR", "./")).expanduser() # Add the path of directory where the GitHub repositories will be saved
      DB_CONNECTION = os.environ.get("JUP_DB_CONNECTION", "sqlite:///db.sqlite") # Add the path where the database is stored.
    • To set up conda environments for each python versions, upgrade pip, install pipenv, and install the archaeology package in each environment, execute:
      source conda-setup.sh
    • Change to the archaeology directory
      cd archaeology
    • Activate conda environment. We used py36 to run the pipeline.
      conda activate py36
    • Execute the main pipeline script (r0_main.py):
      python r0_main.py

    Running the analysis:

    • Navigate to the analysis directory.
      cd analyses
    • Activate conda environment. We use raw38 for the analysis of the metadata collected in the study.
      conda activate raw38
    • Install the required packages using the requirements.txt file.
      pip install -r requirements.txt
    • Launch Jupyterlab
      jupyter lab
    • Refer to the Index.ipynb notebook for the execution order and guidance.

    References:

  12. d

    MEDLINE/PubMed Baseline Statistics: Misc Report

    • catalog.data.gov
    • data.virginia.gov
    • +2more
    Updated Jun 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (2025). MEDLINE/PubMed Baseline Statistics: Misc Report [Dataset]. https://catalog.data.gov/dataset/2023-medline-pubmed-baseline-misc-report
    Explore at:
    Dataset updated
    Jun 19, 2025
    Dataset provided by
    National Library of Medicine
    Description

    A file containing all Misc Baseline Reports for 2018-2023 in their original format is available in the Attachments section below. MEDLINE/PubMed annual statistical reports are based upon the data elements in the baseline versions of MEDLINE®/PubMed are available. For each year covered the reports include: total citations containing each element; total occurrences of each element; minimum/average/maximum occurrences of each element in a record; minimum/average/maximum length of a single element occurrence; average record size; and other statistical data describing the content and size of the elements.

  13. h

    vi_pubmed

    • huggingface.co
    Updated Mar 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Long Phan (2023). vi_pubmed [Dataset]. https://huggingface.co/datasets/justinphan3110/vi_pubmed
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2023
    Authors
    Long Phan
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for PubMed

      Dataset Summary
    

    NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.

      Supported Tasks and Leaderboards
    

    [More Information Needed]

      Languages
    

    English

      Dataset Structure
    

    Bear… See the full description on the dataset page: https://huggingface.co/datasets/justinphan3110/vi_pubmed.

  14. PostgreSQL query to select the ten journals with the highest number of...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kersten Döring; Björn A. Grüning; Kiran K. Telukunta; Philippe Thomas; Stefan Günther (2023). PostgreSQL query to select the ten journals with the highest number of publications containing the MeSH term “Leukemia” [20] on the complete PubMed data set. [Dataset]. http://doi.org/10.1371/journal.pone.0163794.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Kersten Döring; Björn A. Grüning; Kiran K. Telukunta; Philippe Thomas; Stefan Günther
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PostgreSQL query to select the ten journals with the highest number of publications containing the MeSH term “Leukemia” [20] on the complete PubMed data set.

  15. PMDB: a relational database for PubMed

    • zenodo.org
    tar
    Updated Jun 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Hughey; Jacob Hughey; Joshua Schoenbachler; Joshua Schoenbachler (2025). PMDB: a relational database for PubMed [Dataset]. http://doi.org/10.5281/zenodo.15658234
    Explore at:
    tarAvailable download formats
    Dataset updated
    Jun 14, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jacob Hughey; Jacob Hughey; Joshua Schoenbachler; Joshua Schoenbachler
    Description

    The files constitute a compressed dump of PMDB, which was created in PostgreSQL 14 using the pmparser R package. Once you have a Postgres server running, you can set up the database as follows:

    1. Untar the file containing the database dump, which will create a folder. Substitute

    tar xvf 

    2. Restore the database onto your Postgres server. Below is one way. Replace <...> as appropriate, substituting

    createdb -h 

    MEDLINE/PubMed data are courtesy of the U.S. National Library of Medicine. See NLM's Terms and Conditions.

  16. I

    MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and...

    • databank.illinois.edu
    Updated Aug 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vetle I. Torvik (2020). MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide [Dataset]. http://doi.org/10.13012/B2IDB-4354331_V1
    Explore at:
    Dataset updated
    Aug 10, 2020
    Authors
    Vetle I. Torvik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    U.S. National Institutes of Health (NIH)
    U.S. National Science Foundation (NSF)
    Description
    MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. Prepared by Vetle Torvik 2018-04-05 The dataset comes as a single tab-delimited Latin-1 encoded file (only the City column uses non-ASCII characters), and should be about 3.5GB uncompressed. • How was the dataset created? The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions • Affiliations are linked to a particular author on a particular article. Prior to 2014, NLM recorded the affiliation of the first author only. However, MapAffil 2016 covers some PubMed records lacking affiliations that were harvested elsewhere, from PMC (e.g., PMID 22427989), NIH grants (e.g., 1838378), and Microsoft Academic Graph and ADS (e.g. 5833220). • Affiliations are pre-processed (e.g., transliterated into ASCII from UTF-8 and html) so they may differ (sometimes a lot; see PMID 27487542) from PubMed records. • All affiliation strings where processed using the MapAffil procedure, to identify and disambiguate the most specific place-name, as described in: Torvik VI. MapAffil: A bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide. D-Lib Magazine 2015; 21 (11/12). 10p • Look for Fig. 4 in the following article for coverage statistics over time: Palmblad M, Torvik VI. Spatiotemporal analysis of tropical disease research combining Europe PMC and affiliation mapping web services. Tropical medicine and health. 2017 Dec;45(1):33. Expect to see big upticks in coverage of PMIDs around 1988 and for non-first authors in 2014. • The code and back-end data is periodically updated and made available for query by PMID at Torvik Research Group • What is the format of the dataset? The dataset contains 37,406,692 rows. Each row (line) in the file has a unique PMID and author postition (e.g., 10786286_3 is the third author name on PMID 10786286), and the following thirteen columns, tab-delimited. All columns are ASCII, except city which contains Latin-1. 1. PMID: positive non-zero integer; int(10) unsigned 2. au_order: positive non-zero integer; smallint(4) 3. lastname: varchar(80) 4. firstname: varchar(80); NLM started including these in 2002 but many have been harvested from outside PubMed 5. year of publication: 6. type: EDU, HOS, EDU-HOS, ORG, COM, GOV, MIL, UNK 7. city: varchar(200); typically 'city, state, country' but could inlude further subvisions; unresolved ambiguities are concatenated by '|' 8. state: Australia, Canada and USA (which includes territories like PR, GU, AS, and post-codes like AE and AA) 9. country 10. journal 11. lat: at most 3 decimals (only available when city is not a country or state) 12. lon: at most 3 decimals (only available when city is not a country or state) 13. fips: varchar(5); for USA only retrieved by lat-lon query to https://geo.fcc.gov/api/census/block/find

  17. I

    uCite: The union of nine large-scale public PubMed citation datasets with...

    • databank.illinois.edu
    Updated Apr 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liri Fang; Malik Oyewale Salami; Griffin M. Weber; Vetle I. Torvik (2025). uCite: The union of nine large-scale public PubMed citation datasets with reliability labels [Dataset]. http://doi.org/10.13012/B2IDB-6818660_V1
    Explore at:
    Dataset updated
    Apr 4, 2025
    Authors
    Liri Fang; Malik Oyewale Salami; Griffin M. Weber; Vetle I. Torvik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset, uCite, is the union of nine large-scale open-access PubMed citation data separated by reliability. There are 20 files, including the reliable and unreliable citation PMID pairs, non-PMID identifiers to PMID mapping (for DOIs, Lens, MAG, and Semantic Scholar), original PMID pairs from the nine resources, some metadata for PMIDs, duplicate PMIDs, some redirected PMID pairs, and PMC OA Patci citation matching results. The short description of each data file is listed as follows. A detailed description can be found in the README.txt. DATASET DESCRIPTION

    1. PPUB.tsv.gz - tsv format file containing reliable citation pairs uCite.
    2. PUNR.tsv.gz - tsv format file containing reliable citation pairs uCite.
    3. DOI2PMID.tsv.gz - tsv format file containing results mapping DOI to PMID.
    4. LEN2PMID.tsv.gz - tsv format file containing results mapping LensID pairs to PMID pairs..
    5. MAG2PMIDsorted.tsv.gz - tsv format file containing results mapping MAG ID to PMID.
    6. SEM2PMID.tsv.gz - tsv ormat file containing results mapping Semantic Scholar ID to PMID.
    7. JVNPYA.tsv.gz - tsv format file containing metadata of papers with PMID, journal name, volume, issue, pages, publication year, and first author's last name.
    8. TiLTyAlJVNY.tsv.gz - tsv format file containing metadata of papers.
    9. PMC-OA-patci.tsv.gz - tsv format file containing PubMed Central Open Access subset reference strings extracted by \cite{} processed by Patci.
    10. REDIRECTS.gz - txt file containing unreliable PMID pairs mapped to reliable PMID pairs.
    11. REMAP - file containing pairs of duplicate PubMed records (lhs PMID mapped to rhs PMID).
    12. ami_pair.tsv.gz - tsv format file containing all citation pairs from Aminer (2015 version).
    13. dim_pair.tsv.gz - tsv format file containing all citation pairs from Dimensions.
    14. ice_pair.tsv.gz - tsv format file containing all citation pairs from iCite (April 2019 version, version 1).
    15. len_pair.tsv.gz - tsv format file containing all citation pairs from Lens.org (harvested through Oct 2021).
    16. mag_pair.tsv.gz - tsv format file containing all citation pairs from Microsoft Academic Graph (2015 version).
    17. oci_pair.tsv.gz - tsv format file containing all citation pairs from Open Citations (Nov. 2021 dump, csv version ).
    18. pat_pair.tsv.gz - tsv format file containing all citation pairs from Patci (i.e., from "PMC-OA-patci.tsv.gz").
    19. pmc_pair.tsv.gz - tsv format file containing all citation pairs from PubMed Central (harvest through Dec 2018 via e-Utilities).
    20. sem_pair.tsv.gz - tsv format file containing all citation pairs from Semantic Scholar (2019 version) .
    COLUMN DESCRIPTION FILENAME : PPUB.tsv.gz, PUNR.tsv.gz (1) fromPMID - PubMed ID of the citing paper. (2) toPMID - PubMed ID of the cited paper. (3) sources - citation sources, in which the citation pairs are identified. (4) fromYEAR - Publication year of the citing paper. (5) toYEAR - Publication year of the cited paper. FILENAME : DOI2PMID.tsv.gz (1) DOI - Semantic Scholar ID of paper records. (2) PMID - PubMed ID of paper records. (3) PMID2 - Digital Object Identifier of paper records, “-” if the paper doesn't have DOIs. FILENAME : SEMID2PMID.tsv.gz (1) SemID - Semantic Scholar ID of paper records. (2) PMID - PubMed ID of paper records. (3) DOI - Digital Object Identifier of paper records, “-” if the paper doesn't have DOIs. FILENAME : JVNPYA.tsv.gz - Each row refers to a publication record. (1) PMID - PubMed ID. (2) journal - Journal name. (3) volume - Journal volume. (4) issue - Journal issue. (5) pages - The first page and last page (without leading digits) number of the publication separated by '-'. (6) year - Publication year. (7) lastname - Last name of the first author. FILENAME : TiLTyAlJVNY.tsv.gz (1) PMID - PubMed ID. (2) title_tokenized - Paper title after tokenization. (3) languages - Language that paper is written in. (4) pub_types - Types of the publication. (5) length(authors) - String length of author names. (6) journal -Journal name . (7) volume - Journal volume . (8) issue - Journal issue. (9) year - Publication year of print (not necessary epub). FILENAME : PMC-OA-patci.tsv.gz (1) pmcid - PubMed Central identifier. (2) pos - (3) fromPMID - PubMed ID of the citing paper. (4) toPMID - PubMed ID of the cited paper. (5) SRC - citation sources, in which the citation pairs are identified. (6) MatchDB - PubMed, ADS, DBLP. (7) Probability - Matching probability predicted by Patci. (8) toPMID2 - PubMed ID of the cited paper, extracted from OA xml file (9) SRC2 - citation sources, in which the citation pairs are identified. (10) intxt_id - (11) jounal - First character of the journal name. (12) same_ref_string - Y if patci and xml reference string match, otherwise N. (13) DIFF - (14) bestSRC - Citation sources, in which the citation pairs are identified. (15) Match - Matching strings annotated by Patci. FILENAME : REDIRECTS.gz Each row in Redirectis.txt is a string sequence in the same format as follows. - "REDIRECTED FROM: source PMID_i PMID_j -> PMID_i' PMID_j " - "REDIRECTED TO: source PMID_i PMID_j -> PMID_i PMID_j' " Note: source is the names of sources where the PMID_i and PMID_j are from. FILENAME : REMAP Each row is remapping unreliable PMID pairs mapped to reliable PMID pairs. The format of each row is "$REMAP{PMID_i} = PMID_j". FILENAME : ami_pair.tsv.gz, dim_pair.tsv.gz, ice_pair.tsv.gz, len_pair.tsv.gz, mag_pair.tsv.gz, oci_pair.tsv.gz, pat_pair.tsv.gz,pmc_pair.tsv.gz, sem_pair.tsv.gz (1) fromPMID - PubMed ID of the citing paper. (2) toPMID - PubMed ID of the cited paper.

  18. Datasets for OntoClue Project

    • zenodo.org
    tsv, zip
    Updated Feb 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohitha Ravinder; Rohitha Ravinder; Lukas Geist; Lukas Geist; Dietrich Rebholz-Schuhmann; Dietrich Rebholz-Schuhmann; Leyla Jael Castro; Leyla Jael Castro (2025). Datasets for OntoClue Project [Dataset]. http://doi.org/10.5281/zenodo.14801641
    Explore at:
    tsv, zipAvailable download formats
    Dataset updated
    Feb 13, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rohitha Ravinder; Rohitha Ravinder; Lukas Geist; Lukas Geist; Dietrich Rebholz-Schuhmann; Dietrich Rebholz-Schuhmann; Leyla Jael Castro; Leyla Jael Castro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    This release contains the datasets and files associated with the OntoClue project, which investigates various text embedding techniques for assessing document-to-document similarity in biomedical literature. The project primarily utilizes the RELISH Corpus [1], a comprehensive dataset curated by experts that includes relevance annotations for document pairs based on their similarity. This release includes datasets for establishing ground truth, as well as retrieved titles and abstracts for all PMIDs in the RELISH database. The files also contain preprocessed tokens for use in text embedding neural network models, as well as annotated tokens based on the MeSH (Medical Subject Headings) [2] vocabulary.

    Data Structure and Files

    1. missing_pmids.tsv: List of PMIDs for which titles and abstracts could not be retrieved
    2. relevance_matrix.tsv : Ground truth dataset file derived from the RELISH JSON file containing 189,634 documents pairs, with three columns: PMID1 (reference article), PMID2 (assessed article), and relevance (relevance score between the two documents). Consists of 68,479 completely relevant pairs, 65,406 partially relevant pairs and 55,749 irrelevant pairs.
    3. relish_documents.tsv: Contains retrieved RELISH documents, including PMID, title and abstract (163,189 articles)
    4. relish_bert_input_text.zip: Preprocessed titles and abstracts for use with BERT-based models
    5. relish_preprocessed_normal_tokens.zip: Document text preprocessed for use with all embeddings approaches
    6. relish_normal_split_datasets.zip: Preprocessed document text split into training, validation and test datasets
    7. relish_xml_files.zip: RELISH articles retrieved as XML files
    8. relish_annotated_xml_files.zip: Annotated XML files of RELISH articles (163,189 articles)
    9. relish_preprocessed_annotated_tokens.zip: Document text preprocessed for use with all embeddings approaches, with annotations
    10. relish_annotated_split_datasets.zip: Preprocessed and annotated document text split into a training, validation and test datasets
    11. relish_ground_truth_split_datasets.zip: Ground truth dataset split into a training, validation and test datasets

    Data Collection

    The RELIHS dataset v1 was downloaded from the corresponding FigShare record [3] on January 24th, 2022. The dataset, in JSON format, contains PubMed IDs (PMIDs) along with relevance assessments for document pairs. Using the BioC API, we retrieved XML files containing the PMID, title, and abstract for each unique entry in the RELIHS JSON file. Any PMIDs that failed to retrieve, or lacked titles and abstracts, were recorded as missing. In total, approximately 163,189 XML files were successfully retrieved. These XML files were also converted into a TSV file with three columns: PMID, title, and abstract. The text from the titles and abstracts was further preprocessed for use in various approaches.

    References

    [1] Peter Brown, RELISH Consortium , Yaoqi Zhou, Large expert-curated database for benchmarking document similarity detection in biomedical literature search, Database, Volume 2019, 2019, baz085, https://doi.org/10.1093/database/baz085

    [2] Lipscomb C. E. (2000). Medical Subject Headings (MeSH). Bulletin of the Medical Library Association, 88(3), 265–266.

    [3] Brown, Peter (2019). RELISH_v1. figshare. Dataset. https://doi.org/10.6084/m9.figshare.7722905.v1

  19. I

    Dataset for "Continued use of retracted papers: Temporal trends in citations...

    • databank.illinois.edu
    Updated Apr 6, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tzu-Kun Hsiao; Jodi Schneider (2021). Dataset for "Continued use of retracted papers: Temporal trends in citations and (lack of) awareness of retractions shown in citation contexts in biomedicine" [Dataset]. http://doi.org/10.13012/B2IDB-8255619_V2
    Explore at:
    Dataset updated
    Apr 6, 2021
    Authors
    Tzu-Kun Hsiao; Jodi Schneider
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    Alfred P. Sloan Foundation
    U.S. National Institutes of Health (NIH)
    Description

    This dataset includes five files. Descriptions of the files are given as follows: FILENAME: PubMed_retracted_publication_full_v3.tsv - Bibliographic data of retracted papers indexed in PubMed (retrieved on August 20, 2020, searched with the query "retracted publication" [PT] ). - Except for the information in the "cited_by" column, all the data is from PubMed. - PMIDs in the "cited_by" column that meet either of the two conditions below have been excluded from analyses: [1] PMIDs of the citing papers are from retraction notices (i.e., those in the “retraction_notice_PMID.csv” file). [2] Citing paper and the cited retracted paper have the same PMID. ROW EXPLANATIONS - Each row is a retracted paper. There are 7,813 retracted papers. COLUMN HEADER EXPLANATIONS 1) PMID - PubMed ID 2) Title - Paper title 3) Authors - Author names 4) Citation - Bibliographic information of the paper 5) First Author - First author's name 6) Journal/Book - Publication name 7) Publication Year 8) Create Date - The date the record was added to the PubMed database 9) PMCID - PubMed Central ID (if applicable, otherwise blank) 10) NIHMS ID - NIH Manuscript Submission ID (if applicable, otherwise blank) 11) DOI - Digital object identifier (if applicable, otherwise blank) 12) retracted_in - Information of retraction notice (given by PubMed) 13) retracted_yr - Retraction year identified from "retracted_in" (if applicable, otherwise blank) 14) cited_by - PMIDs of the citing papers. (if applicable, otherwise blank) Data collected from iCite. 15) retraction_notice_pmid - PMID of the retraction notice (if applicable, otherwise blank) FILENAME: PubMed_retracted_publication_CitCntxt_withYR_v3.tsv - This file contains citation contexts (i.e., citing sentences) where the retracted papers were cited. The citation contexts were identified from the XML version of PubMed Central open access (PMCOA) articles. - This is part of the data from: Hsiao, T.-K., & Torvik, V. I. (manuscript in preparation). Citation contexts identified from PubMed Central open access articles: A resource for text mining and citation analysis. - Citation contexts that meet either of the two conditions below have been excluded from analyses: [1] PMIDs of the citing papers are from retraction notices (i.e., those in the “retraction_notice_PMID.csv” file). [2] Citing paper and the cited retracted paper have the same PMID. ROW EXPLANATIONS - Each row is a citation context associated with one retracted paper that's cited. - In the manuscript, we count each citation context once, even if it cites multiple retracted papers. COLUMN HEADER EXPLANATIONS 1) pmcid - PubMed Central ID of the citing paper 2) pmid - PubMed ID of the citing paper 3) year - Publication year of the citing paper 4) location - Location of the citation context (abstract = abstract, body = main text, back = supporting material, tbl_fig_caption = tables and table/figure captions) 5) IMRaD - IMRaD section of the citation context (I = Introduction, M = Methods, R = Results, D = Discussions/Conclusion, NoIMRaD = not identified) 6) sentence_id - The ID of the citation context in a given location. For location information, please see column 4. The first sentence in the location gets the ID 1, and subsequent sentences are numbered consecutively. 7) total_sentences - Total number of sentences in a given location 8) intxt_id - Identifier of a cited paper. Here, a cited paper is the retracted paper. 9) intxt_pmid - PubMed ID of a cited paper. Here, a cited paper is the retracted paper. 10) citation - The citation context 11) progression - Position of a citation context by centile within the citing paper. 12) retracted_yr - Retraction year of the retracted paper 13) post_retraction - 0 = not post-retraction citation; 1 = post-retraction citation. A post-retraction citation is a citation made after the calendar year of retraction. FILENAME: 724_knowingly_post_retraction_cit.csv (updated) - The 724 post-retraction citation contexts that we determined knowingly cited the 7,813 retracted papers in "PubMed_retracted_publication_full_v3.tsv". - Two citation contexts from retraction notices have been excluded from analyses. ROW EXPLANATIONS - Each row is a citation context. COLUMN HEADER EXPLANATIONS 1) pmcid - PubMed Central ID of the citing paper 2) pmid - PubMed ID of the citing paper 3) pub_type - Publication type collected from the metadata in the PMCOA XML files. 4) pub_type2 - Specific article types. Please see the manuscript for explanations. 5) year - Publication year of the citing paper 6) location - Location of the citation context (abstract = abstract, body = main text, back = supporting material, table_or_figure_caption = tables and table/figure captions) 7) intxt_id - Identifier of a cited paper. Here, a cited paper is the retracted paper. 8) intxt_pmid - PubMed ID of a cited paper. Here, a cited paper is the retracted paper. 9) citation - The citation context 10) retracted_yr - Retraction year of the retracted paper 11) cit_purpose - Purpose of citing the retracted paper. This is from human annotations. Please see the manuscript for further information about annotation. 12) longer_context - A extended version of the citation context. (if applicable, otherwise blank) Manually pulled from the full-texts in the process of annotation. FILENAME: Annotation manual.pdf - The manual for annotating the citation purposes in column 11) of the 724_knowingly_post_retraction_cit.tsv. FILENAME: retraction_notice_PMID.csv (new file added for this version) - A list of 8,346 PMIDs of retraction notices indexed in PubMed (retrieved on August 20, 2020, searched with the query "retraction of publication" [PT] ).

  20. d

    Distribution of trial registry numbers within full-text PubMed Central -...

    • search.dataone.org
    • datadryad.org
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arthur Holt; Neil Smalheiser; Ang Troy (2025). Distribution of trial registry numbers within full-text PubMed Central - full dataset of discovered links [Dataset]. http://doi.org/10.5061/dryad.dbrv15fb1
    Explore at:
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Arthur Holt; Neil Smalheiser; Ang Troy
    Description

    Linking registered clinical trials with their published results continues to be a challenge. A variety of natural language processing (NLP)-based and machine learning-based models have been developed to assist users in identifying these connections. Articles from the PubMed Central full-text collection were scanned for mentions of ClinicalTrials.gov and international clinical trial registry identifiers. We analyzed the distribution of trial registry numbers within sections of the articles and characterized their publication type indexing and other metrics. Three supporting files are included herein: a pdf containing supplementary figures pertaining to the distribution of registry numbers found within the full text of articles, a csv dataset providing the registry numbers discovered and the corresponding XML path location within the document, and an example Python script to locate registry identifiers within an XML article document. It should be noted that the purpose of this study is to..., These datasets and files are the results of scanning 6,901,686 XML documents within the Pubmed Central Open Access article datasets available at: https://ftp.ncbi.nlm.nih.gov/pub/pmc/ Each registry identifier match is represented by a row in the xmlScanOutput.csv file, along with PubMed identifiers, file information, XML path information, and several computed columns including a validation that an NCT number exists within ClinicalTrials.gov, a generalized article section, and publication types from multiple indexing sources. Summaries within the Distribution_of_Trial_Registry_Numbers_Additional_File.pdf were generated by counting distinct PMID values within the csv file across various groups., , # Distribution of trial registry numbers within full-text PubMed Central - full dataset of discovered links

    https://doi.org/10.5061/dryad.dbrv15fb1

    This data set contains a table with every combination of publication ID, registry number, XML path, and section of the publication discovered in the Full-Text scanning of PubMed Central articles.

    Description of the data and file structure

    Distribution_of_Trial_Registry_Numbers_Additional_File.pdf

    This document contains charts and summaries of the trial registry numbers found from the XML document scanning process. The explicit criteria for locating registry identifiers and designating article sections are provided in this document and may be useful for further research and refinement.

    Distribution_of_Trial_Registry_Numbers_ScanOutput.zip

    This zip archive contains a comma-separated file named "xmlScanOutput.csv" that contains all rows of registry numbers and art...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Hà Huy Hoàng (2025). pubmed25 [Dataset]. https://huggingface.co/datasets/HoangHa/pubmed25

pubmed25

HoangHa/pubmed25

Explore at:
20 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 26, 2025
Authors
Hà Huy Hoàng
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information. This version is modified to extract the full text from structured abstracts.

Search
Clear search
Close search
Google apps
Main menu