100+ datasets found
  1. Article data for citation analysis

    • kaggle.com
    zip
    Updated Sep 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zihan Wang (2024). Article data for citation analysis [Dataset]. https://www.kaggle.com/datasets/thuwangzh/article-data
    Explore at:
    zip(344109 bytes)Available download formats
    Dataset updated
    Sep 1, 2024
    Authors
    Zihan Wang
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Our data is sourced from Web of Science, an academic information retrieval platform. In the field of citation analysis, a recurring criticism revolves around ``field-dependent factors," which highlight that citation practices vary across different scientific disciplines. To enhance the credibility of our results, we focus exclusively on a single discipline, specifically Statistics & Probability, for citation analysis. Additionally, we limit our data to articles published between 2009 and 2018, as articles published within the last five years often have very few citations, which could skew the results. Moreover, there were few articles in the Statistics & Probability category before 2009. To minimize result variance, we selected articles contributed by scholars from Tsinghua University and Peking University, the two most influential universities in China, ensuring a baseline quality for the articles. In total, we exported detailed information on 566 articles from Web of Science (WoS).

  2. I

    Self-citation analysis data based on PubMed Central subset (2002-2005)

    • databank.illinois.edu
    • aws-databank-alb.library.illinois.edu
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik, Self-citation analysis data based on PubMed Central subset (2002-2005) [Dataset]. http://doi.org/10.13012/B2IDB-9665377_V1
    Explore at:
    Authors
    Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    U.S. National Institutes of Health (NIH)
    U.S. National Science Foundation (NSF)
    Description

    Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations. Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304 - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105 * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720 - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927 * MapAffil for identifying article country of affiliation: - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik * IMPLICIT journal similarity: - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 * Novelty dataset for identify article level novelty: - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 - Paper citation: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra - Code: https://github.com/napsternxg/Novelty * Expertise dataset for identifying author expertise on articles: * Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.

  3. d

    The location of the citation: changing practices in how publications cite...

    • datadryad.org
    • data.niaid.nih.gov
    • +2more
    zip
    Updated Oct 31, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christine Mayo; Todd J. Vision; Elizabeth A. Hull (2016). The location of the citation: changing practices in how publications cite original data in the Dryad Digital Repository [Dataset]. http://doi.org/10.5061/dryad.8q931
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 31, 2016
    Dataset provided by
    Dryad
    Authors
    Christine Mayo; Todd J. Vision; Elizabeth A. Hull
    Time period covered
    Jan 19, 2016
    Description

    Citation location resultsDetails of the articles analyzed to determine the location of original data identifiers within the article. Only includes results for articles that were Open Access and full-text indexed in EPMC.citation_locations.txtCitation location scriptThis Python script imports lists of Dryad metadata and uses it to search Europe PMC and to locate and classify data references appearing in the literature.citation_locations.py

  4. Data Citation Corpus Data File

    • zenodo.org
    zip
    Updated Oct 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataCite (2024). Data Citation Corpus Data File [Dataset]. http://doi.org/10.5281/zenodo.13376773
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 14, 2024
    Dataset provided by
    DataCitehttps://www.datacite.org/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Data file for the second release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.

    The data file includes 5,256,114 data citation records in JSON and CSV formats. The JSON file is the version of record.

    For convenience, the data is provided in batches of approximately 1 million records each. The publication date and batch number are included in the file name, ex: 2024-08-23-data-citation-corpus-01-v2.0.json.

    The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles.

    Each data citation record is comprised of:

    • A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication (journal article or preprint) in which the dataset is cited

    • Metadata for the cited dataset and for the citing publication

    The data file includes the following fields:

    Field

    Description

    Required?

    id

    Internal identifier for the citation

    Yes

    created

    Date of item's incorporation into the corpus

    Yes

    updated

    Date of item's most recent update in corpus

    Yes

    repository

    Repository where cited data is stored

    No

    publisher

    Publisher for the article citing the data

    No

    journal

    Journal for the article citing the data

    No

    title

    Title of cited data

    No

    publication

    DOI of article where data is cited

    Yes

    dataset

    DOI or accession number of cited data

    Yes

    publishedDate

    Date when citing article was published

    No

    source

    Source where citation was harvested

    Yes

    subjects

    Subject information for cited data

    No

    affiliations

    Affiliation information for creator of cited data

    No

    funders

    Funding information for cited data

    No

    Additional documentation about the citations and metadata in the file is available on the Make Data Count website.

    The second release of the Data Citation Corpus data file reflects several changes made to add new citations, remove some records deemed out of scope for the corpus, update and enhance citation metadata, and improve the overall usability of the file. These changes are as follows:

    Add and update Event Data citations:

    • Add 179,885 new data citations created in DataCite Event Data between 01 June 2023 through 30 June 2024

    Remove citation records deemed out of scope for the corpus:

    • 273,567 records from DataCite Event Data with non-citation relationship types

    • 28,334 citations to items in non-data repositories (clinical trials registries, stem cells, samples, and other non-data materials)

    • 44,117 invalid citations where subj_id value was the same as the obj_id value or subj_id and obj_id are inverted, indicating a citation from a dataset to a publication

    • 473,792 citations to invalid accession numbers from CZI data present in v1.1 as a result of false positives in the algorithm used to identify mentions

    • 4,110,019 duplicate records from CZI data present in v1.1 where metadata is the same for obj_id, subj_id, repository_id, publisher_id, journal_id, accession_number, and source_id (the record with the most recent updated date was retained in all of these cases)

    Metadata enhancements:

    • Apply Field of Science subject terms to citation records originating from CZI, based on disciplinary area of data repository

    • Initial cleanup of affiliation and funder organization names to remove personal email addresses and social media handles (additional cleanup and standardization in progress and will be included in future releases)

    Data structure updates to improve usability and eliminate redundancies:

    • Rename subj_id and obj_id fields to “dataset” and “publication” for clarity

    • Remove accessionNumber and doi elements to eliminate redundancy with subj_id

    • Remove relationTypeId fields as these are specific to Event Data only

    Full details of the above changes, including scripts used to perform the above tasks, are available in GitHub.

    While v2 addresses a number of cleanup and enhancement tasks, additional data issues may remain, and additional enhancements are being explored. These will be addressed in the course of subsequent data file releases.


    Feedback on the data file can be submitted via GitHub. For general questions, email info@makedatacount.org.

  5. Z

    Raw and aggregated data for the study introduced in the article "An analysis...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Aug 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alves dos Santos, Erika; Peroni, Silvio; Mucheroni, Marcos Luiz (2021). Raw and aggregated data for the study introduced in the article "An analysis of citing and referencing habits across all scholarly disciplines: approaches and trends in bibliographic metadata errors" [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_5185363
    Explore at:
    Dataset updated
    Aug 12, 2021
    Dataset provided by
    School of Communication and Arts (ECA), Department of Information & Culture (CBD), University of São Paulo (USP), São Paulo, Brazil
    Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy
    Authors
    Alves dos Santos, Erika; Peroni, Silvio; Mucheroni, Marcos Luiz
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains all the raw data and aggregated data subject of the study introduced in the article "An analysis of citing and referencing habits across all scholarly disciplines: approaches and trends in bibliographic metadata errors". The study is based on the bibliographic and citation data contained in 729 articles published in 147 journals in 27 subject areas. The articles contained a total amount of 34,140 bibliographic references and 55,100 mentions and quotations overall.

    The dataset is composed of a series of files:

    the files "subject_area_.csv" contain the raw data of the articles published in the journals of all the disciplines considered in the study;

    the file "article_data_summary.csv" contains the aggregated data created considering the raw data in the previous files, which have been used to creating all the tables and figures in the article;

    the file "starred_metadata_set.csv" contains information about the most used subset of bibliographic metadata;

    the file "journals_selection.csv" contains information about all the journals selected for the study.

  6. Z

    Data from: Diversity in citations to a single study: Supplementary data set...

    • data.niaid.nih.gov
    Updated Aug 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leng, Rhodri Ivor (2021). Diversity in citations to a single study: Supplementary data set for citation context network analysis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5244799
    Explore at:
    Dataset updated
    Aug 25, 2021
    Dataset provided by
    University of Edinburgh
    Authors
    Leng, Rhodri Ivor
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    This document describes the data set used for all analyses in 'Diversity in citations to a single study: A citation context network analysis of how evidence from a prospective cohort study was cited' accepted for publication in Quantitative Science Studies [1].

    Data Collection

    The data collection procedure has been fully described [1]. Concisely, the data set contains bibliometric data collected from Web of Science Core Collection via the University of Edinburgh’s Library subscription concerning all papers that cited a cohort study, Paul et al. [2], in the period <1985. This includes a full list of citing papers, and the citations between these papers. Additionally, it includes textual passages (citation contexts) from 343 citing papers, which were manually recovered from the full-text documents accessible via the University of Edinburgh’s Library subscription. These data have been cleaned, converted into network readable datasets, and are coded into particular classifications reflecting content, which are described fully in the supplied code book and within the manuscript [1].

    Data description

    All relevant data can be found in the attached file 'Supplementary_material_Leng_QSS_2021.xlsx', which contains the following five workbooks:

    “Overview” includes a list of the content of the workbooks.

    “Code Book” contains the coding rules and definitions used for the classification of findings and paper titles.

    “Node attribute list” includes a workbook containing all node attributes for the citation network, which includes Paul et al. [2] and its citing papers as of 1984. Highlighted in yellow at the bottom of this workbook is two papers that were discarded due to duplication - remove these if analysing this dataset in a network analysis. The columns refer to:

    Id, the node identifier

    Label, the formal citation of the paper to which data within this row corresponds. Citation is in the following format: last name of first author, year of publication, journal of publication, volume number, start page, and DOI (if available).

    Title, the paper title for the paper in question.

    Publication_year, the year of publication.

    Document_type, the document type (e.g. review, article)

    WoS_ID, the paper’s unique Web of Science accession number.

    Citation_context, a column specifying whether citation context data is available from that paper

    Explanans, the title explanans terms for that paper;

    Explanandum, the explanandum terms for that paper.

    Combined_Title_Classification, the combined terms used for fig 2 of the published manuscript.

    Serum_cholesterol_(SC), a column identifying papers that cited the serum cholesterol findings.

    Blood_Pressure_(BP), a column identifying papers that cited the blood pressure findings.

    Coffee_(C), a column identifying papers that cited the coffee findings.

    Diet_(D), a column identifying papers that cited the dietary findings.

    Smoking_(S), a column identifying papers that cited the smoking findings.

    Alcohol_(A), a column identifying papers that cited the alcohol findings.

    Physical_Activity_(PA), a column identifying papers that cited the physical activity findings.

    Body_Fatness (BF), a column identifying papers that cited the body fatness findings.

    Indegree, the number of within network citations to that paper, calculated for the network shown in Fig 4 of the manuscript.

    Outdegree, the number of within network references of that paper as calculated for the network in Fig 4.

    Main_component, a column specifying whether a node is contained in the largest weakly connect component as shown in Fig 4 of the manuscript.

    Cluster, provides the cluster membership number as discussed within the manuscript (Fig 5).

    “Edge list” includes a workbook including the edges for the network. The columns refer to:

    Source, contains the node identifier of the citing paper.

    Target, contains the node identifier of the cited paper.

    “Citation context classification” includes a workbook containing the WoS accession number for the paper analysed, and any finding category discussed in that paper established via context analysis (see the code book for definitions). The columns refer to:

    Id, the node identifier

    Finding_Class, the findings discussed from Paul et al. within the body of the citing paper.

    “Citation context data” includes a workbook containing the WoS accession number for papers in which citation context data was available, the citation context passages, the reference number or format of Paul et al. within the citing paper, and the finding categories discussed in those contexts (see code book for definitions). The columns refer to:

    Id, the node identifier

    Citation_context, the passage copied from the full text of the citing paper containing discussion of the findings of Paul et al.

    Reference_in_citing_article, the reference number or format of Paul et al. within the citing paper.

    Finding_class, the findings discussed from Paul et al. within the body of the citing paper.

    Software recommended for analysis

    For the analyses performed within the manuscript, Gephi version 0.9.2 was used [3], and both the edge and node lists are in a format that is easily read into this software. The Sci2 tool was used to parse data initially [4].

    Notes

    Leng, R. I. (Forthcoming). Diversity in citations to a single study: A citation context network analysis of how evidence from a prospective cohort study was cited. Quantitative Science Studies.

    Paul, O., Lepper, M. H., Phelan, W. H., Dupertuis, G. W., Macmillan, A., McKean, H., et al. (1963). A longitudinal study of coronary heart disease. Circulation, 28, 20-31. https://doi.org/10.1161/01.cir.28.1.20.

    Bastian, M., Heymann, S., & Jacomy, M. (2009). Gephi: an open source software for exploring and manipulating networks. International AAAI Conference on Weblogs and Social Media.

    Sci2 Team. (2009). Science of Science (Sci2) Tool. Indiana University and SciTech Strategies. Stable URL: https://sci2.cns.iu.edu

  7. iCite Database Snapshot 2022-10

    • nih.figshare.com
    bin
    Updated Jun 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    iCite; B. Ian Hutchins; George Santangelo; Ehsanul Haque (2023). iCite Database Snapshot 2022-10 [Dataset]. http://doi.org/10.35092/yhjc21502470.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    iCite; B. Ian Hutchins; George Santangelo; Ehsanul Haque
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This is a database snapshot of the iCite web service (provided here as a single zipped CSV file, or compressed, tarred JSON files). In addition, citation links in the NIH Open Citation Collection are provided as a two-column CSV table in open_citation_collection.zip. iCite provides bibliometrics and metadata on publications indexed in PubMed, organized into three modules:

    Influence: Delivers metrics of scientific influence, field-adjusted and benchmarked to NIH publications as the baseline.

    Translation: Measures how Human, Animal, or Molecular/Cellular Biology-oriented each paper is; tracks and predicts citation by clinical articles

    Open Cites: Disseminates link-level, public-domain citation data from the NIH Open Citation Collection

    Definitions for individual data fields:

    pmid: PubMed Identifier, an article ID as assigned in PubMed by the National Library of Medicine

    doi: Digital Object Identifier, if available

    year: Year the article was published

    title: Title of the article

    authors: List of author names

    journal: Journal name (ISO abbreviation)

    is_research_article: Flag indicating whether the Publication Type tags for this article are consistent with that of a primary research article

    relative_citation_ratio: Relative Citation Ratio (RCR)--OPA's metric of scientific influence. Field-adjusted, time-adjusted and benchmarked against NIH-funded papers. The median RCR for NIH funded papers in any field is 1.0. An RCR of 2.0 means a paper is receiving twice as many citations per year than the median NIH funded paper in its field and year, while an RCR of 0.5 means that it is receiving half as many citations per year. Calculation details are documented in Hutchins et al., PLoS Biol. 2016;14(9):e1002541.

    provisional: RCRs for papers published in the previous two years are flagged as "provisional", to reflect that citation metrics for newer articles are not necessarily as stable as they are for older articles. Provisional RCRs are provided for papers published previous year, if they have received with 5 citations or more, despite being, in many cases, less than a year old. All papers published the year before the previous year receive provisional RCRs. The current year is considered to be the NIH Fiscal Year which starts in October. For example, in July 2019 (NIH Fiscal Year 2019), papers from 2018 receive provisional RCRs if they have 5 citations or more, and all papers from 2017 receive provisional RCRs. In October 2019, at the start of NIH Fiscal Year 2020, papers from 2019 receive provisional RCRs if they have 5 citations or more and all papers from 2018 receive provisional RCRs.

    citation_count: Number of unique articles that have cited this one

    citations_per_year: Citations per year that this article has received since its publication. If this appeared as a preprint and a published article, the year from the published version is used as the primary publication date. This is the numerator for the Relative Citation Ratio.

    field_citation_rate: Measure of the intrinsic citation rate of this paper's field, estimated using its co-citation network.

    expected_citations_per_year: Citations per year that NIH-funded articles, with the same Field Citation Rate and published in the same year as this paper, receive. This is the denominator for the Relative Citation Ratio.

    nih_percentile: Percentile rank of this paper's RCR compared to all NIH publications. For example, 95% indicates that this paper's RCR is higher than 95% of all NIH funded publications.

    human: Fraction of MeSH terms that are in the Human category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    animal: Fraction of MeSH terms that are in the Animal category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    molecular_cellular: Fraction of MeSH terms that are in the Molecular/Cellular Biology category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    x_coord: X coordinate of the article on the Triangle of Biomedicine

    y_coord: Y Coordinate of the article on the Triangle of Biomedicine

    is_clinical: Flag indicating that this paper meets the definition of a clinical article.

    cited_by_clin: PMIDs of clinical articles that this article has been cited by.

    apt: Approximate Potential to Translate is a machine learning-based estimate of the likelihood that this publication will be cited in later clinical trials or guidelines. Calculation details are documented in Hutchins et al., PLoS Biol. 2019;17(10):e3000416.

    cited_by: PMIDs of articles that have cited this one.

    references: PMIDs of articles in this article's reference list.

    Large CSV files are zipped using zip version 4.5, which is more recent than the default unzip command line utility in some common Linux distributions. These files can be unzipped with tools that support version 4.5 or later such as 7zip.

    Comments and questions can be addressed to iCite@mail.nih.gov

  8. d

    October 2023 data-update for "Updated science-wide author databases of...

    • elsevier.digitalcommonsdata.com
    Updated Oct 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John P.A. Ioannidis (2023). October 2023 data-update for "Updated science-wide author databases of standardized citation indicators" [Dataset]. http://doi.org/10.17632/btchxktzyw.6
    Explore at:
    Dataset updated
    Oct 4, 2023
    Authors
    John P.A. Ioannidis
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    Citation metrics are widely used and misused. We have created a publicly available database of top-cited scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator (c-score). Separate data are shown for career-long and, separately, for single recent year impact. Metrics with and without self-citations and ratio of citations to citing papers are given. Scientists are classified into 22 scientific fields and 174 sub-fields according to the standard Science-Metrix classification. Field- and subfield-specific percentiles are also provided for all scientists with at least 5 papers. Career-long data are updated to end-of-2022 and single recent year data pertain to citations received during calendar year 2022. The selection is based on the top 100,000 scientists by c-score (with and without self-citations) or a percentile rank of 2% or above in the sub-field. This version (6) is based on the October 1, 2023 snapshot from Scopus, updated to end of citation year 2022. This work uses Scopus data provided by Elsevier through ICSR Lab (https://www.elsevier.com/icsr/icsrlab). Calculations were performed using all Scopus author profiles as of October 1, 2023. If an author is not on the list it is simply because the composite indicator value was not high enough to appear on the list. It does not mean that the author does not do good work.

    PLEASE ALSO NOTE THAT THE DATABASE HAS BEEN PUBLISHED IN AN ARCHIVAL FORM AND WILL NOT BE CHANGED. The published version reflects Scopus author profiles at the time of calculation. We thus advise authors to ensure that their Scopus profiles are accurate. REQUESTS FOR CORRECIONS OF THE SCOPUS DATA (INCLUDING CORRECTIONS IN AFFILIATIONS) SHOULD NOT BE SENT TO US. They should be sent directly to Scopus, preferably by use of the Scopus to ORCID feedback wizard (https://orcid.scopusfeedback.com/) so that the correct data can be used in any future annual updates of the citation indicator databases.

    The c-score focuses on impact (citations) rather than productivity (number of publications) and it also incorporates information on co-authorship and author positions (single, first, last author). If you have additional questions, please read the 3 associated PLoS Biology papers that explain the development, validation and use of these metrics and databases. (https://doi.org/10.1371/journal.pbio.1002501, https://doi.org/10.1371/journal.pbio.3000384 and https://doi.org/10.1371/journal.pbio.3000918).

    Finally, we alert users that all citation metrics have limitations and their use should be tempered and judicious. For more reading, we refer to the Leiden manifesto: https://www.nature.com/articles/520429a

  9. Using social media to promote academic research: Identifying the benefits of...

    • plos.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samara Klar; Yanna Krupnikov; John Barry Ryan; Kathleen Searles; Yotam Shmargad (2023). Using social media to promote academic research: Identifying the benefits of twitter for sharing academic work [Dataset]. http://doi.org/10.1371/journal.pone.0229446
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Samara Klar; Yanna Krupnikov; John Barry Ryan; Kathleen Searles; Yotam Shmargad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    To disseminate research, scholars once relied on university media services or journal press releases, but today any academic can turn to Twitter to share their published work with a broader audience. The possibility that scholars can push their research out, rather than hope that it is pulled in, holds the potential for scholars to draw wide attention to their research. In this manuscript, we examine whether there are systematic differences in the types of scholars who most benefit from this push model. Specifically, we investigate the extent to which there are gender differences in the dissemination of research via Twitter. We carry out our analyses by tracking tweet patterns for articles published in six journals across two fields (political science and communication), and we pair this Twitter data with demographic and educational data about the authors of the published articles, as well as article citation rates. We find considerable evidence that, overall, article citations are positively correlated with tweets about the article, and we find little evidence to suggest that author gender affects the transmission of research in this new media.

  10. Dissemination of novel biostatistics methods: Impact of programming code...

    • plos.figshare.com
    doc
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amy E. Wahlquist; Lutfiyya N. Muhammad; Teri Lynn Herbert; Viswanathan Ramakrishnan; Paul J. Nietert (2023). Dissemination of novel biostatistics methods: Impact of programming code availability and other characteristics on article citations [Dataset]. http://doi.org/10.1371/journal.pone.0201590
    Explore at:
    docAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Amy E. Wahlquist; Lutfiyya N. Muhammad; Teri Lynn Herbert; Viswanathan Ramakrishnan; Paul J. Nietert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundAs statisticians develop new methodological approaches, there are many factors that influence whether others will utilize their work. This paper is a bibliometric study that identifies and quantifies associations between characteristics of new biostatistics methods and their citation counts. Of primary interest was the association between numbers of citations and whether software code was available to the reader.MethodsStatistics journal articles published in 2010 from 35 statistical journals were reviewed by two biostatisticians. Generalized linear mixed models were used to determine which characteristics (author, article, and journal) were independently associated with citation counts (as of April 1, 2017) in other peer-reviewed articles.ResultsOf 722 articles reviewed, 428 were classified as new biostatistics methods. In a multivariable model, for articles that were not freely accessible on the journal’s website, having code available appeared to offer no boost to the number of citations (adjusted rate ratio = 0.96, 95% CI = 0.74 to 1.24, p = 0.74); however, for articles that were freely accessible on the journal’s website, having code available was associated with a 2-fold increase in the number of citations (adjusted rate ratio = 2.01, 95% CI = 1.30 to 3.10, p = 0.002). Higher citation rates were also associated with higher numbers of references, longer articles, SCImago Journal Rank indicator (SJR), and total numbers of publications among authors, with the strongest impact on citation rates coming from SJR (rate ratio = 1.21 for a 1-unit increase in SJR; 95% CI = 1.11 to 1.32).ConclusionThese analyses shed new insight into factors associated with citation rates of articles on new biostatistical methods. Making computer code available to readers is a goal worth striving for that may enhance biostatistics knowledge translation.

  11. Z

    Data Citation Corpus Data File

    • data.niaid.nih.gov
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataCite; Make Data Count (2025). Data Citation Corpus Data File [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11196858
    Explore at:
    Dataset updated
    Feb 20, 2025
    Authors
    DataCite; Make Data Count
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Data file for the third release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.

    The data file includes 5,322,388 data citation records in JSON and CSV formats. The JSON file is the version of record.

    For convenience, the data is provided in batches of approximately 1 million records each. The publication date and batch number are included in the file name, ex: 2025-02-01-data-citation-corpus-01-v3.0.json.

    The data citations in the file originate from the following sources:

    DataCite Event Data

    A project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles

    Data citations identified Aligning Science Across Parkinson’s (ASAP)

    Each data citation record is comprised of:

    A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication (journal article or preprint) in which the dataset is cited

    Metadata for the cited dataset and for the citing publication

    The data file includes the following fields:

    Field

    Description

    Required?

    id

    Internal identifier for the citation

    Yes

    created

    Date of item's incorporation into the corpus

    Yes

    updated

    Date of item's most recent update in corpus

    Yes

    repository

    Repository where cited data is stored

    No

    publisher

    Publisher for the article citing the data

    No

    journal

    Journal for the article citing the data

    No

    title

    Title of cited data

    No

    publication

    DOI of article where data is cited

    Yes

    dataset

    DOI or accession number of cited data

    Yes

    publishedDate

    Date when citing article was published

    No

    source

    Source where citation was harvested

    Yes

    subjects

    Subject information for cited data

    No

    affiliations

    Affiliation information for creator of cited data

    No

    funders

    Funding information for cited data

    No

    Additional documentation about the citations and metadata in the file is available on the Make Data Count website.

    Notes on v3.0:

    The third release of the Data Citation Corpus data file reflects a few changes made to add new citations, including those from a new data source (ASAP), update and enhance citation metadata, and improve the overall usability of the file. These changes are as follows:

    Add and update Event Data citations:

    Add 65,524 new data citations created in DataCite Event Data between August 2024 and December 2024

    Add ASAP citations:

    Add 750 new data citations provided by Aligning Science Across Parkinson’s (ASAP), identified through processes to evaluate compliance with ASAP’s for open science practices, which involve a partnership with DataSeer and internal curation (described here).

    Citations with provenance from ASAP are identified as “asap” in the source field

    Metadata enhancements:

    Reconcile and normalize organization names for affiliations and funders in a subset of records with the Research Organization Registry (ROR)

    Add ror_name and ror_id subfields for affiliations and funders in JSON files. Unreconciled affiliation and funder strings are identified with values of null

    Add new columns affiliationsROR and fundersROR in CSV files. Unreconciled affiliation and funder strings are identified with values of NONE NONE (this is to ensure consistency in number and order of values in cases where some strings have been reconciled and others have not)

    Normalize DOI formats for articles and papers as full URLs

    Additional details about the above changes, including scripts used to perform the above tasks, are available in GitHub.

    Additional enhancements to the corpus are ongoing and will be addressed in the course of subsequent releases. Users are invited to submit feedback via GitHub. For general questions, email info@makedatacount.org.

  12. I

    Dataset for "Continued use of retracted papers: Temporal trends in citations...

    • databank.illinois.edu
    Updated Jun 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tzu-Kun Hsiao; Jodi Schneider (2024). Dataset for "Continued use of retracted papers: Temporal trends in citations and (lack of) awareness of retractions shown in citation contexts in biomedicine" [Dataset]. http://doi.org/10.13012/B2IDB-8255619_V2
    Explore at:
    Dataset updated
    Jun 14, 2024
    Authors
    Tzu-Kun Hsiao; Jodi Schneider
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    Alfred P. Sloan Foundation
    U.S. National Institutes of Health (NIH)
    Description

    This dataset includes five files. Descriptions of the files are given as follows: FILENAME: PubMed_retracted_publication_full_v3.tsv - Bibliographic data of retracted papers indexed in PubMed (retrieved on August 20, 2020, searched with the query "retracted publication" [PT] ). - Except for the information in the "cited_by" column, all the data is from PubMed. - PMIDs in the "cited_by" column that meet either of the two conditions below have been excluded from analyses: [1] PMIDs of the citing papers are from retraction notices (i.e., those in the “retraction_notice_PMID.csv” file). [2] Citing paper and the cited retracted paper have the same PMID. ROW EXPLANATIONS - Each row is a retracted paper. There are 7,813 retracted papers. COLUMN HEADER EXPLANATIONS 1) PMID - PubMed ID 2) Title - Paper title 3) Authors - Author names 4) Citation - Bibliographic information of the paper 5) First Author - First author's name 6) Journal/Book - Publication name 7) Publication Year 8) Create Date - The date the record was added to the PubMed database 9) PMCID - PubMed Central ID (if applicable, otherwise blank) 10) NIHMS ID - NIH Manuscript Submission ID (if applicable, otherwise blank) 11) DOI - Digital object identifier (if applicable, otherwise blank) 12) retracted_in - Information of retraction notice (given by PubMed) 13) retracted_yr - Retraction year identified from "retracted_in" (if applicable, otherwise blank) 14) cited_by - PMIDs of the citing papers. (if applicable, otherwise blank) Data collected from iCite. 15) retraction_notice_pmid - PMID of the retraction notice (if applicable, otherwise blank) FILENAME: PubMed_retracted_publication_CitCntxt_withYR_v3.tsv - This file contains citation contexts (i.e., citing sentences) where the retracted papers were cited. The citation contexts were identified from the XML version of PubMed Central open access (PMCOA) articles. - This is part of the data from: Hsiao, T.-K., & Torvik, V. I. (manuscript in preparation). Citation contexts identified from PubMed Central open access articles: A resource for text mining and citation analysis. - Citation contexts that meet either of the two conditions below have been excluded from analyses: [1] PMIDs of the citing papers are from retraction notices (i.e., those in the “retraction_notice_PMID.csv” file). [2] Citing paper and the cited retracted paper have the same PMID. ROW EXPLANATIONS - Each row is a citation context associated with one retracted paper that's cited. - In the manuscript, we count each citation context once, even if it cites multiple retracted papers. COLUMN HEADER EXPLANATIONS 1) pmcid - PubMed Central ID of the citing paper 2) pmid - PubMed ID of the citing paper 3) year - Publication year of the citing paper 4) location - Location of the citation context (abstract = abstract, body = main text, back = supporting material, tbl_fig_caption = tables and table/figure captions) 5) IMRaD - IMRaD section of the citation context (I = Introduction, M = Methods, R = Results, D = Discussions/Conclusion, NoIMRaD = not identified) 6) sentence_id - The ID of the citation context in a given location. For location information, please see column 4. The first sentence in the location gets the ID 1, and subsequent sentences are numbered consecutively. 7) total_sentences - Total number of sentences in a given location 8) intxt_id - Identifier of a cited paper. Here, a cited paper is the retracted paper. 9) intxt_pmid - PubMed ID of a cited paper. Here, a cited paper is the retracted paper. 10) citation - The citation context 11) progression - Position of a citation context by centile within the citing paper. 12) retracted_yr - Retraction year of the retracted paper 13) post_retraction - 0 = not post-retraction citation; 1 = post-retraction citation. A post-retraction citation is a citation made after the calendar year of retraction. FILENAME: 724_knowingly_post_retraction_cit.csv (updated) - The 724 post-retraction citation contexts that we determined knowingly cited the 7,813 retracted papers in "PubMed_retracted_publication_full_v3.tsv". - Two citation contexts from retraction notices have been excluded from analyses. ROW EXPLANATIONS - Each row is a citation context. COLUMN HEADER EXPLANATIONS 1) pmcid - PubMed Central ID of the citing paper 2) pmid - PubMed ID of the citing paper 3) pub_type - Publication type collected from the metadata in the PMCOA XML files. 4) pub_type2 - Specific article types. Please see the manuscript for explanations. 5) year - Publication year of the citing paper 6) location - Location of the citation context (abstract = abstract, body = main text, back = supporting material, table_or_figure_caption = tables and table/figure captions) 7) intxt_id - Identifier of a cited paper. Here, a cited paper is the retracted paper. 8) intxt_pmid - PubMed ID of a cited paper. Here, a cited paper is the retracted paper. 9) citation - The citation context 10) retracted_yr - Retraction year of the retracted paper 11) cit_purpose - Purpose of citing the retracted paper. This is from human annotations. Please see the manuscript for further information about annotation. 12) longer_context - A extended version of the citation context. (if applicable, otherwise blank) Manually pulled from the full-texts in the process of annotation. FILENAME: Annotation manual.pdf - The manual for annotating the citation purposes in column 11) of the 724_knowingly_post_retraction_cit.tsv. FILENAME: retraction_notice_PMID.csv (new file added for this version) - A list of 8,346 PMIDs of retraction notices indexed in PubMed (retrieved on August 20, 2020, searched with the query "retraction of publication" [PT] ).

  13. A study of the impact of data sharing on article citations using journal...

    • plos.figshare.com
    • dataverse.harvard.edu
    • +1more
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Garret Christensen; Allan Dafoe; Edward Miguel; Don A. Moore; Andrew K. Rose (2023). A study of the impact of data sharing on article citations using journal policies as a natural experiment [Dataset]. http://doi.org/10.1371/journal.pone.0225883
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Garret Christensen; Allan Dafoe; Edward Miguel; Don A. Moore; Andrew K. Rose
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This study estimates the effect of data sharing on the citations of academic articles, using journal policies as a natural experiment. We begin by examining 17 high-impact journals that have adopted the requirement that data from published articles be publicly posted. We match these 17 journals to 13 journals without policy changes and find that empirical articles published just before their change in editorial policy have citation rates with no statistically significant difference from those published shortly after the shift. We then ask whether this null result stems from poor compliance with data sharing policies, and use the data sharing policy changes as instrumental variables to examine more closely two leading journals in economics and political science with relatively strong enforcement of new data policies. We find that articles that make their data available receive 97 additional citations (estimate standard error of 34). We conclude that: a) authors who share data may be rewarded eventually with additional scholarly citations, and b) data-posting policies alone do not increase the impact of articles published in a journal unless those policies are enforced.

  14. I

    Data from: OpCitance: Citation contexts identified from the PubMed Central...

    • databank.illinois.edu
    • aws-databank-alb.library.illinois.edu
    Updated Feb 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tzu-Kun Hsiao; Vetle Torvik (2023). OpCitance: Citation contexts identified from the PubMed Central open access articles [Dataset]. http://doi.org/10.13012/B2IDB-4353270_V1
    Explore at:
    Dataset updated
    Feb 15, 2023
    Authors
    Tzu-Kun Hsiao; Vetle Torvik
    Dataset funded by
    U.S. National Institutes of Health (NIH)
    Description

    Sentences and citation contexts identified from the PubMed Central open access articles ---------------------------------------------------------------------- The dataset is delivered as 24 tab-delimited text files. The files contain 720,649,608 sentences, 75,848,689 of which are citation contexts. The dataset is based on a snapshot of articles in the XML version of the PubMed Central open access subset (i.e., the PMCOA subset). The PMCOA subset was collected in May 2019. The dataset is created as described in: Hsiao TK., & Torvik V. I. (manuscript) OpCitance: Citation contexts identified from the PubMed Central open access articles. Files: • A_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with A. • B_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with B. • C_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with C. • D_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with D. • E_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with E. • F_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with F. • G_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with G. • H_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with H. • I_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with I. • J_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with J. • K_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with K. • L_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with L. • M_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with M. • N_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with N. • O_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with O. • P_p1_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 1). • P_p2_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 2). • Q_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with Q. • R_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with R. • S_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with S. • T_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with T. • UV_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with U or V. • W_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with W. • XYZ_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with X, Y or Z. Each row in the file is a sentence/citation context and contains the following columns: • pmcid: PMCID of the article • pmid: PMID of the article. If an article does not have a PMID, the value is NONE. • location: The article component (abstract, main text, table, figure, etc.) to which the citation context/sentence belongs. • IMRaD: The type of IMRaD section associated with the citation context/sentence. I, M, R, and D represent introduction/background, method, results, and conclusion/discussion, respectively; NoIMRaD indicates that the section type is not identifiable. • sentence_id: The ID of the citation context/sentence in the article component • total_sentences: The number of sentences in the article component. • intxt_id: The ID of the citation. • intxt_pmid: PMID of the citation (as tagged in the XML file). If a citation does not have a PMID tagged in the XML file, the value is "-". • intxt_pmid_source: The sources where the intxt_pmid can be identified. Xml represents that the PMID is only identified from the XML file; xml,pmc represents that the PMID is not only from the XML file, but also in the citation data collected from the NCBI Entrez Programming Utilities. If a citation does not have an intxt_pmid, the value is "-". • intxt_mark: The citation marker associated with the inline citation. • best_id: The best source link ID (e.g., PMID) of the citation. • best_source: The sources that confirm the best ID. • best_id_diff: The comparison result between the best_id column and the intxt_pmid column. • citation: A citation context. If no citation is found in a sentence, the value is the sentence. • progression: Text progression of the citation context/sentence. Supplementary Files • PMC-OA-patci.tsv.gz – This file contains the best source link IDs for the references (e.g., PMID). Patci [1] was used to identify the best source link IDs. The best source link IDs are mapped to the citation contexts and displayed in the *_journal IntxtCit.tsv files as the best_id column. Each row in the PMC-OA-patci.tsv.gz file is a citation (i.e., a reference extracted from the XML file) and contains the following columns: • pmcid: PMCID of the citing article. • pos: The citation's position in the reference list. • fromPMID: PMID of the citing article. • toPMID: Source link ID (e.g., PMID) of the citation. This ID is identified by Patci. • SRC: The sources that confirm the toPMID. • MatchDB: The origin bibliographic database of the toPMID. • Probability: The match probability of the toPMID. • toPMID2: PMID of the citation (as tagged in the XML file). • SRC2: The sources that confirm the toPMID2. • intxt_id: The ID of the citation. • journal: The first letter of the journal title. This maps to the *_journal_IntxtCit.tsv files. • same_ref_string: Whether the citation string appears in the reference list more than once. • DIFF: The comparison result between the toPMID column and the toPMID2 column. • bestID: The best source link ID (e.g., PMID) of the citation. • bestSRC: The sources that confirm the best ID. • Match: Matching result produced by Patci. [1] Agarwal, S., Lincoln, M., Cai, H., & Torvik, V. (2014). Patci – a tool for identifying scientific articles cited by patents. GSLIS Research Showcase 2014. http://hdl.handle.net/2142/54885 • Supplementary_File_1.zip – This file contains the code for generating the dataset.

  15. iCite Database Snapshot 2023-08

    • nih.figshare.com
    bin
    Updated Sep 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    iCite; B. Ian Hutchins; George Santangelo (2023). iCite Database Snapshot 2023-08 [Dataset]. http://doi.org/10.35092/yhjc24100917.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Sep 12, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    iCite; B. Ian Hutchins; George Santangelo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This is a database snapshot of the iCite web service (provided here as a single zipped CSV file, or compressed, tarred JSON files). In addition, citation links in the NIH Open Citation Collection are provided as a two-column CSV table in open_citation_collection.zip. iCite provides bibliometrics and metadata on publications indexed in PubMed, organized into three modules:

    Influence: Delivers metrics of scientific influence, field-adjusted and benchmarked to NIH publications as the baseline.

    Translation: Measures how Human, Animal, or Molecular/Cellular Biology-oriented each paper is; tracks and predicts citation by clinical articles

    Open Cites: Disseminates link-level, public-domain citation data from the NIH Open Citation Collection

    Definitions for individual data fields:

    pmid: PubMed Identifier, an article ID as assigned in PubMed by the National Library of Medicine

    doi: Digital Object Identifier, if available

    year: Year the article was published

    title: Title of the article

    authors: List of author names

    journal: Journal name (ISO abbreviation)

    is_research_article: Flag indicating whether the Publication Type tags for this article are consistent with that of a primary research article

    relative_citation_ratio: Relative Citation Ratio (RCR)--OPA's metric of scientific influence. Field-adjusted, time-adjusted and benchmarked against NIH-funded papers. The median RCR for NIH funded papers in any field is 1.0. An RCR of 2.0 means a paper is receiving twice as many citations per year than the median NIH funded paper in its field and year, while an RCR of 0.5 means that it is receiving half as many citations per year. Calculation details are documented in Hutchins et al., PLoS Biol. 2016;14(9):e1002541.

    provisional: RCRs for papers published in the previous two years are flagged as "provisional", to reflect that citation metrics for newer articles are not necessarily as stable as they are for older articles. Provisional RCRs are provided for papers published previous year, if they have received with 5 citations or more, despite being, in many cases, less than a year old. All papers published the year before the previous year receive provisional RCRs. The current year is considered to be the NIH Fiscal Year which starts in October. For example, in July 2019 (NIH Fiscal Year 2019), papers from 2018 receive provisional RCRs if they have 5 citations or more, and all papers from 2017 receive provisional RCRs. In October 2019, at the start of NIH Fiscal Year 2020, papers from 2019 receive provisional RCRs if they have 5 citations or more and all papers from 2018 receive provisional RCRs.

    citation_count: Number of unique articles that have cited this one

    citations_per_year: Citations per year that this article has received since its publication. If this appeared as a preprint and a published article, the year from the published version is used as the primary publication date. This is the numerator for the Relative Citation Ratio.

    field_citation_rate: Measure of the intrinsic citation rate of this paper's field, estimated using its co-citation network.

    expected_citations_per_year: Citations per year that NIH-funded articles, with the same Field Citation Rate and published in the same year as this paper, receive. This is the denominator for the Relative Citation Ratio.

    nih_percentile: Percentile rank of this paper's RCR compared to all NIH publications. For example, 95% indicates that this paper's RCR is higher than 95% of all NIH funded publications.

    human: Fraction of MeSH terms that are in the Human category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    animal: Fraction of MeSH terms that are in the Animal category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    molecular_cellular: Fraction of MeSH terms that are in the Molecular/Cellular Biology category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    x_coord: X coordinate of the article on the Triangle of Biomedicine

    y_coord: Y Coordinate of the article on the Triangle of Biomedicine

    is_clinical: Flag indicating that this paper meets the definition of a clinical article.

    cited_by_clin: PMIDs of clinical articles that this article has been cited by.

    apt: Approximate Potential to Translate is a machine learning-based estimate of the likelihood that this publication will be cited in later clinical trials or guidelines. Calculation details are documented in Hutchins et al., PLoS Biol. 2019;17(10):e3000416.

    cited_by: PMIDs of articles that have cited this one.

    references: PMIDs of articles in this article's reference list.

    Large CSV files are zipped using zip version 4.5, which is more recent than the default unzip command line utility in some common Linux distributions. These files can be unzipped with tools that support version 4.5 or later such as 7zip.

    Comments and questions can be addressed to iCite@mail.nih.gov

  16. I

    Data from: Second-generation citation context analysis (2010-2019) to...

    • databank.illinois.edu
    • aws-databank-alb.library.illinois.edu
    Updated Sep 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jodi Schneider; Di Ye; Alison Hill (2020). Second-generation citation context analysis (2010-2019) to retracted paper Matsuyama 2005 [Dataset]. http://doi.org/10.13012/B2IDB-3331845_V2
    Explore at:
    Dataset updated
    Sep 2, 2020
    Authors
    Jodi Schneider; Di Ye; Alison Hill
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    Alfred P. Sloan Foundation
    Description

    Citation context annotation. This dataset is a second version (V2) and part of the supplemental data for Jodi Schneider, Di Ye, Alison Hill, and Ashley Whitehorn. (2020) "Continued post-retraction citation of a fraudulent clinical trial report, eleven years after it was retracted for falsifying data". Scientometrics. In press, DOI: 10.1007/s11192-020-03631-1 Publications were selected by examining all citations to the retracted paper Matsuyama 2005, and selecting the 35 citing papers, published 2010 to 2019, which do not mention the retraction, but which mention the methods or results of the retracted paper (called "specific" in Ye, Di; Hill, Alison; Whitehorn (Fulton), Ashley; Schneider, Jodi (2020): Citation context annotation for new and newly found citations (2006-2019) to retracted paper Matsuyama 2005. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-8150563_V1 ). The annotated citations are second-generation citations to the retracted paper Matsuyama 2005 (RETRACTED: Matsuyama W, Mitsuyama H, Watanabe M, Oonakahara KI, Higashimoto I, Osame M, Arimura K. Effects of omega-3 polyunsaturated fatty acids on inflammatory markers in COPD. Chest. 2005 Dec 1;128(6):3817-27.), retracted in 2008 (Retraction in: Chest (2008) 134:4 (893) https://doi.org/10.1016/S0012-3692(08)60339-6). OVERALL DATA for VERSION 2 (V2) FILES/FILE FORMATS Same data in two formats: 2010-2019 SG to specific not mentioned FG.csv - Unicode CSV (preservation format only) - same as in V1 2010-2019 SG to specific not mentioned FG.xlsx - Excel workbook (preferred format) - same as in V1 Additional files in V2: 2G-possible-misinformation-analyzed.csv - Unicode CSV (preservation format only) 2G-possible-misinformation-analyzed.xlsx - Excel workbook (preferred format) ABBREVIATIONS: 2G - Refers to the second-generation of Matsuyama FG - Refers to the direct citation of Matsuyama (the one the second-generation item cites) COLUMN HEADER EXPLANATIONS File name: 2G-possible-misinformation-analyzed. Other column headers in this file have same meaning as explained in V1. The following are additional header explanations: Quote Number - The order of the quote (citation context citing the first generation article given in "FG in bibliography") in the second generation article (given in "2G article") Quote - The text of the quote (citation context citing the first generation article given in "FG in bibliography") in the second generation article (given in "2G article") Translated Quote - English translation of "Quote", automatically translation from Google Scholar Seriousness/Risk - Our assessment of the risk of misinformation and its seriousness 2G topic - Our assessment of the topic of the cited article (the second generation article given in "2G article") 2G section - The section of the citing article (the second generation article given in "2G article") in which the cited article(the first generation article given in "FG in bibliography") was found FG in bib type - The type of article (e.g., review article), referring to the cited article (the first generation article given in "FG in bibliography") FG in bib topic - Our assessment of the topic of the cited article (the first generation article given in "FG in bibliography") FG in bib section - The section of the cited article (the first generation article given in "FG in bibliography") in which the Matsuyama retracted paper was cited

  17. Methodology data of "A qualitative and quantitative citation analysis toward...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jul 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Heibi; Ivan Heibi; Silvio Peroni; Silvio Peroni (2022). Methodology data of "A qualitative and quantitative citation analysis toward retracted articles: a case of study" [Dataset]. http://doi.org/10.5281/zenodo.4323221
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 8, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ivan Heibi; Ivan Heibi; Silvio Peroni; Silvio Peroni
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This document contains the datasets and visualizations generated after the application of the methodology defined in our work: "A qualitative and quantitative citation analysis toward retracted articles: a case of study". The methodology defines a citation analysis of the Wakefield et al. [1] retracted article from a quantitative and qualitative point of view. The data contained in this repository are based on the first two steps of the methodology. The first step of the methodology (i.e. “Data gathering”) builds an annotated dataset of the citing entities, this step is largely discussed also in [2]. The second step (i.e. "Topic Modelling") runs a topic modeling analysis on the textual features contained in the dataset generated by the first step.

    Note: the data are all contained inside the "method_data.zip" file. You need to unzip the file to get access to all the files and directories listed below.

    Data gathering

    The data generated by this step are stored in "data/":

    1. "cits_features.csv": a dataset containing all the entities (rows in the CSV) which have cited the Wakefield et al. retracted article, and a set of features characterizing each citing entity (columns in the CSV). The features included are: DOI ("doi"), year of publication ("year"), the title ("title"), the venue identifier ("source_id"), the title of the venue ("source_title"), yes/no value in case the entity is retracted as well ("retracted"), the subject area ("area"), the subject category ("category"), the sections of the in-text citations ("intext_citation.section"), the value of the reference pointer ("intext_citation.pointer"), the in-text citation function ("intext_citation.intent"), the in-text citation perceived sentiment ("intext_citation.sentiment"), and a yes/no value to denote whether the in-text citation context mentions the retraction of the cited entity ("intext_citation.section.ret_mention").
      Note: this dataset is licensed under a Creative Commons public domain dedication (CC0).
    2. "cits_text.csv": this dataset stores the abstract ("abstract") and the in-text citations context ("intext_citation.context") for each citing entity identified using the DOI value ("doi").
      Note: the data keep their original license (the one provided by their publisher). This dataset is provided in order to favor the reproducibility of the results obtained in our work.

    Topic modeling
    We run a topic modeling analysis on the textual features gathered (i.e. abstracts and citation contexts). The results are stored inside the "topic_modeling/" directory. The topic modeling has been done using MITAO, a tool for mashing up automatic text analysis tools, and creating a completely customizable visual workflow [3]. The topic modeling results for each textual feature are separated into two different folders, "abstracts/" for the abstracts, and "intext_cit/" for the in-text citation contexts. Both the directories contain the following directories/files:

    1. "mitao_workflows/": the workflows of MITAO. These are JSON files that could be reloaded in MITAO to reproduce the results following the same workflows.

    2. "corpus_and_dictionary/": it contains the dictionary and the vectorized corpus given as inputs for the LDA topic modeling.

    3. "coherence/coherence.csv": the coherence score of several topic models trained on a number of topics from 1 - 40.

    4. "datasets_and_views/": the datasets and visualizations generated using MITAO.

    References

    1. Wakefield, A., Murch, S., Anthony, A., Linnell, J., Casson, D., Malik, M., Berelowitz, M., Dhillon, A., Thomson, M., Harvey, P., Valentine, A., Davies, S., & Walker-Smith, J. (1998). RETRACTED: Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children. The Lancet, 351(9103), 637–641. https://doi.org/10.1016/S0140-6736(97)11096-0
    2. Heibi, I., & Peroni, S. (2020). A methodology for gathering and annotating the raw-data/characteristics of the documents citing a retracted article v1 (protocols.io.bdc4i2yw) [Data set]. In protocols.io. ZappyLab, Inc. https://doi.org/10.17504/protocols.io.bdc4i2yw

    3. Ferri, P., Heibi, I., Pareschi, L., & Peroni, S. (2020). MITAO: A User Friendly and Modular Software for Topic Modelling [JD]. PuntOorg International Journal, 5(2), 135–149. https://doi.org/10.19245/25.05.pij.5.2.3

  18. n

    PLOS ONE publication and citation data

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +2more
    zip
    Updated May 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Petersen (2023). PLOS ONE publication and citation data [Dataset]. http://doi.org/10.6071/M39W8V
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 15, 2023
    Dataset provided by
    University of California, Merced
    Authors
    Alexander Petersen
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Merged PLOS ONE and Web of Science data compiled in .dta files produced by STATA13. Included is a Do-file for reproducing the regression model estimates reported in the pre-print (Tables I and II) and published version (Table 1). Each observation (.dta line) corresponds to a given PLOS ONE article, with various article-level and editor-level characteristics used as explanatory and control variables. This summary provides a brief description of each variable and its source.

    If you use this data, please cite: A. M. Petersen. Megajournal mismanagement: Manuscript decision bias and anomalous editor activity at PLOS ONE. Journal of Informetrics 13, 100974 (2019). DOI: 10.1016/j.joi.2019.100974

    Methods We gathered the citation information for all PLOS ONE articles, indexed by A, from the Web of Science (WOS) Core Collection. From this data we obtained a master list of the unique digital object identifier, DOIA and the number of citations, cA, at the time of the data download (census) date

    (a) For the pre-print this corresponds to December 3, 2016;

    (b) and for the final published article this corresponds to February 25, 2019.

    We then used each DOIA to access the corresponding online XML version of each article at PLOS ONE by visiting the unique web address “http://journals.plos.org/plosone/article?id=” + “DOIA”. After parsing the full-text XML (primarily the author byline data and reference list), we merged the PLOS ONE publication information and WOS citation data by matching on DOIA.

    allofplos: PLOS has since made all full-text XML data freely available: https://www.plos.org/text-and-data-mining ; this option was not available at the moment of our data collection.

  19. Predicting the number of citations by article characteristics and tweets.

    • figshare.com
    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samara Klar; Yanna Krupnikov; John Barry Ryan; Kathleen Searles; Yotam Shmargad (2023). Predicting the number of citations by article characteristics and tweets. [Dataset]. http://doi.org/10.1371/journal.pone.0229446.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Samara Klar; Yanna Krupnikov; John Barry Ryan; Kathleen Searles; Yotam Shmargad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Predicting the number of citations by article characteristics and tweets.

  20. u

    Impact of NIH Public Access Policy on Citation Rates - Data from Study

    • indigo.uic.edu
    • figshare.com
    txt
    Updated Nov 23, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sandra L. De Groote (2019). Impact of NIH Public Access Policy on Citation Rates - Data from Study [Dataset]. https://indigo.uic.edu/articles/dataset/Impact_of_NIH_Public_Access_Policy_on_Citation_Rates_-_Data_from_Study/10961135
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 23, 2019
    Dataset provided by
    University of Illinois Chicago
    Authors
    Sandra L. De Groote
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    A list of journals across several subject areas was developed from which to collect article citation data. Citation information and cited reference counts of all the articles published in 2006 and 2009 for these journals were obtained.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Zihan Wang (2024). Article data for citation analysis [Dataset]. https://www.kaggle.com/datasets/thuwangzh/article-data
Organization logo

Article data for citation analysis

For citation analysis of Statistics & Probability papers in Chinese Universities

Explore at:
zip(344109 bytes)Available download formats
Dataset updated
Sep 1, 2024
Authors
Zihan Wang
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Our data is sourced from Web of Science, an academic information retrieval platform. In the field of citation analysis, a recurring criticism revolves around ``field-dependent factors," which highlight that citation practices vary across different scientific disciplines. To enhance the credibility of our results, we focus exclusively on a single discipline, specifically Statistics & Probability, for citation analysis. Additionally, we limit our data to articles published between 2009 and 2018, as articles published within the last five years often have very few citations, which could skew the results. Moreover, there were few articles in the Statistics & Probability category before 2009. To minimize result variance, we selected articles contributed by scholars from Tsinghua University and Peking University, the two most influential universities in China, ensuring a baseline quality for the articles. In total, we exported detailed information on 566 articles from Web of Science (WoS).

Search
Clear search
Close search
Google apps
Main menu