63 datasets found
  1. I

    Self-citation analysis data based on PubMed Central subset (2002-2005)

    • databank.illinois.edu
    • aws-databank-alb.library.illinois.edu
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik, Self-citation analysis data based on PubMed Central subset (2002-2005) [Dataset]. http://doi.org/10.13012/B2IDB-9665377_V1
    Explore at:
    Authors
    Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    U.S. National Institutes of Health (NIH)
    U.S. National Science Foundation (NSF)
    Description

    Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations. Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304 - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105 * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720 - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927 * MapAffil for identifying article country of affiliation: - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik * IMPLICIT journal similarity: - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 * Novelty dataset for identify article level novelty: - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 - Paper citation: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra - Code: https://github.com/napsternxg/Novelty * Expertise dataset for identifying author expertise on articles: * Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.

  2. Z

    Data from: Citation network of the knowledge co-production literature....

    • data.niaid.nih.gov
    • nde-dev.biothings.io
    Updated Dec 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Justyna Bandola-Gill; Megan Arthur; Rhodri Ivor Leng (2021). Citation network of the knowledge co-production literature. Supplementary data. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5762450
    Explore at:
    Dataset updated
    Dec 8, 2021
    Dataset provided by
    University of Edinburgh
    Authors
    Justyna Bandola-Gill; Megan Arthur; Rhodri Ivor Leng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data description

    This data note describes the final citation network dataset analysed in the manuscript "What is co-production? Conceptualising and understanding co-production of knowledge and policy across different theoretical perspectives’"[1].

    The data collection strategy used to construct the following dataset can be found in the associated manuscript [1]. These data were originally downloaded from the Web of Science (WoS) Core Collection via the library subscription of the University of Edinburgh via a systematic search methodology that sought to capture literature relevant to ‘knowledge co-production’. The dataset consists of 1,893 unique document reference strings (nodes) interlinked together by 9,759 citation links (edges). The network dataset describes a directed citation network composed of papers relevant to 'knowledge co-production', and is split into two files: (i) ‘KnowCo_node_attribute_list.csv’ contains attributes of the 1,893 documents (nodes); and (ii) ‘KnowCo_edge_list.csv’ records the citation links (edges) between pairs of documents.

    1. ‘KnowCo_node_attribute_list.csv’ consists of attributes of the 1,893 nodes (documents) of the citation network. Due to the approach used to collect data, there are two types of node: (i) 525 nodes represent documents retrieved from WoS via the systematic search strategy, and these have full attribute data including their reference lists; and (ii) 1,368 documents that were cited >2 times by our 525 fully retrieved papers (see manuscript for full description [1]). The columns refer to:

    Id, the unique identifier. Fully retrieved documents are identified via a unique identifier that begins with ‘f’ followed by an integer (e.g. f1, f2, etc.). Non-retrieved documents are identified via a unique identifier beginning with ‘n’ followed by an integer (e.g. n1, n2, etc.).

    Label, contains the unique reference string of the document for which the attribute data in that row corresponds. Reference strings contain the last name of the first author, publication year, journal, volume, start page, and DOI (if available).

    authors, all author names. These are in the order that these names appear in the authorship list of the corresponding document. These data are only available for fully retrieved documents.

    title, document title. These data are only available for fully retrieved documents.

    journal, journal of publication. These data are only available for fully retrieved documents. For those interested in journal data for the remaining papers, this can be extracted from the reference string in the ‘Label’ column.

    year, year of publication. These data are available for all nodes.

    type, document type (e.g. article, review). Available only for fully retrieved documents.

    wos_total_citations, total citation count as recorded by Web of Science Core Collection as of May 2020. Available only for fully retrieved documents.

    wos_id, Web of Science accession number. Available only for fully retrieved documents only, for non-retrieved documents ‘CitedReference’ fills the cell.

    cluster, provides the cluster membership number as discussed within the manuscript, established via modularity maximisation via the Leiden algorithm (Res 0.8; Q=0.53|5 clusters). Available for all nodes.

    indegree, total count of within network citations to a given document. Due to the composition of the network, this figure tells us the total number of citations from 525 fully retrieved documents to each of the 1,893 documents within the network. Available for all nodes.

    outdegree, total count of within network references from a given document. Due to the composition of the network, only fully retrieved documents can have a value >0 because only these documents have their associated reference list data. Available for all nodes.

    1. ‘KnowCo_edge _list.csv’ is an edge list containing 9,759 citation links between the 1,893 documents. The columns refer to:

    Source, the citing document’s unique identifier.

    Target, the cited document’s unique identifier.

    Notes

    [1] Bandola-Gill, J., Arthur, M., & Leng, R. I. (Under review). What is co-production? Conceptualising and understanding co-production of knowledge and policy across different theoretical perspectives. Evidence & Policy

  3. iCite Database Snapshot 2022-10

    • nih.figshare.com
    bin
    Updated Jun 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    iCite; B. Ian Hutchins; George Santangelo; Ehsanul Haque (2023). iCite Database Snapshot 2022-10 [Dataset]. http://doi.org/10.35092/yhjc21502470.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    iCite; B. Ian Hutchins; George Santangelo; Ehsanul Haque
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This is a database snapshot of the iCite web service (provided here as a single zipped CSV file, or compressed, tarred JSON files). In addition, citation links in the NIH Open Citation Collection are provided as a two-column CSV table in open_citation_collection.zip. iCite provides bibliometrics and metadata on publications indexed in PubMed, organized into three modules:

    Influence: Delivers metrics of scientific influence, field-adjusted and benchmarked to NIH publications as the baseline.

    Translation: Measures how Human, Animal, or Molecular/Cellular Biology-oriented each paper is; tracks and predicts citation by clinical articles

    Open Cites: Disseminates link-level, public-domain citation data from the NIH Open Citation Collection

    Definitions for individual data fields:

    pmid: PubMed Identifier, an article ID as assigned in PubMed by the National Library of Medicine

    doi: Digital Object Identifier, if available

    year: Year the article was published

    title: Title of the article

    authors: List of author names

    journal: Journal name (ISO abbreviation)

    is_research_article: Flag indicating whether the Publication Type tags for this article are consistent with that of a primary research article

    relative_citation_ratio: Relative Citation Ratio (RCR)--OPA's metric of scientific influence. Field-adjusted, time-adjusted and benchmarked against NIH-funded papers. The median RCR for NIH funded papers in any field is 1.0. An RCR of 2.0 means a paper is receiving twice as many citations per year than the median NIH funded paper in its field and year, while an RCR of 0.5 means that it is receiving half as many citations per year. Calculation details are documented in Hutchins et al., PLoS Biol. 2016;14(9):e1002541.

    provisional: RCRs for papers published in the previous two years are flagged as "provisional", to reflect that citation metrics for newer articles are not necessarily as stable as they are for older articles. Provisional RCRs are provided for papers published previous year, if they have received with 5 citations or more, despite being, in many cases, less than a year old. All papers published the year before the previous year receive provisional RCRs. The current year is considered to be the NIH Fiscal Year which starts in October. For example, in July 2019 (NIH Fiscal Year 2019), papers from 2018 receive provisional RCRs if they have 5 citations or more, and all papers from 2017 receive provisional RCRs. In October 2019, at the start of NIH Fiscal Year 2020, papers from 2019 receive provisional RCRs if they have 5 citations or more and all papers from 2018 receive provisional RCRs.

    citation_count: Number of unique articles that have cited this one

    citations_per_year: Citations per year that this article has received since its publication. If this appeared as a preprint and a published article, the year from the published version is used as the primary publication date. This is the numerator for the Relative Citation Ratio.

    field_citation_rate: Measure of the intrinsic citation rate of this paper's field, estimated using its co-citation network.

    expected_citations_per_year: Citations per year that NIH-funded articles, with the same Field Citation Rate and published in the same year as this paper, receive. This is the denominator for the Relative Citation Ratio.

    nih_percentile: Percentile rank of this paper's RCR compared to all NIH publications. For example, 95% indicates that this paper's RCR is higher than 95% of all NIH funded publications.

    human: Fraction of MeSH terms that are in the Human category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    animal: Fraction of MeSH terms that are in the Animal category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    molecular_cellular: Fraction of MeSH terms that are in the Molecular/Cellular Biology category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    x_coord: X coordinate of the article on the Triangle of Biomedicine

    y_coord: Y Coordinate of the article on the Triangle of Biomedicine

    is_clinical: Flag indicating that this paper meets the definition of a clinical article.

    cited_by_clin: PMIDs of clinical articles that this article has been cited by.

    apt: Approximate Potential to Translate is a machine learning-based estimate of the likelihood that this publication will be cited in later clinical trials or guidelines. Calculation details are documented in Hutchins et al., PLoS Biol. 2019;17(10):e3000416.

    cited_by: PMIDs of articles that have cited this one.

    references: PMIDs of articles in this article's reference list.

    Large CSV files are zipped using zip version 4.5, which is more recent than the default unzip command line utility in some common Linux distributions. These files can be unzipped with tools that support version 4.5 or later such as 7zip.

    Comments and questions can be addressed to iCite@mail.nih.gov

  4. r

    Author survey data about bibliometrics and altmetrics for open access...

    • researchdata.se
    • data.europa.eu
    Updated Jun 5, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sofie Wennström; Gabor Schubert; Jeroen Sondervan; Graham Stone (2019). Author survey data about bibliometrics and altmetrics for open access monographs – including data about online usage and citations of academic books from Stockholm University Press [Dataset]. http://doi.org/10.17045/STHLMUNI.8051717
    Explore at:
    Dataset updated
    Jun 5, 2019
    Dataset provided by
    Stockholm University
    Authors
    Sofie Wennström; Gabor Schubert; Jeroen Sondervan; Graham Stone
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes a file with results from a survey sent to authors of open access monographs. The survey was available during March–April 2019 and the results are analysed in a paper presented at the 2019 Elpub conference on Jun 2–4 in Marseille, France entitled 'The significant difference in impact – an exploratory study about the meaning and value of metrics for open access monographs'. Version 2 of the dataset has been updated with the slides presented at the conference and the link to the full paper published in the French open archive HAL.

    The respondents of the survey were asked to comment on assumptions about bibliometrics and altmetrics currently in practice, and to think about the meaning of such data in relation to their experiences as authors of books published in a digital format and with an open license (i.e. a creative commons license). The survey questionnaire is included as a separate text document. The dataset also includes measures about the usage of open access books published by Stockholm University Press, including information about online usage, mentions in social media and citations. This data is collected from the publisher's platform, the Altmetric.com database, and citation data was collected from Dimensions, Google Scholar, Web of Science and CrossRef. The data was collected in February 2019, except for the figures from the OAPEN Library database, which was collected in November 2018. The paper, including the analysis of these data, is to be published in the Elpub Digital Library. The tables included in the dataset may vary slightly from those in the published paper, due to space restraints in the published version.

  5. I

    Dataset for "Continued use of retracted papers: Temporal trends in citations...

    • databank.illinois.edu
    Updated Jun 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tzu-Kun Hsiao; Jodi Schneider (2024). Dataset for "Continued use of retracted papers: Temporal trends in citations and (lack of) awareness of retractions shown in citation contexts in biomedicine" [Dataset]. http://doi.org/10.13012/B2IDB-8255619_V2
    Explore at:
    Dataset updated
    Jun 14, 2024
    Authors
    Tzu-Kun Hsiao; Jodi Schneider
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    Alfred P. Sloan Foundation
    U.S. National Institutes of Health (NIH)
    Description

    This dataset includes five files. Descriptions of the files are given as follows: FILENAME: PubMed_retracted_publication_full_v3.tsv - Bibliographic data of retracted papers indexed in PubMed (retrieved on August 20, 2020, searched with the query "retracted publication" [PT] ). - Except for the information in the "cited_by" column, all the data is from PubMed. - PMIDs in the "cited_by" column that meet either of the two conditions below have been excluded from analyses: [1] PMIDs of the citing papers are from retraction notices (i.e., those in the “retraction_notice_PMID.csv” file). [2] Citing paper and the cited retracted paper have the same PMID. ROW EXPLANATIONS - Each row is a retracted paper. There are 7,813 retracted papers. COLUMN HEADER EXPLANATIONS 1) PMID - PubMed ID 2) Title - Paper title 3) Authors - Author names 4) Citation - Bibliographic information of the paper 5) First Author - First author's name 6) Journal/Book - Publication name 7) Publication Year 8) Create Date - The date the record was added to the PubMed database 9) PMCID - PubMed Central ID (if applicable, otherwise blank) 10) NIHMS ID - NIH Manuscript Submission ID (if applicable, otherwise blank) 11) DOI - Digital object identifier (if applicable, otherwise blank) 12) retracted_in - Information of retraction notice (given by PubMed) 13) retracted_yr - Retraction year identified from "retracted_in" (if applicable, otherwise blank) 14) cited_by - PMIDs of the citing papers. (if applicable, otherwise blank) Data collected from iCite. 15) retraction_notice_pmid - PMID of the retraction notice (if applicable, otherwise blank) FILENAME: PubMed_retracted_publication_CitCntxt_withYR_v3.tsv - This file contains citation contexts (i.e., citing sentences) where the retracted papers were cited. The citation contexts were identified from the XML version of PubMed Central open access (PMCOA) articles. - This is part of the data from: Hsiao, T.-K., & Torvik, V. I. (manuscript in preparation). Citation contexts identified from PubMed Central open access articles: A resource for text mining and citation analysis. - Citation contexts that meet either of the two conditions below have been excluded from analyses: [1] PMIDs of the citing papers are from retraction notices (i.e., those in the “retraction_notice_PMID.csv” file). [2] Citing paper and the cited retracted paper have the same PMID. ROW EXPLANATIONS - Each row is a citation context associated with one retracted paper that's cited. - In the manuscript, we count each citation context once, even if it cites multiple retracted papers. COLUMN HEADER EXPLANATIONS 1) pmcid - PubMed Central ID of the citing paper 2) pmid - PubMed ID of the citing paper 3) year - Publication year of the citing paper 4) location - Location of the citation context (abstract = abstract, body = main text, back = supporting material, tbl_fig_caption = tables and table/figure captions) 5) IMRaD - IMRaD section of the citation context (I = Introduction, M = Methods, R = Results, D = Discussions/Conclusion, NoIMRaD = not identified) 6) sentence_id - The ID of the citation context in a given location. For location information, please see column 4. The first sentence in the location gets the ID 1, and subsequent sentences are numbered consecutively. 7) total_sentences - Total number of sentences in a given location 8) intxt_id - Identifier of a cited paper. Here, a cited paper is the retracted paper. 9) intxt_pmid - PubMed ID of a cited paper. Here, a cited paper is the retracted paper. 10) citation - The citation context 11) progression - Position of a citation context by centile within the citing paper. 12) retracted_yr - Retraction year of the retracted paper 13) post_retraction - 0 = not post-retraction citation; 1 = post-retraction citation. A post-retraction citation is a citation made after the calendar year of retraction. FILENAME: 724_knowingly_post_retraction_cit.csv (updated) - The 724 post-retraction citation contexts that we determined knowingly cited the 7,813 retracted papers in "PubMed_retracted_publication_full_v3.tsv". - Two citation contexts from retraction notices have been excluded from analyses. ROW EXPLANATIONS - Each row is a citation context. COLUMN HEADER EXPLANATIONS 1) pmcid - PubMed Central ID of the citing paper 2) pmid - PubMed ID of the citing paper 3) pub_type - Publication type collected from the metadata in the PMCOA XML files. 4) pub_type2 - Specific article types. Please see the manuscript for explanations. 5) year - Publication year of the citing paper 6) location - Location of the citation context (abstract = abstract, body = main text, back = supporting material, table_or_figure_caption = tables and table/figure captions) 7) intxt_id - Identifier of a cited paper. Here, a cited paper is the retracted paper. 8) intxt_pmid - PubMed ID of a cited paper. Here, a cited paper is the retracted paper. 9) citation - The citation context 10) retracted_yr - Retraction year of the retracted paper 11) cit_purpose - Purpose of citing the retracted paper. This is from human annotations. Please see the manuscript for further information about annotation. 12) longer_context - A extended version of the citation context. (if applicable, otherwise blank) Manually pulled from the full-texts in the process of annotation. FILENAME: Annotation manual.pdf - The manual for annotating the citation purposes in column 11) of the 724_knowingly_post_retraction_cit.tsv. FILENAME: retraction_notice_PMID.csv (new file added for this version) - A list of 8,346 PMIDs of retraction notices indexed in PubMed (retrieved on August 20, 2020, searched with the query "retraction of publication" [PT] ).

  6. d

    October 2023 data-update for "Updated science-wide author databases of...

    • elsevier.digitalcommonsdata.com
    Updated Oct 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John P.A. Ioannidis (2023). October 2023 data-update for "Updated science-wide author databases of standardized citation indicators" [Dataset]. http://doi.org/10.17632/btchxktzyw.6
    Explore at:
    Dataset updated
    Oct 4, 2023
    Authors
    John P.A. Ioannidis
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    Citation metrics are widely used and misused. We have created a publicly available database of top-cited scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator (c-score). Separate data are shown for career-long and, separately, for single recent year impact. Metrics with and without self-citations and ratio of citations to citing papers are given. Scientists are classified into 22 scientific fields and 174 sub-fields according to the standard Science-Metrix classification. Field- and subfield-specific percentiles are also provided for all scientists with at least 5 papers. Career-long data are updated to end-of-2022 and single recent year data pertain to citations received during calendar year 2022. The selection is based on the top 100,000 scientists by c-score (with and without self-citations) or a percentile rank of 2% or above in the sub-field. This version (6) is based on the October 1, 2023 snapshot from Scopus, updated to end of citation year 2022. This work uses Scopus data provided by Elsevier through ICSR Lab (https://www.elsevier.com/icsr/icsrlab). Calculations were performed using all Scopus author profiles as of October 1, 2023. If an author is not on the list it is simply because the composite indicator value was not high enough to appear on the list. It does not mean that the author does not do good work.

    PLEASE ALSO NOTE THAT THE DATABASE HAS BEEN PUBLISHED IN AN ARCHIVAL FORM AND WILL NOT BE CHANGED. The published version reflects Scopus author profiles at the time of calculation. We thus advise authors to ensure that their Scopus profiles are accurate. REQUESTS FOR CORRECIONS OF THE SCOPUS DATA (INCLUDING CORRECTIONS IN AFFILIATIONS) SHOULD NOT BE SENT TO US. They should be sent directly to Scopus, preferably by use of the Scopus to ORCID feedback wizard (https://orcid.scopusfeedback.com/) so that the correct data can be used in any future annual updates of the citation indicator databases.

    The c-score focuses on impact (citations) rather than productivity (number of publications) and it also incorporates information on co-authorship and author positions (single, first, last author). If you have additional questions, please read the 3 associated PLoS Biology papers that explain the development, validation and use of these metrics and databases. (https://doi.org/10.1371/journal.pbio.1002501, https://doi.org/10.1371/journal.pbio.3000384 and https://doi.org/10.1371/journal.pbio.3000918).

    Finally, we alert users that all citation metrics have limitations and their use should be tempered and judicious. For more reading, we refer to the Leiden manifesto: https://www.nature.com/articles/520429a

  7. iCite Database Snapshot 2023-08

    • nih.figshare.com
    bin
    Updated Sep 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    iCite; B. Ian Hutchins; George Santangelo (2023). iCite Database Snapshot 2023-08 [Dataset]. http://doi.org/10.35092/yhjc24100917.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Sep 12, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    iCite; B. Ian Hutchins; George Santangelo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This is a database snapshot of the iCite web service (provided here as a single zipped CSV file, or compressed, tarred JSON files). In addition, citation links in the NIH Open Citation Collection are provided as a two-column CSV table in open_citation_collection.zip. iCite provides bibliometrics and metadata on publications indexed in PubMed, organized into three modules:

    Influence: Delivers metrics of scientific influence, field-adjusted and benchmarked to NIH publications as the baseline.

    Translation: Measures how Human, Animal, or Molecular/Cellular Biology-oriented each paper is; tracks and predicts citation by clinical articles

    Open Cites: Disseminates link-level, public-domain citation data from the NIH Open Citation Collection

    Definitions for individual data fields:

    pmid: PubMed Identifier, an article ID as assigned in PubMed by the National Library of Medicine

    doi: Digital Object Identifier, if available

    year: Year the article was published

    title: Title of the article

    authors: List of author names

    journal: Journal name (ISO abbreviation)

    is_research_article: Flag indicating whether the Publication Type tags for this article are consistent with that of a primary research article

    relative_citation_ratio: Relative Citation Ratio (RCR)--OPA's metric of scientific influence. Field-adjusted, time-adjusted and benchmarked against NIH-funded papers. The median RCR for NIH funded papers in any field is 1.0. An RCR of 2.0 means a paper is receiving twice as many citations per year than the median NIH funded paper in its field and year, while an RCR of 0.5 means that it is receiving half as many citations per year. Calculation details are documented in Hutchins et al., PLoS Biol. 2016;14(9):e1002541.

    provisional: RCRs for papers published in the previous two years are flagged as "provisional", to reflect that citation metrics for newer articles are not necessarily as stable as they are for older articles. Provisional RCRs are provided for papers published previous year, if they have received with 5 citations or more, despite being, in many cases, less than a year old. All papers published the year before the previous year receive provisional RCRs. The current year is considered to be the NIH Fiscal Year which starts in October. For example, in July 2019 (NIH Fiscal Year 2019), papers from 2018 receive provisional RCRs if they have 5 citations or more, and all papers from 2017 receive provisional RCRs. In October 2019, at the start of NIH Fiscal Year 2020, papers from 2019 receive provisional RCRs if they have 5 citations or more and all papers from 2018 receive provisional RCRs.

    citation_count: Number of unique articles that have cited this one

    citations_per_year: Citations per year that this article has received since its publication. If this appeared as a preprint and a published article, the year from the published version is used as the primary publication date. This is the numerator for the Relative Citation Ratio.

    field_citation_rate: Measure of the intrinsic citation rate of this paper's field, estimated using its co-citation network.

    expected_citations_per_year: Citations per year that NIH-funded articles, with the same Field Citation Rate and published in the same year as this paper, receive. This is the denominator for the Relative Citation Ratio.

    nih_percentile: Percentile rank of this paper's RCR compared to all NIH publications. For example, 95% indicates that this paper's RCR is higher than 95% of all NIH funded publications.

    human: Fraction of MeSH terms that are in the Human category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    animal: Fraction of MeSH terms that are in the Animal category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    molecular_cellular: Fraction of MeSH terms that are in the Molecular/Cellular Biology category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)

    x_coord: X coordinate of the article on the Triangle of Biomedicine

    y_coord: Y Coordinate of the article on the Triangle of Biomedicine

    is_clinical: Flag indicating that this paper meets the definition of a clinical article.

    cited_by_clin: PMIDs of clinical articles that this article has been cited by.

    apt: Approximate Potential to Translate is a machine learning-based estimate of the likelihood that this publication will be cited in later clinical trials or guidelines. Calculation details are documented in Hutchins et al., PLoS Biol. 2019;17(10):e3000416.

    cited_by: PMIDs of articles that have cited this one.

    references: PMIDs of articles in this article's reference list.

    Large CSV files are zipped using zip version 4.5, which is more recent than the default unzip command line utility in some common Linux distributions. These files can be unzipped with tools that support version 4.5 or later such as 7zip.

    Comments and questions can be addressed to iCite@mail.nih.gov

  8. Data from: Dataset for 'A Matter of Culture? Conceptualising and...

    • zenodo.org
    csv
    Updated Oct 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rhodri Leng; Rhodri Leng; Justyna Bandola-Gill; Justyna Bandola-Gill; Katherine Smith; Katherine Smith; Valerie Pattyn; Valerie Pattyn; Niklas Andersen; Niklas Andersen (2024). Dataset for 'A Matter of Culture? Conceptualising and Investigating 'Evidence Cultures' within Research on Evidence-Informed Policymaking' [Dataset]. http://doi.org/10.5281/zenodo.13972074
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 22, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rhodri Leng; Rhodri Leng; Justyna Bandola-Gill; Justyna Bandola-Gill; Katherine Smith; Katherine Smith; Valerie Pattyn; Valerie Pattyn; Niklas Andersen; Niklas Andersen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 22, 2024
    Description

    Introduction
    This document describes the data collection and datasets used in the manuscript "A Matter of Culture? Conceptualising and Investigating ‘Evidence Cultures’ within Research on Evidence-Informed Policymaking" [1].

    Data Collection

    To construct the citation network analysed in the manuscript, we first designed a series of queries to capture a large sample of literature exploring the relationship between evidence, policy, and culture from various perspectives. Our team of domain experts developed the following queries based on terms common in the literature. These queries search for the terms included in the titles, abstracts, and associated keywords of WoS indexed records (i.e. ‘TS=’). While these are separated below for ease of reading, they combined into a single query via the OR operator in our search. Our search was conducted on the Web of Science’s (WoS) Core Collection through the University of Edinburgh Library subscription on 29/11/2023, returning a total of 2,089 records.

    TS = ((“cultures of evidence” OR “culture of evidence” OR “culture of knowledge” OR “cultures of knowledge” OR “research culture” OR “research cultures” OR “culture of research” OR “cultures of research” OR “epistemic culture” OR “epistemic cultures” OR “epistemic community” OR “epistemic communities” OR “epistemic infrastructure” OR “evaluation culture” OR “evaluation cultures” OR “culture of evaluation” OR “cultures of evaluation” OR “thought style” OR “thought styles” OR “thought collective” OR “thought collectives” OR “knowledge regime” OR “knowledge regimes” OR “knowledge system” OR “knowledge systems” OR “civic epistemology” OR “civic epistemologies”) AND (“policy” OR “policies” OR “policymaking” OR “policy making” OR “policymaker” OR “policymakers” OR “policy maker” OR “policy makers” OR “policy decision” OR “policy decisions” OR “political decision” OR “political decisions” OR “political decision making”))

    OR

    TS = ((“culture” OR “cultures”) AND ((“evidence-based” OR “evidence-informed” OR “evidence-led” OR “science-based” OR “science-informed” OR “science-led” OR “research-based” OR “research-informed” OR “evidence use” OR “evidence user” OR “evidence utilisation” OR “evidence utilization” OR “research use” OR “researcher user” OR “research utilisation” OR “research utilization” OR “research in” OR “evidence in” OR “science in”) NEAR/1 (“policymaking” OR “policy making” OR “policy maker” OR “policy makers”)))

    OR

    TS = ((“culture” OR “cultures”) AND (“scientific advice” OR “technical advice” OR “scientific expertise” OR “technical expertise” OR “expert advice”) AND (“policy” OR “policies” OR “policymaking” OR “policy making” OR “policymaker” OR “policymakers” OR “policy maker” OR “policy makers” OR “political decision” OR “political decisions” OR “political decision making”))

    OR

    TS = ((“culture” OR “cultures”) AND (“post-normal science” OR “trans-science” OR “transdisciplinary” OR “transdisiplinarity” OR “science-policy interface” OR “policy sciences” OR “sociology of knowledge” OR “sociology of science” OR “knowledge transfer” OR “knowledge translation” OR “knowledge broker” OR “implementation science” OR “risk society”) AND (“policymaking” OR “policy making” OR “policymaker” OR “policymakers” OR “policy maker” OR “policy makers”))

    Citation Network Construction

    All bibliographic metadata on these 2,089 records were downloaded in five batches in plain text and then merged in R. We then parsed these data into network readable files. All unique reference strings are given unique node IDs. A node-attribute-list (‘CE_Node’) links identifying information of each document with its node ID, including authors, title, year of publication, journal WoS ID, and WoS citations. An edge-list (‘CE_Edge’) records all citations from these documents to their bibliographies – with edges going from a citing document to the cited – using the relevant node IDs. These data were then cleaned by (a) matching DOIs for reference strings that differ but point to the same paper, and (b) manual merging of obvious duplicates caused by referencing errors.

    Our initial dataset consisted of 2,089 retrieved documents and 123,772 unretrieved cited documents (i.e. documents that were cited within the publications we retrieved but which were not one of these 2,089 documents). These documents were connected by 157,229 citation links, but ~87% of the documents in the network were cited just once. To focus on relevant literature, we filtered the network to include only documents with at least three citation or reference links. We further refined the dataset by focusing on the main connected component, resulting in 6,650 nodes and 29,198 edges. It is this dataset that we publish here, and it is this network that underpins Figure 1, Table 1, and the qualitative examination of documents (see manuscript for further details).

    Our final network dataset contains 1,819 of the documents in our original query (~87% of the original retrieved records), and 4,831 documents not retrieved via our Web of Science search but cited by at least three of the retrieved documents. We then clustered this network by modularity maximization via the Leiden algorithm [2], detecting 14 clusters with Q=0.59. Citations to documents within the same cluster constitute ~77% of all citations in the network.

    Citation Network Dataset Description

    We include two network datasets: (i) ‘CE_Node.csv’ that contains 1,819 retrieved documents, 4,831 unretrieved referenced documents, making for a total of 6,650 documents (nodes); (ii)’CE_Edge.csv’ that records citations (edges) between the documents (nodes), including a total of 29,198 citation links. These files can be used to construct a network with many different tools, but we have formatted these to be used in Gephi 0.10[3].

    ‘CE_Node.csv’ is a comma-separate values file that contains two types of nodes:

    i. Retrieved documents – these are documents captured by our query. These include full bibliographic metadata and reference lists.

    ii. Non-retrieved documents – these are documents referenced by our retrieved documents but were not retrieved via our query. These only have data contained within their reference string (i.e. first author, journal or book title, year of publication, and possibly DOI).

    The columns in the .csv refer to:

    - Id, the node ID

    - Label, the reference string of the document

    - DOI, the DOI for the document, if available

    - WOS_ID, WoS accession number

    - Authors, named authors

    - Title, title of document

    - Document_type, variable indicating whether a document is an article, review, etc.

    - Journal_book_title, journal of publication or title of book

    - Publication year, year of publication.

    - WOS_times_cited, total Core Collection citations as of 29/11/2023

    - Indegree, number of within network citations to a given document

    - Cluster, provides the cluster membership number as discussed in the manuscript (Figure 1)

    ‘CE_Edge.csv’ is a comma-separated values file that contains edges (citation links) between nodes (documents) (n=29,198). The columns refer to:

    - Source, node ID of the citing document

    - Target, node ID of the cited document

    Cluster Analysis

    We qualitatively analyse a set of publications from seven of the largest clusters in our manuscript.

  9. D

    Open Data engages Citation and Reuse: A Follow-up Study on Enhanced...

    • ssh.datastations.nl
    ods, pdf, tsv, zip
    Updated Sep 13, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    D. Farace; D. Farace (2018). Open Data engages Citation and Reuse: A Follow-up Study on Enhanced Publication [Dataset]. http://doi.org/10.17026/DANS-ZY8-FCJW
    Explore at:
    tsv(2536), zip(18285), ods(5448), pdf(210095), pdf(657799), pdf(73894)Available download formats
    Dataset updated
    Sep 13, 2018
    Dataset provided by
    DANS Data Station Social Sciences and Humanities
    Authors
    D. Farace; D. Farace
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In 2011, GreyNet embarked on an Enhanced Publications Project (EPP) in order to link its collection of full text conference papers with accompanying research data. The initial phase in the study dealt with the design and implementation of an online questionnaire among authors, who were published in the International Conference Series on Grey Literature. From 2012 onwards, subsequent phases in the project dealt with the acquisition, submission, indexing, and archiving of GreyNet’s collection of published datasets now housed in the DANS EASY data archive.In 2017, GreyNet’s Enhanced Publications Project was further broadened to include a Data Papers Project. Here, emphasis focused on describing the data rather than analyzing it. As such, the data paper signals data sharing and in this way promotes both data citation and the potential reuse of research data in line with the FAIR Guiding Principles for scientific data management and stewardship.Available results from the Data Papers Project presented last year at GL19 concludes where this study commences. Here, we now seek to demonstrate the reuse of survey data collected in 2011 combined with survey data that will be newly collected via an online questionnaire. The survey population will be drawn from among GreyNet’s author base; and, a selection of questions from the 2011 Survey will be joined by newly formulated questions in constructing the questionnaire. Furthermore, GreyNet relying upon available use and usage statistics compiled from various sources will seek to provide evidence of data citation and referencing.The results of this study are expected to demonstrate an increased willingness among GreyNet authors to share their research data – this in part due to GreyNet’s program of enhanced publication embedded in its workflow over the past six years. The study will provide an example of the reuse and further comparison of the results of survey data, which can be incorporated in GreyNet’s program of training and instruction. However, statistics on data citation and referencing are less likely expected to provide indicative results. Date: Survey of 2018

  10. Research Papers Dataset

    • kaggle.com
    zip
    Updated May 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NECHBA MOHAMMED (2023). Research Papers Dataset [Dataset]. https://www.kaggle.com/datasets/nechbamohammed/research-papers-dataset
    Explore at:
    zip(619131172 bytes)Available download formats
    Dataset updated
    May 8, 2023
    Authors
    NECHBA MOHAMMED
    Description

    Description: This dataset (Version 10) contains a collection of research papers along with various attributes and metadata. It is a comprehensive and diverse dataset that can be used for a wide range of research and analysis tasks. The dataset encompasses papers from different fields of study, including computer science, mathematics, physics, and more.

    Fields in the Dataset: - id: A unique identifier for each paper. - title: The title of the research paper. - authors: The list of authors involved in the paper. - venue: The journal or venue where the paper was published. - year: The year when the paper was published. - n_citation: The number of citations received by the paper. - references: A list of paper IDs that are cited by the current paper. - abstract: The abstract of the paper.

    Example: - "id": "013ea675-bb58-42f8-a423-f5534546b2b1", - "title": "Prediction of consensus binding mode geometries for related chemical series of positive allosteric modulators of adenosine and muscarinic acetylcholine receptors", - "authors": ["Leon A. Sakkal", "Kyle Z. Rajkowski", "Roger S. Armen"], - "venue": "Journal of Computational Chemistry", - "year": 2017, - "n_citation": 0, - "references": ["4f4f200c-0764-4fef-9718-b8bccf303dba", "aa699fbf-fabe-40e4-bd68-46eaf333f7b1"], - "abstract": "This paper studies ..."

    Cite: https://www.aminer.cn/citation

  11. o

    Career promotions, research publications, Open Access dataset

    • ordo.open.ac.uk
    zip
    Updated Feb 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matteo Cancellieri; Nancy Pontika; David Pride; Petr Knoth; Hannah Metzler; Antonia Correia; Helene Brinken; Bikash Gyawali (2022). Career promotions, research publications, Open Access dataset [Dataset]. http://doi.org/10.21954/ou.rd.19228785.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 28, 2022
    Dataset provided by
    The Open University
    Authors
    Matteo Cancellieri; Nancy Pontika; David Pride; Petr Knoth; Hannah Metzler; Antonia Correia; Helene Brinken; Bikash Gyawali
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a compilation of processed data on citation and references for research papers including their author, institution and open access info for a selected sample of academics analysed using Microsoft Academic Graph (MAG) data and CORE. The data for this dataset was collected during December 2019 to January 2020.Six countries (Austria, Brazil, Germany, India, Portugal, United Kingdom and United States) were the focus of the six questions which make up this dataset. There is one csv file per country and per question (36 files in total). More details about the creation of this dataset are available on the public ON-MERRIT D3.1 deliverable report.The dataset is a combination of two different data sources, one part is a dataset created on analysing promotion policies across the target countries, while the second part is a set of data points available to understand the publishing behaviour. To facilitate the analysis the dataset is organised in the following seven folders:PRTThe dataset with the file name "PRT_policies.csv" contains the related information as this was extracted from promotion, review and tenure (PRT) policies. Q1: What % of papers coming from a university are Open Access?- Dataset Name format: oa_status_countryname_papers.csv- Dataset Contents: Open Access (OA) status of all papers of all the universities listed in Times Higher Education World University Rankings (THEWUR) for the given country. A paper is marked OA if there is at least an OA link available. OA links are collected using the CORE Discovery API.- Important considerations about this dataset: - Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to. - The service we used to recognise if a paper is OA, CORE Discovery, does not contain entries for all paperids in MAG. This implies that some of the records in the dataset extracted will not have either a true or false value for the _is_OA_ field. - Only those records marked as true for _is_OA_ field can be said to be OA. Others with false or no value for is_OA field are unknown status (i.e. not necessarily closed access).Q2: How are papers, published by the selected universities, distributed across the three scientific disciplines of our choice?- Dataset Name format: fsid_countryname_papers.csv- Dataset Contents: For the given country, all papers for all the universities listed in THEWUR with the information of fieldofstudy they belong to.- Important considerations about this dataset: * MAG can associate a paper to multiple fieldofstudyid. If a paper belongs to more than one of our fieldofstudyid, separate records were created for the paper with each of those _fieldofstudyid_s.- MAG assigns fieldofstudyid to every paper with a score. We preserve only those records whose score is more than 0.5 for any fieldofstudyid it belongs to.- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to. Papers with authorship from multiple universities are counted once towards each of the universities concerned.Q3: What is the gender distribution in authorship of papers published by the universities?- Dataset Name format: author_gender_countryname_papers.csv- Dataset Contents: All papers with their author names for all the universities listed in THEWUR.- Important considerations about this dataset :- When there are multiple collaborators(authors) for the same paper, this dataset makes sure that only the records for collaborators from within selected universities are preserved.- An external script was executed to determine the gender of the authors. The script is available here.Q4: Distribution of staff seniority (= number of years from their first publication until the last publication) in the given university.- Dataset Name format: author_ids_countryname_papers.csv- Dataset Contents: For a given country, all papers for authors with their publication year for all the universities listed in THEWUR.- Important considerations about this work :- When there are multiple collaborators(authors) for the same paper, this dataset makes sure that only the records for collaborators from within selected universities are preserved.- Calculating staff seniority can be achieved in various ways. The most straightforward option is to calculate it as _academic_age = MAX(year) - MIN(year) _for each authorid.Q5: Citation counts (incoming) for OA vs Non-OA papers published by the university.- Dataset Name format: cc_oa_countryname_papers.csv- Dataset Contents: OA status and OA links for all papers of all the universities listed in THEWUR and for each of those papers, count of incoming citations available in MAG.- Important considerations about this dataset :- CORE Discovery was used to establish the OA status of papers.- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to.- Only those records marked as true for _is_OA_ field can be said to be OA. Others with false or no value for is_OA field are unknown status (i.e. not necessarily closed access).Q6: Count of OA vs Non-OA references (outgoing) for all papers published by universities.- Dataset Name format: rc_oa_countryname_-papers.csv- Dataset Contents: Counts of all OA and unknown papers referenced by all papers published by all the universities listed in THEWUR.- Important considerations about this dataset :- CORE Discovery was used to establish the OA status of papers being referenced.- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to. Papers with authorship from multiple universities are counted once towards each of the universities concerned.Additional files:- _fieldsofstudy_mag_.csv: this file contains a dump of fieldsofstudy table of MAG mapping each of the ids to their actual field of study name.

  12. Z

    Dataset: A Systematic Literature Review on the topic of High-value datasets

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anastasija Nikiforova; Nina Rizun; Magdalena Ciesielska; Charalampos Alexopoulos; Andrea Miletič (2023). Dataset: A Systematic Literature Review on the topic of High-value datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7944424
    Explore at:
    Dataset updated
    Jun 23, 2023
    Dataset provided by
    University of Tartu
    University of the Aegean
    University of Zagreb
    Gdańsk University of Technology
    Authors
    Anastasija Nikiforova; Nina Rizun; Magdalena Ciesielska; Charalampos Alexopoulos; Andrea Miletič
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.

    The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.

    Methodology

    To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).

    These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.

    To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.

    Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.

    Description of the data in this data set

    Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies

    The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information

    Descriptive information
    1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet 2) Complete reference - the complete source information to refer to the study 3) Year of publication - the year in which the study was published 4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter} 5) DOI / Website- a link to the website where the study can be found 6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science 7) Availability in OA - availability of an article in the Open Access 8) Keywords - keywords of the paper as indicated by the authors 9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}

    Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?

    Quality- and relevance- related information
    17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)? 18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))

    HVD determination-related information
    19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term? 20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output") 21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description) 22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles? 23) Data - what data do HVD cover? 24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)

    Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx

    Licenses or restrictions CC-BY

    For more info, see README.txt

  13. My_Library

    • kaggle.com
    zip
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oscar Yáñez Feijóo (2025). My_Library [Dataset]. https://www.kaggle.com/datasets/oscaryezfeijo/my-library/suggestions
    Explore at:
    zip(95894 bytes)Available download formats
    Dataset updated
    Jan 20, 2025
    Authors
    Oscar Yáñez Feijóo
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    My_Library: Personal Library Management Database

    Overview:

    This dataset contains information about the books in my personal library, including details about the books, genres, authors, and more. It serves as a digital catalog for organizing and managing my collection of books.

    Content:

    The dataset consists of multiple tables that store various aspects of book information:

    1. Books Table:

      • ISBN (Text): Unique identifier for each book.
      • Title (Text): Title of the book.
      • Author (Text): Author(s) of the book.
      • Genre (Text): Genre or genres associated with the book.
      • Editor (Text): The editor or publishing house of the book.
      • Publication Year (Integer): Year in which the book was published.
      • Edition (Text): Edition of the book.
      • List Price USD (Real): List price of the book in USD.
      • Rating (Real): User or critic rating of the book.
      • Citation (Text): Citation format for referencing the book.
      • Number of Pages (Integer): Total number of pages in the book.
      • Language (Text): Language in which the book is written.
      • Printing Location (Text): Location where the book was printed.
    2. Genres Table:

      • Genre ID (Integer): Unique identifier for each genre.
      • Genre Name (Text): Name of the genre.
    3. Authors Table:

      • Author ID (Integer): Unique identifier for each author.
      • Author Name (Text): Name of the author.
    4. Book Genre Table (Associative table):

      • ISBN (Text): Foreign key referencing the Books table.
      • Genre ID (Integer): Foreign key referencing the Genres table.
    5. Book Author Table (Associative table):

      • ISBN (Text): Foreign key referencing the Books table.
      • Author ID (Integer): Foreign key referencing the Authors table.

    Use Cases:

    • Cataloging and organizing a personal library.
    • Tracking book details such as titles, authors, genres, and ratings.
    • Managing information about book editions, publication years, and list prices.
    • Analyzing reading preferences and book collection statistics.

    Relationships:

    Book-Genre Relationship: This relationship associates books with their corresponding genres. It connects the Books entity with the Genres entity. It allows a book to belong to multiple genres, as indicated by the many-to-many relationship.

    Book-Author Relationship: This relationship associates books with their authors. It connects the Books entity with the Authors entity. It allows a book to have multiple authors, as indicated by the many-to-many relationship.

    Acknowledgments:

    This dataset was created and uploaded by me, Oscar Yanez-Feijoo, as part of a personal library management project. Feel free to explore and use this dataset for research, analysis, or personal library management purposes.

  14. E

    Boreal forest floor greenhouse gas emissions across a wildfire-disturbed...

    • catalogue.ceh.ac.uk
    • hosted-metadata.bgs.ac.uk
    • +2more
    zip
    Updated Mar 21, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    K.E. Mason; S. Oakley; L.E. Street; M. Arróniz-Crespo; D.L. Jones; T.H. DeLuca; N.J. Ostle (2017). Boreal forest floor greenhouse gas emissions across a wildfire-disturbed chronosequence [Dataset]. http://doi.org/10.5285/f08a5de5-55d4-44de-90e1-94bb4f814054
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 21, 2017
    Dataset provided by
    NERC EDS Environmental Information Data Centre
    Authors
    K.E. Mason; S. Oakley; L.E. Street; M. Arróniz-Crespo; D.L. Jones; T.H. DeLuca; N.J. Ostle
    License

    https://eidc.ac.uk/licences/ogl/plainhttps://eidc.ac.uk/licences/ogl/plain

    Time period covered
    Jun 1, 2012 - Jul 31, 2014
    Area covered
    Description

    Data collected during field experiments to investigate the effect of wildfires on greenhouse gas emissions across forests of differing ages in Sweden. Data comprise greenhouse gas measurements, soil horizon depth, bulk density, loss on ignition, pH, soil phosphorus, carbon and nitrogen, plant species cover, air temperature, soil temperature, soil moisture and surface leaf moisture. Greenhouse gas measurements were taken in the field. Soil cores were collected, then processed and analysed in the laboratory. Plant species cover was by visual estimate and final plant biomass was collected, dried, and weighed. Weather stations were used to collect temperature and moisture data. Data were collected from six Swedish boreal forest sites forming a chronosequence ranging in age from approximately 45 to 365 years since last wildfire. This part of the project used soil and plant property data, climatic data, and age since wildfire to investigate controls on forest floor greenhouse gas fluxes. The project ran from 2012 to 2015.

  15. Quotes Dataset

    • kaggle.com
    zip
    Updated Apr 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ankita20chaudhary (2025). Quotes Dataset [Dataset]. https://www.kaggle.com/datasets/ankita20chaudhary/quotes-dataset
    Explore at:
    zip(7082 bytes)Available download formats
    Dataset updated
    Apr 1, 2025
    Authors
    Ankita20chaudhary
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains a collection of quotes, their respective authors, and associated tags extracted from the Quotes to Scrape website. The dataset provides insights into various inspirational and thought-provoking quotes, along with metadata about the author and the themes or categories each quote belongs to.

    The dataset consists of quotes from multiple pages of the website, with a total of 100 quotes across 10 pages. Each record includes the quote text, the name of the author, and a list of tags that describe the quote's theme.

    Columns: The dataset includes the following columns:

    1. Quote: The actual text of the quote.
    2. Author: The name of the author who said or wrote the quote.
    3. Tags: A list of tags related to the quote (e.g., inspiration, life, love, etc.).

    Dataset Size: - Number of Records: 100 quotes (10 pages × 10 quotes per page) - File Format: CSV - File Size: Small (~20 KB)

    Potential Uses: - Text Analysis: You can use the quotes for natural language processing (NLP) tasks such as sentiment analysis, topic modeling, or text classification. - Author Studies: Analyze patterns in quotes written by different authors or study specific themes that authors frequently explore. - Tagging & Categorization: Develop machine learning models that can categorize quotes into various themes based on their content.

    Credits: - The quotes are sourced from Quotes to Scrape, a simple website designed for practicing web scraping techniques. - Data was collected using Python's BeautifulSoup and Requests libraries.

  16. l

    LSC (Leicester Scientific Corpus)

    • figshare.le.ac.uk
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v1
    Explore at:
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    The LSC (Leicester Scientific Corpus)August 2019 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data is extracted from the Web of Science® [1] You may not copy or distribute this data in whole or in part without the written consent of Clarivate Analytics.Getting StartedThis text provides background information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the sense of research texts. One of the goal of publishing the data is to make it available for further analysis and use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English.The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018.Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper3. Abstract: The abstract of the paper4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’.5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’.6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4]7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,824.All documents in LSC have nonempty abstract, title, categories, research areas and times cited in WoS databases. There are 119 documents with empty authors list, we did not exclude these documents.Data ProcessingThis section describes all steps in order for the LSC to be collected, clean and available to researchers. Processing the data consists of six main steps:Step 1: Downloading of the Data OnlineThis is the step of collecting the dataset online. This is done manually by exporting documents as Tab-delimitated files. All downloaded documents are available online.Step 2: Importing the Dataset to RThis is the process of converting the collection to RData format for processing the data. The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryNot all papers have abstract and categories in the collection. As our research is based on the analysis of abstracts and categories, preliminary detecting and removing inaccurate documents were performed. All documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsTraditionally, abstracts are written in a format of executive summary with one paragraph of continuous writing, which is known as ‘unstructured abstract’. However, especially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc.Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. As a result, some of structured abstracts in the LSC require additional process of correction to split such concatenate words. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. in the corpus. The detection and identification of concatenate words cannot be totally automated. Human intervention is needed in the identification of possible headings of sections. We note that we only consider concatenate words in headings of sections as it is not possible to detect all concatenate words without deep knowledge of research areas. Identification of such words is done by sampling of medicine-related publications. The section headings in such abstracts are listed in the List 1.List 1 Headings of sections identified in structured abstractsBackground Method(s) DesignTheoretical Measurement(s) LocationAim(s) Methodology ProcessAbstract Population ApproachObjective(s) Purpose(s) Subject(s)Introduction Implication(s) Patient(s)Procedure(s) Hypothesis Measure(s)Setting(s) Limitation(s) DiscussionConclusion(s) Result(s) Finding(s)Material (s) Rationale(s)Implications for health and nursing policyAll words including headings in the List 1 are detected in entire corpus, and then words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.Step 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction of concatenate words is completed, the lengths of abstracts are calculated. ‘Length’ indicates the totalnumber of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. However, word limits vary from journal to journal. For instance, Journal of Vascular Surgery recommends that ‘Clinical and basic research studies must include a structured abstract of 400 words or less’[7].In LSC, the length of abstracts varies from 1 to 3805. We decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis. Documents containing less than 30 and more than 500 words in abstracts are removed.Step 6: Saving the Dataset into CSV FormatCorrected and extracted documents are saved into 36 CSV files. The structure of files are described in the following section.The Structure of Fields in CSV FilesIn CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in separated fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html[3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html[4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US[5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3[6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.[7]P. Gloviczki and P. F. Lawrence, "Information for authors," Journal of Vascular Surgery, vol. 65, no. 1, pp. A16-A22, 2017.

  17. Basic and other measurements of radiation at Neumayer Station in 2013,...

    • search.datacite.org
    • doi.pangaea.de
    Updated 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gert König-Langlo (2014). Basic and other measurements of radiation at Neumayer Station in 2013, reference list of 12 datasets [Dataset]. http://doi.org/10.1594/pangaea.150002
    Explore at:
    Dataset updated
    2014
    Dataset provided by
    DataCitehttps://www.datacite.org/
    PANGAEA - Data Publisher for Earth & Environmental Science
    Authors
    Gert König-Langlo
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Area covered
    Description

    The data collection contains 12 links to basic and other measurements of radiation at Neumayer station from the Baseline Surface Radiation Network (BSRN). It covers all available measurements from the time period between 2013-01 and 2013-12.Any user who accepts the BSRN data release guidelines (http://bsrn.awi.de/data/conditions-of-data-release) may ask Amelie Driemel (mailto:Amelie.Driemel@awi.de) to obtain an account to download these datasets.

  18. Data from: Bibliographic dataset characterizing studies that use online...

    • zenodo.org
    • portalcientifico.unav.edu
    • +1more
    bin, csv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joan E. Ball-Damerow; Joan E. Ball-Damerow; Laura Brenskelle; Laura Brenskelle; Narayani Barve; Narayani Barve; Raphael LaFrance; Pamela S. Soltis; Petra Sierwald; Petra Sierwald; Rüdiger Bieler; Rüdiger Bieler; Arturo Ariño; Arturo Ariño; Robert Guralnick; Robert Guralnick; Raphael LaFrance; Pamela S. Soltis (2020). Bibliographic dataset characterizing studies that use online biodiversity databases [Dataset]. http://doi.org/10.5281/zenodo.2589439
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joan E. Ball-Damerow; Joan E. Ball-Damerow; Laura Brenskelle; Laura Brenskelle; Narayani Barve; Narayani Barve; Raphael LaFrance; Pamela S. Soltis; Petra Sierwald; Petra Sierwald; Rüdiger Bieler; Rüdiger Bieler; Arturo Ariño; Arturo Ariño; Robert Guralnick; Robert Guralnick; Raphael LaFrance; Pamela S. Soltis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes bibliographic information for 501 papers that were published from 2010-April 2017 (time of search) and use online biodiversity databases for research purposes. Our overarching goal in this study is to determine how research uses of biodiversity data developed during a time of unprecedented growth of online data resources. We also determine uses with the highest number of citations, how online occurrence data are linked to other data types, and if/how data quality is addressed. Specifically, we address the following questions:

    1.) What primary biodiversity databases have been cited in published research, and which

    databases have been cited most often?

    2.) Is the biodiversity research community citing databases appropriately, and are

    the cited databases currently accessible online?

    3.) What are the most common uses, general taxa addressed, and data linkages, and how

    have they changed over time?

    4.) What uses have the highest impact, as measured through the mean number of citations

    per year?

    5.) Are certain uses applied more often for plants/invertebrates/vertebrates?

    6.) Are links to specific data types associated more often with particular uses?

    7.) How often are major data quality issues addressed?

    8.) What data quality issues tend to be addressed for the top uses?

    Relevant papers for this analysis include those that use online and openly accessible primary occurrence records, or those that add data to an online database. Google Scholar (GS) provides full-text indexing, which was important to identify data sources that often appear buried in the methods section of a paper. Our search was therefore restricted to GS. All authors discussed and agreed upon representative search terms, which were relatively broad to capture a variety of databases hosting primary occurrence records. The terms included: “species occurrence” database (8,800 results), “natural history collection” database (634 results), herbarium database (16,500 results), “biodiversity database” (3,350 results), “primary biodiversity data” database (483 results), “museum collection” database (4,480 results), “digital accessible information” database (10 results), and “digital accessible knowledge” database (52 results)--note that quotations are used as part of the search terms where specific phrases are needed in whole. We downloaded all records returned by each search (or the first 500 if there were more) into a Zotero reference management database. About one third of the 2500 papers in the final dataset were relevant. Three of the authors with specialized knowledge of the field characterized relevant papers using a standardized tagging protocol based on a series of key topics of interest. We developed a list of potential tags and descriptions for each topic, including: database(s) used, database accessibility, scale of study, region of study, taxa addressed, research use of data, other data types linked to species occurrence data, data quality issues addressed, authors, institutions, and funding sources. Each tagged paper was thoroughly checked by a second tagger.

    The final dataset of tagged papers allow us to quantify general areas of research made possible by the expansion of online species occurrence databases, and trends over time. Analyses of this data will be published in a separate quantitative review.

  19. w

    Dataset of books in the The Oxford reference collection series

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books in the The Oxford reference collection series [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=j0-book_series&fop0=%3D&fval0=The+Oxford+reference+collection&j=1&j0=book_series
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 4 rows and is filtered where the book series is The Oxford reference collection. It features 9 columns including author, publication date, language, and book publisher.

  20. I

    Data for Post-retraction citation: A review of scholarly research on the...

    • databank.illinois.edu
    Updated Jul 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jodi Schneider; Susmita Das; Jacqueline Léveillé; Randi Proescholdt (2023). Data for Post-retraction citation: A review of scholarly research on the spread of retracted science [Dataset]. http://doi.org/10.13012/B2IDB-3254797_V1
    Explore at:
    Dataset updated
    Jul 14, 2023
    Authors
    Jodi Schneider; Susmita Das; Jacqueline Léveillé; Randi Proescholdt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    Alfred P. Sloan Foundation
    The David F. Linowes Fellows Program
    Description

    Data for Post-retraction citation: A review of scholarly research on the spread of retracted science Schneider, Jodi; Das, Susmita; Léveillé, Jacqueline; Proescholdt, Randi Contact: Jodi Schneider jodi@illinois.edu & jschneider@pobox.com ********** OVERVIEW ********** This dataset provides further analysis for an ongoing literature review about post-retraction citation. This ongoing work extends a poster presented as: Jodi Schneider, Jacqueline Léveillé, Randi Proescholdt, Susmita Das, and The RISRS Team. Characterization of Publications on Post-Retraction Citation of Retracted Articles. Presented at the Ninth International Congress on Peer Review and Scientific Publication, September 8-10, 2022 hybrid in Chicago. https://hdl.handle.net/2142/114477 (now also in https://peerreviewcongress.org/abstract/characterization-of-publications-on-post-retraction-citation-of-retracted-articles/ ) Items as of the poster version are listed in the bibliography 92-PRC-items.pdf. Note that following the poster, we made several changes to the dataset (see changes-since-PRC-poster.txt). For both the poster dataset and the current dataset, 5 items have 2 categories (see 5-items-have-2-categories.txt). Articles were selected from the Empirical Retraction Lit bibliography (https://infoqualitylab.org/projects/risrs2020/bibliography/ and https://doi.org/10.5281/zenodo.5498474 ). The current dataset includes 92 items; 91 items were selected from the 386 total items in Empirical Retraction Lit bibliography version v.2.15.0 (July 2021); 1 item was added because it is the final form publication of a grouping of 2 items from the bibliography: Yang (2022) Do retraction practices work effectively? Evidence from citations of psychological retracted articles http://doi.org/10.1177/01655515221097623 Items were classified into 7 topics; 2 of the 7 topics have been analyzed to date. ********************** OVERVIEW OF ANALYSIS ********************** DATA ANALYZED: 2 of the 7 topics have been analyzed to date: field-based case studies (n = 20) author-focused case studies of 1 or several authors with many retracted publications (n = 15) FUTURE DATA TO BE ANALYZED, NOT YET COVERED: 5 of the 7 topics have not yet been analyzed as of this release: database-focused analyses (n = 33) paper-focused case studies of 1 to 125 selected papers (n = 15) studies of retracted publications cited in review literature (n = 8) geographic case studies (n = 4) studies selecting retracted publications by method (n = 2) ************** FILE LISTING ************** ------------------ BIBLIOGRAPHY ------------------ 92-PRC-items.pdf ------------------ TEXT FILES ------------------ README.txt 5-items-have-2-categories.txt changes-since-PRC-poster.txt ------------------ CODEBOOKS ------------------ Codebook for authors.docx Codebook for authors.pdf Codebook for field.docx Codebook for field.pdf Codebook for KEY.docx Codebook for KEY.pdf ------------------ SPREADSHEETS ------------------ field.csv field.xlsx multipleauthors.csv multipleauthors.xlsx multipleauthors-not-named.csv multipleauthors-not-named.xlsx singleauthors.csv singleauthors.xlsx *************************** DESCRIPTION OF FILE TYPES *************************** BIBLIOGRAPHY (92-PRC-items.pdf) presents the items, as of the poster version. This has minor differences from the current data set. Consult changes-since-PRC-poster.txt for details on the differences. TEXT FILES provide notes for additional context. These files end in .txt. CODEBOOKS describe the data we collected. The same data is provided in both Word (.docx) and PDF format. There is one general codebook that is referred to in the other codebooks: Codebook for KEY lists fields assigned (e.g., for a journal or conference). Note that this is distinct from the overall analysis in the Empirical Retraction Lit bibliography of fields analyzed; for that analysis see Proescholdt, Randi (2021): RISRS Retraction Review - Field Variation Data. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2070560_V1 Other codebooks document specific information we entered on each column of a spreadsheet. SPREADSHEETS present the data collected. The same data is provided in both Excel (.xlsx) and CSV format. Each data row describes a publication or item (e.g., thesis, poster, preprint). For column header explainations, see the associated codebook. ***************************** DETAILS ON THE SPREADSHEETS ***************************** field-based case studies CODEBOOK: Codebook for field --REFERS TO: Codebook for KEY DATA SHEET: field REFERS TO: Codebook for KEY --NUMBER OF DATA ROWS: 20 NOTE: Each data row describes a publication/item. --NUMBER OF PUBLICATION GROUPINGS: 17 --GROUPED PUBLICATIONS: Rubbo (2019) - 2 items, Yang (2022) - 3 items author-focused case studies of 1 or several authors with many retracted publications CODEBOOK: Codebook for authors --REFERS TO: Codebook for KEY DATA SHEET 1: singleauthors (n = 9) --NUMBER OF DATA ROWS: 9 --NUMBER OF PUBLICATION GROUPINGS: 9 DATA SHEET 2: multipleauthors (n = 5 --NUMBER OF DATA ROWS: 5 --NUMBER OF PUBLICATION GROUPINGS: 5 DATA SHEET 3: multipleauthors-not-named (n = 1) --NUMBER OF DATA ROWS: 1 --NUMBER OF PUBLICATION GROUPINGS: 1 ********************************* CRediT http://credit.niso.org ********************************* Susmita Das: Conceptualization, Data curation, Investigation, Methodology Jaqueline Léveillé: Data curation, Investigation Randi Proescholdt: Conceptualization, Data curation, Investigation, Methodology Jodi Schneider: Conceptualization, Data curation, Funding acquisition, Investigation, Methodology, Project administration, Supervision

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik, Self-citation analysis data based on PubMed Central subset (2002-2005) [Dataset]. http://doi.org/10.13012/B2IDB-9665377_V1

Self-citation analysis data based on PubMed Central subset (2002-2005)

Explore at:
Authors
Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Dataset funded by
U.S. National Institutes of Health (NIH)
U.S. National Science Foundation (NSF)
Description

Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations. Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304 - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105 * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720 - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927 * MapAffil for identifying article country of affiliation: - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik * IMPLICIT journal similarity: - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 * Novelty dataset for identify article level novelty: - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 - Paper citation: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra - Code: https://github.com/napsternxg/Novelty * Expertise dataset for identifying author expertise on articles: * Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.

Search
Clear search
Close search
Google apps
Main menu