https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Background: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered.We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains all the citation data (in N-Triples format) included in the OpenCitations Index, released on March 24, 2025. In particular, any citation in the dataset, defined as an individual of the class cito:Citation, includes the following information:[citation IRI] the Open Citation Identifier (OCI) for the citation, defined in the final part of the URL identifying the citation (https://w3id.org/oc/index/ci/[OCI]);[property "cito:hasCitingEntity"] the citing entity identified by its OMID URL (https://https://opencitations.net/meta/[OMID]);[property "cito:hasCitedEntity"] the cited entity identified by its OMID URL (https://https://opencitations.net/meta/[OMID]);[property "cito:hasCitationCreationDate"] the creation date of the citation (i.e. the publication date of the citing entity);[property "cito:hasCitationTimeSpan"] the time span of the citation (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity);[type "cito:JournalSelfCitation"] it records whether the citation is a journal self-citations (i.e. the citing and the cited entities are published in the same journal);[type "cito:AuthorSelfCitation"] it records whether the citation is an author self-citation (i.e. the citing and the cited entities have at least one author in common).Note: the information for each citation is sourced from OpenCitations Meta (https://opencitations.net/meta), a database that stores and delivers bibliographic metadata for all bibliographic resources included in the OpenCitations Indexes. The data provided in this dump is therefore based on the state of OpenCitations Meta at the time this collection was generated.This version of the dataset contains:2,155,497,918 citationsThe size of the zipped archive is 80.6 GB, while the size of the unzipped N-Triples files is 1.9 TB.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objectives: To analyse the total number of newspaper articles citing the four leading general medical journals and to describe national citation patterns. Design: Quantitative content analysis Setting/sample: Full text of 22 general newspapers in 14 countries over the period 2008-2015, collected from LexisNexis. The 14 countries have been categorized into four regions: US, UK, Western World (EU countries other than UK, and Australia, New Zealand and Canada) and Rest of the World (other countries). Main outcome measure: Press citations of four medical journals (two American: NEJM and JAMA; and two British: The Lancet and The BMJ) in 22 newspapers. Results: British and American newspapers cited some of the four analysed medical journals about three times a week in 2008-2015 (weekly mean 3.2 and 2.7 citations respectively); the newspapers from other Western countries did so about once a week (weekly mean 1.1), and those from the Rest of the World cited them about once a month (monthly mean 1.1). The New York Times cited above all other newspapers (weekly mean 4.7). The analysis showed the existence of three national citation patterns in the daily press: American newspapers cited mostly American journals (70.0% of citations), British newspapers cited mostly British journals (86.5%), and the rest of the analysed press cited more British journals than American ones. The Lancet was the most cited journal in the press of almost all Western countries outside the US and the UK. Multivariate correspondence analysis confirmed the national patterns and showed that over 85% of the citation data variability is retained in just one single new variable: the national dimension. Conclusion: British and American newspapers are the ones that cite the four analysed medical journals more often, showing a domestic preference for their respective national journals; non-British and non-American newspapers show a common international citation pattern.
This dataset gathers the most crucial SEO statistics for the year, providing an overview of the dominant trends and best practices in the field of search engine optimization. Aimed at digital marketing professionals, site owners, and SEO analysts, this collection of information serves as a guide to navigate the evolving SEO landscape with confidence and accuracy.
Mode of Data Production:
The statistics have been carefully selected and compiled from a variety of credible and recognized sources in the SEO industry, including research reports, web traffic data analytics, and consumer and marketing professional surveys. Each statistic was checked for reliability and relevance to current trends.
Categories Included: User search behaviour: Statistics on the evolution of search modes, including voice and mobile search. Mobile Optimisation: Data on the importance of site optimization for mobile devices. Importance of Backlinks: Insights on the role of backlinks in SEO ranking and the need to prioritize quality. Content quality: Statistics highlighting the importance of relevant and engaging content for SEO. Search engine algorithms: Information on the impact of algorithm updates on SEO strategies.
Usefulness of the Data: This dataset is designed to help users quickly understand current SEO dynamics and apply that knowledge in optimizing their digital marketing strategies. It provides a solid foundation for benchmarking, strategic planning, and informed decision-making in the field of SEO.
Update and Accessibility: To ensure relevance and timeliness, the dataset will be regularly updated with new information and emerging trends in the SEO world.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is the supporting materials to a paper entitled "Characterizing Data Reusability through A Data Citation Framework- A Case Study on Earth System Science Data". The dataset is mainly used to support different analysis within this manuscript, particularly on citation demographics, citation intensity, citation aging, and citation neworks.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Price quote data (for locally collected data only) and consumption segment indices that underpin consumer price inflation statistics, giving users access to the detailed data that are used in the construction of the UK’s inflation figures. The data are being made available for research purposes only and are not an accredited official statistic. From October 2024, private school fees and part-time education classes have been included in the consumption segment indices file. For more information on the introduction of consumption segments, please see the Consumer Prices Indices Technical Manual, 2019. Note that this dataset was previously called the consumer price inflation item indices and price quotes dataset.
A list of all uniform citations from the Louisville Metro Police Department, the CSV file is updated daily, including case number, date, location, division, beat, offender demographics, statutes and charges, and UCR codes can be found in this Link.INCIDENT_NUMBER or CASE_NUMBER links these data sets together:Crime DataUniform Citation DataFirearm intakeLMPD hate crimesAssaulted OfficersCITATION_CONTROL_NUMBER links these data sets together:Uniform Citation DataLMPD Stops DataNote: When examining this data, make sure to read the LMPDCrime Data section in our Terms of Use.AGENCY_DESC - the name of the department that issued the citationCASE_NUMBER - the number associated with either the incident or used as reference to store the items in our evidence rooms and can be used to connect the dataset to the following other datasets INCIDENT_NUMBER:1. Crime Data2. Firearms intake3. LMPD hate crimes4. Assaulted OfficersNOTE: CASE_NUMBER is not formatted the same as the INCIDENT_NUMBER in the other datasets. For example: in the Uniform Citation Data you have CASE_NUMBER 8018013155 (no dashes) which matches up with INCIDENT_NUMBER 80-18-013155 in the other 4 datasets.CITATION_YEAR - the year the citation was issuedCITATION_CONTROL_NUMBER - links this LMPD stops dataCITATION_TYPE_DESC - the type of citation issued (citations include: general citations, summons, warrants, arrests, and juvenile)CITATION_DATE - the date the citation was issuedCITATION_LOCATION - the location the citation was issuedDIVISION - the LMPD division in which the citation was issuedBEAT - the LMPD beat in which the citation was issuedPERSONS_SEX - the gender of the person who received the citationPERSONS_RACE - the race of the person who received the citation (W-White, B-Black, H-Hispanic, A-Asian/Pacific Islander, I-American Indian, U-Undeclared, IB-Indian/India/Burmese, M-Middle Eastern Descent, AN-Alaskan Native)PERSONS_ETHNICITY - the ethnicity of the person who received the citation (N-Not Hispanic, H=Hispanic, U=Undeclared)PERSONS_AGE - the age of the person who received the citationPERSONS_HOME_CITY - the city in which the person who received the citation livesPERSONS_HOME_STATE - the state in which the person who received the citation livesPERSONS_HOME_ZIP - the zip code in which the person who received the citation livesVIOLATION_CODE - multiple alpha/numeric code assigned by the Kentucky State Police to link to a Kentucky Revised Statute. For a full list of codes visit: https://kentuckystatepolice.org/crime-traffic-data/ASCF_CODE - the code that follows the guidelines of the American Security Council Foundation. For more details visit https://www.ascfusa.org/STATUTE - multiple alpha/numeric code representing a Kentucky Revised Statute. For a full list of Kentucky Revised Statute information visit: https://apps.legislature.ky.gov/law/statutes/CHARGE_DESC - the description of the type of charge for the citationUCR_CODE - the code that follows the guidelines of the Uniform Crime Report. For more details visit https://ucr.fbi.gov/UCR_DESC - the description of the UCR_CODE. For more details visit https://ucr.fbi.gov/
Note: Due to a system migration, this data will cease to update on March 14th, 2023. The current projection is to restart the updates within 30 days of the system migration, on or around April 13th, 2023A list of all uniform citations from the Louisville Metro Police Department, the CSV file is updated daily, including case number, date, location, division, beat, offender demographics, statutes and charges, and UCR codes can be found in this Link.INCIDENT_NUMBER or CASE_NUMBER links these data sets together:Crime DataUniform Citation DataFirearm intakeLMPD hate crimesAssaulted OfficersCITATION_CONTROL_NUMBER links these data sets together:Uniform Citation DataLMPD Stops DataNote: When examining this data, make sure to read the LMPDCrime Data section in our Terms of Use.AGENCY_DESC - the name of the department that issued the citationCASE_NUMBER - the number associated with either the incident or used as reference to store the items in our evidence rooms and can be used to connect the dataset to the following other datasets INCIDENT_NUMBER:1. Crime Data2. Firearms intake3. LMPD hate crimes4. Assaulted OfficersNOTE: CASE_NUMBER is not formatted the same as the INCIDENT_NUMBER in the other datasets. For example: in the Uniform Citation Data you have CASE_NUMBER 8018013155 (no dashes) which matches up with INCIDENT_NUMBER 80-18-013155 in the other 4 datasets.CITATION_YEAR - the year the citation was issuedCITATION_CONTROL_NUMBER - links this LMPD stops dataCITATION_TYPE_DESC - the type of citation issued (citations include: general citations, summons, warrants, arrests, and juvenile)CITATION_DATE - the date the citation was issuedCITATION_LOCATION - the location the citation was issuedDIVISION - the LMPD division in which the citation was issuedBEAT - the LMPD beat in which the citation was issuedPERSONS_SEX - the gender of the person who received the citationPERSONS_RACE - the race of the person who received the citation (W-White, B-Black, H-Hispanic, A-Asian/Pacific Islander, I-American Indian, U-Undeclared, IB-Indian/India/Burmese, M-Middle Eastern Descent, AN-Alaskan Native)PERSONS_ETHNICITY - the ethnicity of the person who received the citation (N-Not Hispanic, H=Hispanic, U=Undeclared)PERSONS_AGE - the age of the person who received the citationPERSONS_HOME_CITY - the city in which the person who received the citation livesPERSONS_HOME_STATE - the state in which the person who received the citation livesPERSONS_HOME_ZIP - the zip code in which the person who received the citation livesVIOLATION_CODE - multiple alpha/numeric code assigned by the Kentucky State Police to link to a Kentucky Revised Statute. For a full list of codes visit: https://kentuckystatepolice.org/crime-traffic-data/ASCF_CODE - the code that follows the guidelines of the American Security Council Foundation. For more details visit https://www.ascfusa.org/STATUTE - multiple alpha/numeric code representing a Kentucky Revised Statute. For a full list of Kentucky Revised Statute information visit: https://apps.legislature.ky.gov/law/statutes/CHARGE_DESC - the description of the type of charge for the citationUCR_CODE - the code that follows the guidelines of the Uniform Crime Report. For more details visit https://ucr.fbi.gov/UCR_DESC - the description of the UCR_CODE. For more details visit https://ucr.fbi.gov/
Data file for the first release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for our study on the coverage of software engineering articles in open citation databases:
The table Data Citation Corpus is part of the dataset Data Citation Corpus Data File, available at https://redivis.com/datasets/am5t-e9jvcn6s5. It contains 5256114 rows across 14 variables.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GENERAL INFORMATION
Title of Dataset: A dataset from a survey investigating disciplinary differences in data citation
Date of data collection: January to March 2022
Collection instrument: SurveyMonkey
Funding: Alfred P. Sloan Foundation
SHARING/ACCESS INFORMATION
Licenses/restrictions placed on the data: These data are available under a CC BY 4.0 license
Links to publications that cite or use the data:
Gregory, K., Ninkov, A., Ripp, C., Peters, I., & Haustein, S. (2022). Surveying practices of data citation and reuse across disciplines. Proceedings of the 26th International Conference on Science and Technology Indicators. International Conference on Science and Technology Indicators, Granada, Spain. https://doi.org/10.5281/ZENODO.6951437
Gregory, K., Ninkov, A., Ripp, C., Roblin, E., Peters, I., & Haustein, S. (2023). Tracing data:
A survey investigating disciplinary differences in data citation. Zenodo. https://doi.org/10.5281/zenodo.7555266
DATA & FILE OVERVIEW
File List
Additional related data collected that was not included in the current data package: Open ended questions asked to respondents
METHODOLOGICAL INFORMATION
Description of methods used for collection/generation of data:
The development of the questionnaire (Gregory et al., 2022) was centered around the creation of two main branches of questions for the primary groups of interest in our study: researchers that reuse data (33 questions in total) and researchers that do not reuse data (16 questions in total). The population of interest for this survey consists of researchers from all disciplines and countries, sampled from the corresponding authors of papers indexed in the Web of Science (WoS) between 2016 and 2020.
Received 3,632 responses, 2,509 of which were completed, representing a completion rate of 68.6%. Incomplete responses were excluded from the dataset. The final total contains 2,492 complete responses and an uncorrected response rate of 1.57%. Controlling for invalid emails, bounced emails and opt-outs (n=5,201) produced a response rate of 1.62%, similar to surveys using comparable recruitment methods (Gregory et al., 2020).
Methods for processing the data:
Results were downloaded from SurveyMonkey in CSV format and were prepared for analysis using Excel and SPSS by recoding ordinal and multiple choice questions and by removing missing values.
Instrument- or software-specific information needed to interpret the data:
The dataset is provided in SPSS format, which requires IBM SPSS Statistics. The dataset is also available in a coded format in CSV. The Codebook is required to interpret to values.
DATA-SPECIFIC INFORMATION FOR: MDCDataCitationReuse2021surveydata
Number of variables: 95
Number of cases/rows: 2,492
Missing data codes: 999 Not asked
Refer to MDCDatacitationReuse2021Codebook.pdf for detailed variable information.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
We propose a stochastic generative model to represent a directed graph constructed by citations among academic papers, where nodes and directed edges represent papers with discrete publication time and citations respectively. The proposed model assumes that a citation between two papers occurs with a probability based on the type of the citing paper, the importance of cited paper, and the difference between their publication times, like the existing models. We consider the out-degrees of citing paper as its type, because, for example, survey paper cites many papers. We approximate the importance of a cited paper by its in-degrees. In our model, we adopt three functions: a logistic function for illustrating the numbers of papers published in discrete time, an inverse Gaussian probability distribution function to express the aging effect based on the difference between publication times, and an exponential distribution (or a generalized Pareto distribution) for describing the out-degree distribution. We consider that our model is a more reasonable and appropriate stochastic model than other existing models and can perform complete simulations without using original data. In this paper, we first use the Web of Science database and see the features used in our model. By using the proposed model, we can generate simulated graphs and demonstrate that they are similar to the original data concerning the in- and out-degree distributions, and node triangle participation. In addition, we analyze two other citation networks derived from physics papers in the arXiv database and verify the effectiveness of the model. Methods We focus on a subset of the Web of Science (WoS), WoS-Stat, which is a citation network that comprises the citations between papers published in journals whose subject is associated with “Statistics and Probability.” We construct a citation network utilizing a paper identifier (ID), publication year, and reference list (list of paper IDs) for 36 years, from 1981 to 2016. WoS-Stat consists of 179,483 papers and 1,106,622 citations.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This Dataset contains 3 datasets behind graphs generated in the "State of Open Data 2024 Special Report: Bridging policy and practice in data sharing" The datasets include counts and percentages for papers that link to datasets filtered by Country, Funder and Affiliation DatasetsThe datasets were generated by combining the DataCite Data Citation Corpus (https://corpus.datacite.org/dashboard) with Dimensions (https://www.dimensions.ai/) in Google big query.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Statistical information of data set D2.
https://fred.stlouisfed.org/legal/#copyright-citation-requiredhttps://fred.stlouisfed.org/legal/#copyright-citation-required
Graph and download economic data for Infra-Annual Labor Statistics: Employment Total: From 15 to 64 Years for United States (LFEM64TTUSQ647S) from Q1 1970 to Q1 2025 about 15 to 64 years, employment, and USA.
https://louisville-metro-opendata-lojic.hub.arcgis.com/pages/terms-of-use-and-licensehttps://louisville-metro-opendata-lojic.hub.arcgis.com/pages/terms-of-use-and-license
Note: Due to a system migration, this data will cease to update on March 14th, 2023. At this time we are updating this dataset manually once per month as resources allow. For real time crime data please utilize communitycrimemap.comA list of all uniform citations from the Louisville Metro Police Department, the CSV file is updated daily, including case number, date, location, division, beat, offender demographics, statutes and charges, and UCR codes can be found in this Link.INCIDENT_NUMBER or CASE_NUMBER links these data sets together:Crime DataUniform Citation DataFirearm intakeLMPD hate crimesAssaulted OfficersCITATION_CONTROL_NUMBER links these data sets together:Uniform Citation DataLMPD Stops DataNote: When examining this data, make sure to read the LMPDCrime Data section in our Terms of Use.AGENCY_DESC - the name of the department that issued the citationCASE_NUMBER - the number associated with either the incident or used as reference to store the items in our evidence rooms and can be used to connect the dataset to the following other datasets INCIDENT_NUMBER:1. Crime Data2. Firearms intake3. LMPD hate crimes4. Assaulted OfficersNOTE: CASE_NUMBER is not formatted the same as the INCIDENT_NUMBER in the other datasets. For example: in the Uniform Citation Data you have CASE_NUMBER 8018013155 (no dashes) which matches up with INCIDENT_NUMBER 80-18-013155 in the other 4 datasets.CITATION_YEAR - the year the citation was issuedCITATION_CONTROL_NUMBER - links this LMPD stops dataCITATION_TYPE_DESC - the type of citation issued (citations include: general citations, summons, warrants, arrests, and juvenile)CITATION_DATE - the date the citation was issuedCITATION_LOCATION - the location the citation was issuedDIVISION - the LMPD division in which the citation was issuedBEAT - the LMPD beat in which the citation was issuedPERSONS_SEX - the gender of the person who received the citationPERSONS_RACE - the race of the person who received the citation (W-White, B-Black, H-Hispanic, A-Asian/Pacific Islander, I-American Indian, U-Undeclared, IB-Indian/India/Burmese, M-Middle Eastern Descent, AN-Alaskan Native)PERSONS_ETHNICITY - the ethnicity of the person who received the citation (N-Not Hispanic, H=Hispanic, U=Undeclared)PERSONS_AGE - the age of the person who received the citationPERSONS_HOME_CITY - the city in which the person who received the citation livesPERSONS_HOME_STATE - the state in which the person who received the citation livesPERSONS_HOME_ZIP - the zip code in which the person who received the citation livesVIOLATION_CODE - multiple alpha/numeric code assigned by the Kentucky State Police to link to a Kentucky Revised Statute. For a full list of codes visit: https://kentuckystatepolice.org/crime-traffic-data/ASCF_CODE - the code that follows the guidelines of the American Security Council Foundation. For more details visit https://www.ascfusa.org/STATUTE - multiple alpha/numeric code representing a Kentucky Revised Statute. For a full list of Kentucky Revised Statute information visit: https://apps.legislature.ky.gov/law/statutes/CHARGE_DESC - the description of the type of charge for the citationUCR_CODE - the code that follows the guidelines of the Uniform Crime Report. For more details visit https://ucr.fbi.gov/UCR_DESC - the description of the UCR_CODE. For more details visit https://ucr.fbi.gov/
https://fred.stlouisfed.org/legal/#copyright-citation-requiredhttps://fred.stlouisfed.org/legal/#copyright-citation-required
Graph and download economic data for Nominal Statistical Discrepancy for Great Britain (NSDGDPSAXDCGBQ) from Q1 1995 to Q1 2025 about residual and United Kingdom.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Citation context annotation. This dataset is a second version (V2) and part of the supplemental data for Jodi Schneider, Di Ye, Alison Hill, and Ashley Whitehorn. (2020) "Continued post-retraction citation of a fraudulent clinical trial report, eleven years after it was retracted for falsifying data". Scientometrics. In press, DOI: 10.1007/s11192-020-03631-1 Publications were selected by examining all citations to the retracted paper Matsuyama 2005, and selecting the 35 citing papers, published 2010 to 2019, which do not mention the retraction, but which mention the methods or results of the retracted paper (called "specific" in Ye, Di; Hill, Alison; Whitehorn (Fulton), Ashley; Schneider, Jodi (2020): Citation context annotation for new and newly found citations (2006-2019) to retracted paper Matsuyama 2005. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-8150563_V1 ). The annotated citations are second-generation citations to the retracted paper Matsuyama 2005 (RETRACTED: Matsuyama W, Mitsuyama H, Watanabe M, Oonakahara KI, Higashimoto I, Osame M, Arimura K. Effects of omega-3 polyunsaturated fatty acids on inflammatory markers in COPD. Chest. 2005 Dec 1;128(6):3817-27.), retracted in 2008 (Retraction in: Chest (2008) 134:4 (893) https://doi.org/10.1016/S0012-3692(08)60339-6). OVERALL DATA for VERSION 2 (V2) FILES/FILE FORMATS Same data in two formats: 2010-2019 SG to specific not mentioned FG.csv - Unicode CSV (preservation format only) - same as in V1 2010-2019 SG to specific not mentioned FG.xlsx - Excel workbook (preferred format) - same as in V1 Additional files in V2: 2G-possible-misinformation-analyzed.csv - Unicode CSV (preservation format only) 2G-possible-misinformation-analyzed.xlsx - Excel workbook (preferred format) ABBREVIATIONS: 2G - Refers to the second-generation of Matsuyama FG - Refers to the direct citation of Matsuyama (the one the second-generation item cites) COLUMN HEADER EXPLANATIONS File name: 2G-possible-misinformation-analyzed. Other column headers in this file have same meaning as explained in V1. The following are additional header explanations: Quote Number - The order of the quote (citation context citing the first generation article given in "FG in bibliography") in the second generation article (given in "2G article") Quote - The text of the quote (citation context citing the first generation article given in "FG in bibliography") in the second generation article (given in "2G article") Translated Quote - English translation of "Quote", automatically translation from Google Scholar Seriousness/Risk - Our assessment of the risk of misinformation and its seriousness 2G topic - Our assessment of the topic of the cited article (the second generation article given in "2G article") 2G section - The section of the citing article (the second generation article given in "2G article") in which the cited article(the first generation article given in "FG in bibliography") was found FG in bib type - The type of article (e.g., review article), referring to the cited article (the first generation article given in "FG in bibliography") FG in bib topic - Our assessment of the topic of the cited article (the first generation article given in "FG in bibliography") FG in bib section - The section of the cited article (the first generation article given in "FG in bibliography") in which the Matsuyama retracted paper was cited
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
This document describes the data collection and datasets used in the manuscript "A Matter of Culture? Conceptualising and Investigating ‘Evidence Cultures’ within Research on Evidence-Informed Policymaking" [1].
Data Collection
To construct the citation network analysed in the manuscript, we first designed a series of queries to capture a large sample of literature exploring the relationship between evidence, policy, and culture from various perspectives. Our team of domain experts developed the following queries based on terms common in the literature. These queries search for the terms included in the titles, abstracts, and associated keywords of WoS indexed records (i.e. ‘TS=’). While these are separated below for ease of reading, they combined into a single query via the OR operator in our search. Our search was conducted on the Web of Science’s (WoS) Core Collection through the University of Edinburgh Library subscription on 29/11/2023, returning a total of 2,089 records.
TS = ((“cultures of evidence” OR “culture of evidence” OR “culture of knowledge” OR “cultures of knowledge” OR “research culture” OR “research cultures” OR “culture of research” OR “cultures of research” OR “epistemic culture” OR “epistemic cultures” OR “epistemic community” OR “epistemic communities” OR “epistemic infrastructure” OR “evaluation culture” OR “evaluation cultures” OR “culture of evaluation” OR “cultures of evaluation” OR “thought style” OR “thought styles” OR “thought collective” OR “thought collectives” OR “knowledge regime” OR “knowledge regimes” OR “knowledge system” OR “knowledge systems” OR “civic epistemology” OR “civic epistemologies”) AND (“policy” OR “policies” OR “policymaking” OR “policy making” OR “policymaker” OR “policymakers” OR “policy maker” OR “policy makers” OR “policy decision” OR “policy decisions” OR “political decision” OR “political decisions” OR “political decision making”))
OR
TS = ((“culture” OR “cultures”) AND ((“evidence-based” OR “evidence-informed” OR “evidence-led” OR “science-based” OR “science-informed” OR “science-led” OR “research-based” OR “research-informed” OR “evidence use” OR “evidence user” OR “evidence utilisation” OR “evidence utilization” OR “research use” OR “researcher user” OR “research utilisation” OR “research utilization” OR “research in” OR “evidence in” OR “science in”) NEAR/1 (“policymaking” OR “policy making” OR “policy maker” OR “policy makers”)))
OR
TS = ((“culture” OR “cultures”) AND (“scientific advice” OR “technical advice” OR “scientific expertise” OR “technical expertise” OR “expert advice”) AND (“policy” OR “policies” OR “policymaking” OR “policy making” OR “policymaker” OR “policymakers” OR “policy maker” OR “policy makers” OR “political decision” OR “political decisions” OR “political decision making”))
OR
TS = ((“culture” OR “cultures”) AND (“post-normal science” OR “trans-science” OR “transdisciplinary” OR “transdisiplinarity” OR “science-policy interface” OR “policy sciences” OR “sociology of knowledge” OR “sociology of science” OR “knowledge transfer” OR “knowledge translation” OR “knowledge broker” OR “implementation science” OR “risk society”) AND (“policymaking” OR “policy making” OR “policymaker” OR “policymakers” OR “policy maker” OR “policy makers”))
Citation Network Construction
All bibliographic metadata on these 2,089 records were downloaded in five batches in plain text and then merged in R. We then parsed these data into network readable files. All unique reference strings are given unique node IDs. A node-attribute-list (‘CE_Node’) links identifying information of each document with its node ID, including authors, title, year of publication, journal WoS ID, and WoS citations. An edge-list (‘CE_Edge’) records all citations from these documents to their bibliographies – with edges going from a citing document to the cited – using the relevant node IDs. These data were then cleaned by (a) matching DOIs for reference strings that differ but point to the same paper, and (b) manual merging of obvious duplicates caused by referencing errors.
Our initial dataset consisted of 2,089 retrieved documents and 123,772 unretrieved cited documents (i.e. documents that were cited within the publications we retrieved but which were not one of these 2,089 documents). These documents were connected by 157,229 citation links, but ~87% of the documents in the network were cited just once. To focus on relevant literature, we filtered the network to include only documents with at least three citation or reference links. We further refined the dataset by focusing on the main connected component, resulting in 6,650 nodes and 29,198 edges. It is this dataset that we publish here, and it is this network that underpins Figure 1, Table 1, and the qualitative examination of documents (see manuscript for further details).
Our final network dataset contains 1,819 of the documents in our original query (~87% of the original retrieved records), and 4,831 documents not retrieved via our Web of Science search but cited by at least three of the retrieved documents. We then clustered this network by modularity maximization via the Leiden algorithm [2], detecting 14 clusters with Q=0.59. Citations to documents within the same cluster constitute ~77% of all citations in the network.
Citation Network Dataset Description
We include two network datasets: (i) ‘CE_Node.csv’ that contains 1,819 retrieved documents, 4,831 unretrieved referenced documents, making for a total of 6,650 documents (nodes); (ii)’CE_Edge.csv’ that records citations (edges) between the documents (nodes), including a total of 29,198 citation links. These files can be used to construct a network with many different tools, but we have formatted these to be used in Gephi 0.10[3].
‘CE_Node.csv’ is a comma-separate values file that contains two types of nodes:
i. Retrieved documents – these are documents captured by our query. These include full bibliographic metadata and reference lists.
ii. Non-retrieved documents – these are documents referenced by our retrieved documents but were not retrieved via our query. These only have data contained within their reference string (i.e. first author, journal or book title, year of publication, and possibly DOI).
The columns in the .csv refer to:
- Id, the node ID
- Label, the reference string of the document
- DOI, the DOI for the document, if available
- WOS_ID, WoS accession number
- Authors, named authors
- Title, title of document
- Document_type, variable indicating whether a document is an article, review, etc.
- Journal_book_title, journal of publication or title of book
- Publication year, year of publication.
- WOS_times_cited, total Core Collection citations as of 29/11/2023
- Indegree, number of within network citations to a given document
- Cluster, provides the cluster membership number as discussed in the manuscript (Figure 1)
‘CE_Edge.csv’ is a comma-separated values file that contains edges (citation links) between nodes (documents) (n=29,198). The columns refer to:
- Source, node ID of the citing document
- Target, node ID of the cited document
Cluster Analysis
We qualitatively analyse a set of publications from seven of the largest clusters in our manuscript.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Background: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered.We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.