Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data pertaining to the publication "Can journal guidelines improve the reporting of antibody validation?". The project investigates the quality of antibody validation information provided in 120 biomedical publications and whether the introduction of journal validation guidelines improved the quality of this information.The data covers 60 publications before introduction of guidelines, and 60 after introduction, half of which from journals with guidelines. The quality of antibody validation information was coded by one author ("Antibody validation information data set.xlsx"), with a sample checked for interrater reliability by another ("Interrater reliability data set.xlsx"). Effects of journal guidelines introduction were tested statistically with a pseudo-experimental design. (Code for the statistical package R is provided.) The data package also includes detailed explanation of how coding was performed ("Coding protocol.docx") and an explanation of these files and data labels ("Data dictionary.docx").
Background Physicians reading the medical literature attempt to determine whether research studies are valid. However, articles with negative results may not provide sufficient information to allow physicians to properly assess validity. Methods We analyzed all original research articles with negative results published in 1997 in the weekly journals BMJ, JAMA, Lancet, and New England Journal of Medicine as well as those published in the 1997 and 1998 issues of the bimonthly Annals of Internal Medicine (N = 234). Our primary objective was to quantify the proportion of studies with negative results that comment on power and present confidence intervals. Secondary outcomes were to quantify the proportion of these studies with a specified effect size and a defined primary outcome. Stratified analyses by study design were also performed. Results Only 30% of the articles with negative results comment on power. The reporting of power (range: 15%-52%) and confidence intervals (range: 55–81%) varied significantly among journals. Observational studies of etiology/risk factors addressed power less frequently (15%, 95% CI, 8–21%) than did clinical trials (56%, 95% CI, 46–67%, p < 0.001). While 87% of articles with power calculations specified an effect size the authors sought to detect, a minority gave a rationale for the effect size. Only half of the studies with negative results clearly defined a primary outcome. Conclusion Prominent medical journals often provide insufficient information to assess the validity of studies with negative results.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Validation data for the Astro scientific publication clustering benchmark dataset
This is the dataset used in the publication Donner, P. "Validation of the Astro dataset clustering solutions with external data", Scientometrics, DOI 10.1007/s11192-020-03780-3
Certain data included herein are derived from Clarivate Web of Science. © Copyright Clarivate 2020. All rights reserved.
Published with permission from Clarivate.
The original Astro dataset is not contained in this data. It can be obtained from http://topic-challenge.info/ and requires permission from Clarivate Analytics for use.
This dataset collection consists of four files. Each file contains an independent dataset that relates to the Astro dataset via Web of Science (WoS) record identifiers. These identifiers are called UTs. All files are tabular data in CSV format. In each, at least one column contains UT data. This should be used to link to the Astro dataset or other WoS data. The datasets are discussed in detail in the journal publication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Journal policy on research data and code availability is an important part of the ongoing shift toward publishing reproducible computational science. This article extends the literature by studying journal data sharing policies by year (for both 2011 and 2012) for a referent set of 170 journals. We make a further contribution by evaluating code sharing policies, supplemental materials policies, and open access status for these 170 journals for each of 2011 and 2012. We build a predictive model of open data and code policy adoption as a function of impact factor and publisher and find higher impact journals more likely to have open data and code policies and scientific societies more likely to have open data and code policies than commercial publishers. We also find open data policies tend to lead open code policies, and we find no relationship between open data and code policies and either supplemental material policies or open access journal status. Of the journals in this study, 38% had a data policy, 22% had a code policy, and 66% had a supplemental materials policy as of June 2012. This reflects a striking one year increase of 16% in the number of data policies, a 30% increase in code policies, and a 7% increase in the number of supplemental materials policies. We introduce a new dataset to the community that categorizes data and code sharing, supplemental materials, and open access policies in 2011 and 2012 for these 170 journals.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of ResQu Index scores on selected articles by two authors.
In this study, we present a systematic review of these interdisciplinarity measures and explore their inherent relations. We examine these measures in relation to the Web of Science journal subject categories.
The dataset consists of two Excel files, "Interdisciplinarity of 224 WoS SCs" and "Research areas".
The dataset was originally published in DiVA and moved to SND in 2024.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
Discover a world of knowledge and power with scicite! Through its labeled data of scholarly citations extracted from scientific articles, scicite unlocks the key to uncovering information in multiple fields like computer science, biomedicine, ecology and beyond. Laid out in easily digestible columns including strings, section names, labels, isKeyCitations, label2s and more – you’ll soon find yourself losing track of time as you explore this goldmine of facts and figures. With a quick glance at each entry noted down in the dataset’s information log, you can quickly start pinpointing pertinent pieces of info straight away; from sources to key citations to start/end indices that say it all. Don't be left behind - unlock the power hidden within today!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset consists of three CSV files, each containing different elements related to scholarly citations gathered from scientific articles: train.csv, test.csv and validation.csv. These can be used in a variety of ways in order to gain insight into the research process and improve its accuracy and efficiency.
Extracting useful information from citations: The labels attached to each citation section can help in extracting specific information about the sources cited or any other data included for research purposes. Additionally, isKeyCitation gives an indication if the source referred is a key citation which could be looked into in greater detail by researchers or practitioners.
Identifying relationships between citations: scicite's sectionName column helps identify related elements of writing including introduction and abstracts that enable the identification of Potential relationships between these elements and references found within them thus helping better understand what connections scholar have made previously with their research pieces
Improving accuracy in data gathering: With string, citeStart and citeEnd columns available along with source labels one can easily identify if certain references are repeated multiple times while also double checking accuracy through start/end values associated with them
Validation purposes : Last but not least one can also use this dataset for validating documents written by scholars for peer review where similar sections found prior inside unrelated documents can be used as reference points that need to match signaling correctness on original authors part
- Developing a search engine to quickly find citations relevant to specific topics and research areas.
- Creating algorithms that can predict key citations and streamline the research process by automatically including only the most important references in a paper.
- Designing AI systems that can accurately classify, analyze and summarize different scholarly works based on the citation frequency, source type & label assigned to them
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:------------------|:-----------------------------------------------------------------------------| | string | The string of text associated with the citation. (String) | | sectionName | The name of the section the citation is found in. (String) | | label | The label associated with the citation. (String) | | isKeyCitation | A boolean value indicating whether the citation is a key citation. (Boolean) | | label2 | The second label associated with the citation. (String) | | citeEnd | The end index of the citation in the text. (Integer) | | citeStart | The start index of the citation in the text. (Integer) | | source | The source of the citation. (String) ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The peer-reviewed publication for this dataset has been presented in the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), and can be accessed here: https://arxiv.org/abs/2205.02596. Please cite this when using the dataset.
This dataset contains a heterogeneous set of True and False COVID claims and online sources of information for each claim.
The claims have been obtained from online fact-checking sources, existing datasets and research challenges. It combines different data sources with different foci, thus enabling a comprehensive approach that combines different media (Twitter, Facebook, general websites, academia), information domains (health, scholar, media), information types (news, claims) and applications (information retrieval, veracity evaluation).
The processing of the claims included an extensive de-duplication process eliminating repeated or very similar claims. The dataset is presented in a LARGE and a SMALL version, accounting for different degrees of similarity between the remaining claims (excluding respectively claims with a 90% and 99% probability of being similar, as obtained through the MonoT5 model). The similarity of claims was analysed using BM25 (Robertson et al., 1995; Crestani et al., 1998; Robertson and Zaragoza, 2009) with MonoT5 re-ranking (Nogueira et al., 2020), and BERTScore (Zhang et al., 2019).
The processing of the content also involved removing claims making only a direct reference to existing content in other media (audio, video, photos); automatically obtained content not representing claims; and entries with claims or fact-checking sources in languages other than English.
The claims were analysed to identify types of claims that may be of particular interest, either for inclusion or exclusion depending on the type of analysis. The following types were identified: (1) Multimodal; (2) Social media references; (3) Claims including questions; (4) Claims including numerical content; (5) Named entities, including: PERSON − People, including fictional; ORGANIZATION − Companies, agencies, institutions, etc.; GPE − Countries, cities, states; FACILITY − Buildings, highways, etc. These entities have been detected using a RoBERTa base English model (Liu et al., 2019) trained on the OntoNotes Release 5.0 dataset (Weischedel et al., 2013) using Spacy.
The original labels for the claims have been reviewed and homogenised from the different criteria used by each original fact-checker into the final True and False labels.
The data sources used are:
The CoronaVirusFacts/DatosCoronaVirus Alliance Database. https://www.poynter.org/ifcn-covid-19-misinformation/
CoAID dataset (Cui and Lee, 2020) https://github.com/cuilimeng/CoAID
MM-COVID (Li et al., 2020) https://github.com/bigheiniu/MM-COVID
CovidLies (Hossain et al., 2020) https://github.com/ucinlp/covid19-data
TREC Health Misinformation track https://trec-health-misinfo.github.io/
TREC COVID challenge (Voorhees et al., 2021; Roberts et al., 2020) https://ir.nist.gov/covidSubmit/data.html
The LARGE dataset contains 5,143 claims (1,810 False and 3,333 True), and the SMALL version 1,709 claims (477 False and 1,232 True).
The entries in the dataset contain the following information:
Claim. Text of the claim.
Claim label. The labels are: False, and True.
Claim source. The sources include mostly fact-checking websites, health information websites, health clinics, public institutions sites, and peer-reviewed scientific journals.
Original information source. Information about which general information source was used to obtain the claim.
Claim type. The different types, previously explained, are: Multimodal, Social Media, Questions, Numerical, and Named Entities.
Funding. This work was supported by the UK Engineering and Physical Sciences Research Council (grant no. EP/V048597/1, EP/T017112/1). ML and YH are supported by Turing AI Fellowships funded by the UK Research and Innovation (grant no. EP/V030302/1, EP/V020579/1).
References
Arana-Catania M., Kochkina E., Zubiaga A., Liakata M., Procter R., He Y.. Natural Language Inference with Self-Attention for Veracity Assessment of Pandemic Claims. NAACL 2022 https://arxiv.org/abs/2205.02596
Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at trec-3. Nist Special Publication Sp,109:109.
Fabio Crestani, Mounia Lalmas, Cornelis J Van Rijsbergen, and Iain Campbell. 1998. “is this document relevant?. . . probably” a survey of probabilistic models in information retrieval. ACM Computing Surveys (CSUR), 30(4):528–552.
Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc.
Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document ranking with a pre-trained sequence-to-sequence model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 708–718.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA, 23.
Limeng Cui and Dongwon Lee. 2020. Coaid: Covid-19 healthcare misinformation dataset. arXiv preprint arXiv:2006.00885.
Yichuan Li, Bohan Jiang, Kai Shu, and Huan Liu. 2020. Mm-covid: A multilingual and multimodal data repository for combating covid-19 disinformation.
Tamanna Hossain, Robert L. Logan IV, Arjuna Ugarte, Yoshitomo Matsubara, Sean Young, and Sameer Singh. 2020. COVIDLies: Detecting COVID-19 misinformation on social media. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020, Online. Association for Computational Linguistics.
Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. 2021. Trec-covid: constructing a pandemic information retrieval test collection. In ACM SIGIR Forum, volume 54, pages 1–12. ACM New York, NY, USA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and code belonging to the manuscript:
Tracking transformative agreements through open metadata: method and validation using Dutch Research Council NWO funded papers
Abstract
Transformative agreements have become an important strategy in the transition to open access, with almost 1,200 such agreements registered by 2025. Despite their prevalence, these agreements suffer from important transparency limitations, most notably article-level metadata indicating which articles are covered by these agreements. Typically, this data is available to libraries but not openly shared, making it difficult to study the impact of these agreements. In this paper, we present a novel, open, replicable method for analyzing transformative agreements using open metadata, specifically the Journal Checker tool provided by cOAlition S and OpenAlex. To demonstrate its potential, we apply our approach to a subset of publications funded by the Dutch Research Council (NWO) and its health research counterpart ZonMw. In addition, the results of this open method are compared with the actual publisher data reported to the Dutch university library consortium UKB. This validation shows that this open method accurately identified 89% of the publications covered by transformative agreements, while the 11% false positives shed an interesting light on the limitations of this method. In the absence of hard, openly available article-level data on transformative agreements, we provide researchers and institutions with a powerful tool to critically track and evaluate the impact of these agreements.
This dataset contains the following files:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundModeling count and binary data collected in hierarchical designs have increased the use of Generalized Linear Mixed Models (GLMMs) in medicine. This article presents a systematic review of the application and quality of results and information reported from GLMMs in the field of clinical medicine.MethodsA search using the Web of Science database was performed for published original articles in medical journals from 2000 to 2012. The search strategy included the topic “generalized linear mixed models”,“hierarchical generalized linear models”, “multilevel generalized linear model” and as a research domain we refined by science technology. Papers reporting methodological considerations without application, and those that were not involved in clinical medicine or written in English were excluded.ResultsA total of 443 articles were detected, with an increase over time in the number of articles. In total, 108 articles fit the inclusion criteria. Of these, 54.6% were declared to be longitudinal studies, whereas 58.3% and 26.9% were defined as repeated measurements and multilevel design, respectively. Twenty-two articles belonged to environmental and occupational public health, 10 articles to clinical neurology, 8 to oncology, and 7 to infectious diseases and pediatrics. The distribution of the response variable was reported in 88% of the articles, predominantly Binomial (n = 64) or Poisson (n = 22). Most of the useful information about GLMMs was not reported in most cases. Variance estimates of random effects were described in only 8 articles (9.2%). The model validation, the method of covariate selection and the method of goodness of fit were only reported in 8.0%, 36.8% and 14.9% of the articles, respectively.ConclusionsDuring recent years, the use of GLMMs in medical literature has increased to take into account the correlation of data when modeling qualitative data or counts. According to the current recommendations, the quality of reporting has room for improvement regarding the characteristics of the analysis, estimation method, validation, and selection of the model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The article describes the results of the online survey on open science (OS) carried out on researchers affiliated with universities and Spanish research centres and focused on open access to scientific publications, the publication process, the management of research data and the review of open articles. The main objective was to identify the perception and habits of researchers with regard to practices closely linked to open science and the scientific value added is that offers an in-depth picture of researchers as one of the main actors to whom this transformation and implementation of open science will fall. It focuses on the different aspects of OS: open access, open data, publication process and open review in order to identify habits and perceptions. This is to make possible an implementation of the OS movement. The survey was carried out among researchers who had published in the years 2020–2021, according to data obtained from WoS. It was emailed to a total of 8,188 researchers and obtained a total of 666 responses, of which 554 were complete, the rest being forms with some questions unanswered. The main results showed that open access still requires the diffusion of practices and services provided by the institution, as well as training (library or equivalent service) and institutional support from the competent authorities (vice rectors or equivalent) in specific aspects such as data management. In the case of data, around 50% of respondents stated they had stored data in a repository, and of all the options, the most frequently given was that of an institutional repository, followed by a discipline repository. Among the main reasons for doing this, we found transparency, visibility of data and the ability to validate results. For those who stated they had never stored data, the most frequent reasons for not having done so were privacy and confidentiality, the lack of a mandated data policy or a lack of knowledge of how to do it. In terms of open peer review, participants mentioned a certain reticence to the opening of evaluations due to potential conflicts of interest that may arise or because lower-quality content might be accepted in order to avoid conflicts. In addition, the hierarchical structure of senior researcher versus junior researcher might affect reviews. The main conclusions indicate a need for persuasion of OA to take place; APCs are an economic barrier rather than the main criterion for journal selection; OPR practices may seem innovative and emerging; scientific and evaluation policies seem to have a clear effect on the behaviour of researchers; researchers state that they share research data more for reasons of persuasion than out of obligation. Researchers do question the pathways or difficulties that may arise on a day-to-day basis and seem aware that we are undergoing change, where academic evaluation or policies related to open science, its implementation and habits among researchers may change. In this sense, more and better support is needed on the part of institutions and faculty support services.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Frequency of publication and open publication over the 2020–2021 period.
International, curated, digital repository that makes the data underlying scientific publications discoverable, freely reusable, and citable. Particularly data for which no specialized repository exists. Provides the infrastructure for, and promotes the re-use of, data underlying the scholarly literature. Governed by a nonprofit membership organization. Membership is open to any stakeholder organization, including but not limited to journals, scientific societies, publishers, research institutions, libraries, and funding organizations. Most data are associated with peer-reviewed articles, although data associated with non-peer reviewed publications from reputable academic sources, such as dissertations, are also accepted. Used to validate published findings, explore new analysis methodologies, repurpose data for research questions unanticipated by the original authors, and perform synthetic studies.UC system is member organization of Dryad general subject data repository.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
NR: No reported; MCMC: Markov chain Monte Carlo; GEE: Generalized estimating equation;DIC: Deviance information criterion; AIC: Akaike information criterion; BIC: Bayesian information criterion; df: freedom degree.Characteristics of the specification, validation and construction of the model for the reviewed articles.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Health Promotion and Chronic Disease Prevention in Canada: Research, Policy and Practice (the HPCDP Journal) is the monthly, online scientific journal of the Health Promotion and Chronic Disease Prevention Branch of the Public Health Agency of Canada. The journal publishes articles on disease prevention, health promotion and health equity in the areas of chronic diseases, injuries and life course health. Content includes research from fields such as public/community health, epidemiology, biostatistics, the behavioural and social sciences, and health services or economics.
Abstract: Research data, research results, and publications of the PhD thesis entitled 'Validation Framework for RDF-based Constraint Languages', submitted to the Department of Economics and Management at the Karlsruhe Institute of Technology (KIT). TechnicalRemarks: # PhD Thesis--- Title: Validation Framework for RDF-based Constraint Languages Author: Thomas Hartmann Examination Date: 08.07.2016 University: Karlsruhe Institute of Technology (KIT) Chair: Institute of Applied Informatics and Formal Description Methods Department: Department of Economics and Management 1. Advisor: Prof. Dr. York Sure-Vetter, Karlsruhe Institute of Technology 2. Advisor: Prof. Dr. Kai Eckert, Stuttgart Media University---PhD Thesis Download http://dx.doi.org/10.5445/IR/1000056458Publications Complete set of publications: publicationsResearch Data, Research Results, and Publications Link to the KIT research data repository: http://dx.doi.org/10.5445/BWDD/11---RDF Validation Requirements Database http://purl.org/net/rdf-validationValidation Environment Demo: http://purl.org/net/rdfval-demo Source code: software/rdf-validator---Chapter 2: Foundations for RDF Validation XML validation: chapter/chapter-2/xml-validationChapter 3: Vocabularies for Representing Research Data and Related Metadata RDF vocabularies commonly used to represent different types of research data and related metadata: chapter/chapter-3/common-vocabularies Complete running example in RDF: chapter/chapter-3/common-vocabularies/running-exampleChapter 4: RDFication of XML Enabling to use RDF Validation Technologies Evaluation results: chapter/chapter-4/evaluation Chapter 6: Consistent Validation across RDF-based Constraint Languages Constraint languages implementations: chapter/chapter-6/constraint-languages-implementationsChapter 7: Validation Framework for RDF-based Constraint Languages Formal specification, HTML documentation, and UML class diagram of the RDF Constraints Vocabulary (RDF-CV): chapter/chapter-7/rdf-constraints-vocabulary Generic SPIN mappings for constraint types: chapter/chapter-7/generic-SPIN-mappings/RDF-CV-2-SPIN.ttlChapter 8: The Role of Reasoning for RDF Validation Implementations for all constraint types expressible by OWL 2 QL, OWL 2 DL, and DSP as well as for major constraint types representable by ReSh and ShEx: chapter/chapter-8/constraint-types-implementations Implementation of reasoning capabilities for all reasoning constraint types for which OWL 2 QL and OWL 2 DL reasoning may be performed: chapter\chapter-8\reasoning-constraint-types-implementations/OWL2-Reasoning-2-SPIN.ttl Validation and reasoning implementations of constraint types: chapter/chapter-8/constraint-types-implementationsChapter 9: Evaluating the Usability of Constraint Types for Assessing RDF Data Quality Implementations of all 115 constraints: chapter/chapter-9/constraints Evaluation results for each QB data set grouped by SPARQL endpoint: chapter/chapter-9/evaluation/data-sets/QB Vocabulary implementations: chapter/chapter-9/vocabularies/implementationsAppendix Link to appendix: http://dx.doi.org/10.5445/IR/1000054062
This deposit includes the data used for the research presented in 'Validation of SMAP L2 passive-only soil moisture products using in situ measurements collected in Twente, The Netherlands' that will be submitted (d.d. September 5th, 2019) to the open access scientific journal Hydrology and Earth System Sciences (HESS). The data consists of:• In situ soil moisture and temperature measured from January 2015 till December 2018 at a soil depth of 5 cm by a network composed of twenty stations in the Twente region. The network of soil moisture stations is operated by the University of Twente.• Root zone soil moisture simulations for the Netherlands at 250 m resolution performed by the LHM (Landelijk Hydrologisch Model) maintained by Deltares.• NASA SMAP L2 passive-only soil moisture estimates (version R164020) for three reference pixels with different coverages of the Twente network.
The Dryad Digital Repository is a curated resource that makes the data underlying scientific publications discoverable, freely reusable, and citable. Dryad provides a general-purpose home for a wide diversity of datatypes. Dryad welcomes data submissions related to published, or accepted, scholarly publications. Dryad's objectives are to serve as a repository for tables, spreadsheets, and all other kinds of data that do not have another discipline-specific repository, and to enable scientists to: validate published findings, explore new analysis methodologies, repurpose data for research questions unanticipated by the original authors, perform synthetic studies, and utilize data for educational purposes. Dryad is governed by a nonprofit membership organization. Membership is open to any stakeholder organization, including but not limited to journals, scientific societies, publishers, research institutions, libraries, and funding organizations.Publishers are encouraged to facilitate data archiving by coordinating the submission of manuscripts with submission of data to Dryad. Learn more here.Dryad originated from an initiative among a group of leading journals and scientific societies in evolutionary biology and ecology to adopt a joint data archiving policy (JDAP) for their publications, and the recognition that easy-to-use, sustainable, community-governed data infrastructure was needed to support such a policy. See this page to learn more about JDAP.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information on what papers and concepts researchers find relevant to map domain specific research output to the 17 Sustainable Development Goals (SDGs).
Sustainable Development Goals are the 17 global challenges set by the United Nations. Within each of the goals specific targets and indicators are mentioned to monitor the progress of reaching those goals by 2030. In an effort to capture how research is contributing to move the needle on those challenges, we earlier have made an initial classification model than enables to quickly identify what research output is related to what SDG. (This Aurora SDG dashboard is the initial outcome as proof of practice.)
In order to validate our current classification model (on soundness/precision and completeness/recall), and receive input for improvement, a survey has been conducted to capture expert knowledge from senior researchers in their research domain related to the SDG. The survey was open to the world, but mainly distributed to researchers from the Aurora Universities Network. The survey was open from October 2019 till January 2020, and captured data from 244 respondents in Europe and North America.
17 surveys were created from a single template, where the content was made specific for each SDG. Content, like a random set of publications, of each survey was ingested by a data provisioning server. That collected research output metadata for each SDG in an earlier stage. It took on average 1 hour for a respondent to complete the survey. The outcome of the survey data can be used for validating current and optimizing future SDG classification models for mapping research output to the SDGs.
The survey contains the following questions (see inside dataset for exact wording):
In the dataset root you'll find the following folders and files:
In the /04-processed-data/ you'll find in each SDG sub-folder the following files.:
</li>
<li><strong>SDG-survey-questions.doc</strong>
<ul>
<li>This file contains the survey questions</li>
</ul>
</li>
<li><strong>SDG-survey-respondents-per-sdg.csv</strong>
<ul>
<li>Basic information about the survey and responses</li>
</ul>
</li>
<li><strong>SDG-survey-city-heatmap.csv</strong>
<ul>
<li>Origin of the respondents per SDG survey</li>
</ul>
</li>
<li><strong>SDG-survey-suggested-publications.txt</strong>
<ul>
<li>Formatted list of research papers researchers have uploaded or listed they want to see back in the result-set for this SDG.</li>
</ul>
</li>
<li><strong>SDG-survey-suggested-publications-with-eid-match.csv</strong>
<ul>
<li>same as above, only matched with an EID. EIDs are matched my Elsevier's internal fuzzy matching algorithm. Only papers with high confidence are show with a match of an EID, referring to a record in Scopus.</li>
</ul>
</li>
<li><strong>SDG-survey-selected-publications-accepted.csv</strong>
<ul>
<li>Based on our previous result set of papers, researchers were presented random samples, they selected papers they believe represent this SDG. (TRUE=accepted)</li>
</ul>
</li>
<li><strong>SDG-survey-selected-publications-rejected.csv</strong>
<ul>
<li>Based on our previous result set of papers, researchers were presented random samples, they selected papers they believe not to represent this SDG. (FALSE=rejected)</li>
</ul>
</li>
<li><strong>SDG-survey-selected-keywords.csv</strong>
<ul>
<li>Based on our previous result set of papers, we presented researchers the keywords that are in the metadata of those papers, they selected keywords they believe represent this SDG.</li>
</ul>
</li>
<li><strong>SDG-survey-unselected-keywords.csv</strong>
<ul>
<li>As "selected-keywords", this is the list of keywords that respondents have not selected to represent this SDG.</li>
</ul>
</li>
<li><strong>SDG-survey-suggested-keywords.csv</strong>
<ul>
<li>List of keywords researchers suggest to use to find papers related to this SDG</li>
</ul>
</li>
<li><strong>SDG-survey-glossaries.csv</strong>
<ul>
<li>List of glossaries, containing keywords, researchers suggest to use to find papers related to this SDG</li>
</ul>
</li>
<li><strong>SDG-survey-selected-journals.csv</strong>
<ul>
<li>Based on our previous result set of papers, we presented researchers the journals that are in the metadata of those papers, they selected journals they believe represent this SDG.</li>
</ul>
</li>
<li><strong>SDG-survey-unselected-journals.csv</strong>
<ul>
<li>As "selected-journals", this is the list of journals
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The open data screening datasets contain both automatically detected (TRUE) Open Data statements by ODDPub, and its manual validation using Numbat extraction tool. Furthermore, extraction forms for both screenings – 2020 and 2021 – are included. The manually processed dataset for the calculation of the inter-rater reliability of manual validation can be also found here.
(i) Data from articles published in 2020 (file ‘charite_open_data_2020.csv’) have been collected applying a slightly different sequence of questions in the extraction workflow than the articles published in 2021 (file ‘charite_open_data_2021.csv’). Both datasets were cleaned for any personal data or internal comments. Thus, they do not contain the default columns which in the raw export from Numbat contained commentaries regarding different question. Also, in another regard these files do not represent raw outputs of the Numbat extraction tool, but a processed version. This means that articles validated by more than two raters were first reconciled in Numbat, resulting in one final decision (output of extractions after reconciliation). Then from the output of extractions before reconciliation those articles validated by only 1 rater (and thus not part of the inter-rater reliability calculation) were selected, which were afterwards joined with the already reconciled dataset.
The actual decision about Openness of validated dataset can be analysed in various ways:
The original extraction form contains an option ‘unsure_open_data’ besides ‘open_data’/’no_open_data’ which was resolved either during reconciliation between multiple raters or by case-related consultation with a second rater in case of doubt, and is not included here.
(ii) The inter-rater reliability calculation was made on randomly selected 100 articles for 2 raters. The third rater screened 20 articles sample, which is part of 100 sample. The tables provided here include both article-level data, and dataset-level data.
(iii) The Numbat extarction forms used for the screenings in 2020 and 2021 are included in two formats - JSON and Markdown.
(iv) ‘data_dictionary_open_data.csv’ table documents all variables of each data file containing here.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data pertaining to the publication "Can journal guidelines improve the reporting of antibody validation?". The project investigates the quality of antibody validation information provided in 120 biomedical publications and whether the introduction of journal validation guidelines improved the quality of this information.The data covers 60 publications before introduction of guidelines, and 60 after introduction, half of which from journals with guidelines. The quality of antibody validation information was coded by one author ("Antibody validation information data set.xlsx"), with a sample checked for interrater reliability by another ("Interrater reliability data set.xlsx"). Effects of journal guidelines introduction were tested statistically with a pseudo-experimental design. (Code for the statistical package R is provided.) The data package also includes detailed explanation of how coding was performed ("Coding protocol.docx") and an explanation of these files and data labels ("Data dictionary.docx").