United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, _domain-specific databases, and the top journals compare how much data is in institutional vs. _domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find _domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered. Search methods We first compiled a list of known _domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were _domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of _domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared _domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review: Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection. Not even one quarter of the _domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation. See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt
The NSF Public Access Repository contains an initial collection of journal publications and the final accepted version of the peer-reviewed manuscript or the version of record. To do this, NSF draws upon services provided by the publisher community including the Clearinghouse of Open Research for the United States, CrossRef, and International Standard Serial Number. When clicking on a Digital Object Identifier number, you will be taken to an external site maintained by the publisher. Some full text articles may not be available without a charge during the embargo, or administrative interval. Some links on this page may take you to non-federal websites. Their policies may differ from this website.
Overview This directory was developed to provide discovery information for anyone looking for publicly accessible repositories that house geological materials in the U.S. and Canada. In addition, this resource is intended to be a tool to facilitate a community of practice. The need for the directory was identified during planning for and follow-up from a drill core repository webinar series in Spring 2020 for public repository curators and staff in the U.S. and Canada hosted by the Minnesota Geological Survey and the Minnesota Department of Natural Resources. Additional supporting sponsors included the U.S. Geological Survey National Geological and Geophysical Data Preservation Program and the Association of American State Geologists Data Preservation Committee. The 10-part webinar series provided overviews of state, provincial, territorial, and national repositories that house drill core, other geoscience materials, and data. When the series concluded a small working group of the participants continued to meet to facilitate the development and production of a directory of repositories that maintain publicly-accessible geological materials throughout the U.S. and Canada. The group used previous directory efforts described in the next section, Summary of Historical Repository Directory Compilation Efforts, as guides for content during development. The working group prepared and compiled responses from a call for repository information and characterization. This directory is planned to be a living resource for the geoscience community with updates every other year to accommodate changes. The updates will facilitated through versioned updates of this data release. Summary of Historical Repository Directory Compilation Efforts 1957 – Sample and Core Repositories of the United States, Alaska, and Canada. Published by AAPG. Committee on Preservation of Samples and Cores. 13 members from industry, academia, and government. 1977 – Well-Sample and Core Repositories of the Unites States and Canada, C.K. Fisher; M.P. Krupa, USGS Open file report 77-567.USGS wanted to update the original index. Includes a map showing core repositories by “State” “University” “Commercial” and “Federal”. Also includes a “Brief Statement of Requirements for the Preservation of Subsurface Material and Data” and referral to state regulations for details on preserved materials. 1984 - Nonprofit Sample and Core Repositories Open to the Public in the United States – USGS Circular 942. James Schmoker, Thomas Michalski, Patricia Worl. The survey was conducted by a questionnaire mailed to repository curators. Information on additions, corrections, and deletions to earlier (1957,1977) directories from state geologists, each state office of the Water Resources Division of the U.S. Geological Survey, additional government agencies and colleagues were also used. 1997 - The National Directory of Geoscience Data Repositories, edited by Nicholas H. Claudy – American Geologic Institute. To prepare the directory, questionnaires were mailed to state geologists, more than 60 geological societies, private-sector data centers selected from oil and gas directories, and to the membership committee of the American Association of Petroleum Geologists, one of AGI's member societies. The directory contains 124 repository listings, organized alphabetically by state. 2002 – National Research Council 2002. Geoscience Data and Collections: National resources in Peril. Washington, D.C.: The National Academies Press 2005 – The National Geological and Geophysical Data Preservation Program (NGGDPP) of the United States Geological Survey (USGS) was established by The Energy Policy Act of 2005, and reauthorized in the Consolidated Appropriations Act, 2021, “to preserve and expose the Nation’s geoscience collections (samples, logs, maps, data) to promote their discovery and use for research and resource development”. The Program provides “technical and financial assistance to state geological surveys and U.S. Department of the Interior (DOI) bureaus” to archive “geological, geophysical, and engineering data, maps, photographs, samples, and other physical specimens”. Metadata records describing the preserved assets are cataloged in the National Digital Catalog (NDC). References American Association of Petroleum Geologists, 1957, Sample and core repositories of the United States, Alaska, and Canada: American Association of Petroleum Geologists, Committee on Preservation of Samples and Cores, 29 p. American Association of Petroleum Geologists, 2018, US Geological Sample and Data Repositories: American Association of Petroleum Geologists, Preservation of Geoscience Data Committee, Unpublished, (Contact: AAPG Preservation of Geoscience Data Committee) American Geological Institute, 1997, National Geoscience Data Repository System, Phase II. Final report, January 30, 1995--January 28, 1997. United States. https://doi.org/10.2172/598388 American Geological Institute, 1997, National Directory of Geoscience Data Repositories, Claudy, N. H., (ed.), 91pp. Claudy N., Stevens D., 1997, AGI Publishes first edition of national directory of geoscience data repositories. American Geological Institute Spotlight, https://www.agiweb.org/news/datarep2.html Consolidated Appropriations Act, 2021 (Public Law 116-260, Sec.7002) Davidson, E. D., Jr., 1981, A look at core and sample libraries: Bureau of Economic Geology, The University of Texas at Austin, 4 p. and Appendix. Deep Carbon Observatory (DCO) Data Portal, Scientific Collections, https://info.deepcarbon.net/vivo/scientific-collections; Keyword Search: sample repository, https://info.deepcarbon.net/vivo/scientific-collections?source=%7B%22query%22%3A%7B%22query_string%22%3A%7B%22query%22%3A%22sample%20repository%20%22%2C%22default_operator%22%3A%22OR%22%7D%7D%2C%22sort%22%3A%5B%7B%22_score%22%3A%7B%22order%22%3A%22asc%22%7D%7D%5D%2C%22from%22%3A0%2C%22size%22%3A200%7D: Accessed September 29, 2021 Fisher, C. K., and Krupa, M. P., 1977, Well-sample and core repositories of the United States and Canada: U.S. Geological Survey Open-File Report 77-567, 73 p. https://doi.org/10.3133/ofr77567 Fogwill, W.D., 1985, Drill Core Collection and Storage Systems in Canada, Manitoba Energy & Mines. https://www.ngsc-cptgs.com/files/PGJSpecialReport_1985_V03b.pdf Goff, S., and Heiken, G., eds., 1982, Workshop on core and sample curation for the National Continental Scientific Drilling Program: Los Alamos National Laboratory, May 5-6, 1981, LA-9308-C, 31 p. https://www.osti.gov/servlets/purl/5235532 Lonsdale, J. T., 1953, On the preservation of well samples and cores: Oklahoma City Geological Society Shale Shaker, v. 3, no. 7, p. 4. National Geological and Geophysical Data Preservation Program. https://www.usgs.gov/core-science-systems/national-geological-and-geophysical-data-preservation-program National Research Council. 2002. Geoscience Data and Collections: National Resources in Peril. Washington, DC: The National Academies Press, 107 p. https://doi.org/10.17226/10348 Pow, J. R., 1969, Core and sample storage in western Canada: Bulletin of Canadian Petroleum Geology, v. 17, no. 4, p. 362-369. DOI: 10.35767/gscpgbull.17.4.362 Ramdeen, S., 2015. Preservation challenges for geological data at state geological surveys, GeoResJ 6 (2015) 213-220, https://doi.org/10.1016/j.grj.2015.04.002 Schmoker, J. W., Michalski, T. C., and Worl, P. B., 1984, Nonprofit sample and core repositories of the United States: U.S. Geological Survey Circular 942. https://doi.org/10.3133/cir942 Schmoker, J. W., Michalski, T. C., and Worl, P. B., 1984, Addresses, telephone numbers, and brief descriptions of publicly available, nonprofit sample and core repositories of the United States: U.S. Geological Survey Open-File Report 84-333, 13 p. (Superseded by USGS Circular 942) https://doi.org/10.3133/ofr84333 The Energy Policy Act of 2005 (Public Law 109-58, Sec. 351) The National Digital Catalog (NDC). https://www.usgs.gov/core-science-systems/national-geological-and-geophysical-data-preservation-program/national-digital U.S. Bureau of Mines, 1978, CORES Operations Manual: Bureau of Mines Core Repository System: U.S. Bureau of Mines Information Circular IC 8784, 118 p. https://digital.library.unt.edu/ark:/67531/metadc170848/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains article metadata and information about Open Science Indicators for approximately 139,000 research articles published in PLOS journals from 1 January 2018 to 30 March 2025 and a set of approximately 28,000 comparator articles published in non-PLOS journals. This is the tenth release of this dataset, which will be updated with new versions on an annual basis.This version of the Open Science Indicators dataset shares the indicators seen in the previous versions as well as fully operationalised protocols and study registration indicators, which were previously only shared in preliminary forms. The v10 dataset focuses on detection of five Open Science practices by analysing the XML of published research articles:Sharing of research data, in particular data shared in data repositoriesSharing of codePosting of preprintsSharing of protocolsSharing of study registrationsThe dataset provides data and code generation and sharing rates, the location of shared data and code (whether in Supporting Information or in an online repository). It also provides preprint, protocol and study registration sharing rates as well as details of the shared output, such as publication date, URL/DOI/Registration Identifier and platform used. Additional data fields are also provided for each article analysed. This release has been run using an updated preprint detection method (see OSI-Methods-Statement_v10_Jul25.pdf for details). Further information on the methods used to collect and analyse the data can be found in Documentation.Further information on the principles and requirements for developing Open Science Indicators is available in https://doi.org/10.6084/m9.figshare.21640889.Data folders/filesData Files folderThis folder contains the main OSI dataset files PLOS-Dataset_v10_Jul25.csv and Comparator-Dataset_v10_Jul25.csv, which containdescriptive metadata, e.g. article title, publication data, author countries, is taken from the article .xml filesadditional information around the Open Science Indicators derived algorithmicallyand the OSI-Summary-statistics_v10_Jul25.xlsx file contains the summary data for both PLOS-Dataset_v10_Jul25.csv and Comparator-Dataset_v10_Jul25.csv.Documentation folderThis file contains documentation related to the main data files. The file OSI-Methods-Statement_v10_Jul25.pdf describes the methods underlying the data collection and analysis. OSI-Column-Descriptions_v10_Jul25.pdf describes the fields used in PLOS-Dataset_v10_Jul25.csv and Comparator-Dataset_v10_Jul25.csv. OSI-Repository-List_v1_Dec22.xlsx lists the repositories and their characteristics used to identify specific repositories in the PLOS-Dataset_v10_Jul25.csv and Comparator-Dataset_v10_Jul25.csv repository fields.The folder also contains documentation originally shared alongside the preliminary versions of the protocols and study registration indicators in order to give fuller details of their detection methods.Contact details for further information:Iain Hrynaszkiewicz, Director, Open Research Solutions, PLOS, ihrynaszkiewicz@plos.org / plos@plos.orgLauren Cadwallader, Open Research Manager, PLOS, lcadwallader@plos.org / plos@plos.orgAcknowledgements:Thanks to Allegra Pearce, Tim Vines, Asura Enkhbayar, Scott Kerr and parth sarin of DataSeer for contributing to data acquisition and supporting information.
This file collection is part of the ORD Landscape and Cost Analysis Project (DOI: 10.5281/zenodo.2643460), a study jointly commissioned by the SNSF and swissuniversities in 2018. Please cite this data collection as: von der Heyde, M. (2019). Data from the International Open Data Repository Survey. Retrieved from https://doi.org/10.5281/zenodo.2643493 Further information is given in the corresponding data paper: von der Heyde, M. (2019). International Open Data Repository Survey: Description of collection, collected data, and analysis methods [Data paper]. Retrieved from https://doi.org/10.5281/zenodo.2643450 Contact Swiss National Science Foundation (SNSF) Open Research Data Group E-mail: ord@snf.ch swissuniversities Program "Scientific Information" Gabi Schneider E-Mail: isci@swissuniversities.ch
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides guidance materials and templates to help you prepare your research datasets for deposit in the U of G Research Data Repositories.Please refer to the U of G Research Data Repositories LibGuide for detailed information about the U of G Research Data Repositories including additional resources for preparing datasets for deposit. The library offers a self-deposit with curation service. The deposit workflow is as follows:Create your repository account.If you are a first-time depositor, complete the U of G Research Data Repositories Dataset Deposit Intake Form.Activate your Data Repositories account by logging in with your U of G central login account.Once your account is created, contact us to set up your dataset creator access to your home department’s collection in the Data Repositories.Note: If you already have a Data Repositories account and dataset creator access, you can log in and begin a new deposit to your home department’s collection right away.Prepare your dataset.Assemble your dataset following the Dataset Deposit Guidelines. Use the README file template to capture data documentation.Create a draft dataset record.Log in to the Data Repositories and create a draft dataset record following the instructions in the Dataset Submission Guide.Submit your draft dataset for review.Dataset review.Data Repositories staff will review (also referred to as curate) your dataset for alignment with the Dataset Deposit Guidelines using a standard curation workflow.The curator will collaborate with you to enhance the dataset.Public release.Once ready, the dataset curator will make the dataset publicly available in the Data Repositories, with appropriate file access controls. Support: If you have any questions about preparing and depositing your dataset, please make a Publishing and Author Support Request.
The journals’ author guidelines and/or editorial policies were examined on whether they take a stance with regard to the availability of the underlying data of the submitted article. The mere explicated possibility of providing supplementary material along with the submitted article was not considered as a research data policy in the present study. Furthermore, the present article excluded source codes or algorithms from the scope of the paper and thus policies related to them are not included in the analysis of the present article.
For selection of journals within the field of neurosciences, Clarivate Analytics’ InCites Journal Citation Reports database was searched using categories of neurosciences and neuroimaging. From the results, journals with the 40 highest Impact Factor (for the year 2017) indicators were extracted for scrutiny of research data policies. Respectively, the selection journals within the field of physics was created by performing a similar search with the categories of physics, applied; physics, atomic, molecular & chemical; physics, condensed matter; physics, fluids & plasmas; physics, mathematical; physics, multidisciplinary; physics, nuclear and physics, particles & fields. From the results, journals with the 40 highest Impact Factor indicators were again extracted for scrutiny. Similarly, the 40 journals representing the field of operations research were extracted by using the search category of operations research and management.
Journal-specific data policies were sought from journal specific websites providing journal specific author guidelines or editorial policies. Within the present study, the examination of journal data policies was done in May 2019. The primary data source was journal-specific author guidelines. If journal guidelines explicitly linked to the publisher’s general policy with regard to research data, these were used in the analyses of the present article. If journal-specific research data policy, or lack of, was inconsistent with the publisher’s general policies, the journal-specific policies and guidelines were prioritized and used in the present article’s data. If journals’ author guidelines were not openly available online due to, e.g., accepting submissions on an invite-only basis, the journal was not included in the data of the present article. Also journals that exclusively publish review articles were excluded and replaced with the journal having the next highest Impact Factor indicator so that each set representing the three field of sciences consisted of 40 journals. The final data thus consisted of 120 journals in total.
‘Public deposition’ refers to a scenario where researcher deposits data to a public repository and thus gives the administrative role of the data to the receiving repository. ‘Scientific sharing’ refers to a scenario where researcher administers his or her data locally and by request provides it to interested reader. Note that none of the journals examined in the present article required that all data types underlying a submitted work should be deposited into a public data repositories. However, some journals required public deposition of data of specific types. Within the journal research data policies examined in the present article, these data types are well presented by the Springer Nature policy on “Availability of data, materials, code and protocols” (Springer Nature, 2018), that is, DNA and RNA data; protein sequences and DNA and RNA sequencing data; genetic polymorphisms data; linked phenotype and genotype data; gene expression microarray data; proteomics data; macromolecular structures and crystallographic data for small molecules. Furthermore, the registration of clinical trials in a public repository was also considered as a data type in this study. The term specific data types used in the custom coding framework of the present study thus refers to both life sciences data and public registration of clinical trials. These data types have community-endorsed public repositories where deposition was most often mandated within the journals’ research data policies.
The term ‘location’ refers to whether the journal’s data policy provides suggestions or requirements for the repositories or services used to share the underlying data of the submitted works. A mere general reference to ‘public repositories’ was not considered a location suggestion, but only references to individual repositories and services. The category of ‘immediate release of data’ examines whether the journals’ research data policy addresses the timing of publication of the underlying data of submitted works. Note that even though the journals may only encourage public deposition of the data, the editorial processes could be set up so that it leads to either publication of the research data or the research data metadata in conjunction to publishing of the submitted work.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research data sharing has become an expected component of scientific research and scholarly publishing practice over the last few decades, due in part to requirements for federally funded research. As part of a larger effort to better understand the workflows and costs of public access to research data, this project conducted a high-level analysis of where academic research data is most frequently shared. To do this, we leveraged the DataCite and Crossref application programming interfaces (APIs) in search of Publisher field elements demonstrating which data repositories were utilized by researchers from six academic research institutions between 2012–2022. In addition, we also ran a preliminary analysis of the quality of the metadata associated with these published datasets, comparing the extent to which information was missing from metadata fields deemed important for public access to research data. Results show that the top 10 publishers accounted for 89.0% to 99.8% of the datasets connected with the institutions in our study. Known data repositories, including institutional data repositories hosted by those institutions, were initially lacking from our sample due to varying metadata standards and practices. We conclude that the metadata quality landscape for published research datasets is uneven; key information, such as author affiliation, is often incomplete or missing from source data repositories and aggregators. To enhance the findability, interoperability, accessibility, and reusability (FAIRness) of research data, we provide a set of concrete recommendations that repositories and data authors can take to improve scholarly metadata associated with shared datasets.
A handy decision tree to help you identify the right kind of repository for your research data, using just three questions to guide you through this process. This guide also shares top-level guidance on metadata, versioning, and software, and suggests resources for further reading on the subject.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file collection is part of the ORD Landscape and Cost Analysis Project (DOI: 10.5281/zenodo.2643460), a study jointly commissioned by the SNSF and swissuniversities in 2018.
Please cite this data collection as: von der Heyde, M. (2019). Data and tools of the landscape and cost analysis of data repositories currently used by the Swiss research community. Retrieved from https://doi.org/10.5281/zenodo.2643495
Connected data papers are: von der Heyde, M. (2019). Open Data Landscape: Repository Usage of the Swiss Research Community: Description of collection, collected data, and analysis methods [Data paper]. Retrieved from https://doi.org/10.5281/zenodo.2643430 von der Heyde, M. (2019). International Open Data Repository Survey: Description of collection, collected data, and analysis methods [Data paper]. Retrieved from https://doi.org/10.5281/zenodo.2643450
Connected data sets are: von der Heyde, M. (2019). Data from the Swiss Open Data Repository Landscape survey. Retrieved from https://doi.org/10.5281/zenodo.2643487 von der Heyde, M. (2019). Data from the International Open Data Repository Survey. Retrieved from https://doi.org/10.5281/zenodo.2643493
Contact
Swiss National Science Foundation (SNSF)
Open Research Data Group
E-mail: ord@snf.ch
swissuniversities
Program "Scientific Information"
Gabi Schneider
E-Mail: isci@swissuniversities.ch
Image Data Resource (IDR) is an online, public data repository that seeks to store, integrate and serve image datasets from published scientific studies. We have collected and are continuing to receive existing and newly created “reference image" datasets that are valuable resources for a broad community of users, either because they will be frequently accessed and cited or because they can serve as a basis for re-analysis and the development of new computational tools.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We conducted an analysis to confirm our observations that only a very small percentage of public research data is hosted in the Institutional Data Repositories, while the vast majority is published in the open domain-specific and generalist data repositories.
For this analysis, we selected 11 institutions, many of which have been our evaluation partners. For each institution, we counted the number of datasets published in their Institutional Data Repository (IDR) and tracked the number of public research datasets hosted in external data repositories via the Data Monitor API. External tracking was based on the corpus of 14+ mln data records checked against the institutional SciVal ID. One institution didn’t have an IDR.
We found out that 10 out of 11 institutions had most of their public research data hosted outside of their institution, where by research data we mean not only datasets, but a broader notion that includes, for example, software.
We will be happy to expand it by adding more institutions upon request.
Note: This is version 2 of the earlier published dataset. The number of datasets published and tracked in the Monash Institutional Data Repository has been updated based on the information provided by the Monash Library. The number of datasets in the NTU Institutional Data Repository now includes datasets only. Dataverses were excluded to avoid double counting.
Background: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations th...
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Comparative review of open access data repositories collected to inform product development for the Dataverse Project at the Harvard Institute for Quantitative Social Science More information about the scope, purpose and development of this review is at https://dataverse.org/blog/comparative-review-various-data-repositories.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
List of the proteomics datasets used in this study.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0. Contents:
metadata.zip: The dataset metadata and analysis results as CSV files. scripts-and-logs.zip: Scripts and logs of the dataset creation. LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text. README.md: This document. redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program. This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io. Metadata The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files. repositories.csv:
ID (integer): GitHub repository ID url (string): GitHub repository URL downloaded (boolean): Whether cloning the repository succeeded name (string): Repository name description (string): Repository description licenses (string, list of strings): Repository licenses redistributable (boolean): Whether the repository's licenses permit redistribution created (string, date & time): Time of the repository's creation updated (string, date & time): Time of the last update to the repository pushed (string, date & time): Time of the last push to the repository fork (boolean): Whether the repository is a fork forks (integer): Number of forks archive (boolean): Whether the repository is archived programs (string, list of strings): Project file path of each IaC program in the repository programs.csv:
ID (string): Project file path of the IaC program repository (integer): GitHub repository ID of the repository containing the IaC program directory (string): Path of the directory containing the IaC program's project file solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi") language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml") name (string): IaC program name description (string): IaC program description runtime (string): Runtime string of the IaC program testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") tests (string, list of strings): File paths of IaC program's tests testing-files.csv:
file (string): Testing file path language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript") techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(") program (string): Project file path of the testing file's IaC program Dataset Creation scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:
A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries. A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below). Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses. Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories. Searching Repositories The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:
Github access token. Name of the CSV output file. Filename to search for. File extensions to search for, separated by commas. Min file size for the search (for all files: 0). Max file size for the search or * for unlimited (for all files: *). Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/ AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup Limitations The script uses the GitHub code search API and inherits its limitations:
Only forks with more stars than the parent repository are included. Only the repositories' default branches are considered. Only files smaller than 384 KB are searchable. Only repositories with fewer than 500,000 files are considered. Only repositories that have had activity or have been returned in search results in the last year are considered. More details: https://docs.github.com/en/search-github/searching-on-github/searching-code The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api Downloading Repositories download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:
Name of the repositories CSV files generated through search-repositories.py, separated by commas. Output directory to download the repositories to. Name of the CSV output file. The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.
http://data.europa.eu/eli/dec/2011/833/ojhttp://data.europa.eu/eli/dec/2011/833/oj
Open access (OA) can be defined as the practice of providing on-line access to scientific information that is free of charge to the user and that is re-usable. A distinction is usually made between OA to scientific peer reviewed publications and research data. In Horizon 2020 open access to peer-reviewed scientific publications (primarily articles) is mandatory; however, researchers can choose between the open access route most appropriate to them.
For open access publishing (gold open access), researchers can publish in open access journals, or in journals that sell subscriptions and also offer the possibility of making individual articles openly accessible (hybrid journals). In that case, publishers often charge an article processing charge (APC). These costs are eligible for reimbursement during the duration of the Horizon 2020 grant. For APCs incurred after the end of the grant agreement, a mechanism for reimbursing some of these costs is being piloted and implemented through the OpenAIRE project. Note that in case of gold open access publishing, a copy must also be deposited in an open access repository.
For self-archiving (green open access), researchers deposit the final peer-reviewed manuscript in a repository of their choice. In this case, they must ensure open access to the publication within six months of publication (12 months in case of the social sciences and humanities).
This page provides an overview of the state of play as regards the uptake of open access to scientific publications in Horizon 2020 from 2014 to 2017, updating information from 2016.
Two datasets have been used for the analysis presented in this note: one dataset from the EU funded OpenAIRE project for FP7 and H2020 and one dataset from CORDA for H2020, which also provides supplementary information on article processing charges and embargo periods. The datasets are from September and August 2017 respectively.
The OpenAIRE sample includes primarily peer-reviewed scientific articles but also some other forms of publications such as conference papers, book chapters and reports or pre-prints. It is based on information obtained from Open Access repositories, pre-print servers, OA journals and project reports and contains some underreporting since OpenAIRE has difficulties tracking hybrid publications and publications in repositories which are not OpenAIRE compliant. The CORDA sample contains only peer-reviewed scientific articles and is based on project self-reporting. The figures in this note measure open access in a broad sense and not the compliance with the specifics of article 29.2. of the Model Grant Agreement.
The 2017 analysis of open access during the entirety of Horizon 2020 so far shows an overall open access rate of 63,2% from OpenAIRE data (+2,4% compared with the sample from 2016). Internal project reporting through SYGMA shows a total of 80,6% open access for Horizon 2020 scientific peer reviewed articles and 75% for all peer-reviewed publications (including also conference procedures, book chapter, monographs and the like); however, since this data is based on beneficiary self-reporting it may contain some over-reporting.
According to the OpenAIRE sample 75% of publications are green open access and 25% gold open access. Internal figures are similar although they show a slightly higher amount of gold OA with a split of 70% green and 30% gold.
For gold OA internal project reporting suggests than an average of 1500 € is spent per article (median: 1200 €), an increase from the average of 1006 € in the previous sample. A more detailed analysis reveals that 27% percent of articles have a price tag of between 1000 to 1999 €. It is also important to note that 26% of all publications are in gold OA but without any APC charges. Very high APCs of 4000€ or more only concerns a tiny fraction of Horizon 2020 publications (3%).
The average embargo period of green OA publications is 10 months, that is a decrease of 1 month from the 2016 sample. 40% of articles have an embargo period of 11-12 months, followed by 575 articles (or 33% with no embargo period at all. 302 articles, that is 17% have an embargo period of 12,1-24 months and 162 articles or 9% of 0,1 to 6 months. Finally, 12 articles, that is 1%, have an embargo period that is longer than 36 months.
This 2017 analysis thus broadly confirms the earlier findings from summer 2016, but is based on a larger and more robust sample. In the 2017 sample overall open access rates have gone up in all the datasets and cohorts. The distribution between gold and green open access remains similar to the 2016 dataset; for gold OA, average APCs have increased, for green OA embargo periods have slight decreased.
Please consult the background note for a more detailed analysis. Note also that these files only refer to open access to publications. Information on open access to research data is made available on the open data portal on a diffe
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Jira is an issue tracking system that supports software companies (among other types of companies) with managing their projects, community, and processes. This dataset is a collection of public Jira repositories downloaded from the internet using the Jira API V2. We collected data from 16 pubic Jira repositories containing 1822 projects and 2.7 million issues. Included in this data are historical records of 32 million changes, 8 million comments, and 1 million issue links that connect the issues in complex ways. This artefact repository contains the data as a MongoDB dump, the scripts used to download the data, the scripts used to interpret the data, and qualitative work conducted to make the data more approachable.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Top 20 publishers of datasets and software code by affiliation.
United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, _domain-specific databases, and the top journals compare how much data is in institutional vs. _domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find _domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered. Search methods We first compiled a list of known _domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were _domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of _domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared _domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review: Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection. Not even one quarter of the _domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation. See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt