100+ datasets found
  1. Data from: Inventory of online public databases and repositories holding...

    • catalog.data.gov
    • s.cnmilf.com
    • +2more
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Inventory of online public databases and repositories holding agricultural data in 2017 [Dataset]. https://catalog.data.gov/dataset/inventory-of-online-public-databases-and-repositories-holding-agricultural-data-in-2017-d4c81
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered. Search methods We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review: Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection. Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation. See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt

  2. Data from: A consensus compound/bioactivity dataset for data-driven drug...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated May 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laura Isigkeit; Laura Isigkeit; Apirat Chaikuad; Apirat Chaikuad; Daniel Merk; Daniel Merk (2022). A consensus compound/bioactivity dataset for data-driven drug design and chemogenomics [Dataset]. http://doi.org/10.5281/zenodo.6320761
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 13, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Laura Isigkeit; Laura Isigkeit; Apirat Chaikuad; Apirat Chaikuad; Daniel Merk; Daniel Merk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Information

    The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144803 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design.

    The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation.

    Structure and content of the dataset

    Dataset structure

    ChEMBL

    ID

    PubChem

    ID

    IUPHAR

    ID

    Target

    Activity

    type

    Assay typeUnitMean C (0)...Mean PC (0)...Mean B (0)...Mean I (0)...Mean PD (0)...Activity check annotationLigand namesCanonical SMILES C...Structure checkSource

    The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file.

    Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format.

    Column content:

    • ChEMBL ID, PubChem ID, IUPHAR ID: chemical identifier of the databases
    • Target: biological target of the molecule expressed as the HGNC gene symbol
    • Activity type: for example, pIC50
    • Assay type: Simplification/Classification of the assay into cell-free, cellular, functional and unspecified
    • Unit: unit of bioactivity measurement
    • Mean columns of the databases: mean of bioactivity values or activity comments denoted with the frequency of their occurrence in the database, e.g. Mean C = 7.5 *(15) -> the value for this compound-target pair occurs 15 times in ChEMBL database
    • Activity check annotation: a bioactivity check was performed by comparing values from the different sources and adding an activity check annotation to provide automated activity validation for additional confidence
      • no comment: bioactivity values are within one log unit;
      • check activity data: bioactivity values are not within one log unit;
      • only one data point: only one value was available, no comparison and no range calculated;
      • no activity value: no precise numeric activity value was available;
      • no log-value could be calculated: no negative decadic logarithm could be calculated, e.g., because the reported unit was not a compound concentration
    • Ligand names: all unique names contained in the five source databases are listed
    • Canonical SMILES columns: Molecular structure of the compound from each database
    • Structure check: To denote matching or differing compound structures in different source databases
      • match: molecule structures are the same between different sources;
      • no match: the structures differ;
      • 1 source: no structure comparison is possible, because the molecule comes from only one source database.
    • Source: From which databases the data come from

  3. f

    Data from: Analysis of Commercial and Public Bioactivity Databases

    • acs.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pekka Tiikkainen; Lutz Franke (2023). Analysis of Commercial and Public Bioactivity Databases [Dataset]. http://doi.org/10.1021/ci2003126.s003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    ACS Publications
    Authors
    Pekka Tiikkainen; Lutz Franke
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Activity data for small molecules are invaluable in chemoinformatics. Various bioactivity databases exist containing detailed information of target proteins and quantitative binding data for small molecules extracted from journals and patents. In the current work, we have merged several public and commercial bioactivity databases into one bioactivity metabase. The molecular presentation, target information, and activity data of the vendor databases were standardized. The main motivation of the work was to create a single relational database which allows fast and simple data retrieval by in-house scientists. Second, we wanted to know the amount of overlap between databases by commercial and public vendors to see whether the former contain data complementing the latter. Third, we quantified the degree of inconsistency between data sources by comparing data points derived from the same scientific article cited by more than one vendor. We found that each data source contains unique data which is due to different scientific articles cited by the vendors. When comparing data derived from the same article we found that inconsistencies between the vendors are common. In conclusion, using databases of different vendors is still useful since the data overlap is not complete. It should be noted that this can be partially explained by the inconsistencies and errors in the source data.

  4. f

    Retrospective survival analysis - differences between two versions of a...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated May 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Felix, Francisco H C (2017). Retrospective survival analysis - differences between two versions of a dataset [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001818583
    Explore at:
    Dataset updated
    May 1, 2017
    Authors
    Felix, Francisco H C
    Description

    Calculations shown here use data from patients diagnosed with DIPG between 2000-2013, with follow-up until 2014. Follow-up time of the patients in this database is a bit longer than that from the original data used to design VALKYRIE project, hence there are some numerical differences. This post illustrate how to present clinical research data in a transparent and fully reproducible way to an audience. Including individual patient data (de-identified) as well as the script used to perform statistical analysis of data, this is an example of the possibilities of open lab notebook and open science paradigm. When the prospective trial data are collected, they will be equally published in the same format, becoming permanently available to analysis and criticism by interested third parties. I discussed the inspiration to this approach in this post in a personal blog.(excerpt)

  5. d

    Data from: Database for Forensic Anthropology in the United States,...

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Nov 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Justice (2025). Database for Forensic Anthropology in the United States, 1962-1991 [Dataset]. https://catalog.data.gov/dataset/database-for-forensic-anthropology-in-the-united-states-1962-1991-486d3
    Explore at:
    Dataset updated
    Nov 14, 2025
    Dataset provided by
    National Institute of Justice
    Area covered
    United States
    Description

    This project was undertaken to establish a computerized skeletal database composed of recent forensic cases to represent the present ethnic diversity and demographic structure of the United States population. The intent was to accumulate a forensic skeletal sample large and diverse enough to reflect different socioeconomic groups of the general population from different geographical regions of the country in order to enable researchers to revise the standards being used for forensic skeletal identification. The database is composed of eight data files, comprising four categories. The primary "biographical" or "identification" files (Part 1, Demographic Data, and Part 2, Geographic and Death Data) comprise the first category of information and pertain to the positive identification of each of the 1,514 data records in the database. Information in Part 1 includes sex, ethnic group affiliation, birth date, age at death, height (living and cadaver), and weight (living and cadaver). Variables in Part 2 pertain to the nature of the remains, means and sources of identification, city and state/country born, occupation, date missing/last seen, date of discovery, date of death, time since death, cause of death, manner of death, deposit/exposure of body, area found, city, county, and state/country found, handedness, and blood type. The Medical History File (Part 3) represents the second category of information and contains data on the documented medical history of the individual. Variables in Part 3 include general comments on medical history as well as comments on congenital malformations, dental notes, bone lesions, perimortem trauma, and other comments. The third category consists of an inventory file (Part 4, Skeletal Inventory Data) in which data pertaining to the specific contents of the database are maintained. This includes the inventory of skeletal material by element and side (left and right), indicating the condition of the bone as either partial or complete. The variables in Part 4 provide a skeletal inventory of the cranium, mandible, dentition, and postcranium elements and identify the element as complete, fragmentary, or absent. If absent, four categories record why it is missing. The last part of the database is composed of three skeletal data files, covering quantitative observations of age-related changes in the skeleton (Part 5), cranial measurements (Part 6), and postcranial measurements (Part 7). Variables in Part 5 provide assessments of epiphyseal closure and cranial suture closure (left and right), rib end changes (left and right), Todd Pubic Symphysis, Suchey-Brooks Pubic Symphysis, McKern & Steward--Phases I, II, and III, Gilbert & McKern--Phases I, II, and III, auricular surface, and dorsal pubic pitting (all for left and right). Variables in Part 6 include cranial measurements (length, breadth, height) and mandibular measurements (height, thickness, diameter, breadth, length, and angle) of various skeletal elements. Part 7 provides postcranial measurements (length, diameter, breadth, circumference, and left and right, where appropriate) of the clavicle, scapula, humerus, radius, ulna, scarum, innominate, femur, tibia, fibula, and calcaneus. A small file of noted problems for a few cases is also included (Part 8).

  6. d

    Trade & Environment Database - Dataset - CE data hub

    • datahub.digicirc.eu
    Updated Jan 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Trade & Environment Database - Dataset - CE data hub [Dataset]. https://datahub.digicirc.eu/dataset/trade-environment-database
    Explore at:
    Dataset updated
    Jan 25, 2022
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    436 views (8 recent) The TREND database identifies nearly 300 different types of environmental provisions in 730 trade agreements. Unknown. The nations of the world. The TRade and ENvironment Database (TREND) is a unique dataset which tracks more than 300 different environmental provisions relying on the full texts of about 630 preferential trade agreements (PTAs) signed since 1945. Besides the main text, annexes, protocols, side agreements, and side letters have been included as integral parts of the PTA.

  7. Z

    Dataset used for "A Recommender System of Buggy App Checkers for App Store...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jun 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Gomez; Romain Rouvoy; Martin Monperrus; Lionel Seinturier (2021). Dataset used for "A Recommender System of Buggy App Checkers for App Store Moderators" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5034291
    Explore at:
    Dataset updated
    Jun 28, 2021
    Dataset provided by
    University of Lille / Inria
    Authors
    Maria Gomez; Romain Rouvoy; Martin Monperrus; Lionel Seinturier
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset used for paper: "A Recommender System of Buggy App Checkers for App Store Moderators", published on the International Conference on Mobile Software Engineering and Systems (MOBILESoft) in 2015.

    Dataset Collection We built a dataset that consists of a random sample of Android app metadata and user reviews available on the Google Play Store on January and March 2014. Since the Google Play Store is continuously evolving (adding, removing and/or updating apps), we updated the dataset twice. The dataset D1 contains available apps in the Google Play Store in January 2014. Then, we created a new snapshot (D2) of the Google Play Store in March 2014.

    The apps belong to the 27 different categories defined by Google (at the time of writing the paper), and the 4 predefined subcategories (free, paid, new_free, and new_paid). For each category-subcategory pair (e.g. tools-free, tools-paid, sports-new_free, etc.), we collected a maximum of 500 samples, resulting in a median number of 1.978 apps per category.

    For each app, we retrieved the following metadata: name, package, creator, version code, version name, number of downloads, size, upload date, star rating, star counting, and the set of permission requests.

    In addition, for each app, we collected up to a maximum of the latest 500 reviews posted by users in the Google Play Store. For each review, we retrieved its metadata: title, description, device, and version of the app. None of these fields were mandatory, thus several reviews lack some of these details. From all the reviews attached to an app, we only considered the reviews associated with the latest version of the app —i.e., we discarded unversioned and old-versioned reviews. Thus, resulting in a corpus of 1,402,717 reviews (2014 Jan.).

    Dataset Stats Some stats about the datasets:

    • D1 (Jan. 2014) contains 38,781 apps requesting 7,826 different permissions, and 1,402,717 user reviews.

    • D2 (Mar. 2014) contains 46,644 apps and 9,319 different permission requests, and 1,361,319 user reviews.

    Additional stats about the datasets are available here.

    Dataset Description To store the dataset, we created a graph database with Neo4j. This dataset therefore consists of a graph describing the apps as nodes and edges. We chose a graph database because the graph visualization helps to identify connections among data (e.g., clusters of apps sharing similar sets of permission requests).

    In particular, our dataset graph contains six types of nodes: - APP nodes containing metadata of each app, - PERMISSION nodes describing permission types, - CATEGORY nodes describing app categories, - SUBCATEGORY nodes describing app subcategories, - USER_REVIEW nodes storing user reviews. - TOPIC topics mined from user reviews (using LDA).

    Furthermore, there are five types of relationships between APP nodes and each of the remaining nodes:

    • USES_PERMISSION relationships between APP and PERMISSION nodes
    • HAS_REVIEW between APP and USER_REVIEW nodes
    • HAS_TOPIC between USER_REVIEW and TOPIC nodes
    • BELONGS_TO_CATEGORY between APP and CATEGORY nodes
    • BELONGS_TO_SUBCATEGORY between APP and SUBCATEGORY nodes

    Dataset Files Info

    Neo4j 2.0 Databases

    googlePlayDB1-Jan2014_neo4j_2_0.rar

    googlePlayDB2-Mar2014_neo4j_2_0.rar We provide two Neo4j databases containing the 2 snapshots of the Google Play Store (January and March 2014). These are the original databases created for the paper. The databases were created with Neo4j 2.0. In particular with the tool version 'Neo4j 2.0.0-M06 Community Edition' (latest version available at the time of implementing the paper in 2014).

    Neo4j 3.5 Databases

    googlePlayDB1-Jan2014_neo4j_3_5_28.rar

    googlePlayDB2-Mar2014_neo4j_3_5_28.rar Currently, the version Neo4j 2.0 is deprecated and it is not available for download in the official Neo4j Download Center. We have migrated the original databases (Neo4j 2.0) to Neo4j 3.5.28. The databases can be opened with the tool version: 'Neo4j Community Edition 3.5.28'. The tool can be downloaded from the official Neo4j Donwload page.

      In order to open the databases with more recent versions of Neo4j, the databases must be first migrated to the corresponding version. Instructions about the migration process can be found in the Neo4j Migration Guide.
    
      First time the Neo4j database is connected, it could request credentials. The username and pasword are: neo4j/neo4j
    
  8. d

    October 2023 data-update for "Updated science-wide author databases of...

    • elsevier.digitalcommonsdata.com
    Updated Oct 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John P.A. Ioannidis (2023). October 2023 data-update for "Updated science-wide author databases of standardized citation indicators" [Dataset]. http://doi.org/10.17632/btchxktzyw.6
    Explore at:
    Dataset updated
    Oct 4, 2023
    Authors
    John P.A. Ioannidis
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    Citation metrics are widely used and misused. We have created a publicly available database of top-cited scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator (c-score). Separate data are shown for career-long and, separately, for single recent year impact. Metrics with and without self-citations and ratio of citations to citing papers are given. Scientists are classified into 22 scientific fields and 174 sub-fields according to the standard Science-Metrix classification. Field- and subfield-specific percentiles are also provided for all scientists with at least 5 papers. Career-long data are updated to end-of-2022 and single recent year data pertain to citations received during calendar year 2022. The selection is based on the top 100,000 scientists by c-score (with and without self-citations) or a percentile rank of 2% or above in the sub-field. This version (6) is based on the October 1, 2023 snapshot from Scopus, updated to end of citation year 2022. This work uses Scopus data provided by Elsevier through ICSR Lab (https://www.elsevier.com/icsr/icsrlab). Calculations were performed using all Scopus author profiles as of October 1, 2023. If an author is not on the list it is simply because the composite indicator value was not high enough to appear on the list. It does not mean that the author does not do good work.

    PLEASE ALSO NOTE THAT THE DATABASE HAS BEEN PUBLISHED IN AN ARCHIVAL FORM AND WILL NOT BE CHANGED. The published version reflects Scopus author profiles at the time of calculation. We thus advise authors to ensure that their Scopus profiles are accurate. REQUESTS FOR CORRECIONS OF THE SCOPUS DATA (INCLUDING CORRECTIONS IN AFFILIATIONS) SHOULD NOT BE SENT TO US. They should be sent directly to Scopus, preferably by use of the Scopus to ORCID feedback wizard (https://orcid.scopusfeedback.com/) so that the correct data can be used in any future annual updates of the citation indicator databases.

    The c-score focuses on impact (citations) rather than productivity (number of publications) and it also incorporates information on co-authorship and author positions (single, first, last author). If you have additional questions, please read the 3 associated PLoS Biology papers that explain the development, validation and use of these metrics and databases. (https://doi.org/10.1371/journal.pbio.1002501, https://doi.org/10.1371/journal.pbio.3000384 and https://doi.org/10.1371/journal.pbio.3000918).

    Finally, we alert users that all citation metrics have limitations and their use should be tempered and judicious. For more reading, we refer to the Leiden manifesto: https://www.nature.com/articles/520429a

  9. Data from: A large EEG database with users' profile information for motor...

    • data.europa.eu
    • zenodo.org
    unknown
    Updated Jan 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2023). A large EEG database with users' profile information for motor imagery Brain-Computer Interface research [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7554429?locale=en
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Jan 8, 2023
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context : We share a large database containing electroencephalographic signals from 87 human participants, with more than 20,800 trials in total representing about 70 hours of recording. It was collected during brain-computer interface (BCI) experiments and organized into 3 datasets (A, B, and C) that were all recorded following the same protocol: right and left hand motor imagery (MI) tasks during one single day session. It includes the performance of the associated BCI users, detailed information about the demographics, personality and cognitive user’s profile, and the experimental instructions and codes (executed in the open-source platform OpenViBE). Such database could prove useful for various studies, including but not limited to: 1) studying the relationships between BCI users' profiles and their BCI performances, 2) studying how EEG signals properties varies for different users' profiles and MI tasks, 3) using the large number of participants to design cross-user BCI machine learning algorithms or 4) incorporating users' profile information into the design of EEG signal classification algorithms. Sixty participants (Dataset A) performed the first experiment, designed in order to investigated the impact of experimenters' and users' gender on MI-BCI user training outcomes, i.e., users performance and experience, (Pillette & al). Twenty one participants (Dataset B) performed the second one, designed to examined the relationship between users' online performance (i.e., classification accuracy) and the characteristics of the chosen user-specific Most Discriminant Frequency Band (MDFB) (Benaroch & al). The only difference between the two experiments lies in the algorithm used to select the MDFB. Dataset C contains 6 additional participants who completed one of the two experiments described above. Physiological signals were measured using a g.USBAmp (g.tec, Austria), sampled at 512 Hz, and processed online using OpenViBE 2.1.0 (Dataset A) & OpenVIBE 2.2.0 (Dataset B). For Dataset C, participants C83 and C85 were collected with OpenViBE 2.1.0 and the remaining 4 participants with OpenViBE 2.2.0. Experiments were recorded at Inria Bordeaux sud-ouest, France. Duration : Each participant's folder is composed of approximately 48 minutes EEG recording. Meaning six 7-minutes runs and a 6-minutes baseline. Documents Instructions: checklist read by experimenters during the experiments. Questionnaires: the Mental Rotation test used, the translation of 4 questionnaires, notably the Demographic and Social information, the Pre and Post-session questionnaires, and the Index of Learning style. English and french version Performance: The online OpenViBE BCI classification performances obtained by each participant are provided for each run, as well as answers to all questionnaires Scenarios/scripts : set of OpenViBE scenarios used to perform each of the steps of the MI-BCI protocol, e.g., acquire training data, calibrate the classifier or run the online MI-BCI Database : raw signals Dataset A : N=60 participants Dataset B : N=21 participants Dataset C : N=6 participants

  10. Global Power Plant Database - Datasets - Data | World Resources Institute

    • old-datasets.wri.org
    Updated Jun 3, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wri.org (2021). Global Power Plant Database - Datasets - Data | World Resources Institute [Dataset]. https://old-datasets.wri.org/dataset/globalpowerplantdatabase
    Explore at:
    Dataset updated
    Jun 3, 2021
    Dataset provided by
    World Resources Institutehttps://www.wri.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Global Power Plant Database is a comprehensive, open source database of power plants around the world. It centralizes power plant data to make it easier to navigate, compare and draw insights for one’s own analysis. The database covers approximately 35,000 power plants from 167 countries and includes thermal plants (e.g. coal, gas, oil, nuclear, biomass, waste, geothermal) and renewables (e.g. hydro, wind, solar). Each power plant is geolocated and entries contain information on plant capacity, generation, ownership, and fuel type. It will be continuously updated as data becomes available. The methodology for the dataset creation is given in the World Resources Institute publication "A Global Database of Power Plants". Data updates may occur without associated updates to this manuscript. The database can be visualized on Resource Watch together with hundreds of other datasets. The database is available for immediate download and use through the WRI Open Data Portal. Associated code for the creation of the dataset can be found on GitHub. The bleeding-edge version of the database (which may contain substantial differences from the release you are viewing) is available on GitHub as well. To be informed of important database releases in the future, please sign up for our newsletter.

  11. d

    VIUS Public Use File Data

    • catalog.data.gov
    • data.virginia.gov
    • +1more
    Updated Feb 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bureau of Transportation Statistics (2024). VIUS Public Use File Data [Dataset]. https://catalog.data.gov/dataset/vius-public-use-file-data
    Explore at:
    Dataset updated
    Feb 17, 2024
    Dataset provided by
    Bureau of Transportation Statistics
    Description

    This is a recoded version of the 2021 VIUS Public Use file. The Original is found at https://www2.census.gov/programs-surveys/vius/datasets/2021/ and should be used as the most up to date version. This version will be updated as need be and was extracted 2/2/2024. The main difference between the files is that the data dictionary has been incorporated into the database to minimize cross-checking.

  12. H

    A Complete Aerosol Optical Depth Dataset with High Spatiotemporal Resolution...

    • dataverse.harvard.edu
    Updated Jan 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lianfa, Li; Jiajie, Wu (2021). A Complete Aerosol Optical Depth Dataset with High Spatiotemporal Resolution for Mainland China [Dataset]. http://doi.org/10.7910/DVN/RNSWRH
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 19, 2021
    Dataset provided by
    Harvard Dataverse
    Authors
    Lianfa, Li; Jiajie, Wu
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2015 - Dec 31, 2018
    Area covered
    China
    Description

    We share the complete aerosol optical depth dataset with high spatial (1x1km^2) and temporal (daily) resolution and the Beijing 1954 projection (https://epsg.io/2412) for mainland China (2015-2018). The original aerosol optical depth images are from Multi-Angle Implementation of Atmospheric Correction Aerosol Optical Depth (MAIAC AOD) (https://lpdaac.usgs.gov/products/mcd19a2v006/) with the similar spatiotemporal resolution and the sinusoidal projection (https://en.wikipedia.org/wiki/Sinusoidal_projection). After projection conversion, eighteen tiles of MAIAC AOD were merged to obtain a large image of AOD covering the entire area of mainland China. Due to the conditions of clouds and high surface reflectance, each original MAIAC AOD image usually has many missing values, and the average missing percentage of each AOD image may exceed 60%. Such a high percentage of missing values severely limits applicability of the original MAIAC AOD dataset product. We used the sophisticated method of full residual deep networks (Li et al, 2020, https://ieeexplore.ieee.org/document/9186306) to impute the daily missing MAIAC AOD, thus obtaining the complete (no missing values) high-resolution AOD data product covering mainland China. The covariates used in imputation included coordinates, elevation, MERRA2 coarse-resolution PBLH and AOD variables, cloud fraction, high-resolution meteorological variables (air pressure, air temperature, relative humidity and wind speed) and/or time index etc. Ground monitoring data were used to generate high-resolution meteorological variables to ensure the reliability of interpolation. Overall, our daily imputation models achieved an average training R^2 of 0.90 with a range of 0.75 to 0.97 (average RMSE: 0.075, with a range of 0.026 to 0.32) and an average test R^2 of 0.90 with a range of 0.75 to 0.97 (average RMSE: 0.075 with a range of 0.026 to 0.32). With almost no difference between training metrics and test metrics, the high test R^2 and low test RMSE show the reliability of AOD imputation. In the evaluation using the ground AOD data from the monitoring stations of the Aerosol Robot Network (AERONET) in mainland China, our method obtained a R^2 of 0.78 and RMSE of 0.27, which further illustrated the reliability of the method. This database contains four datasets: - Daily complete high-resolution AOD image dataset for mainland China from January 1, 2015 to December 31, 2018. The archived resources contain 1461 images stored in 1461 files, and 3 summary Excel files. The table “CHN_AOD_INFO.xlsx” describing the properties of the 1461 images, including projection, training R^2 and RMSE, testing R^2 and RMSE, minmum, mean, median and maximum AOD that we predicted. - The table “Model_and_Accuracy_of_Meteorological_Elements.xlsx” describing the statistics of performance metrics in interpolation of high-resolution meteorological dataset. - The table “Evaluation_Using_AERONET_AOD.xlsx” showing the evaluation result of AERONET, including R^2, RMSE, and monitoring information used in this study.

  13. Low Probability of Intercept Radar database 2025

    • kaggle.com
    zip
    Updated Jan 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edgard Braz Alves (2025). Low Probability of Intercept Radar database 2025 [Dataset]. https://www.kaggle.com/datasets/edgardbrazalves/tfi-50x50x3-shuffled-from-01-to-05-files
    Explore at:
    zip(5032261682 bytes)Available download formats
    Dataset updated
    Jan 8, 2025
    Authors
    Edgard Braz Alves
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    Low Probability of Intercept Radar database 2025

    Low Probability of Intercept Radar database 2025 is a dataset composed of 403.000 Time-Frequency Images (TFI) for each different method of Time-Frequency Analysis (TFA) such as the Short-Time Fourier Transform (STFT), Smoothed Pseudo-Wigner-Ville (SPWVD) and the Choi-Williams Distribution (CWD). This dataset was generated with simulations on MatLab for classification purposes and to provide a wide radar dataset for further research on this matter.

    This database was created to compare the results of it's application with related works therefore the cross-validation process, the LPI radar signal instance base and the Data Partitioning used was generated following the same procedures and execution conditions as those adopted by the related works in their experiments.

    The results gathered with this dataset can be seen on the following articles: SBBD-2024, SBPO-2024 and SIGE-2024.

    A presentation of the SIGE-2024 article can be accessed on SIGE-2024 Presentation.

    https://www.youtube.com/watch?v=HSlARxBVyi0" alt="SIGE-2024 Presentation">

    Dataset explanation Index

    1. Types of signals generated
    2. Methodology
    3. Related Works
    4. Articles made with this dataset

    Type of signals generated

    We have generated a set of 13 signals commonly used in radar applications:

    1- Linear Frequency Modulation (LFM)

    2- Unmodulated (Rectangular)

    3- Frequency Modulation with Costas code

    4- Phase Modulation with Barker code

    5- Phase modulation with Frank code

    6- Phase modulation with P1 code

    7- Phase modulation with P2 code

    8- Phase Modulation with P3 code

    9- Phase modulation with P4 code

    10- Phase modulation with T1 code

    11- Phase modulation with T2 code

    12- Phase modulation with T3 code

    13- Phase modulation with T4 code

    Methodology

    The methodology comprises the following steps:

    • Creation of LPI Radar Signal Instances Base, generated by adding Additive White Gaussian Noise (AWGN) and simulated channel loss to the noise-free LPI radar signal;
    • Signal Pre-processing, generating the STFT-TFI, SPWVD-TFI and CWD-TFI bases by pre-processing the signals using STFT-TFA, SPWVD-TFA and CWD-TFA techniques, respectively; and
    • Dataset Division in training, validating and testing.

    these steps can be sumerized in the Flowchart below.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22872466%2F099cda70be92e7e204c988e95cedf53d%2FKaggle%20Fluxchart.png?generation=1737380988661833&alt=media" alt="Flowchart">

    Creation of LPI Radar Signal Instances Base

    To create the Base of LPI Radar Signal Instances, we modeled the receiver of a radar system. We considered that the complex sample of an intercepted LPI radar signal is perturbed by Additive White Gaussian Noise (AWGN) and channel loss, as indicated by the equation:

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22872466%2F55c09c1b329fc964c5bd98f63c4d6cbc%2FKaggle%20equation.png?generation=1737381692399396&alt=media" alt="AWGN Equation">

    In this equation: x(k) represents the signal generated during the Signal Generation step, which is noise-free; h(k) corresponds to the channel interference resulting from the Channel Loss Generation step; n(k) characterizes the noise introduced during the AWGN Generation step; k denotes the sample index for each T_s (sampling period), considering a sampling frequency f_s.

    It’s important to emphasize that this data creation mechanism is identical to that used in related works as we utilized the source code for generating LPI radar signals provided by them after corresponding via email. Additionally, we created the database using the same intrapulse modulations and parameter ranges specified by these related works. Consequently, we generated the 13 different types of LPI radar signal modulations, introduced noise during the AWGN Generation step, varying the Signal Noise Ratio (SNR) from -20dB to +10dB with a 1.0dB increment. In the same way, channel loss interference was modeled, using Rayleigh fading during the Channel Loss Generation step.

    The different signal instances constituting the LPI Signal Base were created by randomly varying the specific parameters for each intrapulse modulation, following the specifications outlined in the related works and detailed in the figure below.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22872466%2Fe8da1c27a7584c003dc138960841856b%2Fimg_instncias.jpg?generation=1737382815492443&alt=media" alt="specific parameters for each intrapulse modulation">

    **Generating the STFT-TFI, SPWVD-TFI and...

  14. d

    Data from: Grass-Cast Database - Data on aboveground net primary...

    • catalog.data.gov
    • datasetcatalog.nlm.nih.gov
    • +1more
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Grass-Cast Database - Data on aboveground net primary productivity (ANPP), climate data, NDVI, and cattle weight gain for Western U.S. rangelands [Dataset]. https://catalog.data.gov/dataset/grass-cast-database-data-on-aboveground-net-primary-productivity-anpp-climate-data-ndvi-an-ac7cd
    Explore at:
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    Agricultural Research Service
    Area covered
    United States
    Description

    Grass-Cast: Experimental Grassland Productivity Forecast for the Great Plains Grass-Cast uses almost 40 years of historical data on weather and vegetation growth in order to project grassland productivity in the Western U.S. More details on the projection model and method can be found at https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/ecs2.3280. Every spring, ranchers in the drought‐prone U.S. Great Plains face the same difficult challenge—trying to estimate how much forage will be available for livestock to graze during the upcoming summer grazing season. To reduce this uncertainty in predicting forage availability, we developed an innovative new grassland productivity forecast system, named Grass‐Cast, to provide science‐informed estimates of growing season aboveground net primary production (ANPP). Grass‐Cast uses over 30 yr of historical data including weather and the satellite‐derived normalized vegetation difference index (NDVI)—combined with ecosystem modeling and seasonal precipitation forecasts—to predict if rangelands in individual counties are likely to produce below‐normal, near‐normal, or above‐normal amounts of grass biomass (lbs/ac). Grass‐Cast also provides a view of rangeland productivity in the broader region, to assist in larger‐scale decision‐making—such as where forage resources for grazing might be more plentiful if a rancher’s own region is at risk of drought. Grass‐Cast is updated approximately every two weeks from April through July. Each Grass‐Cast forecast provides three scenarios of ANPP for the upcoming growing season based on different precipitation outlooks. Near real‐time 8‐d NDVI can be used to supplement Grass‐Cast in predicting cumulative growing season NDVI and ANPP starting in mid‐April for the Southern Great Plains and mid‐May to early June for the Central and Northern Great Plains. Here, we present the scientific basis and methods for Grass‐Cast along with the county‐level production forecasts from 2017 and 2018 for ten states in the U.S. Great Plains. The correlation between early growing season forecasts and the end‐of‐growing season ANPP estimate is >50% by late May or early June. In a retrospective evaluation, we compared Grass‐Cast end‐of‐growing season ANPP results to an independent dataset and found that the two agreed 69% of the time over a 20‐yr period. Although some predictive tools exist for forecasting upcoming growing season conditions, none predict actual productivity for the entire Great Plains. The Grass‐Cast system could be adapted to predict grassland ANPP outside of the Great Plains or to predict perennial biofuel grass production. This new experimental grassland forecast is the result of a collaboration between Colorado State University, U.S. Department of Agriculture (USDA), National Drought Mitigation Center, and the University of Arizona. Funding for this project was provided by the USDA Natural Resources Conservation Service (NRCS), USDA Agricultural Research Service (ARS), and the National Drought Mitigation Center. Watch for updates on the Grass-Cast website or on Twitter (@PeckAgEc). Project Contact: Dannele Peck, Director of the USDA Northern Plains Climate Hub, at dannele.peck@ars.usda.gov or 970-744-9043. Resources in this dataset:Resource Title: Cattle weight gain. File Name: Cattle_weight_gains.xlsxResource Description: Cattle weight gain data for Grass-Cast Database. Resource Title: NDVI. File Name: NDVI.xlsxResource Description: Annual NDVI growing season values for Grass-Cast sites. See readme for more information and NDVI_raw for the raw values. Resource Title: NDVI_raw . File Name: NDVI_raw.xlsxResource Description: Raw bimonthly NDVI values for Grass-Cast sites. Resource Title: ANPP. File Name: ANPP.xlsxResource Description: Dataset for annual aboveground net primary productivity (ANPP). Excel sheet is broken into two tabs, 1) 'readme' describing the data, 2) 'ANPP' with the actual data. Resource Title: Grass-Cast_sitelist . File Name: Grass-Cast_sitelist.xlsxResource Description: This provides a list of sites-studies that are currently incorporated into the Database as well as meta-data and contact info associated with the data sets. Includes a 'readme' tab and 'sitelist' tab. Resource Title: Grass-Cast_AgDataCommons_overview. File Name: Grass-Cast_AgDataCommons_download.htmlResource Description: Html document that shows database overview information. This document provides a glimpse of the data tables available within the data resource as well as respective meta-data tables. The R script (R markdown, .Rmd format) that generates the html file, and can be used to upload the Grass-Cast associated Ag Data Commons data files can be downloaded at the 'Grass-Cast R script' zip folder. The Grass-Cast files still need to be locally downloaded before use, but we are looking to make a download automated. Resource Title: Grass-Cast R script . File Name: R_access_script.zipResource Description: R script (in Rmarkdown [Rmd] format) for uploading and looking at Grass-Cast data.

  15. m

    APNEA HRV DATASET

    • data.mendeley.com
    • zenodo.org
    Updated Mar 8, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Juliá-Serdá (2018). APNEA HRV DATASET [Dataset]. http://doi.org/10.17632/vv6wdpbrsh.1
    Explore at:
    Dataset updated
    Mar 8, 2018
    Authors
    Gabriel Juliá-Serdá
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The HuGCDN2014 database was provided by the sleep unit of the Dr. Negrín University Hospital (Canary Islands, Spain). It is made up of 77 single-lead ECG recordings, digitized at 200 Hz. The labeling process was performed by an expert based on the simultaneous polysomnography, indicating the presence or absence of apnea in each minute. The database is divided into two groups: 1º) CONTROL: Forty healthy subjects with an AHI lower than 5 (30 men and 10 women). 2º) APNEA: Thirty-seven OSA patients with an AHI higher than 25 (30 men and 7 women). The learning set (L) consists of the first 20 recordings of control subjects and the first 18 OSA patients. The rest belong to the test set (T). The single-lead ECG signal is divided into 5-minute frames, that are shifted in time in increments of 1 minute. The scoring for each segment is assigned to the minute located in the middle position. The RR interval series is constructed as a sequence of time differences between the successive heartbeats.

  16. NIST Computational Chemistry Comparison and Benchmark Database - SRD 101

    • catalog.data.gov
    • s.cnmilf.com
    • +2more
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2025). NIST Computational Chemistry Comparison and Benchmark Database - SRD 101 [Dataset]. https://catalog.data.gov/dataset/nist-computational-chemistry-comparison-and-benchmark-database-srd-101
    Explore at:
    Dataset updated
    Sep 30, 2025
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    The NIST Computational Chemistry Comparison and Benchmark Database is a collection of experimental and ab initio thermochemical properties for a selected set of gas-phase molecules. The goals are to provide a benchmark set of experimental data for the evaluation of ab initio computational methods and allow the comparison between different ab initio computational methods for the prediction of gas-phase thermochemical properties. The data files linked to this record are a subset of the experimental data present in the CCCBDB.

  17. Data from: IchnoDB: structure and importance of an ichnology database

    • tandf.figshare.com
    mdb
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dean M. Meek; Bruce M. Eglington; Luis A. Buatois; M. Gabriela Mángano (2023). IchnoDB: structure and importance of an ichnology database [Dataset]. http://doi.org/10.6084/m9.figshare.12848993.v1
    Explore at:
    mdbAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Dean M. Meek; Bruce M. Eglington; Luis A. Buatois; M. Gabriela Mángano
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The design of a relational database for ichnological data is presented to illustrate and address deficiencies in present-day palaeontological databases. Currently, palaeontology databases apply concepts and terminology derived from the study of body fossils to trace fossil records. We suggest that fundamental differences between body and trace fossils make this practice inappropriate. These differences stem from the fact that trace fossils represent the behaviour of the tracemaker, and not the phylogenetic affinities of an organism. This database, referred to as IchnoDB, has been tested by the authors throughout the design process to ensure that recommended alterations to current palaeontology databases made herein are functional. In describing the design and logic that underpins an ichnology database, it is our desire to see established palaeontological databases incorporate ichnology specific fields into their structure. This would support and encourage future research, involving the use of large ichnological datasets.

  18. Z

    MERIT-SWORD: Bidirectional Translations Between MERIT-Basins and the SWOT...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Jan 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wade, Jeffrey; David, Cédric H.; Altenau, Elizabeth; Collins, Elyssa; Oubanas, Hind; Coss, Stephen; Cerbelaud, Arnaud; Tom, Manu; Durand, Michael; Pavelsky, Tamlin (2025). MERIT-SWORD: Bidirectional Translations Between MERIT-Basins and the SWOT River Database (SWORD) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13152825
    Explore at:
    Dataset updated
    Jan 16, 2025
    Dataset provided by
    University of North Carolina at Chapel Hill
    Jet Propulsion Laboratory, California Institute of Technology
    The Ohio State University
    INRAE, UMR G-eau
    Authors
    Wade, Jeffrey; David, Cédric H.; Altenau, Elizabeth; Collins, Elyssa; Oubanas, Hind; Coss, Stephen; Cerbelaud, Arnaud; Tom, Manu; Durand, Michael; Pavelsky, Tamlin
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Corresponding peer-reviewed publication

    This dataset corresponds to all input and output files that were used in the study reported in:

    Wade, J., David, C.H., Collins, E.L., Altenau, E.H., Coss, S., Cerbelaud, A., Tom, M., Durand, M., Pavelsky T.M. (In Review), Bidirectional Translations Between Observational and Topography-based Hydrographic Datasets: MERIT-Basins and the SWOT River Database (SWORD).

    When making use of any of the files in this dataset, please cite both the aforementioned article and the dataset herein.

    Summary

    The MERIT-SWORD data product reconciles critical differences between the SWOT River Database (SWORD; Altenau et al., 2021), the hydrography dataset used to aggregate observations from the Surface Water and Ocean Topography (SWOT) Mission, and MERIT-Basins (MB; Lin et al., 2019; Yang et al., 2021), an elevation-derived vector hydrography dataset commonly used by global river routing models (Collins et al., 2024). The SWORD and MERIT-Basins river networks differ considerably in their representation of the location and extent of global river reaches, complicating potential synergistic data transfer between SWOT observations and existing hydrologic models.

    MERIT-SWORD aims to:

    Generate bidirectional, one-to-many links (i.e. translations) between river reaches in SWORD and MERIT-Basins (ms_translate files).

    Provide a reach-specific evaluation of the quality of translations (ms_diagnostic files).

    Data sources

    The following sources were used to produce files in this dataset:

    MERIT-Basins (version 1.0) derived from MERIT-Hydro (version 0.7) available under a CC BY-NC-SA 4.0 license. https://www.reachhydro.org/home/params/merit-basins

    SWOT River Database (SWORD) (version 16) available under a CC BY 4.0. https://zenodo.org/records/10013982. DOI: 10.5281/zenodo.10013982

    Mean Discharge Runoff and Storage (MeanDRS) dataset (version v0.4) available under a CC BY-NC-SA 4.0 license. https://zenodo.org/records/10013744. DOI: 10.5281/zenodo.10013744; 10.1038/s41561-024-01421-5

    Software

    The software that was used to produce files in this dataset are available at https://github.com/jswade/merit-sword.

    Primary Data Products

    The following files represent the primary data products of the MERIT-SWORD dataset. Each file class generally has 61 files, corresponding to the 61 global hydrologic regions (region ii). For typical use of this dataset, download the 3 following zip folders listed below. The ms_translate.zip and ms_diagnostic.zip NetCDF files are best suited for scripting applications, while the ms_translate_shp.zip shapefiles are best suited for GIS applications.

    The MERIT-SWORD translation tables (.nc) establish links between corresponding river reaches in MERIT-Basins and SWORD in both directions. The mb_to_sword translations relate the COMID values of all MERIT-Basins reaches in region ii (as defined by MERIT-Basins) to corresponding SWORD reach_id values, which are ranked by their degree of overlap and stored in columns sword_1 – sword_40. The partial intersecting lengths (m) of SWORD reaches within related MERIT-Basins unit catchments are stored in columns part_len_1 – part_len_40 and can be used to weight data transfers from more than one SWORD reach. The sword_to_mb translations relate the reach_id values of all SWORD reaches in region ii (as defined by SWORD) to corresponding MERIT-Basins COMID values, which are ranked by their degree of overlap and stored in columns mb_1 – mb_40. The partial intersecting lengths (m) of SWORD reaches within related MERIT-Basins unit catchments are again stored in columns part_len_1 – part_len_40.

    ms _translate.zip

    mb_to_sword: mb_to_sword_pfaf_ii_translate.nc

    sword_to_mb: sword_to_mb_pfaf_ii_translate.nc

    The MERIT-SWORD diagnostic tables (.nc) contain evaluations of the quality of translations between MERIT-Basins and SWORD reaches, stored in column flag. The mb_to_sword diagnostic files contain integer quality flags for each MERIT-Basins reach translation in region ii. The sword_to_mb diagnostic files contain integer quality flags for each SWORD reach translation in region ii. The quality flags are as follows:

    0 = Valid translation.

    1 = Translated reaches are not topologically connected to each other.

    2 = Reach does not have a corresponding reach in the other dataset (absent translation).

    21 = Reach does not have a corresponding reach in the other dataset due to flow accumulation mismatches.

    22 = Reach does not have a corresponding reach in the other dataset because it is located in what the other dataset defines as the ocean.

    ms_diagnostic.zip

    mb_to_sword: mb_to_sword_pfaf_ii_diagnostic.nc

    sword_to_mb: sword_to_mb_pfaf_ii_diagnostic.nc

    For GIS applications, the translations and diagnostic tables are also available in shapefile format, joined to their respective MERIT-Basins and SWORD river vector shapefiles. The MERIT-Basins and SWORD shapefiles retain their original attribute tables, in additional to the added translation and diagnostic columns.

    ms _translate_shp.zip

    mb: riv_pfaf_ii_MERIT_Hydro_v07_Basins_v01_translate.shp

    sword: jj_sword_reaches_hbii_v16_translate.shp

    Example Applications Data Products

    The following files are example use cases of transferring data between MERIT-Basins and SWORD. They are not required for typical use of the MERIT-SWORD dataset.

    The MeanDRS-to-SWORD application example files demonstrate how the MERIT-SWORD translation tables can be used to transfer discharge simulations along MERIT-Basins reaches (i.e. MeanDRS; https://zenodo.org/records/8264511) to corresponding SWORD reaches in region ii and continent xx. MeanDRS discharge simulations (m3 s-1) are transferred to SWORD reaches based on a weighted average translation of corresponding reaches and stored in the column meanDRS_Q.

    app_meandrs_to_sword.zip: xx_sword_reaches_hbii_v16_meandrs.shp

    The SWORD-to-MERIT-Basins application example files demonstrate how the MERIT-SWORD translation tables can be used to transfer variables of interest (in this case, river width) from SWORD reaches to corresponding MERIT-Basins reaches in region ii. SWORD width estimates (m) are transferred to MERIT-Basins reaches based on a weighted average translation of corresponding reaches and stored in the column sword_wid.

    app_sword_to_mb.zip: riv_pfaf_ii_MERIT_Hydro_v07_Basins_v01_sword.shp

    Intermediate Data Products

    The following files are intermediates used in generating the primary data. They are not required for typical use of the MERIT-SWORD dataset.

    The MERIT-SWORD river trace files represent our first approximation of MERIT-Basins reaches that correspond to SWORD reaches in region ii, prior to the manual removal of mistakenly included reaches. The river trace files are only used to generate the final river network files and are not used elsewhere in the dataset.

    ms_riv_trace.zip: meritsword_pfaf_ii_trace.shp

    The MERIT-SWORD river network shapefiles contain the MERIT-Basins reaches that in aggregate best correspond to the location and extent of the SWORD river network for each of the Pfafstetter level 2 regions as defined by SWORD v16 (i.e. the 61 values of ii). The MERIT-SWORD river networks serve as an intermediary data product to enable reliable translations.

    ms_riv_network.zip: meritsword_pfaf_ii_network.shp

    The MERIT-SWORD transpose files are used to confirm that the translation tables in one direction can recreated in their entirety using only data from the translation tables in the other direction, ensuring ~3,500 less data transfer. These files are exact copies of the files contained in ms_translate.zip.

    ms _transpose.zip

    mb_transposed: mb_to_sword_pfaf_ii_transpose.nc

    sword_transposed: sword_to_mb_pfaf_ii_transpose.nc

    The MERIT-SWORD translation catchment files contain the MERIT-Basins unit catchments corresponding to each reach used in generating the mb_to_sword and sword_to_mb translations for each region ii. The files are used internally during the translation process and not required for typical dataset use.

    ms_translate_cat.zip

    mb_to_sword: mb_to_sword_pfaf_ii_translate_cat.nc

    sword_to_mb: sword_to_mb_pfaf_ii_translate.cat.nc

    The hydrologic regions as defined by MERIT-Basins and SWORD are not identical and overlap in many cases, complicating translations. The region overlap files provide bidirectional mappings between region identifiers in both datasets. The files are used in most dataset scripts to determine the regional files from each dataset that need to be loaded.

    ms_region_overlap.zip: sword_to_mb_reg_overlap.csv, sword_to_mb_reg_overlap.csv

    The MERIT-SWORD river edit files contain ~3,500 MERIT-Basins river reaches that were mistakenly included during river network generation and do not correspond to any SWORD reaches. These reaches are removed from the river trace files to generate the final MERIT-SWORD river network data product.

    ms_riv_edit.zip: meritsword_edits.csv

    Near the antimeridian, MERIT-Basins and SWORD shapefiles differ in their longitude convention. Additionally, the SWORD dataset lacks a shapefile for region 54, which does not have any SWORD reaches. The SWORD edit files contain copies of SWORD files, altered to match the longitude convention of MERIT-Basins and including a dummy shapefile for region 54.

    sword_edit.zip: xx_sword_reaches_hbii_v16.shp

    Known bugs in this dataset or the associated manuscript

    No bugs have been identified at this time.

    References

    Altenau, E. H., Pavelsky, T. M., Durand, M. T., Yang, X., Frasson, R. P. de M., & Bendezu, L. (2021). The Surface Water and Ocean Topography (SWOT) Mission River Database (SWORD): A Global River Network for Satellite Data Products. Water Resources Research, 57(7), e2021WR030054. https://doi.org/10.1029/2021WR030054

    Collins, E. L., David, C. H., Riggs, R., Allen, G. H., Pavelsky, T. M., Lin, P., Pan, M., Yamazaki,

  19. TMBD_Movies_20000_top_rated

    • kaggle.com
    zip
    Updated Jan 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Priyanka (2023). TMBD_Movies_20000_top_rated [Dataset]. https://www.kaggle.com/datasets/priyankasantramgaik/tmbd-movies-top-rated
    Explore at:
    zip(3076231 bytes)Available download formats
    Dataset updated
    Jan 12, 2023
    Authors
    Priyanka
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The data is in CSV format and includes all historical data on the up to 19th century, following The Movie Database (TMDb) is a popular user ed- itable database for movies. This dataset is taking from TMBD API.

    Difference between IMDB and TMBD: TMDb is an initialism that is almost similar to its more bigger counterpart: IMDb. Both are massive indexes for movie and television information, but The Movie Database differs from the Internet Movie Database in one key aspect: TMDb is completely powered by its community.😊

    dataset columns: id : original_language: original_title: overview: popularity: release_date: title: vote_average: vote_count:

  20. s

    Global rRNA Universal Metabarcoding of Plankton database (GRUMP)

    • simonscmap.com
    • zenodo.org
    Updated Jan 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Southern California (2025). Global rRNA Universal Metabarcoding of Plankton database (GRUMP) [Dataset]. https://simonscmap.com/catalog/datasets/GRUMP
    Explore at:
    Dataset updated
    Jan 30, 2025
    Dataset authored and provided by
    University of Southern California
    Time period covered
    Aug 22, 2003 - Sep 17, 2020
    Area covered
    Measurement technique
    High Throughput Sequencer, CTD, fluorometer, Autoanalyzer, Uncategorized
    Description

    ''We introduce the Global rRNA Universal Metabarcoding Plankton database (McNichol and Williams et al., 2025), which consists of 1194 samples covering extensive latitudinal and longitudinal transects, including depth profiles, in all major ocean basins from 2003-2020. Unfractionated (>0.2 µm) seawater DNA samples were amplified using the 515Y/926R universal 3-domain rRNA primers, quantifying the relative abundance of amplicon sequencing variants (ASVs) from Bacteria, Archaea, and Eukaryotes with one denominator. Thus, the ratio of any organism (or group) to any other in a sample is directly comparable to the ratio in any other sample within the dataset, irrespective of gene copy number differences. This obviates a problem in prior global studies that used size-fractionation and different primers for prokaryotes and eukaryotes, precluding comparisons between abundances across size fractions or domains.

    Sample Collection Samples were collected by multiple collaborations, which used slightly different sample collection techniques. These collection techniques will be outlined by individual cruise.

    For the Atlantic Meridional Transects (AMT 19 and AMT 20), 5–10 L of whole seawater was collected from the sea surface using a Niskin bottle and was then filtered onto 0.22 µm Sterivex Durapore filters (Millipore Sigma, Burlington, MA, USA). Samples were collected by Stephanie Sargeant and Andy Rees, Plymouth Marine Laboratory (PML) as part of the Atlantic Meridional Transect (AMT) (Rees, Smyth and Brotas, 2024) research cruises 19 (2009) and 20 (2010) onboard UK research vessel RRS James Cook (JC039 and JC053 – (Rees, 2010a, 2010b). Sterivex filters were capped and stored in RNAlater® (ThermoFisher) at -80 °C until analysis.

    For samples taken in the FRAM Strait (Wietz et al., 2021), whole seawater was collected using Remote Access Samplers (RAS; McLane) on seafloor moorings F4-S-1, HG-IV-S-1, Fevi-34, and EGC-5. Moorings were operated within the FRAM / HAUSGARTEN Observatory covering the West Spitsbergen Current, central Fram Strait, and East Greenland Current as well as the Marginal Ice Zone. RAS performed continuous, autonomous sampling from July 2016 – August 2017 in programmed intervals (weekly to monthly). Nominal deployment depths were 30 m (F4, HG-IV), 67 m (Fevi), and 80 m (EGC). However, vertical movements in the water column resulted in variable actual sampling depths, ranging from 25 to 150 m. Per sampling event, two lots of 500 mL of whole seawater was pumped into bags containing mercuric chloride for fixation. After RAS recovery, the two samples per sampling event were pooled, and approximately 700 mL of pooled water was filtered onto 0.22 µm Sterivex cartridges. Filtered samples were stored at -20°C until DNA extraction. For MOSAiC whole seawater was collected from the upper water via a rosette sampler equipped with Niskin bottles through a hole in the sea ice next to the RV Polarstern. If possible, duplicate samples, with two Niskins per depth were collected during the up-casts near the surface (~5 m), 10 m, chlorophyll max (~20–40 m), 50 m, and 100m. in these Niskins and. 1-4 litres was filtered on to Sterivex-filters (0.22 µm pore size) using a peristaltic pump in a temperature controlled lab at 1°C in the dark, only using red light. The number of Sterivex-filters used per sampling event varied between two during Polar Night and 3-4 during Polar day, depending on the biomass found in the samples. Sterivex filters with were stored at -80°C until further processing took place in the laboratory.

    GEOTRACES cruises (Anderson et al., 2014), including transects GA02, GA03, GA10, and GP13, collected whole seawater using a Niskin bottle, filtering 100 mL of whole seawater between the surface and 5601 m onto 0.2 µm 25 mm polycarbonate filters. After filtration, 3 mL of sterile preservation solution (10 mM Tris, pH 8.0; 100 mM EDTA; 0.5 M NaCl) was added, and samples were stored in cryovials at -80°C until DNA extraction.

    During the 2017 and 2019 SCOPE (Simons Collaboration on Ocean Processes and Ecology) - Gradients cruises, 0.7-4 L of whole seawater was collected at sea using the ships underway system, which is approximately 7 m below the surface, as well as the rosette sampler for depths between 15 – 125 m by Mary R. Gradoville, Brittany Stewart, and Esther Wing (Zehr lab) (Gradoville et al., 2020). This water was filtered onto 0.22 µm 25 mm Supor membrane filters (Pall Corporation, New York) and stored at -80°C until DNA extraction.

    The collection of Southern Ocean transects include the 1) IND-2017 dataset, which were taken during the Totten Glacier-Sabrina Coast voyage in 2017 as part of the CSIRO Marine National Facility RV Investigator Voyage IN2017_V01, 2) the Kerguelen-Axis Marine Science program (K-AXIS) in 2016 on the Australian Antarctic Division RV Aurora Australis 2015/16 voyage 3, 3) Global Ocean Ship-based Hydrographic Investigations Program (GO-SHIP) P15S cruise in 2016 as a part of the CSIRO Marine National Facility RV Investigator Voyage IN2016_V03, and 4) the Heard Earth-Ocean-Biosphere Interactions (HEOBI) voyage in 2016 as part of the CSIRO Marine National Facility RV Investigator Voyage IN2016_V01. For these cruises, 2 L of whole seawater was filtered onto 0.22 µm Sterivex-GP polyethersulfone membrane filters (Millipore). This water was collected from the ships underway during IND-2017 by Amaranta Focardi (Paulsen Lab, Macquarie University), from between 5 and 4625 m during K-AXIS by Bruce Deagle and Lawrence Clarke (Australian Antarctic Division), from between 5 and 6015 m during GO-SHIP P15S by Eric J. Raes, Swan LS Sow and Gabriela Paniagua Cabarrus (Environmental Genomics Team, CSIRO Environment), Nicole Hellessey (University of Tasmania) and Bernhard Tschitschko (University of New South Wales), and from between 7-3579 m during HEOBI by Thomas Trull (CSIRO Environment). After filtration, samples were stored at -80°C until analysis.

    As part of GO-SHIP, there were several additional transects (i.e., I08S, I09N, P16 S/N), including some that also traversed into the Southern Ocean (i.e., I08S, P16S) or Arctic Ocean (P16N). For I08S and I09N, 2 L of whole seawater was filtered onto 0.22 µm 25 mm filters (Supor® hydrophilic polyethersulfone membrane) by Norm Nelson (I08S) and Elisa Halewood (I09N), UCSB, as part of the U.S. Global Ocean Ship-based Hydrographic Investigations Program aboard the R/V Roger Revelle during the cruises in 2007. Sucrose lysis buffer was added to filters, which were then stored at -80°C until DNA extraction. For P16N and P16S, samples were collected at various depths by Elisa Halewood and Meredith Meyers (Carlson Lab, UCSB) onto 0.22 µm 25 mm Supor filters during two latitudinal transects of the Pacific Ocean in 2005 and 2006 as part of the GO-SHIP repeat hydrography program (then known as CLIVAR). Samples were stored as partially extracted lysates in sucrose lysis buffer at -80°C until DNA extraction.

    Finally, for samples from the Production Observations Through Another Trans-Latitudinal Oceanic Expedition (POTATOE) cruise, 20 L of whole seawater was collected from the sea surface between 1-2 m and filtered onto 0.22 µm Sterivex® filters during a “ship of opportunity” cruise on the RVIB Nathaniel B Palmer in 2003 (Baldwin et al., 2005). Sterivex filters were stored dry at -80°C until DNA extraction.

    All datasets had corresponding environmental data. We included date, time, latitude, longitude, depth, temperature, salinity, oxygen for all transects, and nutrient data where available. However, some cruises have other environmental data which can be found at the British Oceanographic Data Centre https://www.bodc.ac.uk/ for both AMT cruises, at the CSIRO National Collections and Marine Infrastructure Data Trawler https://www.cmar.csiro.au/data/trawler/survey_details.cfm?survey=IN2016_V01 for IND-2017 and HEOBI, at the CLIVAR and Carbon Hydrographic Data Office https://cchdo.ucsd.edu/ for GO-SHIP P15S, P16N and P16S, at the Australian Antarctic Division Datacenter https://data.aad.gov.au/aadc/voyages/ for the K-AXIS cruise, at https://doi.org/10.6075/J0CCHLY9 for the I08S and I09N cruises, at the MGDS (Marine Geoscience Data System: https://www.marine-geo.org) for POTATOE, at https://scope.soest.hawaii.edu/data/gradients/documents/ for both SCOPE-Gradients cruises, and at PANGAEA https://www.pangaea.de/ for FRAM Strait and MOSAiC. Finally, we have also used satellite data to estimate the euphotic zone depth where photosynthetic available radiation (PAR) is 1% of its surface value (Lee et al., 2007; Kirk, 2010). We approximated the euphotic zone depth using the light attenuation at 490nm (Kd 490) product and the relationship Z eu(1%) = 4.6/Kd 490. We also used the script Longhurst-Province-Finder https://github.com/thechisholmlab/Longhurst-Province-Finder to assign each sample to the Longhurst Province in which it was sampled in, another useful column to help subset data and investigate specific regions of the ocean.

    DNA Extraction For AMT cruises, DNA was isolated using the Qiagen AllPrep DNA/RNA Mini kit (Hilden, Germany) with modifications to be compatible with RNAlater® and to disrupt cell membranes (Varaljay et al., 2015). Briefly, the filter was removed from the Sterivex housing and immersed in RLT+ buffer that had been amended with 10 µl 1N NaOH per 1ml buffer, followed by a 2 minute agitation in a Mini-Beadbeater-96 (Biospec Inc., Bartlesville, OK, USA) with 0.1- and 0.5 mm sterile glass beads

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Agricultural Research Service (2025). Inventory of online public databases and repositories holding agricultural data in 2017 [Dataset]. https://catalog.data.gov/dataset/inventory-of-online-public-databases-and-repositories-holding-agricultural-data-in-2017-d4c81
Organization logo

Data from: Inventory of online public databases and repositories holding agricultural data in 2017

Related Article
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description

United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered. Search methods We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review: Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection. Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation. See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt

Search
Clear search
Close search
Google apps
Main menu