100+ datasets found
  1. Data Citation Corpus Data File

    • zenodo.org
    zip
    Updated Oct 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataCite (2024). Data Citation Corpus Data File [Dataset]. http://doi.org/10.5281/zenodo.13376773
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 14, 2024
    Dataset provided by
    DataCite
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Data file for the second release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.

    The data file includes 5,256,114 data citation records in JSON and CSV formats. The JSON file is the version of record.

    For convenience, the data is provided in batches of approximately 1 million records each. The publication date and batch number are included in the file name, ex: 2024-08-23-data-citation-corpus-01-v2.0.json.

    The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles.

    Each data citation record is comprised of:

    • A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication (journal article or preprint) in which the dataset is cited

    • Metadata for the cited dataset and for the citing publication

    The data file includes the following fields:

    Field

    Description

    Required?

    id

    Internal identifier for the citation

    Yes

    created

    Date of item's incorporation into the corpus

    Yes

    updated

    Date of item's most recent update in corpus

    Yes

    repository

    Repository where cited data is stored

    No

    publisher

    Publisher for the article citing the data

    No

    journal

    Journal for the article citing the data

    No

    title

    Title of cited data

    No

    publication

    DOI of article where data is cited

    Yes

    dataset

    DOI or accession number of cited data

    Yes

    publishedDate

    Date when citing article was published

    No

    source

    Source where citation was harvested

    Yes

    subjects

    Subject information for cited data

    No

    affiliations

    Affiliation information for creator of cited data

    No

    funders

    Funding information for cited data

    No

    Additional documentation about the citations and metadata in the file is available on the Make Data Count website.

    The second release of the Data Citation Corpus data file reflects several changes made to add new citations, remove some records deemed out of scope for the corpus, update and enhance citation metadata, and improve the overall usability of the file. These changes are as follows:

    Add and update Event Data citations:

    • Add 179,885 new data citations created in DataCite Event Data between 01 June 2023 through 30 June 2024

    Remove citation records deemed out of scope for the corpus:

    • 273,567 records from DataCite Event Data with non-citation relationship types

    • 28,334 citations to items in non-data repositories (clinical trials registries, stem cells, samples, and other non-data materials)

    • 44,117 invalid citations where subj_id value was the same as the obj_id value or subj_id and obj_id are inverted, indicating a citation from a dataset to a publication

    • 473,792 citations to invalid accession numbers from CZI data present in v1.1 as a result of false positives in the algorithm used to identify mentions

    • 4,110,019 duplicate records from CZI data present in v1.1 where metadata is the same for obj_id, subj_id, repository_id, publisher_id, journal_id, accession_number, and source_id (the record with the most recent updated date was retained in all of these cases)

    Metadata enhancements:

    • Apply Field of Science subject terms to citation records originating from CZI, based on disciplinary area of data repository

    • Initial cleanup of affiliation and funder organization names to remove personal email addresses and social media handles (additional cleanup and standardization in progress and will be included in future releases)

    Data structure updates to improve usability and eliminate redundancies:

    • Rename subj_id and obj_id fields to “dataset” and “publication” for clarity

    • Remove accessionNumber and doi elements to eliminate redundancy with subj_id

    • Remove relationTypeId fields as these are specific to Event Data only

    Full details of the above changes, including scripts used to perform the above tasks, are available in GitHub.

    While v2 addresses a number of cleanup and enhancement tasks, additional data issues may remain, and additional enhancements are being explored. These will be addressed in the course of subsequent data file releases.


    Feedback on the data file can be submitted via GitHub. For general questions, email info@makedatacount.org.

  2. Data from: The location of the citation: changing practices in how...

    • zenodo.org
    • data.niaid.nih.gov
    • +2more
    text/x-python, txt
    Updated May 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christine Mayo; Todd J. Vision; Elizabeth A. Hull; Christine Mayo; Todd J. Vision; Elizabeth A. Hull (2022). Data from: The location of the citation: changing practices in how publications cite original data in the Dryad Digital Repository [Dataset]. http://doi.org/10.5061/dryad.8q931
    Explore at:
    text/x-python, txtAvailable download formats
    Dataset updated
    May 29, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christine Mayo; Todd J. Vision; Elizabeth A. Hull; Christine Mayo; Todd J. Vision; Elizabeth A. Hull
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    While stakeholders in scholarly communication generally agree on the importance of data citation, there is not consensus on where those citations should be placed within the publication – particularly when the publication is citing original data. Recently, CrossRef and the Digital Curation Center (DCC) have recommended as a best practice that original data citations appear in the works cited sections of the article. In some fields, such as the life sciences, this contrasts with the common practice of only listing data identifier(s) within the article body (intratextually). We inquired whether data citation practice has been changing in light of the guidance from CrossRef and the DCC. We examined data citation practices from 2011 to 2014 in a corpus of 1,125 articles associated with original data in the Dryad Digital Repository. The percentage of articles that include no reference to the original data has declined each year, from 31% in 2011 to 15% in 2014. The percentage of articles that include data identifiers intratextually has grown from 69% to 83%, while the percentage that cite data in the works cited section has grown from 5% to 8%. If the proportions continue to grow at the current rate of 19-20% annually, the proportion of articles with data citations in the works cited section will not exceed 90% until 2030.

  3. d

    October 2023 data-update for "Updated science-wide author databases of...

    • elsevier.digitalcommonsdata.com
    Updated Oct 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John P.A. Ioannidis (2023). October 2023 data-update for "Updated science-wide author databases of standardized citation indicators" [Dataset]. http://doi.org/10.17632/btchxktzyw.6
    Explore at:
    Dataset updated
    Oct 4, 2023
    Authors
    John P.A. Ioannidis
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    Citation metrics are widely used and misused. We have created a publicly available database of top-cited scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator (c-score). Separate data are shown for career-long and, separately, for single recent year impact. Metrics with and without self-citations and ratio of citations to citing papers are given. Scientists are classified into 22 scientific fields and 174 sub-fields according to the standard Science-Metrix classification. Field- and subfield-specific percentiles are also provided for all scientists with at least 5 papers. Career-long data are updated to end-of-2022 and single recent year data pertain to citations received during calendar year 2022. The selection is based on the top 100,000 scientists by c-score (with and without self-citations) or a percentile rank of 2% or above in the sub-field. This version (6) is based on the October 1, 2023 snapshot from Scopus, updated to end of citation year 2022. This work uses Scopus data provided by Elsevier through ICSR Lab (https://www.elsevier.com/icsr/icsrlab). Calculations were performed using all Scopus author profiles as of October 1, 2023. If an author is not on the list it is simply because the composite indicator value was not high enough to appear on the list. It does not mean that the author does not do good work.

    PLEASE ALSO NOTE THAT THE DATABASE HAS BEEN PUBLISHED IN AN ARCHIVAL FORM AND WILL NOT BE CHANGED. The published version reflects Scopus author profiles at the time of calculation. We thus advise authors to ensure that their Scopus profiles are accurate. REQUESTS FOR CORRECIONS OF THE SCOPUS DATA (INCLUDING CORRECTIONS IN AFFILIATIONS) SHOULD NOT BE SENT TO US. They should be sent directly to Scopus, preferably by use of the Scopus to ORCID feedback wizard (https://orcid.scopusfeedback.com/) so that the correct data can be used in any future annual updates of the citation indicator databases.

    The c-score focuses on impact (citations) rather than productivity (number of publications) and it also incorporates information on co-authorship and author positions (single, first, last author). If you have additional questions, please read the 3 associated PLoS Biology papers that explain the development, validation and use of these metrics and databases. (https://doi.org/10.1371/journal.pbio.1002501, https://doi.org/10.1371/journal.pbio.3000384 and https://doi.org/10.1371/journal.pbio.3000918).

    Finally, we alert users that all citation metrics have limitations and their use should be tempered and judicious. For more reading, we refer to the Leiden manifesto: https://www.nature.com/articles/520429a

  4. JHU CSSE COVID-19 Data

    • kaggle.com
    zip
    Updated Oct 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony (2023). JHU CSSE COVID-19 Data [Dataset]. https://www.kaggle.com/datasets/anthonyylee/jhu-csse-covid-19-data
    Explore at:
    zip(377698378 bytes)Available download formats
    Dataset updated
    Oct 10, 2023
    Authors
    Anthony
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    Full dataset from Johns Hopkins University (JHU) Center for Systems Science and Engineering (CSSE) GitHub repository.

    This is the full and complete dataset linked from JHU CSSE GitHub repository. The intent of this dataset is to provide access to the full dataset on the platform in contrast to the various other subsets.

    Since the original GitHub repository has been archived, there are no planned updates to this dataset.

    Citation

    All citation please cite according to specification in the GitHub repository README.

    Source

    COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University

    Reference

    Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Inf Dis. 20(5):533-534. doi: 10.1016/S1473-3099(20)30120-1

  5. Listing of data repositories that embed schema.org metadata in dataset...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    csv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Fenner; Martin Fenner; Merce Crosas; Merce Crosas; Gustavo Durand; Gustavo Durand; Sarala Wimalaratne; Sarala Wimalaratne; Florian Gräf; Florian Gräf; Richard Hallett; Richard Hallett; Manuel Bernal Llinares; Manuel Bernal Llinares; Uwe Schindler; Uwe Schindler; Tim Clark; Tim Clark (2020). Listing of data repositories that embed schema.org metadata in dataset landing pages [Dataset]. http://doi.org/10.5281/zenodo.1262598
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Martin Fenner; Martin Fenner; Merce Crosas; Merce Crosas; Gustavo Durand; Gustavo Durand; Sarala Wimalaratne; Sarala Wimalaratne; Florian Gräf; Florian Gräf; Richard Hallett; Richard Hallett; Manuel Bernal Llinares; Manuel Bernal Llinares; Uwe Schindler; Uwe Schindler; Tim Clark; Tim Clark
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine-readable metadata available from landing pages for datasets facilitate data citation by enabling easy integration with reference managers and other tools used in a data citation workflow. Embedding these metadata using the schema.org standard with the JSON-LD is emerging as the community standard. This dataset is a listing of data repositories that have implemented this approach or are in the progress of doing so.

    This is the first version of this dataset and was generated via community consultation. We expect to update this dataset, as an increasing number of data repositories adopt this approach, and we hope to see this information added to registries of data repositories such as re3data and FAIRsharing.

    In addition to the listing of data repositories we provide information of the schema.org properties supported by these data repositories, focussing on the required and recommended properties from the "Data Citation Roadmap for Scholarly Data Repositories".

  6. Data from: Data reuse and the open data citation advantage

    • zenodo.org
    • data-staging.niaid.nih.gov
    • +3more
    bin, csv, txt
    Updated May 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heather A. Piwowar; Todd J. Vision; Heather A. Piwowar; Todd J. Vision (2022). Data from: Data reuse and the open data citation advantage [Dataset]. http://doi.org/10.5061/dryad.781pv
    Explore at:
    bin, csv, txtAvailable download formats
    Dataset updated
    May 28, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Heather A. Piwowar; Todd J. Vision; Heather A. Piwowar; Todd J. Vision
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Background: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered.We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.

  7. Air-LUSI - Lunar Spectral Irradiance Data Repository: 2022 Data

    • nist.gov
    • data.nist.gov
    Updated Jan 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2025). Air-LUSI - Lunar Spectral Irradiance Data Repository: 2022 Data [Dataset]. http://doi.org/10.18434/mds2-3397
    Explore at:
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    In March 2022, the Air-LUSI instrument measured the lunar spectral irradiance on four nights from NASA's high-altitude ER-2 aircraft. The data set includes data from: 1) characterization and calibration of the Air-LUSI instrument and the transfer standards used to calibrate the instrument in NetCDF format, 2) the geolocated lunar irradiance data acquired by the instrument in NetCDF format, and 3) usage examples hosted at https://github.com/usnistgov/air-lusi along with copies of the above. If you prefer not to follow the Python workflow given at the GitHub site, a variety of tools for viewing and manipulating NetCDF files are linked to here: https://www.unidata.ucar.edu/software/netcdf/software.html

  8. I

    Data from development and evaluation of SASCA-s: Scalable Agent-based...

    • databank.illinois.edu
    Updated Aug 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Minhyuk Park; João AC Lamy; Esther CC Rodrigues; Felipe Mariano Ferreira; The-Anh Vu-Le; Tandy Warnow; George Chacko (2025). Data from development and evaluation of SASCA-s: Scalable Agent-based Simulator for Citation Analysis with simulation [Dataset]. http://doi.org/10.13012/B2IDB-3926377_V1
    Explore at:
    Dataset updated
    Aug 16, 2025
    Authors
    Minhyuk Park; João AC Lamy; Esther CC Rodrigues; Felipe Mariano Ferreira; The-Anh Vu-Le; Tandy Warnow; George Chacko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    U.S. National Science Foundation (NSF)
    Illinois:Insper Partnership
    Description

    The data within consist of compressed output files in the form of edgelists (.edgelist.gz) and nodelists (.aux.parquet) from large citation network simulations using an agent-based model. The code and instructions are available at: https://github.com/illinois-or-research-analytics/SASCA. In addition, we provide a distribution of citation frequencies drawn from a random sample of PubMed journal articles (pooled_50k_pubmed_unique.csv) and a table of recencies- the frequency with which citations are made to the previous year, the year before that and so on (recency_probs_percent_stahl_filled.csv). A manuscript describing the SASCA-s simulator has been submitted for review and will be referenced in a future version of this data repository if it is accepted. The prefixes sj and er refer to the real world and Erdos-Renyi random graph respectively that were used to initiate simulations. These 'seed' networks are available from the Github site referenced above.

  9. Subset of Data Citation Corpus version 4

    • kaggle.com
    zip
    Updated Aug 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RodericD.M.Page (2025). Subset of Data Citation Corpus version 4 [Dataset]. https://www.kaggle.com/datasets/rdmpage/subset-of-data-citation-corpus-version-4
    Explore at:
    zip(59591902 bytes)Available download formats
    Dataset updated
    Aug 14, 2025
    Authors
    RodericD.M.Page
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is a subset of version 4.0 of the Data Citation Corpus. It contains article_ids as cleaned DOIs, dataset ids (e.g., accession numbers, DOIs) and the name of the repository of the data (e.g., Dryad, European Nucleotide Archive). It was extracted from the file 2025-07-27-data-citation-corpus-01-v4.0.json which is one of 11 JSONL files in the corpus.

  10. d

    Reference Guide - Treatments Found in the PTSD Repository

    • catalog.data.gov
    • ptsd-va.data.socrata.com
    Updated Jun 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Center for PTSD (2025). Reference Guide - Treatments Found in the PTSD Repository [Dataset]. https://catalog.data.gov/dataset/reference-guide-treatments-found-in-the-ptsd-repository
    Explore at:
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    National Center for PTSD
    Description

    This document contains brief descriptions of many of the treatments found in the PTSD Repository, organized by treatment category. Note: The download is a .zip file which contains the PDF Reference Guide.

  11. d

    Toward a Reproducible Research Data Repository

    • data.depositar.io
    mp4, pdf
    Updated Jan 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    depositar (2024). Toward a Reproducible Research Data Repository [Dataset]. https://data.depositar.io/dataset/reproducible-research-data-repository
    Explore at:
    pdf(212638), pdf(2586248), pdf(627064), mp4(22141307)Available download formats
    Dataset updated
    Jan 26, 2024
    Dataset provided by
    depositar
    Description

    Collected in this dataset are the slideset and abstract for a presentation on Toward a Reproducible Research Data Repository by the depositar team at International Symposium on Data Science 2023 (DSWS 2023), hosted by the Science Council of Japan in Tokyo on December 13-15, 2023. The conference was organized by the Joint Support-Center for Data Science Research (DS), Research Organization of Information and Systems (ROIS) and the Committee of International Collaborations on Data Science, Science Council of Japan. The conference programme is also included as a reference.

    Title

    Toward a Reproducible Research Data Repository

    Author(s)

    Cheng-Jen Lee, Chia-Hsun Ally Wang, Ming-Syuan Ho, and Tyng-Ruey Chuang

    Affiliation of presenter

    Institute of Information Science, Academia Sinica, Taiwan

    Summary of Abstract

    The depositar (https://data.depositar.io/) is a research data repository at Academia Sinica (Taiwan) open to researhers worldwide for the deposit, discovery, and reuse of datasets. The depositar software itself is open source and builds on top of CKAN. CKAN, an open source project initiated by the Open Knowledge Foundation and sustained by an active user community, is a leading data management system for building data hubs and portals. In addition to CKAN's out-of-the-box features such as JSON data API and in-browser preview of uploaded data, we have added several features to the depositar, including sourcing from Wikidata as dataset keywords, a citation snippet for datasets, in-browser Shapefile preview, and a persistent identifier system based on ARK (Archival Resource Keys). At the same time, the depositar team faces an increasing demand for interactive computing (e.g. Jupyter Notebook) which facilitates not just data analysis, but also for the replication and demonstration of scientific studies. Recently, we have provided a JupyterHub service (a multi-tenancy JupyterLab) to some of the depositar's users. However, it still requires users to first download the data files (or copy the URLs of the files) from the depositar, then upload the data files (or paste the URLs) to the Jupyter notebooks for analysis. Furthermore, a JupyterHub deployed on a single server is limited by its processing power which may lower the service level to the users. To address the above issues, we are integrating the BinderHub into the depositar. BinderHub (https://binderhub.readthedocs.io/) is a kubernetes-based service that allows users to create interactive computing environments from code repositories. Once the integration is completed, users will be able to launch Jupyter Notebooks to perform data analysis and vsualization without leaving the depositar by clicking the BinderHub buttons on the datasets. In this presentation, we will first make a brief introduction to the depositar and BinderHub along with their relationship, then we will share our experiences in incorporating interactive computation in a data repository. We shall also evaluate the possibility of integrating the depositar with other automation frameworks (e.g. the Snakemake workflow management system) in order to enable users to reproduce data analysis.

    Keywords

    BinderHub, CKAN, Data Repositories, Interactive Computing, Reproducible Research

  12. Z

    Data from the International Open Data Repository Survey

    • data.niaid.nih.gov
    • nde-dev.biothings.io
    • +1more
    Updated May 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    von der Heyde, Markus (2022). Data from the International Open Data Repository Survey [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_2643492
    Explore at:
    Dataset updated
    May 25, 2022
    Dataset provided by
    vdH-IT
    Authors
    von der Heyde, Markus
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This file collection is part of the ORD Landscape and Cost Analysis Project (DOI: 10.5281/zenodo.2643460), a study jointly commissioned by the SNSF and swissuniversities in 2018.

    Please cite this data collection as: von der Heyde, M. (2019). Data from the International Open Data Repository Survey. Retrieved from https://doi.org/10.5281/zenodo.2643493

    Further information is given in the corresponding data paper: von der Heyde, M. (2019). International Open Data Repository Survey: Description of collection, collected data, and analysis methods [Data paper]. Retrieved from https://doi.org/10.5281/zenodo.2643450

    Contact

    Swiss National Science Foundation (SNSF)

    Open Research Data Group

    E-mail: ord@snf.ch

    swissuniversities

    Program "Scientific Information"

    Gabi Schneider

    E-Mail: isci@swissuniversities.ch

  13. I

    Global News Index and Extracted Features Repository

    • databank.illinois.edu
    Updated Jun 15, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Global News Index and Extracted Features Repository [Dataset]. http://doi.org/10.13012/B2IDB-5649852_V1
    Explore at:
    Dataset updated
    Jun 15, 2021
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Cline Center Global News Index is a searchable database of textual features extracted from millions of news stories, specifically designed to provide comprehensive coverage of events around the world. In addition to searching documents for keywords, users can query metadata and features such as named entities extracted using Natural Language Processing (NLP) methods and variables that measure sentiment and emotional valence. Archer is a web application purpose-built by the Cline Center to enable researchers to access data from the Global News Index. Archer provides a user-friendly interface for querying the Global News Index (with the back-end indexing still handled by Solr). By default, queries are built using icons and drop-down menus. More technically-savvy users can use Lucene/Solr query syntax via a ‘raw query’ option. Archer allows users to save and iterate on their queries, and to visualize faceted query results, which can be helpful for users as they refine their queries. Additional Resources: - Access to Archer and the Global News Index is limited to account-holders. If you are interested in signing up for an account, you can fill out the Archer User Information Form. - Current users who would like to provide feedback, such as reporting a bug or requesting a feature, can fill out the Archer User Feedback Form. - The Cline Center sends out periodic email newsletters to the Archer Users Group. Please fill out this form to subscribe to Archer Users Group. Citation Guidelines: 1) To cite the GNI codebook (or any other documentation associated with the Global News Index and Archer) please use the following citation: Cline Center for Advanced Social Research. 2020. Global News Index and Extracted Features Repository [codebook]. Champaign, IL: University of Illinois. doi:10.13012/B2IDB-5649852_V1 2) To cite data from the Global News Index (accessed via Archer or otherwise) please use the following citation (filling in the correct date of access): Cline Center for Advanced Social Research. 2020. Global News Index and Extracted Features Repository [database]. Champaign, IL: University of Illinois. Accessed Month, DD, YYYY. doi:10.13012/B2IDB-5649852_V1

  14. s

    Mendeley Data

    • scicrunch.org
    Updated Dec 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Mendeley Data [Dataset]. http://doi.org/10.25504/FAIRsharing.3epmpp
    Explore at:
    Dataset updated
    Dec 11, 2022
    Description

    Cloud-based data repository for storing, publishing and accessing scientific data. Mendeley Data creates a permanent location and issues Force 11 compliant citations for uploaded data.

  15. n

    citeseer

    • networkrepository.com
    csv
    Updated Sep 6, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Network Data Repository (2014). citeseer [Dataset]. https://networkrepository.com/ca-citeseer.php
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 6, 2014
    Dataset authored and provided by
    Network Data Repository
    License

    https://networkrepository.com/policy.phphttps://networkrepository.com/policy.php

    Description
    • Citation network extracted from the CiteSeer digital library. Nodes are publications and the directed edges denote citations.
  16. J

    Data Axle Reference Solutions U.S. Business Location Data

    • archive.data.jhu.edu
    • databases.library.jhu.edu
    Updated May 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Axle Reference Solutions (2022). Data Axle Reference Solutions U.S. Business Location Data [Dataset]. http://doi.org/10.7281/T1/P69KYX
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 26, 2022
    Dataset provided by
    Johns Hopkins Research Data Repository
    Authors
    Data Axle Reference Solutions
    License

    https://archive.data.jhu.edu/api/datasets/:persistentId/versions/2.1/customlicense?persistentId=doi:10.7281/T1/P69KYXhttps://archive.data.jhu.edu/api/datasets/:persistentId/versions/2.1/customlicense?persistentId=doi:10.7281/T1/P69KYX

    Time period covered
    2017 - 2020
    Area covered
    United States
    Description

    Data Axle Reference Solutions, formerly ReferenceUSA, contains two directories--residential and business. Records include over 12 million U.S. businesses and 120 million U.S. residents. Businesses include private, public, and non-profit organizations, regardless of employee size or sales. This dataset of Data Axle’s business database provides 52 attributes about tens of millions of businesses across the United States for almost every business from the Fortune 500 down to mom-and-pop shops and work-from-home freelancers. The Data Axle business database is available in its entirety for the years 2017 to 2020. The data can be downloaded in a single commas separated values (.csv) file for each year of interest. This file is approximately 5 GB in size after de-compressing the .zip archive. To load the .csv in memory requires a minimum of 32 GB of RAM. To access the Data Axle data on low-memory systems, the .csv file for each year as been split into subsets by US Census defined geographic regions, as well as the more granular geographic divisions. The file census-regions-divisions.csv identifies the states and territories that belong to each region (5 regions plus territories) and divisions (9 divisions plus territories).

  17. Z

    A2.2a Digital repositories data citation practices. Supplementary material

    • data.niaid.nih.gov
    • data.europa.eu
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bettella, Cristiana; Apostolico, Mauro; Cappellato, Linda; Carrer, Yuri; Felicetti, Achille; Turetta, Giulio (2023). A2.2a Digital repositories data citation practices. Supplementary material [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_8188805
    Explore at:
    Dataset updated
    Aug 4, 2023
    Dataset provided by
    Polo Universitario Città di Prato
    University of Padua
    Authors
    Bettella, Cristiana; Apostolico, Mauro; Cappellato, Linda; Carrer, Yuri; Felicetti, Achille; Turetta, Giulio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data to complement the quantitative analysis of data citation practices in digital repositories based on metadata records from the re3data.org repositories registry.

    Data was retrieved using re3data.org API on 23-02-2023 and 06-03-2023 and processed using the OpenRefine software.

    Part of "A FAIR-enabling citation model for Cultural Heritage Objects" project activities.

  18. Nordic44 - 2015 Powerflow Data: An Open Data Repository of an Equivalent...

    • zenodo.org
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luigi Vanfretti; Svein H. Olsen; V. S. Narasimham Arava; Giuseppe Laera; Ali Bidadfar; Tin Rabuzin; Sigurd H. Jakobsen; Jan Lavenius; Maxime Baudette; Francisco J. Gómez Lopez; Luigi Vanfretti; Svein H. Olsen; V. S. Narasimham Arava; Giuseppe Laera; Ali Bidadfar; Tin Rabuzin; Sigurd H. Jakobsen; Jan Lavenius; Maxime Baudette; Francisco J. Gómez Lopez (2020). Nordic44 - 2015 Powerflow Data: An Open Data Repository of an Equivalent Nordic Grid Model Matched to Historical Electricity Market Data for 2015 [Dataset]. http://doi.org/10.5281/zenodo.162907
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Luigi Vanfretti; Svein H. Olsen; V. S. Narasimham Arava; Giuseppe Laera; Ali Bidadfar; Tin Rabuzin; Sigurd H. Jakobsen; Jan Lavenius; Maxime Baudette; Francisco J. Gómez Lopez; Luigi Vanfretti; Svein H. Olsen; V. S. Narasimham Arava; Giuseppe Laera; Ali Bidadfar; Tin Rabuzin; Sigurd H. Jakobsen; Jan Lavenius; Maxime Baudette; Francisco J. Gómez Lopez
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This repository is used to provide documentation related to the model and data development process, provide source (raw) data for the model in different forms (i.e. Modelica, CIM 14, and PSS/E) for an equivalent Nordic grid model that has been matched to historical power flow data.

    The repository is documented in the paper below, see [Ref00].

    Using this model, data or related software = cite our publications!

    We are happy to contribute with this dataset, however, if you use any of the data or software provided, we will appreciate if you cite the following publications, as follows:

    A) Cite that "the raw and processed data files corresponding to the model are available as an open data set and documented in [Ref00]."

    B) Cite that the first appearance of the model, i.e. "the model is first presented in [Ref01]"

    [Ref00] L. Vanfretti, S.H. Olsen, V. S. Narasimham Arava, G. Laera, A. Bibadafar, T. Rabuzin, H. Jackobsen, J. Lavenius, and M. Baudette, "An Open Data Repository and a Data Processing Software Toolset of an Equivalent Nordic Grid Model Matched to Historical Electricity Market Data," submitted for publication, Data in Brief, 2016.

    [Ref01] L. Vanfretti, T. Rabuzin, M. Baudette, M. Murad, iTesla Power Systems Library (iPSL): A Modelica library for phasor time-domain simulations, SoftwareX, Available online 18 May 2016, ISSN 2352-7110, http://dx.doi.org/10.1016/j.softx.2016.05.001.

    Acknowledgment:

    This model was originally developed in the context of the FP7 iTesla project, and further extended within the ITEA3 openCPSproject.

    Structure of the repository:

    01_PSSE_Resources:

    1. Models :

      • A folder with PSS/E files of the base case

      • A folder with a 7zip archive containing files of the original N44 system that has been modified to have the PSS/E base case

    2. Snapshots :

      • N44_2015xxxx are folders named according to the day they refer to (for example N44_20150401 refers to the 1st of April 2015). In each folder there are Excel files (Consumption_xx.xlsx, Exchange_xx.xlsx, Production_xx.xlsx) with data downloaded from Nord Pool website, an Excel file (PSSE_in_out.xlsx) summarizing the results from the Python script Nordic44.py in the folder 04_Python_Resources, PSS/E snapshots for each hour before solving the power flow (hx_before_PF.raw) and after solving the power flow (hx_after_PF.raw)

      • N44_BC.sav is the PSS/E solved base case that Python script Nordic44.py (put the reference)

    02_CIM14_Snapshots:

    • N44_2015xxxx are folders named according to the day they refer to (e.g. N44_20150401 refers to the 1st of April 2015). In each folder there are CIM files for each hour (N44_hx_EQ.xml, N44_hx_SV.xml_, _N44_hx_TP.xml)

    • N44_noOL_RDFIDMAP.xml is the file with IDs mapping of those cases (N44_hx_noOL_EQ.xml, N44_hx_noOL_SV.xml, N44_hx_noOL_TP.xml) with fixed overloading problems.

    • N44_RDFIDMAP_2015-1.xml and N44_RDFIDMAP_2015-2.xml are the files with IDs mapping of the remaining snapshots from 2015

    03_Modelica:

    1. iTesla_Platform

      • iPSL folder contains the version of the library which can be used to simulate snapshots generated from the iTesla Platform

      • Modelica_snapshots Modelica models generated from the snapshots by iTesla Platform

    2. SmarTSLab

      • OpenIPSL folder contains the version of the forked iPSL library which can be used to simulate the manually generated Modelica model of N44 with the record structures corresponding to the snapshots

      • Snapshots folder contains Modelica records automatically generated from the PSS/E records

      • N44_Base_Case.mo is the handmade N44 model with the loaded record of the power flow results from the PSS/E base case. It can be used to load other PF results from the folder 03_Modelica/Snapshots

  19. G

    INGENIOUS - Great Basin Regional Dataset Compilation

    • gdr.openei.org
    • data.openei.org
    • +3more
    archive, website
    Updated Jun 30, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bridget Ayling; James Faulds; Anieri Morales Rivera; Richard Koehler; Cornelis Kreemer; Elijah Mlawsky; Mark Coolbaugh; Rachel Micander; Craig dePolo; Kurt Kraal; Nicole Wagoner; Drew Siler; Jacob DeAngelo; Jonathan Glen; Jared Peacock; Joseph Batir; Emilie Gentry; Claudio Berti; Zach Lifton; Alexandra Clark; Stefan Kirby; Christian Hardwick; Emily Kleber; Bridget Ayling; James Faulds; Anieri Morales Rivera; Richard Koehler; Cornelis Kreemer; Elijah Mlawsky; Mark Coolbaugh; Rachel Micander; Craig dePolo; Kurt Kraal; Nicole Wagoner; Drew Siler; Jacob DeAngelo; Jonathan Glen; Jared Peacock; Joseph Batir; Emilie Gentry; Claudio Berti; Zach Lifton; Alexandra Clark; Stefan Kirby; Christian Hardwick; Emily Kleber (2022). INGENIOUS - Great Basin Regional Dataset Compilation [Dataset]. http://doi.org/10.15121/1881483
    Explore at:
    archive, websiteAvailable download formats
    Dataset updated
    Jun 30, 2022
    Dataset provided by
    USDOE Office of Energy Efficiency and Renewable Energy (EERE), Renewable Power Office. Geothermal Technologies Program (EE-4G)
    Geothermal Data Repository
    GBCGE, NBMG, UNR
    Authors
    Bridget Ayling; James Faulds; Anieri Morales Rivera; Richard Koehler; Cornelis Kreemer; Elijah Mlawsky; Mark Coolbaugh; Rachel Micander; Craig dePolo; Kurt Kraal; Nicole Wagoner; Drew Siler; Jacob DeAngelo; Jonathan Glen; Jared Peacock; Joseph Batir; Emilie Gentry; Claudio Berti; Zach Lifton; Alexandra Clark; Stefan Kirby; Christian Hardwick; Emily Kleber; Bridget Ayling; James Faulds; Anieri Morales Rivera; Richard Koehler; Cornelis Kreemer; Elijah Mlawsky; Mark Coolbaugh; Rachel Micander; Craig dePolo; Kurt Kraal; Nicole Wagoner; Drew Siler; Jacob DeAngelo; Jonathan Glen; Jared Peacock; Joseph Batir; Emilie Gentry; Claudio Berti; Zach Lifton; Alexandra Clark; Stefan Kirby; Christian Hardwick; Emily Kleber
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Great Basin
    Description

    This is the regional dataset compilation for the INnovative Geothermal Exploration through Novel Investigations Of Undiscovered Systems (INGENIOUS) project. The primary goal of this project is to accelerate discoveries of new, commercially viable hidden geothermal systems while reducing the exploration and development risks for all geothermal resources. These datasets will be used in INGENIOUS as input features for predicting geothermal favorability throughout the Great Basin study area.

    Datasets consist of shapefiles, geotiffs, tabular spreadsheets, and metadata that describe: 2-meter temperature probe surveys, quaternary faults and volcanic features, geodetic shear and dilation models, heat flow, magnetotellurics (conductance), magnetics, gravity, paleogeothermal features (such as sinter and tufa deposits), seismicity, spring and well temperatures, spring and well aqueous geochemistry analyses, thermal conductivity, and fault slip and dilation tendency.

    For additional project information, see the INGENIOUS project site linked in the submission.

    Terms of use: These datasets are provided "as is", and the contributors assume no responsibility for any errors or omissions. The user assumes the entire risk associated with their use of these data and bears all responsibility in determining whether these data are fit for their intended use. These datasets may be redistributed with attribution (see citation information below). Please refer to the license information on this page for full licensing terms and conditions.

  20. f

    Barriers to Data and Software Sharing: The Reference Manager Gap

    • esip.figshare.com
    pdf
    Updated Jan 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natalie Raia; Kristina Vrouwenvelder (2024). Barriers to Data and Software Sharing: The Reference Manager Gap [Dataset]. http://doi.org/10.6084/m9.figshare.25012655.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jan 17, 2024
    Dataset provided by
    ESIP
    Authors
    Natalie Raia; Kristina Vrouwenvelder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    AGU and other publishers have developed guidance for authors to support best practices for data and software sharing and citation, but there remains a significant gap in knowledge and implementation of these practices amongst scientists during the publication process. Reference managers are an important tool to facilitate uptake of data and software citation, but this infrastructure is not yet adequately developed for these applications. In this poster, we compare and contrast dataset and software citation capabilities amongst major reference managers and numerous data repositories to begin a conversation about technical improvements to this critical component of FAIR data infrastructure. This poster was presented during the 2024 January Earth Science Information Partners (ESIP) Meeting held virtually (Jan. 23-26, 2024).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
DataCite (2024). Data Citation Corpus Data File [Dataset]. http://doi.org/10.5281/zenodo.13376773
Organization logo

Data Citation Corpus Data File

Explore at:
zipAvailable download formats
Dataset updated
Oct 14, 2024
Dataset provided by
DataCite
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Data file for the second release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.

The data file includes 5,256,114 data citation records in JSON and CSV formats. The JSON file is the version of record.

For convenience, the data is provided in batches of approximately 1 million records each. The publication date and batch number are included in the file name, ex: 2024-08-23-data-citation-corpus-01-v2.0.json.

The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles.

Each data citation record is comprised of:

  • A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication (journal article or preprint) in which the dataset is cited

  • Metadata for the cited dataset and for the citing publication

The data file includes the following fields:

Field

Description

Required?

id

Internal identifier for the citation

Yes

created

Date of item's incorporation into the corpus

Yes

updated

Date of item's most recent update in corpus

Yes

repository

Repository where cited data is stored

No

publisher

Publisher for the article citing the data

No

journal

Journal for the article citing the data

No

title

Title of cited data

No

publication

DOI of article where data is cited

Yes

dataset

DOI or accession number of cited data

Yes

publishedDate

Date when citing article was published

No

source

Source where citation was harvested

Yes

subjects

Subject information for cited data

No

affiliations

Affiliation information for creator of cited data

No

funders

Funding information for cited data

No

Additional documentation about the citations and metadata in the file is available on the Make Data Count website.

The second release of the Data Citation Corpus data file reflects several changes made to add new citations, remove some records deemed out of scope for the corpus, update and enhance citation metadata, and improve the overall usability of the file. These changes are as follows:

Add and update Event Data citations:

  • Add 179,885 new data citations created in DataCite Event Data between 01 June 2023 through 30 June 2024

Remove citation records deemed out of scope for the corpus:

  • 273,567 records from DataCite Event Data with non-citation relationship types

  • 28,334 citations to items in non-data repositories (clinical trials registries, stem cells, samples, and other non-data materials)

  • 44,117 invalid citations where subj_id value was the same as the obj_id value or subj_id and obj_id are inverted, indicating a citation from a dataset to a publication

  • 473,792 citations to invalid accession numbers from CZI data present in v1.1 as a result of false positives in the algorithm used to identify mentions

  • 4,110,019 duplicate records from CZI data present in v1.1 where metadata is the same for obj_id, subj_id, repository_id, publisher_id, journal_id, accession_number, and source_id (the record with the most recent updated date was retained in all of these cases)

Metadata enhancements:

  • Apply Field of Science subject terms to citation records originating from CZI, based on disciplinary area of data repository

  • Initial cleanup of affiliation and funder organization names to remove personal email addresses and social media handles (additional cleanup and standardization in progress and will be included in future releases)

Data structure updates to improve usability and eliminate redundancies:

  • Rename subj_id and obj_id fields to “dataset” and “publication” for clarity

  • Remove accessionNumber and doi elements to eliminate redundancy with subj_id

  • Remove relationTypeId fields as these are specific to Event Data only

Full details of the above changes, including scripts used to perform the above tasks, are available in GitHub.

While v2 addresses a number of cleanup and enhancement tasks, additional data issues may remain, and additional enhancements are being explored. These will be addressed in the course of subsequent data file releases.


Feedback on the data file can be submitted via GitHub. For general questions, email info@makedatacount.org.

Search
Clear search
Close search
Google apps
Main menu