Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data file for the second release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.
The data file includes 5,256,114 data citation records in JSON and CSV formats. The JSON file is the version of record.
For convenience, the data is provided in batches of approximately 1 million records each. The publication date and batch number are included in the file name, ex: 2024-08-23-data-citation-corpus-01-v2.0.json.
The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles.
Each data citation record is comprised of:
A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication (journal article or preprint) in which the dataset is cited
Metadata for the cited dataset and for the citing publication
The data file includes the following fields:
|
Field |
Description |
Required? |
|
id |
Internal identifier for the citation |
Yes |
|
created |
Date of item's incorporation into the corpus |
Yes |
|
updated |
Date of item's most recent update in corpus |
Yes |
|
repository |
Repository where cited data is stored |
No |
|
publisher |
Publisher for the article citing the data |
No |
|
journal |
Journal for the article citing the data |
No |
|
title |
Title of cited data |
No |
|
publication |
DOI of article where data is cited |
Yes |
|
dataset |
DOI or accession number of cited data |
Yes |
|
publishedDate |
Date when citing article was published |
No |
|
source |
Source where citation was harvested |
Yes |
|
subjects |
Subject information for cited data |
No |
|
affiliations |
Affiliation information for creator of cited data |
No |
|
funders |
Funding information for cited data |
No |
Additional documentation about the citations and metadata in the file is available on the Make Data Count website.
The second release of the Data Citation Corpus data file reflects several changes made to add new citations, remove some records deemed out of scope for the corpus, update and enhance citation metadata, and improve the overall usability of the file. These changes are as follows:
Add and update Event Data citations:
Add 179,885 new data citations created in DataCite Event Data between 01 June 2023 through 30 June 2024
Remove citation records deemed out of scope for the corpus:
273,567 records from DataCite Event Data with non-citation relationship types
28,334 citations to items in non-data repositories (clinical trials registries, stem cells, samples, and other non-data materials)
44,117 invalid citations where subj_id value was the same as the obj_id value or subj_id and obj_id are inverted, indicating a citation from a dataset to a publication
473,792 citations to invalid accession numbers from CZI data present in v1.1 as a result of false positives in the algorithm used to identify mentions
4,110,019 duplicate records from CZI data present in v1.1 where metadata is the same for obj_id, subj_id, repository_id, publisher_id, journal_id, accession_number, and source_id (the record with the most recent updated date was retained in all of these cases)
Metadata enhancements:
Apply Field of Science subject terms to citation records originating from CZI, based on disciplinary area of data repository
Initial cleanup of affiliation and funder organization names to remove personal email addresses and social media handles (additional cleanup and standardization in progress and will be included in future releases)
Data structure updates to improve usability and eliminate redundancies:
Rename subj_id and obj_id fields to “dataset” and “publication” for clarity
Remove accessionNumber and doi elements to eliminate redundancy with subj_id
Remove relationTypeId fields as these are specific to Event Data only
Full details of the above changes, including scripts used to perform the above tasks, are available in GitHub.
While v2 addresses a number of cleanup and enhancement tasks, additional data issues may remain, and additional enhancements are being explored. These will be addressed in the course of subsequent data file releases.
Feedback on the data file can be submitted via GitHub. For general questions, email info@makedatacount.org.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
While stakeholders in scholarly communication generally agree on the importance of data citation, there is not consensus on where those citations should be placed within the publication – particularly when the publication is citing original data. Recently, CrossRef and the Digital Curation Center (DCC) have recommended as a best practice that original data citations appear in the works cited sections of the article. In some fields, such as the life sciences, this contrasts with the common practice of only listing data identifier(s) within the article body (intratextually). We inquired whether data citation practice has been changing in light of the guidance from CrossRef and the DCC. We examined data citation practices from 2011 to 2014 in a corpus of 1,125 articles associated with original data in the Dryad Digital Repository. The percentage of articles that include no reference to the original data has declined each year, from 31% in 2011 to 15% in 2014. The percentage of articles that include data identifiers intratextually has grown from 69% to 83%, while the percentage that cite data in the works cited section has grown from 5% to 8%. If the proportions continue to grow at the current rate of 19-20% annually, the proportion of articles with data citations in the works cited section will not exceed 90% until 2030.
Facebook
TwitterAttribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Citation metrics are widely used and misused. We have created a publicly available database of top-cited scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator (c-score). Separate data are shown for career-long and, separately, for single recent year impact. Metrics with and without self-citations and ratio of citations to citing papers are given. Scientists are classified into 22 scientific fields and 174 sub-fields according to the standard Science-Metrix classification. Field- and subfield-specific percentiles are also provided for all scientists with at least 5 papers. Career-long data are updated to end-of-2022 and single recent year data pertain to citations received during calendar year 2022. The selection is based on the top 100,000 scientists by c-score (with and without self-citations) or a percentile rank of 2% or above in the sub-field. This version (6) is based on the October 1, 2023 snapshot from Scopus, updated to end of citation year 2022. This work uses Scopus data provided by Elsevier through ICSR Lab (https://www.elsevier.com/icsr/icsrlab). Calculations were performed using all Scopus author profiles as of October 1, 2023. If an author is not on the list it is simply because the composite indicator value was not high enough to appear on the list. It does not mean that the author does not do good work.
PLEASE ALSO NOTE THAT THE DATABASE HAS BEEN PUBLISHED IN AN ARCHIVAL FORM AND WILL NOT BE CHANGED. The published version reflects Scopus author profiles at the time of calculation. We thus advise authors to ensure that their Scopus profiles are accurate. REQUESTS FOR CORRECIONS OF THE SCOPUS DATA (INCLUDING CORRECTIONS IN AFFILIATIONS) SHOULD NOT BE SENT TO US. They should be sent directly to Scopus, preferably by use of the Scopus to ORCID feedback wizard (https://orcid.scopusfeedback.com/) so that the correct data can be used in any future annual updates of the citation indicator databases.
The c-score focuses on impact (citations) rather than productivity (number of publications) and it also incorporates information on co-authorship and author positions (single, first, last author). If you have additional questions, please read the 3 associated PLoS Biology papers that explain the development, validation and use of these metrics and databases. (https://doi.org/10.1371/journal.pbio.1002501, https://doi.org/10.1371/journal.pbio.3000384 and https://doi.org/10.1371/journal.pbio.3000918).
Finally, we alert users that all citation metrics have limitations and their use should be tempered and judicious. For more reading, we refer to the Leiden manifesto: https://www.nature.com/articles/520429a
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Full dataset from Johns Hopkins University (JHU) Center for Systems Science and Engineering (CSSE) GitHub repository.
This is the full and complete dataset linked from JHU CSSE GitHub repository. The intent of this dataset is to provide access to the full dataset on the platform in contrast to the various other subsets.
Since the original GitHub repository has been archived, there are no planned updates to this dataset.
All citation please cite according to specification in the GitHub repository README.
Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Inf Dis. 20(5):533-534. doi: 10.1016/S1473-3099(20)30120-1
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine-readable metadata available from landing pages for datasets facilitate data citation by enabling easy integration with reference managers and other tools used in a data citation workflow. Embedding these metadata using the schema.org standard with the JSON-LD is emerging as the community standard. This dataset is a listing of data repositories that have implemented this approach or are in the progress of doing so.
This is the first version of this dataset and was generated via community consultation. We expect to update this dataset, as an increasing number of data repositories adopt this approach, and we hope to see this information added to registries of data repositories such as re3data and FAIRsharing.
In addition to the listing of data repositories we provide information of the schema.org properties supported by these data repositories, focussing on the required and recommended properties from the "Data Citation Roadmap for Scholarly Data Repositories".
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Background: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered.We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.
Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
In March 2022, the Air-LUSI instrument measured the lunar spectral irradiance on four nights from NASA's high-altitude ER-2 aircraft. The data set includes data from: 1) characterization and calibration of the Air-LUSI instrument and the transfer standards used to calibrate the instrument in NetCDF format, 2) the geolocated lunar irradiance data acquired by the instrument in NetCDF format, and 3) usage examples hosted at https://github.com/usnistgov/air-lusi along with copies of the above. If you prefer not to follow the Python workflow given at the GitHub site, a variety of tools for viewing and manipulating NetCDF files are linked to here: https://www.unidata.ucar.edu/software/netcdf/software.html
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data within consist of compressed output files in the form of edgelists (.edgelist.gz) and nodelists (.aux.parquet) from large citation network simulations using an agent-based model. The code and instructions are available at: https://github.com/illinois-or-research-analytics/SASCA. In addition, we provide a distribution of citation frequencies drawn from a random sample of PubMed journal articles (pooled_50k_pubmed_unique.csv) and a table of recencies- the frequency with which citations are made to the previous year, the year before that and so on (recency_probs_percent_stahl_filled.csv). A manuscript describing the SASCA-s simulator has been submitted for review and will be referenced in a future version of this data repository if it is accepted. The prefixes sj and er refer to the real world and Erdos-Renyi random graph respectively that were used to initiate simulations. These 'seed' networks are available from the Github site referenced above.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a subset of version 4.0 of the Data Citation Corpus. It contains article_ids as cleaned DOIs, dataset ids (e.g., accession numbers, DOIs) and the name of the repository of the data (e.g., Dryad, European Nucleotide Archive). It was extracted from the file 2025-07-27-data-citation-corpus-01-v4.0.json which is one of 11 JSONL files in the corpus.
Facebook
TwitterThis document contains brief descriptions of many of the treatments found in the PTSD Repository, organized by treatment category. Note: The download is a .zip file which contains the PDF Reference Guide.
Facebook
TwitterCollected in this dataset are the slideset and abstract for a presentation on Toward a Reproducible Research Data Repository by the depositar team at International Symposium on Data Science 2023 (DSWS 2023), hosted by the Science Council of Japan in Tokyo on December 13-15, 2023. The conference was organized by the Joint Support-Center for Data Science Research (DS), Research Organization of Information and Systems (ROIS) and the Committee of International Collaborations on Data Science, Science Council of Japan. The conference programme is also included as a reference.
Toward a Reproducible Research Data Repository
Cheng-Jen Lee, Chia-Hsun Ally Wang, Ming-Syuan Ho, and Tyng-Ruey Chuang
Institute of Information Science, Academia Sinica, Taiwan
The depositar (https://data.depositar.io/) is a research data repository at Academia Sinica (Taiwan) open to researhers worldwide for the deposit, discovery, and reuse of datasets. The depositar software itself is open source and builds on top of CKAN. CKAN, an open source project initiated by the Open Knowledge Foundation and sustained by an active user community, is a leading data management system for building data hubs and portals. In addition to CKAN's out-of-the-box features such as JSON data API and in-browser preview of uploaded data, we have added several features to the depositar, including sourcing from Wikidata as dataset keywords, a citation snippet for datasets, in-browser Shapefile preview, and a persistent identifier system based on ARK (Archival Resource Keys). At the same time, the depositar team faces an increasing demand for interactive computing (e.g. Jupyter Notebook) which facilitates not just data analysis, but also for the replication and demonstration of scientific studies. Recently, we have provided a JupyterHub service (a multi-tenancy JupyterLab) to some of the depositar's users. However, it still requires users to first download the data files (or copy the URLs of the files) from the depositar, then upload the data files (or paste the URLs) to the Jupyter notebooks for analysis. Furthermore, a JupyterHub deployed on a single server is limited by its processing power which may lower the service level to the users. To address the above issues, we are integrating the BinderHub into the depositar. BinderHub (https://binderhub.readthedocs.io/) is a kubernetes-based service that allows users to create interactive computing environments from code repositories. Once the integration is completed, users will be able to launch Jupyter Notebooks to perform data analysis and vsualization without leaving the depositar by clicking the BinderHub buttons on the datasets. In this presentation, we will first make a brief introduction to the depositar and BinderHub along with their relationship, then we will share our experiences in incorporating interactive computation in a data repository. We shall also evaluate the possibility of integrating the depositar with other automation frameworks (e.g. the Snakemake workflow management system) in order to enable users to reproduce data analysis.
BinderHub, CKAN, Data Repositories, Interactive Computing, Reproducible Research
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file collection is part of the ORD Landscape and Cost Analysis Project (DOI: 10.5281/zenodo.2643460), a study jointly commissioned by the SNSF and swissuniversities in 2018.
Please cite this data collection as: von der Heyde, M. (2019). Data from the International Open Data Repository Survey. Retrieved from https://doi.org/10.5281/zenodo.2643493
Further information is given in the corresponding data paper: von der Heyde, M. (2019). International Open Data Repository Survey: Description of collection, collected data, and analysis methods [Data paper]. Retrieved from https://doi.org/10.5281/zenodo.2643450
Contact
Swiss National Science Foundation (SNSF)
Open Research Data Group
E-mail: ord@snf.ch
swissuniversities
Program "Scientific Information"
Gabi Schneider
E-Mail: isci@swissuniversities.ch
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Cline Center Global News Index is a searchable database of textual features extracted from millions of news stories, specifically designed to provide comprehensive coverage of events around the world. In addition to searching documents for keywords, users can query metadata and features such as named entities extracted using Natural Language Processing (NLP) methods and variables that measure sentiment and emotional valence. Archer is a web application purpose-built by the Cline Center to enable researchers to access data from the Global News Index. Archer provides a user-friendly interface for querying the Global News Index (with the back-end indexing still handled by Solr). By default, queries are built using icons and drop-down menus. More technically-savvy users can use Lucene/Solr query syntax via a ‘raw query’ option. Archer allows users to save and iterate on their queries, and to visualize faceted query results, which can be helpful for users as they refine their queries. Additional Resources: - Access to Archer and the Global News Index is limited to account-holders. If you are interested in signing up for an account, you can fill out the Archer User Information Form. - Current users who would like to provide feedback, such as reporting a bug or requesting a feature, can fill out the Archer User Feedback Form. - The Cline Center sends out periodic email newsletters to the Archer Users Group. Please fill out this form to subscribe to Archer Users Group. Citation Guidelines: 1) To cite the GNI codebook (or any other documentation associated with the Global News Index and Archer) please use the following citation: Cline Center for Advanced Social Research. 2020. Global News Index and Extracted Features Repository [codebook]. Champaign, IL: University of Illinois. doi:10.13012/B2IDB-5649852_V1 2) To cite data from the Global News Index (accessed via Archer or otherwise) please use the following citation (filling in the correct date of access): Cline Center for Advanced Social Research. 2020. Global News Index and Extracted Features Repository [database]. Champaign, IL: University of Illinois. Accessed Month, DD, YYYY. doi:10.13012/B2IDB-5649852_V1
Facebook
TwitterCloud-based data repository for storing, publishing and accessing scientific data. Mendeley Data creates a permanent location and issues Force 11 compliant citations for uploaded data.
Facebook
Twitterhttps://networkrepository.com/policy.phphttps://networkrepository.com/policy.php
Facebook
Twitterhttps://archive.data.jhu.edu/api/datasets/:persistentId/versions/2.1/customlicense?persistentId=doi:10.7281/T1/P69KYXhttps://archive.data.jhu.edu/api/datasets/:persistentId/versions/2.1/customlicense?persistentId=doi:10.7281/T1/P69KYX
Data Axle Reference Solutions, formerly ReferenceUSA, contains two directories--residential and business. Records include over 12 million U.S. businesses and 120 million U.S. residents. Businesses include private, public, and non-profit organizations, regardless of employee size or sales. This dataset of Data Axle’s business database provides 52 attributes about tens of millions of businesses across the United States for almost every business from the Fortune 500 down to mom-and-pop shops and work-from-home freelancers. The Data Axle business database is available in its entirety for the years 2017 to 2020. The data can be downloaded in a single commas separated values (.csv) file for each year of interest. This file is approximately 5 GB in size after de-compressing the .zip archive. To load the .csv in memory requires a minimum of 32 GB of RAM. To access the Data Axle data on low-memory systems, the .csv file for each year as been split into subsets by US Census defined geographic regions, as well as the more granular geographic divisions. The file census-regions-divisions.csv identifies the states and territories that belong to each region (5 regions plus territories) and divisions (9 divisions plus territories).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data to complement the quantitative analysis of data citation practices in digital repositories based on metadata records from the re3data.org repositories registry.
Data was retrieved using re3data.org API on 23-02-2023 and 06-03-2023 and processed using the OpenRefine software.
Part of "A FAIR-enabling citation model for Cultural Heritage Objects" project activities.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This repository is used to provide documentation related to the model and data development process, provide source (raw) data for the model in different forms (i.e. Modelica, CIM 14, and PSS/E) for an equivalent Nordic grid model that has been matched to historical power flow data.
The repository is documented in the paper below, see [Ref00].
Using this model, data or related software = cite our publications!
We are happy to contribute with this dataset, however, if you use any of the data or software provided, we will appreciate if you cite the following publications, as follows:
A) Cite that "the raw and processed data files corresponding to the model are available as an open data set and documented in [Ref00]."
B) Cite that the first appearance of the model, i.e. "the model is first presented in [Ref01]"
[Ref00] L. Vanfretti, S.H. Olsen, V. S. Narasimham Arava, G. Laera, A. Bibadafar, T. Rabuzin, H. Jackobsen, J. Lavenius, and M. Baudette, "An Open Data Repository and a Data Processing Software Toolset of an Equivalent Nordic Grid Model Matched to Historical Electricity Market Data," submitted for publication, Data in Brief, 2016.
[Ref01] L. Vanfretti, T. Rabuzin, M. Baudette, M. Murad, iTesla Power Systems Library (iPSL): A Modelica library for phasor time-domain simulations, SoftwareX, Available online 18 May 2016, ISSN 2352-7110, http://dx.doi.org/10.1016/j.softx.2016.05.001.
Acknowledgment:
This model was originally developed in the context of the FP7 iTesla project, and further extended within the ITEA3 openCPSproject.
Structure of the repository:
01_PSSE_Resources:
Models :
A folder with PSS/E files of the base case
A folder with a 7zip archive containing files of the original N44 system that has been modified to have the PSS/E base case
Snapshots :
N44_2015xxxx are folders named according to the day they refer to (for example N44_20150401 refers to the 1st of April 2015). In each folder there are Excel files (Consumption_xx.xlsx, Exchange_xx.xlsx, Production_xx.xlsx) with data downloaded from Nord Pool website, an Excel file (PSSE_in_out.xlsx) summarizing the results from the Python script Nordic44.py in the folder 04_Python_Resources, PSS/E snapshots for each hour before solving the power flow (hx_before_PF.raw) and after solving the power flow (hx_after_PF.raw)
N44_BC.sav is the PSS/E solved base case that Python script Nordic44.py (put the reference)
02_CIM14_Snapshots:
N44_2015xxxx are folders named according to the day they refer to (e.g. N44_20150401 refers to the 1st of April 2015). In each folder there are CIM files for each hour (N44_hx_EQ.xml, N44_hx_SV.xml_, _N44_hx_TP.xml)
N44_noOL_RDFIDMAP.xml is the file with IDs mapping of those cases (N44_hx_noOL_EQ.xml, N44_hx_noOL_SV.xml, N44_hx_noOL_TP.xml) with fixed overloading problems.
N44_RDFIDMAP_2015-1.xml and N44_RDFIDMAP_2015-2.xml are the files with IDs mapping of the remaining snapshots from 2015
03_Modelica:
iTesla_Platform
iPSL folder contains the version of the library which can be used to simulate snapshots generated from the iTesla Platform
Modelica_snapshots Modelica models generated from the snapshots by iTesla Platform
SmarTSLab
OpenIPSL folder contains the version of the forked iPSL library which can be used to simulate the manually generated Modelica model of N44 with the record structures corresponding to the snapshots
Snapshots folder contains Modelica records automatically generated from the PSS/E records
N44_Base_Case.mo is the handmade N44 model with the loaded record of the power flow results from the PSS/E base case. It can be used to load other PF results from the folder 03_Modelica/Snapshots
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the regional dataset compilation for the INnovative Geothermal Exploration through Novel Investigations Of Undiscovered Systems (INGENIOUS) project. The primary goal of this project is to accelerate discoveries of new, commercially viable hidden geothermal systems while reducing the exploration and development risks for all geothermal resources. These datasets will be used in INGENIOUS as input features for predicting geothermal favorability throughout the Great Basin study area.
Datasets consist of shapefiles, geotiffs, tabular spreadsheets, and metadata that describe: 2-meter temperature probe surveys, quaternary faults and volcanic features, geodetic shear and dilation models, heat flow, magnetotellurics (conductance), magnetics, gravity, paleogeothermal features (such as sinter and tufa deposits), seismicity, spring and well temperatures, spring and well aqueous geochemistry analyses, thermal conductivity, and fault slip and dilation tendency.
For additional project information, see the INGENIOUS project site linked in the submission.
Terms of use: These datasets are provided "as is", and the contributors assume no responsibility for any errors or omissions. The user assumes the entire risk associated with their use of these data and bears all responsibility in determining whether these data are fit for their intended use. These datasets may be redistributed with attribution (see citation information below). Please refer to the license information on this page for full licensing terms and conditions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AGU and other publishers have developed guidance for authors to support best practices for data and software sharing and citation, but there remains a significant gap in knowledge and implementation of these practices amongst scientists during the publication process. Reference managers are an important tool to facilitate uptake of data and software citation, but this infrastructure is not yet adequately developed for these applications. In this poster, we compare and contrast dataset and software citation capabilities amongst major reference managers and numerous data repositories to begin a conversation about technical improvements to this critical component of FAIR data infrastructure. This poster was presented during the 2024 January Earth Science Information Partners (ESIP) Meeting held virtually (Jan. 23-26, 2024).
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data file for the second release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.
The data file includes 5,256,114 data citation records in JSON and CSV formats. The JSON file is the version of record.
For convenience, the data is provided in batches of approximately 1 million records each. The publication date and batch number are included in the file name, ex: 2024-08-23-data-citation-corpus-01-v2.0.json.
The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles.
Each data citation record is comprised of:
A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication (journal article or preprint) in which the dataset is cited
Metadata for the cited dataset and for the citing publication
The data file includes the following fields:
|
Field |
Description |
Required? |
|
id |
Internal identifier for the citation |
Yes |
|
created |
Date of item's incorporation into the corpus |
Yes |
|
updated |
Date of item's most recent update in corpus |
Yes |
|
repository |
Repository where cited data is stored |
No |
|
publisher |
Publisher for the article citing the data |
No |
|
journal |
Journal for the article citing the data |
No |
|
title |
Title of cited data |
No |
|
publication |
DOI of article where data is cited |
Yes |
|
dataset |
DOI or accession number of cited data |
Yes |
|
publishedDate |
Date when citing article was published |
No |
|
source |
Source where citation was harvested |
Yes |
|
subjects |
Subject information for cited data |
No |
|
affiliations |
Affiliation information for creator of cited data |
No |
|
funders |
Funding information for cited data |
No |
Additional documentation about the citations and metadata in the file is available on the Make Data Count website.
The second release of the Data Citation Corpus data file reflects several changes made to add new citations, remove some records deemed out of scope for the corpus, update and enhance citation metadata, and improve the overall usability of the file. These changes are as follows:
Add and update Event Data citations:
Add 179,885 new data citations created in DataCite Event Data between 01 June 2023 through 30 June 2024
Remove citation records deemed out of scope for the corpus:
273,567 records from DataCite Event Data with non-citation relationship types
28,334 citations to items in non-data repositories (clinical trials registries, stem cells, samples, and other non-data materials)
44,117 invalid citations where subj_id value was the same as the obj_id value or subj_id and obj_id are inverted, indicating a citation from a dataset to a publication
473,792 citations to invalid accession numbers from CZI data present in v1.1 as a result of false positives in the algorithm used to identify mentions
4,110,019 duplicate records from CZI data present in v1.1 where metadata is the same for obj_id, subj_id, repository_id, publisher_id, journal_id, accession_number, and source_id (the record with the most recent updated date was retained in all of these cases)
Metadata enhancements:
Apply Field of Science subject terms to citation records originating from CZI, based on disciplinary area of data repository
Initial cleanup of affiliation and funder organization names to remove personal email addresses and social media handles (additional cleanup and standardization in progress and will be included in future releases)
Data structure updates to improve usability and eliminate redundancies:
Rename subj_id and obj_id fields to “dataset” and “publication” for clarity
Remove accessionNumber and doi elements to eliminate redundancy with subj_id
Remove relationTypeId fields as these are specific to Event Data only
Full details of the above changes, including scripts used to perform the above tasks, are available in GitHub.
While v2 addresses a number of cleanup and enhancement tasks, additional data issues may remain, and additional enhancements are being explored. These will be addressed in the course of subsequent data file releases.
Feedback on the data file can be submitted via GitHub. For general questions, email info@makedatacount.org.