1 dataset found
  1. Softcite Dataset: A dataset of software mentions in research publications

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Howison; James Howison; Patrice Lopez; Patrice Lopez; Caifan Du; Caifan Du; Hannah Cohoon; Hannah Cohoon (2021). Softcite Dataset: A dataset of software mentions in research publications [Dataset]. http://doi.org/10.5281/zenodo.4445202
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 17, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    James Howison; James Howison; Patrice Lopez; Patrice Lopez; Caifan Du; Caifan Du; Hannah Cohoon; Hannah Cohoon
    Description

    The Softcite dataset is a gold-standard dataset of software mentions in research publications, a free resource primarily for software entity recognition in scholarly text. This is the first release of this dataset.

    What's in the dataset

    With the aim of facilitating software entity recognition efforts at scale and eventually increased visibility of research software for the due credit of software contributions to scholarly research, a team of trained annotators from Howison Lab at the University of Texas at Austin annotated 4,093 software mentions in 4,971 open access research publications in biomedicine (from PubMed Central Open Access collection) and economics (from Unpaywall open access services). The annotated software mentions, along with their publisher, version, and access URL, if mentioned in the text, as well as those publications annotated as containing no software mentions, are all included in the released dataset as a TEI/XML corpus file.

    For understanding the schema of the Softcite corpus, its design considerations, and provenance, please refer to our paper included in this release (preprint version).

    Use scenarios

    The release of the Softcite dataset is intended to encourage researchers and stakeholders to make research software more visible in science, especially to academic databases and systems of information retrieval; and facilitate interoperability and collaboration among similar and relevant efforts in software entity recognition and building utilities for software information retrieval. This dataset can also be useful for researchers investigating software use in academic research.

    Current release content

    softcite-dataset v1.0 release includes:

    • The Softcite dataset corpus file: softcite_corpus-full.tei.xml
    • Softcite Dataset: A Dataset of Software Mentions in Biomedical and Economic Research Publications, our paper that describes the design consideration and creation process of the dataset: Softcite_Dataset_Description_RC.pdf. (This is a preprint version of our forthcoming publication in the Journal of the Association for Information Science and Technology.)

    The Softcite dataset is licensed under a Creative Commons Attribution 4.0 International License.

    If you have questions, please start a discussion or issue in the howisonlab/softcite-dataset Github repository.

  2. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
James Howison; James Howison; Patrice Lopez; Patrice Lopez; Caifan Du; Caifan Du; Hannah Cohoon; Hannah Cohoon (2021). Softcite Dataset: A dataset of software mentions in research publications [Dataset]. http://doi.org/10.5281/zenodo.4445202
Organization logo

Softcite Dataset: A dataset of software mentions in research publications

Explore at:
zipAvailable download formats
Dataset updated
Jan 17, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
James Howison; James Howison; Patrice Lopez; Patrice Lopez; Caifan Du; Caifan Du; Hannah Cohoon; Hannah Cohoon
Description

The Softcite dataset is a gold-standard dataset of software mentions in research publications, a free resource primarily for software entity recognition in scholarly text. This is the first release of this dataset.

What's in the dataset

With the aim of facilitating software entity recognition efforts at scale and eventually increased visibility of research software for the due credit of software contributions to scholarly research, a team of trained annotators from Howison Lab at the University of Texas at Austin annotated 4,093 software mentions in 4,971 open access research publications in biomedicine (from PubMed Central Open Access collection) and economics (from Unpaywall open access services). The annotated software mentions, along with their publisher, version, and access URL, if mentioned in the text, as well as those publications annotated as containing no software mentions, are all included in the released dataset as a TEI/XML corpus file.

For understanding the schema of the Softcite corpus, its design considerations, and provenance, please refer to our paper included in this release (preprint version).

Use scenarios

The release of the Softcite dataset is intended to encourage researchers and stakeholders to make research software more visible in science, especially to academic databases and systems of information retrieval; and facilitate interoperability and collaboration among similar and relevant efforts in software entity recognition and building utilities for software information retrieval. This dataset can also be useful for researchers investigating software use in academic research.

Current release content

softcite-dataset v1.0 release includes:

  • The Softcite dataset corpus file: softcite_corpus-full.tei.xml
  • Softcite Dataset: A Dataset of Software Mentions in Biomedical and Economic Research Publications, our paper that describes the design consideration and creation process of the dataset: Softcite_Dataset_Description_RC.pdf. (This is a preprint version of our forthcoming publication in the Journal of the Association for Information Science and Technology.)

The Softcite dataset is licensed under a Creative Commons Attribution 4.0 International License.

If you have questions, please start a discussion or issue in the howisonlab/softcite-dataset Github repository.

Search
Clear search
Close search
Google apps
Main menu