2 datasets found

Z
Data from: Citation data of arXiv eprints and the associated...
data.niaid.nih.gov
zenodo.org
Updated Jan 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Keisuke Okamura (2024). Citation data of arXiv eprints and the associated quantitatively-and-temporally normalised impact metrics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5803961
Explore at:
Dataset updated
Jan 7, 2024
Dataset provided by
Hitoshi Koshiba
Keisuke Okamura
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data collection

This dataset contains information on the eprints posted on arXiv from its launch in 1991 until the end of 2019 (1,589,006 unique eprints), plus the data on their citations and the associated impact metrics. Here, eprints include preprints, conference proceedings, book chapters, data sets and commentary, i.e. every electronic material that has been posted on arXiv.

The content and metadata of the arXiv eprints were retrieved from the arXiv API (https://arxiv.org/help/api/) as of 21st January 2020, where the metadata included data of the eprint’s title, author, abstract, subject category and the arXiv ID (the arXiv’s original eprint identifier). In addition, the associated citation data were derived from the Semantic Scholar API (https://api.semanticscholar.org/) from 24th January 2020 to 7th February 2020, containing the citation information in and out of the arXiv eprints and their published versions (if applicable). Here, whether an eprint has been published in a journal or other means is assumed to be inferrable, albeit indirectly, from the status of the digital object identifier (DOI) assignment. It is also assumed that if an arXiv eprint received cpre and cpub citations until the data retrieval date (7th February 2020) before and after it is assigned a DOI, respectively, then the citation count of this eprint is recorded in the Semantic Scholar dataset as cpre + cpub. Both the arXiv API and the Semantic Scholar datasets contained the arXiv ID as metadata, which served as a key variable to merge the two datasets.

The classification of research disciplines is based on that described in the arXiv.org website (https://arxiv.org/help/stats/2020_by_area/). There, the arXiv subject categories are aggregated into several disciplines, of which we restrict our attention to the following six disciplines: Astrophysics (‘astro-ph’), Computer Science (‘comp-sci’), Condensed Matter Physics (‘cond-mat’), High Energy Physics (‘hep’), Mathematics (‘math’) and Other Physics (‘oth-phys’), which collectively accounted for 98% of all the eprints. Those eprints tagged to multiple arXiv disciplines were counted independently for each discipline. Due to this overlapping feature, the current dataset contains a cumulative total of 2,011,216 eprints.

Some general statistics and visualisations per research discipline are provided in the original article (Okamura, to appear), where the validity and limitations associated with the dataset are also discussed.

Description of columns (variables)

arxiv_id : arXiv ID

category : Research discipline

pre_year : Year of posting v1 on arXiv

pub_year : Year of DOI acquisition

c_tot : No. of citations acquired during 1991–2019

c_pre : No. of citations acquired before and including the year of DOI acquisition

c_pub : No. of citations acquired after the year of DOI acquisition

c_yyyy (yyyy = 1991, …, 2019) : No. of citations acquired in the year yyyy (with ‘yyyy’ running from 1991 to 2019)

gamma : The quantitatively-and-temporally normalised citation index

gamma_star : The quantitatively-and-temporally standardised citation index

Note: The definition of the quantitatively-and-temporally normalised citation index (γ; ‘gamma’) and that of the standardised citation index (γ*; ‘gamma_star’) are provided in the original article (Okamura, to appear). Both indices can be used to compare the citational impact of papers/eprints published in different research disciplines at different times.

Data files

A comma-separated values file (‘arXiv_impact.csv’) and a Stata file (‘arXiv_impact.dta’) are provided, both containing the same information.
T
scicite
tensorflow.org
opendatalab.com
+1more
Updated Dec 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). scicite [Dataset]. https://www.tensorflow.org/datasets/catalog/scicite
Explore at:
Dataset updated
Dec 23, 2022
Description
This is a dataset for classifying citation intents in academic papers. The main citation intent label for each Json object is specified with the label key while the citation context is specified in with a context key. Example:

{ 'string': 'In chacma baboons, male-infant relationships can be linked to both formation of friendships and paternity success [30,31].' 'sectionName': 'Introduction', 'label': 'background', 'citingPaperId': '7a6b2d4b405439', 'citedPaperId': '9d1abadc55b5e0', ... }

You may obtain the full information about the paper using the provided paper ids with the Semantic Scholar API (https://api.semanticscholar.org/).

The labels are: Method, Background, Result

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('scicite', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Keisuke Okamura (2024). Citation data of arXiv eprints and the associated quantitatively-and-temporally normalised impact metrics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5803961

Data from: Citation data of arXiv eprints and the associated quantitatively-and-temporally normalised impact metrics

Explore at:

Dataset updated

Jan 7, 2024

Dataset provided by

Hitoshi Koshiba
Keisuke Okamura

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data collection

This dataset contains information on the eprints posted on arXiv from its launch in 1991 until the end of 2019 (1,589,006 unique eprints), plus the data on their citations and the associated impact metrics. Here, eprints include preprints, conference proceedings, book chapters, data sets and commentary, i.e. every electronic material that has been posted on arXiv.

The content and metadata of the arXiv eprints were retrieved from the arXiv API (https://arxiv.org/help/api/) as of 21st January 2020, where the metadata included data of the eprint’s title, author, abstract, subject category and the arXiv ID (the arXiv’s original eprint identifier). In addition, the associated citation data were derived from the Semantic Scholar API (https://api.semanticscholar.org/) from 24th January 2020 to 7th February 2020, containing the citation information in and out of the arXiv eprints and their published versions (if applicable). Here, whether an eprint has been published in a journal or other means is assumed to be inferrable, albeit indirectly, from the status of the digital object identifier (DOI) assignment. It is also assumed that if an arXiv eprint received cpre and cpub citations until the data retrieval date (7th February 2020) before and after it is assigned a DOI, respectively, then the citation count of this eprint is recorded in the Semantic Scholar dataset as cpre + cpub. Both the arXiv API and the Semantic Scholar datasets contained the arXiv ID as metadata, which served as a key variable to merge the two datasets.

The classification of research disciplines is based on that described in the arXiv.org website (https://arxiv.org/help/stats/2020_by_area/). There, the arXiv subject categories are aggregated into several disciplines, of which we restrict our attention to the following six disciplines: Astrophysics (‘astro-ph’), Computer Science (‘comp-sci’), Condensed Matter Physics (‘cond-mat’), High Energy Physics (‘hep’), Mathematics (‘math’) and Other Physics (‘oth-phys’), which collectively accounted for 98% of all the eprints. Those eprints tagged to multiple arXiv disciplines were counted independently for each discipline. Due to this overlapping feature, the current dataset contains a cumulative total of 2,011,216 eprints.

Some general statistics and visualisations per research discipline are provided in the original article (Okamura, to appear), where the validity and limitations associated with the dataset are also discussed.

Description of columns (variables)

arxiv_id : arXiv ID

category : Research discipline

pre_year : Year of posting v1 on arXiv

pub_year : Year of DOI acquisition

c_tot : No. of citations acquired during 1991–2019

c_pre : No. of citations acquired before and including the year of DOI acquisition

c_pub : No. of citations acquired after the year of DOI acquisition

c_yyyy (yyyy = 1991, …, 2019) : No. of citations acquired in the year yyyy (with ‘yyyy’ running from 1991 to 2019)

gamma : The quantitatively-and-temporally normalised citation index

gamma_star : The quantitatively-and-temporally standardised citation index

Note: The definition of the quantitatively-and-temporally normalised citation index (γ; ‘gamma’) and that of the standardised citation index (γ*; ‘gamma_star’) are provided in the original article (Okamura, to appear). Both indices can be used to compare the citational impact of papers/eprints published in different research disciplines at different times.

Data files

A comma-separated values file (‘arXiv_impact.csv’) and a Stata file (‘arXiv_impact.dta’) are provided, both containing the same information.

Clear search

Close search

Google apps

Main menu

Data from: Citation data of arXiv eprints and the associated...

scicite

Data from: Citation data of arXiv eprints and the associated quantitatively-and-temporally normalised impact metrics