100+ datasets found

Citation Graph
kaggle.com
zip
Updated Jun 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caselaw Access Project (2020). Citation Graph [Dataset]. https://www.kaggle.com/datasets/harvardlil/citation-graph
Explore at:
zip(306688738 bytes)Available download formats
Dataset updated
Jun 30, 2020
Authors
Caselaw Access Project
Description
Context

The Caselaw Access Project makes 40 million pages of U.S. caselaw freely available online from the collections of Harvard Law School Library.

The CAP citation graph shows the connections between cases in the Caselaw Access Project dataset. You can use the citation graph to answer questions like "what is the most influential case?" and "what jurisdictions cite most often to this jurisdiction?".

Learn More: https://case.law/download/citation_graph/

Access Limits: https://case.law/api/#limits

Content

This dataset includes citations and metadata for the CAP citation graph in CSV format.

Acknowledgements

The Caselaw Access Project is by the Library Innovation Lab at Harvard Law School Library.

Inspiration

People are using CAP data to create research, applications, and more. We're sharing examples in our gallery.

Cite Grid is the first visualization we've created based on data from our citation graph.

Have something to share? We're excited to hear about it.
Z
Data for "Open Access impact on citations: a case study"
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bordignon, Frédérique; Andro, Mathieu (2020). Data for "Open Access impact on citations: a case study" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_60293
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Direction de la Documentation, Ecole des Ponts ParisTech, Champs-sur-Marne, France
DIST, INRA, Versailles, France
Authors
Bordignon, Frédérique; Andro, Mathieu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is a list of 347 papers published in 2010 and retrieved from the Web of Science, Scopus and Google Scholar. For each paper, the number of citations and the citation date(s) have been collected. If the full-text is available online, the date of "liberation" and the URL of the file have been retrieved as well. The objective was to assess the impact of Open access on citation rate and more particularly the impact before and after full-text "liberation".
I
Dataset for "Continued use of retracted papers: Temporal trends in citations...
databank.illinois.edu
Updated Jun 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tzu-Kun Hsiao; Jodi Schneider (2024). Dataset for "Continued use of retracted papers: Temporal trends in citations and (lack of) awareness of retractions shown in citation contexts in biomedicine" [Dataset]. http://doi.org/10.13012/B2IDB-8255619_V2
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-8255619_V2
Dataset updated
Jun 14, 2024
Authors
Tzu-Kun Hsiao; Jodi Schneider
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
Alfred P. Sloan Foundation
U.S. National Institutes of Health (NIH)
Description
This dataset includes five files. Descriptions of the files are given as follows: FILENAME: PubMed_retracted_publication_full_v3.tsv - Bibliographic data of retracted papers indexed in PubMed (retrieved on August 20, 2020, searched with the query "retracted publication" [PT] ). - Except for the information in the "cited_by" column, all the data is from PubMed. - PMIDs in the "cited_by" column that meet either of the two conditions below have been excluded from analyses: [1] PMIDs of the citing papers are from retraction notices (i.e., those in the “retraction_notice_PMID.csv” file). [2] Citing paper and the cited retracted paper have the same PMID. ROW EXPLANATIONS - Each row is a retracted paper. There are 7,813 retracted papers. COLUMN HEADER EXPLANATIONS 1) PMID - PubMed ID 2) Title - Paper title 3) Authors - Author names 4) Citation - Bibliographic information of the paper 5) First Author - First author's name 6) Journal/Book - Publication name 7) Publication Year 8) Create Date - The date the record was added to the PubMed database 9) PMCID - PubMed Central ID (if applicable, otherwise blank) 10) NIHMS ID - NIH Manuscript Submission ID (if applicable, otherwise blank) 11) DOI - Digital object identifier (if applicable, otherwise blank) 12) retracted_in - Information of retraction notice (given by PubMed) 13) retracted_yr - Retraction year identified from "retracted_in" (if applicable, otherwise blank) 14) cited_by - PMIDs of the citing papers. (if applicable, otherwise blank) Data collected from iCite. 15) retraction_notice_pmid - PMID of the retraction notice (if applicable, otherwise blank) FILENAME: PubMed_retracted_publication_CitCntxt_withYR_v3.tsv - This file contains citation contexts (i.e., citing sentences) where the retracted papers were cited. The citation contexts were identified from the XML version of PubMed Central open access (PMCOA) articles. - This is part of the data from: Hsiao, T.-K., & Torvik, V. I. (manuscript in preparation). Citation contexts identified from PubMed Central open access articles: A resource for text mining and citation analysis. - Citation contexts that meet either of the two conditions below have been excluded from analyses: [1] PMIDs of the citing papers are from retraction notices (i.e., those in the “retraction_notice_PMID.csv” file). [2] Citing paper and the cited retracted paper have the same PMID. ROW EXPLANATIONS - Each row is a citation context associated with one retracted paper that's cited. - In the manuscript, we count each citation context once, even if it cites multiple retracted papers. COLUMN HEADER EXPLANATIONS 1) pmcid - PubMed Central ID of the citing paper 2) pmid - PubMed ID of the citing paper 3) year - Publication year of the citing paper 4) location - Location of the citation context (abstract = abstract, body = main text, back = supporting material, tbl_fig_caption = tables and table/figure captions) 5) IMRaD - IMRaD section of the citation context (I = Introduction, M = Methods, R = Results, D = Discussions/Conclusion, NoIMRaD = not identified) 6) sentence_id - The ID of the citation context in a given location. For location information, please see column 4. The first sentence in the location gets the ID 1, and subsequent sentences are numbered consecutively. 7) total_sentences - Total number of sentences in a given location 8) intxt_id - Identifier of a cited paper. Here, a cited paper is the retracted paper. 9) intxt_pmid - PubMed ID of a cited paper. Here, a cited paper is the retracted paper. 10) citation - The citation context 11) progression - Position of a citation context by centile within the citing paper. 12) retracted_yr - Retraction year of the retracted paper 13) post_retraction - 0 = not post-retraction citation; 1 = post-retraction citation. A post-retraction citation is a citation made after the calendar year of retraction. FILENAME: 724_knowingly_post_retraction_cit.csv (updated) - The 724 post-retraction citation contexts that we determined knowingly cited the 7,813 retracted papers in "PubMed_retracted_publication_full_v3.tsv". - Two citation contexts from retraction notices have been excluded from analyses. ROW EXPLANATIONS - Each row is a citation context. COLUMN HEADER EXPLANATIONS 1) pmcid - PubMed Central ID of the citing paper 2) pmid - PubMed ID of the citing paper 3) pub_type - Publication type collected from the metadata in the PMCOA XML files. 4) pub_type2 - Specific article types. Please see the manuscript for explanations. 5) year - Publication year of the citing paper 6) location - Location of the citation context (abstract = abstract, body = main text, back = supporting material, table_or_figure_caption = tables and table/figure captions) 7) intxt_id - Identifier of a cited paper. Here, a cited paper is the retracted paper. 8) intxt_pmid - PubMed ID of a cited paper. Here, a cited paper is the retracted paper. 9) citation - The citation context 10) retracted_yr - Retraction year of the retracted paper 11) cit_purpose - Purpose of citing the retracted paper. This is from human annotations. Please see the manuscript for further information about annotation. 12) longer_context - A extended version of the citation context. (if applicable, otherwise blank) Manually pulled from the full-texts in the process of annotation. FILENAME: Annotation manual.pdf - The manual for annotating the citation purposes in column 11) of the 724_knowingly_post_retraction_cit.tsv. FILENAME: retraction_notice_PMID.csv (new file added for this version) - A list of 8,346 PMIDs of retraction notices indexed in PubMed (retrieved on August 20, 2020, searched with the query "retraction of publication" [PT] ).
s
Citation Trends for "Supporting Data and Services Access in Digital...
shibatadb.com
Updated Oct 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yubetsu (2025). Citation Trends for "Supporting Data and Services Access in Digital Government Environments" [Dataset]. https://www.shibatadb.com/article/JWopxJ3C
Explore at:
Dataset updated
Oct 4, 2025
Dataset authored and provided by
Yubetsu
License
https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Time period covered
2004
Variables measured
New Citations per Year
Description
Yearly citation counts for the publication titled "Supporting Data and Services Access in Digital Government Environments".
I
Data from: OpCitance: Citation contexts identified from the PubMed Central...
databank.illinois.edu
aws-databank-alb.library.illinois.edu
Updated Feb 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tzu-Kun Hsiao; Vetle Torvik (2023). OpCitance: Citation contexts identified from the PubMed Central open access articles [Dataset]. http://doi.org/10.13012/B2IDB-4353270_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-4353270_V1
Dataset updated
Feb 15, 2023
Authors
Tzu-Kun Hsiao; Vetle Torvik
Dataset funded by
U.S. National Institutes of Health (NIH)
Description
Sentences and citation contexts identified from the PubMed Central open access articles ---------------------------------------------------------------------- The dataset is delivered as 24 tab-delimited text files. The files contain 720,649,608 sentences, 75,848,689 of which are citation contexts. The dataset is based on a snapshot of articles in the XML version of the PubMed Central open access subset (i.e., the PMCOA subset). The PMCOA subset was collected in May 2019. The dataset is created as described in: Hsiao TK., & Torvik V. I. (manuscript) OpCitance: Citation contexts identified from the PubMed Central open access articles. Files: • A_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with A. • B_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with B. • C_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with C. • D_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with D. • E_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with E. • F_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with F. • G_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with G. • H_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with H. • I_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with I. • J_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with J. • K_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with K. • L_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with L. • M_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with M. • N_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with N. • O_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with O. • P_p1_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 1). • P_p2_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 2). • Q_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with Q. • R_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with R. • S_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with S. • T_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with T. • UV_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with U or V. • W_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with W. • XYZ_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with X, Y or Z. Each row in the file is a sentence/citation context and contains the following columns: • pmcid: PMCID of the article • pmid: PMID of the article. If an article does not have a PMID, the value is NONE. • location: The article component (abstract, main text, table, figure, etc.) to which the citation context/sentence belongs. • IMRaD: The type of IMRaD section associated with the citation context/sentence. I, M, R, and D represent introduction/background, method, results, and conclusion/discussion, respectively; NoIMRaD indicates that the section type is not identifiable. • sentence_id: The ID of the citation context/sentence in the article component • total_sentences: The number of sentences in the article component. • intxt_id: The ID of the citation. • intxt_pmid: PMID of the citation (as tagged in the XML file). If a citation does not have a PMID tagged in the XML file, the value is "-". • intxt_pmid_source: The sources where the intxt_pmid can be identified. Xml represents that the PMID is only identified from the XML file; xml,pmc represents that the PMID is not only from the XML file, but also in the citation data collected from the NCBI Entrez Programming Utilities. If a citation does not have an intxt_pmid, the value is "-". • intxt_mark: The citation marker associated with the inline citation. • best_id: The best source link ID (e.g., PMID) of the citation. • best_source: The sources that confirm the best ID. • best_id_diff: The comparison result between the best_id column and the intxt_pmid column. • citation: A citation context. If no citation is found in a sentence, the value is the sentence. • progression: Text progression of the citation context/sentence. Supplementary Files • PMC-OA-patci.tsv.gz – This file contains the best source link IDs for the references (e.g., PMID). Patci [1] was used to identify the best source link IDs. The best source link IDs are mapped to the citation contexts and displayed in the *_journal IntxtCit.tsv files as the best_id column. Each row in the PMC-OA-patci.tsv.gz file is a citation (i.e., a reference extracted from the XML file) and contains the following columns: • pmcid: PMCID of the citing article. • pos: The citation's position in the reference list. • fromPMID: PMID of the citing article. • toPMID: Source link ID (e.g., PMID) of the citation. This ID is identified by Patci. • SRC: The sources that confirm the toPMID. • MatchDB: The origin bibliographic database of the toPMID. • Probability: The match probability of the toPMID. • toPMID2: PMID of the citation (as tagged in the XML file). • SRC2: The sources that confirm the toPMID2. • intxt_id: The ID of the citation. • journal: The first letter of the journal title. This maps to the *_journal_IntxtCit.tsv files. • same_ref_string: Whether the citation string appears in the reference list more than once. • DIFF: The comparison result between the toPMID column and the toPMID2 column. • bestID: The best source link ID (e.g., PMID) of the citation. • bestSRC: The sources that confirm the best ID. • Match: Matching result produced by Patci. [1] Agarwal, S., Lincoln, M., Cai, H., & Torvik, V. (2014). Patci – a tool for identifying scientific articles cited by patents. GSLIS Research Showcase 2014. http://hdl.handle.net/2142/54885 • Supplementary_File_1.zip – This file contains the code for generating the dataset.
d
August 2024 data-update for "Updated science-wide author databases of...
elsevier.digitalcommonsdata.com
Updated Sep 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John P.A. Ioannidis (2024). August 2024 data-update for "Updated science-wide author databases of standardized citation indicators" [Dataset]. http://doi.org/10.17632/btchxktzyw.7
Explore at:
Unique identifier
https://doi.org/10.17632/btchxktzyw.7
Dataset updated
Sep 16, 2024
Authors
John P.A. Ioannidis
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
Citation metrics are widely used and misused. We have created a publicly available database of top-cited scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator (c-score). Separate data are shown for career-long and, separately, for single recent year impact. Metrics with and without self-citations and ratio of citations to citing papers are given and data on retracted papers (based on Retraction Watch database) as well as citations to/from retracted papers have been added in the most recent iteration. Scientists are classified into 22 scientific fields and 174 sub-fields according to the standard Science-Metrix classification. Field- and subfield-specific percentiles are also provided for all scientists with at least 5 papers. Career-long data are updated to end-of-2023 and single recent year data pertain to citations received during calendar year 2023. The selection is based on the top 100,000 scientists by c-score (with and without self-citations) or a percentile rank of 2% or above in the sub-field. This version (7) is based on the August 1, 2024 snapshot from Scopus, updated to end of citation year 2023. This work uses Scopus data. Calculations were performed using all Scopus author profiles as of August 1, 2024. If an author is not on the list it is simply because the composite indicator value was not high enough to appear on the list. It does not mean that the author does not do good work. PLEASE ALSO NOTE THAT THE DATABASE HAS BEEN PUBLISHED IN AN ARCHIVAL FORM AND WILL NOT BE CHANGED. The published version reflects Scopus author profiles at the time of calculation. We thus advise authors to ensure that their Scopus profiles are accurate. REQUESTS FOR CORRECIONS OF THE SCOPUS DATA (INCLUDING CORRECTIONS IN AFFILIATIONS) SHOULD NOT BE SENT TO US. They should be sent directly to Scopus, preferably by use of the Scopus to ORCID feedback wizard (https://orcid.scopusfeedback.com/) so that the correct data can be used in any future annual updates of the citation indicator databases. The c-score focuses on impact (citations) rather than productivity (number of publications) and it also incorporates information on co-authorship and author positions (single, first, last author). If you have additional questions, see attached file on FREQUENTLY ASKED QUESTIONS. Finally, we alert users that all citation metrics have limitations and their use should be tempered and judicious. For more reading, we refer to the Leiden manifesto: https://www.nature.com/articles/520429a
Open Access In Africa: Scopus Citation Data
search.datacite.org
data.niaid.nih.gov
Updated Jun 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dasapta Erwin Irawan; OB Ojemeni (2017). Open Access In Africa: Scopus Citation Data [Dataset]. http://doi.org/10.5281/zenodo.817600
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.817600
Dataset updated
Jun 23, 2017
Dataset provided by
DataCitehttps://www.datacite.org/
Zenodohttp://zenodo.org/
Authors
Dasapta Erwin Irawan; OB Ojemeni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The following citation dataset was retrieved from Scopus in June 24, 2017 (3am, Western Indonesian time).

It consists of 3 sets of data based on our searches. Each search was saved both in 'csv' and 'bib':

OA_Africa_inTitle.xxx: "Open Access" AND Africa IN TITLE OA_Africa_inTitle_inAbstract_inKeywords.xxx: "Open Access" AND Africa IN TITLE, IN ABSTRACT, IN KEYWORDS OAmovement_Africa_inTitle_inAbstract_inKeywords.xxx: "Open Access movement" AND Africa IN TITLE, IN ABSTRACT, IN KEYWORDS

The access to Scopus was provided by The Central Library of Institut Teknologi Bandung (Indonesia)
f
Impact of NIH Public Access Policy on Citation Rates - Data from Study
figshare.com
indigo.uic.edu
txt
Updated Nov 23, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sandra L. De Groote (2019). Impact of NIH Public Access Policy on Citation Rates - Data from Study [Dataset]. https://figshare.com/articles/dataset/Impact_of_NIH_Public_Access_Policy_on_Citation_Rates_-_Data_from_Study/10961135
Explore at:
txtAvailable download formats
Dataset updated
Nov 23, 2019
Dataset provided by
University of Illinois Chicago
Authors
Sandra L. De Groote
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
A list of journals across several subject areas was developed from which to collect article citation data. Citation information and cited reference counts of all the articles published in 2006 and 2009 for these journals were obtained.
n
PLOS ONE publication and citation data
data-staging.niaid.nih.gov
data.niaid.nih.gov
+2more
zip
Updated May 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Petersen (2023). PLOS ONE publication and citation data [Dataset]. http://doi.org/10.6071/M39W8V
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6071/M39W8V
Dataset updated
May 15, 2023
Dataset provided by
University of California, Merced
Authors
Alexander Petersen
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Merged PLOS ONE and Web of Science data compiled in .dta files produced by STATA13. Included is a Do-file for reproducing the regression model estimates reported in the pre-print (Tables I and II) and published version (Table 1). Each observation (.dta line) corresponds to a given PLOS ONE article, with various article-level and editor-level characteristics used as explanatory and control variables. This summary provides a brief description of each variable and its source.

If you use this data, please cite: A. M. Petersen. Megajournal mismanagement: Manuscript decision bias and anomalous editor activity at PLOS ONE. Journal of Informetrics 13, 100974 (2019). DOI: 10.1016/j.joi.2019.100974

Methods We gathered the citation information for all PLOS ONE articles, indexed by A, from the Web of Science (WOS) Core Collection. From this data we obtained a master list of the unique digital object identifier, DOIA and the number of citations, cA, at the time of the data download (census) date

(a) For the pre-print this corresponds to December 3, 2016;

(b) and for the final published article this corresponds to February 25, 2019.

We then used each DOIA to access the corresponding online XML version of each article at PLOS ONE by visiting the unique web address “http://journals.plos.org/plosone/article?id=” + “DOIA”. After parsing the full-text XML (primarily the author byline data and reference list), we merged the PLOS ONE publication information and WOS citation data by matching on DOIA.

allofplos: PLOS has since made all full-text XML data freely available: https://www.plos.org/text-and-data-mining ; this option was not available at the moment of our data collection.
Data from: unarXive: A Large Scholarly Data Set with Publications'...
zenodo.org
Updated Apr 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tarek Saier; Tarek Saier; Michael Färber; Michael Färber (2024). unarXive: A Large Scholarly Data Set with Publications' Full-Text, Annotated In-Text Citations, and Links to Metadata [Dataset]. http://doi.org/10.5281/zenodo.3385851
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3385851
Dataset updated
Apr 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tarek Saier; Tarek Saier; Michael Färber; Michael Färber
Description
Description

unarXive is a scholarly data set containing publications' full-text, annotated in-text citations, and a citation network.

The data is generated from all LaTeX sources on arXiv and therefore of higher quality than data generated from PDF files.

Typical use cases are

Citation recommendation

Citation context analysis

Bibliographic analyses

Reference string parsing

Note: This Zenodo record is an old version of unarXive. You can find the most recent version at https://zenodo.org/record/7752754 and https://zenodo.org/record/7752615

Access

┏━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ D O W N L O A D S A M P L E ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛

To download the whole data set send an access request and note the following:

Note: this Zenodo record is a "full" version of unarXive, which was generated from all of arXiv.org including non-permissively licensed papers. Make sure that your use of the data is compliant with the paper's licensing terms.¹

¹ For information on papers' licenses use arXiv's bulk metadata access.

The code used for generating the data set is publicly available.

Usage examples for our data set are provided at here on GitHub.

Citing

This initial version of unarXive is described in the following journal article.

Tarek Saier, Michael Färber: "unarXive: A Large Scholarly Data Set with Publications' Full-Text, Annotated In-Text Citations, and Links to Metadata", Scientometrics, 2020,
[link to an author copy]

The updated version is described in the following conference paper.

Tarek Saier, Michael Färber. "unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network", JCDL 2023.
[link to an author copy]
Data from: Bibliometric-Enhanced arXiv: A Data Set for Paper-Based and...
zenodo.org
search.datacite.org
Updated Apr 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tarek Saier; Michael Färber; Michael Färber; Tarek Saier (2024). Bibliometric-Enhanced arXiv: A Data Set for Paper-Based and Citation-Based Tasks [Dataset]. http://doi.org/10.5281/zenodo.2553523
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.2553523
Dataset updated
Apr 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tarek Saier; Michael Färber; Michael Färber; Tarek Saier
Description
Description

unarXive is a scholarly data set containing publications' full-text, annotated in-text citations, and a citation network.

The data is generated from all LaTeX sources on arXiv and therefore of higher quality than data generated from PDF files.

Typical use cases are

Citation recommendation

Citation context analysis

Bibliographic analyses

Reference string parsing

Note: This Zenodo record is an old version of unarXive. You can find the most recent version at https://zenodo.org/record/7752754 and https://zenodo.org/record/7752615

Access

┏━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ D O W N L O A D S A M P L E ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛

To download the whole data set send an access request and note the following:

Note: this Zenodo record is a "full" version of unarXive, which was generated from all of arXiv.org including non-permissively licensed papers. Make sure that your use of the data is compliant with the paper's licensing terms.¹

¹ For information on papers' licenses use arXiv's bulk metadata access.

The code used for generating the data set is publicly available.

Usage examples for our data set are provided at here on GitHub.

Citing

This initial version of unarXive is described in the following journal article.

Tarek Saier, Michael Färber: "unarXive: A Large Scholarly Data Set with Publications' Full-Text, Annotated In-Text Citations, and Links to Metadata", Scientometrics, 2020,
[link to an author copy]

The updated version is described in the following conference paper.

Tarek Saier, Michael Färber. "unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network", JCDL 2023.
[link to an author copy]
Citation and access data, and journal impact factors for co-published...
figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Shanahan (2023). Citation and access data, and journal impact factors for co-published EQUATOR reporting guidelines [Dataset]. http://doi.org/10.6084/m9.figshare.3156211.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3156211.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Daniel Shanahan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the full citation details and DOIs for 85 co-published reporting guidelines, together with the citation counts, number of article accesses and journal impact factor for each article and journal. This represents a total of nine research reporting statements, published across 58 journals in biomedicine.
n
Data from: Data reuse and the open data citation advantage
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
zip
Updated Oct 1, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heather A. Piwowar; Todd J. Vision (2013). Data reuse and the open data citation advantage [Dataset]. http://doi.org/10.5061/dryad.781pv
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.781pv
Dataset updated
Oct 1, 2013
Dataset provided by
National Evolutionary Synthesis Center
Authors
Heather A. Piwowar; Todd J. Vision
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Background: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered.We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.
D
Access to Grey Content: An Analysis of Grey Literature based on Citation and...
ssh.datastations.nl
mdb, pdf, tsv, txt +1
Updated Jan 1, 2006
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. J. Farace; J. Frantzen; Dr. J. (INIST-CNRS) Schöpfel; C. (INIST-CNRS) Stock; Dr. A.K. (UvA) Boekhorst; Dr. J. Farace; J. Frantzen; Dr. J. (INIST-CNRS) Schöpfel; C. (INIST-CNRS) Stock; Dr. A.K. (UvA) Boekhorst (2006). Access to Grey Content: An Analysis of Grey Literature based on Citation and Survey Data, A Follow-up Study [Dataset]. http://doi.org/10.17026/DANS-XFQ-MDFG
Explore at:
mdb(18948096), zip(24444), pdf(125470), txt(468), tsv(70473), tsv(354353), tsv(276247), tsv(41)Available download formats
Unique identifier
https://doi.org/10.17026/DANS-XFQ-MDFG
Dataset updated
Jan 1, 2006
Dataset provided by
DANS Data Station Social Sciences and Humanities
Authors
Dr. J. Farace; J. Frantzen; Dr. J. (INIST-CNRS) Schöpfel; C. (INIST-CNRS) Stock; Dr. A.K. (UvA) Boekhorst; Dr. J. Farace; J. Frantzen; Dr. J. (INIST-CNRS) Schöpfel; C. (INIST-CNRS) Stock; Dr. A.K. (UvA) Boekhorst
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Grey literature, an area of interest to special librarians and information professionals, can be traced back a half-century. However, grey literature as a specialized field in information studies is less than a decade old. At GL'97 in Luxembourg, grey literature was redefined "as information produced on all levels of government, academics, business and industry in electronic and print formats not controlled by commercial publishers (i.e. where publishing is not the primary activity of the producing body)". The subject area was broadened and the need for continuing research and instruction pursued. The results of an online survey carried out in 2004 compared with survey results a decade prior indicate two changes: (1) a move to more specialization in the field of grey literature and (2) a move to more balance in activities related to research and teaching as compared with the processing and distribution of grey literature. It is not that the activities of processing and distribution are today of less concern, but technological advances and the Internet may have made them less labour intensive. The burden that grey literature poised to human resources and budgets appears to have been reduced enough that the benefits of the content of grey literature is discovered. And this discovery of a wealth of knowledge and information is the onset to further research and instruction in the field of grey literature. This research project is a follow-up or second part of a citation research. The first part was carried out last year and the results were presented in a conference paper at GL6 in New York. Citation analysis is a relatively objective quantitative method and must be carefully implemented (Moed, 2002). Thus, in an effort to expand the results of our initial analysis beyond the realm of the GL Conference Series, an Author Survey will also be implemented in this follow-up study. The empirical data gathered from the online questionnaire will be compared with the updated data from the Citation Database to which the citations in the GL6 Conference Proceedings will have been added. Comparative data from the comprehensive citation database (estimated 1650 records) and the data from the online author survey would then allow for a clearer demonstration of the impact of this research. Where only part of the impact of research is covered by citation analysis alone (Thelwall, 2002). This research will allow for tracking the life of a conference paper as well as the application and use of its content within and outside the grey circuit. Further gain would be a better profile of the GL authors, who are the source of GreyNet's knowledge and information base. This in turn could lead to the subsequent development of services that are in line with the needs of authors and researchers in the field of grey literature. For example, a citation style for grey literature, where special analysis of hyperlinked citations would provide an opportunity to address the problem of the disparity of web-based grey literature in the context of open archives.
I
Global News Index and Extracted Features Repository
databank.illinois.edu
Updated Jun 15, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Global News Index and Extracted Features Repository [Dataset]. http://doi.org/10.13012/B2IDB-5649852_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-5649852_V1
Dataset updated
Jun 15, 2021
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Cline Center Global News Index is a searchable database of textual features extracted from millions of news stories, specifically designed to provide comprehensive coverage of events around the world. In addition to searching documents for keywords, users can query metadata and features such as named entities extracted using Natural Language Processing (NLP) methods and variables that measure sentiment and emotional valence. Archer is a web application purpose-built by the Cline Center to enable researchers to access data from the Global News Index. Archer provides a user-friendly interface for querying the Global News Index (with the back-end indexing still handled by Solr). By default, queries are built using icons and drop-down menus. More technically-savvy users can use Lucene/Solr query syntax via a ‘raw query’ option. Archer allows users to save and iterate on their queries, and to visualize faceted query results, which can be helpful for users as they refine their queries. Additional Resources: - Access to Archer and the Global News Index is limited to account-holders. If you are interested in signing up for an account, you can fill out the Archer User Information Form. - Current users who would like to provide feedback, such as reporting a bug or requesting a feature, can fill out the Archer User Feedback Form. - The Cline Center sends out periodic email newsletters to the Archer Users Group. Please fill out this form to subscribe to Archer Users Group. Citation Guidelines: 1) To cite the GNI codebook (or any other documentation associated with the Global News Index and Archer) please use the following citation: Cline Center for Advanced Social Research. 2020. Global News Index and Extracted Features Repository [codebook]. Champaign, IL: University of Illinois. doi:10.13012/B2IDB-5649852_V1 2) To cite data from the Global News Index (accessed via Archer or otherwise) please use the following citation (filling in the correct date of access): Cline Center for Advanced Social Research. 2020. Global News Index and Extracted Features Repository [database]. Champaign, IL: University of Illinois. Accessed Month, DD, YYYY. doi:10.13012/B2IDB-5649852_V1
d
Replication Data for: Mapping the landscape of geospatial data citations
search.dataone.org
borealisdata.ca
Updated Dec 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leahey, Amber; Genzinger, Peter (2024). Replication Data for: Mapping the landscape of geospatial data citations [Dataset]. http://doi.org/10.5683/SP2/JDLRJP
Explore at:
Unique identifier
https://doi.org/10.5683/SP2/JDLRJP
Dataset updated
Dec 18, 2024
Dataset provided by
Borealis
Authors
Leahey, Amber; Genzinger, Peter
Time period covered
Jan 1, 2015 - Jan 1, 2018
Description
This data supports the paper entitled "Mapping the landscape of geospatial data citations". The dataset covers geospatial data-intensive research papers published between 2015-2018 retrieved using Scopus. The article's citations were assessed for data citation occurances, and coded using a data citation classification. Data were enhanced and linked to subject coverage and journal policy status information using Excel & SPSS. For more information about how the data were created and coded please review the 'Methodology' section of the paper. More information is provided below, including supplemental documentation and related publications. Abstract (paper) ABSTRACT Data citations, similar to article and other research citations, are important references to research data that underlie published research results. In support of open science directives, these citations must adhere to specific conventions in terms of consistency of both placement within an article, and the actual availability or access to research data. To better understand the level to which geospatial research data are currently cited, we undertook a study to analyse the rate of data citation within a set of data-intensive geospatial research articles. After analysing 1717 scholarly articles published between 2015 and 2018, we found that very few, or 78 (5%), meaningfully cited primary or secondary geospatial data sources in the cited references section of the article. Even fewer researchers, only 25 or 1.5%, were found to have cited data using a DOI. Given the relatively low data citation rate, a focus on contributing factors including barriers to citing geospatial data is needed. And while open sharing requirements for geospatial data may change over time, driving data citation as a result, understanding benchmarks for data citation for monitoring purposes is useful.
o
Career promotions, research publications, Open Access dataset
ordo.open.ac.uk
zip
Updated Feb 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matteo Cancellieri; Nancy Pontika; David Pride; Petr Knoth; Hannah Metzler; Antonia Correia; Helene Brinken; Bikash Gyawali (2022). Career promotions, research publications, Open Access dataset [Dataset]. http://doi.org/10.21954/ou.rd.19228785.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.21954/ou.rd.19228785.v1
Dataset updated
Feb 28, 2022
Dataset provided by
The Open University
Authors
Matteo Cancellieri; Nancy Pontika; David Pride; Petr Knoth; Hannah Metzler; Antonia Correia; Helene Brinken; Bikash Gyawali
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is a compilation of processed data on citation and references for research papers including their author, institution and open access info for a selected sample of academics analysed using Microsoft Academic Graph (MAG) data and CORE. The data for this dataset was collected during December 2019 to January 2020.Six countries (Austria, Brazil, Germany, India, Portugal, United Kingdom and United States) were the focus of the six questions which make up this dataset. There is one csv file per country and per question (36 files in total). More details about the creation of this dataset are available on the public ON-MERRIT D3.1 deliverable report.The dataset is a combination of two different data sources, one part is a dataset created on analysing promotion policies across the target countries, while the second part is a set of data points available to understand the publishing behaviour. To facilitate the analysis the dataset is organised in the following seven folders:PRTThe dataset with the file name "PRT_policies.csv" contains the related information as this was extracted from promotion, review and tenure (PRT) policies. Q1: What % of papers coming from a university are Open Access?- Dataset Name format: oa_status_countryname_papers.csv- Dataset Contents: Open Access (OA) status of all papers of all the universities listed in Times Higher Education World University Rankings (THEWUR) for the given country. A paper is marked OA if there is at least an OA link available. OA links are collected using the CORE Discovery API.- Important considerations about this dataset: - Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to. - The service we used to recognise if a paper is OA, CORE Discovery, does not contain entries for all paperids in MAG. This implies that some of the records in the dataset extracted will not have either a true or false value for the _is_OA_ field. - Only those records marked as true for _is_OA_ field can be said to be OA. Others with false or no value for is_OA field are unknown status (i.e. not necessarily closed access).Q2: How are papers, published by the selected universities, distributed across the three scientific disciplines of our choice?- Dataset Name format: fsid_countryname_papers.csv- Dataset Contents: For the given country, all papers for all the universities listed in THEWUR with the information of fieldofstudy they belong to.- Important considerations about this dataset: * MAG can associate a paper to multiple fieldofstudyid. If a paper belongs to more than one of our fieldofstudyid, separate records were created for the paper with each of those _fieldofstudyid_s.- MAG assigns fieldofstudyid to every paper with a score. We preserve only those records whose score is more than 0.5 for any fieldofstudyid it belongs to.- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to. Papers with authorship from multiple universities are counted once towards each of the universities concerned.Q3: What is the gender distribution in authorship of papers published by the universities?- Dataset Name format: author_gender_countryname_papers.csv- Dataset Contents: All papers with their author names for all the universities listed in THEWUR.- Important considerations about this dataset :- When there are multiple collaborators(authors) for the same paper, this dataset makes sure that only the records for collaborators from within selected universities are preserved.- An external script was executed to determine the gender of the authors. The script is available here.Q4: Distribution of staff seniority (= number of years from their first publication until the last publication) in the given university.- Dataset Name format: author_ids_countryname_papers.csv- Dataset Contents: For a given country, all papers for authors with their publication year for all the universities listed in THEWUR.- Important considerations about this work :- When there are multiple collaborators(authors) for the same paper, this dataset makes sure that only the records for collaborators from within selected universities are preserved.- Calculating staff seniority can be achieved in various ways. The most straightforward option is to calculate it as _academic_age = MAX(year) - MIN(year) _for each authorid.Q5: Citation counts (incoming) for OA vs Non-OA papers published by the university.- Dataset Name format: cc_oa_countryname_papers.csv- Dataset Contents: OA status and OA links for all papers of all the universities listed in THEWUR and for each of those papers, count of incoming citations available in MAG.- Important considerations about this dataset :- CORE Discovery was used to establish the OA status of papers.- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to.- Only those records marked as true for _is_OA_ field can be said to be OA. Others with false or no value for is_OA field are unknown status (i.e. not necessarily closed access).Q6: Count of OA vs Non-OA references (outgoing) for all papers published by universities.- Dataset Name format: rc_oa_countryname_-papers.csv- Dataset Contents: Counts of all OA and unknown papers referenced by all papers published by all the universities listed in THEWUR.- Important considerations about this dataset :- CORE Discovery was used to establish the OA status of papers being referenced.- Papers with multiple authorship are preserved only once towards each of the distinct institutions their authors may belong to. Papers with authorship from multiple universities are counted once towards each of the universities concerned.Additional files:- _fieldsofstudy_mag_.csv: this file contains a dump of fieldsofstudy table of MAG mapping each of the ids to their actual field of study name.
d
Data from: U.S. Geological Survey Data Citation Analysis, 2016-2022
catalog.data.gov
data.usgs.gov
Updated Nov 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). U.S. Geological Survey Data Citation Analysis, 2016-2022 [Dataset]. https://catalog.data.gov/dataset/u-s-geological-survey-data-citation-analysis-2016-2022
Explore at:
Dataset updated
Nov 20, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
In 2022, publication and data linkages were evaluated using two methods in an effort to understand how a data citation workflow has been implemented by the U.S. Geological Survey (USGS) since the 2016 USGS instructional memorandum, Public Access to Results of Federally Funded Research at the U.S. Geological Survey: Scholarly Publications and Digital Data (USGS OSQI, 2016), went into effect, requiring USGS data be assigned a DOI, be accompanied by a citation, and be referenced from the associated publication (USGS OSQI, 2017). This data release includes data and publication structural metadata results retrieved from the USGS DOI Tool and Crossref APIs and Jupyter notebooks used to process and analyze the results.
r
Author survey data about bibliometrics and altmetrics for open access...
researchdata.se
demo.researchdata.se
+1more
Updated Jun 5, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sofie Wennström; Gabor Schubert; Jeroen Sondervan; Graham Stone (2019). Author survey data about bibliometrics and altmetrics for open access monographs – including data about online usage and citations of academic books from Stockholm University Press [Dataset]. http://doi.org/10.17045/STHLMUNI.8051717
Explore at:
Unique identifier
https://doi.org/10.17045/STHLMUNI.8051717
Dataset updated
Jun 5, 2019
Dataset provided by
Stockholm University
Authors
Sofie Wennström; Gabor Schubert; Jeroen Sondervan; Graham Stone
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes a file with results from a survey sent to authors of open access monographs. The survey was available during March–April 2019 and the results are analysed in a paper presented at the 2019 Elpub conference on Jun 2–4 in Marseille, France entitled 'The significant difference in impact – an exploratory study about the meaning and value of metrics for open access monographs'. Version 2 of the dataset has been updated with the slides presented at the conference and the link to the full paper published in the French open archive HAL.

The respondents of the survey were asked to comment on assumptions about bibliometrics and altmetrics currently in practice, and to think about the meaning of such data in relation to their experiences as authors of books published in a digital format and with an open license (i.e. a creative commons license). The survey questionnaire is included as a separate text document. The dataset also includes measures about the usage of open access books published by Stockholm University Press, including information about online usage, mentions in social media and citations. This data is collected from the publisher's platform, the Altmetric.com database, and citation data was collected from Dimensions, Google Scholar, Web of Science and CrossRef. The data was collected in February 2019, except for the figures from the OAPEN Library database, which was collected in November 2018. The paper, including the analysis of these data, is to be published in the Elpub Digital Library. The tables included in the dataset may vary slightly from those in the published paper, due to space restraints in the published version.
OA vs NOA Citation Ratios
zenodo.org
zip
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefan Fröse; Stefan Fröse (2025). OA vs NOA Citation Ratios [Dataset]. http://doi.org/10.5281/zenodo.15820075
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15820075
Dataset updated
Jul 6, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Stefan Fröse; Stefan Fröse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Automated monthly plot of citation ratios using OpenAlex data. This includes a PDF visualization and supporting CSV generated from OpenAlex (CC0) data.

Facebook

Twitter

Click to copy link

Link copied

Cite

Caselaw Access Project (2020). Citation Graph [Dataset]. https://www.kaggle.com/datasets/harvardlil/citation-graph

Citation Graph

CAP Citation Graph Citations and Metadata

Explore at:

zip(306688738 bytes)Available download formats

Dataset updated

Jun 30, 2020

Authors

Caselaw Access Project

Description

Context

The Caselaw Access Project makes 40 million pages of U.S. caselaw freely available online from the collections of Harvard Law School Library.

The CAP citation graph shows the connections between cases in the Caselaw Access Project dataset. You can use the citation graph to answer questions like "what is the most influential case?" and "what jurisdictions cite most often to this jurisdiction?".

Learn More: https://case.law/download/citation_graph/

Access Limits: https://case.law/api/#limits

Content

This dataset includes citations and metadata for the CAP citation graph in CSV format.

Acknowledgements

The Caselaw Access Project is by the Library Innovation Lab at Harvard Law School Library.

Inspiration

People are using CAP data to create research, applications, and more. We're sharing examples in our gallery.

Cite Grid is the first visualization we've created based on data from our citation graph.

Have something to share? We're excited to hear about it.

Clear search

Close search

Google apps

Main menu

Citation Graph

Context

Content

Acknowledgements

Inspiration

Data for "Open Access impact on citations: a case study"

Dataset for "Continued use of retracted papers: Temporal trends in citations...

Citation Trends for "Supporting Data and Services Access in Digital...

Data from: OpCitance: Citation contexts identified from the PubMed Central...

August 2024 data-update for "Updated science-wide author databases of...

Open Access In Africa: Scopus Citation Data

Impact of NIH Public Access Policy on Citation Rates - Data from Study

PLOS ONE publication and citation data

allofplos: PLOS has since made all full-text XML data freely available: https://www.plos.org/text-and-data-mining ; this option was not available at the moment of our data collection.

Data from: unarXive: A Large Scholarly Data Set with Publications'...

Description

Access

Citing

Data from: Bibliometric-Enhanced arXiv: A Data Set for Paper-Based and...

Description

Access

Citing

Citation and access data, and journal impact factors for co-published...

Data from: Data reuse and the open data citation advantage

Access to Grey Content: An Analysis of Grey Literature based on Citation and...

Global News Index and Extracted Features Repository

Replication Data for: Mapping the landscape of geospatial data citations

Career promotions, research publications, Open Access dataset

Data from: U.S. Geological Survey Data Citation Analysis, 2016-2022

Author survey data about bibliometrics and altmetrics for open access...

OA vs NOA Citation Ratios

Citation Graph

CAP Citation Graph Citations and Metadata

Context

Content

Acknowledgements

Inspiration