6 datasets found

g
arXiv Public Datasets
github.com
Updated Sep 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). arXiv Public Datasets [Dataset]. https://github.com/mitanshu7/arxiv_public_datasets
Explore at:
Dataset updated
Sep 7, 2024
License
https://raw.githubusercontent.com/mattbierbaum/arxiv-public-datasets/master/LICENSEhttps://raw.githubusercontent.com/mattbierbaum/arxiv-public-datasets/master/LICENSE
Description
The arXiv pre-print service is the de facto venue for publishing in many scientific disciplines. This repository provides tools for using all the publicly available information provided by the arXiv to download all of the publications and their metadata, extract fulltext from PDFs, and build a co-citation graph. For each publication the tools provide access to: * Article metadata -- title, authors string, category, doi, abstract, submitter * PDFs -- all PDFs available through arXiv bulk download * Plain text -- PDFs converted to UTF-8 encoded plain text * Citation graph -- intra-arXiv citation graph between arXiv IDs only (generated from plain text) * Author string parsing -- convert metadata author strings into standardized list of name, affiliations
Data from: arXiv Dataset
kaggle.com
Updated Jul 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cornell University (2025). arXiv Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/7548853
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/7548853
Dataset updated
Jul 5, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Cornell University
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
About ArXiv

For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.

In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

The dataset is freely available via Google Cloud Storage buckets (more info here). Stay tuned for weekly updates to the dataset!

ArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.

The release of this dataset was featured further in a Kaggle blog post here.

https://storage.googleapis.com/kaggle-public-downloads/arXiv.JPG" alt="">

See here for more information.

ArXiv On Kaggle

Metadata

This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. This file contains an entry for each paper, containing: - id: ArXiv ID (can be used to access the paper, see below) - submitter: Who submitted the paper - authors: Authors of the paper - title: Title of the paper - comments: Additional info, such as number of pages and figures - journal-ref: Information about the journal the paper was published in - doi: https://www.doi.org - abstract: The abstract of the paper - categories: Categories / tags in the ArXiv system - versions: A version history

You can access each paper directly on ArXiv using these links: - https://arxiv.org/abs/{id}: Page for this paper including its abstract and further links - https://arxiv.org/pdf/{id}: Direct link to download the PDF

Bulk access

The full set of PDFs is available for free in the GCS bucket gs://arxiv-dataset or through Google API (json documentation and xml documentation).

You can use for example gsutil to download the data to your local machine. ```

List files:

gsutil cp gs://arxiv-dataset/arxiv/

Download pdfs from March 2020:

gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2003/ ./a_local_directory/

Download all the source files

gsutil cp -r gs://arxiv-dataset/arxiv/ ./a_local_directory/ ```

Update Frequency

We're automatically updating the metadata as well as the GCS bucket on a weekly basis.

License

Creative Commons CC0 1.0 Universal Public Domain Dedication applies to the metadata in this dataset. See https://arxiv.org/help/license for further details and licensing on individual papers.

Acknowledgements

The original data is maintained by ArXiv, huge thanks to the team for building and maintaining this dataset.

We're using https://github.com/mattbierbaum/arxiv-public-datasets to pull the original data, thanks to Matt Bierbaum for providing this tool.
Data from: unarXive: A Large Scholarly Data Set with Publications'...
zenodo.org
Updated Apr 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tarek Saier; Tarek Saier; Michael Färber; Michael Färber (2024). unarXive: A Large Scholarly Data Set with Publications' Full-Text, Annotated In-Text Citations, and Links to Metadata [Dataset]. http://doi.org/10.5281/zenodo.3385851
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3385851
Dataset updated
Apr 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tarek Saier; Tarek Saier; Michael Färber; Michael Färber
Description
Description

unarXive is a scholarly data set containing publications' full-text, annotated in-text citations, and a citation network.

The data is generated from all LaTeX sources on arXiv and therefore of higher quality than data generated from PDF files.

Typical use cases are

Citation recommendation

Citation context analysis

Bibliographic analyses

Reference string parsing

Note: This Zenodo record is an old version of unarXive. You can find the most recent version at https://zenodo.org/record/7752754 and https://zenodo.org/record/7752615

Access

┏━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ D O W N L O A D S A M P L E ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛

To download the whole data set send an access request and note the following:

Note: this Zenodo record is a "full" version of unarXive, which was generated from all of arXiv.org including non-permissively licensed papers. Make sure that your use of the data is compliant with the paper's licensing terms.¹

¹ For information on papers' licenses use arXiv's bulk metadata access.

The code used for generating the data set is publicly available.

Usage examples for our data set are provided at here on GitHub.

Citing

This initial version of unarXive is described in the following journal article.

Tarek Saier, Michael Färber: "unarXive: A Large Scholarly Data Set with Publications' Full-Text, Annotated In-Text Citations, and Links to Metadata", Scientometrics, 2020,
[link to an author copy]

The updated version is described in the following conference paper.

Tarek Saier, Michael Färber. "unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network", JCDL 2023.
[link to an author copy]
h
arxiv-abstracts-2021
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giancarlo, arxiv-abstracts-2021 [Dataset]. https://huggingface.co/datasets/gfissore/arxiv-abstracts-2021
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Giancarlo
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for arxiv-abstracts-2021

Dataset Summary

A dataset of metadata including title and abstract for all arXiv articles up to the end of 2021 (~2 million papers). Possible applications include trend analysis, paper recommender engines, category prediction, knowledge graph construction and semantic search interfaces. In contrast to arxiv_dataset, this dataset doesn't include papers submitted to arXiv after 2021 and it doesn't require any external download.… See the full description on the dataset page: https://huggingface.co/datasets/gfissore/arxiv-abstracts-2021.
Z
unarXive: All arXiv Publications Pre-Processed for NLP, Including Structured...
data.niaid.nih.gov
zenodo.org
Updated Nov 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saier, Tarek (2023). unarXive: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network (full) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7752753
Explore at:
Dataset updated
Nov 3, 2023
Dataset provided by
Färber, Michael
Krause, Johan
Saier, Tarek
Description
Description unarXive is a scholarly data set containing publications' structured full-text, annotated in-text citations, linked non-text content (mathematical notation, figure/table captions) and a citation network. The data is generated from all LaTeX sources on arXiv and therefore of higher quality than data generated from PDF files. Typical uses are

Training of ML models (citation recommendation, summarization, LLMs) Citation context analysis Bibliographic analyses Access ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┓┃ D O W N L O A D S A M P L E ┃┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛ To download the whole data set send an access request and note the following:

Note: this Zenodo record is the "full" version of unarXive, which was generated from all of arXiv.org including non-permissively licensed papers. Make sure that your use of the data is compliant with the paper's licensing terms.¹Alternatively you can use the unarXive open subset. ¹ For information on papers' licenses use arXiv's bulk metadata access. The code for generating the data set is publicly available.
H
Replication Data for: "Images of the arXiv: reconfiguring large scientific...
dataverse.harvard.edu
Updated Mar 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kynan Tan; Anna Munster; Adrian Mackenzie (2021). Replication Data for: "Images of the arXiv: reconfiguring large scientific image datasets" [Dataset]. http://doi.org/10.7910/DVN/EAAG94
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/EAAG94
Dataset updated
Mar 3, 2021
Dataset provided by
Harvard Dataverse
Authors
Kynan Tan; Anna Munster; Adrian Mackenzie
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Code and data for interacting with the article and image metadata from papers stored in the arXiv repository. At the time of writing the related publication, this involved investigating ~1.5 million articles, ~10 million images, and 2.1 TB of data downloaded from arXiv. This dataset upload contains instructions and code to download the bulk source data, extract into a folder hierarchy, create an SQLite database, and then run various queries, sample images, and generate plots using this data. The full SQLite database is provided containing article metadata, image metadata, and figure captions. Also contained here are data statistics and image credits for images that appear in the related publication.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). arXiv Public Datasets [Dataset]. https://github.com/mitanshu7/arxiv_public_datasets

arXiv Public Datasets

Explore at:

470 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Sep 7, 2024

License

https://raw.githubusercontent.com/mattbierbaum/arxiv-public-datasets/master/LICENSEhttps://raw.githubusercontent.com/mattbierbaum/arxiv-public-datasets/master/LICENSE

Description

The arXiv pre-print service is the de facto venue for publishing in many scientific disciplines. This repository provides tools for using all the publicly available information provided by the arXiv to download all of the publications and their metadata, extract fulltext from PDFs, and build a co-citation graph. For each publication the tools provide access to: * Article metadata -- title, authors string, category, doi, abstract, submitter * PDFs -- all PDFs available through arXiv bulk download * Plain text -- PDFs converted to UTF-8 encoded plain text * Citation graph -- intra-arXiv citation graph between arXiv IDs only (generated from plain text) * Author string parsing -- convert metadata author strings into standardized list of name, affiliations

Clear search

Close search

Google apps

Main menu

arXiv Public Datasets

Data from: arXiv Dataset

About ArXiv

ArXiv On Kaggle

Metadata

Bulk access

List files:

Download pdfs from March 2020:

Download all the source files

Update Frequency

License

Acknowledgements

Data from: unarXive: A Large Scholarly Data Set with Publications'...

Description

Access

Citing

arxiv-abstracts-2021

unarXive: All arXiv Publications Pre-Processed for NLP, Including Structured...

Replication Data for: "Images of the arXiv: reconfiguring large scientific...

arXiv Public Datasets