https://raw.githubusercontent.com/mattbierbaum/arxiv-public-datasets/master/LICENSEhttps://raw.githubusercontent.com/mattbierbaum/arxiv-public-datasets/master/LICENSE
The arXiv pre-print service is the de facto venue for publishing in many scientific disciplines. This repository provides tools for using all the publicly available information provided by the arXiv to download all of the publications and their metadata, extract fulltext from PDFs, and build a co-citation graph. For each publication the tools provide access to: * Article metadata -- title, authors string, category, doi, abstract, submitter * PDFs -- all PDFs available through arXiv bulk download * Plain text -- PDFs converted to UTF-8 encoded plain text * Citation graph -- intra-arXiv citation graph between arXiv IDs only (generated from plain text) * Author string parsing -- convert metadata author strings into standardized list of name, affiliations
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.
In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.
Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.
The dataset is freely available via Google Cloud Storage buckets (more info here). Stay tuned for weekly updates to the dataset!
ArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.
The release of this dataset was featured further in a Kaggle blog post here.
https://storage.googleapis.com/kaggle-public-downloads/arXiv.JPG" alt="">
See here for more information.
This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json
format. This file contains an entry for each paper, containing:
- id
: ArXiv ID (can be used to access the paper, see below)
- submitter
: Who submitted the paper
- authors
: Authors of the paper
- title
: Title of the paper
- comments
: Additional info, such as number of pages and figures
- journal-ref
: Information about the journal the paper was published in
- doi
: https://www.doi.org
- abstract
: The abstract of the paper
- categories
: Categories / tags in the ArXiv system
- versions
: A version history
You can access each paper directly on ArXiv using these links:
- https://arxiv.org/abs/{id}
: Page for this paper including its abstract and further links
- https://arxiv.org/pdf/{id}
: Direct link to download the PDF
The full set of PDFs is available for free in the GCS bucket gs://arxiv-dataset
or through Google API (json documentation and xml documentation).
You can use for example gsutil to download the data to your local machine. ```
gsutil cp gs://arxiv-dataset/arxiv/
gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2003/ ./a_local_directory/
gsutil cp -r gs://arxiv-dataset/arxiv/ ./a_local_directory/ ```
We're automatically updating the metadata as well as the GCS bucket on a weekly basis.
Creative Commons CC0 1.0 Universal Public Domain Dedication applies to the metadata in this dataset. See https://arxiv.org/help/license for further details and licensing on individual papers.
The original data is maintained by ArXiv, huge thanks to the team for building and maintaining this dataset.
We're using https://github.com/mattbierbaum/arxiv-public-datasets to pull the original data, thanks to Matt Bierbaum for providing this tool.
unarXive is a scholarly data set containing publications' full-text, annotated in-text citations, and a citation network.
The data is generated from all LaTeX sources on arXiv and therefore of higher quality than data generated from PDF files.
Typical use cases are
Note: This Zenodo record is an old version of unarXive. You can find the most recent version at https://zenodo.org/record/7752754 and https://zenodo.org/record/7752615
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ D O W N L O A D S A M P L E ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛
To download the whole data set send an access request and note the following:
Note: this Zenodo record is a "full" version of unarXive, which was generated from all of arXiv.org including non-permissively licensed papers. Make sure that your use of the data is compliant with the paper's licensing terms.¹
¹ For information on papers' licenses use arXiv's bulk metadata access.
The code used for generating the data set is publicly available.
Usage examples for our data set are provided at here on GitHub.
This initial version of unarXive is described in the following journal article.
Tarek Saier, Michael Färber: "unarXive: A Large Scholarly Data Set with Publications' Full-Text, Annotated In-Text Citations, and Links to Metadata", Scientometrics, 2020,
[link to an author copy]
The updated version is described in the following conference paper.
Tarek Saier, Michael Färber. "unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network", JCDL 2023.
[link to an author copy]
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for arxiv-abstracts-2021
Dataset Summary
A dataset of metadata including title and abstract for all arXiv articles up to the end of 2021 (~2 million papers). Possible applications include trend analysis, paper recommender engines, category prediction, knowledge graph construction and semantic search interfaces. In contrast to arxiv_dataset, this dataset doesn't include papers submitted to arXiv after 2021 and it doesn't require any external download.… See the full description on the dataset page: https://huggingface.co/datasets/gfissore/arxiv-abstracts-2021.
Description unarXive is a scholarly data set containing publications' structured full-text, annotated in-text citations, linked non-text content (mathematical notation, figure/table captions) and a citation network. The data is generated from all LaTeX sources on arXiv and therefore of higher quality than data generated from PDF files. Typical uses are
Training of ML models (citation recommendation, summarization, LLMs) Citation context analysis Bibliographic analyses Access ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┓┃ D O W N L O A D S A M P L E ┃┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛ To download the whole data set send an access request and note the following:
Note: this Zenodo record is the "full" version of unarXive, which was generated from all of arXiv.org including non-permissively licensed papers. Make sure that your use of the data is compliant with the paper's licensing terms.¹Alternatively you can use the unarXive open subset. ¹ For information on papers' licenses use arXiv's bulk metadata access. The code for generating the data set is publicly available.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Code and data for interacting with the article and image metadata from papers stored in the arXiv repository. At the time of writing the related publication, this involved investigating ~1.5 million articles, ~10 million images, and 2.1 TB of data downloaded from arXiv. This dataset upload contains instructions and code to download the bulk source data, extract into a folder hierarchy, create an SQLite database, and then run various queries, sample images, and generate plots using this data. The full SQLite database is provided containing article metadata, image metadata, and figure captions. Also contained here are data statistics and image credits for images that appear in the related publication.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://raw.githubusercontent.com/mattbierbaum/arxiv-public-datasets/master/LICENSEhttps://raw.githubusercontent.com/mattbierbaum/arxiv-public-datasets/master/LICENSE
The arXiv pre-print service is the de facto venue for publishing in many scientific disciplines. This repository provides tools for using all the publicly available information provided by the arXiv to download all of the publications and their metadata, extract fulltext from PDFs, and build a co-citation graph. For each publication the tools provide access to: * Article metadata -- title, authors string, category, doi, abstract, submitter * PDFs -- all PDFs available through arXiv bulk download * Plain text -- PDFs converted to UTF-8 encoded plain text * Citation graph -- intra-arXiv citation graph between arXiv IDs only (generated from plain text) * Author string parsing -- convert metadata author strings into standardized list of name, affiliations