6 datasets found
  1. g

    arXiv Public Datasets

    • github.com
    Updated Sep 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). arXiv Public Datasets [Dataset]. https://github.com/mitanshu7/arxiv_public_datasets
    Explore at:
    Dataset updated
    Sep 7, 2024
    License

    https://raw.githubusercontent.com/mattbierbaum/arxiv-public-datasets/master/LICENSEhttps://raw.githubusercontent.com/mattbierbaum/arxiv-public-datasets/master/LICENSE

    Description

    The arXiv pre-print service is the de facto venue for publishing in many scientific disciplines. This repository provides tools for using all the publicly available information provided by the arXiv to download all of the publications and their metadata, extract fulltext from PDFs, and build a co-citation graph. For each publication the tools provide access to: * Article metadata -- title, authors string, category, doi, abstract, submitter * PDFs -- all PDFs available through arXiv bulk download * Plain text -- PDFs converted to UTF-8 encoded plain text * Citation graph -- intra-arXiv citation graph between arXiv IDs only (generated from plain text) * Author string parsing -- convert metadata author strings into standardized list of name, affiliations

  2. Data from: arXiv Dataset

    • kaggle.com
    Updated Jul 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cornell University (2025). arXiv Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/7548853
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Cornell University
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    About ArXiv

    For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.

    In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

    Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

    The dataset is freely available via Google Cloud Storage buckets (more info here). Stay tuned for weekly updates to the dataset!

    ArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.

    The release of this dataset was featured further in a Kaggle blog post here.

    https://storage.googleapis.com/kaggle-public-downloads/arXiv.JPG" alt="">

    See here for more information.

    ArXiv On Kaggle

    Metadata

    This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. This file contains an entry for each paper, containing: - id: ArXiv ID (can be used to access the paper, see below) - submitter: Who submitted the paper - authors: Authors of the paper - title: Title of the paper - comments: Additional info, such as number of pages and figures - journal-ref: Information about the journal the paper was published in - doi: https://www.doi.org - abstract: The abstract of the paper - categories: Categories / tags in the ArXiv system - versions: A version history

    You can access each paper directly on ArXiv using these links: - https://arxiv.org/abs/{id}: Page for this paper including its abstract and further links - https://arxiv.org/pdf/{id}: Direct link to download the PDF

    Bulk access

    The full set of PDFs is available for free in the GCS bucket gs://arxiv-dataset or through Google API (json documentation and xml documentation).

    You can use for example gsutil to download the data to your local machine. ```

    List files:

    gsutil cp gs://arxiv-dataset/arxiv/

    Download pdfs from March 2020:

    gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2003/ ./a_local_directory/

    Download all the source files

    gsutil cp -r gs://arxiv-dataset/arxiv/ ./a_local_directory/ ```

    Update Frequency

    We're automatically updating the metadata as well as the GCS bucket on a weekly basis.

    License

    Creative Commons CC0 1.0 Universal Public Domain Dedication applies to the metadata in this dataset. See https://arxiv.org/help/license for further details and licensing on individual papers.

    Acknowledgements

    The original data is maintained by ArXiv, huge thanks to the team for building and maintaining this dataset.

    We're using https://github.com/mattbierbaum/arxiv-public-datasets to pull the original data, thanks to Matt Bierbaum for providing this tool.

  3. Data from: unarXive: A Large Scholarly Data Set with Publications'...

    • zenodo.org
    Updated Apr 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tarek Saier; Tarek Saier; Michael Färber; Michael Färber (2024). unarXive: A Large Scholarly Data Set with Publications' Full-Text, Annotated In-Text Citations, and Links to Metadata [Dataset]. http://doi.org/10.5281/zenodo.3385851
    Explore at:
    Dataset updated
    Apr 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tarek Saier; Tarek Saier; Michael Färber; Michael Färber
    Description

    Description

    unarXive is a scholarly data set containing publications' full-text, annotated in-text citations, and a citation network.

    The data is generated from all LaTeX sources on arXiv and therefore of higher quality than data generated from PDF files.

    Typical use cases are

    • Citation recommendation
    • Citation context analysis
    • Bibliographic analyses
    • Reference string parsing

    Note: This Zenodo record is an old version of unarXive. You can find the most recent version at https://zenodo.org/record/7752754 and https://zenodo.org/record/7752615

    Access

    ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┓
    D O W N L O A D S A M P L E  ┃
    ┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛

    To download the whole data set send an access request and note the following:

    Note: this Zenodo record is a "full" version of unarXive, which was generated from all of arXiv.org including non-permissively licensed papers. Make sure that your use of the data is compliant with the paper's licensing terms.¹

    ¹ For information on papers' licenses use arXiv's bulk metadata access.

    The code used for generating the data set is publicly available.

    Usage examples for our data set are provided at here on GitHub.

    Citing

    This initial version of unarXive is described in the following journal article.

    Tarek Saier, Michael Färber: "unarXive: A Large Scholarly Data Set with Publications' Full-Text, Annotated In-Text Citations, and Links to Metadata", Scientometrics, 2020,
    [link to an author copy]

    The updated version is described in the following conference paper.

    Tarek Saier, Michael Färber. "unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network", JCDL 2023.
    [link to an author copy]

  4. h

    arxiv-abstracts-2021

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giancarlo, arxiv-abstracts-2021 [Dataset]. https://huggingface.co/datasets/gfissore/arxiv-abstracts-2021
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Giancarlo
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for arxiv-abstracts-2021

      Dataset Summary
    

    A dataset of metadata including title and abstract for all arXiv articles up to the end of 2021 (~2 million papers). Possible applications include trend analysis, paper recommender engines, category prediction, knowledge graph construction and semantic search interfaces. In contrast to arxiv_dataset, this dataset doesn't include papers submitted to arXiv after 2021 and it doesn't require any external download.… See the full description on the dataset page: https://huggingface.co/datasets/gfissore/arxiv-abstracts-2021.

  5. Z

    unarXive: All arXiv Publications Pre-Processed for NLP, Including Structured...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Nov 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saier, Tarek (2023). unarXive: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network (full) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7752753
    Explore at:
    Dataset updated
    Nov 3, 2023
    Dataset provided by
    Färber, Michael
    Krause, Johan
    Saier, Tarek
    Description

    Description unarXive is a scholarly data set containing publications' structured full-text, annotated in-text citations, linked non-text content (mathematical notation, figure/table captions) and a citation network. The data is generated from all LaTeX sources on arXiv and therefore of higher quality than data generated from PDF files. Typical uses are

    Training of ML models (citation recommendation, summarization, LLMs) Citation context analysis Bibliographic analyses Access ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┓┃ D O W N L O A D S A M P L E  ┃┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛ To download the whole data set send an access request and note the following:

    Note: this Zenodo record is the "full" version of unarXive, which was generated from all of arXiv.org including non-permissively licensed papers. Make sure that your use of the data is compliant with the paper's licensing terms.¹Alternatively you can use the unarXive open subset. ¹ For information on papers' licenses use arXiv's bulk metadata access. The code for generating the data set is publicly available.

  6. H

    Replication Data for: "Images of the arXiv: reconfiguring large scientific...

    • dataverse.harvard.edu
    Updated Mar 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kynan Tan; Anna Munster; Adrian Mackenzie (2021). Replication Data for: "Images of the arXiv: reconfiguring large scientific image datasets" [Dataset]. http://doi.org/10.7910/DVN/EAAG94
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 3, 2021
    Dataset provided by
    Harvard Dataverse
    Authors
    Kynan Tan; Anna Munster; Adrian Mackenzie
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Code and data for interacting with the article and image metadata from papers stored in the arXiv repository. At the time of writing the related publication, this involved investigating ~1.5 million articles, ~10 million images, and 2.1 TB of data downloaded from arXiv. This dataset upload contains instructions and code to download the bulk source data, extract into a folder hierarchy, create an SQLite database, and then run various queries, sample images, and generate plots using this data. The full SQLite database is provided containing article metadata, image metadata, and figure captions. Also contained here are data statistics and image credits for images that appear in the related publication.

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2024). arXiv Public Datasets [Dataset]. https://github.com/mitanshu7/arxiv_public_datasets

arXiv Public Datasets

Related Article
Explore at:
470 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Sep 7, 2024
License

https://raw.githubusercontent.com/mattbierbaum/arxiv-public-datasets/master/LICENSEhttps://raw.githubusercontent.com/mattbierbaum/arxiv-public-datasets/master/LICENSE

Description

The arXiv pre-print service is the de facto venue for publishing in many scientific disciplines. This repository provides tools for using all the publicly available information provided by the arXiv to download all of the publications and their metadata, extract fulltext from PDFs, and build a co-citation graph. For each publication the tools provide access to: * Article metadata -- title, authors string, category, doi, abstract, submitter * PDFs -- all PDFs available through arXiv bulk download * Plain text -- PDFs converted to UTF-8 encoded plain text * Citation graph -- intra-arXiv citation graph between arXiv IDs only (generated from plain text) * Author string parsing -- convert metadata author strings into standardized list of name, affiliations

Search
Clear search
Close search
Google apps
Main menu