3 datasets found
  1. h

    arXiv-Dataset

    • huggingface.co
    Updated Jun 30, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ankush Singal (2015). arXiv-Dataset [Dataset]. https://huggingface.co/datasets/Andyrasika/arXiv-Dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 30, 2015
    Authors
    Ankush Singal
    Description

    For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth. In these times of unique global challenges, efficient extraction of insights from data is… See the full description on the dataset page: https://huggingface.co/datasets/Andyrasika/arXiv-Dataset.

  2. Data from: arXiv Dataset

    • kaggle.com
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cornell University (2025). arXiv Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/7548853
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Cornell University
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    About ArXiv

    For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.

    In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

    Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

    The dataset is freely available via Google Cloud Storage buckets (more info here). Stay tuned for weekly updates to the dataset!

    ArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.

    The release of this dataset was featured further in a Kaggle blog post here.

    https://storage.googleapis.com/kaggle-public-downloads/arXiv.JPG" alt="">

    See here for more information.

    ArXiv On Kaggle

    Metadata

    This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. This file contains an entry for each paper, containing: - id: ArXiv ID (can be used to access the paper, see below) - submitter: Who submitted the paper - authors: Authors of the paper - title: Title of the paper - comments: Additional info, such as number of pages and figures - journal-ref: Information about the journal the paper was published in - doi: https://www.doi.org - abstract: The abstract of the paper - categories: Categories / tags in the ArXiv system - versions: A version history

    You can access each paper directly on ArXiv using these links: - https://arxiv.org/abs/{id}: Page for this paper including its abstract and further links - https://arxiv.org/pdf/{id}: Direct link to download the PDF

    Bulk access

    The full set of PDFs is available for free in the GCS bucket gs://arxiv-dataset or through Google API (json documentation and xml documentation).

    You can use for example gsutil to download the data to your local machine. ```

    List files:

    gsutil cp gs://arxiv-dataset/arxiv/

    Download pdfs from March 2020:

    gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2003/ ./a_local_directory/

    Download all the source files

    gsutil cp -r gs://arxiv-dataset/arxiv/ ./a_local_directory/ ```

    Update Frequency

    We're automatically updating the metadata as well as the GCS bucket on a weekly basis.

    License

    Creative Commons CC0 1.0 Universal Public Domain Dedication applies to the metadata in this dataset. See https://arxiv.org/help/license for further details and licensing on individual papers.

    Acknowledgements

    The original data is maintained by ArXiv, huge thanks to the team for building and maintaining this dataset.

    We're using https://github.com/mattbierbaum/arxiv-public-datasets to pull the original data, thanks to Matt Bierbaum for providing this tool.

  3. arXivMeta

    • kaggle.com
    Updated Aug 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    itsshavar (2020). arXivMeta [Dataset]. https://www.kaggle.com/shishu1421/arxivmeta/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 7, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    itsshavar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    About ArXiv For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.

    In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

    Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

    ArXiv is a collaboratively funded, community-supported resource founded by **Paul Ginsparg **in 1991 and maintained and operated by Cornell University.

    ArXiv On Kaggle Metadata This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. This file contains an entry for each paper, containing:

    id: ArXiv ID (can be used to access the paper, see below) submitter: Who submitted the paper authors: Authors of the paper title: Title of the paper comments: Additional info, such as number of pages and figures journal-ref: Information about the journal the paper was published in doi: https://www.doi.org abstract: The abstract of the paper categories: Categories / tags in the ArXiv system versions: A version history

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ankush Singal (2015). arXiv-Dataset [Dataset]. https://huggingface.co/datasets/Andyrasika/arXiv-Dataset

arXiv-Dataset

Andyrasika/arXiv-Dataset

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 30, 2015
Authors
Ankush Singal
Description

For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth. In these times of unique global challenges, efficient extraction of insights from data is… See the full description on the dataset page: https://huggingface.co/datasets/Andyrasika/arXiv-Dataset.

Search
Clear search
Close search
Google apps
Main menu