100+ datasets found
  1. h

    ML-ArXiv-Papers

    • huggingface.co
    • opendatalab.com
    Updated Jun 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Connor Shorten (2022). ML-ArXiv-Papers [Dataset]. https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2022
    Authors
    Connor Shorten
    License

    https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/

    Description

    This dataset contains the subset of ArXiv papers with the "cs.LG" tag to indicate the paper is about Machine Learning. The core dataset is filtered from the full ArXiv dataset hosted on Kaggle: https://www.kaggle.com/datasets/Cornell-University/arxiv. The original dataset contains roughly 2 million papers. This dataset contains roughly 100,000 papers following the category filtering. The dataset is maintained by with requests to the ArXiv API. The current iteration of the dataset only contains… See the full description on the dataset page: https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers.

  2. P

    arXiv-10 Dataset

    • paperswithcode.com
    Updated Nov 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashkan Farhangi; Ning Sui; Nan Hua; Haiyan Bai; Arthur Huang; Zhishan Guo (2024). arXiv-10 Dataset [Dataset]. https://paperswithcode.com/dataset/arxiv-10
    Explore at:
    Dataset updated
    Nov 11, 2024
    Authors
    Ashkan Farhangi; Ning Sui; Nan Hua; Haiyan Bai; Arthur Huang; Zhishan Guo
    Description

    Benchmark dataset for abstracts and titles of 100,000 ArXiv scientific papers. This dataset contains 10 classes and is balanced (exactly 10,000 per class). The classes include subcategories of computer science, physics, and math.

    • Direct link: Download

    • Citation: @inproceedings{farhangi2022protoformer, title={Protoformer: Embedding Prototypes for Transformers}, author={Farhangi, Ashkan and Sui, Ning and Hua, Nan and Bai, Haiyan and Huang, Arthur and Guo, Zhishan}, booktitle={Advances in Knowledge Discovery and Data Mining: 26th Pacific-Asia Conference, PAKDD 2022, Chengdu, China, May 16--19, 2022, Proceedings, Part I}, pages={447--458}, year={2022} }

  3. h

    arxiv-summarization

    • huggingface.co
    Updated Dec 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ccdv (2021). arxiv-summarization [Dataset]. https://huggingface.co/datasets/ccdv/arxiv-summarization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2021
    Authors
    ccdv
    Description

    Arxiv dataset for summarization

    Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/arxiv-summarization": ("article", "abstract")

      Data Fields
    

    id: paper id article: a string containing the body of… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/arxiv-summarization.

  4. P

    Data from: arXiv Dataset

    • paperswithcode.com
    Updated Jul 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). arXiv Dataset [Dataset]. https://paperswithcode.com/dataset/arxiv-1
    Explore at:
    Dataset updated
    Jul 27, 2020
    Description

    For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.

    In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full-text PDFs, and more.

    We hope to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction, and semantic search interfaces.

  5. h

    ai-arxiv-chunked

    • huggingface.co
    Updated Jan 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ai-arxiv-chunked [Dataset]. https://huggingface.co/datasets/jamescalam/ai-arxiv-chunked
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 30, 2025
    Authors
    James Briggs
    Description

    jamescalam/ai-arxiv-chunked dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. arxiv-kaggle

    • zenodo.org
    json
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian Maltzan; Brian Maltzan (2025). arxiv-kaggle [Dataset]. http://doi.org/10.5281/zenodo.15808027
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jul 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Brian Maltzan; Brian Maltzan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About Dataset

    This is version 239. The following is a blurb taken from the Kaggle website where this dataset originates:

    About ArXiv

    For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.

    In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

    Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

    The dataset is freely available via Google Cloud Storage buckets (more info here). Stay tuned for weekly updates to the dataset!

    ArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.

    The release of this dataset was featured further in a Kaggle blog post here.

  7. P

    arXiv Summarization Dataset Dataset

    • paperswithcode.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arman Cohan; Franck Dernoncourt; Doo Soon Kim; Trung Bui; Seokhwan Kim; Walter Chang; Nazli Goharian, arXiv Summarization Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/arxiv-summarization-dataset
    Explore at:
    Authors
    Arman Cohan; Franck Dernoncourt; Doo Soon Kim; Trung Bui; Seokhwan Kim; Walter Chang; Nazli Goharian
    Description

    This is a dataset for evaluating summarisation methods for research papers.

  8. Data from: arXiv Dataset

    • kaggle.com
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cornell University (2025). arXiv Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/7548853
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Cornell University
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    About ArXiv

    For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.

    In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

    Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

    The dataset is freely available via Google Cloud Storage buckets (more info here). Stay tuned for weekly updates to the dataset!

    ArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.

    The release of this dataset was featured further in a Kaggle blog post here.

    https://storage.googleapis.com/kaggle-public-downloads/arXiv.JPG" alt="">

    See here for more information.

    ArXiv On Kaggle

    Metadata

    This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. This file contains an entry for each paper, containing: - id: ArXiv ID (can be used to access the paper, see below) - submitter: Who submitted the paper - authors: Authors of the paper - title: Title of the paper - comments: Additional info, such as number of pages and figures - journal-ref: Information about the journal the paper was published in - doi: https://www.doi.org - abstract: The abstract of the paper - categories: Categories / tags in the ArXiv system - versions: A version history

    You can access each paper directly on ArXiv using these links: - https://arxiv.org/abs/{id}: Page for this paper including its abstract and further links - https://arxiv.org/pdf/{id}: Direct link to download the PDF

    Bulk access

    The full set of PDFs is available for free in the GCS bucket gs://arxiv-dataset or through Google API (json documentation and xml documentation).

    You can use for example gsutil to download the data to your local machine. ```

    List files:

    gsutil cp gs://arxiv-dataset/arxiv/

    Download pdfs from March 2020:

    gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2003/ ./a_local_directory/

    Download all the source files

    gsutil cp -r gs://arxiv-dataset/arxiv/ ./a_local_directory/ ```

    Update Frequency

    We're automatically updating the metadata as well as the GCS bucket on a weekly basis.

    License

    Creative Commons CC0 1.0 Universal Public Domain Dedication applies to the metadata in this dataset. See https://arxiv.org/help/license for further details and licensing on individual papers.

    Acknowledgements

    The original data is maintained by ArXiv, huge thanks to the team for building and maintaining this dataset.

    We're using https://github.com/mattbierbaum/arxiv-public-datasets to pull the original data, thanks to Matt Bierbaum for providing this tool.

  9. T

    scientific_papers

    • tensorflow.org
    • huggingface.co
    Updated Dec 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). scientific_papers [Dataset]. https://www.tensorflow.org/datasets/catalog/scientific_papers
    Explore at:
    Dataset updated
    Dec 23, 2022
    Description

    Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.

    Both "arxiv" and "pubmed" have two features:

    • article: the body of the document, pagragraphs seperated by "/n".
    • abstract: the abstract of the document, pagragraphs seperated by "/n".
    • section_names: titles of sections, seperated by "/n".

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('scientific_papers', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  10. g

    arXiv Public Datasets

    • github.com
    Updated Sep 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    arXiv Public Datasets [Dataset]. https://github.com/mitanshu7/arxiv_public_datasets
    Explore at:
    Dataset updated
    Sep 7, 2024
    License

    https://raw.githubusercontent.com/mattbierbaum/arxiv-public-datasets/master/LICENSEhttps://raw.githubusercontent.com/mattbierbaum/arxiv-public-datasets/master/LICENSE

    Description

    The arXiv pre-print service is the de facto venue for publishing in many scientific disciplines. This repository provides tools for using all the publicly available information provided by the arXiv to download all of the publications and their metadata, extract fulltext from PDFs, and build a co-citation graph. For each publication the tools provide access to: * Article metadata -- title, authors string, category, doi, abstract, submitter * PDFs -- all PDFs available through arXiv bulk download * Plain text -- PDFs converted to UTF-8 encoded plain text * Citation graph -- intra-arXiv citation graph between arXiv IDs only (generated from plain text) * Author string parsing -- convert metadata author strings into standardized list of name, affiliations

  11. P

    Arxiv HEP-TH citation graph Dataset

    • paperswithcode.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arxiv HEP-TH citation graph Dataset [Dataset]. https://paperswithcode.com/dataset/arxiv
    Explore at:
    Description

    Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a dataset of 27,770 papers with 352,807 edges. If a paper i cites paper j, the graph contains a directed edge from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this. The data covers papers in the period from January 1993 to April 2003 (124 months).

  12. P

    Arxiv Academic Paper Dataset Dataset

    • paperswithcode.com
    Updated May 9, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pengcheng Yang; Xu sun; Wei Li; Shuming Ma (2018). Arxiv Academic Paper Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/arxiv-academic-paper-dataset
    Explore at:
    Dataset updated
    May 9, 2018
    Authors
    Pengcheng Yang; Xu sun; Wei Li; Shuming Ma
    Description

    A dataset to enable automatic academic paper rating.

  13. P

    ArxivPapers Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Feb 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcin Kardas; Piotr Czapla; Pontus Stenetorp; Sebastian Ruder; Sebastian Riedel; Ross Taylor; Robert Stojnic (2021). ArxivPapers Dataset [Dataset]. https://paperswithcode.com/dataset/arxivpapers
    Explore at:
    Dataset updated
    Feb 23, 2021
    Authors
    Marcin Kardas; Piotr Czapla; Pontus Stenetorp; Sebastian Ruder; Sebastian Riedel; Ross Taylor; Robert Stojnic
    Description

    The ArxivPapers dataset is an unlabelled collection of over 104K papers related to machine learning and published on arXiv.org between 2007–2020. The dataset includes around 94K papers (for which LaTeX source code is available) in a structured form in which paper is split into a title, abstract, sections, paragraphs and references. Additionally, the dataset contains over 277K tables extracted from the LaTeX papers.

    Due to the papers license the dataset is published as a metadata and open-source pipeline that can be used to obtain and convert the papers.

  14. Z

    Data from: Citation data of arXiv eprints and the associated...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hitoshi Koshiba (2024). Citation data of arXiv eprints and the associated quantitatively-and-temporally normalised impact metrics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5803961
    Explore at:
    Dataset updated
    Jan 7, 2024
    Dataset provided by
    Hitoshi Koshiba
    Keisuke Okamura
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data collection

    This dataset contains information on the eprints posted on arXiv from its launch in 1991 until the end of 2019 (1,589,006 unique eprints), plus the data on their citations and the associated impact metrics. Here, eprints include preprints, conference proceedings, book chapters, data sets and commentary, i.e. every electronic material that has been posted on arXiv.

    The content and metadata of the arXiv eprints were retrieved from the arXiv API (https://arxiv.org/help/api/) as of 21st January 2020, where the metadata included data of the eprint’s title, author, abstract, subject category and the arXiv ID (the arXiv’s original eprint identifier). In addition, the associated citation data were derived from the Semantic Scholar API (https://api.semanticscholar.org/) from 24th January 2020 to 7th February 2020, containing the citation information in and out of the arXiv eprints and their published versions (if applicable). Here, whether an eprint has been published in a journal or other means is assumed to be inferrable, albeit indirectly, from the status of the digital object identifier (DOI) assignment. It is also assumed that if an arXiv eprint received cpre and cpub citations until the data retrieval date (7th February 2020) before and after it is assigned a DOI, respectively, then the citation count of this eprint is recorded in the Semantic Scholar dataset as cpre + cpub. Both the arXiv API and the Semantic Scholar datasets contained the arXiv ID as metadata, which served as a key variable to merge the two datasets.

    The classification of research disciplines is based on that described in the arXiv.org website (https://arxiv.org/help/stats/2020_by_area/). There, the arXiv subject categories are aggregated into several disciplines, of which we restrict our attention to the following six disciplines: Astrophysics (‘astro-ph’), Computer Science (‘comp-sci’), Condensed Matter Physics (‘cond-mat’), High Energy Physics (‘hep’), Mathematics (‘math’) and Other Physics (‘oth-phys’), which collectively accounted for 98% of all the eprints. Those eprints tagged to multiple arXiv disciplines were counted independently for each discipline. Due to this overlapping feature, the current dataset contains a cumulative total of 2,011,216 eprints.

    Some general statistics and visualisations per research discipline are provided in the original article (Okamura, to appear), where the validity and limitations associated with the dataset are also discussed.

    Description of columns (variables)

    arxiv_id : arXiv ID

    category : Research discipline

    pre_year : Year of posting v1 on arXiv

    pub_year : Year of DOI acquisition

    c_tot : No. of citations acquired during 1991–2019

    c_pre : No. of citations acquired before and including the year of DOI acquisition

    c_pub : No. of citations acquired after the year of DOI acquisition

    c_yyyy (yyyy = 1991, …, 2019) : No. of citations acquired in the year yyyy (with ‘yyyy’ running from 1991 to 2019)

    gamma : The quantitatively-and-temporally normalised citation index

    gamma_star : The quantitatively-and-temporally standardised citation index

    Note: The definition of the quantitatively-and-temporally normalised citation index (γ; ‘gamma’) and that of the standardised citation index (γ*; ‘gamma_star’) are provided in the original article (Okamura, to appear). Both indices can be used to compare the citational impact of papers/eprints published in different research disciplines at different times.

    Data files

    A comma-separated values file (‘arXiv_impact.csv’) and a Stata file (‘arXiv_impact.dta’) are provided, both containing the same information.

  15. h

    arxiv-classification

    • huggingface.co
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ccdv (2025). arxiv-classification [Dataset]. https://huggingface.co/datasets/ccdv/arxiv-classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 26, 2025
    Authors
    ccdv
    Description

    Arxiv Classification: a classification of Arxiv Papers (11 classes). This dataset is intended for long context classification (documents have all > 4k tokens). Copied from "Long Document Classification From Local Word Glimpses via Recurrent Attention Learning" @ARTICLE{8675939, author={He, Jun and Wang, Liqun and Liu, Liu and Feng, Jiao and Wu, Hao}, journal={IEEE Access}, title={Long Document Classification From Local Word Glimpses via Recurrent Attention Learning}, year={2019}… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/arxiv-classification.

  16. P

    arXiv Categories Dataset

    • paperswithcode.com
    Updated Oct 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tim Schopf; Alexander Blatzheim; Nektarios Machner; Florian Matthes (2024). arXiv Categories Dataset [Dataset]. https://paperswithcode.com/dataset/arxiv-categories
    Explore at:
    Dataset updated
    Oct 10, 2024
    Authors
    Tim Schopf; Alexander Blatzheim; Nektarios Machner; Florian Matthes
    Description

    This is a dataset of scientific documents derived from arXiv. It comprises 203,961 titles and abstracts categorized into 130 different classes from the arXiv category taxonomy. Each document (title+abstract) is categorized into one or more distinct classes. It is split into train (163,168), validation (20,396), and test (20,397) sets.

  17. h

    arXiv_dataset

    • huggingface.co
    • figshare.com
    Updated Mar 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CCR (2022). arXiv_dataset [Dataset]. https://huggingface.co/datasets/CCRss/arXiv_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 31, 2022
    Authors
    CCR
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    ArXiv Dataset

      Overview
    

    This dataset is a comprehensive collection of metadata from the ArXiv repository, a widely-recognized open-access archive offering access to scholarly articles in various fields of science. It covers a broad range of subjects from physics and computer science to mathematics, statistics, electrical engineering, quantitative biology, and economics. The dataset hosted here is derived from the original ArXiv dataset available on Kaggle, which includes… See the full description on the dataset page: https://huggingface.co/datasets/CCRss/arXiv_dataset.

  18. arXivMeta

    • kaggle.com
    Updated Aug 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    itsshavar (2020). arXivMeta [Dataset]. https://www.kaggle.com/shishu1421/arxivmeta/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 7, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    itsshavar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    About ArXiv For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.

    In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

    Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

    ArXiv is a collaboratively funded, community-supported resource founded by **Paul Ginsparg **in 1991 and maintained and operated by Cornell University.

    ArXiv On Kaggle Metadata This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. This file contains an entry for each paper, containing:

    id: ArXiv ID (can be used to access the paper, see below) submitter: Who submitted the paper authors: Authors of the paper title: Title of the paper comments: Additional info, such as number of pages and figures journal-ref: Information about the journal the paper was published in doi: https://www.doi.org abstract: The abstract of the paper categories: Categories / tags in the ArXiv system versions: A version history

  19. i

    Data from: Arxiv

    • ieee-dataport.org
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Feihu Che (2024). Arxiv [Dataset]. https://ieee-dataport.org/documents/arxiv
    Explore at:
    Dataset updated
    Nov 13, 2024
    Authors
    Feihu Che
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ground truth labels

  20. P

    arXiv-200 Dataset

    • paperswithcode.com
    Updated Dec 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nianlong Gu; Yingqiang Gao; Richard H. R. Hahnloser (2021). arXiv-200 Dataset [Dataset]. https://paperswithcode.com/dataset/arxiv-200
    Explore at:
    Dataset updated
    Dec 1, 2021
    Authors
    Nianlong Gu; Yingqiang Gao; Richard H. R. Hahnloser
    Description

    A newly proposed dataset for local citation recommendation, consisting of 3.2 million local citation sentences along with the title and the abstract of both the citing and the cited papers. Around 1.66 million papers' titles and abstracts are available in the database.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Connor Shorten (2022). ML-ArXiv-Papers [Dataset]. https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers

ML-ArXiv-Papers

CShorten/ML-ArXiv-Papers

Explore at:
8 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2022
Authors
Connor Shorten
License

https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/

Description

This dataset contains the subset of ArXiv papers with the "cs.LG" tag to indicate the paper is about Machine Learning. The core dataset is filtered from the full ArXiv dataset hosted on Kaggle: https://www.kaggle.com/datasets/Cornell-University/arxiv. The original dataset contains roughly 2 million papers. This dataset contains roughly 100,000 papers following the category filtering. The dataset is maintained by with requests to the ArXiv API. The current iteration of the dataset only contains… See the full description on the dataset page: https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers.

Search
Clear search
Close search
Google apps
Main menu