100+ datasets found

h
ML-ArXiv-Papers
huggingface.co
opendatalab.com
Updated Jun 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Connor Shorten (2022). ML-ArXiv-Papers [Dataset]. https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2022
Authors
Connor Shorten
License
https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
Description
This dataset contains the subset of ArXiv papers with the "cs.LG" tag to indicate the paper is about Machine Learning. The core dataset is filtered from the full ArXiv dataset hosted on Kaggle: https://www.kaggle.com/datasets/Cornell-University/arxiv. The original dataset contains roughly 2 million papers. This dataset contains roughly 100,000 papers following the category filtering. The dataset is maintained by with requests to the ArXiv API. The current iteration of the dataset only contains… See the full description on the dataset page: https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers.
P
arXiv-10 Dataset
paperswithcode.com
Updated Nov 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashkan Farhangi; Ning Sui; Nan Hua; Haiyan Bai; Arthur Huang; Zhishan Guo (2024). arXiv-10 Dataset [Dataset]. https://paperswithcode.com/dataset/arxiv-10
Explore at:
Dataset updated
Nov 11, 2024
Authors
Ashkan Farhangi; Ning Sui; Nan Hua; Haiyan Bai; Arthur Huang; Zhishan Guo
Description
Benchmark dataset for abstracts and titles of 100,000 ArXiv scientific papers. This dataset contains 10 classes and is balanced (exactly 10,000 per class). The classes include subcategories of computer science, physics, and math.

• Direct link: Download

• Citation: @inproceedings{farhangi2022protoformer, title={Protoformer: Embedding Prototypes for Transformers}, author={Farhangi, Ashkan and Sui, Ning and Hua, Nan and Bai, Haiyan and Huang, Arthur and Guo, Zhishan}, booktitle={Advances in Knowledge Discovery and Data Mining: 26th Pacific-Asia Conference, PAKDD 2022, Chengdu, China, May 16--19, 2022, Proceedings, Part I}, pages={447--458}, year={2022} }
h
arxiv-summarization
huggingface.co
Updated Dec 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ccdv (2021). arxiv-summarization [Dataset]. https://huggingface.co/datasets/ccdv/arxiv-summarization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 19, 2021
Authors
ccdv
Description
Arxiv dataset for summarization

Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/arxiv-summarization": ("article", "abstract")

Data Fields

id: paper id article: a string containing the body of… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/arxiv-summarization.
P
Data from: arXiv Dataset
paperswithcode.com
Updated Jul 27, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). arXiv Dataset [Dataset]. https://paperswithcode.com/dataset/arxiv-1
Explore at:
Dataset updated
Jul 27, 2020
Description
For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.

In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full-text PDFs, and more.

We hope to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction, and semantic search interfaces.
h
ai-arxiv-chunked
huggingface.co
Updated Jan 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ai-arxiv-chunked [Dataset]. https://huggingface.co/datasets/jamescalam/ai-arxiv-chunked
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 30, 2025
Authors
James Briggs
Description
jamescalam/ai-arxiv-chunked dataset hosted on Hugging Face and contributed by the HF Datasets community
arxiv-kaggle
zenodo.org
json
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Maltzan; Brian Maltzan (2025). arxiv-kaggle [Dataset]. http://doi.org/10.5281/zenodo.15808027
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15808027
Dataset updated
Jul 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Brian Maltzan; Brian Maltzan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

About Dataset

This is version 239. The following is a blurb taken from the Kaggle website where this dataset originates:

About ArXiv

For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.

In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

The dataset is freely available via Google Cloud Storage buckets (more info here). Stay tuned for weekly updates to the dataset!

ArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.

The release of this dataset was featured further in a Kaggle blog post here.
P
arXiv Summarization Dataset Dataset
paperswithcode.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arman Cohan; Franck Dernoncourt; Doo Soon Kim; Trung Bui; Seokhwan Kim; Walter Chang; Nazli Goharian, arXiv Summarization Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/arxiv-summarization-dataset
Explore at:
Authors
Arman Cohan; Franck Dernoncourt; Doo Soon Kim; Trung Bui; Seokhwan Kim; Walter Chang; Nazli Goharian
Description
This is a dataset for evaluating summarisation methods for research papers.
Data from: arXiv Dataset
kaggle.com
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cornell University (2025). arXiv Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/7548853
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/7548853
Dataset updated
Jul 5, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Cornell University
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
About ArXiv

For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.

In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

The dataset is freely available via Google Cloud Storage buckets (more info here). Stay tuned for weekly updates to the dataset!

ArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.

The release of this dataset was featured further in a Kaggle blog post here.

https://storage.googleapis.com/kaggle-public-downloads/arXiv.JPG" alt="">

See here for more information.

ArXiv On Kaggle

Metadata

This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. This file contains an entry for each paper, containing: - id: ArXiv ID (can be used to access the paper, see below) - submitter: Who submitted the paper - authors: Authors of the paper - title: Title of the paper - comments: Additional info, such as number of pages and figures - journal-ref: Information about the journal the paper was published in - doi: https://www.doi.org - abstract: The abstract of the paper - categories: Categories / tags in the ArXiv system - versions: A version history

You can access each paper directly on ArXiv using these links: - https://arxiv.org/abs/{id}: Page for this paper including its abstract and further links - https://arxiv.org/pdf/{id}: Direct link to download the PDF

Bulk access

The full set of PDFs is available for free in the GCS bucket gs://arxiv-dataset or through Google API (json documentation and xml documentation).

You can use for example gsutil to download the data to your local machine. ```

List files:

gsutil cp gs://arxiv-dataset/arxiv/

Download pdfs from March 2020:

gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2003/ ./a_local_directory/

Download all the source files

gsutil cp -r gs://arxiv-dataset/arxiv/ ./a_local_directory/ ```

Update Frequency

We're automatically updating the metadata as well as the GCS bucket on a weekly basis.

License

Creative Commons CC0 1.0 Universal Public Domain Dedication applies to the metadata in this dataset. See https://arxiv.org/help/license for further details and licensing on individual papers.

Acknowledgements

The original data is maintained by ArXiv, huge thanks to the team for building and maintaining this dataset.

We're using https://github.com/mattbierbaum/arxiv-public-datasets to pull the original data, thanks to Matt Bierbaum for providing this tool.
T
scientific_papers
tensorflow.org
huggingface.co
Updated Dec 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). scientific_papers [Dataset]. https://www.tensorflow.org/datasets/catalog/scientific_papers
Explore at:
Dataset updated
Dec 23, 2022
Description
Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.

Both "arxiv" and "pubmed" have two features:

article: the body of the document, pagragraphs seperated by "/n".

abstract: the abstract of the document, pagragraphs seperated by "/n".

section_names: titles of sections, seperated by "/n".

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('scientific_papers', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
g
arXiv Public Datasets
github.com
Updated Sep 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
arXiv Public Datasets [Dataset]. https://github.com/mitanshu7/arxiv_public_datasets
Explore at:
Dataset updated
Sep 7, 2024
License
https://raw.githubusercontent.com/mattbierbaum/arxiv-public-datasets/master/LICENSEhttps://raw.githubusercontent.com/mattbierbaum/arxiv-public-datasets/master/LICENSE
Description
The arXiv pre-print service is the de facto venue for publishing in many scientific disciplines. This repository provides tools for using all the publicly available information provided by the arXiv to download all of the publications and their metadata, extract fulltext from PDFs, and build a co-citation graph. For each publication the tools provide access to: * Article metadata -- title, authors string, category, doi, abstract, submitter * PDFs -- all PDFs available through arXiv bulk download * Plain text -- PDFs converted to UTF-8 encoded plain text * Citation graph -- intra-arXiv citation graph between arXiv IDs only (generated from plain text) * Author string parsing -- convert metadata author strings into standardized list of name, affiliations
P
Arxiv HEP-TH citation graph Dataset
paperswithcode.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arxiv HEP-TH citation graph Dataset [Dataset]. https://paperswithcode.com/dataset/arxiv
Explore at:
Description
Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a dataset of 27,770 papers with 352,807 edges. If a paper i cites paper j, the graph contains a directed edge from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this. The data covers papers in the period from January 1993 to April 2003 (124 months).
P
Arxiv Academic Paper Dataset Dataset
paperswithcode.com
Updated May 9, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pengcheng Yang; Xu sun; Wei Li; Shuming Ma (2018). Arxiv Academic Paper Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/arxiv-academic-paper-dataset
Explore at:
Dataset updated
May 9, 2018
Authors
Pengcheng Yang; Xu sun; Wei Li; Shuming Ma
Description
A dataset to enable automatic academic paper rating.
P
ArxivPapers Dataset
paperswithcode.com
opendatalab.com
Updated Feb 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marcin Kardas; Piotr Czapla; Pontus Stenetorp; Sebastian Ruder; Sebastian Riedel; Ross Taylor; Robert Stojnic (2021). ArxivPapers Dataset [Dataset]. https://paperswithcode.com/dataset/arxivpapers
Explore at:
Dataset updated
Feb 23, 2021
Authors
Marcin Kardas; Piotr Czapla; Pontus Stenetorp; Sebastian Ruder; Sebastian Riedel; Ross Taylor; Robert Stojnic
Description
The ArxivPapers dataset is an unlabelled collection of over 104K papers related to machine learning and published on arXiv.org between 2007–2020. The dataset includes around 94K papers (for which LaTeX source code is available) in a structured form in which paper is split into a title, abstract, sections, paragraphs and references. Additionally, the dataset contains over 277K tables extracted from the LaTeX papers.

Due to the papers license the dataset is published as a metadata and open-source pipeline that can be used to obtain and convert the papers.
Z
Data from: Citation data of arXiv eprints and the associated...
data.niaid.nih.gov
zenodo.org
Updated Jan 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hitoshi Koshiba (2024). Citation data of arXiv eprints and the associated quantitatively-and-temporally normalised impact metrics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5803961
Explore at:
Dataset updated
Jan 7, 2024
Dataset provided by
Hitoshi Koshiba
Keisuke Okamura
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data collection

This dataset contains information on the eprints posted on arXiv from its launch in 1991 until the end of 2019 (1,589,006 unique eprints), plus the data on their citations and the associated impact metrics. Here, eprints include preprints, conference proceedings, book chapters, data sets and commentary, i.e. every electronic material that has been posted on arXiv.

The content and metadata of the arXiv eprints were retrieved from the arXiv API (https://arxiv.org/help/api/) as of 21st January 2020, where the metadata included data of the eprint’s title, author, abstract, subject category and the arXiv ID (the arXiv’s original eprint identifier). In addition, the associated citation data were derived from the Semantic Scholar API (https://api.semanticscholar.org/) from 24th January 2020 to 7th February 2020, containing the citation information in and out of the arXiv eprints and their published versions (if applicable). Here, whether an eprint has been published in a journal or other means is assumed to be inferrable, albeit indirectly, from the status of the digital object identifier (DOI) assignment. It is also assumed that if an arXiv eprint received cpre and cpub citations until the data retrieval date (7th February 2020) before and after it is assigned a DOI, respectively, then the citation count of this eprint is recorded in the Semantic Scholar dataset as cpre + cpub. Both the arXiv API and the Semantic Scholar datasets contained the arXiv ID as metadata, which served as a key variable to merge the two datasets.

The classification of research disciplines is based on that described in the arXiv.org website (https://arxiv.org/help/stats/2020_by_area/). There, the arXiv subject categories are aggregated into several disciplines, of which we restrict our attention to the following six disciplines: Astrophysics (‘astro-ph’), Computer Science (‘comp-sci’), Condensed Matter Physics (‘cond-mat’), High Energy Physics (‘hep’), Mathematics (‘math’) and Other Physics (‘oth-phys’), which collectively accounted for 98% of all the eprints. Those eprints tagged to multiple arXiv disciplines were counted independently for each discipline. Due to this overlapping feature, the current dataset contains a cumulative total of 2,011,216 eprints.

Some general statistics and visualisations per research discipline are provided in the original article (Okamura, to appear), where the validity and limitations associated with the dataset are also discussed.

Description of columns (variables)

arxiv_id : arXiv ID

category : Research discipline

pre_year : Year of posting v1 on arXiv

pub_year : Year of DOI acquisition

c_tot : No. of citations acquired during 1991–2019

c_pre : No. of citations acquired before and including the year of DOI acquisition

c_pub : No. of citations acquired after the year of DOI acquisition

c_yyyy (yyyy = 1991, …, 2019) : No. of citations acquired in the year yyyy (with ‘yyyy’ running from 1991 to 2019)

gamma : The quantitatively-and-temporally normalised citation index

gamma_star : The quantitatively-and-temporally standardised citation index

Note: The definition of the quantitatively-and-temporally normalised citation index (γ; ‘gamma’) and that of the standardised citation index (γ*; ‘gamma_star’) are provided in the original article (Okamura, to appear). Both indices can be used to compare the citational impact of papers/eprints published in different research disciplines at different times.

Data files

A comma-separated values file (‘arXiv_impact.csv’) and a Stata file (‘arXiv_impact.dta’) are provided, both containing the same information.
h
arxiv-classification
huggingface.co
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ccdv (2025). arxiv-classification [Dataset]. https://huggingface.co/datasets/ccdv/arxiv-classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 26, 2025
Authors
ccdv
Description
Arxiv Classification: a classification of Arxiv Papers (11 classes). This dataset is intended for long context classification (documents have all > 4k tokens). Copied from "Long Document Classification From Local Word Glimpses via Recurrent Attention Learning" @ARTICLE{8675939, author={He, Jun and Wang, Liqun and Liu, Liu and Feng, Jiao and Wu, Hao}, journal={IEEE Access}, title={Long Document Classification From Local Word Glimpses via Recurrent Attention Learning}, year={2019}… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/arxiv-classification.
P
arXiv Categories Dataset
paperswithcode.com
Updated Oct 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tim Schopf; Alexander Blatzheim; Nektarios Machner; Florian Matthes (2024). arXiv Categories Dataset [Dataset]. https://paperswithcode.com/dataset/arxiv-categories
Explore at:
Dataset updated
Oct 10, 2024
Authors
Tim Schopf; Alexander Blatzheim; Nektarios Machner; Florian Matthes
Description
This is a dataset of scientific documents derived from arXiv. It comprises 203,961 titles and abstracts categorized into 130 different classes from the arXiv category taxonomy. Each document (title+abstract) is categorized into one or more distinct classes. It is split into train (163,168), validation (20,396), and test (20,397) sets.
h
arXiv_dataset
huggingface.co
figshare.com
Updated Mar 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CCR (2022). arXiv_dataset [Dataset]. https://huggingface.co/datasets/CCRss/arXiv_dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 31, 2022
Authors
CCR
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
ArXiv Dataset

Overview

This dataset is a comprehensive collection of metadata from the ArXiv repository, a widely-recognized open-access archive offering access to scholarly articles in various fields of science. It covers a broad range of subjects from physics and computer science to mathematics, statistics, electrical engineering, quantitative biology, and economics. The dataset hosted here is derived from the original ArXiv dataset available on Kaggle, which includes… See the full description on the dataset page: https://huggingface.co/datasets/CCRss/arXiv_dataset.
arXivMeta
kaggle.com
Updated Aug 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
itsshavar (2020). arXivMeta [Dataset]. https://www.kaggle.com/shishu1421/arxivmeta/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 7, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
itsshavar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
About ArXiv For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.

In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

ArXiv is a collaboratively funded, community-supported resource founded by **Paul Ginsparg **in 1991 and maintained and operated by Cornell University.

ArXiv On Kaggle Metadata This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. This file contains an entry for each paper, containing:

id: ArXiv ID (can be used to access the paper, see below) submitter: Who submitted the paper authors: Authors of the paper title: Title of the paper comments: Additional info, such as number of pages and figures journal-ref: Information about the journal the paper was published in doi: https://www.doi.org abstract: The abstract of the paper categories: Categories / tags in the ArXiv system versions: A version history
i
Data from: Arxiv
ieee-dataport.org
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Feihu Che (2024). Arxiv [Dataset]. https://ieee-dataport.org/documents/arxiv
Explore at:
Dataset updated
Nov 13, 2024
Authors
Feihu Che
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ground truth labels
P
arXiv-200 Dataset
paperswithcode.com
Updated Dec 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nianlong Gu; Yingqiang Gao; Richard H. R. Hahnloser (2021). arXiv-200 Dataset [Dataset]. https://paperswithcode.com/dataset/arxiv-200
Explore at:
Dataset updated
Dec 1, 2021
Authors
Nianlong Gu; Yingqiang Gao; Richard H. R. Hahnloser
Description
A newly proposed dataset for local citation recommendation, consisting of 3.2 million local citation sentences along with the title and the abstract of both the citing and the cited papers. Around 1.66 million papers' titles and abstracts are available in the database.

Facebook

Twitter

Click to copy link

Link copied

Cite

Connor Shorten (2022). ML-ArXiv-Papers [Dataset]. https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers

ML-ArXiv-Papers

CShorten/ML-ArXiv-Papers

Explore at:

8 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 29, 2022

Authors

Connor Shorten

License

https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/

Description

This dataset contains the subset of ArXiv papers with the "cs.LG" tag to indicate the paper is about Machine Learning. The core dataset is filtered from the full ArXiv dataset hosted on Kaggle: https://www.kaggle.com/datasets/Cornell-University/arxiv. The original dataset contains roughly 2 million papers. This dataset contains roughly 100,000 papers following the category filtering. The dataset is maintained by with requests to the ArXiv API. The current iteration of the dataset only contains… See the full description on the dataset page: https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers.

Clear search

Close search

Google apps

Main menu

ML-ArXiv-Papers

arXiv-10 Dataset

arxiv-summarization

Data from: arXiv Dataset

ai-arxiv-chunked

arxiv-kaggle

About Dataset

About ArXiv

arXiv Summarization Dataset Dataset

Data from: arXiv Dataset

About ArXiv

ArXiv On Kaggle

Metadata

Bulk access

List files:

Download pdfs from March 2020:

Download all the source files

Update Frequency

License

Acknowledgements

scientific_papers

arXiv Public Datasets

Arxiv HEP-TH citation graph Dataset

Arxiv Academic Paper Dataset Dataset

ArxivPapers Dataset

Data from: Citation data of arXiv eprints and the associated...

arxiv-classification

arXiv Categories Dataset

arXiv_dataset

arXivMeta

Data from: Arxiv

arXiv-200 Dataset

ML-ArXiv-PapersSee More Versions

CShorten/ML-ArXiv-Papers

ML-ArXiv-Papers