https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
This dataset contains the subset of ArXiv papers with the "cs.LG" tag to indicate the paper is about Machine Learning. The core dataset is filtered from the full ArXiv dataset hosted on Kaggle: https://www.kaggle.com/datasets/Cornell-University/arxiv. The original dataset contains roughly 2 million papers. This dataset contains roughly 100,000 papers following the category filtering. The dataset is maintained by with requests to the ArXiv API. The current iteration of the dataset only contains… See the full description on the dataset page: https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers.
Benchmark dataset for abstracts and titles of 100,000 ArXiv scientific papers. This dataset contains 10 classes and is balanced (exactly 10,000 per class). The classes include subcategories of computer science, physics, and math.
• Direct link: Download
• Citation: @inproceedings{farhangi2022protoformer, title={Protoformer: Embedding Prototypes for Transformers}, author={Farhangi, Ashkan and Sui, Ning and Hua, Nan and Bai, Haiyan and Huang, Arthur and Guo, Zhishan}, booktitle={Advances in Knowledge Discovery and Data Mining: 26th Pacific-Asia Conference, PAKDD 2022, Chengdu, China, May 16--19, 2022, Proceedings, Part I}, pages={447--458}, year={2022} }
Arxiv dataset for summarization
Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/arxiv-summarization": ("article", "abstract")
Data Fields
id: paper id article: a string containing the body of… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/arxiv-summarization.
For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.
In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full-text PDFs, and more.
We hope to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction, and semantic search interfaces.
jamescalam/ai-arxiv-chunked dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is version 239. The following is a blurb taken from the Kaggle website where this dataset originates:
For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.
In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.
Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.
The dataset is freely available via Google Cloud Storage buckets (more info here). Stay tuned for weekly updates to the dataset!
ArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.
The release of this dataset was featured further in a Kaggle blog post here.
This is a dataset for evaluating summarisation methods for research papers.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.
In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.
Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.
The dataset is freely available via Google Cloud Storage buckets (more info here). Stay tuned for weekly updates to the dataset!
ArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.
The release of this dataset was featured further in a Kaggle blog post here.
https://storage.googleapis.com/kaggle-public-downloads/arXiv.JPG" alt="">
See here for more information.
This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json
format. This file contains an entry for each paper, containing:
- id
: ArXiv ID (can be used to access the paper, see below)
- submitter
: Who submitted the paper
- authors
: Authors of the paper
- title
: Title of the paper
- comments
: Additional info, such as number of pages and figures
- journal-ref
: Information about the journal the paper was published in
- doi
: https://www.doi.org
- abstract
: The abstract of the paper
- categories
: Categories / tags in the ArXiv system
- versions
: A version history
You can access each paper directly on ArXiv using these links:
- https://arxiv.org/abs/{id}
: Page for this paper including its abstract and further links
- https://arxiv.org/pdf/{id}
: Direct link to download the PDF
The full set of PDFs is available for free in the GCS bucket gs://arxiv-dataset
or through Google API (json documentation and xml documentation).
You can use for example gsutil to download the data to your local machine. ```
gsutil cp gs://arxiv-dataset/arxiv/
gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2003/ ./a_local_directory/
gsutil cp -r gs://arxiv-dataset/arxiv/ ./a_local_directory/ ```
We're automatically updating the metadata as well as the GCS bucket on a weekly basis.
Creative Commons CC0 1.0 Universal Public Domain Dedication applies to the metadata in this dataset. See https://arxiv.org/help/license for further details and licensing on individual papers.
The original data is maintained by ArXiv, huge thanks to the team for building and maintaining this dataset.
We're using https://github.com/mattbierbaum/arxiv-public-datasets to pull the original data, thanks to Matt Bierbaum for providing this tool.
Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.
Both "arxiv" and "pubmed" have two features:
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('scientific_papers', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://raw.githubusercontent.com/mattbierbaum/arxiv-public-datasets/master/LICENSEhttps://raw.githubusercontent.com/mattbierbaum/arxiv-public-datasets/master/LICENSE
The arXiv pre-print service is the de facto venue for publishing in many scientific disciplines. This repository provides tools for using all the publicly available information provided by the arXiv to download all of the publications and their metadata, extract fulltext from PDFs, and build a co-citation graph. For each publication the tools provide access to: * Article metadata -- title, authors string, category, doi, abstract, submitter * PDFs -- all PDFs available through arXiv bulk download * Plain text -- PDFs converted to UTF-8 encoded plain text * Citation graph -- intra-arXiv citation graph between arXiv IDs only (generated from plain text) * Author string parsing -- convert metadata author strings into standardized list of name, affiliations
Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a dataset of 27,770 papers with 352,807 edges. If a paper i cites paper j, the graph contains a directed edge from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this. The data covers papers in the period from January 1993 to April 2003 (124 months).
A dataset to enable automatic academic paper rating.
The ArxivPapers dataset is an unlabelled collection of over 104K papers related to machine learning and published on arXiv.org between 2007–2020. The dataset includes around 94K papers (for which LaTeX source code is available) in a structured form in which paper is split into a title, abstract, sections, paragraphs and references. Additionally, the dataset contains over 277K tables extracted from the LaTeX papers.
Due to the papers license the dataset is published as a metadata and open-source pipeline that can be used to obtain and convert the papers.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data collection
This dataset contains information on the eprints posted on arXiv from its launch in 1991 until the end of 2019 (1,589,006 unique eprints), plus the data on their citations and the associated impact metrics. Here, eprints include preprints, conference proceedings, book chapters, data sets and commentary, i.e. every electronic material that has been posted on arXiv.
The content and metadata of the arXiv eprints were retrieved from the arXiv API (https://arxiv.org/help/api/) as of 21st January 2020, where the metadata included data of the eprint’s title, author, abstract, subject category and the arXiv ID (the arXiv’s original eprint identifier). In addition, the associated citation data were derived from the Semantic Scholar API (https://api.semanticscholar.org/) from 24th January 2020 to 7th February 2020, containing the citation information in and out of the arXiv eprints and their published versions (if applicable). Here, whether an eprint has been published in a journal or other means is assumed to be inferrable, albeit indirectly, from the status of the digital object identifier (DOI) assignment. It is also assumed that if an arXiv eprint received cpre and cpub citations until the data retrieval date (7th February 2020) before and after it is assigned a DOI, respectively, then the citation count of this eprint is recorded in the Semantic Scholar dataset as cpre + cpub. Both the arXiv API and the Semantic Scholar datasets contained the arXiv ID as metadata, which served as a key variable to merge the two datasets.
The classification of research disciplines is based on that described in the arXiv.org website (https://arxiv.org/help/stats/2020_by_area/). There, the arXiv subject categories are aggregated into several disciplines, of which we restrict our attention to the following six disciplines: Astrophysics (‘astro-ph’), Computer Science (‘comp-sci’), Condensed Matter Physics (‘cond-mat’), High Energy Physics (‘hep’), Mathematics (‘math’) and Other Physics (‘oth-phys’), which collectively accounted for 98% of all the eprints. Those eprints tagged to multiple arXiv disciplines were counted independently for each discipline. Due to this overlapping feature, the current dataset contains a cumulative total of 2,011,216 eprints.
Some general statistics and visualisations per research discipline are provided in the original article (Okamura, to appear), where the validity and limitations associated with the dataset are also discussed.
Description of columns (variables)
arxiv_id : arXiv ID
category : Research discipline
pre_year : Year of posting v1 on arXiv
pub_year : Year of DOI acquisition
c_tot : No. of citations acquired during 1991–2019
c_pre : No. of citations acquired before and including the year of DOI acquisition
c_pub : No. of citations acquired after the year of DOI acquisition
c_yyyy (yyyy = 1991, …, 2019) : No. of citations acquired in the year yyyy (with ‘yyyy’ running from 1991 to 2019)
gamma : The quantitatively-and-temporally normalised citation index
gamma_star : The quantitatively-and-temporally standardised citation index
Note: The definition of the quantitatively-and-temporally normalised citation index (γ; ‘gamma’) and that of the standardised citation index (γ*; ‘gamma_star’) are provided in the original article (Okamura, to appear). Both indices can be used to compare the citational impact of papers/eprints published in different research disciplines at different times.
Data files
A comma-separated values file (‘arXiv_impact.csv’) and a Stata file (‘arXiv_impact.dta’) are provided, both containing the same information.
Arxiv Classification: a classification of Arxiv Papers (11 classes). This dataset is intended for long context classification (documents have all > 4k tokens). Copied from "Long Document Classification From Local Word Glimpses via Recurrent Attention Learning" @ARTICLE{8675939, author={He, Jun and Wang, Liqun and Liu, Liu and Feng, Jiao and Wu, Hao}, journal={IEEE Access}, title={Long Document Classification From Local Word Glimpses via Recurrent Attention Learning}, year={2019}… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/arxiv-classification.
This is a dataset of scientific documents derived from arXiv. It comprises 203,961 titles and abstracts categorized into 130 different classes from the arXiv category taxonomy. Each document (title+abstract) is categorized into one or more distinct classes. It is split into train (163,168), validation (20,396), and test (20,397) sets.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
ArXiv Dataset
Overview
This dataset is a comprehensive collection of metadata from the ArXiv repository, a widely-recognized open-access archive offering access to scholarly articles in various fields of science. It covers a broad range of subjects from physics and computer science to mathematics, statistics, electrical engineering, quantitative biology, and economics. The dataset hosted here is derived from the original ArXiv dataset available on Kaggle, which includes… See the full description on the dataset page: https://huggingface.co/datasets/CCRss/arXiv_dataset.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
About ArXiv For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.
In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.
Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.
ArXiv is a collaboratively funded, community-supported resource founded by **Paul Ginsparg **in 1991 and maintained and operated by Cornell University.
ArXiv On Kaggle Metadata This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. This file contains an entry for each paper, containing:
id: ArXiv ID (can be used to access the paper, see below) submitter: Who submitted the paper authors: Authors of the paper title: Title of the paper comments: Additional info, such as number of pages and figures journal-ref: Information about the journal the paper was published in doi: https://www.doi.org abstract: The abstract of the paper categories: Categories / tags in the ArXiv system versions: A version history
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ground truth labels
A newly proposed dataset for local citation recommendation, consisting of 3.2 million local citation sentences along with the title and the abstract of both the citing and the cited papers. Around 1.66 million papers' titles and abstracts are available in the database.
https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
This dataset contains the subset of ArXiv papers with the "cs.LG" tag to indicate the paper is about Machine Learning. The core dataset is filtered from the full ArXiv dataset hosted on Kaggle: https://www.kaggle.com/datasets/Cornell-University/arxiv. The original dataset contains roughly 2 million papers. This dataset contains roughly 100,000 papers following the category filtering. The dataset is maintained by with requests to the ArXiv API. The current iteration of the dataset only contains… See the full description on the dataset page: https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers.