2 datasets found

h
dataclysm-arxiv
huggingface.co
Updated Jan 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
S2 (2024). dataclysm-arxiv [Dataset]. https://huggingface.co/datasets/somewheresystems/dataclysm-arxiv
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 24, 2024
Dataset authored and provided by
S2
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
DATACLYSM PATCH 0.0.2: ARXIV

USE THE NOTEBOOK TO GET STARTED!

https://github.com/somewheresystems/dataclysm

somewheresystems/dataclysm-wikipedia-titles

This dataset comprises of 3,360,984 English language arXiv papers from the Cornell/arXiv dataset, with two new columns added: title-embeddings and abstract-embeddings. These additional columns were generated using the bge-small-en-v1.5 embeddings model. The dataset was sourced from the Cornell/arXiv GCP… See the full description on the dataset page: https://huggingface.co/datasets/somewheresystems/dataclysm-arxiv.
o
arXiv Paper Abstracts
opendatabay.com
.undefined
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). arXiv Paper Abstracts [Dataset]. https://www.opendatabay.com/data/dataset/b1fe3b22-0ace-4bb5-b400-818fbf063adf
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 23, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Education & Learning Analytics
Description
Context Paper submission systems (CMT, OpenReview, etc.) require the users to upload paper titles and paper abstracts and then specify the subject areas their papers best belong to. Won't it be nice if such submission systems provided viable subject area suggestions as to where the corresponding papers could be best associated with?

This dataset would allow developers to build baseline models that might benefit this use case. Data analysts might also enjoy analyzing the intricacies of different papers and how well their abstracts correlate to their noted categories. Additionally, we hope that the dataset will serve as a decent benchmark for building useful text classification systems.

Content The dataset collection process is available here in this notebook. Please use the latest version of the data to run your experiments. Here's an accompanying blog post on keras.io discussing the motivation behind this dataset, building a simple baseline model, etc.: Large-scale multi-label text classification.

Acknowledgements Thanks to Lukas Schwab (author of arxiv.py) for helping us build our initial data collection utilities. Thanks to Robert Bradshaw for his inputs on the Apache Beam pipeline. Thanks to the ML-GDE program for providing GCP credits that allowed us to run the Beam pipeline at scale on Dataflow.

Original Data Source: arXiv Paper Abstracts
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

S2 (2024). dataclysm-arxiv [Dataset]. https://huggingface.co/datasets/somewheresystems/dataclysm-arxiv

dataclysm-arxiv

somewheresystems/dataclysm-arxiv

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 24, 2024

Dataset authored and provided by

License

https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

Description

DATACLYSM PATCH 0.0.2: ARXIV

  USE THE NOTEBOOK TO GET STARTED!

https://github.com/somewheresystems/dataclysm

  somewheresystems/dataclysm-wikipedia-titles

This dataset comprises of 3,360,984 English language arXiv papers from the Cornell/arXiv dataset, with two new columns added: title-embeddings and abstract-embeddings. These additional columns were generated using the bge-small-en-v1.5 embeddings model. The dataset was sourced from the Cornell/arXiv GCP… See the full description on the dataset page: https://huggingface.co/datasets/somewheresystems/dataclysm-arxiv.

Clear search

Close search

Google apps

Main menu

dataclysm-arxiv

arXiv Paper Abstracts

dataclysm-arxiv

dataclysm-arxiv

somewheresystems/dataclysm-arxiv