https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
DATACLYSM PATCH 0.0.2: ARXIV
USE THE NOTEBOOK TO GET STARTED!
https://github.com/somewheresystems/dataclysm
somewheresystems/dataclysm-wikipedia-titles
This dataset comprises of 3,360,984 English language arXiv papers from the Cornell/arXiv dataset, with two new columns added: title-embeddings and abstract-embeddings. These additional columns were generated using the bge-small-en-v1.5 embeddings model. The dataset was sourced from the Cornell/arXiv GCP… See the full description on the dataset page: https://huggingface.co/datasets/somewheresystems/dataclysm-arxiv.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Context Paper submission systems (CMT, OpenReview, etc.) require the users to upload paper titles and paper abstracts and then specify the subject areas their papers best belong to. Won't it be nice if such submission systems provided viable subject area suggestions as to where the corresponding papers could be best associated with?
This dataset would allow developers to build baseline models that might benefit this use case. Data analysts might also enjoy analyzing the intricacies of different papers and how well their abstracts correlate to their noted categories. Additionally, we hope that the dataset will serve as a decent benchmark for building useful text classification systems.
Content The dataset collection process is available here in this notebook. Please use the latest version of the data to run your experiments. Here's an accompanying blog post on keras.io discussing the motivation behind this dataset, building a simple baseline model, etc.: Large-scale multi-label text classification.
Acknowledgements Thanks to Lukas Schwab (author of arxiv.py) for helping us build our initial data collection utilities. Thanks to Robert Bradshaw for his inputs on the Apache Beam pipeline. Thanks to the ML-GDE program for providing GCP credits that allowed us to run the Beam pipeline at scale on Dataflow.
Original Data Source: arXiv Paper Abstracts
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
DATACLYSM PATCH 0.0.2: ARXIV
USE THE NOTEBOOK TO GET STARTED!
https://github.com/somewheresystems/dataclysm
somewheresystems/dataclysm-wikipedia-titles
This dataset comprises of 3,360,984 English language arXiv papers from the Cornell/arXiv dataset, with two new columns added: title-embeddings and abstract-embeddings. These additional columns were generated using the bge-small-en-v1.5 embeddings model. The dataset was sourced from the Cornell/arXiv GCP… See the full description on the dataset page: https://huggingface.co/datasets/somewheresystems/dataclysm-arxiv.