2 datasets found
  1. h

    dataclysm-arxiv

    • huggingface.co
    Updated Jan 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    S2 (2024). dataclysm-arxiv [Dataset]. https://huggingface.co/datasets/somewheresystems/dataclysm-arxiv
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 24, 2024
    Dataset authored and provided by
    S2
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    DATACLYSM PATCH 0.0.2: ARXIV

      USE THE NOTEBOOK TO GET STARTED!
    

    https://github.com/somewheresystems/dataclysm

      somewheresystems/dataclysm-wikipedia-titles
    

    This dataset comprises of 3,360,984 English language arXiv papers from the Cornell/arXiv dataset, with two new columns added: title-embeddings and abstract-embeddings. These additional columns were generated using the bge-small-en-v1.5 embeddings model. The dataset was sourced from the Cornell/arXiv GCP… See the full description on the dataset page: https://huggingface.co/datasets/somewheresystems/dataclysm-arxiv.

  2. o

    arXiv Paper Abstracts

    • opendatabay.com
    .undefined
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). arXiv Paper Abstracts [Dataset]. https://www.opendatabay.com/data/dataset/b1fe3b22-0ace-4bb5-b400-818fbf063adf
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 23, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Education & Learning Analytics
    Description

    Context Paper submission systems (CMT, OpenReview, etc.) require the users to upload paper titles and paper abstracts and then specify the subject areas their papers best belong to. Won't it be nice if such submission systems provided viable subject area suggestions as to where the corresponding papers could be best associated with?

    This dataset would allow developers to build baseline models that might benefit this use case. Data analysts might also enjoy analyzing the intricacies of different papers and how well their abstracts correlate to their noted categories. Additionally, we hope that the dataset will serve as a decent benchmark for building useful text classification systems.

    Content The dataset collection process is available here in this notebook. Please use the latest version of the data to run your experiments. Here's an accompanying blog post on keras.io discussing the motivation behind this dataset, building a simple baseline model, etc.: Large-scale multi-label text classification.

    Acknowledgements Thanks to Lukas Schwab (author of arxiv.py) for helping us build our initial data collection utilities. Thanks to Robert Bradshaw for his inputs on the Apache Beam pipeline. Thanks to the ML-GDE program for providing GCP credits that allowed us to run the Beam pipeline at scale on Dataflow.

    Original Data Source: arXiv Paper Abstracts

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
S2 (2024). dataclysm-arxiv [Dataset]. https://huggingface.co/datasets/somewheresystems/dataclysm-arxiv

dataclysm-arxiv

dataclysm-arxiv

somewheresystems/dataclysm-arxiv

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 24, 2024
Dataset authored and provided by
S2
License

https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

Description

DATACLYSM PATCH 0.0.2: ARXIV

  USE THE NOTEBOOK TO GET STARTED!

https://github.com/somewheresystems/dataclysm

  somewheresystems/dataclysm-wikipedia-titles

This dataset comprises of 3,360,984 English language arXiv papers from the Cornell/arXiv dataset, with two new columns added: title-embeddings and abstract-embeddings. These additional columns were generated using the bge-small-en-v1.5 embeddings model. The dataset was sourced from the Cornell/arXiv GCP… See the full description on the dataset page: https://huggingface.co/datasets/somewheresystems/dataclysm-arxiv.

Search
Clear search
Close search
Google apps
Main menu