2 datasets found
  1. h

    dclm-dedup

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zyphra, dclm-dedup [Dataset]. https://huggingface.co/datasets/Zyphra/dclm-dedup
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Zyphra
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    DCLM-Deduped

    DCLM is a recently released high quality dataset that uses model-based quality filtering to filter a large subset of common-crawl for similarity to OpenHermes and other instruction-tuning datasets. For reference see the DCLM paper. The original authors of DCLM did not release fully deduplicated version of their dataset, claiming that full deduplication did not improve performance. The released version was partially deduplicated in shards. Nevertheless, when performing… See the full description on the dataset page: https://huggingface.co/datasets/Zyphra/dclm-dedup.

  2. h

    dclm_20b

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MoCa, dclm_20b [Dataset]. https://huggingface.co/datasets/moca-embed/dclm_20b
    Explore at:
    Dataset authored and provided by
    MoCa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DCLM used in MoCa Pre-training

    🏠 Homepage | 💻 Code | 🤖 MoCa-Qwen25VL-7B | 🤖 MoCa-Qwen25VL-3B | 📚 Datasets | 📄 Paper

      Introduction
    

    This is a text pre-training dataset used in the modality-aware continual pre-training of MoCa models. It is adapted from DCLM and randomly downsampled to ~20B tokens. The dataset consists of text examples. text is a string containing text while images are left blank intentionally since there is no image available.

      Citation… See the full description on the dataset page: https://huggingface.co/datasets/moca-embed/dclm_20b.
    
  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Zyphra, dclm-dedup [Dataset]. https://huggingface.co/datasets/Zyphra/dclm-dedup

dclm-dedup

DCLM-Deduped

Zyphra/dclm-dedup

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Zyphra
License

https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

Description

DCLM-Deduped

DCLM is a recently released high quality dataset that uses model-based quality filtering to filter a large subset of common-crawl for similarity to OpenHermes and other instruction-tuning datasets. For reference see the DCLM paper. The original authors of DCLM did not release fully deduplicated version of their dataset, claiming that full deduplication did not improve performance. The released version was partially deduplicated in shards. Nevertheless, when performing… See the full description on the dataset page: https://huggingface.co/datasets/Zyphra/dclm-dedup.

Search
Clear search
Close search
Google apps
Main menu