https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
DCLM-Deduped
DCLM is a recently released high quality dataset that uses model-based quality filtering to filter a large subset of common-crawl for similarity to OpenHermes and other instruction-tuning datasets. For reference see the DCLM paper. The original authors of DCLM did not release fully deduplicated version of their dataset, claiming that full deduplication did not improve performance. The released version was partially deduplicated in shards. Nevertheless, when performing… See the full description on the dataset page: https://huggingface.co/datasets/Zyphra/dclm-dedup.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DCLM used in MoCa Pre-training
🏠 Homepage | 💻 Code | 🤖 MoCa-Qwen25VL-7B | 🤖 MoCa-Qwen25VL-3B | 📚 Datasets | 📄 Paper
Introduction
This is a text pre-training dataset used in the modality-aware continual pre-training of MoCa models. It is adapted from DCLM and randomly downsampled to ~20B tokens. The dataset consists of text examples. text is a string containing text while images are left blank intentionally since there is no image available.
Citation… See the full description on the dataset page: https://huggingface.co/datasets/moca-embed/dclm_20b.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
DCLM-Deduped
DCLM is a recently released high quality dataset that uses model-based quality filtering to filter a large subset of common-crawl for similarity to OpenHermes and other instruction-tuning datasets. For reference see the DCLM paper. The original authors of DCLM did not release fully deduplicated version of their dataset, claiming that full deduplication did not improve performance. The released version was partially deduplicated in shards. Nevertheless, when performing… See the full description on the dataset page: https://huggingface.co/datasets/Zyphra/dclm-dedup.