Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
govdocs1: source PDF files
[!NOTE] Converted versions of other document types (word, txt, etc) are available in this repo
This is ~220,000 open-access PDF documents (about 6.6M pages) from the dataset govdocs1. It wants to be OCR'd.
Uploaded as tar file pieces of ~10 GiB each due to size/file count limits with an index.csv covering details 5,000 randomly sampled PDFs are available unarchived in sample/. Hugging Face supports previewing these in-browser, for example this one… See the full description on the dataset page: https://huggingface.co/datasets/BEE-spoke-data/govdocs1-pdf-source.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MORPH Video Dataset
This repository contains compressed video files for the MORPH dataset.
Contents
The videos are split into multiple zip files due to size limitations. Each zip file contains a portion of the dataset's videos.
Usage
To use these videos:
Download the zip files Extract them to your local machine Process the videos as needed for your application
File Structure
videos_1.zip videos_2.zip ...
Facebook
Twitterhttps://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary. Described in the following paper: https://arxiv.org/abs/2305.07759. The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M. Additional resources: tinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.
Facebook
TwitterBare-Makeup Synthesis Dataset (BMS)
This dataset contains makeup images only. The corresponding bare-skin images can be obtained from the FFHQ dataset using the same filenames.
📂 Dataset Details
Total Images: 319,516 (makeup images) Resolution: 512x512 Format: PNG (packed in ZIP) License: CC BY 4.0
📥 Download & Reconstruction
Since Hugging Face limits individual file sizes to 50GB, this dataset is split into multiple parts.
🔹 Step 1: Download all… See the full description on the dataset page: https://huggingface.co/datasets/lulululululululululu/Bare-Makeup-Synthesis-Dataset.
Facebook
TwitterCodeParrot 🦜 Dataset Cleaned
What is it?
A dataset of Python files from Github. This is the deduplicated version of the codeparrot.
Processing
The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
Deduplication Remove exact matches
Filtering Average line length < 100 Maximum line length < 1000 Alpha numeric characters fraction > 0.25 Remove auto-generated files (keyword search)
For… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-clean.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
DAS: Data Acquisition System
Enable embodied intelligence data acquisition to be as simple and natural as shooting a video.
📋 Contents
📦 How to Use the Dataset 📚 Dataset Structure License Contact
📦 How to Use the Dataset
Due to Hugging Face's file size limitation of 50GB per file, the dataset has been split into smaller parts.
📚 Dataset Structure
Purpose: Each HDF5 file corresponds to a single episode and encapsulates both observational data and… See the full description on the dataset page: https://huggingface.co/datasets/genrobot2025/DAS-Sample-Data.
Facebook
TwitterFile Restoration and Extraction Guide
File Structure
Root directory: Contains Part 1 split files part2/ directory: Contains Part 2 split files
Instructions
Step 1: File Restoration
Due to size limitations, the original file has been split. To restore the complete file: cat images_1024.part_* > images_1024.tar
Step 2: Extraction
To extract the contents: tar -xvf images_1024.tar
Important Notes
For Part 1 images: Execute… See the full description on the dataset page: https://huggingface.co/datasets/OpenMOSS-Team/AnyInstruct-resolution-1024.
Facebook
TwitterBinary Logs (Split Upload)
This dataset contains a large zip file split into multiple parts due to size limits.
📁 File List
test_sample_multi4.zip.part_aa test_sample_multi4.zip.part_ab
📦 How to Use
Download all parts, then merge them locally: cat test_sample_multi4.zip.part_aa test_sample_multi4.zip.part_ab > test_sample_multi4.zip unzip test_sample_multi4.zip
Facebook
TwitterSubnet 96 — Clean Q/A Dataset
Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}
Total pairs: 63 Avg answer length (tokens): 83.9 (median 80, min 51, max 145) Schema errors: 0 (should be 0) File size: 0.04 MB SHA256 (data.jsonl): 4ddee3f0499f22ef37b39db1b8138e53d2cd36663202ad0b5767777792831bec
Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-test-2chunk1-20250919_155537.
Facebook
TwitterSubnet 96 — Clean Q/A Dataset
Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}
Total pairs: 43 Avg answer length (tokens): 101.5 (median 99, min 70, max 147) Schema errors: 0 (should be 0) File size: 0.03 MB SHA256 (data.jsonl): bbc1611cb16f022eb5b06da63d444cc846e8071a0327ac1f12fa7c109d11943a
Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-ds-20chunk1-20250919_165946.
Facebook
Twitterhttps://choosealicense.com/licenses/unlicense/https://choosealicense.com/licenses/unlicense/
This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books in English. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.
Note that in order to limit the required storage for preparing this dataset, the audio
is stored in the .wav format and is not converted to a float32 array. To convert the audio
file to a float32 array, please make use of the .map() function as follows:
import soundfile as sf
def map_to_array(batch):
speech_array, _ = sf.read(batch["file"])
batch["speech"] = speech_array
return batch
dataset = dataset.map(map_to_array, remove_columns=["file"])
Facebook
TwitterSubnet 96 — Clean Q/A Dataset
Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}
Total pairs: 2 Avg answer length (tokens): 31 (median 31.0, min 26, max 36) Schema errors: 0 (should be 0) File size: 0.00 MB SHA256 (data.jsonl): 9f17429fd49de1bf267dc6dbdbe983c9bcd2750748f15765cff445bdb8b28a49
Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-med-general-2chunk1-20250919_190407.
Facebook
TwitterSubnet 96 — Clean Q/A Dataset
Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}
Total pairs: 5000 Avg answer length (tokens): 131.4 (median 128.0, min 80, max 242) Schema errors: 0 (should be 0) File size: 5.13 MB SHA256 (data.jsonl): b10725196a1a4491cf97c0c31d9087df89cc3510f7e99c0ace7260c0d5e72e32
Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/flock96-dataset.
Facebook
TwitterSubnet 96 — Clean Q/A Dataset
Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}
Total pairs: 2 Avg answer length (tokens): 24 (median 24.0, min 11, max 37) Schema errors: 0 (should be 0) File size: 0.00 MB SHA256 (data.jsonl): 136e1892f1197a827331e2668b8004e8d55ecd91550f8494639c28463a27b55d
Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-it-troubleshooting-2chunk1-20250919_174208.
Facebook
TwitterSubnet 96 — Clean Q/A Dataset
Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}
Total pairs: 1709 Avg answer length (tokens): 90.3 (median 87, min 9, max 197) Schema errors: 0 (should be 0) File size: 1.22 MB SHA256 (data.jsonl): 09be7de802ced5d7d93c09590a7cbf47cb911872deefc15dde4c41105e9fd908
Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-auto-dataset.
Facebook
TwitterSubnet 96 — Clean Q/A Dataset
Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}
Total pairs: 2 Avg answer length (tokens): 31 (median 31.0, min 23, max 39) Schema errors: 0 (should be 0) File size: 0.00 MB SHA256 (data.jsonl): c2f0650dd8ef778f4ce3ea4be52347b573539c17b60cb6bbf0a84a34ea45fbfb
Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-history-geoculture-2chunk1-20250919_190101.
Facebook
TwitterSubnet 96 — Clean Q/A Dataset
Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}
Total pairs: 2 Avg answer length (tokens): 21 (median 21.0, min 15, max 27) Schema errors: 0 (should be 0) File size: 0.00 MB SHA256 (data.jsonl): c9f037ebdefac561e443017e8c36cb50fa38ff5f6f41a7d6ad31617813d4748c
Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-howto-practical-2chunk1-20250919_185757.
Facebook
TwitterSubnet 96 — Clean Q/A Dataset
Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}
Total pairs: 2 Avg answer length (tokens): 35 (median 35.0, min 26, max 44) Schema errors: 0 (should be 0) File size: 0.00 MB SHA256 (data.jsonl): 3d134bc47c74ba1f958620f97a9ed58f30fa62c14311e565ac699bde4aa5f089
Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-science-explainers-2chunk1-20250919_193451.
Facebook
TwitterThe data file is based on a copy of the Hugging Face data file: buruzaemon/amazon_reviews_multi. Ten je kopií původního datového souboru defunct-datasets/amazon_reviews_multi. The dataset was published by the community Open Data on AWS
In our modification, we removed unnecessary columns and thus anonymized the data file, and at the same time we added columns describing the lengths of the strings of the single columns, see Multilingual_Amazon_Reviews_Corpus_analysis. Next, the dataset was re-partitioned:
The original *.jsonl data format has been changed to the more modern *.parquet format see Apacha Arrow
The data file was created for the purpose of testing the Hugging Face tutorial Summarization, because the older version of the dataset is not compatible with the new version of the datasets library.
This dataset is comprehensive, derived datasets for the tutorial can be found here:
"We provide an Amazon product reviews dataset for multilingual text classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. ‘books’, ‘appliances’, etc.) The corpus is balanced across stars, so each star rating constitutes 20% of the reviews in each language.
For each language, there are 200,000, 5,000 and 5,000 reviews in the training, development and test sets respectively. The maximum number of reviews per reviewer is 20 and the maximum number of reviews per product is 20. All reviews are truncated after 2,000 characters, and all reviews are at least 20 characters long.
Note that the language of a review does not necessarily match the language of its marketplace (e.g. reviews from amazon.de are primarily written in German, but could also be written in English, etc.). For this reason, we applied a language detection algorithm based on the work in Bojanowski et al. (2017) to determine the language of the review text and we removed reviews that were not written in the expected language." source
Documentation of the authors of the original dataset: The Multilingual Amazon Reviews Corpus
The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish.
id: record idstars: An int between 1-5 indicating the number of stars.review_body: The text body of the review.review_title: The text title of the review.language: The string identifier of the review language.product_category: String representation of the product's category.lenght_review_body: text length of review_bodylenght_review_title: text lenght of review_titlelenght_product_category: text lenght of product_categoryThis dataset is part of an effort to encourage text classification research in languages other than English. Such work increases the accessibility of natural language technology to more regions and cultures. Unfortunately, each of the languages included here is relatively high resource and well studied. The dataset is used for training in NLP, summarization tasks, text generation, and masked text filling. source
The dataset contains only reviews from verified purchases (as described in the paper, section 2.1), and the reviews should conform the Amazon Community Guidelines. source
Amazon has licensed this dataset under its own agreement for non-commercial research usage only. This licenc...
Facebook
TwitterSubnet 96 — Clean Q/A Dataset
Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}
Total pairs: 2 Avg answer length (tokens): 16 (median 16.0, min 13, max 19) Schema errors: 0 (should be 0) File size: 0.00 MB SHA256 (data.jsonl): 957587716fc204eafff5f6f37b2d390d8eac9431431bd4923696830d42607ee2
Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-coding-python-2chunk1-20250919_172914.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
govdocs1: source PDF files
[!NOTE] Converted versions of other document types (word, txt, etc) are available in this repo
This is ~220,000 open-access PDF documents (about 6.6M pages) from the dataset govdocs1. It wants to be OCR'd.
Uploaded as tar file pieces of ~10 GiB each due to size/file count limits with an index.csv covering details 5,000 randomly sampled PDFs are available unarchived in sample/. Hugging Face supports previewing these in-browser, for example this one… See the full description on the dataset page: https://huggingface.co/datasets/BEE-spoke-data/govdocs1-pdf-source.