32 datasets found
  1. h

    govdocs1-pdf-source

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BEEspoke Data, govdocs1-pdf-source [Dataset]. https://huggingface.co/datasets/BEE-spoke-data/govdocs1-pdf-source
    Explore at:
    Dataset authored and provided by
    BEEspoke Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    govdocs1: source PDF files

    [!NOTE] Converted versions of other document types (word, txt, etc) are available in this repo

    This is ~220,000 open-access PDF documents (about 6.6M pages) from the dataset govdocs1. It wants to be OCR'd.

    Uploaded as tar file pieces of ~10 GiB each due to size/file count limits with an index.csv covering details 5,000 randomly sampled PDFs are available unarchived in sample/. Hugging Face supports previewing these in-browser, for example this one… See the full description on the dataset page: https://huggingface.co/datasets/BEE-spoke-data/govdocs1-pdf-source.

  2. h

    MORPH-dataset

    • huggingface.co
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chima (2025). MORPH-dataset [Dataset]. https://huggingface.co/datasets/ChimaAI/MORPH-dataset
    Explore at:
    Dataset updated
    Apr 24, 2025
    Dataset authored and provided by
    Chima
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    MORPH Video Dataset

    This repository contains compressed video files for the MORPH dataset.

      Contents
    

    The videos are split into multiple zip files due to size limitations. Each zip file contains a portion of the dataset's videos.

      Usage
    

    To use these videos:

    Download the zip files Extract them to your local machine Process the videos as needed for your application

      File Structure
    

    videos_1.zip videos_2.zip ...

  3. h

    TinyStories

    • huggingface.co
    • opendatalab.com
    • +1more
    Updated May 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ronen Eldan (2023). TinyStories [Dataset]. https://huggingface.co/datasets/roneneldan/TinyStories
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 22, 2023
    Authors
    Ronen Eldan
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary. Described in the following paper: https://arxiv.org/abs/2305.07759. The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M. Additional resources: tinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.

  4. h

    Bare-Makeup-Synthesis-Dataset

    • huggingface.co
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qianwen Lu (2025). Bare-Makeup-Synthesis-Dataset [Dataset]. https://huggingface.co/datasets/lulululululululululu/Bare-Makeup-Synthesis-Dataset
    Explore at:
    Dataset updated
    Mar 4, 2025
    Authors
    Qianwen Lu
    Description

    Bare-Makeup Synthesis Dataset (BMS)

    This dataset contains makeup images only. The corresponding bare-skin images can be obtained from the FFHQ dataset using the same filenames.

      📂 Dataset Details
    

    Total Images: 319,516 (makeup images) Resolution: 512x512 Format: PNG (packed in ZIP) License: CC BY 4.0

      📥 Download & Reconstruction
    

    Since Hugging Face limits individual file sizes to 50GB, this dataset is split into multiple parts.

      🔹 Step 1: Download all… See the full description on the dataset page: https://huggingface.co/datasets/lulululululululululu/Bare-Makeup-Synthesis-Dataset.
    
  5. h

    codeparrot-clean

    • huggingface.co
    Updated Dec 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeParrot (2021). codeparrot-clean [Dataset]. https://huggingface.co/datasets/codeparrot/codeparrot-clean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 7, 2021
    Dataset provided by
    Good Engineering, Inc
    Authors
    CodeParrot
    Description

    CodeParrot 🦜 Dataset Cleaned

      What is it?
    

    A dataset of Python files from Github. This is the deduplicated version of the codeparrot.

      Processing
    

    The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:

    Deduplication Remove exact matches

    Filtering Average line length < 100 Maximum line length < 1000 Alpha numeric characters fraction > 0.25 Remove auto-generated files (keyword search)

    For… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-clean.

  6. h

    DAS-Sample-Data

    • huggingface.co
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    genrobot.ai (2024). DAS-Sample-Data [Dataset]. https://huggingface.co/datasets/genrobot2025/DAS-Sample-Data
    Explore at:
    Dataset updated
    Dec 15, 2024
    Dataset provided by
    genrobot.ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    DAS: Data Acquisition System

    Enable embodied intelligence data acquisition to be as simple and natural as shooting a video.

      📋 Contents
    

    📦 How to Use the Dataset 📚 Dataset Structure License Contact

      📦 How to Use the Dataset
    

    Due to Hugging Face's file size limitation of 50GB per file, the dataset has been split into smaller parts.

      📚 Dataset Structure
    

    Purpose: Each HDF5 file corresponds to a single episode and encapsulates both observational data and… See the full description on the dataset page: https://huggingface.co/datasets/genrobot2025/DAS-Sample-Data.

  7. h

    AnyInstruct-resolution-1024

    • huggingface.co
    Updated Aug 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenMOSS (2024). AnyInstruct-resolution-1024 [Dataset]. https://huggingface.co/datasets/OpenMOSS-Team/AnyInstruct-resolution-1024
    Explore at:
    Dataset updated
    Aug 11, 2024
    Dataset authored and provided by
    OpenMOSS
    Description

    File Restoration and Extraction Guide

      File Structure
    

    Root directory: Contains Part 1 split files part2/ directory: Contains Part 2 split files

      Instructions
    
    
    
    
    
      Step 1: File Restoration
    

    Due to size limitations, the original file has been split. To restore the complete file: cat images_1024.part_* > images_1024.tar

      Step 2: Extraction
    

    To extract the contents: tar -xvf images_1024.tar

      Important Notes
    

    For Part 1 images: Execute… See the full description on the dataset page: https://huggingface.co/datasets/OpenMOSS-Team/AnyInstruct-resolution-1024.

  8. h

    test_sample

    • huggingface.co
    Updated Aug 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ENFU NAN (2024). test_sample [Dataset]. https://huggingface.co/datasets/nanenfu/test_sample
    Explore at:
    Dataset updated
    Aug 11, 2024
    Authors
    ENFU NAN
    Description

    Binary Logs (Split Upload)

    This dataset contains a large zip file split into multiple parts due to size limits.

      📁 File List
    

    test_sample_multi4.zip.part_aa test_sample_multi4.zip.part_ab

      📦 How to Use
    

    Download all parts, then merge them locally: cat test_sample_multi4.zip.part_aa test_sample_multi4.zip.part_ab > test_sample_multi4.zip unzip test_sample_multi4.zip

    license: mit

  9. h

    sn96g-test-2chunk1-20250919_155537

    • huggingface.co
    Updated Sep 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dagliana (2025). sn96g-test-2chunk1-20250919_155537 [Dataset]. https://huggingface.co/datasets/raniero/sn96g-test-2chunk1-20250919_155537
    Explore at:
    Dataset updated
    Sep 19, 2025
    Authors
    dagliana
    Description

    Subnet 96 — Clean Q/A Dataset

    Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}

    Total pairs: 63 Avg answer length (tokens): 83.9 (median 80, min 51, max 145) Schema errors: 0 (should be 0) File size: 0.04 MB SHA256 (data.jsonl): 4ddee3f0499f22ef37b39db1b8138e53d2cd36663202ad0b5767777792831bec

    Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-test-2chunk1-20250919_155537.

  10. h

    sn96g-ds-20chunk1-20250919_165946

    • huggingface.co
    Updated Sep 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dagliana (2025). sn96g-ds-20chunk1-20250919_165946 [Dataset]. https://huggingface.co/datasets/raniero/sn96g-ds-20chunk1-20250919_165946
    Explore at:
    Dataset updated
    Sep 19, 2025
    Authors
    dagliana
    Description

    Subnet 96 — Clean Q/A Dataset

    Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}

    Total pairs: 43 Avg answer length (tokens): 101.5 (median 99, min 70, max 147) Schema errors: 0 (should be 0) File size: 0.03 MB SHA256 (data.jsonl): bbc1611cb16f022eb5b06da63d444cc846e8071a0327ac1f12fa7c109d11943a

    Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-ds-20chunk1-20250919_165946.

  11. h

    lj_speech

    • huggingface.co
    • tensorflow.org
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Keith Ito (2024). lj_speech [Dataset]. https://huggingface.co/datasets/keithito/lj_speech
    Explore at:
    Dataset updated
    May 17, 2024
    Authors
    Keith Ito
    License

    https://choosealicense.com/licenses/unlicense/https://choosealicense.com/licenses/unlicense/

    Description

    This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books in English. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.

    Note that in order to limit the required storage for preparing this dataset, the audio is stored in the .wav format and is not converted to a float32 array. To convert the audio file to a float32 array, please make use of the .map() function as follows:

    import soundfile as sf
    
    def map_to_array(batch):
      speech_array, _ = sf.read(batch["file"])
      batch["speech"] = speech_array
      return batch
    
    dataset = dataset.map(map_to_array, remove_columns=["file"])
    
  12. h

    sn96g-med-general-2chunk1-20250919_190407

    • huggingface.co
    Updated Sep 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dagliana (2025). sn96g-med-general-2chunk1-20250919_190407 [Dataset]. https://huggingface.co/datasets/raniero/sn96g-med-general-2chunk1-20250919_190407
    Explore at:
    Dataset updated
    Sep 19, 2025
    Authors
    dagliana
    Description

    Subnet 96 — Clean Q/A Dataset

    Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}

    Total pairs: 2 Avg answer length (tokens): 31 (median 31.0, min 26, max 36) Schema errors: 0 (should be 0) File size: 0.00 MB SHA256 (data.jsonl): 9f17429fd49de1bf267dc6dbdbe983c9bcd2750748f15765cff445bdb8b28a49

    Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-med-general-2chunk1-20250919_190407.

  13. h

    flock96-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dagliana, flock96-dataset [Dataset]. https://huggingface.co/datasets/raniero/flock96-dataset
    Explore at:
    Authors
    dagliana
    Description

    Subnet 96 — Clean Q/A Dataset

    Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}

    Total pairs: 5000 Avg answer length (tokens): 131.4 (median 128.0, min 80, max 242) Schema errors: 0 (should be 0) File size: 5.13 MB SHA256 (data.jsonl): b10725196a1a4491cf97c0c31d9087df89cc3510f7e99c0ace7260c0d5e72e32

    Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/flock96-dataset.

  14. h

    sn96g-it-troubleshooting-2chunk1-20250919_174208

    • huggingface.co
    Updated Sep 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dagliana (2025). sn96g-it-troubleshooting-2chunk1-20250919_174208 [Dataset]. https://huggingface.co/datasets/raniero/sn96g-it-troubleshooting-2chunk1-20250919_174208
    Explore at:
    Dataset updated
    Sep 19, 2025
    Authors
    dagliana
    Description

    Subnet 96 — Clean Q/A Dataset

    Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}

    Total pairs: 2 Avg answer length (tokens): 24 (median 24.0, min 11, max 37) Schema errors: 0 (should be 0) File size: 0.00 MB SHA256 (data.jsonl): 136e1892f1197a827331e2668b8004e8d55ecd91550f8494639c28463a27b55d

    Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-it-troubleshooting-2chunk1-20250919_174208.

  15. h

    sn96g-auto-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dagliana, sn96g-auto-dataset [Dataset]. https://huggingface.co/datasets/raniero/sn96g-auto-dataset
    Explore at:
    Authors
    dagliana
    Description

    Subnet 96 — Clean Q/A Dataset

    Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}

    Total pairs: 1709 Avg answer length (tokens): 90.3 (median 87, min 9, max 197) Schema errors: 0 (should be 0) File size: 1.22 MB SHA256 (data.jsonl): 09be7de802ced5d7d93c09590a7cbf47cb911872deefc15dde4c41105e9fd908

    Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-auto-dataset.

  16. h

    sn96g-history-geoculture-2chunk1-20250919_190101

    • huggingface.co
    Updated Sep 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dagliana (2025). sn96g-history-geoculture-2chunk1-20250919_190101 [Dataset]. https://huggingface.co/datasets/raniero/sn96g-history-geoculture-2chunk1-20250919_190101
    Explore at:
    Dataset updated
    Sep 19, 2025
    Authors
    dagliana
    Description

    Subnet 96 — Clean Q/A Dataset

    Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}

    Total pairs: 2 Avg answer length (tokens): 31 (median 31.0, min 23, max 39) Schema errors: 0 (should be 0) File size: 0.00 MB SHA256 (data.jsonl): c2f0650dd8ef778f4ce3ea4be52347b573539c17b60cb6bbf0a84a34ea45fbfb

    Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-history-geoculture-2chunk1-20250919_190101.

  17. h

    sn96g-howto-practical-2chunk1-20250919_185757

    • huggingface.co
    Updated Sep 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dagliana (2025). sn96g-howto-practical-2chunk1-20250919_185757 [Dataset]. https://huggingface.co/datasets/raniero/sn96g-howto-practical-2chunk1-20250919_185757
    Explore at:
    Dataset updated
    Sep 19, 2025
    Authors
    dagliana
    Description

    Subnet 96 — Clean Q/A Dataset

    Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}

    Total pairs: 2 Avg answer length (tokens): 21 (median 21.0, min 15, max 27) Schema errors: 0 (should be 0) File size: 0.00 MB SHA256 (data.jsonl): c9f037ebdefac561e443017e8c36cb50fa38ff5f6f41a7d6ad31617813d4748c

    Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-howto-practical-2chunk1-20250919_185757.

  18. h

    sn96g-science-explainers-2chunk1-20250919_193451

    • huggingface.co
    Updated Sep 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dagliana (2025). sn96g-science-explainers-2chunk1-20250919_193451 [Dataset]. https://huggingface.co/datasets/raniero/sn96g-science-explainers-2chunk1-20250919_193451
    Explore at:
    Dataset updated
    Sep 19, 2025
    Authors
    dagliana
    Description

    Subnet 96 — Clean Q/A Dataset

    Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}

    Total pairs: 2 Avg answer length (tokens): 35 (median 35.0, min 26, max 44) Schema errors: 0 (should be 0) File size: 0.00 MB SHA256 (data.jsonl): 3d134bc47c74ba1f958620f97a9ed58f30fa62c14311e565ac699bde4aa5f089

    Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-science-explainers-2chunk1-20250919_193451.

  19. edit_amazon_reviews_multi

    • kaggle.com
    zip
    Updated Aug 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radim Közl (2025). edit_amazon_reviews_multi [Dataset]. https://www.kaggle.com/datasets/radimkzl/edit-amazon-reviews-multi
    Explore at:
    zip(167232183 bytes)Available download formats
    Dataset updated
    Aug 21, 2025
    Authors
    Radim Közl
    Description

    Dataset Summary

    The data file is based on a copy of the Hugging Face data file: buruzaemon/amazon_reviews_multi. Ten je kopií původního datového souboru defunct-datasets/amazon_reviews_multi. The dataset was published by the community Open Data on AWS

    In our modification, we removed unnecessary columns and thus anonymized the data file, and at the same time we added columns describing the lengths of the strings of the single columns, see Multilingual_Amazon_Reviews_Corpus_analysis. Next, the dataset was re-partitioned:

    • train: 95% (199500)
    • validation: 2.5% (5250)
    • test: 2.5% (5250)

    The original *.jsonl data format has been changed to the more modern *.parquet format see Apacha Arrow

    The data file was created for the purpose of testing the Hugging Face tutorial Summarization, because the older version of the dataset is not compatible with the new version of the datasets library.

    This dataset is comprehensive, derived datasets for the tutorial can be found here:

    Description of the original dataset - Hugging Face Datasets

    "We provide an Amazon product reviews dataset for multilingual text classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. ‘books’, ‘appliances’, etc.) The corpus is balanced across stars, so each star rating constitutes 20% of the reviews in each language.

    For each language, there are 200,000, 5,000 and 5,000 reviews in the training, development and test sets respectively. The maximum number of reviews per reviewer is 20 and the maximum number of reviews per product is 20. All reviews are truncated after 2,000 characters, and all reviews are at least 20 characters long.

    Note that the language of a review does not necessarily match the language of its marketplace (e.g. reviews from amazon.de are primarily written in German, but could also be written in English, etc.). For this reason, we applied a language detection algorithm based on the work in Bojanowski et al. (2017) to determine the language of the review text and we removed reviews that were not written in the expected language." source

    Documentation of the authors of the original dataset: The Multilingual Amazon Reviews Corpus

    Languages

    The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish.

    Dataset Structure

    • id: record id
    • stars: An int between 1-5 indicating the number of stars.
    • review_body: The text body of the review.
    • review_title: The text title of the review.
    • language: The string identifier of the review language.
    • product_category: String representation of the product's category.
    • lenght_review_body: text length of review_body
    • lenght_review_title: text lenght of review_title
    • lenght_product_category: text lenght of product_category

    Social Impact of Dataset

    This dataset is part of an effort to encourage text classification research in languages other than English. Such work increases the accessibility of natural language technology to more regions and cultures. Unfortunately, each of the languages included here is relatively high resource and well studied. The dataset is used for training in NLP, summarization tasks, text generation, and masked text filling. source

    Discussion of Biases of origin dataset

    The dataset contains only reviews from verified purchases (as described in the paper, section 2.1), and the reviews should conform the Amazon Community Guidelines. source

    Licensing Information

    Licensing of origin dataset

    Amazon has licensed this dataset under its own agreement for non-commercial research usage only. This licenc...

  20. h

    sn96g-coding-python-2chunk1-20250919_172914

    • huggingface.co
    Updated Sep 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dagliana (2025). sn96g-coding-python-2chunk1-20250919_172914 [Dataset]. https://huggingface.co/datasets/raniero/sn96g-coding-python-2chunk1-20250919_172914
    Explore at:
    Dataset updated
    Sep 19, 2025
    Authors
    dagliana
    Description

    Subnet 96 — Clean Q/A Dataset

    Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}

    Total pairs: 2 Avg answer length (tokens): 16 (median 16.0, min 13, max 19) Schema errors: 0 (should be 0) File size: 0.00 MB SHA256 (data.jsonl): 957587716fc204eafff5f6f37b2d390d8eac9431431bd4923696830d42607ee2

    Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-coding-python-2chunk1-20250919_172914.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
BEEspoke Data, govdocs1-pdf-source [Dataset]. https://huggingface.co/datasets/BEE-spoke-data/govdocs1-pdf-source

govdocs1-pdf-source

BEE-spoke-data/govdocs1-pdf-source

Explore at:
Dataset authored and provided by
BEEspoke Data
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

govdocs1: source PDF files

[!NOTE] Converted versions of other document types (word, txt, etc) are available in this repo

This is ~220,000 open-access PDF documents (about 6.6M pages) from the dataset govdocs1. It wants to be OCR'd.

Uploaded as tar file pieces of ~10 GiB each due to size/file count limits with an index.csv covering details 5,000 randomly sampled PDFs are available unarchived in sample/. Hugging Face supports previewing these in-browser, for example this one… See the full description on the dataset page: https://huggingface.co/datasets/BEE-spoke-data/govdocs1-pdf-source.

Search
Clear search
Close search
Google apps
Main menu