32 datasets found

h
govdocs1-pdf-source
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BEEspoke Data, govdocs1-pdf-source [Dataset]. https://huggingface.co/datasets/BEE-spoke-data/govdocs1-pdf-source
Explore at:
Dataset authored and provided by
BEEspoke Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
govdocs1: source PDF files

[!NOTE] Converted versions of other document types (word, txt, etc) are available in this repo

This is ~220,000 open-access PDF documents (about 6.6M pages) from the dataset govdocs1. It wants to be OCR'd.

Uploaded as tar file pieces of ~10 GiB each due to size/file count limits with an index.csv covering details 5,000 randomly sampled PDFs are available unarchived in sample/. Hugging Face supports previewing these in-browser, for example this one… See the full description on the dataset page: https://huggingface.co/datasets/BEE-spoke-data/govdocs1-pdf-source.
h
MORPH-dataset
huggingface.co
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chima (2025). MORPH-dataset [Dataset]. https://huggingface.co/datasets/ChimaAI/MORPH-dataset
Explore at:
Dataset updated
Apr 24, 2025
Dataset authored and provided by
Chima
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
MORPH Video Dataset

This repository contains compressed video files for the MORPH dataset.

Contents

The videos are split into multiple zip files due to size limitations. Each zip file contains a portion of the dataset's videos.

Usage

To use these videos:

Download the zip files Extract them to your local machine Process the videos as needed for your application

File Structure

videos_1.zip videos_2.zip ...
h
TinyStories
huggingface.co
opendatalab.com
+1more
Updated May 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ronen Eldan (2023). TinyStories [Dataset]. https://huggingface.co/datasets/roneneldan/TinyStories
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 22, 2023
Authors
Ronen Eldan
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary. Described in the following paper: https://arxiv.org/abs/2305.07759. The models referred to in the paper were trained on TinyStories-train.txt (the file tinystories-valid.txt can be used for validation loss). These models can be found on Huggingface, at roneneldan/TinyStories-1M/3M/8M/28M/33M/1Layer-21M. Additional resources: tinystories_all_data.tar.gz - contains a superset of… See the full description on the dataset page: https://huggingface.co/datasets/roneneldan/TinyStories.
h
Bare-Makeup-Synthesis-Dataset
huggingface.co
Updated Mar 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qianwen Lu (2025). Bare-Makeup-Synthesis-Dataset [Dataset]. https://huggingface.co/datasets/lulululululululululu/Bare-Makeup-Synthesis-Dataset
Explore at:
Dataset updated
Mar 4, 2025
Authors
Qianwen Lu
Description
Bare-Makeup Synthesis Dataset (BMS)

This dataset contains makeup images only. The corresponding bare-skin images can be obtained from the FFHQ dataset using the same filenames.

📂 Dataset Details

Total Images: 319,516 (makeup images) Resolution: 512x512 Format: PNG (packed in ZIP) License: CC BY 4.0

📥 Download & Reconstruction

Since Hugging Face limits individual file sizes to 50GB, this dataset is split into multiple parts.

🔹 Step 1: Download all… See the full description on the dataset page: https://huggingface.co/datasets/lulululululululululu/Bare-Makeup-Synthesis-Dataset.
h
codeparrot-clean
huggingface.co
Updated Dec 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CodeParrot (2021). codeparrot-clean [Dataset]. https://huggingface.co/datasets/codeparrot/codeparrot-clean
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 7, 2021
Dataset provided by
Good Engineering, Inc
Authors
CodeParrot
Description
CodeParrot 🦜 Dataset Cleaned

What is it?

A dataset of Python files from Github. This is the deduplicated version of the codeparrot.

Processing

The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:

Deduplication Remove exact matches

Filtering Average line length < 100 Maximum line length < 1000 Alpha numeric characters fraction > 0.25 Remove auto-generated files (keyword search)

For… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-clean.
h
DAS-Sample-Data
huggingface.co
Updated Dec 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
genrobot.ai (2024). DAS-Sample-Data [Dataset]. https://huggingface.co/datasets/genrobot2025/DAS-Sample-Data
Explore at:
Dataset updated
Dec 15, 2024
Dataset provided by
genrobot.ai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
DAS: Data Acquisition System

Enable embodied intelligence data acquisition to be as simple and natural as shooting a video.

📋 Contents

📦 How to Use the Dataset 📚 Dataset Structure License Contact

📦 How to Use the Dataset

Due to Hugging Face's file size limitation of 50GB per file, the dataset has been split into smaller parts.

📚 Dataset Structure

Purpose: Each HDF5 file corresponds to a single episode and encapsulates both observational data and… See the full description on the dataset page: https://huggingface.co/datasets/genrobot2025/DAS-Sample-Data.
h
AnyInstruct-resolution-1024
huggingface.co
Updated Aug 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenMOSS (2024). AnyInstruct-resolution-1024 [Dataset]. https://huggingface.co/datasets/OpenMOSS-Team/AnyInstruct-resolution-1024
Explore at:
Dataset updated
Aug 11, 2024
Dataset authored and provided by
OpenMOSS
Description
File Restoration and Extraction Guide

File Structure

Root directory: Contains Part 1 split files part2/ directory: Contains Part 2 split files

Instructions Step 1: File Restoration

Due to size limitations, the original file has been split. To restore the complete file: cat images_1024.part_* > images_1024.tar

Step 2: Extraction

To extract the contents: tar -xvf images_1024.tar

Important Notes

For Part 1 images: Execute… See the full description on the dataset page: https://huggingface.co/datasets/OpenMOSS-Team/AnyInstruct-resolution-1024.
h
test_sample
huggingface.co
Updated Aug 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ENFU NAN (2024). test_sample [Dataset]. https://huggingface.co/datasets/nanenfu/test_sample
Explore at:
Dataset updated
Aug 11, 2024
Authors
ENFU NAN
Description
Binary Logs (Split Upload)

This dataset contains a large zip file split into multiple parts due to size limits.

📁 File List

test_sample_multi4.zip.part_aa test_sample_multi4.zip.part_ab

📦 How to Use

Download all parts, then merge them locally: cat test_sample_multi4.zip.part_aa test_sample_multi4.zip.part_ab > test_sample_multi4.zip unzip test_sample_multi4.zip

license: mit
h
sn96g-test-2chunk1-20250919_155537
huggingface.co
Updated Sep 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dagliana (2025). sn96g-test-2chunk1-20250919_155537 [Dataset]. https://huggingface.co/datasets/raniero/sn96g-test-2chunk1-20250919_155537
Explore at:
Dataset updated
Sep 19, 2025
Authors
dagliana
Description
Subnet 96 — Clean Q/A Dataset

Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}

Total pairs: 63 Avg answer length (tokens): 83.9 (median 80, min 51, max 145) Schema errors: 0 (should be 0) File size: 0.04 MB SHA256 (data.jsonl): 4ddee3f0499f22ef37b39db1b8138e53d2cd36663202ad0b5767777792831bec

Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-test-2chunk1-20250919_155537.
h
sn96g-ds-20chunk1-20250919_165946
huggingface.co
Updated Sep 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dagliana (2025). sn96g-ds-20chunk1-20250919_165946 [Dataset]. https://huggingface.co/datasets/raniero/sn96g-ds-20chunk1-20250919_165946
Explore at:
Dataset updated
Sep 19, 2025
Authors
dagliana
Description
Subnet 96 — Clean Q/A Dataset

Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}

Total pairs: 43 Avg answer length (tokens): 101.5 (median 99, min 70, max 147) Schema errors: 0 (should be 0) File size: 0.03 MB SHA256 (data.jsonl): bbc1611cb16f022eb5b06da63d444cc846e8071a0327ac1f12fa7c109d11943a

Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-ds-20chunk1-20250919_165946.
h
lj_speech
huggingface.co
tensorflow.org
Updated May 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Keith Ito (2024). lj_speech [Dataset]. https://huggingface.co/datasets/keithito/lj_speech
Explore at:
Dataset updated
May 17, 2024
Authors
Keith Ito
License
https://choosealicense.com/licenses/unlicense/https://choosealicense.com/licenses/unlicense/
Description
This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books in English. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.

Note that in order to limit the required storage for preparing this dataset, the audio is stored in the .wav format and is not converted to a float32 array. To convert the audio file to a float32 array, please make use of the .map() function as follows:

import soundfile as sf def map_to_array(batch): speech_array, _ = sf.read(batch["file"]) batch["speech"] = speech_array return batch dataset = dataset.map(map_to_array, remove_columns=["file"])
h
sn96g-med-general-2chunk1-20250919_190407
huggingface.co
Updated Sep 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dagliana (2025). sn96g-med-general-2chunk1-20250919_190407 [Dataset]. https://huggingface.co/datasets/raniero/sn96g-med-general-2chunk1-20250919_190407
Explore at:
Dataset updated
Sep 19, 2025
Authors
dagliana
Description
Subnet 96 — Clean Q/A Dataset

Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}

Total pairs: 2 Avg answer length (tokens): 31 (median 31.0, min 26, max 36) Schema errors: 0 (should be 0) File size: 0.00 MB SHA256 (data.jsonl): 9f17429fd49de1bf267dc6dbdbe983c9bcd2750748f15765cff445bdb8b28a49

Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-med-general-2chunk1-20250919_190407.
h
flock96-dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dagliana, flock96-dataset [Dataset]. https://huggingface.co/datasets/raniero/flock96-dataset
Explore at:
Authors
dagliana
Description
Subnet 96 — Clean Q/A Dataset

Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}

Total pairs: 5000 Avg answer length (tokens): 131.4 (median 128.0, min 80, max 242) Schema errors: 0 (should be 0) File size: 5.13 MB SHA256 (data.jsonl): b10725196a1a4491cf97c0c31d9087df89cc3510f7e99c0ace7260c0d5e72e32

Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/flock96-dataset.
h
sn96g-it-troubleshooting-2chunk1-20250919_174208
huggingface.co
Updated Sep 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dagliana (2025). sn96g-it-troubleshooting-2chunk1-20250919_174208 [Dataset]. https://huggingface.co/datasets/raniero/sn96g-it-troubleshooting-2chunk1-20250919_174208
Explore at:
Dataset updated
Sep 19, 2025
Authors
dagliana
Description
Subnet 96 — Clean Q/A Dataset

Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}

Total pairs: 2 Avg answer length (tokens): 24 (median 24.0, min 11, max 37) Schema errors: 0 (should be 0) File size: 0.00 MB SHA256 (data.jsonl): 136e1892f1197a827331e2668b8004e8d55ecd91550f8494639c28463a27b55d

Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-it-troubleshooting-2chunk1-20250919_174208.
h
sn96g-auto-dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dagliana, sn96g-auto-dataset [Dataset]. https://huggingface.co/datasets/raniero/sn96g-auto-dataset
Explore at:
Authors
dagliana
Description
Subnet 96 — Clean Q/A Dataset

Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}

Total pairs: 1709 Avg answer length (tokens): 90.3 (median 87, min 9, max 197) Schema errors: 0 (should be 0) File size: 1.22 MB SHA256 (data.jsonl): 09be7de802ced5d7d93c09590a7cbf47cb911872deefc15dde4c41105e9fd908

Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-auto-dataset.
h
sn96g-history-geoculture-2chunk1-20250919_190101
huggingface.co
Updated Sep 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dagliana (2025). sn96g-history-geoculture-2chunk1-20250919_190101 [Dataset]. https://huggingface.co/datasets/raniero/sn96g-history-geoculture-2chunk1-20250919_190101
Explore at:
Dataset updated
Sep 19, 2025
Authors
dagliana
Description
Subnet 96 — Clean Q/A Dataset

Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}

Total pairs: 2 Avg answer length (tokens): 31 (median 31.0, min 23, max 39) Schema errors: 0 (should be 0) File size: 0.00 MB SHA256 (data.jsonl): c2f0650dd8ef778f4ce3ea4be52347b573539c17b60cb6bbf0a84a34ea45fbfb

Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-history-geoculture-2chunk1-20250919_190101.
h
sn96g-howto-practical-2chunk1-20250919_185757
huggingface.co
Updated Sep 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dagliana (2025). sn96g-howto-practical-2chunk1-20250919_185757 [Dataset]. https://huggingface.co/datasets/raniero/sn96g-howto-practical-2chunk1-20250919_185757
Explore at:
Dataset updated
Sep 19, 2025
Authors
dagliana
Description
Subnet 96 — Clean Q/A Dataset

Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}

Total pairs: 2 Avg answer length (tokens): 21 (median 21.0, min 15, max 27) Schema errors: 0 (should be 0) File size: 0.00 MB SHA256 (data.jsonl): c9f037ebdefac561e443017e8c36cb50fa38ff5f6f41a7d6ad31617813d4748c

Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-howto-practical-2chunk1-20250919_185757.
h
sn96g-science-explainers-2chunk1-20250919_193451
huggingface.co
Updated Sep 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dagliana (2025). sn96g-science-explainers-2chunk1-20250919_193451 [Dataset]. https://huggingface.co/datasets/raniero/sn96g-science-explainers-2chunk1-20250919_193451
Explore at:
Dataset updated
Sep 19, 2025
Authors
dagliana
Description
Subnet 96 — Clean Q/A Dataset

Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}

Total pairs: 2 Avg answer length (tokens): 35 (median 35.0, min 26, max 44) Schema errors: 0 (should be 0) File size: 0.00 MB SHA256 (data.jsonl): 3d134bc47c74ba1f958620f97a9ed58f30fa62c14311e565ac699bde4aa5f089

Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-science-explainers-2chunk1-20250919_193451.
edit_amazon_reviews_multi
kaggle.com
zip
Updated Aug 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Radim Közl (2025). edit_amazon_reviews_multi [Dataset]. https://www.kaggle.com/datasets/radimkzl/edit-amazon-reviews-multi
Explore at:
zip(167232183 bytes)Available download formats
Dataset updated
Aug 21, 2025
Authors
Radim Közl
Description
Dataset Summary

The data file is based on a copy of the Hugging Face data file: buruzaemon/amazon_reviews_multi. Ten je kopií původního datového souboru defunct-datasets/amazon_reviews_multi. The dataset was published by the community Open Data on AWS

In our modification, we removed unnecessary columns and thus anonymized the data file, and at the same time we added columns describing the lengths of the strings of the single columns, see Multilingual_Amazon_Reviews_Corpus_analysis. Next, the dataset was re-partitioned:

train: 95% (199500)

validation: 2.5% (5250)

test: 2.5% (5250)

The original *.jsonl data format has been changed to the more modern *.parquet format see Apacha Arrow

The data file was created for the purpose of testing the Hugging Face tutorial Summarization, because the older version of the dataset is not compatible with the new version of the datasets library.

This dataset is comprehensive, derived datasets for the tutorial can be found here:

KRadim/edit_amazon_reviews_multi_en

KRadim/edit_amazon_reviews_multi_es

Description of the original dataset - Hugging Face Datasets

"We provide an Amazon product reviews dataset for multilingual text classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. ‘books’, ‘appliances’, etc.) The corpus is balanced across stars, so each star rating constitutes 20% of the reviews in each language.

For each language, there are 200,000, 5,000 and 5,000 reviews in the training, development and test sets respectively. The maximum number of reviews per reviewer is 20 and the maximum number of reviews per product is 20. All reviews are truncated after 2,000 characters, and all reviews are at least 20 characters long.

Note that the language of a review does not necessarily match the language of its marketplace (e.g. reviews from amazon.de are primarily written in German, but could also be written in English, etc.). For this reason, we applied a language detection algorithm based on the work in Bojanowski et al. (2017) to determine the language of the review text and we removed reviews that were not written in the expected language." source

Documentation of the authors of the original dataset: The Multilingual Amazon Reviews Corpus

Languages

The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish.

Dataset Structure

id: record id

stars: An int between 1-5 indicating the number of stars.

review_body: The text body of the review.

review_title: The text title of the review.

language: The string identifier of the review language.

product_category: String representation of the product's category.

lenght_review_body: text length of review_body

lenght_review_title: text lenght of review_title

lenght_product_category: text lenght of product_category

Social Impact of Dataset

This dataset is part of an effort to encourage text classification research in languages other than English. Such work increases the accessibility of natural language technology to more regions and cultures. Unfortunately, each of the languages included here is relatively high resource and well studied. The dataset is used for training in NLP, summarization tasks, text generation, and masked text filling. source

Discussion of Biases of origin dataset

The dataset contains only reviews from verified purchases (as described in the paper, section 2.1), and the reviews should conform the Amazon Community Guidelines. source

Licensing Information

Licensing of origin dataset

Amazon has licensed this dataset under its own agreement for non-commercial research usage only. This licenc...
h
sn96g-coding-python-2chunk1-20250919_172914
huggingface.co
Updated Sep 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dagliana (2025). sn96g-coding-python-2chunk1-20250919_172914 [Dataset]. https://huggingface.co/datasets/raniero/sn96g-coding-python-2chunk1-20250919_172914
Explore at:
Dataset updated
Sep 19, 2025
Authors
dagliana
Description
Subnet 96 — Clean Q/A Dataset

Format: one JSONL per line: {"system": null, "conversations":[{"role":"user","content":"..."}, {"role":"assistant","content":"..."}]}

Total pairs: 2 Avg answer length (tokens): 16 (median 16.0, min 13, max 19) Schema errors: 0 (should be 0) File size: 0.00 MB SHA256 (data.jsonl): 957587716fc204eafff5f6f37b2d390d8eac9431431bd4923696830d42607ee2

Language: English Intended for: Bittensor Subnet 96 validators Generation: local LLaMA (GPU) +… See the full description on the dataset page: https://huggingface.co/datasets/raniero/sn96g-coding-python-2chunk1-20250919_172914.

Facebook

Twitter

Click to copy link

Link copied

Cite

BEEspoke Data, govdocs1-pdf-source [Dataset]. https://huggingface.co/datasets/BEE-spoke-data/govdocs1-pdf-source

govdocs1-pdf-source

BEE-spoke-data/govdocs1-pdf-source

Explore at:

Dataset authored and provided by

BEEspoke Data

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

govdocs1: source PDF files

[!NOTE] Converted versions of other document types (word, txt, etc) are available in this repo

This is ~220,000 open-access PDF documents (about 6.6M pages) from the dataset govdocs1. It wants to be OCR'd.

Uploaded as tar file pieces of ~10 GiB each due to size/file count limits with an index.csv covering details 5,000 randomly sampled PDFs are available unarchived in sample/. Hugging Face supports previewing these in-browser, for example this one… See the full description on the dataset page: https://huggingface.co/datasets/BEE-spoke-data/govdocs1-pdf-source.

Clear search

Close search

Google apps

Main menu

govdocs1-pdf-source

MORPH-dataset

TinyStories

Bare-Makeup-Synthesis-Dataset

codeparrot-clean

DAS-Sample-Data

AnyInstruct-resolution-1024

test_sample

license: mit

sn96g-test-2chunk1-20250919_155537

sn96g-ds-20chunk1-20250919_165946

lj_speech

sn96g-med-general-2chunk1-20250919_190407

flock96-dataset

sn96g-it-troubleshooting-2chunk1-20250919_174208

sn96g-auto-dataset

sn96g-history-geoculture-2chunk1-20250919_190101

sn96g-howto-practical-2chunk1-20250919_185757

sn96g-science-explainers-2chunk1-20250919_193451

edit_amazon_reviews_multi

Dataset Summary

Description of the original dataset - Hugging Face Datasets

Languages

Dataset Structure

Social Impact of Dataset

Discussion of Biases of origin dataset

Licensing Information

Licensing of origin dataset

sn96g-coding-python-2chunk1-20250919_172914

govdocs1-pdf-source

BEE-spoke-data/govdocs1-pdf-source