2 datasets found

d
Fine-Tuning Text Data | 2 Millions | User Generated Text |Foundation Model |...
datarade.ai
Updated Feb 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2025). Fine-Tuning Text Data | 2 Millions | User Generated Text |Foundation Model | SFT Data | Large Language Model(LLM) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-fine-tuning-text-data-2-millions-f-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Feb 12, 2025
Dataset authored and provided by
Nexdata
Area covered
Japan, Ecuador, United States of America, Spain, Chile, South Africa, Egypt, Austria, Indonesia, Tunisia
Description
Overview Volume: 2 Millions Data use: Instruction-Following Evaluation for LLM
Data content: A variety of complex prompt instructions, between 50 and 400 words, with no fewer than 3 constraints in each prompt Production method: All prompt are manually written to satisfy the diversity of coverage
Language: English, Korean, French, German, Spanish, Russian, Italian, Dutch, Polish, Portuguese, Japanese, Indonesian, Vietnamese

About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go data supports instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/llm?source=Datarade
h
SIFT-50M
huggingface.co
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon AGI (2025). SIFT-50M [Dataset]. https://huggingface.co/datasets/amazon-agi/SIFT-50M
Explore at:
Dataset updated
Jun 1, 2025
Dataset authored and provided by
Amazon AGI
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Dataset Card for SIFT-50M

SIFT-50M (Speech Instruction Fine-Tuning) is a 50-million-example dataset designed for instruction fine-tuning and pre-training of speech-text large language models (LLMs). It is built from publicly available speech corpora containing a total of 14K hours of speech and leverages LLMs and off-the-shelf expert models. The dataset spans five languages, covering diverse aspects of speech understanding and controllable speech generation instructions. SIFT-50M… See the full description on the dataset page: https://huggingface.co/datasets/amazon-agi/SIFT-50M.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Fine-Tuning Text Data | 2 Millions | User Generated Text |Foundation Model | SFT Data | Large Language Model(LLM) Data

Explore at:

.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats

Dataset updated

Feb 12, 2025

Dataset authored and provided by

Nexdata

Area covered

Japan, Ecuador, United States of America, Spain, Chile, South Africa, Egypt, Austria, Indonesia, Tunisia

Description

Overview Volume: 2 Millions Data use: Instruction-Following Evaluation for LLM
Data content: A variety of complex prompt instructions, between 50 and 400 words, with no fewer than 3 constraints in each prompt Production method: All prompt are manually written to satisfy the diversity of coverage
Language: English, Korean, French, German, Spanish, Russian, Italian, Dutch, Polish, Portuguese, Japanese, Indonesian, Vietnamese
About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go data supports instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/llm?source=Datarade

Clear search

Close search

Google apps

Main menu

Fine-Tuning Text Data | 2 Millions | User Generated Text |Foundation Model |...

SIFT-50M

Fine-Tuning Text Data | 2 Millions | User Generated Text |Foundation Model | SFT Data | Large Language Model(LLM) Data