Overview
Volume: 2 Millions
Data use: Instruction-Following Evaluation for LLM
Data content: A variety of complex prompt instructions, between 50 and 400 words, with no fewer than 3 constraints in each prompt
Production method: All prompt are manually written to satisfy the diversity of coverage
Language: English, Korean, French, German, Spanish, Russian, Italian, Dutch, Polish, Portuguese, Japanese, Indonesian, Vietnamese
About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go data supports instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/llm?source=Datarade
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Dataset Card for SIFT-50M
SIFT-50M (Speech Instruction Fine-Tuning) is a 50-million-example dataset designed for instruction fine-tuning and pre-training of speech-text large language models (LLMs). It is built from publicly available speech corpora containing a total of 14K hours of speech and leverages LLMs and off-the-shelf expert models. The dataset spans five languages, covering diverse aspects of speech understanding and controllable speech generation instructions. SIFT-50M… See the full description on the dataset page: https://huggingface.co/datasets/amazon-agi/SIFT-50M.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Overview
Volume: 2 Millions
Data use: Instruction-Following Evaluation for LLM
Data content: A variety of complex prompt instructions, between 50 and 400 words, with no fewer than 3 constraints in each prompt
Production method: All prompt are manually written to satisfy the diversity of coverage
Language: English, Korean, French, German, Spanish, Russian, Italian, Dutch, Polish, Portuguese, Japanese, Indonesian, Vietnamese
About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go data supports instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/llm?source=Datarade