2 datasets found
  1. d

    Fine-Tuning Text Data | 2 Millions | User Generated Text |Foundation Model |...

    • datarade.ai
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). Fine-Tuning Text Data | 2 Millions | User Generated Text |Foundation Model | SFT Data | Large Language Model(LLM) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-fine-tuning-text-data-2-millions-f-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Feb 12, 2025
    Dataset authored and provided by
    Nexdata
    Area covered
    Japan, Ecuador, United States of America, Spain, Chile, South Africa, Egypt, Austria, Indonesia, Tunisia
    Description
    1. Overview Volume: 2 Millions Data use: Instruction-Following Evaluation for LLM
      Data content: A variety of complex prompt instructions, between 50 and 400 words, with no fewer than 3 constraints in each prompt Production method: All prompt are manually written to satisfy the diversity of coverage
      Language: English, Korean, French, German, Spanish, Russian, Italian, Dutch, Polish, Portuguese, Japanese, Indonesian, Vietnamese

    2. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go data supports instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/llm?source=Datarade

  2. h

    SIFT-50M

    • huggingface.co
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon AGI (2025). SIFT-50M [Dataset]. https://huggingface.co/datasets/amazon-agi/SIFT-50M
    Explore at:
    Dataset updated
    Jun 1, 2025
    Dataset authored and provided by
    Amazon AGI
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Dataset Card for SIFT-50M

    SIFT-50M (Speech Instruction Fine-Tuning) is a 50-million-example dataset designed for instruction fine-tuning and pre-training of speech-text large language models (LLMs). It is built from publicly available speech corpora containing a total of 14K hours of speech and leverages LLMs and off-the-shelf expert models. The dataset spans five languages, covering diverse aspects of speech understanding and controllable speech generation instructions. SIFT-50M… See the full description on the dataset page: https://huggingface.co/datasets/amazon-agi/SIFT-50M.

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nexdata (2025). Fine-Tuning Text Data | 2 Millions | User Generated Text |Foundation Model | SFT Data | Large Language Model(LLM) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-fine-tuning-text-data-2-millions-f-nexdata

Fine-Tuning Text Data | 2 Millions | User Generated Text |Foundation Model | SFT Data | Large Language Model(LLM) Data

Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Feb 12, 2025
Dataset authored and provided by
Nexdata
Area covered
Japan, Ecuador, United States of America, Spain, Chile, South Africa, Egypt, Austria, Indonesia, Tunisia
Description
  1. Overview Volume: 2 Millions Data use: Instruction-Following Evaluation for LLM
    Data content: A variety of complex prompt instructions, between 50 and 400 words, with no fewer than 3 constraints in each prompt Production method: All prompt are manually written to satisfy the diversity of coverage
    Language: English, Korean, French, German, Spanish, Russian, Italian, Dutch, Polish, Portuguese, Japanese, Indonesian, Vietnamese

  2. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go data supports instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/llm?source=Datarade

Search
Clear search
Close search
Google apps
Main menu