22 datasets found
  1. orca-agentinstruct-1M-v1

    • huggingface.co
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Microsoft (2024). orca-agentinstruct-1M-v1 [Dataset]. https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 4, 2024
    Dataset authored and provided by
    Microsofthttp://microsoft.com/
    License

    https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

    Description

    Dataset Card

    This dataset is a fully synthetic set of instruction pairs where both the prompts and the responses have been synthetically generated, using the AgentInstruct framework. AgentInstruct is an extensible agentic framework for synthetic data generation. This dataset contains ~1 million instruction pairs generated by the AgentInstruct, using only raw text content publicly avialble on the Web as seeds. The data covers different capabilities, such as text editing, creative… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1.

  2. orca-math-word-problems-200k

    • huggingface.co
    Updated Mar 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    agicorp (2024). orca-math-word-problems-200k [Dataset]. https://huggingface.co/datasets/agicorp/orca-math-word-problems-200k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 4, 2024
    Dataset provided by
    Agicorp
    Authors
    agicorp
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card

    This dataset contains ~200K grade school math word problems. All the answers in this dataset is generated using Azure GPT4-Turbo. Please refer to Orca-Math: Unlocking the potential of SLMs in Grade School Math for details about the dataset construction.

      Dataset Sources
    

    Repository: microsoft/orca-math-word-problems-200k Paper: Orca-Math: Unlocking the potential of SLMs in Grade School Math

      Direct Use
    

    This dataset has been designed to… See the full description on the dataset page: https://huggingface.co/datasets/agicorp/orca-math-word-problems-200k.

  3. t

    ORCAS-I

    • researchdata.tuwien.ac.at
    tsv
    Updated Jun 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wojciech Kusa; Wojciech Kusa; Daria Alexander; Daria Alexander; Arjen P. de Vries; Arjen P. de Vries (2024). ORCAS-I [Dataset]. http://doi.org/10.48436/pp7xz-n9a06
    Explore at:
    tsvAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    TU Wien
    Authors
    Wojciech Kusa; Wojciech Kusa; Daria Alexander; Daria Alexander; Arjen P. de Vries; Arjen P. de Vries
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ORCAS-I is an annotated version of ORCAS dataset (Craswell et al., 2020) annotated with user intents using weak supervision. It allows you to train your algorithm on various types of user intents. Those intents are initially taken from Broder's (2002) classification: informational, navigational and transactional. We also refined this classification and added two subcategories inside the informational category: factual and instrumental. If the intent did not get any label inside the informational category it was classified as abstain.

    ORCAS-I consists of the following files:

    a) ORCAS-I-18M.tsv

    A complete ORCAS data set which contains 18 million unique query-urls pairs.

    dataset size: 18,823,602
    unique queries: 10,405,339
    unique URLs: 1,422,029
    unique domains: 241,199

    b) ORCAS-I-2M.tsv

    A 2M subset of ORCAS-I-18M.tsv that we used for our experiments with different machine learning algorithms.

    dataset size: 2,000,000
    unique queries: 1,796,652
    unique URLs: 618,679
    unique domains: 126,001


    Both ORCAS-I-18M and ORCAS-I-2M contain the following columns:

    1. qid: the id of the query
    2. query: the text of the query
    3. url: the url that the user clicked
    4. did: the document from TREC deep learning track that the url leads to
    5. level_1: first level of annotation which has three top level categories: informational, navigational and transactional
    6. level_2: second level of annotation (only classifies according to factual and instrumental categories, so all the other intents in the column are classified as abstain)
    7. label: final intent label. Provides the annotation for informational, navigational and transactional categories and also for factual, instrumental and abstain subcategories
    8. data_split: either 'train' or 'validation' that corresponds to split used during the original experiments

    You can train your classifier either on the 3 top level categories (column 'level_1') or on the full taxonomy (column 'label').

    c) ORCAS-I-gold.tsv

    This is a test file that contains 1000 randomly selected queries from the full dataset (they are excluded from the 2M sample). These queries were manually annotated by two IR specialists.

    dataset size: 1,000
    unique queries: 1,000
    unique URLs: 995
    unique domains: 700

    ORCAS-I-gold contains the following columns:

    1. qid: the id of the query
    2. query: the text of the query
    3. url: the url that the user clicked
    4. did: the document from TREC deep learning track that the url leads to
    5. label_manual - manually annotated intent
    6. data_split: always equal to 'test'

  4. h

    microsoft-orca-agentinstruct-1M-v1_sample100

    • huggingface.co
    Updated Jul 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Pipelines Mock (2025). microsoft-orca-agentinstruct-1M-v1_sample100 [Dataset]. https://huggingface.co/datasets/data-pipelines-mock/microsoft-orca-agentinstruct-1M-v1_sample100
    Explore at:
    Dataset updated
    Jul 1, 2025
    Dataset authored and provided by
    Data Pipelines Mock
    Description

    data-pipelines-mock/microsoft-orca-agentinstruct-1M-v1_sample100 dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    microsoft_Orca-2-13b-details

    • huggingface.co
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open LLM Leaderboard (2025). microsoft_Orca-2-13b-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/microsoft_Orca-2-13b-details
    Explore at:
    Dataset updated
    Jul 30, 2025
    Dataset authored and provided by
    Open LLM Leaderboard
    Description

    Dataset Card for Evaluation run of microsoft/Orca-2-13b

    Dataset automatically created during the evaluation run of model microsoft/Orca-2-13b The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/microsoft_Orca-2-13b-details.

  6. O

    ORCAS

    • opendatalab.com
    zip
    Updated Sep 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Microsoft (2022). ORCAS [Dataset]. https://opendatalab.com/ORCAS
    Explore at:
    zip(11268007470 bytes)Available download formats
    Dataset updated
    Sep 22, 2022
    Dataset provided by
    Microsoft
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ORCAS is a click-based dataset. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries.

  7. h

    microsoft_Orca-2-7b-details

    • huggingface.co
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open LLM Leaderboard (2025). microsoft_Orca-2-7b-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/microsoft_Orca-2-7b-details
    Explore at:
    Dataset updated
    Jul 30, 2025
    Dataset authored and provided by
    Open LLM Leaderboard
    Description

    Dataset Card for Evaluation run of microsoft/Orca-2-7b

    Dataset automatically created during the evaluation run of model microsoft/Orca-2-7b The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/microsoft_Orca-2-7b-details.

  8. d

    Database of Southern Alaska Killer Whale Surveys and Encounters, 2001 to...

    • dataone.org
    • search.dataone.org
    Updated Oct 3, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Craig Matkin (2019). Database of Southern Alaska Killer Whale Surveys and Encounters, 2001 to 2016, Gulf Watch Alaska Pelagic Component [Dataset]. http://doi.org/10.24431/rw1k32h
    Explore at:
    Dataset updated
    Oct 3, 2019
    Dataset provided by
    Research Workspace
    Authors
    Craig Matkin
    Time period covered
    Jul 4, 2012 - Jul 28, 2016
    Area covered
    Description

    These data are part of the Gulf Watch Alaska (GWA), Pelagic Component of the Exxon Valdez Oil Spill Trustee Council, project numbers 12120114-M, 13120114-M, 14120114-M, 15120114-M and 16120114-M. Gulf Watch Alaska is the long-term ecosystem monitoring program of the Exxon Valdez Oil Spill Trustee Council for the marine ecosystem affected by the 1989 oil spill. The project is a continuation of annual monitoring of AB pod and the AT1 population killer whales in Prince William Sound-Kenai Fjords. These groups of whales suffered significant losses at the time of the oil spill and have not recovered at projected rates. Monitoring of all the major pods and their current movements, range, feeding habits, and contaminant levels will help determine their vulnerability to future perturbations, including oil spills. This dataset is a database containing information from the killer whale surveys conducted from 2001 to 2016 in Prince William Sound and the Gulf of Alaska. The native file format is a Microsoft Office Access 2007 database (12.0 6735.5000), components of which have been separated and stored in Orcadatabase_CSV_tables.zip as .csv files to ensure that the information contained within the Access database file is openly accessible to data customers. Details of killer whale surveys, and subsequent encounters are stored in the file. Stored information includes the date, time, observers, behavioral observations, samples taken, location, pods present, number of whales present, name of survey vessel, and other pertinent information.

  9. d

    Potential new species of pseudaliid lung nematode (Metastrongyloidea) from...

    • dataone.org
    • search.dataone.org
    • +2more
    Updated Jul 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joy Ometere Boyi (2025). Potential new species of pseudaliid lung nematode (Metastrongyloidea) from two stranded neonatal orcas (Orcinus orca) characterised by ITS-2 and COI sequences [Dataset]. http://doi.org/10.5061/dryad.v15dv421f
    Explore at:
    Dataset updated
    Jul 16, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Joy Ometere Boyi
    Time period covered
    Jan 1, 2023
    Description

    Knowledge about parasite species of orcas, their prevalence and impact on the health status is scarce. Only two records of lungworm infections in orca exist from male neonatal orcas stranded in Germany and Norway. The nematodes were identified as Halocercus sp. (Pseudaliidae), which have been described in the respiratory tract of multiple odontocete species, but morphological identification to species level remained impossible due to the fragile structure and ambiguous morphological features. Pseudaliid nematodes (Metastrongyloidea) are specific to the respiratory tract of toothed whales and are hypothesized to have become almost extinct in terrestrial mammals. Severe lungworm infections can cause secondary bacterial infections and bronchopneumonia and are a common cause of mortality in odontocetes. DNA isolations and subsequent sequencing of the rDNA ITS-2 and mtDNA COI revealed nucleotide differences between previously described Halocercus species from common dolphin (H. delphini) and..., Sanger dideoxy sequencing., The data files can be opened with Microsoft Word or Notepad.

  10. orca-math-word-problems-200k

    • huggingface.co
    Updated Mar 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2024). orca-math-word-problems-200k [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/orca-math-word-problems-200k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 12, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    Description

    Dataset Card for Orca Math Word Problems 200k

    This is a formatted version of microsoft/orca-math-word-problems-200k to store the conversations in the same format as the OpenAI SDK.

  11. h

    orca-agentinstruct-shuffle_scored

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenDataArena, orca-agentinstruct-shuffle_scored [Dataset]. https://huggingface.co/datasets/OpenDataArena/orca-agentinstruct-shuffle_scored
    Explore at:
    Authors
    OpenDataArena
    Description

    Orca-agentinstruct-shuffle_scored- with OpenDataArena Scores

    This dataset is a scored version of the original microsoft/orca-agentinstruct-1M-v1 dataset. The scoring was performed using the OpenDataArena-Tool, a comprehensive suite of automated evaluation methods for assessing instruction-following datasets. This version of the dataset includes rich, multi-dimensional scores for both the instructions (questions) and the instruction-response pairs, allowing for highly granular data… See the full description on the dataset page: https://huggingface.co/datasets/OpenDataArena/orca-agentinstruct-shuffle_scored.

  12. S

    Small Language Model Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Small Language Model Report [Dataset]. https://www.datainsightsmarket.com/reports/small-language-model-1498279
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Jan 15, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Small Language Model market is projected to grow from $6,430 million in 2025 to $37,780 million by 2033, at a CAGR of 17.8%. Growing adoption of AI, machine learning (ML), and natural language processing (NLP) technologies is driving the market. Additionally, increasing demand for virtual assistants, chatbots, and content generation tools is further fueling the growth. The market is segmented into application, type, region, and company. Based on application, the market is divided into artificial intelligence training, chatbots and virtual assistants, content generation, language translation, code development, medical diagnosis and treatment, education, and others. Based on type, the market is classified into below 5 billion parameters and above 5 billion parameters. Geographically, the market is segmented into North America, South America, Europe, Middle East & Africa, and Asia Pacific. Key players in the market include Llama 2 (Meta AI), Phi2 (Microsoft), Orca (Microsoft), Stable Beluga 7B (Meta AI), X Gen (Salesforce AI), Qwen (Alibaba), Alpaca 7B (Meta), MPT (Mosaic ML), Falcon 7B (Technology Innovation Institute (TII) from the UAE), and Zephyr (Hugging Face).

  13. h

    orca-math-word-problems-200k_scored

    • huggingface.co
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenDataArena (2024). orca-math-word-problems-200k_scored [Dataset]. https://huggingface.co/datasets/OpenDataArena/orca-math-word-problems-200k_scored
    Explore at:
    Dataset updated
    Mar 4, 2024
    Authors
    OpenDataArena
    Description

    Orca-math-word-problems-200k_scored- with OpenDataArena Scores

    This dataset is a scored version of the original microsoft/orca-math-word-problems-200k dataset. The scoring was performed using the OpenDataArena-Tool, a comprehensive suite of automated evaluation methods for assessing instruction-following datasets. This version of the dataset includes rich, multi-dimensional scores for both the instructions (questions) and the instruction-response pairs, allowing for highly granular… See the full description on the dataset page: https://huggingface.co/datasets/OpenDataArena/orca-math-word-problems-200k_scored.

  14. h

    orca-math-word-problems-200k-askllm-v1

    • huggingface.co
    Updated Aug 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Team Kuma (2024). orca-math-word-problems-200k-askllm-v1 [Dataset]. https://huggingface.co/datasets/geniacllm/orca-math-word-problems-200k-askllm-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 20, 2024
    Dataset authored and provided by
    Team Kuma
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    orca-math-word-problems-200k-askllm-v1

    データセット microsoft/orca-math-word-problems-200k に対して、 Ask-LLM 手法でスコア付けしたデータセットです。 元データセットのカラムに加え askllm_score というカラムが追加されており、ここに Ask-LLM のスコアが格納されています。 Ask-LLM でスコア付けに使用した LLM は Rakuten/RakutenAI-7B-instruct で、プロンプトは以下の通りです。 ### {data} ###

    Does the previous paragraph demarcated within ### and ### contain informative signal for pre-training a large-language model? An informative datapoint should be well-formatted, contain some usable knowledge of… See the full description on the dataset page: https://huggingface.co/datasets/geniacllm/orca-math-word-problems-200k-askllm-v1.

  15. h

    orca-math-word-problems-193k-korean

    • huggingface.co
    Updated Mar 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jisoo Kim (2024). orca-math-word-problems-193k-korean [Dataset]. https://huggingface.co/datasets/kuotient/orca-math-word-problems-193k-korean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 27, 2024
    Authors
    Jisoo Kim
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    원본 데이터셋: https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k 번역 모델: Seagull-13b-translation 후처리

    번역 repetition 오류 제거 LaTeX 오류 체크(전부는 아닐 수 있음. /(/) -> /(/ 같은 오류 등...)

      Citation
    

    @misc{mitra2024orcamath, title={Orca-Math: Unlocking the potential of SLMs in Grade School Math}, author={Arindam Mitra and Hamed Khanpour and Corby Rosset and Ahmed Awadallah}, year={2024}, eprint={2402.14830}, archivePrefix={arXiv}, primaryClass={cs.CL}… See the full description on the dataset page: https://huggingface.co/datasets/kuotient/orca-math-word-problems-193k-korean.

  16. h

    orca-math-word-problems-200k-turkmen

    • huggingface.co
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bahtiyar Mamedov (2024). orca-math-word-problems-200k-turkmen [Dataset]. https://huggingface.co/datasets/mamed0v/orca-math-word-problems-200k-turkmen
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 6, 2024
    Authors
    Bahtiyar Mamedov
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Turkmen Orca Math Word Problems 200k Dataset

      Overview
    

    This dataset is a Turkmen translation of the original microsoft/orca-math-word-problems-200k dataset. The Orca Math Word Problems dataset contains 200,000 high-quality math word problems and their solutions. This Turkmen version aims to extend the accessibility of math problem-solving datasets to the Turkmen language community.

      Dataset Details
    

    Original Dataset: microsoft/orca-math-word-problems-200k… See the full description on the dataset page: https://huggingface.co/datasets/mamed0v/orca-math-word-problems-200k-turkmen.

  17. h

    Math-Qwen3-14B-vi

    • huggingface.co
    Updated May 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jeongjaeyong (2025). Math-Qwen3-14B-vi [Dataset]. https://huggingface.co/datasets/jaeyong2/Math-Qwen3-14B-vi
    Explore at:
    Dataset updated
    May 2, 2025
    Authors
    jeongjaeyong
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Development Process

    question dataset from 5CD-AI/Vietnamese-microsoft-orca-math-word-problems-200k-gg-translated We used Qwen/Qwen3-14B to evaluate the appropriateness of those candidates.

      License
    

    Qwen/Qwen3-14B : https://choosealicense.com/licenses/apache-2.0/ 5CD-AI/Vietnamese-microsoft-orca-math-word-problems-200k-gg-translated : https://huggingface.co/datasets/5CD-AI/Vietnamese-microsoft-orca-math-word-problems-200k-gg-translated

      Acknowledgement… See the full description on the dataset page: https://huggingface.co/datasets/jaeyong2/Math-Qwen3-14B-vi.
    
  18. h

    tangled-llama-pints-1.5b-v0.1-dataset

    • huggingface.co
    Updated Sep 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TangledLabs (2024). tangled-llama-pints-1.5b-v0.1-dataset [Dataset]. https://huggingface.co/datasets/tangledlabs/tangled-llama-pints-1.5b-v0.1-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 18, 2024
    Dataset authored and provided by
    TangledLabs
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    tangled-llama-pints-1.5b-v0.1-dataset

    Combined dataset as single JSONL from following datasets:

    laurentiubp/systemchat-sharegpt Open-Orca/slimorca-deduped-cleaned-corrected Crystalcareai/openhermes_200k_unfiltered Locutusque/function-calling-chatml m-a-p/CodeFeedback-Filtered-Instruction microsoft/orca-math-word-problems-200k

  19. h

    prompt-voice-v1.5

    • huggingface.co
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Menlo Research (2024). prompt-voice-v1.5 [Dataset]. https://huggingface.co/datasets/Menlo/prompt-voice-v1.5
    Explore at:
    Dataset updated
    Jul 22, 2024
    Dataset authored and provided by
    Menlo Research
    Description

    Dataset Overview

    This dataset contains nearly 2.35M English speech instruction to text answer samples, using the combination of:

    Intel/orca_dpo_pairs routellm/gpt4_dataset nomic-ai/gpt4all-j-prompt-generations microsoft/orca-math-word-problems-200k allenai/WildChat-1M Open-Orca/oo-gpt4-200k Magpie-Align/Magpie-Pro-300K-Filtered qiaojin/PubMedQA Undi95/Capybara-ShareGPT HannahRoseKirk/prism-alignment BAAI/Infinity-Instruct

      Usage
    

    from datasets import load_dataset… See the full description on the dataset page: https://huggingface.co/datasets/Menlo/prompt-voice-v1.5.

  20. h

    omni-math

    • huggingface.co
    Updated Aug 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omni (2025). omni-math [Dataset]. https://huggingface.co/datasets/omniomni/omni-math
    Explore at:
    Dataset updated
    Aug 23, 2025
    Dataset authored and provided by
    Omni
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    GitHub
    Website
    Paper (Coming Soon)

      Dataset Details
    

    This dataset is a combination of math Question-Answer datasets spanning various difficulties and concepts. This dataset contains only chat-based data.

      Sources
    

    This dataset was sourced from the following open-sourced datasets: Math

    meta-math/MetaMathQA microsoft/orca-math-word-problems-200k openai/gsm8k

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Microsoft (2024). orca-agentinstruct-1M-v1 [Dataset]. https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1
Organization logo

orca-agentinstruct-1M-v1

microsoft/orca-agentinstruct-1M-v1

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 4, 2024
Dataset authored and provided by
Microsofthttp://microsoft.com/
License

https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

Description

Dataset Card

This dataset is a fully synthetic set of instruction pairs where both the prompts and the responses have been synthetically generated, using the AgentInstruct framework. AgentInstruct is an extensible agentic framework for synthetic data generation. This dataset contains ~1 million instruction pairs generated by the AgentInstruct, using only raw text content publicly avialble on the Web as seeds. The data covers different capabilities, such as text editing, creative… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1.

Search
Clear search
Close search
Google apps
Main menu