14 datasets found
  1. orca-agentinstruct-1M-v1

    • huggingface.co
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Microsoft (2024). orca-agentinstruct-1M-v1 [Dataset]. https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 4, 2024
    Dataset authored and provided by
    Microsofthttp://microsoft.com/
    License

    https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

    Description

    Dataset Card

    This dataset is a fully synthetic set of instruction pairs where both the prompts and the responses have been synthetically generated, using the AgentInstruct framework. AgentInstruct is an extensible agentic framework for synthetic data generation. This dataset contains ~1 million instruction pairs generated by the AgentInstruct, using only raw text content publicly avialble on the Web as seeds. The data covers different capabilities, such as text editing, creative… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1.

  2. orca-math-word-problems-200k

    • huggingface.co
    Updated Mar 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    agicorp (2024). orca-math-word-problems-200k [Dataset]. https://huggingface.co/datasets/agicorp/orca-math-word-problems-200k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 4, 2024
    Dataset provided by
    Agicorp
    Authors
    agicorp
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card

    This dataset contains ~200K grade school math word problems. All the answers in this dataset is generated using Azure GPT4-Turbo. Please refer to Orca-Math: Unlocking the potential of SLMs in Grade School Math for details about the dataset construction.

      Dataset Sources
    

    Repository: microsoft/orca-math-word-problems-200k Paper: Orca-Math: Unlocking the potential of SLMs in Grade School Math

      Direct Use
    

    This dataset has been designed to… See the full description on the dataset page: https://huggingface.co/datasets/agicorp/orca-math-word-problems-200k.

  3. h

    microsoft_Orca-2-7b-details

    • huggingface.co
    Updated Mar 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open LLM Leaderboard (2025). microsoft_Orca-2-7b-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/microsoft_Orca-2-7b-details
    Explore at:
    Dataset updated
    Mar 10, 2025
    Dataset authored and provided by
    Open LLM Leaderboard
    Description

    Dataset Card for Evaluation run of microsoft/Orca-2-7b

    Dataset automatically created during the evaluation run of model microsoft/Orca-2-7b The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/microsoft_Orca-2-7b-details.

  4. d

    Database of Southern Alaska Killer Whale Surveys and Encounters, 2001 to...

    • search.dataone.org
    • search-demo.dataone.org
    • +2more
    Updated Oct 3, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Craig Matkin; Kris Holderied; Molly McCammon; Katrina Hoffman; Graeme Ellis; Eva Saulitis (2019). Database of Southern Alaska Killer Whale Surveys and Encounters, 2001 to 2016, Gulf Watch Alaska Pelagic Component [Dataset]. http://doi.org/10.24431/rw1k1r
    Explore at:
    Dataset updated
    Oct 3, 2019
    Dataset provided by
    Research Workspace
    Authors
    Craig Matkin; Kris Holderied; Molly McCammon; Katrina Hoffman; Graeme Ellis; Eva Saulitis
    Time period covered
    Jul 4, 2012 - Jul 28, 2016
    Area covered
    Description

    These data are part of the Gulf Watch Alaska (GWA), Pelagic Component of the Exxon Valdez Oil Spill Trustee Council, project numbers 12120114-M, 13120114-M, 14120114-M, 15120114-M and 16120114-M. Gulf Watch Alaska is the long-term ecosystem monitoring program of the Exxon Valdez Oil Spill Trustee Council for the marine ecosystem affected by the 1989 oil spill. The project is a continuation of annual monitoring of AB pod and the AT1 population killer whales in Prince William Sound-Kenai Fjords. These groups of whales suffered significant losses at the time of the oil spill and have not recovered at projected rates. Monitoring of all the major pods and their current movements, range, feeding habits, and contaminant levels will help determine their vulnerability to future perturbations, including oil spills. This dataset is a database containing information from the killer whale surveys conducted from 2001 to 2016 in Prince William Sound and the Gulf of Alaska. The native file format is a Microsoft Office Access 2007 database (12.0 6735.5000), components of which have been separated and stored in Orcadatabase_CSV_tables.zip as .csv files to ensure that the information contained within the Access database file is openly accessible to data customers. Details of killer whale surveys, and subsequent encounters are stored in the file. Stored information includes the date, time, observers, behavioral observations, samples taken, location, pods present, number of whales present, name of survey vessel, and other pertinent information.

  5. orca-math-word-problems-200k

    • huggingface.co
    Updated Mar 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2024). orca-math-word-problems-200k [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/orca-math-word-problems-200k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 12, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    Description

    Dataset Card for Orca Math Word Problems 200k

    This is a formatted version of microsoft/orca-math-word-problems-200k to store the conversations in the same format as the OpenAI SDK.

  6. d

    Potential new species of pseudaliid lung nematode (Metastrongyloidea) from...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Apr 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joy Ometere Boyi (2023). Potential new species of pseudaliid lung nematode (Metastrongyloidea) from two stranded neonatal orcas (Orcinus orca) characterised by ITS-2 and COI sequences [Dataset]. http://doi.org/10.5061/dryad.v15dv421f
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 24, 2023
    Dataset provided by
    Dryad
    Authors
    Joy Ometere Boyi
    Time period covered
    2023
    Description

    Sanger dideoxy sequencing.

  7. h

    math-orca-arch

    • huggingface.co
    Updated Mar 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farouk (2024). math-orca-arch [Dataset]. https://huggingface.co/datasets/pharaouk/math-orca-arch
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 4, 2024
    Authors
    Farouk
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card

    This dataset contains ~200K grade school math word problems. All the answers in this dataset is generated using Azure GPT4-Turbo. Please refer to Orca-Math: Unlocking the potential of SLMs in Grade School Math for details about the dataset construction.

      Dataset Sources
    

    Repository: microsoft/orca-math-word-problems-200k Paper: Orca-Math: Unlocking the potential of SLMs in Grade School Math

      Direct Use
    

    This dataset has been designed to… See the full description on the dataset page: https://huggingface.co/datasets/pharaouk/math-orca-arch.

  8. h

    orca-math-word-problems-193k-korean

    • huggingface.co
    Updated Mar 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jisoo Kim (2024). orca-math-word-problems-193k-korean [Dataset]. https://huggingface.co/datasets/kuotient/orca-math-word-problems-193k-korean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 27, 2024
    Authors
    Jisoo Kim
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    원본 데이터셋: https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k 번역 모델: Seagull-13b-translation 후처리

    번역 repetition 오류 제거 LaTeX 오류 체크(전부는 아닐 수 있음. /(/) -> /(/ 같은 오류 등...)

      Citation
    

    @misc{mitra2024orcamath, title={Orca-Math: Unlocking the potential of SLMs in Grade School Math}, author={Arindam Mitra and Hamed Khanpour and Corby Rosset and Ahmed Awadallah}, year={2024}, eprint={2402.14830}, archivePrefix={arXiv}, primaryClass={cs.CL}… See the full description on the dataset page: https://huggingface.co/datasets/kuotient/orca-math-word-problems-193k-korean.

  9. h

    orca-math-word-problems-200k-askllm-v1

    • huggingface.co
    Updated Aug 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Team Kuma (2024). orca-math-word-problems-200k-askllm-v1 [Dataset]. https://huggingface.co/datasets/geniacllm/orca-math-word-problems-200k-askllm-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 20, 2024
    Dataset authored and provided by
    Team Kuma
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    orca-math-word-problems-200k-askllm-v1

    データセット microsoft/orca-math-word-problems-200k に対して、 Ask-LLM 手法でスコア付けしたデータセットです。 元データセットのカラムに加え askllm_score というカラムが追加されており、ここに Ask-LLM のスコアが格納されています。 Ask-LLM でスコア付けに使用した LLM は Rakuten/RakutenAI-7B-instruct で、プロンプトは以下の通りです。 ### {data} ###

    Does the previous paragraph demarcated within ### and ### contain informative signal for pre-training a large-language model? An informative datapoint should be well-formatted, contain some usable knowledge of… See the full description on the dataset page: https://huggingface.co/datasets/geniacllm/orca-math-word-problems-200k-askllm-v1.

  10. h

    orca-math-word-problems-200k-turkmen

    • huggingface.co
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bahtiyar Mamedov (2024). orca-math-word-problems-200k-turkmen [Dataset]. https://huggingface.co/datasets/mamed0v/orca-math-word-problems-200k-turkmen
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 6, 2024
    Authors
    Bahtiyar Mamedov
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Turkmen Orca Math Word Problems 200k Dataset

      Overview
    

    This dataset is a Turkmen translation of the original microsoft/orca-math-word-problems-200k dataset. The Orca Math Word Problems dataset contains 200,000 high-quality math word problems and their solutions. This Turkmen version aims to extend the accessibility of math problem-solving datasets to the Turkmen language community.

      Dataset Details
    

    Original Dataset: microsoft/orca-math-word-problems-200k… See the full description on the dataset page: https://huggingface.co/datasets/mamed0v/orca-math-word-problems-200k-turkmen.

  11. h

    Math-Qwen3-14B-vi

    • huggingface.co
    Updated May 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jeongjaeyong (2025). Math-Qwen3-14B-vi [Dataset]. https://huggingface.co/datasets/jaeyong2/Math-Qwen3-14B-vi
    Explore at:
    Dataset updated
    May 2, 2025
    Authors
    jeongjaeyong
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Development Process

    question dataset from 5CD-AI/Vietnamese-microsoft-orca-math-word-problems-200k-gg-translated We used Qwen/Qwen3-14B to evaluate the appropriateness of those candidates.

      License
    

    Qwen/Qwen3-14B : https://choosealicense.com/licenses/apache-2.0/ 5CD-AI/Vietnamese-microsoft-orca-math-word-problems-200k-gg-translated : https://huggingface.co/datasets/5CD-AI/Vietnamese-microsoft-orca-math-word-problems-200k-gg-translated

      Acknowledgement… See the full description on the dataset page: https://huggingface.co/datasets/jaeyong2/Math-Qwen3-14B-vi.
    
  12. h

    tangled-llama-pints-1.5b-v0.2-dataset

    • huggingface.co
    Updated Sep 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TangledLabs (2024). tangled-llama-pints-1.5b-v0.2-dataset [Dataset]. https://huggingface.co/datasets/tangledlabs/tangled-llama-pints-1.5b-v0.2-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 18, 2024
    Dataset authored and provided by
    TangledLabs
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    tangled-llama-pints-1.5b-v0.2-dataset

    Combined dataset as single JSONL from following datasets:

    laurentiubp/systemchat-sharegpt Open-Orca/slimorca-deduped-cleaned-corrected Crystalcareai/openhermes_200k_unfiltered Locutusque/function-calling-chatml m-a-p/CodeFeedback-Filtered-Instruction microsoft/orca-math-word-problems-200k meta-math/MetaMathQA mlabonne/FineTome-100k arcee-ai/agent-data

  13. h

    prompt-voice-v1.5

    • huggingface.co
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Menlo Research (2024). prompt-voice-v1.5 [Dataset]. https://huggingface.co/datasets/Menlo/prompt-voice-v1.5
    Explore at:
    Dataset updated
    Jul 22, 2024
    Dataset authored and provided by
    Menlo Research
    Description

    Dataset Overview

    This dataset contains nearly 2.35M English speech instruction to text answer samples, using the combination of:

    Intel/orca_dpo_pairs routellm/gpt4_dataset nomic-ai/gpt4all-j-prompt-generations microsoft/orca-math-word-problems-200k allenai/WildChat-1M Open-Orca/oo-gpt4-200k Magpie-Align/Magpie-Pro-300K-Filtered qiaojin/PubMedQA Undi95/Capybara-ShareGPT HannahRoseKirk/prism-alignment BAAI/Infinity-Instruct

      Usage
    

    from datasets import load_dataset… See the full description on the dataset page: https://huggingface.co/datasets/Menlo/prompt-voice-v1.5.

  14. h

    Raiden-DeepSeek-R1-PREVIEW

    • huggingface.co
    Updated Feb 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    t.d.a.g. (2025). Raiden-DeepSeek-R1-PREVIEW [Dataset]. https://huggingface.co/datasets/sequelbox/Raiden-DeepSeek-R1-PREVIEW
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 4, 2025
    Authors
    t.d.a.g.
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is a preview of the full Raiden-Deepseek-R1 creative and analytical reasoning dataset, containing the first ~6k rows. Get the full dataset here! This dataset uses synthetic data generated by deepseek-ai/DeepSeek-R1. The initial release of Raiden uses 'creative_content' and 'analytical_reasoning' prompts from microsoft/orca-agentinstruct-1M-v1. Dataset has not been reviewed for format or accuracy. All responses are synthetic and provided without editing. Use as you will.

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Microsoft (2024). orca-agentinstruct-1M-v1 [Dataset]. https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1
Organization logo

orca-agentinstruct-1M-v1

microsoft/orca-agentinstruct-1M-v1

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 4, 2024
Dataset authored and provided by
Microsofthttp://microsoft.com/
License

https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

Description

Dataset Card

This dataset is a fully synthetic set of instruction pairs where both the prompts and the responses have been synthetically generated, using the AgentInstruct framework. AgentInstruct is an extensible agentic framework for synthetic data generation. This dataset contains ~1 million instruction pairs generated by the AgentInstruct, using only raw text content publicly avialble on the Web as seeds. The data covers different capabilities, such as text editing, creative… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1.

Search
Clear search
Close search
Google apps
Main menu