https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Dataset Card
This dataset is a fully synthetic set of instruction pairs where both the prompts and the responses have been synthetically generated, using the AgentInstruct framework. AgentInstruct is an extensible agentic framework for synthetic data generation. This dataset contains ~1 million instruction pairs generated by the AgentInstruct, using only raw text content publicly avialble on the Web as seeds. The data covers different capabilities, such as text editing, creative… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card
This dataset contains ~200K grade school math word problems. All the answers in this dataset is generated using Azure GPT4-Turbo. Please refer to Orca-Math: Unlocking the potential of SLMs in Grade School Math for details about the dataset construction.
Dataset Sources
Repository: microsoft/orca-math-word-problems-200k Paper: Orca-Math: Unlocking the potential of SLMs in Grade School Math
Direct Use
This dataset has been designed to… See the full description on the dataset page: https://huggingface.co/datasets/agicorp/orca-math-word-problems-200k.
Dataset Card for Evaluation run of microsoft/Orca-2-7b
Dataset automatically created during the evaluation run of model microsoft/Orca-2-7b The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/microsoft_Orca-2-7b-details.
These data are part of the Gulf Watch Alaska (GWA), Pelagic Component of the Exxon Valdez Oil Spill Trustee Council, project numbers 12120114-M, 13120114-M, 14120114-M, 15120114-M and 16120114-M. Gulf Watch Alaska is the long-term ecosystem monitoring program of the Exxon Valdez Oil Spill Trustee Council for the marine ecosystem affected by the 1989 oil spill. The project is a continuation of annual monitoring of AB pod and the AT1 population killer whales in Prince William Sound-Kenai Fjords. These groups of whales suffered significant losses at the time of the oil spill and have not recovered at projected rates. Monitoring of all the major pods and their current movements, range, feeding habits, and contaminant levels will help determine their vulnerability to future perturbations, including oil spills. This dataset is a database containing information from the killer whale surveys conducted from 2001 to 2016 in Prince William Sound and the Gulf of Alaska. The native file format is a Microsoft Office Access 2007 database (12.0 6735.5000), components of which have been separated and stored in Orcadatabase_CSV_tables.zip as .csv files to ensure that the information contained within the Access database file is openly accessible to data customers. Details of killer whale surveys, and subsequent encounters are stored in the file. Stored information includes the date, time, observers, behavioral observations, samples taken, location, pods present, number of whales present, name of survey vessel, and other pertinent information.
Dataset Card for Orca Math Word Problems 200k
This is a formatted version of microsoft/orca-math-word-problems-200k to store the conversations in the same format as the OpenAI SDK.
Sanger dideoxy sequencing.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card
This dataset contains ~200K grade school math word problems. All the answers in this dataset is generated using Azure GPT4-Turbo. Please refer to Orca-Math: Unlocking the potential of SLMs in Grade School Math for details about the dataset construction.
Dataset Sources
Repository: microsoft/orca-math-word-problems-200k Paper: Orca-Math: Unlocking the potential of SLMs in Grade School Math
Direct Use
This dataset has been designed to… See the full description on the dataset page: https://huggingface.co/datasets/pharaouk/math-orca-arch.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
원본 데이터셋: https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k 번역 모델: Seagull-13b-translation 후처리
번역 repetition 오류 제거 LaTeX 오류 체크(전부는 아닐 수 있음. /(/) -> /(/ 같은 오류 등...)
Citation
@misc{mitra2024orcamath, title={Orca-Math: Unlocking the potential of SLMs in Grade School Math}, author={Arindam Mitra and Hamed Khanpour and Corby Rosset and Ahmed Awadallah}, year={2024}, eprint={2402.14830}, archivePrefix={arXiv}, primaryClass={cs.CL}… See the full description on the dataset page: https://huggingface.co/datasets/kuotient/orca-math-word-problems-193k-korean.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
orca-math-word-problems-200k-askllm-v1
データセット microsoft/orca-math-word-problems-200k に対して、 Ask-LLM 手法でスコア付けしたデータセットです。 元データセットのカラムに加え askllm_score というカラムが追加されており、ここに Ask-LLM のスコアが格納されています。 Ask-LLM でスコア付けに使用した LLM は Rakuten/RakutenAI-7B-instruct で、プロンプトは以下の通りです。 ### {data} ###
Does the previous paragraph demarcated within ### and ### contain informative signal for pre-training a large-language model? An informative datapoint should be well-formatted, contain some usable knowledge of… See the full description on the dataset page: https://huggingface.co/datasets/geniacllm/orca-math-word-problems-200k-askllm-v1.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Turkmen Orca Math Word Problems 200k Dataset
Overview
This dataset is a Turkmen translation of the original microsoft/orca-math-word-problems-200k dataset. The Orca Math Word Problems dataset contains 200,000 high-quality math word problems and their solutions. This Turkmen version aims to extend the accessibility of math problem-solving datasets to the Turkmen language community.
Dataset Details
Original Dataset: microsoft/orca-math-word-problems-200k… See the full description on the dataset page: https://huggingface.co/datasets/mamed0v/orca-math-word-problems-200k-turkmen.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Development Process
question dataset from 5CD-AI/Vietnamese-microsoft-orca-math-word-problems-200k-gg-translated We used Qwen/Qwen3-14B to evaluate the appropriateness of those candidates.
License
Qwen/Qwen3-14B : https://choosealicense.com/licenses/apache-2.0/ 5CD-AI/Vietnamese-microsoft-orca-math-word-problems-200k-gg-translated : https://huggingface.co/datasets/5CD-AI/Vietnamese-microsoft-orca-math-word-problems-200k-gg-translated
Acknowledgement… See the full description on the dataset page: https://huggingface.co/datasets/jaeyong2/Math-Qwen3-14B-vi.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
tangled-llama-pints-1.5b-v0.2-dataset
Combined dataset as single JSONL from following datasets:
laurentiubp/systemchat-sharegpt Open-Orca/slimorca-deduped-cleaned-corrected Crystalcareai/openhermes_200k_unfiltered Locutusque/function-calling-chatml m-a-p/CodeFeedback-Filtered-Instruction microsoft/orca-math-word-problems-200k meta-math/MetaMathQA mlabonne/FineTome-100k arcee-ai/agent-data
Dataset Overview
This dataset contains nearly 2.35M English speech instruction to text answer samples, using the combination of:
Intel/orca_dpo_pairs routellm/gpt4_dataset nomic-ai/gpt4all-j-prompt-generations microsoft/orca-math-word-problems-200k allenai/WildChat-1M Open-Orca/oo-gpt4-200k Magpie-Align/Magpie-Pro-300K-Filtered qiaojin/PubMedQA Undi95/Capybara-ShareGPT HannahRoseKirk/prism-alignment BAAI/Infinity-Instruct
Usage
from datasets import load_dataset… See the full description on the dataset page: https://huggingface.co/datasets/Menlo/prompt-voice-v1.5.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is a preview of the full Raiden-Deepseek-R1 creative and analytical reasoning dataset, containing the first ~6k rows. Get the full dataset here! This dataset uses synthetic data generated by deepseek-ai/DeepSeek-R1. The initial release of Raiden uses 'creative_content' and 'analytical_reasoning' prompts from microsoft/orca-agentinstruct-1M-v1. Dataset has not been reviewed for format or accuracy. All responses are synthetic and provided without editing. Use as you will.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Dataset Card
This dataset is a fully synthetic set of instruction pairs where both the prompts and the responses have been synthetically generated, using the AgentInstruct framework. AgentInstruct is an extensible agentic framework for synthetic data generation. This dataset contains ~1 million instruction pairs generated by the AgentInstruct, using only raw text content publicly avialble on the Web as seeds. The data covers different capabilities, such as text editing, creative… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1.