https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Dataset Card
This dataset is a fully synthetic set of instruction pairs where both the prompts and the responses have been synthetically generated, using the AgentInstruct framework. AgentInstruct is an extensible agentic framework for synthetic data generation. This dataset contains ~1 million instruction pairs generated by the AgentInstruct, using only raw text content publicly avialble on the Web as seeds. The data covers different capabilities, such as text editing, creative… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card
This dataset contains ~200K grade school math word problems. All the answers in this dataset is generated using Azure GPT4-Turbo. Please refer to Orca-Math: Unlocking the potential of SLMs in Grade School Math for details about the dataset construction.
Dataset Sources
Repository: microsoft/orca-math-word-problems-200k Paper: Orca-Math: Unlocking the potential of SLMs in Grade School Math
Direct Use
This dataset has been designed to… See the full description on the dataset page: https://huggingface.co/datasets/agicorp/orca-math-word-problems-200k.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ORCAS-I is an annotated version of ORCAS dataset (Craswell et al., 2020) annotated with user intents using weak supervision. It allows you to train your algorithm on various types of user intents. Those intents are initially taken from Broder's (2002) classification: informational, navigational and transactional. We also refined this classification and added two subcategories inside the informational category: factual and instrumental. If the intent did not get any label inside the informational category it was classified as abstain.
ORCAS-I consists of the following files:
A complete ORCAS data set which contains 18 million unique query-urls pairs.
dataset size: 18,823,602
unique queries: 10,405,339
unique URLs: 1,422,029
unique domains: 241,199
A 2M subset of ORCAS-I-18M.tsv that we used for our experiments with different machine learning algorithms.
dataset size: 2,000,000
unique queries: 1,796,652
unique URLs: 618,679
unique domains: 126,001
Both ORCAS-I-18M and ORCAS-I-2M contain the following columns:
You can train your classifier either on the 3 top level categories (column 'level_1') or on the full taxonomy (column 'label').
This is a test file that contains 1000 randomly selected queries from the full dataset (they are excluded from the 2M sample). These queries were manually annotated by two IR specialists.
dataset size: 1,000
unique queries: 1,000
unique URLs: 995
unique domains: 700
ORCAS-I-gold contains the following columns:
data-pipelines-mock/microsoft-orca-agentinstruct-1M-v1_sample100 dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for Evaluation run of microsoft/Orca-2-13b
Dataset automatically created during the evaluation run of model microsoft/Orca-2-13b The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/microsoft_Orca-2-13b-details.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ORCAS is a click-based dataset. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries.
Dataset Card for Evaluation run of microsoft/Orca-2-7b
Dataset automatically created during the evaluation run of model microsoft/Orca-2-7b The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/microsoft_Orca-2-7b-details.
These data are part of the Gulf Watch Alaska (GWA), Pelagic Component of the Exxon Valdez Oil Spill Trustee Council, project numbers 12120114-M, 13120114-M, 14120114-M, 15120114-M and 16120114-M. Gulf Watch Alaska is the long-term ecosystem monitoring program of the Exxon Valdez Oil Spill Trustee Council for the marine ecosystem affected by the 1989 oil spill. The project is a continuation of annual monitoring of AB pod and the AT1 population killer whales in Prince William Sound-Kenai Fjords. These groups of whales suffered significant losses at the time of the oil spill and have not recovered at projected rates. Monitoring of all the major pods and their current movements, range, feeding habits, and contaminant levels will help determine their vulnerability to future perturbations, including oil spills. This dataset is a database containing information from the killer whale surveys conducted from 2001 to 2016 in Prince William Sound and the Gulf of Alaska. The native file format is a Microsoft Office Access 2007 database (12.0 6735.5000), components of which have been separated and stored in Orcadatabase_CSV_tables.zip as .csv files to ensure that the information contained within the Access database file is openly accessible to data customers. Details of killer whale surveys, and subsequent encounters are stored in the file. Stored information includes the date, time, observers, behavioral observations, samples taken, location, pods present, number of whales present, name of survey vessel, and other pertinent information.
Knowledge about parasite species of orcas, their prevalence and impact on the health status is scarce. Only two records of lungworm infections in orca exist from male neonatal orcas stranded in Germany and Norway. The nematodes were identified as Halocercus sp. (Pseudaliidae), which have been described in the respiratory tract of multiple odontocete species, but morphological identification to species level remained impossible due to the fragile structure and ambiguous morphological features. Pseudaliid nematodes (Metastrongyloidea) are specific to the respiratory tract of toothed whales and are hypothesized to have become almost extinct in terrestrial mammals. Severe lungworm infections can cause secondary bacterial infections and bronchopneumonia and are a common cause of mortality in odontocetes. DNA isolations and subsequent sequencing of the rDNA ITS-2 and mtDNA COI revealed nucleotide differences between previously described Halocercus species from common dolphin (H. delphini) and..., Sanger dideoxy sequencing., The data files can be opened with Microsoft Word or Notepad.
Dataset Card for Orca Math Word Problems 200k
This is a formatted version of microsoft/orca-math-word-problems-200k to store the conversations in the same format as the OpenAI SDK.
Orca-agentinstruct-shuffle_scored- with OpenDataArena Scores
This dataset is a scored version of the original microsoft/orca-agentinstruct-1M-v1 dataset. The scoring was performed using the OpenDataArena-Tool, a comprehensive suite of automated evaluation methods for assessing instruction-following datasets. This version of the dataset includes rich, multi-dimensional scores for both the instructions (questions) and the instruction-response pairs, allowing for highly granular data… See the full description on the dataset page: https://huggingface.co/datasets/OpenDataArena/orca-agentinstruct-shuffle_scored.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The Small Language Model market is projected to grow from $6,430 million in 2025 to $37,780 million by 2033, at a CAGR of 17.8%. Growing adoption of AI, machine learning (ML), and natural language processing (NLP) technologies is driving the market. Additionally, increasing demand for virtual assistants, chatbots, and content generation tools is further fueling the growth. The market is segmented into application, type, region, and company. Based on application, the market is divided into artificial intelligence training, chatbots and virtual assistants, content generation, language translation, code development, medical diagnosis and treatment, education, and others. Based on type, the market is classified into below 5 billion parameters and above 5 billion parameters. Geographically, the market is segmented into North America, South America, Europe, Middle East & Africa, and Asia Pacific. Key players in the market include Llama 2 (Meta AI), Phi2 (Microsoft), Orca (Microsoft), Stable Beluga 7B (Meta AI), X Gen (Salesforce AI), Qwen (Alibaba), Alpaca 7B (Meta), MPT (Mosaic ML), Falcon 7B (Technology Innovation Institute (TII) from the UAE), and Zephyr (Hugging Face).
Orca-math-word-problems-200k_scored- with OpenDataArena Scores
This dataset is a scored version of the original microsoft/orca-math-word-problems-200k dataset. The scoring was performed using the OpenDataArena-Tool, a comprehensive suite of automated evaluation methods for assessing instruction-following datasets. This version of the dataset includes rich, multi-dimensional scores for both the instructions (questions) and the instruction-response pairs, allowing for highly granular… See the full description on the dataset page: https://huggingface.co/datasets/OpenDataArena/orca-math-word-problems-200k_scored.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
orca-math-word-problems-200k-askllm-v1
データセット microsoft/orca-math-word-problems-200k に対して、 Ask-LLM 手法でスコア付けしたデータセットです。 元データセットのカラムに加え askllm_score というカラムが追加されており、ここに Ask-LLM のスコアが格納されています。 Ask-LLM でスコア付けに使用した LLM は Rakuten/RakutenAI-7B-instruct で、プロンプトは以下の通りです。 ### {data} ###
Does the previous paragraph demarcated within ### and ### contain informative signal for pre-training a large-language model? An informative datapoint should be well-formatted, contain some usable knowledge of… See the full description on the dataset page: https://huggingface.co/datasets/geniacllm/orca-math-word-problems-200k-askllm-v1.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
원본 데이터셋: https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k 번역 모델: Seagull-13b-translation 후처리
번역 repetition 오류 제거 LaTeX 오류 체크(전부는 아닐 수 있음. /(/) -> /(/ 같은 오류 등...)
Citation
@misc{mitra2024orcamath, title={Orca-Math: Unlocking the potential of SLMs in Grade School Math}, author={Arindam Mitra and Hamed Khanpour and Corby Rosset and Ahmed Awadallah}, year={2024}, eprint={2402.14830}, archivePrefix={arXiv}, primaryClass={cs.CL}… See the full description on the dataset page: https://huggingface.co/datasets/kuotient/orca-math-word-problems-193k-korean.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Turkmen Orca Math Word Problems 200k Dataset
Overview
This dataset is a Turkmen translation of the original microsoft/orca-math-word-problems-200k dataset. The Orca Math Word Problems dataset contains 200,000 high-quality math word problems and their solutions. This Turkmen version aims to extend the accessibility of math problem-solving datasets to the Turkmen language community.
Dataset Details
Original Dataset: microsoft/orca-math-word-problems-200k… See the full description on the dataset page: https://huggingface.co/datasets/mamed0v/orca-math-word-problems-200k-turkmen.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Development Process
question dataset from 5CD-AI/Vietnamese-microsoft-orca-math-word-problems-200k-gg-translated We used Qwen/Qwen3-14B to evaluate the appropriateness of those candidates.
License
Qwen/Qwen3-14B : https://choosealicense.com/licenses/apache-2.0/ 5CD-AI/Vietnamese-microsoft-orca-math-word-problems-200k-gg-translated : https://huggingface.co/datasets/5CD-AI/Vietnamese-microsoft-orca-math-word-problems-200k-gg-translated
Acknowledgement… See the full description on the dataset page: https://huggingface.co/datasets/jaeyong2/Math-Qwen3-14B-vi.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
tangled-llama-pints-1.5b-v0.1-dataset
Combined dataset as single JSONL from following datasets:
laurentiubp/systemchat-sharegpt Open-Orca/slimorca-deduped-cleaned-corrected Crystalcareai/openhermes_200k_unfiltered Locutusque/function-calling-chatml m-a-p/CodeFeedback-Filtered-Instruction microsoft/orca-math-word-problems-200k
Dataset Overview
This dataset contains nearly 2.35M English speech instruction to text answer samples, using the combination of:
Intel/orca_dpo_pairs routellm/gpt4_dataset nomic-ai/gpt4all-j-prompt-generations microsoft/orca-math-word-problems-200k allenai/WildChat-1M Open-Orca/oo-gpt4-200k Magpie-Align/Magpie-Pro-300K-Filtered qiaojin/PubMedQA Undi95/Capybara-ShareGPT HannahRoseKirk/prism-alignment BAAI/Infinity-Instruct
Usage
from datasets import load_dataset… See the full description on the dataset page: https://huggingface.co/datasets/Menlo/prompt-voice-v1.5.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
GitHub
Website
Paper (Coming Soon)
Dataset Details
This dataset is a combination of math Question-Answer datasets spanning various difficulties and concepts. This dataset contains only chat-based data.
Sources
This dataset was sourced from the following open-sourced datasets: Math
meta-math/MetaMathQA microsoft/orca-math-word-problems-200k openai/gsm8k
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Dataset Card
This dataset is a fully synthetic set of instruction pairs where both the prompts and the responses have been synthetically generated, using the AgentInstruct framework. AgentInstruct is an extensible agentic framework for synthetic data generation. This dataset contains ~1 million instruction pairs generated by the AgentInstruct, using only raw text content publicly avialble on the Web as seeds. The data covers different capabilities, such as text editing, creative… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1.