22 datasets found

orca-agentinstruct-1M-v1
huggingface.co
Updated Nov 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Microsoft (2024). orca-agentinstruct-1M-v1 [Dataset]. https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 4, 2024
Dataset authored and provided by
Microsofthttp://microsoft.com/
License
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Description
Dataset Card

This dataset is a fully synthetic set of instruction pairs where both the prompts and the responses have been synthetically generated, using the AgentInstruct framework. AgentInstruct is an extensible agentic framework for synthetic data generation. This dataset contains ~1 million instruction pairs generated by the AgentInstruct, using only raw text content publicly avialble on the Web as seeds. The data covers different capabilities, such as text editing, creative… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1.
orca-math-word-problems-200k
huggingface.co
Updated Mar 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
agicorp (2024). orca-math-word-problems-200k [Dataset]. https://huggingface.co/datasets/agicorp/orca-math-word-problems-200k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 4, 2024
Dataset provided by
Agicorp
Authors
agicorp
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card

This dataset contains ~200K grade school math word problems. All the answers in this dataset is generated using Azure GPT4-Turbo. Please refer to Orca-Math: Unlocking the potential of SLMs in Grade School Math for details about the dataset construction.

Dataset Sources

Repository: microsoft/orca-math-word-problems-200k Paper: Orca-Math: Unlocking the potential of SLMs in Grade School Math

Direct Use

This dataset has been designed to… See the full description on the dataset page: https://huggingface.co/datasets/agicorp/orca-math-word-problems-200k.
t
ORCAS-I
researchdata.tuwien.ac.at
tsv
Updated Jun 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wojciech Kusa; Wojciech Kusa; Daria Alexander; Daria Alexander; Arjen P. de Vries; Arjen P. de Vries (2024). ORCAS-I [Dataset]. http://doi.org/10.48436/pp7xz-n9a06
Explore at:
tsvAvailable download formats
Unique identifier
https://doi.org/10.48436/pp7xz-n9a06
Dataset updated
Jun 25, 2024
Dataset provided by
TU Wien
Authors
Wojciech Kusa; Wojciech Kusa; Daria Alexander; Daria Alexander; Arjen P. de Vries; Arjen P. de Vries
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ORCAS-I is an annotated version of ORCAS dataset (Craswell et al., 2020) annotated with user intents using weak supervision. It allows you to train your algorithm on various types of user intents. Those intents are initially taken from Broder's (2002) classification: informational, navigational and transactional. We also refined this classification and added two subcategories inside the informational category: factual and instrumental. If the intent did not get any label inside the informational category it was classified as abstain.

ORCAS-I consists of the following files:
a) ORCAS-I-18M.tsv
A complete ORCAS data set which contains 18 million unique query-urls pairs.
dataset size: 18,823,602
unique queries: 10,405,339
unique URLs: 1,422,029
unique domains: 241,199

b) ORCAS-I-2M.tsv
A 2M subset of ORCAS-I-18M.tsv that we used for our experiments with different machine learning algorithms.
dataset size: 2,000,000
unique queries: 1,796,652
unique URLs: 618,679
unique domains: 126,001

Both ORCAS-I-18M and ORCAS-I-2M contain the following columns:
qid: the id of the query
query: the text of the query
url: the url that the user clicked
did: the document from TREC deep learning track that the url leads to
level_1: first level of annotation which has three top level categories: informational, navigational and transactional
level_2: second level of annotation (only classifies according to factual and instrumental categories, so all the other intents in the column are classified as abstain)
label: final intent label. Provides the annotation for informational, navigational and transactional categories and also for factual, instrumental and abstain subcategories
data_split: either 'train' or 'validation' that corresponds to split used during the original experiments
You can train your classifier either on the 3 top level categories (column 'level_1') or on the full taxonomy (column 'label').

c) ORCAS-I-gold.tsv
This is a test file that contains 1000 randomly selected queries from the full dataset (they are excluded from the 2M sample). These queries were manually annotated by two IR specialists.
dataset size: 1,000
unique queries: 1,000
unique URLs: 995
unique domains: 700
ORCAS-I-gold contains the following columns:
qid: the id of the query
query: the text of the query
url: the url that the user clicked
did: the document from TREC deep learning track that the url leads to
label_manual - manually annotated intent
data_split: always equal to 'test'
h
microsoft-orca-agentinstruct-1M-v1_sample100
huggingface.co
Updated Jul 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Pipelines Mock (2025). microsoft-orca-agentinstruct-1M-v1_sample100 [Dataset]. https://huggingface.co/datasets/data-pipelines-mock/microsoft-orca-agentinstruct-1M-v1_sample100
Explore at:
Dataset updated
Jul 1, 2025
Dataset authored and provided by
Data Pipelines Mock
Description
data-pipelines-mock/microsoft-orca-agentinstruct-1M-v1_sample100 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
microsoft_Orca-2-13b-details
huggingface.co
Updated Jul 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open LLM Leaderboard (2025). microsoft_Orca-2-13b-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/microsoft_Orca-2-13b-details
Explore at:
Dataset updated
Jul 30, 2025
Dataset authored and provided by
Open LLM Leaderboard
Description
Dataset Card for Evaluation run of microsoft/Orca-2-13b

Dataset automatically created during the evaluation run of model microsoft/Orca-2-13b The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/microsoft_Orca-2-13b-details.
O
ORCAS
opendatalab.com
zip
Updated Sep 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Microsoft (2022). ORCAS [Dataset]. https://opendatalab.com/ORCAS
Explore at:
zip(11268007470 bytes)Available download formats
Dataset updated
Sep 22, 2022
Dataset provided by
Microsoft
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ORCAS is a click-based dataset. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries.
h
microsoft_Orca-2-7b-details
huggingface.co
Updated Jul 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open LLM Leaderboard (2025). microsoft_Orca-2-7b-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/microsoft_Orca-2-7b-details
Explore at:
Dataset updated
Jul 30, 2025
Dataset authored and provided by
Open LLM Leaderboard
Description
Dataset Card for Evaluation run of microsoft/Orca-2-7b

Dataset automatically created during the evaluation run of model microsoft/Orca-2-7b The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 2 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An additional… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/microsoft_Orca-2-7b-details.
d
Database of Southern Alaska Killer Whale Surveys and Encounters, 2001 to...
dataone.org
search.dataone.org
Updated Oct 3, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Craig Matkin (2019). Database of Southern Alaska Killer Whale Surveys and Encounters, 2001 to 2016, Gulf Watch Alaska Pelagic Component [Dataset]. http://doi.org/10.24431/rw1k32h
Explore at:
Unique identifier
https://doi.org/10.24431/rw1k32h
Dataset updated
Oct 3, 2019
Dataset provided by
Research Workspace
Authors
Craig Matkin
Time period covered
Jul 4, 2012 - Jul 28, 2016
Area covered

Description
These data are part of the Gulf Watch Alaska (GWA), Pelagic Component of the Exxon Valdez Oil Spill Trustee Council, project numbers 12120114-M, 13120114-M, 14120114-M, 15120114-M and 16120114-M. Gulf Watch Alaska is the long-term ecosystem monitoring program of the Exxon Valdez Oil Spill Trustee Council for the marine ecosystem affected by the 1989 oil spill. The project is a continuation of annual monitoring of AB pod and the AT1 population killer whales in Prince William Sound-Kenai Fjords. These groups of whales suffered significant losses at the time of the oil spill and have not recovered at projected rates. Monitoring of all the major pods and their current movements, range, feeding habits, and contaminant levels will help determine their vulnerability to future perturbations, including oil spills. This dataset is a database containing information from the killer whale surveys conducted from 2001 to 2016 in Prince William Sound and the Gulf of Alaska. The native file format is a Microsoft Office Access 2007 database (12.0 6735.5000), components of which have been separated and stored in Orcadatabase_CSV_tables.zip as .csv files to ensure that the information contained within the Access database file is openly accessible to data customers. Details of killer whale surveys, and subsequent encounters are stored in the file. Stored information includes the date, time, observers, behavioral observations, samples taken, location, pods present, number of whales present, name of survey vessel, and other pertinent information.
d
Potential new species of pseudaliid lung nematode (Metastrongyloidea) from...
dataone.org
search.dataone.org
+2more
Updated Jul 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joy Ometere Boyi (2025). Potential new species of pseudaliid lung nematode (Metastrongyloidea) from two stranded neonatal orcas (Orcinus orca) characterised by ITS-2 and COI sequences [Dataset]. http://doi.org/10.5061/dryad.v15dv421f
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.v15dv421f
Dataset updated
Jul 16, 2025
Dataset provided by
Dryad Digital Repository
Authors
Joy Ometere Boyi
Time period covered
Jan 1, 2023
Description
Knowledge about parasite species of orcas, their prevalence and impact on the health status is scarce. Only two records of lungworm infections in orca exist from male neonatal orcas stranded in Germany and Norway. The nematodes were identified as Halocercus sp. (Pseudaliidae), which have been described in the respiratory tract of multiple odontocete species, but morphological identification to species level remained impossible due to the fragile structure and ambiguous morphological features. Pseudaliid nematodes (Metastrongyloidea) are specific to the respiratory tract of toothed whales and are hypothesized to have become almost extinct in terrestrial mammals. Severe lungworm infections can cause secondary bacterial infections and bronchopneumonia and are a common cause of mortality in odontocetes. DNA isolations and subsequent sequencing of the rDNA ITS-2 and mtDNA COI revealed nucleotide differences between previously describedÂ Halocercus species from common dolphin (H. delphini) and..., Sanger dideoxy sequencing., The data files can be opened with Microsoft Word or Notepad.
orca-math-word-problems-200k
huggingface.co
Updated Mar 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2024). orca-math-word-problems-200k [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/orca-math-word-problems-200k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 12, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
Description
Dataset Card for Orca Math Word Problems 200k

This is a formatted version of microsoft/orca-math-word-problems-200k to store the conversations in the same format as the OpenAI SDK.
h
orca-agentinstruct-shuffle_scored
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenDataArena, orca-agentinstruct-shuffle_scored [Dataset]. https://huggingface.co/datasets/OpenDataArena/orca-agentinstruct-shuffle_scored
Explore at:
Authors
OpenDataArena
Description
Orca-agentinstruct-shuffle_scored- with OpenDataArena Scores

This dataset is a scored version of the original microsoft/orca-agentinstruct-1M-v1 dataset. The scoring was performed using the OpenDataArena-Tool, a comprehensive suite of automated evaluation methods for assessing instruction-following datasets. This version of the dataset includes rich, multi-dimensional scores for both the instructions (questions) and the instruction-response pairs, allowing for highly granular data… See the full description on the dataset page: https://huggingface.co/datasets/OpenDataArena/orca-agentinstruct-shuffle_scored.
S
Small Language Model Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Small Language Model Report [Dataset]. https://www.datainsightsmarket.com/reports/small-language-model-1498279
Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
Jan 15, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Small Language Model market is projected to grow from $6,430 million in 2025 to $37,780 million by 2033, at a CAGR of 17.8%. Growing adoption of AI, machine learning (ML), and natural language processing (NLP) technologies is driving the market. Additionally, increasing demand for virtual assistants, chatbots, and content generation tools is further fueling the growth. The market is segmented into application, type, region, and company. Based on application, the market is divided into artificial intelligence training, chatbots and virtual assistants, content generation, language translation, code development, medical diagnosis and treatment, education, and others. Based on type, the market is classified into below 5 billion parameters and above 5 billion parameters. Geographically, the market is segmented into North America, South America, Europe, Middle East & Africa, and Asia Pacific. Key players in the market include Llama 2 (Meta AI), Phi2 (Microsoft), Orca (Microsoft), Stable Beluga 7B (Meta AI), X Gen (Salesforce AI), Qwen (Alibaba), Alpaca 7B (Meta), MPT (Mosaic ML), Falcon 7B (Technology Innovation Institute (TII) from the UAE), and Zephyr (Hugging Face).
h
orca-math-word-problems-200k_scored
huggingface.co
Updated Mar 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenDataArena (2024). orca-math-word-problems-200k_scored [Dataset]. https://huggingface.co/datasets/OpenDataArena/orca-math-word-problems-200k_scored
Explore at:
Dataset updated
Mar 4, 2024
Authors
OpenDataArena
Description
Orca-math-word-problems-200k_scored- with OpenDataArena Scores

This dataset is a scored version of the original microsoft/orca-math-word-problems-200k dataset. The scoring was performed using the OpenDataArena-Tool, a comprehensive suite of automated evaluation methods for assessing instruction-following datasets. This version of the dataset includes rich, multi-dimensional scores for both the instructions (questions) and the instruction-response pairs, allowing for highly granular… See the full description on the dataset page: https://huggingface.co/datasets/OpenDataArena/orca-math-word-problems-200k_scored.
h
orca-math-word-problems-200k-askllm-v1
huggingface.co
Updated Aug 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Team Kuma (2024). orca-math-word-problems-200k-askllm-v1 [Dataset]. https://huggingface.co/datasets/geniacllm/orca-math-word-problems-200k-askllm-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 20, 2024
Dataset authored and provided by
Team Kuma
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
orca-math-word-problems-200k-askllm-v1

データセット microsoft/orca-math-word-problems-200k に対して、 Ask-LLM 手法でスコア付けしたデータセットです。元データセットのカラムに加え askllm_score というカラムが追加されており、ここに Ask-LLM のスコアが格納されています。 Ask-LLM でスコア付けに使用した LLM は Rakuten/RakutenAI-7B-instruct で、プロンプトは以下の通りです。 ### {data} ###

Does the previous paragraph demarcated within ### and ### contain informative signal for pre-training a large-language model? An informative datapoint should be well-formatted, contain some usable knowledge of… See the full description on the dataset page: https://huggingface.co/datasets/geniacllm/orca-math-word-problems-200k-askllm-v1.
h
orca-math-word-problems-193k-korean
huggingface.co
Updated Mar 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jisoo Kim (2024). orca-math-word-problems-193k-korean [Dataset]. https://huggingface.co/datasets/kuotient/orca-math-word-problems-193k-korean
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 27, 2024
Authors
Jisoo Kim
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
원본 데이터셋: https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k 번역 모델: Seagull-13b-translation 후처리

번역 repetition 오류 제거 LaTeX 오류 체크(전부는 아닐 수 있음. /(/) -> /(/ 같은 오류 등...)

Citation

@misc{mitra2024orcamath, title={Orca-Math: Unlocking the potential of SLMs in Grade School Math}, author={Arindam Mitra and Hamed Khanpour and Corby Rosset and Ahmed Awadallah}, year={2024}, eprint={2402.14830}, archivePrefix={arXiv}, primaryClass={cs.CL}… See the full description on the dataset page: https://huggingface.co/datasets/kuotient/orca-math-word-problems-193k-korean.
h
orca-math-word-problems-200k-turkmen
huggingface.co
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bahtiyar Mamedov (2024). orca-math-word-problems-200k-turkmen [Dataset]. https://huggingface.co/datasets/mamed0v/orca-math-word-problems-200k-turkmen
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 6, 2024
Authors
Bahtiyar Mamedov
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Turkmen Orca Math Word Problems 200k Dataset

Overview

This dataset is a Turkmen translation of the original microsoft/orca-math-word-problems-200k dataset. The Orca Math Word Problems dataset contains 200,000 high-quality math word problems and their solutions. This Turkmen version aims to extend the accessibility of math problem-solving datasets to the Turkmen language community.

Dataset Details

Original Dataset: microsoft/orca-math-word-problems-200k… See the full description on the dataset page: https://huggingface.co/datasets/mamed0v/orca-math-word-problems-200k-turkmen.
h
Math-Qwen3-14B-vi
huggingface.co
Updated May 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jeongjaeyong (2025). Math-Qwen3-14B-vi [Dataset]. https://huggingface.co/datasets/jaeyong2/Math-Qwen3-14B-vi
Explore at:
Dataset updated
May 2, 2025
Authors
jeongjaeyong
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Development Process

question dataset from 5CD-AI/Vietnamese-microsoft-orca-math-word-problems-200k-gg-translated We used Qwen/Qwen3-14B to evaluate the appropriateness of those candidates.

License

Qwen/Qwen3-14B : https://choosealicense.com/licenses/apache-2.0/ 5CD-AI/Vietnamese-microsoft-orca-math-word-problems-200k-gg-translated : https://huggingface.co/datasets/5CD-AI/Vietnamese-microsoft-orca-math-word-problems-200k-gg-translated

Acknowledgement… See the full description on the dataset page: https://huggingface.co/datasets/jaeyong2/Math-Qwen3-14B-vi.
h
tangled-llama-pints-1.5b-v0.1-dataset
huggingface.co
Updated Sep 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TangledLabs (2024). tangled-llama-pints-1.5b-v0.1-dataset [Dataset]. https://huggingface.co/datasets/tangledlabs/tangled-llama-pints-1.5b-v0.1-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 18, 2024
Dataset authored and provided by
TangledLabs
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
tangled-llama-pints-1.5b-v0.1-dataset

Combined dataset as single JSONL from following datasets:

laurentiubp/systemchat-sharegpt Open-Orca/slimorca-deduped-cleaned-corrected Crystalcareai/openhermes_200k_unfiltered Locutusque/function-calling-chatml m-a-p/CodeFeedback-Filtered-Instruction microsoft/orca-math-word-problems-200k
h
prompt-voice-v1.5
huggingface.co
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Menlo Research (2024). prompt-voice-v1.5 [Dataset]. https://huggingface.co/datasets/Menlo/prompt-voice-v1.5
Explore at:
Dataset updated
Jul 22, 2024
Dataset authored and provided by
Menlo Research
Description
Dataset Overview

This dataset contains nearly 2.35M English speech instruction to text answer samples, using the combination of:

Intel/orca_dpo_pairs routellm/gpt4_dataset nomic-ai/gpt4all-j-prompt-generations microsoft/orca-math-word-problems-200k allenai/WildChat-1M Open-Orca/oo-gpt4-200k Magpie-Align/Magpie-Pro-300K-Filtered qiaojin/PubMedQA Undi95/Capybara-ShareGPT HannahRoseKirk/prism-alignment BAAI/Infinity-Instruct

Usage

from datasets import load_dataset… See the full description on the dataset page: https://huggingface.co/datasets/Menlo/prompt-voice-v1.5.
h
omni-math
huggingface.co
Updated Aug 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omni (2025). omni-math [Dataset]. https://huggingface.co/datasets/omniomni/omni-math
Explore at:
Dataset updated
Aug 23, 2025
Dataset authored and provided by
Omni
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
GitHub
Website
Paper (Coming Soon)

Dataset Details

This dataset is a combination of math Question-Answer datasets spanning various difficulties and concepts. This dataset contains only chat-based data.

Sources

This dataset was sourced from the following open-sourced datasets: Math

meta-math/MetaMathQA microsoft/orca-math-word-problems-200k openai/gsm8k

Facebook

Twitter

Click to copy link

Link copied

Cite

Microsoft (2024). orca-agentinstruct-1M-v1 [Dataset]. https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1

orca-agentinstruct-1M-v1

microsoft/orca-agentinstruct-1M-v1

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 4, 2024

Dataset authored and provided by

Microsofthttp://microsoft.com/

License

https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

Description

Dataset Card

This dataset is a fully synthetic set of instruction pairs where both the prompts and the responses have been synthetically generated, using the AgentInstruct framework. AgentInstruct is an extensible agentic framework for synthetic data generation. This dataset contains ~1 million instruction pairs generated by the AgentInstruct, using only raw text content publicly avialble on the Web as seeds. The data covers different capabilities, such as text editing, creative… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1.

Clear search

Close search

Google apps

Main menu

orca-agentinstruct-1M-v1

orca-math-word-problems-200k

ORCAS-I

a) ORCAS-I-18M.tsv

b) ORCAS-I-2M.tsv

c) ORCAS-I-gold.tsv

microsoft-orca-agentinstruct-1M-v1_sample100

microsoft_Orca-2-13b-details

ORCAS

microsoft_Orca-2-7b-details

Database of Southern Alaska Killer Whale Surveys and Encounters, 2001 to...

Potential new species of pseudaliid lung nematode (Metastrongyloidea) from...

orca-math-word-problems-200k

orca-agentinstruct-shuffle_scored

Small Language Model Report

orca-math-word-problems-200k_scored

orca-math-word-problems-200k-askllm-v1

orca-math-word-problems-193k-korean

orca-math-word-problems-200k-turkmen

Math-Qwen3-14B-vi

tangled-llama-pints-1.5b-v0.1-dataset

prompt-voice-v1.5

omni-math

orca-agentinstruct-1M-v1

microsoft/orca-agentinstruct-1M-v1