Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A classics data set for use with mistral-7b-v0.1
This dataset was used for a fine-tune of Mistral 7b base model. It contains 1,640 Q/A pairs on Greek & Roman history. The dataset was generated via Mixtral-8x7b Instruct v01, run over 512 token-length chunks of vol's 2&3 of Will Durants' 13 vol Story of Civilization (Life of Greece and Caesar & Christ). Training data was formatted with [INST] and [/INST] delimiting instructions: {"text": "Q: "Why did many Greeks come to resent Rome's⊠See the full description on the dataset page: https://huggingface.co/datasets/wmmarcellino/mistral-7b-v0.1-GreeceRome-v0.1.
đŻ DART-Math
Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
đ Paper@arXiv | đ€ Datasets&Models@HF | đ± Code@GitHub
đŠ Thread@X(Twitter) | đ¶ äžæććźą@ç„äč | đ Leaderboard@PapersWithCode | đ BibTeX
Datasets: DART-Math DART-Math datasets are the state-of-the-art and data-efficient open-source instruction tuning datasets for mathematical reasoning.
DART-Math-Hard contains ~585k mathematical QA pair samples constructed by applying DARS-Prop2Diff to the query set from MATH and GSK8K training sets, achieves SOTA on many challenging mathematical reasoning benchmarks. It introduces a deliberate bias towards hard queries, opposite to vanilla rejection sampling.
Performance produced by DART-Math-Hard is usually but not necessarily slightly better (~1% absolutely) than DART-Math-Uniform, which contains ~591k samples constructed by applying DARS-Uniform.
Comparison between Mathematical Instruction Tuning Datasets Most of previous datasets are constructed with ChatGPT, and many of them are not open-source, especially for ones of the best performance.
Math SFT Dataset | # of Samples | MATH | GSM8K | College | Synthesis Agent(s) | Open-Source |
---|---|---|---|---|---|---|
WizardMath | 96k | 32.3 | 80.4 | 23.1 | GPT-4 | â |
MetaMathQA | 395k | 29.8 | 76.5 | 19.3 | GPT-3.5 | â |
MMIQC | 2294k | 37.4 | 75.4 | 28.5 | GPT-4+GPT-3.5+Human | â |
Orca-Math | 200k | -- | -- | -- | GPT-4 | â |
Xwin-Math-V1.1 | 1440k | 45.5 | 84.9 | 27.6 | GPT-4 | â |
KPMath-Plus | 1576k | 46.8 | 82.1 | -â | GPT-4 | â |
MathScaleQA | 2021k | 35.2 | 74.8 | 21.8 | GPT-3.5+Human | â |
DART-Math-Uniform | 591k | 43.5 | 82.6 | 26.9 | DeepSeekMath-7B-RL | â |
DART-Math-Hard | 585k | 45.5 | 81.1 | 29.4 | DeepSeekMath-7B-RL | â |
MATH and GSM8K are in-domain, while College(Math) is out-of-domain. Performance here are of models fine-tuned from Mistral-7B, except for Xwin-Math-V1.1 based on Llama2-7B. Bold/Italic means the best/second best score here.
Dataset Construction: DARS - Difficulty-Aware Rejection Sampling Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results. However, our analysis of these datasets reveals severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries.
Motivated by the observation above, we propose to Difficulty-Aware Rejection Sampling (DARS), to collect more responses for more difficult queries. Specifically, we introduce two strategies to increase the number of correct responses for difficult queries:
1) Uniform, which involves sampling responses for each query until each query accumulates $k_u$ correct responses, where $k_u$ is a preset hyperparameter determined by the desired size of the synthetic dataset; 2) Prop2Diff, where we continue sampling responses until the number of correct responses for each query is proportional to its difficulty score. The most challenging queries will receive $k_p$ responses and kp is a hyperparameter. This method introduces a deliberate bias in the opposite direction to vanilla rejection sampling, towards more difficult queries, inspired by previous works that demonstrate difficult samples can be more effective to enhance model capabilities (Sorscher et al., 2022; Liu et al., 2024b).
See Figure 1 (Right) for examples of DART-Math-Uniform by DARS-Uniform and DART-Math-Hard by DARS-Prop2Diff.
Citation If you find our data, model or code useful for your work, please kindly cite our paper:
latex @article{tong2024dartmath, title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving}, author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He}, year={2024}, eprint={2407.13690}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2407.13690}, }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was created by gathering human-authored corpora from several public health sites and generating additional data via three different LLMs: GPT-4o, Mistral-7B and Llama3-1. We included texts in English, Spanish, German and French data from the biomedical domain. The current version gathers 50% AI-generated and 50% human-written texts.
The following are the data we used:
The corpus statistics and methods are explained in the following article: Patrick Styll, Leonardo Campillos-Llanos, Jorge FernĂĄndez-GarcĂa, Isabel Segura-Bedmar (2025) "MedAID-ML: A Multilingual Dataset of Biomedical Texts for Detecting AI-Generated Content".
JSON files:.These are separated in TRAIN and TEST. Each file has a list of hashes for each text, and each hash contains the following fields:
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
đ The OpenOrca Dataset! đ
We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!
Official Models
Mistral-7B-OpenOrca
Our latest model, the first 7B to score better overall than all⊠See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/OpenOrca.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MULTITuDEv3 is a dataset for multilingual machine-generated text detection benchmark, originally described in the EMNLP 2023 conference paper. It consisted of 7992 human-written news texts in 11 languages subsampled from MassiveSumm, accompanied by 66089 texts generated by 8 large language models (by using headlines of news articles) (see MULTITuDEv1). The creation process and scripts for replication/extension are located in a GitHub repository. The dataset has been further extended in MULTITuDEv2 by obfuscated texts using 10 authorship obfuscation methods, described in the EMNL 2024 Findings conference paper. This version covers 21 languages (instead of original 11) with mostly equal coverage in the training set and has been introduced in ACL 2025 conference paper for out-of-domain evaluation of detectors trained on social-media texts.
If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.
The dataset has the following fields:
Splits:
Binary labels:
Multiclass labels:
Languages:
Language | train | test |
---|---|---|
Arabic | 7975 | 2392 |
Bulgarian | 7954 | 2386 |
Catalan | 2894 | 2389 |
Chinese | 7926 | 2383 |
Croatian | 7951 | 2384 |
Czech | 7962 | 2389 |
Dutch | 7958 | 2386 |
English | 7954 | 2384 |
German | 7951 | 2388 |
Greek | 7944 | 2384 |
Hungarian | 7964 | 2385 |
Irish | 2333 | 2381 |
Polish | 7946 | 2383 |
Portuguese | 7956 | 2388 |
Romanian | 7949 | 2386 |
Russian | 7945 | 2382 |
Scottish Gaelic | 7899 | 2377 |
Slovak | 7946 | 2385 |
Slovenian | 7947 | 2386 |
Spanish | 7947 | 2387 |
Ukrainian | 7939 | 2385 |
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A classics data set for use with mistral-7b-v0.1
This dataset was used for a fine-tune of Mistral 7b base model. It contains 1,640 Q/A pairs on Greek & Roman history. The dataset was generated via Mixtral-8x7b Instruct v01, run over 512 token-length chunks of vol's 2&3 of Will Durants' 13 vol Story of Civilization (Life of Greece and Caesar & Christ). Training data was formatted with [INST] and [/INST] delimiting instructions: {"text": "Q: "Why did many Greeks come to resent Rome's⊠See the full description on the dataset page: https://huggingface.co/datasets/wmmarcellino/mistral-7b-v0.1-GreeceRome-v0.1.