5 datasets found
  1. h

    mistral-7b-v0.1-GreeceRome-v0.1

    • huggingface.co
    Updated Feb 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    William Marcellino (2024). mistral-7b-v0.1-GreeceRome-v0.1 [Dataset]. https://huggingface.co/datasets/wmmarcellino/mistral-7b-v0.1-GreeceRome-v0.1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 3, 2024
    Authors
    William Marcellino
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    A classics data set for use with mistral-7b-v0.1

    This dataset was used for a fine-tune of Mistral 7b base model. It contains 1,640 Q/A pairs on Greek & Roman history. The dataset was generated via Mixtral-8x7b Instruct v01, run over 512 token-length chunks of vol's 2&3 of Will Durants' 13 vol Story of Civilization (Life of Greece and Caesar & Christ). Training data was formatted with [INST] and [/INST] delimiting instructions: {"text": "Q: "Why did many Greeks come to resent Rome's
 See the full description on the dataset page: https://huggingface.co/datasets/wmmarcellino/mistral-7b-v0.1-GreeceRome-v0.1.

  2. P

    DART-Math-Uniform Dataset

    • paperswithcode.com
    Updated Jun 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). DART-Math-Uniform Dataset [Dataset]. https://paperswithcode.com/dataset/dart-math-uniform
    Explore at:
    Dataset updated
    Jun 17, 2024
    Description

    🎯 DART-Math

    Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

    📝 Paper@arXiv | đŸ€— Datasets&Models@HF | đŸ± Code@GitHub

    🐩 Thread@X(Twitter) | đŸ¶ äž­æ–‡ćšćźą@矄äčŽ | 📊 Leaderboard@PapersWithCode | 📑 BibTeX

    Datasets: DART-Math DART-Math datasets are the state-of-the-art and data-efficient open-source instruction tuning datasets for mathematical reasoning.

    DART-Math-Hard contains ~585k mathematical QA pair samples constructed by applying DARS-Prop2Diff to the query set from MATH and GSK8K training sets, achieves SOTA on many challenging mathematical reasoning benchmarks. It introduces a deliberate bias towards hard queries, opposite to vanilla rejection sampling.

    Performance produced by DART-Math-Hard is usually but not necessarily slightly better (~1% absolutely) than DART-Math-Uniform, which contains ~591k samples constructed by applying DARS-Uniform.

    Comparison between Mathematical Instruction Tuning Datasets Most of previous datasets are constructed with ChatGPT, and many of them are not open-source, especially for ones of the best performance.

    Math SFT Dataset# of SamplesMATHGSM8KCollegeSynthesis Agent(s)Open-Source
    WizardMath96k32.380.423.1GPT-4✗
    MetaMathQA395k29.876.519.3GPT-3.5✓
    MMIQC2294k37.475.428.5GPT-4+GPT-3.5+Human✓
    Orca-Math200k------GPT-4✓
    Xwin-Math-V1.11440k45.584.927.6GPT-4✗
    KPMath-Plus1576k46.882.1-–GPT-4✗
    MathScaleQA2021k35.274.821.8GPT-3.5+Human✗
    DART-Math-Uniform591k43.582.626.9DeepSeekMath-7B-RL✓
    DART-Math-Hard585k45.581.129.4DeepSeekMath-7B-RL✓

    MATH and GSM8K are in-domain, while College(Math) is out-of-domain. Performance here are of models fine-tuned from Mistral-7B, except for Xwin-Math-V1.1 based on Llama2-7B. Bold/Italic means the best/second best score here.

    Dataset Construction: DARS - Difficulty-Aware Rejection Sampling Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results. However, our analysis of these datasets reveals severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries.

    Motivated by the observation above, we propose to Difficulty-Aware Rejection Sampling (DARS), to collect more responses for more difficult queries. Specifically, we introduce two strategies to increase the number of correct responses for difficult queries:

    1) Uniform, which involves sampling responses for each query until each query accumulates $k_u$ correct responses, where $k_u$ is a preset hyperparameter determined by the desired size of the synthetic dataset; 2) Prop2Diff, where we continue sampling responses until the number of correct responses for each query is proportional to its difficulty score. The most challenging queries will receive $k_p$ responses and kp is a hyperparameter. This method introduces a deliberate bias in the opposite direction to vanilla rejection sampling, towards more difficult queries, inspired by previous works that demonstrate difficult samples can be more effective to enhance model capabilities (Sorscher et al., 2022; Liu et al., 2024b).

    See Figure 1 (Right) for examples of DART-Math-Uniform by DARS-Uniform and DART-Math-Hard by DARS-Prop2Diff.

    Citation If you find our data, model or code useful for your work, please kindly cite our paper:

    latex @article{tong2024dartmath, title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving}, author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He}, year={2024}, eprint={2407.13690}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2407.13690}, }

  3. t

    MedAID-ML: A Multilingual Dataset of Biomedical Texts for Detecting...

    • researchdata.tuwien.ac.at
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Styll; Leonardo Campillos-Llanos; Jorge Fernåndez-García; Isabel Segura-Bédmar; Patrick Styll; Leonardo Campillos-Llanos; Jorge Fernåndez-García; Isabel Segura-Bédmar; Patrick Styll; Leonardo Campillos-Llanos; Jorge Fernåndez-García; Isabel Segura-Bédmar; Patrick Styll; Leonardo Campillos-Llanos; Jorge Fernåndez-García; Isabel Segura-Bédmar (2025). MedAID-ML: A Multilingual Dataset of Biomedical Texts for Detecting AI-Generated Content [Dataset]. http://doi.org/10.20350/digitalcsic/17276
    Explore at:
    Dataset updated
    Jun 23, 2025
    Dataset provided by
    TU Wien
    Authors
    Patrick Styll; Leonardo Campillos-Llanos; Jorge Fernåndez-García; Isabel Segura-Bédmar; Patrick Styll; Leonardo Campillos-Llanos; Jorge Fernåndez-García; Isabel Segura-Bédmar; Patrick Styll; Leonardo Campillos-Llanos; Jorge Fernåndez-García; Isabel Segura-Bédmar; Patrick Styll; Leonardo Campillos-Llanos; Jorge Fernåndez-García; Isabel Segura-Bédmar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 11, 2025
    Description

    Dataset for MedAID-ML: A Multilingual Dataset of Biomedical Texts for Detecting AI-Generated Content

    This dataset was created by gathering human-authored corpora from several public health sites and generating additional data via three different LLMs: GPT-4o, Mistral-7B and Llama3-1. We included texts in English, Spanish, German and French data from the biomedical domain. The current version gathers 50% AI-generated and 50% human-written texts.

    The following are the data we used:

    • Cochrane Library: This is a database of meta-analyses and systematic reviews of updated results of clinical studies. We used abstracts of systematic reviews in all four languages.
    • European Clinical Trials (EUCT): This agency that supervises and evaluates pharmaceutical products of the European Union (EU). We downloaded parallel data from public assessment reports (EPARs) from 12 new medicinal products, and data from clinical trial protocols and eligibility criteria. We ensured the data were published only from January 2025 to date. The goal was gathering data that might not have been used to train the LLMs in our experiments.
    • European Medicines Agency (EMA): This agency that supervises and evaluates pharmaceutical products of the European Union (EU). We downloaded parallel data from public assessment reports (EPARs) from 12 new medicinal products, and data from clinical trial protocols and eligibility criteria. We ensured the data were published only from January 2025 to date. The goal was gathering data that might not have been used to train the LLMs in our experiments.
    • European Food Safety Authority (EFSA): This website provides a comprehensive range of data about food consumption and chemical/biological monitoring data. We chose only the topics we deem necessary for our goals, therefore including a total of 51 topics. Processing: we manually split articles with a wordcount of above 1350 and manually ensured their correctness and alignment in all languages.
    • European Vaccination Information Portal (EVIP): it provides up-to-date information on vaccines and vaccination. The factsheets are available in all languages, and consist of 20 texts each.
    • Immunize: Immunize.org (formerly known as the Immunization Action Coalition) is a U.S.-based organization dedicated to providing comprehensive immunization resources for healthcare professionals and the public. Vaccine Information Sheets (VISs) have been translated into several languages, but not all of them contain all VISs. They are given as PDFs, with 25 in Spanish, French and English, but only 21 in German. Only PDFs overlapping in all languages were used.
    • Migration und Gesundheit - German Ministry of Health (BFG): This portal provides multilingual health information tailored for migrants and refugees. Gesundheit fĂŒr alle is a PDF file that provides a guide to the German healthcare system, and it is available in Spanish, English and German. Processing: Two topics, which were shorter than 100 words, were merged with the next one to ensure that context is preserved.
    • Orphadata (INSERM): a comprehensive knowledge base about rare diseases and orphan drugs, in re-usable and high-quality formats, released in 12 official EU languages. We gathered definitions, signs and symptoms and phenotypes about 4389 rare diseases in English, German, Spanish and French. Processing: Since each definition is roughly the same size and similar format, we simply group 5 definitions together to make the text per topic longer.
    • PubMed (National Library of Medicine): we downloaded abstracts available in English, Spanish, French and German.
    • Wikipedia: a free, web-based, collaborative multilingual encyclopedia project; we selected (bio)medical contents available in English, German, Spanish and French. To ensure that the texts were not automatically generated, we only use articles that date back to before the release of ChatGPT, i.e. before 30th November 2022. Processing: some data cleaning was necessary; we also removed all topics with less than 5 words, or split those with more than 9 sentences into equally long parts. From these split up files, we make sure that they contain a minimum of 100 words, and we take only those contents or topics that exist in all three languages.

    Description of methods used for collection/generation of data

    The corpus statistics and methods are explained in the following article: Patrick Styll, Leonardo Campillos-Llanos, Jorge FernĂĄndez-GarcĂ­a, Isabel Segura-Bedmar (2025) "MedAID-ML: A Multilingual Dataset of Biomedical Texts for Detecting AI-Generated Content".

    Methods for processing the data

    • Web-scraping of data from HTML content and PDF files available on the websites of the health contents.
    • Postprocessing and cleaning of data (e.g., removal of redundant white spaces or line breaks), and homogeneization of text length.
    • Generation of corresponding contents by means of generative AI using three large language models: GPT-4o, Mistral-7B and Llama3-1. - Formating of contents into JSON format.

    Files

    JSON files:.These are separated in TRAIN and TEST. Each file has a list of hashes for each text, and each hash contains the following fields:

    • text: the textual content.
    • data_source: the source repository of the text.
    • filename: the name of the original file from which the data were sourced.
    • source: label indicating if it is a human-written text (HUMAN) or the LLM used to generate the text ("gpt4o", "mistral" or "llama").
    • language: The language code of the text: German ("de"), English ("en"), Spanish ("es") or French ("fr").
    • target: a binary label to code if the text is written by humans ("0") or AI ("1").
    • ratio: The proportion of the text that was created with AI: "0.5" for AI-generated texts, and "null" for human texts.
  4. h

    OpenOrca

    • huggingface.co
    • opendatalab.com
    Updated Jun 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenOrca (2023). OpenOrca [Dataset]. https://huggingface.co/datasets/Open-Orca/OpenOrca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2023
    Dataset authored and provided by
    OpenOrca
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🐋 The OpenOrca Dataset! 🐋

    We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!

      Official Models
    
    
    
    
    
    
      Mistral-7B-OpenOrca
    

    Our latest model, the first 7B to score better overall than all
 See the full description on the dataset page: https://huggingface.co/datasets/Open-Orca/OpenOrca.

  5. MULTITuDEv3

    • zenodo.org
    Updated May 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominik Macko; Dominik Macko; Jakub Kopal; Robert Moro; Robert Moro; Ivan Srba; Ivan Srba; Jakub Kopal (2025). MULTITuDEv3 [Dataset]. http://doi.org/10.5281/zenodo.15519413
    Explore at:
    Dataset updated
    May 26, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dominik Macko; Dominik Macko; Jakub Kopal; Robert Moro; Robert Moro; Ivan Srba; Ivan Srba; Jakub Kopal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MULTITuDEv3 is a dataset for multilingual machine-generated text detection benchmark, originally described in the EMNLP 2023 conference paper. It consisted of 7992 human-written news texts in 11 languages subsampled from MassiveSumm, accompanied by 66089 texts generated by 8 large language models (by using headlines of news articles) (see MULTITuDEv1). The creation process and scripts for replication/extension are located in a GitHub repository. The dataset has been further extended in MULTITuDEv2 by obfuscated texts using 10 authorship obfuscation methods, described in the EMNL 2024 Findings conference paper. This version covers 21 languages (instead of original 11) with mostly equal coverage in the training set and has been introduced in ACL 2025 conference paper for out-of-domain evaluation of detectors trained on social-media texts.

    If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.

    Fields

    The dataset has the following fields:

    • 'text' - a text sample,
    • 'label' - 0 for human-written text, 1 for machine-generated text,
    • 'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,
    • 'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,
    • 'language' - the ISO 639-1 language code identifying the language of the given text,
    • 'length' - word count of the given text,
    • 'source' - a string identifying the source dataset / news medium of the given text

    Statistics (the number of samples)

    Splits:

    • train - 156240
    • test - 50090
    • train+test - 206330

    Binary labels:

    • 0 - 25945
    • 1 - 180385

    Multiclass labels:

    • human - 25945
    • aya-101 - 25948
    • Mistral-7B-Instruct-v0.2 - 25937
    • gpt-3.5-turbo-0125 - 25935
    • v5-Eagle-7B-HF - 25892
    • vicuna-13b - 25876
    • opt-iml-max-30b - 25568
    • Llama-2-70b-chat-hf - 25229

    Languages:

    • Languagetraintest
      Arabic79752392
      Bulgarian79542386
      Catalan28942389
      Chinese79262383
      Croatian79512384
      Czech79622389
      Dutch79582386
      English79542384
      German79512388
      Greek79442384
      Hungarian79642385
      Irish23332381
      Polish79462383
      Portuguese79562388
      Romanian79492386
      Russian79452382
      Scottish Gaelic78992377
      Slovak79462385
      Slovenian79472386
      Spanish79472387
      Ukrainian79392385
  6. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
William Marcellino (2024). mistral-7b-v0.1-GreeceRome-v0.1 [Dataset]. https://huggingface.co/datasets/wmmarcellino/mistral-7b-v0.1-GreeceRome-v0.1

mistral-7b-v0.1-GreeceRome-v0.1

wmmarcellino/mistral-7b-v0.1-GreeceRome-v0.1

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 3, 2024
Authors
William Marcellino
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

A classics data set for use with mistral-7b-v0.1

This dataset was used for a fine-tune of Mistral 7b base model. It contains 1,640 Q/A pairs on Greek & Roman history. The dataset was generated via Mixtral-8x7b Instruct v01, run over 512 token-length chunks of vol's 2&3 of Will Durants' 13 vol Story of Civilization (Life of Greece and Caesar & Christ). Training data was formatted with [INST] and [/INST] delimiting instructions: {"text": "Q: "Why did many Greeks come to resent Rome's
 See the full description on the dataset page: https://huggingface.co/datasets/wmmarcellino/mistral-7b-v0.1-GreeceRome-v0.1.

Search
Clear search
Close search
Google apps
Main menu