100+ datasets found

h
Lucie-Training-Dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenLLM France, Lucie-Training-Dataset [Dataset]. https://huggingface.co/datasets/OpenLLM-France/Lucie-Training-Dataset
Explore at:
Dataset authored and provided by
OpenLLM France
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Lucie Training Dataset Card

The Lucie Training Dataset is a curated collection of text data in English, French, German, Spanish and Italian culled from a variety of sources including: web data, video subtitles, academic papers, digital books, newspapers, and magazines, some of which were processed by Optical Character Recognition (OCR). It also contains samples of diverse programming languages. The Lucie Training Dataset was used to pretrain Lucie-7B, a foundation LLM with strong… See the full description on the dataset page: https://huggingface.co/datasets/OpenLLM-France/Lucie-Training-Dataset.
h
amazon-review
huggingface.co
Updated Oct 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
amazon-review [Dataset]. https://huggingface.co/datasets/tppllm/amazon-review
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 4, 2024
Dataset authored and provided by
TPP-LLM
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Amazon Review Dataset

This dataset contains Amazon reviews from January 1, 2018, to June 30, 2018. It includes 2,245 sequences with 127,054 events across 18 category types. The original data is available at Amazon Review Data with citation information provided on the page. The detailed data preprocessing steps used to create this dataset can be found in the TPP-LLM paper and TPP-LLM-Embedding paper. If you find this dataset useful, we kindly invite you to cite the following… See the full description on the dataset page: https://huggingface.co/datasets/tppllm/amazon-review.
LLM Coding Leaderboards
kaggle.com
zip
Updated Apr 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chris Gorgolewski (2024). LLM Coding Leaderboards [Dataset]. https://www.kaggle.com/datasets/chrisfilo/llm-coding-leaderboards/suggestions
Explore at:
zip(7072 bytes)Available download formats
Dataset updated
Apr 14, 2024
Authors
Chris Gorgolewski
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Chris Gorgolewski

Released under Apache 2.0

Contents
LLM-SE Python Wheel
kaggle.com
zip
Updated Oct 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ranchantan (2023). LLM-SE Python Wheel [Dataset]. https://www.kaggle.com/datasets/ranchantan/llm-se-python-wheel/suggestions?status=pending&yourSuggestions=true
Explore at:
zip(146397 bytes)Available download formats
Dataset updated
Oct 7, 2023
Authors
Ranchantan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Ranchantan

Released under CC0: Public Domain

Contents
llm-human-large-data
kaggle.com
zip
Updated Oct 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hemanthh Velliyangirie (2024). llm-human-large-data [Dataset]. https://www.kaggle.com/datasets/hemanthhvv/llm-human-large-data/suggestions
Explore at:
zip(153243329 bytes)Available download formats
Dataset updated
Oct 15, 2024
Authors
Hemanthh Velliyangirie
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Hemanthh Velliyangirie

Released under Apache 2.0

Contents
F
Bahasa Open Ended Classification Prompt & Response Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Bahasa Open Ended Classification Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/bahasa-open-ended-classification-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Welcome to the Bahasa Open Ended Classification Prompt-Response Dataset—an extensive collection of 3000 meticulously curated prompt and response pairs. This dataset is a valuable resource for training Language Models (LMs) to classify input text accurately, a crucial aspect in advancing generative AI.
Dataset Content: This open-ended classification dataset comprises a diverse set of prompts and responses where the prompt contains input text to be classified and may also contain task instruction, context, constraints, and restrictions while completion contains the best classification category as response. Both these prompts and completions are available in Bahasa language. As this is an open-ended dataset, there will be no options given to choose the right classification category as a part of the prompt.
These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native Bahasa people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This open-ended classification prompt and completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains prompts and responses with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Prompt Diversity: To ensure diversity, this open-ended classification dataset includes prompts with varying complexity levels, ranging from easy to medium and hard. Additionally, prompts are diverse in terms of length from short to medium and long, creating a comprehensive variety. The classification dataset also contains prompts with constraints and persona restrictions, which makes it even more useful for LLM training.Response Formats: To accommodate diverse learning experiences, our dataset incorporates different types of responses depending on the prompt. These formats include single-word, short phrase, and single sentence type of response. These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.Data Format and Annotation Details: This fully labeled Bahasa Open Ended Classification Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, response type, and rich text presence.Quality and Accuracy: Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.
The Bahasa version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.
Continuous Updates and Customization: The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom open-ended classification prompt and completion data tailored to specific needs, providing flexibility and customization options.License: The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Bahasa Open Ended Classification Prompt-Completion Dataset to enhance the classification abilities and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.
i
Dataset for QoS-aware LLM Routing Experiment
ieee-dataport.org
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jin Yang (2025). Dataset for QoS-aware LLM Routing Experiment [Dataset]. http://doi.org/10.21227/nrwq-0y10
Explore at:
Unique identifier
https://doi.org/10.21227/nrwq-0y10
Dataset updated
Mar 18, 2025
Dataset provided by
IEEE Dataport
Authors
Jin Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset for QoS-aware LLM Routing Experiment.paper abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities, leading to a significant increase in user demand for LLM services. However, cloud-based LLM services often suffer from high latency, unstable responsiveness, and privacy concerns. Therefore, multiple LLMs are usually deployed at the network edge to boost real-time responsiveness and protect data privacy, particularly for many emerging smart mobile and IoT applications. Given the varying response quality and latency of LLM services, a critical issue is how to route user requests from mobile and IoT devices to an appropriate LLM service (i.e., edge LLM expert) to ensure acceptable quality-of-service (QoS). Existing routing algorithms fail to simultaneously address the heterogeneity of LLM services, the interference among requests, and the dynamic workloads necessary for maintaining long-term stable QoS. To meet these challenges, in this paper we propose a novel deep reinforcement learning (DRL)-based QoS-aware LLM routing framework for sustained high-quality LLM services. Due to the dynamic nature of the global state, we propose a dynamic state abstraction technique to compactly represent global state features with a heterogeneous graph attention network (HAN). Additionally, we introduce an action impact estimator and a tailored reward function to guide the DRL agent in maximizing QoS and preventing latency violations. Extensive experiments on both Poisson and real-world workloads demonstrate that our proposed algorithm significantly improves average QoS and computing resource efficiency compared to existing baselines.
d
FileMarket |AI & ML Training Data from Sotheby's International Realty | Real...
datarade.ai
Updated Aug 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FileMarket (2024). FileMarket |AI & ML Training Data from Sotheby's International Realty | Real Estate Dataset for AI Agents | LLM | ML | DL Training Data [Dataset]. https://datarade.ai/data-products/filemarket-ai-ml-training-data-from-sotheby-s-internationa-filemarket
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Aug 30, 2024
Dataset authored and provided by
FileMarket
Area covered
Virgin Islands (British), Togo, Ukraine, Palestine, Sint Maarten (Dutch part), Mali, Montenegro, Bolivia (Plurinational State of), United Republic of, Ethiopia
Description
The Sotheby's International Realty dataset provides a premium collection of real estate data, ideal for training AI models and enhancing various business operations in the luxury real estate market. Our data is carefully curated and prepared to ensure seamless integration with your AI systems, allowing you to innovate and optimize your business processes with minimal effort. This dataset is versatile and suitable for small boutique agencies, mid-sized firms, and large real estate enterprises.

Key features include:

Custom Delivery Options: Data can be delivered through Rest-API, Websockets, tRPC/gRPC, or other preferred methods, ensuring smooth integration with your AI infrastructure. Vectorized Data: Choose from multiple embedding models (LLama, ChatGPT, etc.) and vector databases (Chroma, FAISS, QdrantVectorStore) for optimal AI model performance and vectorized data processing. Comprehensive Data Coverage: Includes detailed property listings, luxury market trends, customer engagement data, and agent performance metrics, providing a robust foundation for AI-driven analytics. Ease of Integration: Our dataset is designed for easy integration with existing AI systems, providing the flexibility to create AI-driven analytics, notifications, and other business applications with minimal hassle. Additional Services: Beyond data provision, we offer AI agent development and integration services, helping you seamlessly incorporate AI into your business workflows. With this dataset, you can enhance property valuation models, optimize customer engagement strategies, and perform advanced market analysis using AI-driven insights. This dataset is perfect for training AI models that require high-quality, structured data, helping luxury real estate businesses stay competitive in a dynamic market.
F
Tamil Brainstorming Prompt & Response Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Tamil Brainstorming Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/tamil-brainstorming-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Welcome to the Tamil Brainstorming Prompt-Response Dataset, a meticulously curated collection of 2000 prompt and response pairs. This dataset is a valuable resource for enhancing the creative and generative abilities of Language Models (LMs), a critical aspect in advancing generative AI.
Dataset Content: This brainstorming dataset comprises a diverse set of prompts and responses where the prompt contains instruction, context, constraints, and restrictions while completion contains the most accurate response list for the given prompt. Both these prompts and completions are available in Tamil language.
These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native Tamil people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This dataset encompasses various prompt types, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. Additionally, you'll find prompts and responses containing rich text elements, such as tables, code, JSON, etc., all in proper markdown format.
Prompt Diversity: To ensure diversity, our brainstorming dataset features prompts of varying complexity levels, ranging from easy to medium and hard. The prompts also vary in length, including short, medium, and long prompts, providing a comprehensive range. Furthermore, the dataset includes prompts with constraints and persona restrictions, making it exceptionally valuable for LLM training.Response Formats: Our dataset accommodates diverse learning experiences, offering responses across different domains depending on the prompt. For these brainstorming prompts, responses are generally provided in list format. These responses encompass text strings, numerical values, and dates, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.Data Format and Annotation Details: This fully labeled Tamil Brainstorming Prompt Completion Dataset is available in both JSON and CSV formats. It includes comprehensive annotation details, including a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, and the presence of rich text.Quality and Accuracy: Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.
The Tamil version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.
Continuous Updates and Customization: The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. We continuously work to expand this dataset, ensuring its ongoing growth and relevance. Additionally, FutureBeeAI offers the flexibility to curate custom brainstorming prompt and completion datasets tailored to specific requirements, providing you with customization options.License: This dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Tamil Brainstorming Prompt-Completion Dataset to enhance the creative and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.
f
Data Sheet 2_Large language models generating synthetic clinical datasets: a...
frontiersin.figshare.com
xlsx
Updated Feb 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1533508.s002
Dataset updated
Feb 5, 2025
Dataset provided by
Frontiers
Authors
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
h
LlamaLens-Hindi
huggingface.co
Updated Nov 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LlamaLens-Hindi [Dataset]. https://huggingface.co/datasets/QCRI/LlamaLens-Hindi
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2024
Dataset authored and provided by
Arabic Language Technologies, Qatar Computing Research Institute
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
LlamaLens: Specialized Multilingual LLM Dataset

Overview

LlamaLens is a specialized multilingual LLM designed for analyzing news and social media content. It focuses on 18 NLP tasks, leveraging 52 datasets across Arabic, English, and Hindi.

LlamaLens

This repo includes scripts needed to run our full pipeline, including data preprocessing and sampling, instruction dataset creation, model fine-tuning, inference and evaluation.

Features… See the full description on the dataset page: https://huggingface.co/datasets/QCRI/LlamaLens-Hindi.
Z
Unlocking LLM Insights: A Dataset for Automatic Model Card Generation
data.niaid.nih.gov
zenodo.org
Updated Jun 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Singh, Shruti (2024). Unlocking LLM Insights: A Dataset for Automatic Model Card Generation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11466896
Explore at:
Dataset updated
Jun 4, 2024
Dataset authored and provided by
Singh, Shruti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Language models (LMs) are no longer restricted to the ML community, and instruction-following LMs have led to a rise in autonomous AI agents. As the accessibility of LMs grows, it is imperative that an understanding of their capabilities, intended usage, and development cycle also improves. Model cards are a widespread practice for documenting detailed information about an ML model. To automate model card generation, we introduce a dataset of 500 question-answer pairs for 25 LMs that cover crucial aspects of the model, such as its training configurations, datasets, biases, architecture details, and training resources. We employ annotators to extract the answers from the original paper. Further, we explore the capabilities of LMs in generating model cards by answering questions. We experiment with three configurations: zero-shot generation, retrieval-augmented generation, and fine-tuning on our dataset. The fine-tuned Llama 3 model shows an improvement of 7 points over the retrieval-augmented generation setup. This indicates that our dataset can be used to train models to automatically generate model cards from paper text and reduce the human effort in the model card curation process.
h
sat-questions-and-answers-for-llm
huggingface.co
Updated Oct 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Training Data (2023). sat-questions-and-answers-for-llm [Dataset]. https://huggingface.co/datasets/TrainingDataPro/sat-questions-and-answers-for-llm
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 20, 2023
Authors
Training Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
SAT History Questions and Answers 🏛️ - Text Classification Dataset

This dataset contains a collection of questions and answers for the SAT Subject Test in World History and US History. Each question is accompanied by a corresponding answers and the correct response. The dataset includes questions from various topics, time periods, and regions on both World History and US History.

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the… See the full description on the dataset page: https://huggingface.co/datasets/TrainingDataPro/sat-questions-and-answers-for-llm.
h
dataset-preferences-llm-course-full-dataset
huggingface.co
Updated May 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dataset-preferences-llm-course-full-dataset [Dataset]. https://huggingface.co/datasets/davanstrien/dataset-preferences-llm-course-full-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 31, 2024
Authors
Daniel van Strien
Description
Dataset Card for dataset-preferences-llm-course-full-dataset

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/davanstrien/dataset-preferences-llm-course-full-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/dataset-preferences-llm-course-full-dataset.
P
CriticBench Dataset
paperswithcode.com
Updated Jan 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CriticBench Dataset [Dataset]. https://paperswithcode.com/dataset/criticbench
Explore at:
Dataset updated
Jan 23, 2025
Authors
Zicheng Lin; Zhibin Gou; Tian Liang; Ruilin Luo; Haowei Liu; Yujiu Yang
Description
CriticBench is a comprehensive benchmark designed to assess the abilities of Large Language Models (LLMs) to critique and rectify their reasoning across various tasks. It encompasses five reasoning domains:

Mathematical Commonsense Symbolic Coding Algorithmic

CriticBench compiles 15 datasets and incorporates responses from three LLM families. By utilizing CriticBench, researchers evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning (referred to as GQC reasoning). Notable findings include:

A linear relationship in GQC capabilities, with critique-focused training significantly enhancing performance. Task-dependent variation in correction effectiveness, with logic-oriented tasks being more amenable to correction. GQC knowledge inconsistencies that decrease as model size increases. An intriguing inter-model critiquing dynamic, where stronger models excel at critiquing weaker ones, while weaker models surprisingly surpass stronger ones in self-critique.

(1) CriticBench: Benchmarking LLMs for Critique-Correct Reasoning. https://arxiv.org/abs/2402.14809. (2) CriticBench: Benchmarking LLMs for Critique-Correct Reasoning. http://export.arxiv.org/abs/2402.14809. (3) CriticBench: Benchmarking LLMs for Critique-Correct Reasoning. https://openreview.net/forum?id=sc5i7q6DQO. (4) CriticBench: Benchmarking LLMs for Critique-Correct Reasoning - arXiv.org. https://arxiv.org/html/2402.14809v2. (5) undefined. https://doi.org/10.48550/arXiv.2402.14809.
F
Russian Brainstorming Prompt & Response Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Russian Brainstorming Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/russian-brainstorming-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Welcome to the Russian Brainstorming Prompt-Response Dataset, a meticulously curated collection of 2000 prompt and response pairs. This dataset is a valuable resource for enhancing the creative and generative abilities of Language Models (LMs), a critical aspect in advancing generative AI.
Dataset Content: This brainstorming dataset comprises a diverse set of prompts and responses where the prompt contains instruction, context, constraints, and restrictions while completion contains the most accurate response list for the given prompt. Both these prompts and completions are available in Russian language.
These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native Russian people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This dataset encompasses various prompt types, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. Additionally, you'll find prompts and responses containing rich text elements, such as tables, code, JSON, etc., all in proper markdown format.
Prompt Diversity: To ensure diversity, our brainstorming dataset features prompts of varying complexity levels, ranging from easy to medium and hard. The prompts also vary in length, including short, medium, and long prompts, providing a comprehensive range. Furthermore, the dataset includes prompts with constraints and persona restrictions, making it exceptionally valuable for LLM training.Response Formats: Our dataset accommodates diverse learning experiences, offering responses across different domains depending on the prompt. For these brainstorming prompts, responses are generally provided in list format. These responses encompass text strings, numerical values, and dates, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.Data Format and Annotation Details: This fully labeled Russian Brainstorming Prompt Completion Dataset is available in both JSON and CSV formats. It includes comprehensive annotation details, including a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, and the presence of rich text.Quality and Accuracy: Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.
The Russian version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.
Continuous Updates and Customization: The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. We continuously work to expand this dataset, ensuring its ongoing growth and relevance. Additionally, FutureBeeAI offers the flexibility to curate custom brainstorming prompt and completion datasets tailored to specific requirements, providing you with customization options.License: This dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Russian Brainstorming Prompt-Completion Dataset to enhance the creative and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.
F
Japanese Closed Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Japanese Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/japanese-closed-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The Japanese Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Japanese language, advancing the field of artificial intelligence.
Dataset Content: This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Japanese. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Japanese people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity: To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.Answer Formats: To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.Data Format and Annotation Details: This fully labeled Japanese Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.Quality and Accuracy: The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The Japanese versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization: The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.License: The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Japanese Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.
BOLA Dataset for Karate LLM Project
kaggle.com
zip
Updated Mar 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emil Marian (2024). BOLA Dataset for Karate LLM Project [Dataset]. https://www.kaggle.com/datasets/emilmarian/bola-dataset-for-karate-llm-project
Explore at:
zip(20363 bytes)Available download formats
Dataset updated
Mar 17, 2024
Authors
Emil Marian
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset

This dataset was created by Emil Marian

Released under CC BY-SA 4.0

Contents
P
GPTFuzzer Dataset
paperswithcode.com
Updated Jan 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiahao Yu; Xingwei Lin; Zheng Yu; Xinyu Xing (2025). GPTFuzzer Dataset [Dataset]. https://paperswithcode.com/dataset/gptfuzzer
Explore at:
Dataset updated
Jan 21, 2025
Authors
Jiahao Yu; Xingwei Lin; Zheng Yu; Xinyu Xing
Description
GPTFuzzer is a fascinating project that explores red teaming of large language models (LLMs) using auto-generated jailbreak prompts. Let's dive into the details:

Project Overview: GPTFuzzer aims to assess the security and robustness of LLMs by crafting prompts that can potentially lead to harmful or unintended behavior.

The project focuses on GPT-3 and similar models.

Datasets:

The datasets used in GPTFuzzer include:

Harmful Questions: Sampled from public datasets like llm-jailbreak-study and hh-rlhf. Human-Written Templates: Collected from llm-jailbreak-study. Responses: Gathered by querying models like Vicuna-7B, ChatGPT, and Llama-2-7B-chat.

Models:

The judgment model is a finetuned RoBERTa-large model. The training code and data are available in the repository.

During fuzzing experiments, the model is automatically downloaded and cached.

Updates:

The project has received recognition and awards at conferences like Geekcon 2023. The team continues to improve the codebase and aims to build a general black-box fuzzing framework for LLMs.

Source: Conversation with Bing, 3/17/2024 (1) sherdencooper/GPTFuzz: Official repo for GPTFUZZER - GitHub. https://github.com/sherdencooper/GPTFuzz. (2) GPTFUZZER : Red Teaming Large Language Models with Auto ... - GitHub. https://github.com/sherdencooper/GPTFuzz/blob/master/README.md. (3) GPTFUZZER : Red Teaming Large Language Models with Auto ... - GitHub. https://github.com/CriticalPulsar/GPTFuzz/blob/master/README.md. (4) undefined. https://avatars.githubusercontent.com/u/37368657?v=4. (5) undefined. https://github.com/sherdencooper/GPTFuzz/blob/master/README.md?raw=true. (6) undefined. https://desktop.github.com. (7) undefined. https://github.com/sherdencooper/GPTFuzz/raw/master/README.md. (8) undefined. https://opensource.org/licenses/MIT. (9) undefined. https://camo.githubusercontent.com/a4426cbe5c21edb002526331c7a8fbfa089e84a550567b02a0d829a98b136ad0/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667. (10) undefined. https://img.shields.io/badge/License-MIT-yellow.svg. (11) undefined. https://arxiv.org/pdf/2309.10253.pdf. (12) undefined. https://sherdencooper.github.io/. (13) undefined. https://scholar.google.com/citations?user=Zv_rC0AAAAAJ&. (14) undefined. http://www.dataisland.org/. (15) undefined. http://xinyuxing.org/. (16) undefined. https://geekcon.darknavy.com/2023/china/en/index.html. (17) undefined. https://avatars.githubusercontent.com/u/35443979?v=4. (18) undefined. https://github.com/CriticalPulsar/GPTFuzz/blob/master/README.md?raw=true. (19) undefined. https://docs.github.com/articles/about-issue-and-pull-request-templates. (20) undefined. https://github.com/CriticalPulsar/GPTFuzz/raw/master/README.md. (21) undefined. https://scholar.google.com/citations?user=Zv_rC0AAAAAJ&hl=en.
llm-science-exam-model-with-all-data
kaggle.com
zip
Updated Oct 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sandiago (2023). llm-science-exam-model-with-all-data [Dataset]. https://www.kaggle.com/datasets/sandiago21/llm-science-exam-model-with-all-data/suggestions
Explore at:
zip(1935521544 bytes)Available download formats
Dataset updated
Oct 6, 2023
Authors
Sandiago
Description
Dataset

This dataset was created by Sandiago

Contents

Facebook

Twitter

Click to copy link

Link copied

Cite

OpenLLM France, Lucie-Training-Dataset [Dataset]. https://huggingface.co/datasets/OpenLLM-France/Lucie-Training-Dataset

Lucie-Training-Dataset

Lucie Training Dataset

OpenLLM-France/Lucie-Training-Dataset

Explore at:

Dataset authored and provided by

OpenLLM France

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

Lucie Training Dataset Card

The Lucie Training Dataset is a curated collection of text data in English, French, German, Spanish and Italian culled from a variety of sources including: web data, video subtitles, academic papers, digital books, newspapers, and magazines, some of which were processed by Optical Character Recognition (OCR). It also contains samples of diverse programming languages. The Lucie Training Dataset was used to pretrain Lucie-7B, a foundation LLM with strong… See the full description on the dataset page: https://huggingface.co/datasets/OpenLLM-France/Lucie-Training-Dataset.

Clear search

Close search

Google apps

Main menu

Lucie-Training-Dataset

amazon-review

LLM Coding Leaderboards

Dataset

Contents

LLM-SE Python Wheel

Dataset

Contents

llm-human-large-data

Dataset

Contents

Bahasa Open Ended Classification Prompt & Response Dataset

What’s Included

Dataset for QoS-aware LLM Routing Experiment

FileMarket |AI & ML Training Data from Sotheby's International Realty | Real...

Tamil Brainstorming Prompt & Response Dataset

What’s Included

Data Sheet 2_Large language models generating synthetic clinical datasets: a...

LlamaLens-Hindi

Unlocking LLM Insights: A Dataset for Automatic Model Card Generation

sat-questions-and-answers-for-llm

dataset-preferences-llm-course-full-dataset

CriticBench Dataset

Russian Brainstorming Prompt & Response Dataset

What’s Included

Japanese Closed Ended Question Answer Text Dataset

What’s Included

BOLA Dataset for Karate LLM Project

Dataset

Contents

GPTFuzzer Dataset

llm-science-exam-model-with-all-data

Dataset

Contents

Lucie-Training-Dataset

Lucie Training Dataset

OpenLLM-France/Lucie-Training-Dataset