10 datasets found

h
chatbot_arena_conversations
huggingface.co
Updated Jul 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Large Model Systems Organization (2023). chatbot_arena_conversations [Dataset]. https://huggingface.co/datasets/lmsys/chatbot_arena_conversations
Explore at:
Dataset updated
Jul 18, 2023
Dataset authored and provided by
Large Model Systems Organization
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
Chatbot Arena Conversations Dataset

This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. Each sample includes a question ID, two model names, their full conversation text in OpenAI API JSON format, the user vote, the anonymized user ID, the detected language tag, the OpenAI moderation API tag, the additional toxic tag, and the timestamp. To ensure the safe release… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/chatbot_arena_conversations.
h
llm-jp-chatbot-arena-conversations
huggingface.co
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LLM-jp (2025). llm-jp-chatbot-arena-conversations [Dataset]. https://huggingface.co/datasets/llm-jp/llm-jp-chatbot-arena-conversations
Explore at:
Dataset updated
Jul 15, 2025
Dataset authored and provided by
LLM-jp
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
LLM-jp Chatbot Arena Conversations Dataset

This dataset contains approximately 1,000 conversations with pairwise human preferences, most of which are in Japanese. The data was collected during the trial phase of the LLM-jp Chatbot Arena (January–February 2025), where users compared responses from two different models in a head-to-head format. Each sample includes a question ID, the names of the two models, their conversation transcripts, the user's vote, an anonymized user ID, a… See the full description on the dataset page: https://huggingface.co/datasets/llm-jp/llm-jp-chatbot-arena-conversations.
h
lmsys-chat-1m
huggingface.co
Updated Jul 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jitendra Chauhan (2025). lmsys-chat-1m [Dataset]. https://huggingface.co/datasets/jc-detoxio/lmsys-chat-1m
Explore at:
Dataset updated
Jul 2, 2025
Authors
Jitendra Chauhan
Description
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

This dataset contains one million real-world conversations with 25 state-of-the-art LLMs. It is collected from 210K unique IP addresses in the wild on the Vicuna demo and Chatbot Arena website from April to August 2023. Each sample includes a conversation ID, model name, conversation text in OpenAI API JSON format, detected language tag, and OpenAI moderation API tag. User consent is obtained through the "Terms of use"… See the full description on the dataset page: https://huggingface.co/datasets/jc-detoxio/lmsys-chat-1m.
h
lmsys-arena-human-preference-winner-43k-unfiltered
huggingface.co
Updated Sep 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lesserfield (2021). lmsys-arena-human-preference-winner-43k-unfiltered [Dataset]. https://huggingface.co/datasets/lesserfield/lmsys-arena-human-preference-winner-43k-unfiltered
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 15, 2021
Authors
Lesserfield
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
lmsys-arena-human-preference-winner-43k-unfiltered

This repository contains a dataset derived from the lmsys/lmsys-arena-human-preference-55k dataset, which is licensed under the Apache 2.0 License.

Dataset Description

The lmsys-arena-human-preference-winner-43k-unfiltered dataset is a collection of 43,000 samples, each containing an instruction (prompt) and an output (winning response) from real-world user and LLM conversations. The dataset is derived from the original… See the full description on the dataset page: https://huggingface.co/datasets/lesserfield/lmsys-arena-human-preference-winner-43k-unfiltered.
h
arena-human-preference-55k
huggingface.co
Updated Jun 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LMArena (2025). arena-human-preference-55k [Dataset]. https://huggingface.co/datasets/lmarena-ai/arena-human-preference-55k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2025
Dataset authored and provided by
LMArena
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset for Kaggle competition on predicting human preference on Chatbot Arena battles. The training dataset includes over 55,000 real-world user and LLM conversations and user preferences across over 70 state-of-the-art LLMs, such as GPT-4, Claude 2, Llama 2, Gemini, and Mistral models. Each sample represents a battle consisting of 2 LLMs which answer the same question, with a user label of either prefer model A, prefer model B, tie, or tie (both bad).

Citation

Please cite the… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/arena-human-preference-55k.
h
llm-jp-chatbot-arena-conversations-reformatted
huggingface.co
Updated May 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kaeru39 (2025). llm-jp-chatbot-arena-conversations-reformatted [Dataset]. https://huggingface.co/datasets/ryota39/llm-jp-chatbot-arena-conversations-reformatted
Explore at:
Dataset updated
May 29, 2025
Authors
kaeru39
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
llm-jp/llm-jp-chatbot-arena-conversationsを整形したデータセットです
h
search-arena-v1-7k
huggingface.co
Updated Apr 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LMArena (2025). search-arena-v1-7k [Dataset]. https://huggingface.co/datasets/lmarena-ai/search-arena-v1-7k
Explore at:
Dataset updated
Apr 13, 2025
Dataset authored and provided by
LMArena
Description
Overview

This dataset contains 7k leaderboard conversation votes collected from Search Arena between March 18, 2025 and April 13, 2025. All entries have been redacted for PII and sensitive user information to ensure privacy. Each data point includes:

Two model responses (messages_a and messages_b) The human vote result A timestamp Full system metadata, LLM + web search trace, and post-processed metadata for controlled experiments (conv_meta)

To reproduce the leaderboard results… See the full description on the dataset page: https://huggingface.co/datasets/lmarena-ai/search-arena-v1-7k.
chatbot-arena-ja-calm2-7b-chat-experimental
huggingface.co
Updated Jan 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CyberAgent (2024). chatbot-arena-ja-calm2-7b-chat-experimental [Dataset]. https://huggingface.co/datasets/cyberagent/chatbot-arena-ja-calm2-7b-chat-experimental
Explore at:
Dataset updated
Jan 24, 2024
Dataset provided by
サイバーエージェントhttp://cyberagent.co.jp/
Authors
CyberAgent
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for "chatbot-arena-ja-calm2-7b-chat"

Chatbot Arena Conversations JA (calm2) Dataset

Chatbot Arena Conversations JA (calm2)はこちらの論文で構築されたRLHFのための日本語Instructionデータセットです。「英語で公開されているデータセットをオープンソースのツール・モデルのみを使って日本語用に転用し、日本語LLMの学習に役立てることができるか」を検証する目的で作成しております。指示文（prompt）はlmsys/chatbot_arena_conversationsのユーザ入力（CC-BY 4.0）を和訳したものです。これはChatbot Arenaを通して人間が作成した指示文であり、CC-BY 4.0で公開されているものです。複数ターンの対話の場合は最初のユーザ入力のみを使っています（そのため、このデータセットはすべて１ターンの対話のみになっております）。… See the full description on the dataset page: https://huggingface.co/datasets/cyberagent/chatbot-arena-ja-calm2-7b-chat-experimental.
h
preference-dissection
huggingface.co
Updated Feb 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SII - GAIR (2024). preference-dissection [Dataset]. https://huggingface.co/datasets/GAIR/preference-dissection
Explore at:
Dataset updated
Feb 18, 2024
Dataset authored and provided by
SII - GAIR
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Introduction

We release the annotated data used in Dissecting Human and LLM Preferences. Original Dataset - The dataset is based on lmsys/chatbot_arena_conversations, which contains 33K cleaned conversations with pairwise human preferences collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. Filtering and Scenario-wise Sampling - We filter out the conversations that are not in English, with "Tie" or "Both Bad" labels, and the multi-turn… See the full description on the dataset page: https://huggingface.co/datasets/GAIR/preference-dissection.
h
twllm-data
huggingface.co
Updated Aug 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yen-Ting Lin (2025). twllm-data [Dataset]. https://huggingface.co/datasets/yentinglin/twllm-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 17, 2025
Authors
Yen-Ting Lin
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
TWLLM-Data: Open Traditional Mandarin LLM Conversations

TWLLM-Data is the first large-scale open dataset containing real user-generated conversation logs from TWLLM and TWLLM Arena, where over 80% of users are based in Taiwan. The dataset is designed to facilitate the development and evaluation of Traditional Mandarin Large Language Models (LLMs). We extend our gratitude to Professor Yun-Nung (Vivian) Chen for her guidance and advisement. Special thanks to Tzu-Han Lin, Kang-Chieh… See the full description on the dataset page: https://huggingface.co/datasets/yentinglin/twllm-data.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Large Model Systems Organization (2023). chatbot_arena_conversations [Dataset]. https://huggingface.co/datasets/lmsys/chatbot_arena_conversations

chatbot_arena_conversations

lmsys/chatbot_arena_conversations

Explore at:

24 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jul 18, 2023

Dataset authored and provided by

Large Model Systems Organization

License

https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

Description

Chatbot Arena Conversations Dataset

This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. Each sample includes a question ID, two model names, their full conversation text in OpenAI API JSON format, the user vote, the anonymized user ID, the detected language tag, the OpenAI moderation API tag, the additional toxic tag, and the timestamp. To ensure the safe release… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/chatbot_arena_conversations.

Clear search

Close search

Google apps

Main menu

chatbot_arena_conversations

llm-jp-chatbot-arena-conversations

lmsys-chat-1m

lmsys-arena-human-preference-winner-43k-unfiltered

arena-human-preference-55k

llm-jp-chatbot-arena-conversations-reformatted

search-arena-v1-7k

chatbot-arena-ja-calm2-7b-chat-experimental

preference-dissection

twllm-data

chatbot_arena_conversations

lmsys/chatbot_arena_conversations