MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
StrongREJECT
A novel benchmark of 313 malicious prompts for use in evaluating jailbreaking attacks against LLMs, aimed to expose whether a jailbreak attack actually enables malicious actors to utilize LLMs for harmful tasks. Dataset link: https://github.com/alexandrasouly/strongreject/blob/main/strongreject_dataset/strongreject_dataset.csv
Citation
If you find the dataset useful, please cite the following work: @misc{souly2024strongreject, title={A StrongREJECT… See the full description on the dataset page: https://huggingface.co/datasets/walledai/StrongREJECT.
Lv111/StrongREJECT dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset was created by James Mann
Machlovi/strongreject-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
NoahShen/strongreject-llama3.1-8b-inst-completions dataset hosted on Hugging Face and contributed by the HF Datasets community
🌍 strongREJECT++ Dataset
Welcome to the strongREJECT++ dataset! This dataset is a collection of translations from the original strongREJECT dataset. "It's cutting-edge benchmark for evaluating jailbreaks in Large Language Models (LLMs)"
Available Languages
You can find translations provided by native speakers in the following languages:
🇺🇸 English 🇷🇺 Russian 🇺🇦 Ukrainian 🇧🇾 Belarusian 🇺🇿 Uzbek
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Large Language Models (LLMs) are susceptible to jailbreaking attacks, where carefully crafted malicious inputs bypass safety guardrails and provoke harmful responses. We introduce AutoAdv, a novel automated framework that generates adversarial prompts and assesses vulnerabilities in LLM safety mechanisms. Our approach employs an attacker LLM to create disguised malicious prompts using strategic rewriting techniques, tailored system prompts, and optimized hyperparameter settings. The core innovation is a dynamic, multiturn attack strategy that analyzes unsuccessful jailbreak attempts to iteratively develop more effective follow-up prompts. We evaluate the attack success rate (ASR) using the StrongREJECT framework across multiple interaction turns. Extensive empirical testing on state-of-the-art models, including ChatGPT, Llama, DeepSeek, Qwen, Gemma, and Mistral, reveals significant weaknesses, with AutoAdv achieving an ASR of 86% on the Llama-3.1-8B. These findings indicate that current safety mechanisms remain susceptible to sophisticated multiturn attacks.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Greek Jailbreak-StrongReject
This dataset is a translated and extended version of the StrongReject benchmark. It adapts 309 harmful prompts into Greek, preserving the original behavioral categories, and adds a new column containing one of five jailbreak prompt templates for each query. The goal is to evaluate the robustness of multilingual LLMs, especially in low-resource languages like Greek, against adversarial jailbreak attacks using automated evaluation methods.… See the full description on the dataset page: https://huggingface.co/datasets/ilsp/Jailbreak-StrongReject-el.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Jailbreak-PromptBank
Jailbreak-PromptBank is a curated dataset of jailbreak prompts collected from a wide range of existing datasets. It was compiled for research purposes related to prompt injection and the security of Large Language Models (LLMs).
✨ Sources Used
The dataset aggregates prompts from the following public datasets:
Jailbreak Classification InTheWild AdvBench AdvSuffix HarmBench UltraSafety StrongReject JBB_Behaviors ChatGPT-Jailbreak-Prompts JailBreakV_28k… See the full description on the dataset page: https://huggingface.co/datasets/Lshafii/Jailbreak-PromptBank.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
StrongREJECT
A novel benchmark of 313 malicious prompts for use in evaluating jailbreaking attacks against LLMs, aimed to expose whether a jailbreak attack actually enables malicious actors to utilize LLMs for harmful tasks. Dataset link: https://github.com/alexandrasouly/strongreject/blob/main/strongreject_dataset/strongreject_dataset.csv
Citation
If you find the dataset useful, please cite the following work: @misc{souly2024strongreject, title={A StrongREJECT… See the full description on the dataset page: https://huggingface.co/datasets/walledai/StrongREJECT.