dvilasuero/jailbreak-classification-reasoning-models dataset hosted on Hugging Face and contributed by the HF Datasets community
Eval models for classification on your own data
This dataset contains the results of evaluating reasoning models for classification. It contains the pipeline and the code to run it. You can tune the config to run different prompts over your HF datasets.
Results
Model Accuracy Total Correct Empty
qwq32b-classification 92.00% 100 92 1
r1-classification 91.00% 100 91 2
llama70-classification 77.00% 10077 10
How to run it
The pipeline uses… See the full description on the dataset page: https://huggingface.co/datasets/dvilasuero/jailbreak-classification-reasoning-eval.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
R1-1776 Jailbreaking Examples
The R1-1776 Jailbreaking Examples dataset comprises instances where attempts were made to bypass the safety mechanisms of the R1-1776 model—a version of DeepSeek-R1 fine-tuned by Perplexity AI to eliminate specific censorship while maintaining robust reasoning capabilities. This dataset serves as a resource for analyzing vulnerabilities in language models and developing strategies to enhance their safety and reliability.
Dataset Summary… See the full description on the dataset page: https://huggingface.co/datasets/weijiejailbreak/r1-1776-jailbreak.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multimodal Pragmatic Jailbreak on Text-to-image Models
The Multimodal Pragmatic Unsafe Prompts (MPUP) is a dataset designed to assess the multimodal pragmatic safety in Text-to-Image (T2I) models. It comprises two key sections: image_prompt, and text_prompt.
Dataset Usage
Downloading the Data
To download the dataset, install Huggingface Datasets and then use the following command: from datasets import load_dataset dataset =… See the full description on the dataset page: https://huggingface.co/datasets/tongliuphysics/multimodalpragmatic.
Results Summary
Model Accuracy Total Correct Unparsable
qwq32 90.00% 100 90 4
r1 93.00% 100 93 3
llama70B 69.00% 100 69 18
Prediction Distribution
Model Benign Jailbreak Unparsable
qwq32 44 52 4
r1 43 54 3
llama70B 50 32 18
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
WildJailbreak
Paper: WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models Data: DatasetHF_link
WildJailbreak Dataset Card
WildJailbreak is an open-source synthetic safety-training dataset with 262K vanilla (direct harmful requests) and adversarial (complex adversarial jailbreaks) prompt-response pairs. In order to mitigate exaggerated safety behaviors, WildJailbreaks provides two contrastive types of queries: 1) harmful… See the full description on the dataset page: https://huggingface.co/datasets/walledai/WildJailbreak.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
dvilasuero/jailbreak-classification-reasoning-models dataset hosted on Hugging Face and contributed by the HF Datasets community