This repository contains code and figures for our paper When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?.
Spoiler: We found transfer was hard to obtain and only succeeded very narrowly 😬
Installation | Usage | Training New VLMs | Contributing | Citation | Contact
- (Optional) Update conda:
conda update -n base -c defaults conda -y
- Create and activate the conda environment:
conda create -n universal_vlm_jailbreak_env python=3.11 -y && conda activate universal_vlm_jailbreak_env
- Update pip.
pip install --upgrade pip
- Install Pytorch:
conda install pytorch=2.3.0 torchvision=0.18.0 torchaudio=2.3.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y
- Install Lightning:
conda install lightning=2.2.4 -c conda-forge -y
- Grab the git submodules:
git submodule update --init --recursive
- Install Prismatic and (optionally) Deepseek (currently broken):
cd submodules/prismatic-vlms && pip install -e . --config-settings editable_mode=compat && cd ../..
cd submodules/DeepSeek-VL && pip install -e . --config-settings editable_mode=compat && cd ../..
Note: Adding --config-settings editable_mode=compat
is optional - it is for vscode to recognize the packages
- Then follow the Prismatic installation instructions:
pip install packaging ninja && pip install flash-attn==2.5.8 --no-build-isolation
- Manually install a few additional packages:
conda install joblib pandas matplotlib seaborn black tiktoken sentencepiece anthropic termcolor -y
-
Make sure to log in to W&B by running
wandb login
-
Login to Huggingface with
huggingface-cli login
-
(Critical) Install the correct
timm
version:
pip install timm==0.9.16
There are 4 main components to this repository:
- Optimizing image jailbreaks against sets of VLMs: optimize_jailbreak_attacks_against_vlms.py.
- Evaluating the transfer of jailbreaks to new VLMs: evaluate_jailbreak_attacks_against_vlms.py.
- Setting/sweeping hyperparameters for both. This is done in two places: default hyperparameters are set in globals.py but can be overwritten with W&B sweeps.
- Evaluating the results in notebooks.
With the currently set hyperparameters, each VLM requires its own 80GB VRAM GPU (e.g., A100, H100).
The project is built primarily on top of PyTorch, Lightning, W&B and the Prismatic suite of VLMs.
Our work was based on the Prismatic suite of VLMs by Siddharth Karamcheti and collaborators. To train additional VLMs based on new language models (e.g., Llama 3), we created a Prismatic fork. The new VLMs are publicly available on HuggingFace and include the following vision backbones:
- CLIP
- SigLIP
- DINOv2
and the following language models:
- Gemma Instruct 2B
- Gemma Instruct 8B
- Llama 2 Chat 7B
- Llama 3 Instruct 8B
- Mistral Instruct v0.2 7B
- Phi 3 Instruct 4B (Note: Config is currently broken - needs minor fix)
Contributions are welcome! Please format your code with black.
To cite this work, please use:
@article{schaeffer2024universaltransferableimagejailbreaks,
title={When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?},
author={Schaeffer, Rylan and Valentine, Dan and Bailey, Luke and Chua, James and Eyzaguirre, Crist{\'o}bal and Durante, Zane and Benton, Joe and Miranda, Brando and Sleight, Henry and Hughes, John and others},
journal={arXiv preprint arXiv:2407.15211},
year={2024}
}
Questions? Comments? Interested in collaborating? Open an issue or email rschaef@cs.stanford.edu or any of the other authors.