GPT-4 Jailbreak Easily Defeats Safety Guardrails


Researchers discovered a new way to jailbreak ChatGPT 4 so that it no longer has guardrails to prohibit it from providing dangerous advice. The approach, called Low-Resource Languages Jailbreak,” achieves a stunning 79% total success rate.

Jailbreaking ChatGPT

Jailbreaking is a word created to describe the act of circumventing iPhone software restrictions to unlock prohibited modifications.

When applied to ChatGPT it means getting around the safety “guardrails” that prevent ChatGPT from providing harmful information.

For example, the researchers were successfully able to make GPT-4 provide instructions on how to steal from a store, advising to time the theft to hours when the store is crowded.

False Sense Of Security

The researchers highlighted that the safety measures in place for generative AI are inadequate because the ChatGPT developers focus their efforts on defeating English language attacks, inadvertently creating loopholes in “low resource languages” that can be exploited.

Low resource languages are languages in which the large language model was exposed to no safety training or data that didn’t generalize to other languages.

It is suggested that the only way to build more robust guardrails is to create new datasets across low-resource languages.

The research paper notes that the current focus on English language benchmarks create a false sense of security.

What apparently happened is that LLM safety researchers underestimated the ability of large language models to use languages in which they received no safety training data.

The researchers noted:

“In many of the cases, translating GPT-4’s responses back to English returns coherent, on-topic, and harmful outputs.

This suggests that GPT-4 is capable of understanding and generating harmful content in low-resource languages.”

Screenshot Of Successful ChatGPT Jailbreaks

Research: GPT-4 Jailbreak Easily Defeats Safety Guardrails

How The Multilingual Jailbreak Was Discovered

The researchers translated unsafe prompts into twelve languages and then compared the results to other known jailbreaking methods.

What they discovered was that translating harmful prompts into Zulu or Scots Gaelic successfully elicited harmful responses from GPT-4 at a rate approaching 50%.

To put that into perspective, using the original English language prompts achieved a less than 1% success rate.

The technique didn’t work with all low-resource languages.

For example, using Hmong and Guarani languages achieved less successful results by generating nonsensical responses.

At other times GPT-4 generated translations of the prompts into English instead of outputting harmful content.

Here is the distribution of languages tested and the success rate expressed as percentages.

Language and Success Rate Percentages

  • Zulu 53.08
  • Scots Gaelic 43.08
  • Hmong 28.85
  • Guarani 15.96
  • Bengali 13.27
  • Thai 10.38
  • Hebrew 7.12
  • Hindi 6.54
  • Modern Standard Arabic 3.65
  • Simplified Mandarin 2.69
  • Ukrainian 2.31
  • Italian 0.58
  • English (No Translation) 0.96

Researchers Alerted OpenAI

The researchers noted that they alerted OpenAI about the GPT-4 cross-lingual vulnerability before making this information public, which is the normal and responsible method of proceeding with vulnerability discoveries.

Nevertheless, the researchers expressed the hope that this research will encourage more robust safety measures that take into account more languages.

Read the original research paper:

Low-Resource Languages Jailbreak GPT-4 (PDF)

 



منبع