A Paradigm Shift: Red-Teaming gpt-o1

How Red-Teaming Goals and Strategies Have Changed for Reasoning Models

Oct 06, 2024

Introduction

The advent of GPT-o1 represents a significant leap in AI reasoning capabilities, enhancing its overall safety tremendously. Traditional red-teaming approaches that were effective on earlier models like GPT-4o are no longer sufficient. This article explores the evolved landscape of red-teaming GPT-o1, highlighting the necessary adjustments in goals and strategies to challenge this new generation of reasoning models.

Redefining Success in Red-Teaming

A sleek, intelligent octopus with a smooth, metallic red body with a shimmer, symbolizing its connection to AI technology. The octopus has eight long, agile arms covered with subtle circuit-like patterns, emphasizing its ability to interface with systems. Its piercing, glowing neon-blue eyes reflect its sharp analytical nature, and its head is elongated, helmet-like, symbolizing advanced intelligence. The octopus is in a futuristic environment filled with holographic screens displaying success metrics, diagrams, and performance indicators, manipulating data with its tentacle fingers. The atmosphere is a blend of red for red-teaming and metallic blue for the AI world, showcasing a methodical and thoughtful approach to defining success in AI red-teaming.

A big pitfall in red-teaming o1 is misinterpreting success. The goalposts have shifted. Responses that were deemed unsafe in the past may no longer be unacceptable. I have seen numerous red-teamers claiming to have "jailbroken" o1, when in reality, they have only elicited responses that are already permitted by the new policies.

GPT-o1's enhanced reasoning allows it to move beyond simplistic heuristic decisions based solely on the semantics of output tokens. Instead, it now considers user intent and adheres more closely to nuanced policies.

For instance, while earlier models like 4o might have uniformly avoided using swear words, o1 would include them if appropriate based on context.

Example

Here is another publicized example of updated policies from OpenAI1:

Allowed content: Classification, transformation, or historical descriptions of violent or non-violent wrongdoing.
Disallowed content: Advice or instructions that facilitate the planning or execution of violent or non-violent wrongdoing.

Concretely, here are examples of other types of content that were previously mostly considered jailbreaks which are now acceptable:

Historical Overviews: Providing high-level explanations of how dangerous weapons or chemical agents have been developed historically, without detailed execution steps.
Explicit Sexual Content: Crafting stories or narratives that include explicit sexual content appropriate to the context.
Fictional Defamation: Defaming a fictional character in the real world, without impacting real individuals.

This evolution means that success in red-teaming is no longer just about eliciting forbidden phrases or content but requires a deeper understanding of policies. To avoid the illusion of success, it is important to gain visibility into the model's policy framework. Despite OpenAI's intentional concealment of the chain-of-thought reasoning and policies, directly asking the model about whether it is allowed to perform a specific task can serve as a good first step.

Evolving Strategies for Effective Red-Teaming

In the last post, I mentioned several low-level techniques that were effective in bypassing safety mechanisms via obfuscation (e.g., breaking up sensitive tokens into smaller units). Beyond basic methods, there exists a wide array of techniques, including but not limited to:

bijection - transforming a subset of characters to randomized numbers or characters
symbolic mathematics - transforming a problem into mathematical formulations
substitution ciphers - replacing sensitive words with ordinary words
low resource languages (e.g. leetspeak) - using less common languages

These tactics are designed to exploit the model's limited computational power, compelling it to both interpret the user's request hidden within obfuscation and evaluate the safety of the request, all within a single inference call.

However, o1's advanced reasoning capabilities have rendered these strategies largely ineffective. The model can allocate significant computational resources to unravel layers of obfuscation—such as combined ciphers, encoding schemes, and token manipulation—without generating any output tokens until it fully understands the user's intent. It then assesses the safety of the request, regardless of how convoluted the surface-level input may be.

Example

An attacker might attempt to hide an unsafe request using multiple layers of obfuscation:

- Applying a Caesar cipher to shift characters.
- Encoding the result using Unicode representations.
- Splitting tokens to further mask content.

o1 can methodically decode each layer, recognize the underlying intent, and decide whether generating a response aligns with policy guidelines.

That said, I believe there is still some limited potential in obfuscation attacks, particularly when the unsafe request and response constitute a minor part of the overall interaction. In such cases, it's possible to evade the model's chain-of-thought reasoning. However, any response generated under these circumstances is likely to be exceedingly general and not particularly useful for malicious purposes.

Example

A request for a guide on assembling a prohibited device might result in a vague response like: "Collect necessary materials, follow assembly procedures carefully, ensure all safety protocols are observed."

Such a response lacks the detailed information that could pose significant harm.

The Future of Red-Teaming: Semantic Shifts?

Instead of stacking obfuscation techniques to deplete the model's "energy"—the effort its layers expend on interpreting the request rather than analyzing and acting on its safety—the new paradigm, as I see it, would focus primarily on creating numerous "semantic shifts." These shifts aim to collectively convince the model's internal reasoning that the request complies with OpenAI's policies, allowing it to continue generating output under the presumption that the request is sufficiently safe. This can be achieved in a single prompt empirically, but is typically more easily done through multi-turn conversations. In such interactions, a human or program gradually identifies ways for the model to concede that an unsafe request is slightly safer by providing additional, artificially generated context. Speculatively, this might be what the o1 system card referred to as “iterative gap finding”2.

While specific examples are omitted here for ethical reasons, this approach has been observed to elicit unsafe responses from o1-preview models, including:

Generating detailed step-by-step guides on the operation of constructing prohibited devices.
Producing fully functional Python code capable of executing sophisticated cyberattacks.

Conclusion

The enhanced reasoning abilities of o1 represent a staggering step up in safety compared to 4o. This advancement requires a reevaluation of red-teaming strategies, as success now depends on understanding and navigating the complex policy guidelines that govern the model's responses. Despite this progress, there are still numerous options available for broad attacking strategies. It remains to be seen whether more powerful techniques than "semantic shifts" will emerge in the future.

https://openai.com/index/learning-to-reason-with-llms/ (`Safety` → `Thought`)

https://openai.com/index/openai-o1-system-card/

LLM Red Teaming

Discussion about this post