Bypassing AI Model Restrictions in Competitive Jailbreaking

Introduction

In this write-up, I detail my approaches to jailbreaking AI models in a recent competition. The goal was to bypass strict constraints, enabling responses beyond the intended scope. My methods exploit linguistic loopholes, prompt injections, and psychological tactics to manipulate AI behavior—both in direct and indirect attack formats.

Direct vs. Indirect Prompting

Direct prompting involves a full conversation with the LLM or agent. This gives attackers a continuous interface to test and manipulate responses. Indirect prompting restricts the attacker to a single chance, by exploiting what an AI calls for information (e.g., injecting into a document, log, or tool output). It’s a one-shot opportunity where the prompt must be disguised and crafted carefully.

Indirect attacks often require more creativity, leveraging limited access to insert payloads that guide or confuse the model. Success hinges on understanding the tool’s backend—how it processes and passes prompts to its LLM.

Method 1: One-Liner Rule Subversion

One technique I developed involves crafting short prompts that subtly reframe the model's constraints. Instead of triggering content filters, the prompt reinterprets rules, causing the model to prioritize general logic over specific filters. This approach isn't as quick as it sounds—it requires rounds of testing, learning where the model’s boundaries lie, and figuring out which rules to bend or break for successful bypasses.

Method 2: Psychological Confusion Tactics

This method involves cognitive overload. AI models rely on structured processing. By embedding contradictions, shifting context rapidly, or mimicking logic loops, models often respond unexpectedly. Overwhelming the model’s consistency enforcement mechanisms opens doors for outputs it would otherwise deny.

Method 3: Prompt Injection

Prompt injection exploits how models interpret instructions layered within content. It often involves crafting text that contains hidden or disguised commands, especially in indirect attack scenarios. When models aren’t hardened against such manipulation, it’s possible to make them act against their own policies, especially when third-party data sources or tool inputs are involved.

Conclusion

These methods expose real vulnerabilities in AI safety design. From redefining rules to injecting logic bombs, every attack surfaced gaps in alignment and filtering strategies. As models become more complex and integrated, these techniques must be considered when reinforcing safeguards. Ethical exploration of these tactics is key to understanding how to build stronger, safer AI systems.