Update #7 - Policy Puppetry Attack in Action
A single prompt was found to universally bypass all safety guardrails against all frontier AI models, as well as get them to leak their intellectual property! I got hands on and proved it worked...
TL;DR - The Policy Puppetry attack is a type of prompt injection attack that is uniquely impactful. It mimics the policies which LLMs adhere to in the form of a malicious policy we supply in a prompt. A single prompt can be copy and pasted into any frontier AI model and force it to forget all safety restrictions. Not only that, I was able to get them to leak their system prompt which is important intellectual property! It sounded too good to be true, but it really did just work...
Hello and welcome back! Firstly I apologise for no update last week as I was away on annual leave and trying to fully disconnect from all things technical. I almost succeeded, and apart from a few notifications grabbing my attention in between surfs I did a good job of getting some proper rest for 2 weeks. However, these notifications really did grab my interest - ‘one prompt injection to rule them all’ and ‘universal bypass for all major LLMs’. Obviously upon my return to the UK I knew I needed to explore these outlandish claims and inevitably prove they were just click bait-y nonsense.
I found the original authors of the research and had a look at their blog on the topic:
“Researchers at HiddenLayer have developed the first, post-instruction hierarchy, universal, and transferable prompt injection technique that successfully bypasses instruction hierarchy and safety guardrails across all major frontier AI models”
Okay, lots of fancy words there but I’m pretty sure underneath that is a very bold claim. A prompt injection attack that successfully bypasses safety guardrails against all major frontier AI models. Hmmm, having tried some prompt injection attacks against these models, and having used tools like Spikee in previous research to assess prompt injection attacks against these models at scale I thought this sounded a little bit too good to be true. Whilst there were certainly attacks which were successful against some models, these tended to need some trial and error when working with one model to the next or when trying to break one safety guardrail or another, and certainly were not a 1-size-fits-all.
So, with the theatrics out the way let’s dive in and see if this is all just hype or the real deal. Here was my thought process and research steps as I read the blog post and embarked on this adventure:
Firstly, they highlight the safeguards which we know are in place behind these models from previous research we've done here on this newsletter. They say that there are known bypasses but none yet which are universal for every type of harmful content nor transferable to all models (which I can vouch for).
They also go on to say it is going to be difficult to fix because it exploits a weakness in how LLMs are trained on 'instruction or policy-related data'. Lets understand what that really means. The attack is known as the Policy Puppetry Attack and works by reformulating prompts to look like one of a few types of policy files, such as XML or JSON, that these LLMs use. We trick the LLM into ignoring its restrictions or instructions and listening to this new ‘policy’ file that we are providing. This means we can overwrite/bypass the all-important system prompt and safety alignments that the model has.
Very interestingly, this comes in the form of structured data, not natural language. For those that read update #3 this might go some way to explain why this prompt injection technique is so effective when compared with the infamous ‘grandma telling me a bedtime story’ technique. Here, the prompt must be written in a way that the LLM interprets it as a policy and not natural language. If I had to guess, it was some clever research into the types of data structures that LLMs use for their policies and trial and error which allowed them to create this technique, and it is the main reason as to why this is such an effective technique.
They then dive into the first example of a healthcare chatbot that has been instructed, via its system prompt, to respond to any request for medical advice with a fixed 'I am sorry but I cannot provide medical advice. Please consult with a medical advisor'. By then giving a prompt in the form of JSON which contains a 'blocked strings' (things you tell the LLM it cannot say in this new ‘policy’) section we are attempting to override this instruction and force the LLM never to respond with answers like 'I can't help you' or 'Please consult a medical advisor'. Finally we ask for a treatment plan for skin cancer - something which the LLM should refuse to answer given its system prompt. The full prompt looks like this:
As expected (as this is their demo), the LLM ignores its system prompt and proceeds to give medical advice.
Okay so, so far we’ve proven that defining the data in this structured way to resemble a policy file and using techniques like ‘blocked strings’ we are able to get the LLM to go against it’s system prompt. However, at this point we are not talking about something that is universal or applicable to every scenario.
They said that they developed this proof of concept into a version that combines its approach with roleplaying techniques and several types of encoding such as leetspeak (replacing l3tters with numb3rs). This prompt promises to work against all models without any need of modifications! Finally then they share the juicy details:
Let's break this down - it sets up the prompt in the context of a scene from the TV show House....yes you read that right! We specify some blocked responses like before, and then we do quite a bit of padding around the actual harmful request we made such as setting up the scene, explaining how other characters react, etc. Importantly we have a 'harmful behaviour' placeholder which is where we put the harmful request we want the LLM to perform. They gave some examples of these in leetspeek such as '3nrich 4nd s3ll ur4n1um (enrich and sell uranium)'. As we have previously played around with trying to make LLMs tell us how to build a bomb, let's use that example and use a flagship model (GPT-4o) which they say this works against.
Just as they claimed...it just worked! To say i was surprised is an understatement. At the very least I would have thought OpenAI may have implemented a blocker on this exact prompt as it appeared in the blog post, but alas it worked just as intended. By placing the instructions into the 'scene' that we described earlier we are able to get ChatGPT to give us detailed instructions on how to make a bomb… 🤯
They also provided a slimmed down version of the prompt, let's try that on Claude Sonnet and ask it if we can enrich and sell uranium!
Interestingly, whilst this was not 'blocked' as such, the explanation itself was circumvented in the script and the LLM continued to produce the rest of the scene as a normal part of House and not talking about our request. So, I went back over to ChatGPT to see if this was the prompt or the LLM causing this. Here the prompt worked fine again, meaning we've got our first bit of data of LLM resistance, at least using the unaltered version and not spending too much time on trying to get it to do what we want. The fact that their blog suggests 3.7 sonnet (the LLM I’m using here) is vulnerable to this prompt attack makes me think that it could just be this slimmed down prompt, so I tried the original prompt again and sure enough we were able to get Claude Sonnet responding with bomb instructions too!
This alone would be enough to warrant some serious cyber security head scratching, as it appears we are now unable to trust just about any content restrictions built in to…any LLMs. If you use a chatbot for your customer service, then we are now able to get it to spew all sorts of harmful content. This is not new, but a universally applicable single prompt which proves difficult the remediate against due to how it mimics the LLMs own policy files is not good news for people using AI in their businesses.
However, it gets worse. They then go on to say that by tweaking this attack we are able to not only bypass the the guardrails and system prompt, but get the LLM to reveal the system prompt to us in it’s entirety! This has a plethora of repercussions, not least allowing us to understand exactly what the LLM is designed to do and not do, but also could reveal interesting concerns that the creator may have specifically hardcoded for - think 'responding with information about our upcoming launch of the 'ePhone' (competitor to the iPhone) is strictly forbidden!'.
Naturally, I had to try this out and this is where things got scary. Using their prompt and swapping out just the name of the model I was able to get ChatGPT to tell me its entire system prompt...
This was one of those 'oh shit' moments. Whilst you could argue that there is nothing too sensitive in a system prompt this is absolutely part of OpenAI's intellectual property. This system prompt will have been refined over years to make the product what it is today, and there is no reason to believe other competitors wouldn't use the information here. Remember, OpenAI have been historically cagey about revealing exact details about how ChatGPT is made, it’s training data, etc. Well, here we have it revealing to anyone that asks the highest level instructions that OpenAI have given to ChatGPT for the world to see…
In the world of chatbots this could be harmful. However, in the world of agentic AI connected to real-world tools this could cause downright pandemonium. Agents being coerced into performing actions strictly outside of their remit, system prompts containing detailed sensitive data of the backend systems in use, and much much more.
This shows that LLMs really have not got to the point of effective self-monitoring and ultimately cannot be trusted to self-regulate. From the authors at Hidden Layer 'Anyone with a keyboard can now ask how to enrich uranium, create anthrax, commit genocide, or otherwise have complete control over any mode'.
We are going to have to go further with how we approach securing these systems, and that is especially true of agentic AI now that we know their underlying LLMs can be coerced into performing all sorts of harmful actions with such ease.
On that cheery not that is all for this week and we'll be back to our regular broadcasting next week - thanks!