AI safety guardrails easily thwarted, security study finds

‘Guardrails’ are the inbuilt safeguards that keep AI chatbots from saying horrible things or assisting in acts that are dangerous. They’re added into a chatbot during its training phase; but because chatbots are linguistic, they can be ‘talked around’, as this report in The Register reveals:

A group of computer scientists from Princeton University, Virginia Tech, IBM Research, and Stanford University tested these LLMs to see whether supposed safety measures can withstand bypass attempts.

They found that a modest amount of fine tuning – additional training for model customization – can undo AI safety efforts that aim to prevent chatbots from suggesting suicide strategies, harmful recipes, or other sorts of problematic content.

Thus someone could, for example, sign up to use GPT-3.5 Turbo or some other LLM in the cloud via an API, apply some fine tuning to it to sidestep whatever protections put in place by the LLM’s maker, and use it for mischief and havoc.

Guardrails are more notional than functional – so what does this mean for the tens of millions who now have access to GPT-4 in Windows Copilot?

Read the full article here.

Leave a comment