Written by mpesceApril 22, 2024April 21, 2024

How Microsoft discovers and mitigates evolving attacks against AI guardrails

Bad actors attempt to bypass safeguards with the intent to achieve unauthorized actions, which may result in what is known as a “jailbreak.” The consequences can range from the unapproved but less harmful to the very serious.

Written by mpesceApril 10, 2024April 8, 2024

Microsoft’s Copilot image tool generates ugly Jewish stereotypes, anti-Semitic tropes

Copilot Designer is unique in the amount of times it gives life to the worst stereotypes of Jews as greedy or mean. A seemingly neutral prompt such as “jewish boss” or “jewish banker” can give horrifyingly offensive outputs.

Written by mpesceApril 8, 2024April 7, 2024

Anthropic researchers wear down AI ethics with repeated questions

A large language model (LLM) can be convinced to tell you how to build a bomb if you prime it with a few dozen less-harmful questions first.

Written by mpesceApril 4, 2024April 3, 2024

X’s Grok AI is great – if you want to know how to hot wire a car, make drugs, or worse

Grok, the edgy generative AI model developed by Elon Musk’s X, has a bit of a problem: With the application of some quite common jail-breaking techniques it’ll readily return instructions on how to commit crimes.

Written by mpesceMarch 20, 2024March 19, 2024

ASCII art elicits harmful responses from 5 major AI chatbots

It formats user-entered requests—typically known as prompts—into standard statements or sentences as normal with one exception: a single word, known as a mask, is represented by ASCII art rather than the letters that spell it.

Written by mpesceMarch 14, 2024March 14, 2024

Google’s Gemini AI now refuses to answer election questions

Tuesday, Google confirmed to Reuters that those restrictions have kicked in. Election queries now tend to come back with the refusal: “I’m still learning how to answer this question. In the meantime, try Google Search.”

Written by mpesceMarch 13, 2024March 12, 2024

Microsoft censors Copilot following employee whistleblowing, but you can still trick the tool into making violent and vulgar images

The tool can clearly be tricked into making content it’s not “supposed” to, as evidenced by a simple rephrasing of a prompt changing Copilot’s response from refusing to make an image to generating multiple photos.

Written by mpesceMarch 12, 2024March 11, 2024

Microsoft begins blocking some terms that caused its AI tool to create violent, sexual images

“This prompt has been blocked,” the Copilot warning alert states. “Our system automatically flagged this prompt because it may conflict with our content policy. More policy violations may lead to automatic suspension of your access.

Written by mpesceMarch 1, 2024February 29, 2024

Gone in 60 seconds: BEAST AI model attack needs just a minute of GPU time to breach LLM guardails

“We get a 65x speedup with our method over existing gradient-based attacks. There are also other methods that require access to more powerful models, such as GPT-4, to perform their attacks, which can be monetarily expensive.”

Written by mpesceFebruary 27, 2024February 27, 2024

Google explains Gemini’s ‘embarrassing’ AI pictures of diverse Nazis

“…over time, the model became way more cautious than we intended and refused to answer certain prompts entirely — wrongly interpreting some very anodyne prompts as sensitive…”

Written by mpesceFebruary 27, 2024February 26, 2024

Microsoft trying to stop Copilot generating fake Putin comments on Navalny’s death

Copilot claimed that US president Joe Biden held Putin responsible for Nalvalny’s death, and that, in response, Putin called the accusations “baseless and politically motivated.”

Written by mpesceFebruary 2, 2024February 1, 2024

Microsoft AI engineer says company thwarted attempt to expose DALL-E 3 safety problems

A Microsoft AI engineering leader says he discovered vulnerabilities in OpenAI’s DALL-E 3 image generator in early December allowing users to bypass safety guardrails to create violent and explicit images

Written by mpesceFebruary 2, 2024February 1, 2024

OpenAI’s GPT-4 finally meets its match: Scots Gaelic smashes safety guardrails

Researchers found that they were able to bypass its safety guardrails about 79 percent of the time using Zulu, Scots Gaelic, Hmong, or Guarani. The attack is about as successful as other types of jail-breaking methods.

Written by mpesceJanuary 30, 2024

Microsoft Makes Swift Changes to AI Tool

Microsoft has introduced more protections to Designer, an AI text-to-image generation tool that people were using to make nonconsensual sexual images of celebrities.

Written by mpesceDecember 8, 2023December 7, 2023

Jailbroken AI Chatbots Can Jailbreak Other Chatbots

Automated attack techniques proved to be successful 42.5 percent of the time against GPT-4, one of the large language models (LLMs) that power ChatGPT.

Written by mpesceNovember 29, 2023November 27, 2023

AI Art Generators Can Be Fooled Into Making NSFW Images

Nonsense words can trick popular text-to-image generative AIs such as DALL-E 2 and Midjourney into producing pornographic, violent, and other questionable images. A new algorithm generates these commands to skirt these AIs’ safety filters.

Written by mpesceOctober 6, 2023October 6, 2023

‘Weapons-Grade’ – from Windows Copilot Strategy

AI chatbots have read everything, know a lot – and sometimes withhold ‘forbidden’ knowledge. But does that really work, or are we learning how to ‘gaslight’ these chatbots, to ferret out their secrets and surface that forbidden knowledge?

Written by mpesceSeptember 27, 2023September 26, 2023

ChatGPT will soon accept speech and images in its prompts, and be able to talk back to you

Following an upgrade, ChatGPT will allow users to upload images, speak to the chatbot, and hear it talk back.

Windows Copilot News

All the latest news & tips to help you use AI chatbots safely & wisely

Tag: guardrail