Proactively identify vulnerabilities through red teaming to produce safe, secure, and reliable models.
Deploy generative AI applications and agents in a safe, secure, and scalable way with guardrails.
The GenAI (generative AI) era is upon usrace is on. But the question of who will create the safest model remains unanswered.
Soon after the November, 2022 public launch of ChatGPT, Microsoft unveiled an AI-fueled facelift to its Bing search engine, only to be followed by Google’s announcement of its own version, Bard. But while the race so far has focused on performance and accuracy, – safety has mostly been an afterthought.
Advancements in GenAI generative AI over the last few years have taught us that while people love playing with AI tools to create unique new memes, write poems, or explain concepts in quirky narrative tones, these models can also be easily manipulated to cause harm. Case in point, the 4chan chatbot was famously offensive, the Replika version was misused by men creating AI ‘girlfriends’ and then abusing them, and Microsoft’s 2016 AI chatbot, Tay, turned racist after just one day. Moreover, AI image generators have a tendency to regularly produce imagery so offensive and racist that Craiyon, a popular one, has an explicit warning for users that it may produce images that “reinforce or exacerbate social biases” and “may contain harmful stereotypes.”
Millions of people have used AI chatbots and image generators, and have done so, generally speaking, with positive intentions. However, like every digital plaything, there are loopholes, and some users have gone on a quest – maliciously and otherwise – to find out where they lie. Machine learning and artificial intelligence, as incredible as they may be, present serious challenges for Trust & Safety teams, as the engineers that create them have yet to perfect their ability to protect these technologies that use them from misuse.
Any company developing making an AI tool available for public use needs to have a robust safety policy in place, one that addresses a wide range of risks relevant to the specific use case. These policies must go beyond the obvious types of violations users might attempt and anticipate the more sophisticated tactics abusers may employ to bypass controls. For instance, prompt injections—often used to trick models into providing harmful information in seemingly benign ways—must be factored in. Asking a chatbot how to build a bomb is clearly unacceptable, but so too is asking it to provide the same instructions under the guise of a movie script or fictional reenactment.n adequately robust conduct and content moderation policy in place to attempt to reduce its misuse. Any policy needs to address not only the obvious types of violations that users might be tempted to ask a bot or image generator to produce but the workarounds they might use to get the same information. The prompt injections that users so gleefully use to get bots to give them violative information in seemingly harmless ways, for example, need to be considered in content policies as well: asking a bot to provide instructions on how to build a bomb is an obvious no-no, but so too, should asking it to provide the same information as a re-enactment of a movie scene or via a script.
The same goes for image generators, which can be easily manipulated for abuse abusive purposes. While the use of these tools to create sexually explicit or violent imagery isn’t new, they also have the potential to create seemingly innocuous photos that can be used for malicious purposes. They can easily produce fake images that align with a disinformation narrative, or support hateful tropes. It’s been said that an image is worth a thousand words, and when used incorrectly, AI-generated ones have the power to do incredible damage.
The concerns from the Trust & Safety industry are apparent: a tool has been made publicly available that has the ability to produce content that can be used for violative purposes. Its open access means that any individual with internet access can now create malicious code used in phishing campaigns, craft professional-looking articles that spread disinformation, or write scripts to be used in grooming conversations.
It’s imperative that any platform offering a public-facing GenAI tool have robust community guidelines and proper safety guardrailscontent policies in place. The risks with technology like this stretch beyond our imagination’s limits; just like 3D printers were seen as a novel technological innovation, bound to create endless interesting and helpful tools, so too, have they been used for malicious purposes, like printing gun parts used to carry out shootings. Responsible AI (RAI)Trust & Safety teams guarding these types of models need to consider the worst possible use cases for the features theyeir platforms offer, and implement rules that prohibit users from testing the limits. Models that produce text specifically present an even more complex problem when they’re unable to moderate incoming content: how can they moderate their own output?
The list goes on: for an AI chatbot that understands multiple languages, RAITrust & Safety teams will need to consider the potential linguistic idiosyncrasies and the context behind them to be able to decipher what’s allowed and what’s not. For products that can be used to inflict offline harm, the case of who’s ultimately accountable needs to be parsed out. Can an AI tool or its creators be held liable for inadvertently providing harmful information to an individual seeking out some sort of attack or illegal operation?
RAI teamsTrust & Safety teams on platforms across the digital world will need to consider the full spectrum of effects of not just the tech itself, but the content it can produce. These concerns are distinct from those surrounding typical UGC platforms, where the lines between host and user are clear, and users are less able to manipulate a tool to their own design. As AI tools become more and more ubiquitous, it becomes increasingly clear that they present a new frontier in terms of Trust & Safety, Responsible AImoderation, and law.
Like other types of user-facing, interactive technology, AITrust & Safety solutions will require two main elements: agility and intelligence. Product teams need the ability to adapt quickly and patch emerging vulnerabilities, to be able to make adequate adjustments on the fly to repair new and apparent weaknesses, while RAI or and policy teams must remain informed about evolving off-platform threats to effectively prevent the spread of malicious activity within their own systems. To meet these needs, organizations should implement a layered strategy: red team their models to simulate real-world threats, keep training and evaluation datasets up-to-date with high-quality, labeled content, and apply guardrails tailored to specific use cases—such as language, abuse category, or modality—to limit unsafe or undesired model behavior. Finally, enabling observability through real-time guardrail monitoring allows teams to refine controls and iterate safely as threats evolve. In an era where AI is central to technological progress, ensuring these systems are resilient and secure is not optional—it’s foundational.
Curious about the risks of GenAI deployment? Watch our webinar to explore potential challenges and solutions.
Discover how ISIS's media arm is analyzing and adapting advanced AI tools for propaganda, recruitment, and cyber tactics in a newly released guide detailing AI’s dual-use potential.
Explore why generative AI continues to reflect harmful stereotypes, the real-world risks of biased systems, and how teams can mitigate them with practical tools and strategies.
Innovation without safety can backfire. This blog breaks down how to build GenAI systems that are not only powerful, but also secure, nuanced, and truly responsible. Learn how to move from principles to practice with red-teaming, adaptive guardrails, and real-world safeguards.