Learn the latest trends and solutions for safer gaming communities Read our report
Improve your detection and simplify moderation - in one AI-powered platform.
Stay ahead of novel risks and bad actors with proactive, on-demand insights.
Proactively stop safety gaps to produce safe, reliable, and compliant models.
Deploy generative AI in a safe and scalable way with active safety guardrails.
Online abuse has countless forms. Understand the types of risks Trust & Safety teams must keep users safe from on-platform.
Protect your most vulnerable users with a comprehensive set of child safety tools and services.
Our out-of-the-box solutions support platform transparency and compliance.
Keep up with T&S laws, from the Online Safety Bill to the Online Safety Act.
Over 70 elections will take place in 2024: don't let your platform be abused to harm election integrity.
Protect your brand integrity before the damage is done.
From privacy risks, to credential theft and malware, the cyber threats to users are continuously evolving.
Stay ahead of industry news in our exclusive T&S community.
Cohere, a leader in AI language technology, leverages ActiveFence’s Generative AI Safety solution to enhance model safety and accelerate release timelines.
Like any large language model, the novel nature of AI technology meant that Cohere faced a broad range of unknown threats. Cohere’s broad linguistic coverage, however, added the challenge of detecting these threats in languages that are not covered by traditional detection systems.
Particularly concerning for Seraphina was the potential for harmful activity in non-Romance languages. She was concerned with malicious actors using Cohere’s models to create sophisticated attacks and harmful content like misinformation, hate speech, and CSAM, and the inadvertent generation of offensive or biased content, as well as suicide and self-harm content.
“Because this technology is so new and constantly evolving, the potential for harm by malicious users is enormous, and we don’t fully understand how they will do it, which makes it very hard to detect”
Seraphina looked for a partner with true domain expertise across a wide range of abuse areas, who could identify threats and work with her to find solutions. She knew that she didn’t have the time or resources to develop this domain-level expertise in-house, so she turned to ActiveFence.
To support Cohere’s AI safety team, ActiveFence provided two distinct services: targeted data feeds and red teaming: Targeted Data Feeds: Using specialized domain-area knowledge across abuse areas and languages, ActiveFence provides the team with feeds of risky prompts and annotations. This data is then used to train Cohere’s models, enabling them to better recognize and appropriately respond to similar content, reducing the risk of harmful outputs.
“ActiveFence is one of our main streams of data that we use for safety evaluation. It's especially important for threat actor evaluation because of the domain expertise.”
Red Teaming: ActiveFence’s team of experts conducts specialized red teaming exercises to test specific features and model releases. These exercises mimic real-world risks by simulating attacks or problematic scenarios that a malicious user might attempt, and assessing Cohere’s resilience against these threats. This proactive approach helps the team discover weaknesses before they can be exploited maliciously in deployed applications.
By harnessing ActiveFence’s specialized domain expertise across several abuse areas and multiple languages, the team is able to get real insights into the Cohere’s safety challenges. Then, through a collaborative relationship, come up with targeted solutions.
“My experience working with ActiveFence has been distinct from my experiences with other partnerships, in that it is much more of a collaborative discussion where we take ActiveFence’s domain expertise in different types of content and combine that with what we know about machine learning and our models to come up with what we should do from there.”
By leveraging ActiveFence’s red teaming insights and targeted data, the AI safety team is able to improve model safety and reliability, accelerate model release timelines, and be proactive about regulatory compliance.
Applying ActiveFence’s domain expertise allows the team to develop more sophisticated safety mechanisms within Cohere’s models. These findings translate to more reliable AI models, that are less likely to generate harmful content, particularly within high-risk abuse areas like misinformation, hate speech, and child safety.
"ActiveFence has significantly impacted our iteration speed and confidence in our evaluations and mitigations. It has enabled us to develop a faster evaluation suite, allowing us to release models more quickly and safely."
Recently, the company released several major models, each of which involved multiple iterations. As part of the release process, the AI safety team had to find a good balance between performance and safety. ActiveFence data helped the team with these evaluations:
The partnership also enables Cohere to be proactive about safety. By using the outcomes of red teaming exercises, Seraphina is able to identify what the AI safety team should focus on next, targeting her efforts to the areas that need it most. Moreover, by using verified malicious prompts to train models, she is able to proactively tackle harmful content, before it arrives at the model organically.
Seraphina Goldfarb-Tarrant
Head of AI Safety
Discover how Udemy uses ActiveFence’s solutions to safeguard learners and educators worldwide.
See how Niantic boosts user safety and engagement with ActiveFence’s cutting-edge technologies.
Explore how The Trevor Project leverages ActiveFence’s tools to create a safe, supportive space for LGBTQ+ youth.