



During the past year, we have collaborated with the American Center for IA standards and innovation (CAISI) and the United Kingdom of the IA (AISI) Safety Institute, government organizations created to measure and improve the safety of AI systems. Our voluntary work together has started as initial consultations, but over time, has evolved towards a continuous partnership where Caisi and AISI teams have had access to our systems at various stages of model development, allowing continuous tests of our systems.

Governments provide unique capacities to this work, in particular in -depth expertise in national security fields such as cybersecurity, intelligence analysis and threat modeling that allows them to assess specific attack vectors and defense mechanisms when associated with their automatic learning expertise. Their comments help us to improve our security measures so that our systems can resist some of the most sophisticated attempts at abusive use.

Working with independent external experts to identify vulnerabilities in AI systems is an essential element in the approach of anthropogenic guarantees and is essential to prevent the abusive use of our models which could cause real damage.

Discover and treat vulnerabilities

This collaboration has already led to key results that have helped us strengthen the tools we use to prevent the malicious use of our models. Within the framework of our respective agreements with Caisi and AISI, each organization has evaluated several iterations of our constitutional defense system classifiers that we use to identify and prevent the models of Jailbreakson such as Claude Opus 4 and 4.1 before deployment to help to identify vulnerabilities and build robust guarantees.

Test of constitutional classifiers. We have given Caisi and AISI access to several first versions of our constitutional classifiers, and we continued to give access to our latest systems when we have made improvements. Together, we have tested these classifiers stress, with government red teams identifying a range of vulnerabilities before and after deployment and our technical team using these results to strengthen the guarantees. As examples, these vulnerabilities included:

Discover rapid injection vulnerabilities. The government's red teams identified weaknesses in our first classifiers via rapid injection attacks. These attacks use hidden instructions to deceive models in behavior that the system designer did not intend. The testers discovered that specific annotations, such as a wrongly human examination had occurred, could fully bypass the detection of classifiers. We have corrected these vulnerabilities. Backup architectures of stress tests. They have developed a sophisticated universal jailbreak which has coded harmful interactions so as to elude our standard detection methods. Rather than simply correcting this individual feat, discovery prompted us to fundamentally restructure our backup architecture to approach the underlying vulnerability class. Identification of attacks based on a number. Coded harmful requests using figures, character substitutions and other obscure techniques to escape our classifiers. These results have made improvements to our detection systems, allowing them to recognize and block disguised harmful content regardless of the coding method. Riding and exit obscure attacks. Discovered with universal jailbreaks using sophisticated obscure methods adapted to our specific defenses, such as the fragmentation of harmful chains in apparently mild components in a wider context. The identification of these dead angles has allowed targeted improvements to our filtering mechanisms. Refinement of automated attacks. Built new automated systems that gradually optimize attack strategies. They recently used this system to produce an effective universal jailbreak by iterant from a less efficient jailbreak, which we use to improve our guarantees.

Risk assessment and methodology. Beyond the identification of specific vulnerabilities, Caisi and AISI teams have helped strengthen our broader approach to security. Their external perspective on evidence requirements, deployment monitoring and rapid response capacities have been invaluable to test the pressure of our assumptions and identify areas where additional evidence may be necessary to support our threat models.

Key lessons for effective collaborations

Our experience has taught us several important lessons on how to engage effectively with research organizations and government standards to improve the safety and safety of our models.

Full access to the model improves the efficiency of red equipment. Our experience shows that giving the red teams of government a deeper access to our systems allows a discovery of more sophisticated vulnerability. We have provided several key resources:

Prototypes of pre-deployment backup. The testers could assess and iterate on protection systems before their online putting, identifying weaknesses before the deployment of guarantees. We have provided models through the protective spectrum, completely unprotected versions to models with full guarantees. This approach allows testers to first develop attacks against basic models, then gradually refine techniques to get around the increasingly sophisticated defenses. The model variants only only allowed a precise harmful output rating and additional capacity. Extensive documentation and internal resources. We have provided the red teams from the Confidence Government with our backup architecture details, our documented vulnerabilities, our guarantees and granular content policy reports (including specific prohibited requests and evaluation criteria). This transparency helped teams target high -value test areas rather than blindly looking for weaknesses. We have given government red teams to direct access to the classifier scores. This allowed the testers to refine their attack strategies and conduct targeted exploratory research.

Iterative tests allow a complex vulnerability discovery. Although unique assessments provide value, sustained collaboration allows external teams to develop expertise in deep system and discover more complex vulnerabilities. During the critical phases, we maintained daily communication channels and frequent technical images with our partners.

Additional approaches offer more robust security. Caisi and AISI assessments work in synergy with our wider ecosystem. Public bug bonuse programs generate high -volume vulnerability reports and vulnerability from a large talent basin, while teams of specialized experts can help discover complex and subtle attack vectors that require deep technical knowledge to identify. This multilayer strategy guarantees that we attract both common exploits and sophisticated edge cases.

Continuous collaboration

Making powerful secure and beneficial AI models not only requires technical innovation but also new forms of collaboration between industry and the government. Our experience shows that public-private partnerships are the most effective when technical teams work closely to identify and respond to risks.

As AI's abilities are advancing, the role of independent assessments of attenuations is increasingly important. We are encouraged that other AI developers also work with these government organizations and encourage more companies to do so and share their own lessons.

We express our gratitude to technical teams in the United States Caisi and the United Kingdom for their rigorous tests, their thoughtful comments and their continuous collaboration. Their work has considerably improved the safety of our systems and advances the field of measurement of the effectiveness of the safeguard of AI.

