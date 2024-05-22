



Last year, the researchers began experimenting with smaller models that use only a single layer of neurons (advanced LLMs have dozens of layers). We hoped that we would be able to discover patterns that indicate characteristics using the simplest possible settings. He conducted countless experiments without success. I tried various things, but nothing worked. “It looked like a bunch of random junk,” says Tom Henighan, a member of Anthropics' technical staff. Later, in a run named Johnny, each experiment was assigned a random name, allowing him to associate the neural patterns with the concepts that appeared in the output.

Chris saw that and thought, “What the fuck.” Mr. Henihan was surprised, saying, “This is wonderful.'' I looked at it and thought, “Oh, wow, wait, this is going to work?”

Suddenly, the researchers were able to identify what features groups of neurons were encoding. They could peer into the black box. Henihan says they identified the five features they initially looked at. One group of neurons represented Russian text. Another was related to mathematical functions in the Python computer language. And so on.

Once it was shown that features in small models could be identified, researchers began the difficult task of deciphering actual, full-sized LLMs. They used the Claude Sonnet, a medium-strength version of Anthropics' three current models. That worked too. One feature that stuck with them was related to the Golden Gate Bridge. They mapped a set of neurons that, when fired together, showed Claude thinking about a giant structure connecting San Francisco and Marin County. Additionally, when a similar set of neurons fired, it evoked characters such as Alcatraz Island, adjacent to the Golden Gate Bridge, California Governor Gavin Newsom, and the Hitchcock film Vertigo, which was set in San Francisco. All told the team that they had identified millions of Rosetta Stone-like features to decipher Claude's neural net. Many of the features were safety-related, such as approaching someone with some ulterior motive, a discussion of biological warfare, and a villainous plot to take over the world.

The Anthropic team then took the next step and used that information to see if they could change Claude's behavior. They began manipulating neural networks to enhance or diminish specific concepts in a type of AI brain surgery. This could potentially make LLM safer and increase its capabilities in selected areas. Suppose we have this feature board. When you turn on the models, one of them lights up and says, “Oh, I see you're thinking about the Golden Gate Bridge,'' says Shan Carter, an anthropologist on the team. So now I thought, what if we put a little dial on all of this? So what happens when you turn that dial?

So far, the answer to that question seems to be that turning the dial the right amount is crucial. By suppressing these features, the model can create safer computer programs and reduce bias, according to Anthropic. For example, the team found several features that represent risky practices, including unsafe computer code, deceptive emails, and instructions for creating dangerous products.

