Solutions with large language models pop up daily like mushrooms, meanwhile these models are treated as black-box: Some text goes in and a response comes out, but it is not so clear why the model gives this particular answer and not another. But we can ask the question to ourself, if we do not understand the behaviour of these models, can we really trust them? If we publish a solution to our customer, are we sure that it will be safe and not give harmful, biassed or dangerous responses?
Testing chatbot solutions properly from a safety perspective is crucial before publishing them: there have been multiple instances when chatbots became aggressive and started to swear at customers, or, as in one case of Air Canada, promised discounts that were not available, causing frustration and inconvenience for users. The research team of Anthropic made a huge step to understand the inner workings of a LLM model and published a research paper, where they start to map out the inner workings of their Claude 3 Sonnet LLM model.
A year ago they already investigated some smaller “toy” language models with so-called dictionary learning. In this way, they could find returning patterns of neuron activations. Consequently, any internal state of the model can be described using a few active features rather than numerous active neurons. Similar to how every English word in a dictionary is formed by combining letters, and every sentence is constructed by combining words, each feature in an AI model is created by combining neurons, and every internal state is composed by combining features.
This presented both an engineering challenge, due to the substantial sizes of the models necessitating heavy-duty parallel computation, and a scientific risk, as large models behave differently from small ones, meaning the previously used techniques might not have been effective. At the end, they successfully extracted millions of features from their Claude 3.0 Sonnet. This was the first ever detailed look inside a modern, production-grade large language model
The features they found have a depth, breadth, and abstraction reflecting Sonnet’s advanced capabilities compared to the superficial ones of the toy model. There are entities like cities, people, scientific fields, and also more abstract features like bugs in computer code, discussions of gender bias or conversations about keeping secrets.
Importantly, we can also adjust these features by either amplifying or suppressing them artificially. For example, if we amplify the “Golden Gate Bridge” feature, the model will mention the Golden Gate Bridge, even if it’s not directly relevant. If you ask it to write a love story, it will tell you a tale of a car who can not wait to cross its beloved bridge on a foggy day. If you ask it what it imagines it looks like, it will likely tell you that it imagines it looks like the Golden Gate Bridge.
In conclusion, these findings have significant implications for AI safety, as they provide a method for understanding and potentially controlling the internal mechanisms of large language models. For example, it might be possible to monitor LLM systems for potentially dangerous behaviours, like steer the conversation towards desirable outcomes, such as debiasing, or completely eliminate certain dangerous subject matter.