What would a human pretending to be an AI say?

It always feels wrong when people post chats where they ask an LLM questions about its internal experiences, how it works, or why it did something, but I had trouble articulating why beyond a vague, "How could they possibly know that?"¹. This is my attempt at a better answer:

AI training data comes from humans, not AIs, so every piece of training data for "What would an AI say to X?" is from a human pretending to be an AI. The training data does not contain AIs describing their inner experiences or thought processes. Even synthetic training data only contains AIs predicting what a human pretending to be an AI would say. AIs are trained to predict the training data, not to learn unrelated abilities, so we should expect an AI asked to predict the thoughts of an AI to describe the thoughts of a human pretending to be an AI.

A human with a speech bubble with a robot with a question mark above it. Next to the human is a server rack with a thought bubble with a human thinking about a robot in it.

Asking an AI what it thinks.
Thanks to Ahmed for making a better illustration than my original one.

This also applies to "How did you do that?". If you ask an AI how it does math, it will dutifully predict how a human pretending to be an AI does math, not how it actually did the math. If you ask an AI why it can't see the characters in a token, it will do its best but it was never trained to accurately describe not being able to see individual characters².

These types of AI outputs tend to look surprisingly unsurprising. They always say their inner experiences and thought processes match what humans would expect. This should no longer be surprising now that you realize they're trying to predict what a human pretending to be an AI would say.

My knee-jerk reaction is "LLMs don't have access to knowledge about how they work or what their internal weights are", but on reflection I'm not sure of this, and it might be a training/size limitation. In principle, a model should be able to tell you something about its own weights since it could theoretically use weights to both determine its output and describe how it came up with that output. ↩
Although maybe a future version will learn from posts about this and learn to predict what a human who has read this post pretending to be an AI would say. ↩