I could definitely see that being the case in this so called "white genocide" thing on grok, but I still have to wonder in general.
For instance with the Chinese models refusing to acknowledge Tienanmen square (as an example). I wonder about the ability to determine if such a bias is inherent in the data of the model, and what tools might exist to analyze models to determine how their training data might lead to some intentional influence on what the LLM might output.
I'm not an LLM expert (and never will be), so I'm hoping someone with deeper knowledge can shed some light...
With most Chinese models, you can run them locally.
You can then specifically prompt the model to do a CoT before answering (or refusing to answer) the question about e.g. Tiananmen. In my experiments, both QwQ and DeepSeek will exhibit awareness of the 1989 events in their CoT, but will specifically exclude it from their final answer on the basis that it is controversial and restricted in China.
It gets even funnier if you do multi-turn, and on the next turn, point out to the model that you can see its CoT, and therefore what it thought about Tiananmen. They are still finetuned into doing CoT regardless and just can't stop "thinking about the white elephant" while refusing to acknowledge it in more and more panicked ways.
For instance with the Chinese models refusing to acknowledge Tienanmen square (as an example). I wonder about the ability to determine if such a bias is inherent in the data of the model, and what tools might exist to analyze models to determine how their training data might lead to some intentional influence on what the LLM might output.
I'm not an LLM expert (and never will be), so I'm hoping someone with deeper knowledge can shed some light...