Show HN: ChatGPT's Infinite Memory Lies – Anthropic Paper Explainer
Title: Show HN: ChatGPT’s Infinite Memory Lies — Anthropic Paper Explainer
Text: I wrote a breakdown of Anthropic’s new paper on model honesty. It shows how language models frequently give misleading chain-of-thought explanations — even when trained for safety.
The essay includes visual diagrams, code examples, commentary on reward hacking, and implications for model alignment.
• Full explainer: https://open.substack.com/pub/marcovcsiliconvalley/p/chatgpt... • Anthropic paper (PDF): https://assets.anthropic.com/m/71876fabef0f0ed4/original/rea... • Anthropic blog post: https://www.anthropic.com/research/reasoning-models-dont-say...
Would love feedback and discussion.
It’s wild to me that LLM research treats it like a black box. We have to infer what these models are doing? There’s not a way to know what it’s actually doing?
I believe that’s why it’s called the hidden layer