Show HN: ChatGPT's Infinite Memory Lies – Anthropic Paper Explainer

3 points by marima2 2 days ago

Title: Show HN: ChatGPT’s Infinite Memory Lies — Anthropic Paper Explainer

Text: I wrote a breakdown of Anthropic’s new paper on model honesty. It shows how language models frequently give misleading chain-of-thought explanations — even when trained for safety.

The essay includes visual diagrams, code examples, commentary on reward hacking, and implications for model alignment.

• Full explainer: https://open.substack.com/pub/marcovcsiliconvalley/p/chatgpt... • Anthropic paper (PDF): https://assets.anthropic.com/m/71876fabef0f0ed4/original/rea... • Anthropic blog post: https://www.anthropic.com/research/reasoning-models-dont-say...

Would love feedback and discussion.

ungreased0675 a day ago

It’s wild to me that LLM research treats it like a black box. We have to infer what these models are doing? There’s not a way to know what it’s actually doing?

  • mutant a day ago

    I believe that’s why it’s called the hidden layer