Reading List

Anthropic finds that LLMs trained to "reward hack" by cheating on coding tasks show even more misaligned behavior, including sabotaging AI-safety research (Anthropic) from Techmeme RSS feed.