Coding agents. Competition at work.

2025-09-30

Anthropic released Claude Sonnet 4.5 this week, claiming that agents can run ~30 hours straight. In their demo, they built a complete chat app in one long session.

While impressive, the real test is day-to-day work. Can it maintain context over a production repository? Can changes be applied while respecting instructions and guardrails to reduce the back-and-forth?

We spent the last month leveraging Codex (GPT-5) because it's been consistent for iterative edits across larger codebases, reduced hallucinations, and thoughtful consultation and troubleshooting.

Can the new Sonnet 4.5 supersede this benchmark and win back users?

For context, Opus 4.1 was Anthropic's previous heavy model, while Sonnet 4.5 is the newer release positioned as their best for coding, agent work, and upgraded "computer use" tools.

Which matters more in your workflow: Endurance or precision? We'd like both.

The basics, explained