Introducing Claude Sonnet 4.5

ANTHROPIC.COM

Benchmarks aside, Claude Sonnet seems to have been the model of choice for most software developers for the last few months. However, that started to change a couple of months back with the release of GPT-5, which demonstrated a significant step forwards in the coding capability, especially on complex tasks.

Anthropic have now released Sonnet 4.5, once again reasserting their lead on the coding benchmarks, topping the popular SWE-Bench, which evaluates model performance on “real world” programming tasks. They also demonstrate leading performance on other benchmarks including maths, finance and “computer use”.

claude sonnet performance

For AI coding, benchmarks aren’t everything, they cannot capture the full breadth of tasks that we want these tools to solve, and the overall user experience. However, it has been well received by early adopter, Simon Willison considers their claim that his is teh “best coding model in the world” to be quite justified. Others have noted that GPT-5 still leads in “deep reasoning” on tough, long-context problems.

Devin, who have a market-leading AI Coding Agent have switched to Sonnet 4.5, revealing an interesting detail that Sonnet has an awareness of the length limitations of its own context window, and actively manages this. Although this can lead to “context anxiety” - a new entry into our growing list of terminology!

Another interesting detail in the release announcement is their claim that Sonnet 4.5 ran for ~30 hours autonomously to build a Slack-like app (~11k LOC). This far exceeds prior run-length reports for competing models, and looks like a new metric that will be the subject of intense competition.

How Claude Code is built

PRAGMATICENGINEER.COM

On a related note, in this post Gergely interviewed a couple of Anthropic engineers to find out a little more about the back-story of Claude Code.

It grew from a simple terminal prototype that could interact with the filesystem, autonomously exploring codebases and filesystems in order to answer questions. This quickly transitioned into an internal tool that was widely adopted across the business.

The tool architecture is minimal, the model handles most logic (UI, file traversal, tool use), and the client layer stays lightweight. This is a common pattern with LLM-powered applications.

One significant challenge is that the tool intentionally runs locally, not in a sandbox or VM. A lot fo thought has gone into creating a multi-tiered (project / user / company) permissions system, which ultimately have a human-in-the-loop, being granted interactively by the user before changes.

Some interesting insights, and no great surprise that Anthropic are dog-fooding their own tools. They are also reporting a high level of success (fast releases, lots of AI-generated code). Given the point about that Claude Code is relatively minimal, I wouldn’t get too carried away with the metrics they quote!

The RAG Obituary: Killed by Agents, Buried by Context Windows

NICOLASBUSTAMANTE.COM

While the title of this article is a bit clickbait (“X is dead!!!”), it does tell an interesting story, and is a good illustration of how fast-moving this technology is.

Retrieval-Augmented Generation (or RAG) is a technique that emerged a few years back as a way to manage the small context windows of the leading LLMs of the time, which were only able to encode one or two pages of text. Since then, context window sizes ave increased exponentially, with models able to process 100s of pages, which the author claims has made RAG redundant.

This isn’t entirely true. While LLMs can process very large documents, RAG can still be useful when finding information across multiple documents, e.g. enterprise-wide document search.

However, once again echoing the Claude-related posts above, this is where Agentic AI systems provide an alternative. Rather than using RAG create a ‘map’ of your documents as vector embeddings, an Agentic AI system can crawl your internal document store, much as a human would, in order to find information.

I don’t think RAG is dead, but its usefulness is become more niche.

Announcing fossabot: AI Agent for Strategic Dependency Updates

FOSSA.COM

Software dependencies have become increasingly complex, with a trend towards having a high number of small dependencies. This makes managing dependency updates quite challenging.

Tools like dependabot, which are supposed to help manage this, by bringing upstream changes into your repository sound like a good idea, but in practice, result in a lot of noise and work (for the maintainers).

This new agent looks like a promising tool to alleviate this burden. The agent performs an analysis of the upstream change and the impact it has on your project. This context specific information should help determine whether it is worth spending the effort on the update.

Ultimately this could help reduce supply chain attacks, which often rely on the above issue resulting in complacency.