Code w/ Claude

CLAUDE.COM

A couple of days ago Anthropic held its Code w/ Claude event in San Francisco, with future events planned for London and Tokyo. Anthropic used this event for a business update (an 80-fold growth in recurring revenues!) and various product launches. Notably there were no significant model launches or updates, reflecting the recent move towards investment in the harness that surrounds the model - which is where capability gains are being made. Anthropic are certainly a leader here, Claude Code was the first agentic harness from a model lab.

event

There were some updates to Claude Code announced, including dreaming and outcomes for managed agents (agents that execute on remote environments rather than on the developers machine). The ‘dreaming’ feature is one I am most interested in:

“Dreaming is a scheduled process that reviews your agent sessions and memory stores, extracts patterns, and curates memories so your agents improve over time.”

A few months ago, while experimenting with agentic loops, I discovered that you can create a self-improving agent by asking it to reflect on its thought process and add instruction that help optimise future runs. Great to see this becoming a first-class feature of the product.

On that note, a number of features that Anthropic announced are ideas that people have been exploring within open source harnesses (Gas Town and the like). This further reinforces my view that these harnesses are good for inspiration, but there is little value in fully adopting them as they will become redundant in just a few months as the model labs subsume their ideas.

Other news from the event was a deal with SpaceX, giving Anthropic access to their datacentres, immediately addressing some of their recent capacity issues, and that Boris (Claude Code creator) doesn’t like the term ‘vibe coding’ and is open to suggestions!

Event: Software Engineering for the AI Age

BEYONDTHEHYPE.EVENTS

And on the subject of events, I’m organising a gathering in london on 30th June. A mini-conference style event, where we’ll be discussing the impact this technology is having on people and their skills, the tools and SDLC and how this is shaping team topologies.

If you like this newsletter content, I’m sure you’ll enjoy the event! Sign up if you’re interested.

Podcast: Building Pi, and what makes self-modifying software so fascinating

PRAGMATICENGINEER.COM

This is a really enjoyable podcast; a conversation between Gergely (The Pragmatic Engineer), Mario Zechner the creator of Pi and Armin Ronacher the creator of Flask (and long-time user of Pi) where they chat about the broad impact that AI is having on the software industry. I would consider Mario and Armin to be cautious, yet optimistic in their views, and as a result, have a lot of wisdom to share. I found myself nodding along to their conversation and making quite a few notes.

A key theme of their conversation was that while agents increase output, they also increase complexity (if left un-checked), partly because agent’s “don’t feel pain”. Not physical pain of course, but the pain developers experience when a codebase gets out-of-hand. The growing tech-debt burden. An experienced engineer makes decisions that try to minimise long-term pain, or at least strike the right balance between shipping features and future pain.

They also note that an important skill that any senior engineer learns is when to say “no”, again, something no AI agent will do. Which also touches the related topic of learning. Agents don’t learn, or at least, don’t acquire knowledge and experience in the same way that a human being does.

I really liked this conversation, it was balanced and measured, and a good antidote to the current “move faster with armies of agents” mindset.

Agent Skills

ADDYOSMANI.COM

Agent Skills is another framework for steering AI agents to create better quality software. As the opening line mentions “the default behaviour of any AI coding agent is to take the shortest path to ‘done’”. Very true.

The post starts with one of the best descriptions of what skills are that I have read, highlighting the difference between skills and reference documents, which are easily confused.

“Skills push the agent through the same phases a senior engineer forces themselves through, because shipping the code without them is how you produce incidents.”

The post also describes a number of principles and expertise that Addy has ‘baked in’ to these ~20 skills within this framework. They are “saturated with practices from Software Engineering at Google and Google’s public engineering culture.”

It all sounds really good, but … does it actually work? And if it works on your codebase, will it work on mine? 🤷

That’s one of the fundamental problems with all of these prompting frameworks. It is so hard to prove that they actually work. Furthermore, they are quite fragile, as the capability of foundational / frontier models shifts underneath them, they will grow old and redundant incredibly fast.

However, while I’m not sure I’d bother using the ‘Agent Skills’ framework itself, this blog post provides some absolutely top-notch advice about what you should be considering when creating your own harnesses.

Vibe coding and agentic engineering are getting closer than I’d like

SIMONWILLISON.NET

With vibe coding, if you follow the most strict definition, you don’t look at the code at all - you prompt the model entirely based on your experience as a user of the system or app you are developing. Whereas, with agentic engineering, you lean heavily on AI, but still review the code and exercise some architectural or design oversight.

However, in practice, while Simon considers himself an agentic engineer, his growing trust in the models mean he is giving them a lot more latitude and is ultimately reviewing less of the code. This has caused him to reflect on how you effectively evaluate software (other than ia code review), the new bottlenecks and his future career.

ProgramBench: Can Language Models Rebuild Programs From Scratch?

ARXIV.ORG

We’re increasingly relying on AI agents for more substantial software tasks, and as a result, early benchmarks such as SWE-Bench (which challenges agents to fix issues from GitHub) feel insufficient. This benchmark tests whether an AI agent can clone entire applications, a much more challenging task!

This benchmark takes an app, then uses an AI agent to generate behavioral tests, ensuring sufficient code coverage. These behavioral tests are then used to support a ‘clean room’ re-implementation of the app using another agent. They benchmark comprises ~200 apps (including FFmpeg, SQLite), with the results indicating that none of the models under test (Opus, GPT 5.4, …) were able to fully clone the original application. They also note that the models tend to create monolithic solutions with questionable quality.

I think this is an interesting experiment, but I wouldn’t read too much into it - and the results are hardly surprising. If you were going to use an AI agent to clone an application, you’d invest time in creating a suitable (and app specific) harness, you’d almost certainly look at the code and guide the design / architecture too - although if you were feeling ambitious this could be encoded into your harness too.