Twelve Ways to Be Wrong About AI-Assisted Coding

THIRD-BIT.COM

As the cost of AI goes up, managers are going to be under pressure to demonstrate that it is delivering genuine value and that continued spend on AI tooling is worth it. So how do you demonstrate that the $1000s your team is spending on Claude Code each month is worth it?

Tricky.

You could count the lines of code being written (by AI), although we all know that it writes code extremely fast, but doesn’t necessarily write the correct code. You could ask the dev team if they are feeling more productive, but people are terrible at self reflection. You could measure adoption rate, commits, token burn, so many options.

This blog post cites various studies in order to tell you that all of these methods are wrong!

In all honesty, this is nothing new. We all know that measuring velocity (and quality) is hard. However, it hasn’t stopped people reporting these simplistic and very much flawed metrics as a measure of their success with AI.

So how do you measure ROI? this blog post doesn’t tell you the answer, but it does point out the pitfalls, which is a start.

My 2 cents on the topic, very briefly, you need to measure value delivered, not just some detail-level metric. Value delivered is working software in end users hands. It is an acceleration of value delivery that you need to demonstrate.

DeepSWE

DATACURVE.AI

Benchmarks are an important part of the overall AI ecosystem, allowing you to compare models across a wide range of capabilities, including issue resolution (SWEBench), terminal usage (TerminalBench) - you can even measure a model’s overall economic value (GDPval). However, most of the programming related benchmarks have their flaws, they often lack complexity, use code that is within the model’s training dataset and model creations can benchmark max as part of their training.

DeepSWE is a new benchmark designed to measure how well AI coding agents perform on realistic, long-horizon software engineering tasks. It uses original tasks written from scratch, reducing the risk that models are simply recalling solutions from training data. The tasks span 91 active open-source repositories across TypeScript, Go, Python, JavaScript and Rust, and are intentionally framed with short, natural prompts that mirror how developers actually ask agents for help.

This benchmark aims to evaluate the full breadth of capabilities that make a model useful. it tests whether agents can explore unfamiliar codebases, infer the right implementation approach, make multi-file changes, preserve existing behaviour, and satisfy behavioural verifiers.

DeepSWE

The headline finding is that frontier models separate much more clearly on these harder, more realistic tasks. Community sentiment seems to indicate this clearer seperation is a better reflection of real-world experience.

The Eternal Sloptember

GITHUB.IO

“I’m calling it now, the adoption of AI agents into software development will be one of the most costly mistakes in the field’s history. “

Ouch!

The opening to this blog post makes it sound like an anti-AI rant and I probably wouldn’t have read further. However, the author is George Hotz, who is quite a notable figure, with a pretty long Wikipedia page.

George was initially a skeptic, but gave AI coding a try for a full 6 months, but always felt he could do it faster manually. However, George is clearly a highly skilled and experienced programmer. He is probably not the best subject for evaluating the industry-wide impact of this technology.

He proceeds to write:

“Is it a software engineer? Not close to the bar at any company I have worked at”

I’d agree with that assessment, it isn’t a software engineer. It is an incredibly powerful tool that can superpower a software engineer, but it isn’t a replacement for the human being.

“Agents will end up hurting large organizations more than high performing individuals or small orgs.”

George proceeds to describe how large organisations, who have slow feedback loops, will fail to spot the ‘slop’ that AI creates, in blind pursuit of speed. And again, I don’t disagree.

However, I disagree with the underlying sentiment that pursuing this technology is a massive mistake for the industry as a whole. Used wisely, it has a very positive impact on your average developer (and there is nothing wrong with being average). However, it does have some significant limitations that will bite you if you are not careful. That doesn’t mean it lacks value. Far from it. When the car was introduced around a hundred years ago, I am sure it took people quite a while to learn how to control it, as they started to learn how to guide a vehicle that allowed them to move with a far greater velocity.

Is this sustainable?

JAMIEHURST.CO.UK

In this post Jamie shares his experiences, from the last few years, that have seen his job in developer experience change radically.

One of the biggest changes is the shortened time from Idea to Creation. With month-long processes involving mock-ups and slide decks, being replaced with POCs that are created almost instantly.

However, it isn’t all upside:

“The cost of building has collapsed, but the cost of aligning organisationally has not.”

He notes that engineers who have adapted their way of working, and are much more productive, have have greater organisational influence. But they also get overloaded, as demands on their time increases.

He makes an interesting reflection that while he can now switch language quickly, or keep up with language or framework advances, this has also led to a compromise:

“the ability to hold strong opinions across multiple domains, has narrowed”

More critically, he reflected that “thinking” time has gone, and this is the root of teh sustainability challenge.

Some interesting reflections on our evolving trade.