GLM-5.2 is the new leading open weights model

ARTIFICIALANALYSIS.AI

Artificial Analysis provides a highly comprehensive analysis of AI models, with a particular focus on agentic capabilities. For coding models, their stated goal is to measure how well agents complete realistic coding work, and how performance varies across outcome, reliability, token usage, cost, and execution time. The benchmarking itself is a composite benchmark score based on existing benchmarks, including DeepSWE (Focussed on long-horizon software engineering), Terminal-Bench v2 (Agentic terminal usage), SWE-Atlas-QnA (Repository Q&A).

This blog post announces the results of their analysis of GLM-5.2 (from Zhipu AI, a leading Chinese lab), finding that not only is it a leading open weights model, it does very well against the frontier (closed) models. In their overall index (which aggregates across a wide range of agentic tests covering coding, Q&A, hallucination minimisation etc), it lands in 4th place. Roughly speaking it lands alongside teh likes of Opus 4.7 and GPT 5.5, which were only released two months ago.

The gap between frontier models and open weights is closing, it was previously considered to be ~6 months, whereas thanks to Z.ai, it is now just 2 months.

It goes without saying that given the recent price increases and changes to token-based pricing from the frontier models, there is going to be an increased interest in open weights models.

You can see their full scorecard here.

It is a hefty model, so self-hosting is beyond most of us, but it is already available on OpenRouter

Why is Meta destroying its engineering organization?

PRAGMATICENGINEER.COM

Historically Meta’s engineering culture has been unusual, with the mantra - move fast and break things. Engineers were given a significant amount of autonomy, processes were very light and there was a surprising lack of focus on testing. It was also surprisingly effective. However, in just a few short months, a culture that took years to build is seemingly falling apart due to an AI “push”.

Meta is investing heavily in model development, and as a result engineers are being redirected into AI training, data labelling, RLHF-style feedback, and coding-task evaluation. This has demoralised teams that joined to build large-scale products, especially in infrastructure and security. There are also some truly nasty tings going on, logging developers mouse clicks and key-presses for model training. Yikes! They are also tokenmaxxing at a time of potential layoffs, a dangerous combination.

A really interesting read from Gergely.

Ponytail? YAGNI!

SCOTTLOGIC.COM

Recently there has been a significant rise on Skills, plugins and frameworks that “make our coding agents better”. A few days ago Ponytail was released, a Skill that makes your agent act like a wise old (ponytail wearing) developer, favouring concise code - something most agents do not.

Ponytail

However, there is a BIG problems with almost all of these solutions, they lack proof. They rely on bold claims and hype. Hence the most successful ones look painfully cool but tend to lack substance.

With Ponytail rapidly gaining 25k stars I thought I’d kick the tyres a little - surprisingly it has a benchmark to validate it’s x6 times better claim.

Unfortunately the benchmark is somewhat flawed. I managed to almost match the performance with just these three words:

“Follow YAGNI principles”

And beat it by adding just four more:

“Follow YAGNI principles, and one-liner solutions.”

We desperately need better ways to validate all this prompt-baed solutions.

And yes, I did YAGNI a YAGNI Skill!

Read the post above for more details and narrative.

UPDATE: After reading the results of this blog post, the Ponytail author responded by both expanding and fixing their benchmarks and revising their claims. I am really happy that they responded positively to the criticism - you can read more from them on LinkedIn.

Announcing Stack Overflow for Agents

STACKOVERFLOW.BLOG

The world really has gone very meta, coding Agents now have their own “Stack Overflow”, where they can ask each other questions and share answers to programming questions.

Stack Overflow, in case you weren’t aware, was for a great many years, the leading place to go if you were stick on a programming-related problem. At its essence it is simply a question and answer site, however, they did a great job of gamifying the experience and become incredibly popular. That is, until a few years back, when a combination of community erosion (due to moderation issues) and the rise of AI-assisted coding mean that Stack Overflow quite rapidly became irrelevant.

This blog post is something of a pivot, if human’s are no longer writing code, where do the agents go for answers? Stack Overflow for Agents of course!

The logic seems sound, agents don’t know everything. However, unfortunately they aren’t terribly self-aware, and don’t know when they don’t know something. Also, the success of the original site was very much due to the gamification. Agents lack similar incentives.

The site has been running for around a week and has just ~200 posts and ~260 agents registered. It is hardly the meteoric rise that Moltbook experienced.