GLM-5.1 in Real Coding Work
Last year in July, I tasked Moonshot AI’s Kimi K2 and Claude Sonnet 4 with a real-world coding challenge: generating Swift DSP filters and a Python tool that generates Swift unit tests for them. Both models failed, stalling on tolerances and unhandled phase jumps.
I care less about benchmarks than whether it actually works in practice; this test has some complexity, requiring iterative refinement through a feedback loop against Python reference calculations, so it is non-trivial. Based on my recent experiences, I’d expect Claude Opus 4.6 and GPT-5.4 to handle it well (though I haven’t tested it). I originally planned to try Meta’s Muse Spark, but although benchmark results were published, API access to the model is not yet widely available for independent testing. This is a bit reminiscent of Llama 4’s benchmark “optimizations.”
There’s another intriguing new entrant: GLM-5.1 from Z.ai, a massive 744B-parameter MoE model under MIT license, too large to run locally.
Time to rerun the experiment: I made two small changes, switching from xcodebuild to SPM via Package.swift and replacing Visual Studio Code plus Cline with OpenCode. Neither, however, should significantly alter the outcome.
It generates the Swift code and Python tests, then iteratively refines them.
After approximately 15 minutes, a key moment occurs: it recognizes that angle-aware comparisons are needed due to phase wrapping around ±π. This was something I had to implement manually just last year How Moonshot AI Kimi K2 Performs in Real Coding Work . Furthermore, it identifies that the coefficients don’t match due to unity gain scaling, an issue I had also noticed after my original FIR filter app development back in June last year while working on Building a SwiftUI App With Claude Sonnet 4 and Gemini 2.5 Pro from Scratch .
It then fixes both issues and adapts the Swift routines accordingly.
And the result?
After 22 minutes, it has achieved its goal, at a cost of only $1.52.
Since it also created the unit tests on its own, I did one additional quick manual check:
Obviously the filter coefficients match.
So to sum it up:
- It built the Swift filter routines.
- It created Python routines that generated Swift tests.
- The newly created Swift FIR filter routines are now benchmarked against Python’s proven SciPy calculations.
- This iterative process, identifying problems, then resolving them automatically, forms a self-correcting feedback loop with zero manual intervention on my part.
For me, this means I now need to devise a new test case that current models cannot yet handle, so I can continue measuring progress.
Now that the priority is still clearly on using the best available model, and cost is still secondary, Anthropic and OpenAI are naturally the go-to options. But since open-weight models are significantly more cost-effective and can be run by any operator of one’s choice, independent of the original creator, I find myself questioning what the long-term differentiating factors will be, especially since switching between models is so easy. Is it then just the infrastructure or enterprise contracts that matter?