2025 Reflections
What was exciting in terms of information technology in 2025 is, of course, a very subjective question.
Undeniably, the year was massively dominated by AI again.
One sees great potential in code generation, a no-brainer for prototypes, unit tests etc., but as soon as you try to incorporate that into complex legacy systems and workflows, it often feels like little can be gained. There’s also not much substantial data but a lot of boasting on social media. In the blog post Balancing Agentic AI with Traditional Engineering from last year, I wrote:
…it’s crucial to maintain control by applying traditional software engineering principles. Without this, you risk introducing unnecessary complexity that can undo the benefits of initial quick progress achieved through AI. Such complexity can quickly overwhelm your efforts, regardless of the amount of AI resources or time invested.
But what about companies that can use AI on a large scale because they are financed by venture capital and work on greenfield projects? Surely they must have incredibly high velocity, no?
But looking at real examples, for instance, this interview with a founder of Cluely:
…A lot of the reason for lower product velocity right now is because we’re suffering from a lot of tech debt from a quick launch….
That’s after just a few months. There are also articles on techcrunch.com and other media about Cluely. By the way, they have really brilliant marketing skills, only the product needs to keep up.
So maybe traditional software engineering principles aren’t so boring after all.
However, before we dive into the topic of complexity, let’s start from the beginning.
Experimenting With My Own Tooling
I think you can just read about it, but trying to build something exposes the gaps, the things missing. So like others, I came up with the idea of building my own AI interaction tooling, something that would let me work with both local and cloud models. The experiment was to have only three tools (list_files, read_file, and write_file) and see how far one could go with them, experimenting with my own system prompt and tool calls. Stateless communication is achieved by simply sending the full conversation history with each request. The agentic iterative processes (agent mode) work under the hood with continuation. Things like that. Interestingly and a bit unexpectedly, you can achieve a lot with this setup. Seems the model’s inherent quality is by far the most critical factor.
You can get a glimpse of how it looked in June, showing a simple agentic flow with a local model, and surprisingly, it was with Mistral Small 3.2. I documented the experience in a blog post with a screencast: Local Agentic Flow with Mistral Small 3.2.
After that, my own environment became somewhat obsolete, as I had already learned what I needed. But then the first locally runnable models from the Qwen 3 series came out, though they initially didn’t work with existing tool chains, so my environment was used longer than planned until LM Studio and Cline and friends supported them.
By the way, the next logical tool call support in my environment would have been shell access, obviously. However, how this works in existing AI interaction tooling is not hidden; you can see the calls directly in Cursor, Cline, Julie, or similar tools.
Shell is a good keyword: In November 2024, the Model Context Protocol (MCP) was published, a standard for connecting LLMs with external data and tools: Connecting to external systems with just a few clicks or a bit of JSON config, so anyone who has ever connected to external systems knows how much of a simplification this is. Also has some obvious security implications.
Interestingly however, for my practical work, MCP plays almost no role, as (almost) everything runs through the shell. And shell access IS a universal interface to the computer with a variety of tools for almost anything. Tools that are called on the shell return text, which is perfect for interpretation by LLMs. Initially, sometimes the amount of tokens sent from the tool output to the server was immense, but meanwhile you can see that LLMs pipe the output through grep, head and similar filter tools.
Having a Plan
When working practically with AI, it becomes clear that better results come from first creating a plan and then letting it work on it. Feedback loops are great (if possible) to validate results, unit tests seem even more important than ever before.
Since results using specifications (plus relevant context) and generating AI code from scratch are so much better, this is likely why spec-driven development emerged.
The term is used somewhat vaguely, but it essentially refers to breaking work down into these phases:
- Spec: What and why, described in a mostly technology-agnostic way.
- Plan: Architecture and tech stack decisions.
- Tasks: Breaking it all down into actionable items.
- Execution: Implementation carried out by the agent.
Although the term in connection with AI was only coined in late summer, you could already see it at the beginning of May in the screencast of my blog post Balancing Agentic AI with Traditional Engineering.
Here are a few self-explaining key screenshots from the screencast that show the phases using a concrete example (you can also watch the screencast of course):
Spec: What and Why, Described in a Mostly Technology-Agnostic Way
Often giving small examples of input or expected output is good.
Plan: Architecture and Tech Stack Decisions
Tasks: Breaking It All Down into Actionable Items
I overlooked the package structure. It’s more a part of the plan, but it can also stay in the tasks if you intend to only keep the source as the source of truth.
Execution
The task list (plus spec and plan) goes into the actually implementation that is carried out by the agent.
These flows work well for projects that are started from scratch and have limited complexity. Toolkits have also been developed to structure this process. Some of them produce a whole host of artifacts, and I wonder who would read, edit, and keep them up to date.
Meanwhile, some people are proposing workflows in the form of specifications that are as comprehensive as possible and serve as a source of truth.
Complex Systems
What works for relatively trivial projects doesn’t necessarily work for complex systems.
Comprehensive, high-quality specs can only be created if you already know exactly what all the details should look like, but that is precisely the problem. For larger, complex projects, this is nearly impossible. This seems to be an old idea, even before my time, that has been tried many times. A famous example being IBM’s Future Systems project in the 1970s (code name GRAD), which began with a complex, comprehensive specification and was then to be implemented by independent teams. It was discontinued after high costs without commercial release.
This is where we should fairly mention that iterative approaches are also quite old, for example, Niklaus Wirth (inventor of Pascal) already practically pointed this out in the 1970s.
For me, Gall’s law from 1975 describes this best:
A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.
John Gall, Systemantics: How Systems Really Work and How They Fail, p. 71, see Wikipedia
Source of Truth
You could regard specs as source of truth and work iteratively with specs, sure, but every time a specification changes, regenerating the plans and tasks and code for the relevant parts is very token-intensive, which makes the process very costly. This is also where the common comparison to compilers breaks down: local compilation costs virtually nothing. Whereas with this approach parts are repeatedly generated from scratch, treating code more like a transient, disposable artifact, much like how developers in traditional toolchains don’t directly inspect assembler code. The other difference is that the generation itself is not deterministic.
This approach with frontier models, which undoubtedly deliver the best quality, can quickly cost several thousand Euros per month per developer, even though AI services are currently heavily subsidized.
Legacy Projects
Finally, most projects aren’t greenfield, they’re complex legacy projects. Here, effective usage of AI remains difficult. One challenge is providing the necessary context. Again, it’s good to first create a plan and refine it until it fits. Codebases with well-separated concerns deliver better results and are more cost-effective due to their manageable context. It’s even worth checking whether AI actually speeds things up in areas of your code that you know quite well. Interesting reading on this topic: Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
Suggested Alternative Workflow in Complex Scenarios
As soon as things become important and/or more complex, I still believe you need to work directly with the code as the source of truth and also to dig into the documentation, use exploration projects and so on. AI can still be helpful for research, drafting prototypes, or suggesting options.
In September, I attempted to demonstrate that this is not optional, using three concrete examples in the article Why Your AI Assistant Might Be Wrong: Keep Healthy Skepticism. And I could share many more such examples.
By working and shipping in small increments, you almost always learn something new at each step, insights that feed back into goals, requirements, and design decisions.
But if code is considered the source of truth, where does one obtain the relevant context in complex source code?
Retrieve Context
General basic rules are typically stored anyway and therefore always available to the AI agent.
The real question is: How do you provide relevant context for the specific task at hand?
If your codebase follows separation of concerns, you typically know where a change needs to happen. To gather context, ask:
How does
<My-Source-File(s)>achieveXYZ?
The AI will read the file (and any related files if needed) and answer the question, and that alone often provides most of the needed relevant context.
Another approach: when a similar problem is solved somewhere or a flow already exists in the codebase, ask the AI:
How was
XYZsolved here?
Once you have the context, then formulate what needs to be changed or added, and ask for a plan first. Then continue the usual way. This way, you avoid costly syncs between specs and code, and save significantly on token usage.
And if the solution is only kind of close? You can work on it directly or just open it separately and apply a proper implementation manually.
Of course, these are smaller steps at a time, and of course, it’s not a fully automated process, but I see another advantage in that you don’t have to read through vast amounts of output and, if necessary, adjust it and keep it in sync.
Ultimately, these small steps mean that you are involved in every step of the solution, and help to think things through completely yourself!
And if you need a spec at some point? Well that is something AI can extract quite well, I don’t see the point in investing a huge amount of time in proper synchronization.
Fascinating Project Announcement
Speaking of complex legacy projects, there was another fascinating announcement in early 2025: plans to rebuild the Social Security Administration (SSA) code base within a few months, so migrating millions and millions of lines of COBOL code to a current tech stack. This is interesting because, on the one side, there is a large complex COBOL legacy system and, on the other, almost infinite AI resources. Unfortunately, there have been no updates since the announcement. Has it stalled?
AGI Coming Soon?
Speaking of complexity, we often heard in 2025 that AGI (Artificial General Intelligence, something that reaches or exceeds human intelligence levels) would come soon. I still wonder what these predictions are based on, and I doubt that generating answers token-by-token based on predicted probabilities for the next token can be a base technology for that. We’ll see.
AI Landscape 2025
Openweight Models
After Qwen 2.5 from Alibaba in 2024 and Deepseek R1 early in 2025, another Chinese openweight model entered the market in July 2025: Kimi K2. See also the blog post already mentioned further above How Moonshot AI Kimi K2 Performs in Real Coding Work, as is almost always the case, with a screencast. At the time, I wrote:
I am used to the massive dominance of the USA in the field of computer science and find this quite interesting. Deepseek R1 was therefore no accidental success. At least in the AI sector, China is definitely playing its part. Competition is of course good for the consumer as it reduces costs. And an open-weight LLM can and will of course be used by computing operators. Since the various APIs are open, one LLM can simply be exchanged and another used, thus saving costs. Vendor lock-in with the LLM itself is therefore difficult at first. And it seems that this LLM is also more efficient with a quality that is at least coming close.
Data is now available that shows that my assumptions were correct. OpenRouter.ai is a platform that offers various AI models under a unified interface. In their study State of AI, they show that usage started at around 1% at the beginning of the year and rose to over 30% in the second half of the year, with most of the models used being Chinese.
Running Openweight Models Locally
Many new openweight models used a Mixture of Experts (MoE) architecture: activating a small percentage of total parameters per token via sparse computation. Therefore, they are more efficient, which is important for running local models on consumer hardware.
Local models can now handle agentic workflows. They don’t reach frontier cloud models. However tasks like querying a codebase agentically (multi-step with tool calls) already work quite well. Agentic code generation is much more limited.
They can be used for tasks of limited complexity and they have some hardware requirements, especially high memory throughput and a relatively large amount of memory (unified memory on Macs or PCs with a graphics card with a large amount of memory).
Just as you can easily switch between different cloud models, for example planning with higher-quality, more expensive models such as Opus 4.5 and letting faster, more cost-effective models do the actual implementation, you can also use local models for simpler queries.
It is evident that I advocate learning how to use LLMs cost-effectively, especially since subsidization may disappear at some point. Even if cloud usage should become significantly cheaper, it doesn’t hurt to learn how to use paid resources efficiently.
I’ve wanted to write a blog post about running models locally (e.g. Qwen 3 and Mistral AI Devstral Small 2) with a screencast for a while but haven’t gotten to it yet.
AI Browser To Sell Ads
With deep user understanding and personalized, hyper-optimized sponsored content, AI vendors could sell expensive ads. But this requires advanced memory and broader data access (like browser activity) to build richer user profiles for targeted advertising.
In my November 16 post What Problems Do AI Browsers Solve? I came to the conclusion: AI browsers are slower, raise privacy concerns, and risk security vulnerabilities like prompt injection, where malicious sites hide commands in text or images to hijack the AI.
On December 22, OpenAI published a blog post Continuously hardening ChatGPT Atlas against prompt injection attacks where they mention:
… we recently shipped a security update to Atlas’s browser agent, including a newly adversarially trained model and strengthened surrounding safeguards. This update was prompted by a new class of prompt-injection attacks uncovered through our internal automated red teaming.
… prompt injection remains an open challenge for agent security, and one we expect to continue working on for years to come.
Comeback of the Year?
Both surprised me: Meta is lagging behind despite large investments, and Google’s models have caught up and are now as powerful as OpenAI’s and Anthropic’s. In the area of image generation, they are even better than OpenAI’s. Together with Google’s market power in the internet services and corporate sectors, it is likely to be a very strong competitor for OpenAI.
AI Vendors Shifting Focus Toward Efficiency?
At the end of Why Your AI Assistant Might Be Wrong: Keep Healthy Skepticism I noted that GPT-5 did not bring significant practical improvements in abilities compared to previous models, suggesting that development may have slowed or shifted focus toward efficiency rather than new capabilities. I doubted it would change soon.
The latest version of Anthropic’s impressive high-end model Opus 4.5, which was released at the end of November, also seems to follow this pattern: it is significantly more efficient, which reduces the price to a third(!), rather than introducing new capabilities.
see Claude Docs - Model pricing
2026
In 2025, my goal was to learn more about how AI interaction tools work under the hood. In 2026, I now want to learn more about the actual LLM base technology, although it is already clear that, given the scope and variety of developments, it will be necessary to focus on a few core areas.
I hope that by 2026 we will have more actual data collected using reputable methods, and I expect we will see more case studies as some companies succeed or fail with their projects.
The advances in local models are encouraging, but the gap with cloud-based systems remains enormous. I hope that local open models will progress by 2026 and beyond.
If this idea seems far-fetched to you, think about how computer technology has changed over time, starting with centralized mainframe computers and then shifting largely to personal computing.