AI Agents: Why the Hype Feels Wrong to an Old Programmer
I’ve been a programmer forever, and have been using AI tools for a while now. The internet told me the next big thing is ‘agents’, so I figured I’d give it a spin.
For review, an agent (in the 2024-2025 sense of the word)1 is a language model that can call external tools to do something other than just chat. So, maybe you talk to a language model and ask it to order you a pizza2, so it calls some tools that expose APIs or maybe even uses a ‘voice call’ tool letting it make an actual phone call with a human, then you have something in the physical world.
This is pretty cool, for sure. So I figured I’d set one up (an agent, not a pizza).
The effort/value combo
This is the first spot that I had to learn to shift my brain.
We’re used to “setting something up” being the hard part. Not just as in difficult and time-consuming, but also the part that adds the most value. Sure, anyone can “set up” a website with Wordpress, but if they want it done right (many of us think) they have to go to a programmer. We understand all the moving parts and know how to put them together in the right way to optimize the whole system. We even build our own systems, subsystems and adapters and plug them all in to create elegant software systems (we hope).
I naively assumed this was the hard part about agents as well. So I cobbled something together that would call a local LLM, listen for tool calls, do the actions defined, return the results etc. I defined a few tools as well, to get my hands dirty.
I assumed the cleaner, production version of this would have more input sanitization, graceful failovers, etc. (my old brain could see where all those parts ought to go) but I figured the ‘core’ thing had been set up. “Expert mode” for this would be adding the above trimmings, right?
I ran it and results were… disappointing to say the least.
The naive solution
I started with a ‘Wikipedia search’ tool which would search Wikipedia. This worked fine for simple questions like “What year was the Eiffel Tower built?”, but for questions requiring multiple reasoning steps, it failed. I asked “What is the deadliest animal in Sweden?” and the model3 kept repeatedly querying Wikipedia with different query strings in the hopes of finding a single page with that exact answer on it.
I did a little research and found the solution is essentially to just tell it not to do that.4 Seriously.
Seriously?!
“Vibe” debugging
One might argue this is exactly what we do when debugging: we see what instructions the computer misunderstands and adjust our instructions accordingly. (There’s the famous “Peanut Butter & Jelly” challenge5 that illustrates this to non-programmers.)
But this is different. In ‘traditional’ programming, you can break inputs down into input classes and see how any such input travels the system, what logic happens where and have tests that are comprehensive. You can even simulate parts of the system running mentally to see where it might break, even without running input.
But you can’t really do that here… you can’t just say “what will it do for any given human utterance?” and test for all cases, nor can you imagine how it would handle a specific query and trace the path of execution in your mind.
You can put in software-level guardrails, such as requiring user confirmation from outside the LLM before a tool is executed, or automatic pausing after so many tool executions, but this somewhat defeats the advertised goal of agents — they’re supposed to “just do” the thing you asked for without extra work, steps, or even supervision.6
Agent “programming”
So what does this leave? Strategies. Either the baseline LLM underneath has to have been explicitly trained on different strategies to handle different requests (many are nowadays), and/or the strategies have to be explicitly defined in the system prompt.
There can, however, be layers of pre- and post-processing that happens here to find the right strategy or modify the query, either using LLMs or traditional processing.
But the “real” programming, the hard part that creates value, is in the creation and selection of the strategies and not the plumbing around the systems communicating with each other properly.
This hurts
This hurts my professional bones in many ways.
Things happen for no reason
The grounding in lower layers is gone. When debugging software, you could always fall back on what you know is happening ‘behind the scenes’. A network connection isn’t working? Well, it has to be doing a system call to open a socket somewhere, see if that works. Yes? Go up the stack. You have enough of a mental map to quickly steer and diagnose the situation, and your mental map improves with time so you can navigate the system faster later.
With an agent, you do get their thinking tokens (usually) but they’re not necessarily faithful to what is actually happening behind the scenes7 and a slight change in prompt can yield a completely different thought chain.
Things change for no reason
Strategies here are also necessarily model-specific. For those old enough, this was like quirks mode for certain browsers, except in this analogy a new browser with totally different quirks comes out every week.
Basically, the ‘learned’ knowledge you have is that this particular model tends to interpret these queries a certain way. The next model may, or it may not. It’s really up to the upstream provider (even for a local LLM) what they choose to emphasize in training.
You could freeze the model, but you risk fossilizing unknown unwanted behaviors as well. You could also potentially keep a data set that forced the model to behave with the same ‘quirks’ in it, or force some rigidity in how it responds to specific queries, but there’s a chance that such fine-tuning would cause it to interfere in other unforeseen ways8.
Still, it might not be problematic if the task is relatively simple, but you just can’t really say for sure.
As a programmer…
I think my overall takeaway is that agents should be used for prototyping and as a ‘fallback’ but you should do everything you can to move your system away from them once you find stable use-cases.
For instance, if you’re a pizza place and you set up an agent to take an order, really, it should be used to format and clean up the human input into very specific tools and into a very controlled work flow.
If you’re building an agent from scratch to, say, go online, find the best flight deals and book for you, then yes, you should start prototyping using an agent since there are zillions of different sites and formats that would be a pain to keep up in a ‘traditional’ way.
But then, once you kind of understand what the workflow should be, you should start adding structure into the workflow and just use the agent for (essentially) cleaning up raw input into entry points of your workflow and formatting the raw output back.
In other words, use strategies for prototyping and then quickly codify the ones that work, but don’t buy into the hype and build agents on top of agents on top of agents because every single layer you add adds more instability and madness.
The most important part: remember the “hard” part isn’t getting it production-ready and then containerizing that and orchestrating the containers etc. The hard part is getting it to reasonably interpret the user input into a sensible and reproducible execution strategy, which you only learn through either narrowing your use-cases or pouring over tons of input and learning your specific model’s quirks.
Safe coding!
Yes, it has meant different things in the past, in different branches of computer science. Yes, this frustrates me as well. But for the sake of not devolving into a rant, I’m just gonna pretend like I have amnesia and I’ve never seen this word in my life and it carried no meaning to me before. I imagine that’s what everyone else is doing.
The specific model does matter, in this case it was Nemotron 70B instruct at 4bit quantization
You ask someone to give instructions on how to make a Peanut Butter and Jelly sandwich and you then maliciously comply with the instructions. On some level it illustrates the core difficulty of programming. It also creates some funny videos
Don’t do this.