My take on the issue is that for most use cases where AI is pushed to the general public, a conversational chatbot is not the right tool, and the experience is bound to be frustrating.
Remember when Copilot was basically a super-smart version of Intellisense? It was awesome. Sure, there was a lot of pushback and concern, mainly about licensing and ethical issues, none of which are solved with the current chatbot model. But now I also have to come up with a prompt and type it out. How is that an improvement over having the LLM use surrounding code as context and figure out how to fill in the blanks? A well integrated tool beats a bolted-on chatbot any time for me. Another example would be translation: in Firefox, I can right click any text or click the ć/A button, and I can translate the text or the whole page from basically any language to any other. The frontier LLM's solution is to prompt their chatbot to do the task, which is a downgrade. Sure, I could also ask Claude to write a poem, but when I need to translate a webpage, it doesn't help much.
I get why all major AI companies push towards this solution, because they can build a single tool and sell it to everyone, and that training their models is very expensive and they can't afford to alienate any part of the potential market. But ultimately they're building Swiss army knives, which are able to do basically anything, but will never be able to allow users to tighten a screw better than a well designed screwdriver. Sure, I won't ever be able to clip my nails with a screwdriver, but if my business is tightening screws, I won't tolerate using a Swiss army knife for long.
Please build actual tools. Not textboxes for me to try and configure a non-deterministic tool. Then frustration will go down.
Many of the AI companies do train and release models dedicated at one task.
I mainly use mistral, so that's my reference, but I know anthropic et.al have similar models around.
Codestral is rediculously bad at conversation, but it's -for me- the best model around for "magic autocomplete". It's also pretty good at "one shot" prompt+context generations, e.g. to make "git commit log entries".
Document.AI is unusable bad in a conversation style, but really good when wired up to a simple pipeline as "replacement" for OCR or for indexing "meaning" from documents (I'm experimenting with it for my administration, to get invoices, contracts etc into a search tool).
I presume there are many others like it.
So, what you describe, is already in place. I guess mostly the "interfaces" are missing for you, or hard to discover maybe?
For example, a dedicated model with tool I'd like, is some "shell" -a zsh or bash fork or some wrapper- backed with a dedicated model, trained for "commandline interaction".
Where instead of "git commit --fixup=[opens another terminal to git log the relevant entry]", we can "git fixup the commit that fixes full names" or "ffmpeg convert some.mov to mp4 without sound but keep quality and ratio etc". Or "run any valid tar command - you have ten seconds".
I'm now using the way too heavy "devstral" for these tasks. I don't need it reasoning, conversing, apologising. I need it translating my requests into commands, then showing these to me so I can deny/allow/whitelist/blacklist them and then run them - to interpret *and show me* errors and suggest improvements or fixes etc.
Same for - indeed - translation, writing draft mails, reading documents, etc: I don't need to converse with it. I want to have buttons, shortcuts, "tab complete" etc that's "smart" enough to understand what I need and want, preferably tunable by editing "system prompts" or such and then get out of my way.
I think the company that figures this out for my IDE will win the competition-race of "AI coding tools".
Just today, I found, zed presented a button "git conflict found, resolve with AI" . When pressed, it did start a conversational thread, but its a step in the right direction.
> So, what you describe, is already in place. I guess mostly the "interfaces" are missing for you, or hard to discover maybe?
That's definitely an issue. Mind you, the general population is not a developer. I'm a mechanical engineer. I can code, use an IDE, but I hate having to figure out tooling the way you describe, and it's not a skill I'm interested in developing. What you are describing sounds to me like someone using vim and a terminal trying to convince me to stop using CLion, because they can make anything CLion can do work with their setup. Sure, I believe it, but for my part I'm going to wait for the features to be well integrated into finely designed software, I'm not going to duct-tape this stuff together to get a workflow that still involves writing out and tweaking prompts.
It also sounds to me that the AI/LLM vendors are still in a phase where they are trying to figure what the actual workflow should look like so they let their power users do that work for them. I'm not going to do that either.
I strongly believe that if youâre not in the business of predicting text or transforming it, the values of AI tools goes way down. Most people workflows are very routinely and with a constrained set of outputs. Thatâs why we build software and scripts for those. And for the rest, we need actual human judgment.
> Remember when Copilot was basically a super-smart version of Intellisense? It was awesome.
I have only used the Copilot completion for C#, but it was absolutely awful and a net negative not just compared to IntelliSense, but compared to the most basic autocomplete algorithm imaginable. I turned it off after a day.
well you know it kept improving. It got pretty good, though everyone moved to full "agentic" changes over autocomplete.
Maybe it shouldn't have been forced on me if it wasn't ready, then.
I made a tool thatâs non-conversational. But Iâll be honest - itâs hard to sell because people default to thinking in conversational terms. My customer set is limited to folks like the author who have genuinely faced an issue. For most, compromising with conversations is fine (at least now)
Yea, it really depends on your mode of thinking. Might be why a lot of people struggle with LLM coding while others think itâs great and are productive.
When Iâm writing code I think in terms of data structures and algorithms. I have the idea fully formed in my head and coding becomes a mere typing exercise.
If I have to use a chatbot instead, now I have to do the awkward exercise of translating that code into English text that the chatbot can understand, just so the LLM can convert it back to code. And always a lot is lost in that translation.
What is useful are things that speed up data entry, autocorrect formatting, and linting things I forget and so on. Not some awkward thing that makes me round-trip to English as an extra step.
Amen. Chatbots are a band-aid on broken UX. <insert bandaid tank meme here> Trying to explain this for a while at the company I work at, but everybody is drunk on the kool-aid. But I get it: good UX takes deep thought and creativity. Tacking on a chatbot does not.
> good UX takes deep thought and creativity. Tacking on a chatbot does not
See: Microsoft Copilot in the Windows settings app. Instead of actually fixing the app and making things discoverable with good design (or at least, functional search), they just slapped the chatbot into the search box.
I'm seeing this pattern more and more, and its frustrating. All UX just goes out the window because "well, we'll just make it so the user can ask a chatbot"
I've found swearing at a model to be quite effective in getting it to rethink and correct its mistakes. This seems to apply across Codex, Claude, Qwen, and Gemma/Gemini.
I don't know if the model is picking up on a "need to lock in and be more rigorous" signal, or if the model providers are routing to smarter models if they detect a frustrated user. But if a model keeps making the same mistakes, swearing at it often helped kick it out of a glut and onto the right track.
Or it could just be catharsis.
Reminds me of this study: https://arxiv.org/pdf/2510.04950 . It demonstrates that being "rude" or "very rude" increases the accuracy of the results. A dubious but very fun read. The prompts in Table 1 (top of page 3) are awesome. I am sure they tried other prompts, but didn't include them to the paper.
"You poor creature" XD
I would prefer not having to get into a habit that might bleed into non-LLM interactions.
It might improve the general state of "professional" software though. When done selectively and dosed just right that is.
If a coworker deleted your database you'd expect some 4 letter words.
If youâre talking to people the same way an LLM is spoken to then youâre already being rude.
I talk to LLMs the same way I talk to people.
The only difference is that I interrupt the LLM when I find a typo in my prompt. ;)
how do you know how they prompt an LLM?
Personally, I don't say 'please' to vending machines and 'thankyou' to automatic doors :-P
I would prefer not having machines mimic human conversation patterns that can lead to such confusion.
I notice the same. Like you I am not even sure if it really helps, however, every day I find occasions where I see Opus will never do it correctly even though I calmly explain; swearing then suddenly fixes it. I had some issue yesterday where opus kept blaming the api for not sending some field while I knew it was there ; I showed it json, logs etc but it kept repeating that there must have been a glitch; frustration built, I called it all kinds of things in one sentence and the next solution was the right one. This after 10 similar misguesses. It was one of those increasingly rare cases where I should have just done it myself, but I can never know going in how stubborn it will be in continue blaming the (obviously) wrong thing. The around 11 prompts to get to the answer were in a /clear opus 4.7 context (1m) on xhigh.
So the correct strategy is a global CLAUDE.md with couple lines of colourful "you best behave or else" texts, so all your prompts get routed via the frustrated path?
That will not work - you end up with Claude being ADHD and not following any guidelines.
Skills do work, as they ground the agent with constrained context for the task it's performing
I find it routes more quickly for patches when in the frustrated path, so after planning sure :)
there already is a global claude using any cloud model is a high probability that theyre context stuffing trying to curate output for the normative use cases. see "dont talk about goblins"
Fascinating. Projection/antropomorphism or actual human fawn-like survival mechanism trait-ish? It should be possible to test this empirically.
Since the source code leaked showed they key off of swearing to trigger certain behavior, I actually intentionally swear when running into things like insufficient thinking and/or hallucinations. It also unironically makes it easier for me to grep later to run analysis on how often its happening.
Interestingly to me, the problem I always find is that you will make a suggestion, the AI will go through a thinking loop, come to the exact wrong conclusion then blast out tokens make the solution to their own conclusions.
I honestly wish there was more "I'm not sure what you meant can you clarify this part" more often. It feels like I want a "confidence in itself slider"
I'm solving the "make the solution to their own conclusions" with rigorous "context engineering". Skills, MCPs, and, above all, context window switching.
E.g. with TDD, I find that a model that writes both the tests and the code, will almost always hone in on a solution, then -grudgingly- write a test for that, but quite certain with the final code "in mind" already.
So, I instruct it to use sub-agents; though I find the tooling on figuring out what context is and isn't passed between agents and subagents severely lacking.
Or, also worked pretty well, have one thread write the test. Only that. It cannot read code, it can only read the tests directory or even a subset thereof. Then another thread, entirely new context, must run the test, see it fail, start implementing and stop as soon as the test is green - it obviously cannot edit the test. Yet another new context then is instructed to refactor based on rigorous refactoring skills.
A lot of work - And ironically, skills written by agents are pretty bad, I found, so a lot of manual work. But the rewards are promising.
behaving like a human is not the problem. behaving unpredictably is. not doing what i expect, or rather not being able to define what i can expect is what's bothering me.
but the real kicker is: getting frustrated creates stress, that's unhealthy and makes for a hostile work environment. as much as i sympathize with the idea that AI tools can be more helpful than they cause pain, i am simply not interested in working in a hostile painful work environment. my health and my dignity are not up for negotiation. even if that costs me a lot of job opportunities.
that's also why i am not working with windows. that too costs me a lot of job opportunities. but again, i'd rather keep my dignity and my sanity.
> that's also why i am not working with windows
Oh good, so it's not just me. Windows is weird, my hand starts cramping up and I start getting angry pretty quickly when I use it.
For LLMs, I just can't use them, they aren't there yet for me. What I need is for an LLMs to say "stop, you're clearly doing something wrong, talk me through what it is you want to do". The current generation of LLMs seems designed to piss me off.
my hand starts cramping up and I start getting angry pretty quickly when I use it
it's been a while (fortunately) but yes, same. for me it's this feeling of helplessness. like this behavior is not a bug but an intentional design flaw that won't ever get fixed.
like this quote: when something goes wrong in windows i bang my head against the wall and give up. when something goes wrong in linux i bang my head against the wall and go look at the code. (paraphrased from memory, i could not find the source)
(edit: actually the original quote is a bit different than from how i remembered it: https://www.junauza.com/2008/01/top-50-linux-quotes-of-all-t... (search for nr 10))
The current generation of LLMs
i feel exactly as you say. and maybe in the future LLMs will improve. it is also clear that some people have different levels of tolerance for this. if someone can tolerate the current state and work with it, good for them. i simply don't have much of any tolerance for that at all.
Incredibly privileged take to claim that using Windows is somehow beneath your "dignity". Do you have any idea at all of the kinds of jobs people are doing in the real world?
Imagine the daycare worker taking care of your kids or the truck driver bringing your food saying "getting frustrated creates stress, that's unhealthy and makes for a hostile work environment".
i live in a developing country. from my perspective, anyone who has access to a computer is privileged.
Imagine the daycare worker taking care of your kids or the truck driver bringing your food saying "getting frustrated creates stress, that's unhealthy and makes for a hostile work environment".
what's your point? if you get frustrated with my kids then you are in the wrong profession or you need more training. as a parent i am not allowed to get frustrated with my kids either. if you get frustrated with my delivery, then i am sorry, and if i was the cause, i apologize. tell me what went wrong and i'll do better next time. if it was something else, you have my sympathies. i'll do my best to not make it any worse.
working in a stress free environment is not a privilege, it's a human right. nobody deserves to be mistreated at work, or be stressed by other peoples expectations (which is a form of mistreatment, or, dare i say, abuse).
Taking care of kids or driving in heavy traffic is 10x more stressful than using windows. If you claim you never get frustrated by your kids or traffic, then you must be the perfect person, good job.
> working in a stress free environment is not a privilege, it's a human right. nobody deserves to be mistreated at work, or be stressed by other peoples expectations (which is a form of mistreatment, or, dare i say, abuse)
Seriously? Having expectations is abuse? Since students get stressed by exams and deadlines, education is nothing but abuse then?
And having a stress free environment is a human right? It'd be nice if the world worked that way, but it's as absurd as saying "never stubbing your toe is a human right".
Bob forbid someone have standards
There's nothing wrong with Taylor Swift taking a private jet instead of a 20 minute drive, she just has "standards".
No, obviously if everybody had those same unreasonable standards the world wouldn't work at all. So all of the privileged elites should probably be grateful that us plebs with "lower standards" exist.
I didn't see them say anything about dignity. They said using Windows makes them angry, which is understandable. That speaks to a poor user experience design. Framing it as a privilege issue is blaming the victim.
I didn't see them say anything about dignity
actually, i did, and i stand by it. working with a system that makes you angry is undignified.
it's a reference to a quote that i can't find the source for which roughly goes like this: *why i use linux and not windows? i could also rob banks and ..., but you have to keep a certain amount of dignity". the original of that quote was in german.
"The victim" of using a certain operating system? Please.
Being lucky enough to work in a comfortable air-conditioned office, AND having the luxury of declining jobs sorely because the operating system makes you angry, is the height of privilege.
Stop feeling sorry for yourselves and realize how good you have it.
> I didn't see them say anything about dignity.
The word dignity was used twice in the comment I replied to...
But they behave predictable- if you think of it not as a conversation, but any conversation you ever saw on the internet, on all possible worlds. Every stackoverflow post, every github issue. And your reply, your tone, picks between this many worlds.
If you become the master, it becomes the pupil, if you become the pupil, it attempts to scholar you. You can see it in the tone it takes, where you are in this canyon system.
So, your goal is to bring the conversation to the language of the pros, who regularly war with reason and language, over topics that determinate who gets to eat or not. Academia prompts for the win..
> behaving like a human is not the problem. behaving unpredictably is.
Not sure you can have one without the other.
humans are way more predictable than AI. not predictable in the mathematical sense, but in the trust sense, that when i ask a human to do something, they will do it in the way i taught them. and if they don't, i can correct them and they will learn and adapt. it's not perfect, but there is progress. even if they do something different from what i want, they will keep doing it that way until they learn a different way. AI is entirely random in the ways it goes wrong.
so when i say a human is predictable, i mean that a human will do their best to follow instructions, and they will generally not repeat mistakes.
a human that refuses to listen, and doesn't learn will be fired for being unsuitable for the work i expect them to do. in that sense i tried working with AI and i decided to fire them because they don't meet my expectations.
The UX problem is elsewhere I think. Many users probably don't realize that the agent's context window is limited, and that clever compaction is happening regularly to make it seem infinite. But that necessarily means the agent has to forget stuff.
As a result, users will keep reusing the same coding or chat session again and again. While it would be better to start fresh for unrelated tasks.
I don't believe this is a context problem.
Claude Opus 4.7 has a very large context compared to itself, but IME it is the worst at following instructions, and completely disregards the (small) preferences prompt, even in the first or second message, even if the messages are just a few characters long.
IMO this is entirely a training problem.
Isn't a large context window still a problem though? At the upper bound, the more you put in the more each sentence washes out within that window?
Iâm not talking about large amounts of text, Iâm talking about a couple sentences back and forth.
It disregards things like âno follow up questionsâ.
Haiku, for example doesnât.
This bias is a very human thing, actually now that I think about it. You just disregarded the âeven if the messages are just a few characters longâ. :)
Codex compaction is way better imo.
I've had many long-running sessions and it doesn't suffer the same retardation (the act of delaying, slowing down, or hindering progress) that Opus does.
The quality stays consistent and it actually seems to follow the instructions, todos, etc. even after multiple compactions.
if you look at claude code, it now says compaction is happening constantly, which is likely why
If compaction is throwing away crucial prompting instructions even when it's at a 1% of maximum token usage (like my example), then it's a software bug, not an LLM artifact.
[dead]
The author of this post and the readers of this thread probably do understand context window limitations, but are frustrated nonetheless.
Well yeah. And there's little more frustrating than someone telling you not be frustrated because "that's just how it works".
We get how it works. It's just irritating.
I think the post author is smarter than that.
I usually work with sessions <300k tokens, Opus 4.7 xhigh, and it simply has holes in it's world model, or some strong conditioning here and there, and it sips through regardless of how strong you will say things and how explicit the rules in system prompt will be.
Even with a fresh session, if you bump into one of these things, it will lead you into circles that will be very hard to break out of. And swearing helps a bit.
Whatâs interesting to me is that the conversational nature of the LLM tends to lead folks down an unproductive convo path.
âDonât do Xâ is just as useful as telling an infant not to cry.
When an infant cries, we implicitly understand there is a form of discomfort to address (food, diaper, etc).
To me, when a LLM fails, it signals to me that the architecture and structure of the code is problematic and that needs to be addressed.
Any seasoned dev can usually see non-DRY, non-KISS patterns, then will structure an encapsulation around said pattern to address issues.
Iâve found that this same type of refactoring is needed in LLM code to improve its outcomes, of which then itâs capable of overcoming the bugs.
Simply telling the LLM to refactor for cleanliness in between code generation runs will do so much for maintainability.
Working with LLMs is great for building communication skills. Communicating effectively is one of the hardest skills and it's baked into everything we do as humans. I'd say as a matter of principle: blame it on a communication failure on your end vs blaming the stupid LLM since you're the only one that can do anything about it.
So I don't think it's a matter of form; whether the AI should or shouldn't act like a human.
> Practically speaking, I probably just need to condition myself not to get caught in the illusion of speaking with a human. Though Iâm not really thrilled about a future where I need to guard against the tools I use for my job.
That's been one of the gravest re-realizations I've noticed watching coworkers trying to pick up "agentic" coding: they often just break down into "just fix it" or "why is this broke". I've noticed that even though supposedly there's training or some sort of work done to make the agent work better with unclear or ambiguous grammar or bad structure, it feels like the quality changes palpably when you talk in clear well-structured English and provide at least a good background on the task. To me all of that feels natural, and I like writing and explaining anyways, but it's seemed like an almost insurmountable obstacle for some I've met (and I'm not even talking ESLs either). I strongly suspect those communication and writing skills will be a major factor in the bifurcation of haves and have nots as software "engineering" as we understand it continues to change.
Yes, I have definitely witnessed this as well.
I think, I hope, this will be fixable to some degree, but at this moment I believe it's best to communicate in Queen's English and try to maintain the level of clarity of thought you expect of them in return.
My pet theory is that actual real conversations they were trained on with bad grammar and spelling are in general relatively starved of proper reasoning. By talking to them in this fashion you activate their lowbrow patterns and while it may not be catastrophic I can't imagine it helps.
Also, quite simply, the output is a function of the requirements and the context. If you can't communicate clearly what the situation is and what you want, what do you expect the LLM to do, read your mind?
Agreed, 100%.
If you cannot formulate a specification, or describe a requirement - or indeed, if you cannot fathom the difference between a spec and a requirement, and why its needed to differentiate these from each other prior to doing a proper design and implementation - then you're going to carry your bad practice into the AI realm and that AI is going to be a force multiplier of your own bad practice. Because you will never know if the specs/reqs/design chosen by the AI are actually appropriate, unless you yourself review those specs/reqs/designs, AND the code produced by the AI to fulfill those specs/reqs/designs...
AI makes the able software developer, more able.
But it also makes the unable software developer, even more unable - with the risk of exceeding the AI-users limits on the Peter Principle scale.. in fact, AI will propel you to the middle of your own Peter Principle dilemma faster than you can type, probably.
Communication and writing skills are essential, with or without AI. But reading skills are even more relevant when dealing with AI. Alas, so few people who choose to use AI, have the temerity to actually do the work - or else they wouldn't be rushing for the AI tool in the first place.
Review, review, review. Always. Read the damn code, no matter who or what wrote it. Make sure it fulfills the specs and requirements its supposed to fulfill - and even more important make sure you, the reviewer, also understand the specs and requirements.
And if you don't, fix that - don't ship it anyway, ffs!
I am maybe positing something even stronger: say you had two prompts, both with the same information, one was written in the style of a good paper out of Nature or Science, one written in the style of a bad Twitter post or other kind of mess, even with the same information, I increasingly believe even for the top end models like Opus, the results are at least materially different if not grossly so. I really believe the stronger English input yields demonstrably better outcomes, even if the information contained inside is the same.
> Review, review, review. Always. Read the damn code, no matter who or what wrote it. Make sure it fulfills the specs and requirements its supposed to fulfill
This drives me bananas
I love writing code. There's nothing like getting into flow and just building. Reviewing code? Less interesting. Much more tedious. I do it because it's part of the job
So this AI coding shit has completely eliminated the part of the job I enjoy, and replaced it with 100x more of the part I only really tolerate
I don't want this career anymore :/
Author here. I definitely agree that communicating well is a prerequisite to getting decent results. On the other hand:
1. Even if you communicate perfectly, there's no guarantee that the LLM will "behave as instructed" and as you imagined it to. Indeed, the frustration often comes from the fact that you've said something as clear as day, yet the agent takes another path.
2. Part of the value of coding agents is exactly that you don't need to lay it all out perfectly for them. I mean, if I need to give the LLM every little implementation detail, I might as well write the code. Of course, I don't expect it to work off of "I want nice app make money", but I do expect some "intelligence" in figuring out the missing pieces.
> Even if you communicate perfectly, there's no guarantee that the LLM will "behave as instructed" and as you imagined it to. Indeed, the frustration often comes from the fact that you've said something as clear as day, yet the agent takes another path.
People forget. People misunderstand clear things. Teach yourself to not judge people for being human. You'll have easier time with AI. You are not gonna be angry at 5 year old because it occasionally can't follow your instructions. AI is a 5 year old that accidentally ate all the encyclopedias in the world and is super eager to help. Be a more charitable, generous, understanding person, even in the absence of actual people.
Also try a stronger model. There is a difference. I have very good results with Codex but don't get fixated on any one, they are all "state of the art" or close but they are different and state of the art is moving ahead faster and faster.
I don't extend the same grace to machines as I do to humans. This is working as intended. I have patience for people that mistake. But much less for machines. Why should I? These are things created by trillion dollar for-profit corporations. Extending them any benefit of any doubt is a vulnerability waiting to be exploited.
> Teach yourself to not judge people for being human. You'll have easier time with AI.
LLM is not a human. This implication that OP or someone else is impatient against people when they get frustrated with effin machine is completely absurd.
> AI is a 5 year old that accidentally ate all the encyclopedias in the world and is super eager to help
LLM is not 5 years old kid. It is an expensive tool.
"Teach yourself to not judge people for being human."
"Be a more charitable, generous, understanding person."
Anyone making such blatantly judgemental and egotistical comments to a complete stranger has absolutely no idea what is frustrating to people. And is not being anything like a charitable or understanding person.
LLM is a tool, it is not communication failure. This is like saying I should treat null pointer with workaround as a communication failure between me and the software.
More specifically it is about efficiently conveying outside context. The 4 horsemen of AI dismissal:
1. Slow typist
2. Terse communicator / ambiguous "it" "that" "this"
3. Assumes conversation partners share their reality and headspace
4. Mental blocks with delegation, even to competent humans
One skill that I still possess and that LLMs haven't been able to replace (yet) is to ask good questions, for example:
- Rephrasing the original question to validate my understanding - Asking "why" a sufficient amount of times until I understand where the other party is coming from - Asking open questions aimed at generating insights
et cetera.
Instead, LLMs (often badly) guess what the background of the question may be, answer with that in mind and find it very difficult to let go of what they have made up.
Asking non-leading questions is a skill. Sometimes I feel the urge to mention something to AI (in a question or in passing), but I stop myself because I know it will stick to that thing and become dumber because of it.
I usually don't want AI to ask me questions. I want it to guess the things I didn't specify, because if I wanted to specify them, I would. Sometimes I even tell it directly to not ask me any questions and assume reasonable choices for underspecified things. But when I do want it to ask clarifying questions I just ask it to do that. And it does. If you prefer that style, you might put it in a prompt. Or use a flexible coding harness like pi and ask it to create a skill or extension that will help you push it in that inquisitive direction easily or automatically.
Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.