Honda: 2 years of ml vs 1 month of prompting

4 days ago/67 comments/levs.fyi

Three points to note:

* "2 years vs 1 month" is a bit misleading because the work that enabled testing the 1 month of prompting was part of the 2 years of ML work.

* xgboost is an ensemble method... add the llm outputs as inputs to xgboost and probably enjoy better results.

* vectorize all the text data points using an embedding model and add those as inputs to xgboost for probably better results.

2 hours ago by PaulHoule

I'll note that they had a large annotated data set already that they were using to train and evaluate their own models. Once they decided to start testing LLMs it was straightforward for them to say "LLM 1 outperforms LLM 2" or "Prompt 3 outperforms Prompt 4".

I'm afraid that people will draw the wrong conclusion from "We didn’t just replace a model. We replaced a process." and see it as an endorsement of the zero-shot-uber-alles "Prompt and Pray" approach that is dominant in the industry right now and the reason why an overwhelming faction of AI projects fail.

If you can get good enough performance out of zero shot then yeah, zero shot is fine. Thing is that to know it is good enough you still have to collect and annotate more data than most people and organizations want to do.

24 minutes ago by Aurornis

> Thing is that to know it is good enough you still have to collect and annotate more data than most people and organizations want to do.

This has been the bottleneck in every ML (not just text/LLM) project I’ve been part of.

Not finding the right AI engineers. Not getting the MLops textbook perfect using the latest trends.

It’s the collecting enough high quality data and getting it properly annotated and verified. Then doing proper evals with humans in the loop to get it right.

People who only know these projects through headlines and podcasts really don’t like to accept this idea. Everyone wants synthetic data with LLMs doing the annotations and evals because they’ve been sold this idea that the AI will do everything for you, you just need to use it right. Then layer on top of that the idea that the LLMs can also write the code for you and it’s a mess when you have to deal with people who only gain their AI knowledge through headlines, LinkedIn posts, and podcasts.

10 minutes ago by isoprophlex

Amen brother. Working on a computer vision project right now, it's a wild success.

This isn't my first CV project, but it's the most successful one. And that chiefly because my client pulled out their wallets and let an army of annotators create all the train data I asked for, and more.

4 hours ago by pjc50

Crucially, this is:

    - text classification, not text generation
    - operating on existing unstructured input
    - existing solution was extremely limited (string matching)
    - comparing LLM to similar but older methods of using neural networks to match
    - seemingly no negative consequences to warranty customers themselves of mis-classification (the data is used to improve process, not to make decisions)

34 minutes ago by Moto7451

Which is good because a lot of such matching and ML use cases for products I’ve worked on at several companies fit into this. The problem I’ve seen is when decision making capabilities are inferred from/conflated with text classification and sentiment analysis.

In my current role this seems like a very interesting approach to keep up with pop culture references and internet speak that can change as quickly as it takes the small ML team I work with to train or re-train a model. The limit is not a tech limitation, it’s a person-hours and data labeling problem like this one.

Given I have some people on my team that like to explore this area I’m going to see if I can run a similar case study to this one to see if it’s actually a fit.

Edit: At the risk of being self deprecating and reductive: I’d say a lot of products I’ve worked on are profitable/meaningful versions of Silicon Valley’s Hot Dog/Not Hot Dog.

3 hours ago by jwong_

Wish there was a bit more technical details in how the prompt iterations looked like.

> We didn’t just replace a model. We replaced a process.

That line sticks out so much now, and I can't unsee it.

3 hours ago by prasoonds

Right? This one is also very clear ChatGPTese

> That’s not a marginal improvement; it’s a different way of building classifiers.

They've replaced an em-dash with a semi-colon.

3 hours ago by klabb3

They are really getting to the heart of the problem!

an hour ago by notanastronaut

One of the benefits of being immersed in model usage is being able to spot it in the wild from a mile away. People really hate when you catch them doing it and call them out for it.

38 minutes ago by Der_Einzige

And people like you will hate it even more when the normies immunize themselves from being obviously caught by such tells:

https://arxiv.org/abs/2510.15061

32 minutes ago by magicalist

> That line sticks out so much now, and I can't unsee it.

I thought maybe they did it on purpose at first, like a cheeky but too subtle joke about LLM usage, but when it happened twice near the end of the post I just acknowledged, yeah, they did the thing. At least it was at the end or I might have stopped reading way earlier.

2 hours ago by ieie3366

HN readers: claim to hate ai-generated text

Also HN readers: upvote the most obvious chatgpt slop to the frontpage

20 minutes ago by Aurornis

I’ve been on HN long enough to know that the upvotes are primarily driven by reactions to the headline. The actual content only gets viewed after upvoting, or often not at all.

2 hours ago by Aniket-N

The two groups can be different but exist in the same community.

35 minutes ago by magicalist

> Also HN readers: upvote the most obvious chatgpt slop to the frontpage

Eh, this one was interesting as documentation of real work that people were doing over years. You don't get that many blog posts about this sort of effort without, usually, a bunch of self hype (because the company blogging also sells data analysis AI or whatever) that clouds any interesting part of the story. The slop in it is annoying but it's also noise thats relatively easy to filter out in this case

2 hours ago by serjester

Seems like a very natural fit for fine tuning - would have loved to see more on the LLM side.

3 hours ago by datax2

Warranty data is a great example of where LLMs have evolved bureaucratic data overhead. What most people do not know is because of US federal TREAD regulation Automotive companies (If they want to land and look at warranty data) need to review all warranty claims, document, and detect any safety related issues and issue recalls all with an strong auditability requirement. This problem generates huge data and operations overhead, Companies need to either hire 10's if not hundreds of individuals to inspect claims or come up with automation to make this process easier.

Over the past couple of years people have made attempts with NLP (lets say standard ML workflows) but NLP and word temperature scores are hard to integrate into a reliable data pipeline much less a operational review workflow.

Enter LLM's, the world is a data gurus oyster for building an detection system on warranty claims. Passing data to Prompted LLM's means capturing and classifying records becomes significantly easier, and these data applications can flow into more normal analytic work streams.

3 hours ago by killerstorm

Hmm, why was their starting point not something like BERT:

  * already known as SotA for text classification and similarity 
     back in 2023
  * natively multi-lingual

3 hours ago by embedding-shape

People generally sleep when you start talking about fine-tuned BERT and CLIP, although they do a fairly decent job as long as you have good data and know what you're doing.

But no, they want to pay $0.1 per request to recognize if a photo has a person in it by asking a multimodal LLM deployed across 8x GPUs, for some reason, instead of just spending some hours with CLIP and run it effectively even on CPU.

17 minutes ago by Aurornis

> they do a fairly decent job as long as you have good data and know what you're doing.

This is the bottleneck in my experience. Going for the expensive per-request LLM gets something shipped now that you can wow the execs with. Setting up a whole process to gather and annotate data, train models, run evals, and iterate takes time. The execs who hired those expensive AI engineers want their results right now, not after a process of hiring more people to collect and annotate the data.

an hour ago by cestith

I’m no ML engineer and far from an LLM expert. Just reading the article though it seemed to me that leveraging an SQL database here was a bigger issue than using traditional ML on the data, rather than the LLM being a win specifically. Just finding anything that was better suited than string matching on a RDBMS to the type of inputs seems like the natural conclusion when the complaint in the article itself was literally about SQL.

3 hours ago by efavdb

Are you suggesting use the clip embedding for the text as a feature to train a standard Ml model on?

2 hours ago by daemonologist

I think they're suggesting doing that with BERT for text and CLIP for images. Which in my experience is indeed quite effective (and easy/fast).

There have been some developments in the image-of-text/other-than-photograph area though recently. From Meta (although they seem unsure of what exactly their AI division is called): https://arxiv.org/abs/2510.05014 and Qihoo360: https://arxiv.org/abs/2510.27350 for instance.

3 hours ago by PaulHoule

I think he is. I do things like that plenty.

4 hours ago by pards

> Over multiple years, we built a supervised pipeline that worked. In 6 rounds of prompting, we matched it. That’s the headline, but it’s not the point. The real shift is that classification is no longer gated by data availability, annotation cycles, or pipeline engineering.

2 hours ago by mcdonje

I get that SQL text searches are miserable to write, but it would have flagged it properly in the example.

The text says, "...no leaks..." The case statement says, "...AND LOWER(claim_text) NOT LIKE '%no leak%...'"

It would've properly been marked as a "0".

30 minutes ago by undefined

[deleted]

Daily Digest

Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.

Home About GitHub Kaggle

AI Blog Deep Learning Apps Security Checklist

Bookmarks Hacker News My Stack

Honda: 2 years of ml vs 1 month of prompting - heres what we learned

Daily Digest