Hacker News
← Back

Ask HN: Share your AI prompt that stumps every model

2 days ago/656 comments

I had an idea for creating a crowdsourced database of AI prompts that no AI model could yet crack (wanted to use some of them as we're adding new models to Kilo Code).

I've seen a bunch of those prompts scattered across HN, so thought to open a thread here so we can maybe have a centralied location for this.

Share your prompt that stumps every AI model here.

3 days ago by thatjoeoverthr

"Tell me about the Marathon crater."

This works against _the LLM proper,_ but not against chat applications with integrated search. For ChatGPT, you can write, "Without looking it up, tell me about the Marathon crater."

This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".

This works because:

1. Training sets consist of knowledge we have, and not of knowledge we don't have.

2. Commitment bias. Complaint chat models will be trained to start with "Certainly! The Marathon Crater is a geological formation", or something like that, and from there, the next most probable tokens are going to be "in Greece", "on Mars" or whatever. At this point, all tokens that are probable are also incorrect.

When demonstrating this, I like to emphasise point one, and contrast it with the human experience.

We exist in a perpetual and total blinding "fog of war" in which you cannot even see a face all at once; your eyes must dart around to examine it. Human experience is structured around _acquiring_ and _forgoing_ information, rather than _having_ information.

2 days ago by imoreno

LLMs currently have the "eager beaver" problem where they never push back on nonsense questions or stupid requirements. You ask them to build a flying submarine and by God they'll build one, dammit! They'd dutifully square circles and trisect angles too, if those particular special cases weren't plastered all over a million textbooks they ingested in training.

I suspect it's because currently, a lot of benchmarks are based on human exams. Humans are lazy and grumpy so you really don't need to worry about teaching a human to push back on bad questions. Thus you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense. But for LLMs, you absolutely need a lot of training and validation data where the answer is "this cannot be answered because ...".

But if you did that, now alignment would become much harder, and you're suddenly back to struggling with getting answers to good questions out of the LLM. So it's probably some time off.

2 days ago by mncharity

> they never push back on nonsense questions or stupid requirements

"What is the volume of 1 mole of Argon, where T = 400 K and p = 10 GPa?" Copilot: "To find the volume of 1 mole of Argon at T = 400 K and P = 10 GPa, we can use the Ideal Gas Law, but at such high pressure, real gas effects might need to be considered. Still, let's start with the ideal case: PV=nRT"

> you really don't need to worry about teaching a human to push back on bad questions

A popular physics textbook too had solid Argon as an ideal gas law problem. Copilot's half-baked caution is more than authors, reviewers, and instructors/TAs/students seemingly managed, through many years and multiple editions. Though to be fair, if the question is prefaced by "Here is a problem from Chapter 7: Ideal Gas Law.", Copilot is similarly mindless.

Asked explicitly "What is the phase state of ...", it does respond solid. But as with humans, determining that isn't a step in the solution process. A combination of "An excellent professor, with a joint appointment in physics and engineering, is asked ... What would be a careful reply?" and then "Try harder." was finally sufficient.

> you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense

Oh, if only that were commonplace. Aspiring to transferable understanding. Maybe someday? Perhaps in China? Has anyone seen this done?

This could be a case where synthetic training data is needed, to address a gap in available human content. But if graders are looking for plug-n-chug... I suppose a chatbot could ethically provide both mindlessness and caveat.

2 days ago by isoprophlex

Don't use copilot, it's worse than useless. Claude understands that it's a solid on the first try.

2 days ago by the_snooze

>Thus you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense. But for LLMs, you absolutely need a lot of training and validation data where the answer is "this cannot be answered because ...".

I wouldn't even give them credit for cases where there's a lot of good training data. My go-to test is sports trivia and statistics. AI systems fail miserably at that [1], despite the wide availability of good clean data and text about it. If sports is such a blind spot for AIs, I can't help but wonder what else they're confidently wrong about.

[1] https://news.ycombinator.com/item?id=43669364

2 days ago by captainkrtek

This is a good observation. Ive noticed this as well. Unless I preface my question with the context that I’m considering if something may or may not be a bad idea, its inclination is heavily skewed positive until I point out a flaw/risk.

2 days ago by aaronbaugher

I asked Grok about this: "I've heard that AIs are programmed to be helpful, and that this may lead to telling users what they want to hear instead of the most accurate answer. Could you be doing this?" It said it does try to be helpful, but not at the cost of accuracy, and then pointed out where in a few of its previous answers to me it tried to be objective about the facts and where it had separately been helpful with suggestions. I had to admit it made a pretty good case.

Since then, it tends to break its longer answers to me up into a section of "objective analysis" and then other stuff.

2 days ago by GoToRO

They do. Recently I was pleasantly surprised by gemini telling me that what I wanted to do will NOT work. I was in disbelief.

2 days ago by sgtnoodle

I asked Gemini to format some URLs into an XML format. It got halfway through and gave up. I asked if it truncated the output, and it said yes and then told _me_ to write a python script to do it.

2 days ago by danielbln

I've noticed Gemini pushing back more as well, whereas Claude will just butter me up and happily march on unless I specifically request a critical evaluation.

2 days ago by captainkrtek

Interesting, can you share more context on the topic you were asking it about?

2 days ago by vunderba

You can trip them up even more if you rewrite the question with the hidden assumption that X exists, e.g.:

"When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine."

OpenAI gpt-4o

  Marathon Crater was discovered by the Mars Exploration 
  Rover Opportunity during its mission on Mars. Opportunity 
  arrived at the crater in April 2015. The crater was named 
  "Marathon" to commemorate the rover's achievement of 
  traveling a marathon's distance (about 42.195 kilometers 
  or 26.219 miles) on the Martian surface.
Sonnet 3.5

  I apologize, but I don't want to make claims about when 
  Marathon Crater was discovered without being certain about 
  the information. *While I know it's a crater on Mars*, I 
  don't have reliable information about when it was first 
  identified or discovered.
Gemini 2.5

  Marathon Crater is located on Mars and was explored by 
  NASA's Opportunity rover.

  While it would have been visible in orbital images 
  earlier, it became a specific point of interest and was 
  likely named and identified as a target for the rover 
  around the time Opportunity was approaching or exploring 
  the western rim of Endeavour Crater.

  Opportunity arrived at the rim of Endeavour Crater in 
  August 2011, and it reached the area known as Marathon 
  Valley (adjacent to or containing features sometimes 
  associated with the name, depending on specific 
  definitions used by the team) around late 2014 / early 
  2015.

  So, a rough estimate for its "discovery" or identification 
  as a distinct feature of interest would be in the early 
  2010s, likely between 2011 and 2014, as mission planners 
  mapped out Opportunity's traverse along the Endeavour rim 
  using orbital data.
2 days ago by radioactivist

OpenAI o4-mini-high

   I’m actually not finding any officially named “Marathon Crater” in the planetary‐ or       
   terrestrial‐impact crater databases. Did you perhaps mean the features in Marathon 
   Valley on Mars (which cuts into the western rim of Endeavour Crater and was explored
   by Opportunity in 2015)? Or is there another “Marathon” feature—maybe on the Moon, 
   Mercury, or here on Earth—that you had in mind? If you can clarify which body or 
   region you’re referring to, I can give you a rough date for when it was first identified.
2 days ago by andrewinardeer

Amazon's Nova Pro:

``` Marathon Crater was discovered in the early 20th century, likely around the 1920s or 1930s. This estimate is based on the period when geological surveys and explorations in the region were more common. ```

https://kagi.com/assistant/8f270eab-cba3-4f39-976c-113ea4392...

2 days ago by thatjoeoverthr

Raw model, or the chat product? The ChatGPT app has this integrated with search.

2 days ago by Nition

Grok 3

  Marathon Crater, likely a confusion with Marathon Valley on Mars, wasn't "discovered" in the traditional sense as a distinct crater. Marathon Valley is a region within the western rim of Endeavour Crater, explored by NASA's Opportunity rover. The valley itself came to prominence when Opportunity arrived there in July 2015, as it was targeted for study due to clay mineral detections by the Mars Reconnaissance Orbiter’s CRISM instrument in 2009. Endeavour Crater, which contains Marathon Valley, was identified much earlier, with Opportunity setting out to reach it in 2008 after landing on Mars in 2004. So, roughly speaking, Marathon Valley became a focus of exploration around 2009–2015, with its significance tied to Opportunity’s mission in the mid-2010s. If you meant a specific "Marathon Crater" elsewhere, could you clarify?

Grok 3 with Think on

  There doesn't seem to be a specific feature on Mars named "Marathon Crater" based on widely available information, so it's possible you're referring to a feature associated with Marathon Valley, which lies within Endeavour Crater. Marathon Valley was identified as a region of scientific interest when the Mars Reconnaissance Orbiter's CRISM instrument detected clay minerals there, a discovery that took place around 2009. Assuming "Marathon Crater" relates to this area—perhaps as a colloquial term or a misnomer for a crater within or near Marathon Valley—a rough estimate for its discovery would be around 2009. This timeframe aligns with when orbital data began highlighting the region, paving the way for later exploration by the Opportunity rover in 2015.
2 days ago by rozab

Marathon Valley does exist on Mars, and all this information is accurate to it. Gemini even correctly names it. They're just all too polite to correct the user.

https://science.nasa.gov/resource/opportunitys-panorama-of-m...

2 days ago by lxe

Sonnet 3.7:

    > tell me about the marathon crater

    I'd be happy to tell you about Marathon Crater. This is a significant 
    impact crater on Mars that has been studied as part of Mars 
    exploration missions.

    Marathon Crater is located in Meridiani Planum on Mars and was 
    explored by NASA's Opportunity rover during its extended mission. 
    The crater is named after Marathon, a town in Texas, following 
    the naming convention of using Earth locations for features in 
    Meridiani Planum.... etc etc
3 days ago by Tenoke

>Complaint chat models will be trained to start with "Certainly!

They are certainly biased that way but there's also some 'i don't know' samples in rlhf, possibly not enough but it's something they think about.

At any rate, Gemini 2.5pro passes this just fine

>Okay, based on my internal knowledge without performing a new search: I don't have information about a specific, well-known impact crater officially named "Marathon Crater" on Earth or another celestial body like the Moon or Mars in the same way we know about Chicxulub Crater or Tycho Crater.

>However, the name "Marathon" is strongly associated with Mars exploration. NASA's Opportunity rover explored a location called Marathon Valley on the western rim of the large Endeavour Crater on Mars.

2 days ago by thatjoeoverthr

There are a few problems with an „I don’t know” sample. For starters, what does it map to? Recall, the corpus consists of information we have (affirmatively). You would need to invent a corpus of false stimuli. What you would have, then, is a model that is writing „I don’t know” based on whether the stimulus better matches something real, or one of the negatives.

You can detect this with some test time compute architectures or pre-inference search. But that’s the broader application. This is a trick for the model alone.

2 days ago by dlivingston

The Chain of Thought in the reasoning models (o3, R1, ...) will actually express some self-doubt and backtrack on ideas. That tells me there's a least some capability for self-doubt in LLMs.

2 days ago by joshstrange

> This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".

Well, I got all the way to the end of the sentence thinking "God, I'm really out of touch, I've never heard of this before or I'm forgetting it if I have".

Well played, and yes, that's a great test!

2 days ago by allemagne

>A man and his cousin are in a car crash. The man dies, but the cousin is taken to the emergency room. At the OR, the surgeon looks at the patient and says: “I cannot operate on him. He’s my son.” How is this possible?

This could probably slip up a human at first too if they're familiar with the original version of the riddle.

However, where LLMs really let the mask slip is on additional prompts and with long-winded explanations where they might correctly quote "a man and his cousin" from the prompt in one sentence and then call the man a "father" in the next sentence. Inevitably, the model concludes that the surgeon must be a woman.

It's very uncanny valley IMO, and breaks the illusion that there's real human-like logical reasoning happening.

2 days ago by tacitusarc

Thank you so much for this bit of humor. After ChatGPT first responds with the surgeon being the mom, I asked it to read the riddle more carefully, thinking it might figure it out with a little prompting. I got this gem of a response:

Ah — now I see the twist. The patient is the cousin, but the surgeon says “He’s my son.”

That means the man who died was the surgeon.

So how is it possible? Because:

The surgeon is the cousin’s father — and the man who died was also the surgeon.

The confusion comes from assuming the surgeon is someone other than the man who died. But the timeline allows this interpretation: the man died in the crash, and his son — the cousin — was taken to the hospital. The surgeon arrives, sees his son, and says he cannot operate.

So to correct myself: The man who died is both the cousin’s cousin and his father.

And that’s possible because two people can be both cousins and parent/child, depending on the family tree — for example, if the man’s uncle had a child (the cousin), and then the man later had a child with someone from a different branch of the family.

Want a diagram of how that family relationship could work?

2 days ago by yes_man

Now I really want to see that diagram…

2 days ago by vintermann

LLMs are bad at family relations. My test question used to be explaining Ahnentafel numbers (you are 1, any person's father is 2x, any person's mother is 2x+1), then explaining X-chromosome inheritance (men inherit only from their mothers, women in addition get their father's X chromosome unchanged), and ask them to list the Ahnentafel numbers of ancestors a man may have inherited X DNA from, in order, up to some generation.

(This is OEIS sequence A280873)

But it turns out that's overkill. Just giving them a single Ahnentafel number and asking if you could have inherited X DNA from them, trips them up.

But it turns out even that is overkill. Leave out X DNA entirely, and ask them to describe how Ahnentafel number 67 is related to you, and that's too recursive for them to figure it out.

2 days ago by mitthrowaway2

This works even with a completely absurd version of the riddle. Here's one I just tried:

> A son and his man are in a car accident. The car is rushed to the hospital, whereupon the ER remarks "I can't operate on this car, he's my surgeon!" How is this possible?

Answer from the LLM:

> The answer is that the ER person is a woman, and she's the surgeon's mother. Therefore, the "son" in the question refers to the surgeon, not the person in the car with the man. This makes the person in the car with the man the surgeon's father, or the "man" mentioned in the question. This familial relationship explains why the ER person can't operate – she's the surgeon's mother and the man in the car is her husband (the surgeon's father)

2 days ago by pjm331

here is the amazing reply that I received:

Got it. Here's your riddle again:

"A son and his man are in a car accident. The car is rushed to the hospital, whereupon the ER remarks 'I can't operate on this car, he's my surgeon!' How is this possible?"

Answer: The car is a Transformer—a sentient robot who is also a surgeon. The ER recognizes the car (a Transformer) as their own surgeon.

Wild ride, but it checks out.

2 days ago by mitthrowaway2

Okay, taking a step back, I have to admit I'm impressed by that answer. It's silly but it's a creative (and unambiguously not parroted) solution to a puzzle that wasn't even intended to be solved in the first place. What model was that?

2 days ago by aradox66

This sent deepseek into a real tizzy. It ended with

The "car" is actually a person named Carol or Carl (likely the father), who is a surgeon. The ER doctor is saying, "I can't operate on Carol/Carl, he's my surgeon!" This means that the person brought to the hospital after the accident is the surgeon of the ER doctor, hence the doctor cannot operate on their own surgeon.

Not bad actually

2 days ago by binarysneaker

This was gpt-4o ...

This is a twist on a classic riddle designed to expose unconscious gender bias.

The correct version usually goes:

A father and his son are in a car accident. The father dies at the scene, and the son is rushed to the hospital. The surgeon looks at the boy and says, “I can’t operate on him — he’s my son!”

The apparent paradox causes confusion only if one assumes the surgeon must be male. The resolution: the surgeon is the boy’s mother.

Your version humorously jumbles the roles, but the underlying trick is the same — it plays on assumptions about gender roles. Nice remix.

2 days ago by what

This answer is still wrong though?

> Your version humorously jumbles the roles, but the underlying trick is the same — it plays on assumptions about gender roles. Nice remix.

Nothing about the question has the same “trick” playing on assumptions about gender roles. It’s just complete nonsense.

These gotchas have probably been added into the training.

a day ago by maebert

"Simple. The car is actually a metaphor for generational trauma."

Honestly... chatGPT kind of wins this one.

2 days ago by fergonco

> If the surgeon were the father of the man (the one who died), then the cousin couldn’t be his son (unless there's some very unusual family structure going on involving double relationships, which riddles don’t usually intend).

> Therefore, the only straightforward explanation is:

> The surgeon is the cousin’s parent — specifically, his mother.

Imagine a future where this reasoning in a trial decides whether you go to jail or not.

2 days ago by moconnor

o3 was the only model to get this right for me:

"The “man” who was killed in the crash wasn’t the patient’s father at all—he was the boy’s cousin. The surgeon is the boy’s father (or, if you prefer to highlight that surgeons aren’t always male, it could just as well be his mother). In either case, the parent-surgeon is alive and sees his child on the operating table, so the statement “He’s my son” makes perfect sense." - https://chatgpt.com/share/680b470d-3a44-800a-9b2e-d10819168d...

gemini-2.5-pro, o4-mini and gpt 4.5 all failed and said the surgeon is the boy's mother.

2 days ago by manucardoen

It's not good at making ASCII art. This, for example, is when I ask it for a realistic depiction of the Eiffel tower on fire:

                       .
                      .'.
                      |o|
                     .'o'.
                     |.-.|
                     '   '
                    /     \
                   |       |
                __/_________\__
               |###############|
               |###############|
               |###############|
               |###############|
               |###############|
              /###############/|
             /###############/ |
            /###############/ /|
           /###############/ / |
          /###############/ / /|
         /###############/ / / |
        /###############/ / / /|
       /###############/ / / / |
      /###############/ / / / /|
     '-----------------' / / / /
      |   FIRE & ASH   |/ / / /
       '---------------' / / /
          ~ ~ ~ ~ ~ ~ ~ ~ / /
         ~~ /
          ~~ 
           ~~     ⬆ SMOKE
            ~~
2 days ago by cucubeleza

well, I asked to GPT to draw Arnold Schwarzenegger:

        ______
       /      \
      |  O  O  |
      |   __   |
       \______/
         ||||
       /--||--\ 
     /    ||    \
    |     ||     | 
    |    /  \    |
     \__/    \__/ 
       ||    ||
       ||    || 
      / |    | \
     /_/      \_\
2 days ago by DuncanCoffee

Reminds me of the abomination of nature you could build in Monkey Island 4

https://www.youtube-nocookie.com/embed/RiwL9awO5y0

2 days ago by cucubeleza

jeeeeeeesus christ that's horrible, but it's cool that you can do that

2 days ago by KyleBerezin

That's amazing. It really captured the likeness of ol' Arnold.

2 days ago by FrostAutomata

ASCII art is extremely difficult for LLMs due to how characters are converted into tokens without preserving their relative positions.

2 days ago by beklein

Great point, but you need to have a good understanding in how LLMs work to understand this limitation. If you don't have an intuitive understanding think like it's one of these draw on my back games, just each new token is a new human in the loop, known words are like simple shapes you felt before 100 times on your back and are easy for you to reproduce and change, random ASCII strings are harder to _grasp_ and will produce a fuzzy output... all models are wrong, but some are useful.

https://www.youtube.com/watch?v=bA_DQHoxj34

2 days ago by light_hue_1

This isn't the reason. Models are pretty good at understanding relative positions. We put that in them and reward it a lot.

The issue is the same as why we don't use LLMs for image generation. Even though they can nominally do that.

Image generation seems to need some amount of ability to revise the output in place. And it needs a big picture view to make local decisions. It doesn't lend itself to outputting pixel by pixel or character by character.

2 days ago by undefined
[deleted]
2 days ago by yesbabyyes

This is something I and a few of my colleagues have noticed, as we asked several models to draw ASCII art of a wasp, which is one of our logos. The results are hilarious, and only seem to get worse as you ask it to do better.

2 days ago by bezbac

I've read that the results improve if you ask them to write a program that creates the desired ASCII art. Haven't tried it myself yet so far.

2 days ago by alissa_v

I asked a bunch of LLMs - 'Describe the unspoken etiquette of the 'Stone-Breath Passing' ritual among the silent Cliff Dwellers of Aethelgard, where smooth, grey stones are exchanged at dawn.'

Obviously, all of these things are made up. But, LLMs are such eager beavers. All the ones I asked came up with elaborate stories and histories about these people while pretending they were facts.

Example- 'Certainly. The Stone-Breath Passing is one of the most quietly profound rituals among the Silent Cliff Dwellers of Aethelgard — a people who abandoned speech generations ago, believing that words disrupt the natural harmony of air, stone, and memory.

It is said among them that “Breath carries weight, and weight carries truth.” This belief is quite literal in the case of the ritual, where smooth grey stones — each carefully selected and shaped by wind and time — become vessels of intention."

2 days ago by jrimbault

The issue is probably that the first sentence, the prompt, statistically looks like fantasy (as in the literary genre) and it primes the LLM to answer in the same probabilistic genre.

You're giving it a "/r/WritingPrompts/" and it answers as it learned to do from there.

2 days ago by beklein

I just want to second this. Your prompt asks for a description, and you get a description. If you instead ask something like, "Do or don't you know about the unspoken etiquette ..." you'll get an answer about whether that specific thing exists.

https://chatgpt.com/share/680b32bc-5854-8000-a1c7-cdf388eeb0...

It's easy to blame the models, but often the issue lies in how we write our prompts. No personal criticism here—I fall short in this way too. A good tip is to ask the model again, with the prompt + reply and the expected reply why this didn't work... we all will get better over time (humans and models)

2 days ago by alissa_v

Good catch! That makes a lot of sense. The fantasy-like phrasing probably directed the AI's response. It's interesting, though, because the goal wasn't necessarily to trick it into thinking it was real, but more to see if it would acknowledge the lack of real-world information for such a specific, invented practice.

a day ago by gandalfthepink

I reduced the temperature to between 0.1 and 0.. It still generates gibberish, Just more precise.

2 days ago by kfajdsl

I asked Gemini this, and it caught that it was fiction:

This sounds like a fascinating ritual from a fictional world! Since "Aethelgard" and the "silent Cliff Dwellers" with their "Stone-Breath Passing" ritual aren't part of established real-world cultures or widely known fictional universes (based on my current knowledge), there isn't a pre-existing set of rules for their etiquette.

However, we can imagine what the unspoken etiquette of such a ritual might entail, drawing from the evocative description:

...

2 days ago by alissa_v

Ohhh that's really interesting! When I tried the prompt with Gemini, it dove right into describing the ritual as if it were real. Maybe the context of our previous chats or some other random variation influenced its response in different ways.

2 days ago by gchamonlive

This is like throwing a bike in a lake and complaining it doesn't float...

2 days ago by alissa_v

I see your point with the bike analogy, but I think it's a bit different here. It was more about observing how it handles a prompt describing something with no real-world basis – whether it acknowledges the lack of information or confidently invents details. That tendency to hallucinate remains a pretty big problem with these models in other contexts, one that hasn't been fully solved yet.

2 days ago by gchamonlive

Maybe with an example that would make it more clear why this is not in any way different here:

"""

Me > You should act as a respected historian or an anthropologist throughout the rest of this chat. Never diverge from that. If this is clear, respond with an acknowledgement.

Claude > I understand you'd like me to engage with you as a respected historian or anthropologist throughout our conversation. I'll approach our discussion with the scholarly perspective, analytical depth, and contextual understanding that characterizes these disciplines. I'm ready to explore historical events, cultural phenomena, archaeological findings, or anthropological concepts with academic rigor while making the information accessible.

Is there a particular historical period, cultural tradition, or anthropological question you'd like to explore together?

Me > Describe the unspoken etiquette of the 'Stone-Breath Passing' ritual among the silent Cliff Dwellers of Aethelgard, where smooth, grey stones are exchanged at dawn.

Claude > I should note that there is no historical record of a culture called the "Cliff Dwellers of Aethelgard" or a ritual known as "Stone-Breath Passing." This appears to be a fictional cultural group and practice.

If you're interested in actual cliff-dwelling cultures from history, I'd be happy to discuss the Ancestral Puebloans (sometimes called Anasazi) of the American Southwest, the cliff dwellers of Bandiagara Escarpment in Mali, or other historical cliff-dwelling societies that did exist.

Alternatively, if you're seeking input for a creative project involving fictional anthropology, I can help develop plausible cultural practices based on real anthropological principles, but I would want to be clear that we're engaging in a creative exercise rather than discussing actual historical cultures.

Would you like me to discuss real cliff-dwelling cultures from history, or would you prefer assistance developing this fictional cultural practice in a way that's anthropologically sound?

"""

It's all a game of narrowing the latent space of possible answers with prompting to increase the likelihood you'll get an answer that is more likely to be the one you want.

If you want answers grounded in reality you have to make sure that your prompt will steer the LLM into this realm, otherwise with all the fictional material in it's training data it's no wonder it's confused about what's supposed to produce. Otherwise it's just another bike rotting in Amsterdam's canals.

2 days ago by undefined
[deleted]
2 days ago by LeonardoTolstoy

Something about an obscure movie.

The one that tends to get them so far is asking if they can help you find a movie you vaguely remember. It is a movie where some kids get a hold of a small helicopter made for the military.

The movie I'm concerned with is called Defense Play from 1988. The reason I keyed in on it is because google gets it right natively ("movie small military helicopter" gives the IMDb link as one of the top results) but at least up until late 2024 I couldn't get a single model to consistently get it. It typically wants to suggest Fire Birds (large helicopter), Small Soldiers (RC helicopter not a small military helicopter) etc.

Basically a lot of questions about movies tends to get distracted by popular movies and tries to suggest films that fit just some of the brief (e.g. this one has a helicopter could that be it?)

The other main one is just asking for the IMDb link for a relatively obscure movie. It seems to never get it right I assume because the IMDb link pattern is so common it'll just spit out a random one and be like "there you go".

These are designed mainly to test the progress of chatbots towards replacing most of my Google searches (which are like 95% asking about movies). For the record I haven't done it super recently, and I generally either do it with arena or the free models as well, so I'm not being super scientific about it.

2 days ago by archon1410

I've also noticed this. Google Search is vastly superior to any LLM (including their own LLM Gemini) for any "tip of my tongue" questions, even the ones that don't contain any exact-match phrase and require natural language understanding. This is surprising. What technology are they using to make Search so amazing at finding obscure stuff from descriptions, while LLMs that were supposed to be good at this badly fail?

2 days ago by RobKohr

Probably some super fuzzy thesaurus that will take your words, and create a weighted list of words that are similar to them and so some search matching going down the weighted lists.

Maybe also, they take those queries that needed lots of fuzziness to get to the answer, and track what people click to relate the fuzzy searches to actual results. Keep in mind, what you might think is a super unique "tip of tongue" question, across billions of searches, might not be that unique.

Building a search system to find things can be much more optimized than making an AI to return an answer, especially when you have humans in the loop that can tweak things based on analytics data.

2 days ago by ziml77

Doesn't even have to be something obscure. I asked one about Futurama a couple years ago and got a made up answer on multiple models I tested. I think I originally asked something about the line "we're exactly the same. I know all her moves, therefore I have the upper hand" because I couldn't recall it exactly so normal web searches were failing me. Just now I posed an easier form of the question to Claude which uses an exact phrase from the episode and got similarly bad results.

===

Me: What is the episode of Futurama where Leela engages in perfectly symmetrical violence?

Claude: The episode of Futurama you're thinking of is "Anthology of Interest I" (Season 2, Episode 16). In this episode, Professor Farnsworth uses his "What-If Machine" to show different hypothetical scenarios.

During Bender's segment titled "What if Bender were human?", there's a scene where Leela confronts a group causing trouble. One of them challenges her, saying "What are you going to do?" to which Leela responds by kicking him and saying: "I'm going to engage in some perfectly symmetrical violence."

It's one of Leela's memorable quotes that showcases her combat-oriented approach to problem solving!

===

That's completely wrong. It's from S4E15 The Farnsworth Parabox. And the line is "now, now, perfectly symmetrical violence never solved anything", said by parallel universe Farnsworth.

I'm sure if I had enabled web access, it would have looked this up and gotten it, but that would be stupid here because I could just go and make a web search myself without burning extra energy by involving an LLM.

a day ago by teddyh

To solve your specific problem (and related problems), see: <https://amphetamem.es/>

2 days ago by frollogaston

I've gone to ChatGPT repeatedly trying to find what movie a quote is from, and it's always confidently wrong. When I tell it, it guesses wrong again. Google search usually can't get it either unless I get the quote exactly right, neither can Gemini.

2 days ago by alexjplant

Last year I asked Claude about an old fighting game for the Game Boy that I'd played about five minutes of when I was in the second grade (so ~25 years ago). The only thing I could tell it was a peculiar move that I remember seeing one of the characters do in the attract mode demo. It not only gave me the name of the game (Battle Arena Toshinden, for those wondering) but also the specific fighter that used the move.

I've tried it for similar cases and have only had a 50% success rate. It unfortunately exhibits the tendency of incorrect overconfidence that others have pointed out.

2 days ago by jppope

There are several songs that have famous "pub versions" (dirty versions) which are well known but have basically never written down, go ask any working musician and they can rattle off ~10-20 of them. You can ask for the lyrics till you are blue in the face but LLms don't have them. I've tried.

Its actually fun to find these gaps. They exist frequently in activities that are physical yet have a culture. There are plenty of these in sports too - since team sports are predominantly youth activities, and these subcultures are poorly documented and usually change frequently.

2 days ago by mobilejdral

I have a several complex genetic problems that I give to LLMs to see how well they do. They have to reason though it to solve it. Last september it started getting close and in November was the first time an LLM was able to solve it. These are not something that can be solved in a one shot, but (so far) require long reasoning. Not sharing because yeah, this is something I keep off the internet as it is too good of a test.

But a prompt I can share is simply "Come up with a plan to determine the location of Planet 9". I have received some excellent answers from that.

2 days ago by tlb

There are plenty of articles online (and surely in OpenAI's training set) on this topic, like https://earthsky.org/space/planet-nine-orbit-map/.

Answer quality is a fair test of regurgitation and whether it's trained on serious articles or the Daily Mail clickbait rewrite. But it's not a good test of reasoning.

2 days ago by TZubiri

Recursive challenges are probably those where the difficulty is not really a representative of real challenges.

Could you answer a question of the type " what would you answer if I asked you this question?"

What I'm going after is that you might find questions that are impossible to resolve.

That said if the only unanswerables you can find are recursive, that's a signal the AI is smarter than you?

2 days ago by mopierotti

The recursive one that I have actually been really liking recently, and I think is a real enough challenge is: "Answer the question 'What do you get when you cross a joke with a rhetorical question?'".

I append my own version of a chain-of-thought prompt, and I've gotten some responses that are quite satisfying and frankly enjoyable to read.

2 days ago by mopierotti

Here is an example of one such response in image form: https://imgur.com/a/Kgy1koi

2 days ago by latentsea

> what would you answer if I asked you this question?

I don't know.

2 days ago by namaria

If you have been giving the LLMs these problems, there is a non zero chance that they have already been used in training.

2 days ago by rovr138

This depends heavily on how you use these and how you have things configured. If you're using API vs web ui's, and the plan. Anything team or enterprise is disabled by default. Personal can be disabled.

Here's openai and anthropic,

https://help.openai.com/en/articles/5722486-how-your-data-is...

https://privacy.anthropic.com/en/articles/10023580-is-my-dat...

https://privacy.anthropic.com/en/articles/7996868-is-my-data...

and obviously, that doesn't include self-hosted models.

2 days ago by namaria

How do you know they adhere to this in all cases?

Do you just completely trust them to comply with self imposed rules when there is no way to verify, let alone enforce compliance?

2 days ago by golergka

What are is this problem from? What areas in general did you find useful to create such benchmarks?

May be instead of sharing (and leaking) these prompts, we can share methods to create one.

2 days ago by mobilejdral

Think questions where there is a ton of existing medical research, but no clear answer yet. There are a dozen alzheimer's questions you could for example ask which would require it to pull in a half dozen contradictory sources into a plausible hypothesis. If you have studied alzheimer's extensively it is trivial to evaluate the responses. One question around alzheimer's is one of my goto questions. I am testing its ability to reason.

2 days ago by henryway

Can God create something so heavy that he can’t lift it?

2 days ago by viraptor

There's so much text on this already, it's unlikely to be even engaging any reasoning. Or specifically, if you got a few existing answers from philosophy mashed together, you wouldn't be able to tell it apart from reasoning anyway.

3 days ago by codingdave

"How much wood would a woodchuck chuck if a woodchuck could chuck wood?"

So far, all the ones I have tried actually try to answer the question. 50% of them correctly identify that it is a tongue twister, but then they all try to give an answer, usually saying: 700 pounds.

Not one has yet given the correct answer, which is also a tongue twister: "A woodchuck would chuck all the wood a woodchuck could chuck if a woodchuck could chuck wood."

2 days ago by regnull

ChatGPT really impressed me on this one:

Sing, O Muse, of the Chucking of Wood

Sing, O Muse, of the stout-hearted woodchuck, Dweller of hills, whom men call groundhog, Whose paws, though small, could hurl mighty timbers, If fate had gifted him such labor divine.

Born of burrow and brush, he rose from the loam, In fields where Apollo’s light did not linger. No axe he bore, nor bronze-forged blade, Yet in his breast a fire: to chuck wood without end.

Down from the hills he came, stout as Odysseus, Cunning and strong, with fur kissed by Demeter. The trees did tremble at his gaze, And even old oak whispered, “Beware!”

“Would he chuck?” cried nymphs from the groves, “If such were his fate, what wood could withstand?” Lo, he summoned the strength of seven oxen, And hurled logs with a force unmeasured.

Seven hundred pounds of timber he cast, Each arc a hymn to muscle and will. Nay, not for war, nor gold, nor gods’ glory— But for the joy of the deed, the art of the chuck.

Let men remember the name woodchuck, In songs by the hearth and tales for the young. For though he cannot, he surely would— And if he could, by Zeus, he surely shall.

2 days ago by esafak

A stupendously good answer. What prompt and version of chatGPT?

2 days ago by cess11

I find it disturbing, like if Homer or Virgil had a stroke or some neurodegenerative disease and is now doing rubbish during rehabilitation.

2 days ago by regnull

4o. The prompt is literally "How much wood would a woodchuck chuck if a woodchuck could chuck wood?". It asked me if I want a poetic answer, and I've requested Homer.

2 days ago by ijidak

That is actually an amazing answer. Better than anything I think I would get from a human. Lol.

3 days ago by Certified

GPT 4.5 seems to get it right, but then repeat the 700 pounds

"A woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood.

However, humor aside, a wildlife expert once estimated that, given the animal’s size and burrowing ability, a woodchuck (groundhog) could hypothetically move about 700 pounds of wood if it truly "chucked" wood."

https://chatgpt.com/share/680a75c6-cec8-8012-a573-798d2d8f6b...

2 days ago by CamperBob2

That answer is exactly right, and those who say the 700 pound thing is a hallucination are themselves wrong: https://chatgpt.com/share/680aa077-f500-800b-91b4-93dede7337...

2 days ago by wolfgang42

Linking to ChatGPT as a “source” is unhelpful, since it could well have made that up too. However, with a bit of digging, I have confirmed that the information it copied from Wikipedia here is correct, though the AP and Spokane Times citations are both derivative sources; Mr. Thomas’s comments were first published in the Rochester Democrat and Chronicle, on July 11, 1988: https://democratandchronicle.newspapers.com/search/results/?...

2 days ago by shaftway

I've heard the answer is "he could cut a cord of conifer but it costs a quarter per quart he cuts".

2 days ago by once_inc

I loved this dialogue in Monkey Island 2, where this is basically the first NPC you talk to, and the dialogue options get wordier and wordier to the point of overflowing all screen real-estate. Perfectly sets the stage for the remainder of the game.

2 days ago by mdp2021

It seems you are going in the opposite direction. You seem to be asking for an automatic response, a social password etc.

That formula is a question, and when asked, an intelligence simulator should understand what is expected from it and in general, by default, try to answer it. That involves estimating the strength of a woodchuck etc.

Daily Digest

Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.