What John Carmack is exploring is pretty revealing. Train models to play 2D video games to a superhuman level, then ask them to play a level they have not seen before or another 2D video game they have not seen before. The transfer function is negative. So, in my definition, no intelligence has been developed, only expertise in a narrow set of tasks.
Itās apparently much easier to scare the masses with visions of ASI, than to build a general intelligence that can pick up a new 2D video game faster than a human being.
Seeing comments here saying āthis problem is already solvedā, āhe is just bad at thisā etc. feels bad. He has given a long time to this problem by now. He is trying to solve this to advance the field. And needless to say, he is a legend in computer engineering or w/e you call it.
It should be required to point to the āsolutionā and maybe how it works to say āhe just sucksā or āthis was solved beforeā.
IMO the problem with current models is that they donāt learn categorically like: lions are animals, animals are alive. goats are animals, goats are alive too. So if lions have some property like breathing and goats also have it, it is likely that other similar things have the same property.
Or when playing a game, a human can come up with a strategy like: Iāll level this ability and lean on it for starting, then Iāll level this other ability that takes more time to ramp up while using the first one, then change to this play style after I have the new ability ready. This might be formulated completely based on theoretical ideas about the game, and modified as the player gets more experience.
With current AI models as far as I can understand, it will see the whole game as an optimization problem and try to find something at random that makes it win more. This is not as scalable as combining theory and experience in the way that humans do. For example a human is innately capable of understanding there is a concept of early game, and the gains made in early game can compound and generate a large lead. This is pattern matching as well but it is on a higher level .
Theory makes learning more scalable compared to just trying everything and seeing what works
He is not using appropriate models for this conclusion and neither is he using state of the art models in this research and moreover he doesn't have an expensive foundational model to build upon for 2d games. It's just a fun project.
A serious attempt at video/vision would involve some probabilistic latent space that can be noised in ways that make sense for games in general. I think veo3 proves that ai can generalize 2d and even 3d games, generating a video under prompt constraints is basically playing a game. I think you could prompt veo3 to play any game for a few seconds and it will generally make sense even though it is not fine tuned.
Veo3's world model is still pretty limited. That becomes obvious very fast once you prompt out of distribution video content (i.e. stuff that you are unlikely to find on youtube). It's extremely good at creating photorealistic surfaces and lighting. It even has some reasonably solid understanding of fluid dynamics for simulating water. But for complex human behaviour (in particular certain motions) it simply lacks the training data. Although that's not really a fault of the model and I'm pretty sure there will be a way to overcome this as well. Maybe some kind of physics based simulation as supplement training data.
Is any model currently known to succeed in the scenario that Carmackās inappropriate model failed?
No monolithic models but us ng hybrid approaches we've been able to beet humans for some time now.
What you're thinking of is much more like the Genie model from DeepMind [0]. That one is like Veo, but interactive (but not publically available)
[0] https://deepmind.google/discover/blog/genie-2-a-large-scale-...
> I think veo3 proves that ai can generalize 2d and even 3d games, generating a video under prompt constraints is basically playing a game.
In the same way that keeping a dream journal is basically doing investigative journalism, or talking to yourself is equivalent to making new friends, maybe.
The difference is that while they may both produce similar, "plausible" output, one does so as a result of processes that exist in relation to an external reality.
Indeed, it's nothing but function fitting.
I wonder if this is a case of overfitting from allowing the model to grow too large, and if you might cajole it into learning more generic heuristics by putting some constraints on it.
It sounds like the "best" AI without constraint would just be something like a replay of a record speedrun rather than a smaller set of heuristics of getting through a game, though the latter is clearly much more important with unseen content.
I'd say with confidence: we're living in the early days. AI has made jaw-dropping progress in two major domains: language and vision. With large language models (LLMs) like GPT-4 and Claude, and vision models like CLIP and DALLĀ·E, we've seen machines that can generate poetry, write code, describe photos, and even hold eerily humanlike conversations.
But as impressive as this is, itās easy to lose sight of the bigger picture: weāve only scratched the surface of what artificial intelligence could be ā because weāve only scaled two modalities: text and images.
Thatās like saying weāve modeled human intelligence by mastering reading and eyesight, while ignoring touch, taste, smell, motion, memory, emotion, and everything else that makes our cognition rich, embodied, and contextual.
Human intelligence is multimodal. We make sense of the world through:
Touch (the texture of a surface, the feedback of pressure, the warmth of skin0; Smell and taste (deeply tied to memory, danger, pleasure, and even creativity); Proprioception (the sense of where your body is in space ā how you move and balance); Emotional and internal states (hunger, pain, comfort, fear, motivation).
None of these are captured by current LLMs or vision transformers. Not even close. And yet, our cognitive lives depend on them.
Language and vision are just the beginning ā the parts we were able to digitize first - not necessarily the most central to intelligence.
The real frontier of AI lies in the messy, rich, sensory world where people live. Weāll need new hardware (sensors), new data representations (beyond tokens), and new ways to train models that grow understanding from experience, not just patterns.
> Language and vision are just the beginning ā the parts we were able to digitize first - not necessarily the most central to intelligence.
I respectfully disagree. Touch gives pretty cool skills, but language, video and audio are all that are needed for all online interactions. We use touch for typing and pointing, but that is only because we don't have a more efficient and effective interface.
Now I'm not saying that all other senses are uninteresting. Integrating touch, extensive proprioception, and olfaction is going to unlock a lot of 'real world' behavior, but your comment was specifically about intelligence.
Compare humans to apes and other animals and the thing that sets us apart is definitely not in the 'remaining' senses, but firmly in the realm of audio, video and language.
> Language and vision are just the beginning ā the parts we were able to digitize first - not necessarily the most central to intelligence.
I probably made a mistake when i asserted that -- should have thought it over. Vision is evolutionarily older and more āprimitiveā, while language is uniquely human [or maybe, more broadly, primate, cetacean, cephalopod, avian...] symbolic, and abstract ā arguably a different order of cognition altogether. But i maintain that each and every sense is important as far as human cognition -- and its replication -- is concerned.
People who lack one of those senses, or even two of them, tend to do just fine.
Organic adaption and persistence of memory I would say are the two major advancements that need to happen.
Human neural networks are dynamic, they change and rearrange, grow and sever. An LLM is fixed and relies on context, if you give it the right answer it won't "learn" that is the correct answer unless it is fed back into the system and trained over months. What if it's only the right answer for a limited period of time?
To build an intelligent machine, it must be able train itself in real time and remember.
Yes and: and forget.
> Language and vision are just the beginning..
Based on the architectures we have they may also be the ending. Thereās been a lot of news in the past couple years about LLMs but has there been any breakthroughs making headlines anywhere else in AI?
> Thereās been a lot of news in the past couple years about LLMs but has there been any breakthroughs making headlines anywhere else in AI?
Yeah, lots of stuff tied to robotics, for instance; this overlaps with vision, but the advances go beyond vision.
Audio has seen quite a bit. And I imagine there is stuff happening in niche areas that just aren't as publicly interesting as language, vision/imagery, audio, and robotics.
Two Nobel prizes in chemistry: https://www.nature.com/articles/s41746-024-01345-9
Sure. In physics, math, chemistry, biology. To name a few.
> The real frontier of AI lies in the messy, rich, sensory world where people live. Weāll need new hardware (sensors), new data representations (beyond tokens), and new ways to train models that grow understanding from experience, not just patterns.
Like Dr. Who said: DALEKs aren't brains in a machine, they are the machine!
Same is true for humans. We really are the whole body, we're not just driving it around.
There are many people who mentally developed while paralyzed that literally drive around their bodies via motorized wheelchair. I don't think there's any evidence that a brain couldn't exist or develop in a jar, given only the inputs modern AI now has (text, video, audio).
> any evidence that a brain couldn't exist or develop in a jar
The brain could. Of course it could. It's just a signals processing machine.
But would it be missing anything we consider core to the way humans think? Would it struggle with parts of cognition?
For example: experiments were done with cats growing up in environments with vertical lines only. They were then put in a normal room and had a hard time understanding flat surfaces.
https://computervisionblog.wordpress.com/2013/06/01/cats-and...
Sometimes we get confused by the difference between technological and scientific progress. When science makes progress it unlocks new S-curves that progress at an incredible pace until you get into the diminishing returns region. People complain of slowing progress but it was always slow, you just didnāt notice that nothing new was happening during the exponential take off of the S-curve, just furious optimization.
As far back as 2017 I copped a lot of flak for suggesting that the coming automation revolution will be great at copying office workers and artists but wont be in order of replacing the whole human race. A lot of the time moores law got thrown back in my face. But thats how this works, we unlock something new, we exploit it as far as possible, the shine wears off and we deal with the aftermath.
Fully agree.
And at the same time I have noticed that people donāt understand the difference between an S-curve and an exponential function. They can look almost identical at certain intervals.
The crypto mind cannot comprehend
You're being awfully generous to describe basic hype as "technological progress."
If you work with model architecture and read papers, how could not know there are a flood of new ideas? Only few yield interesting results though.
I kind of wonder if libraries like pytorch have hurt experimental development. So many basic concepts no one thinks about anymore because they just use the out of the box solutions. And maybe those solutions are great and those parts are "solved", but I am not sure. How many models are using someone else's tokenizer, or someone else's strapped on vision model just to check a box in the model card?
That's been the very normal way of the human world.
When the foundation layer at a given moment doesn't yield an ROI on intellectual exploration - say because you can overcompensate with VC funded raw compute and make more progess elsewhere -, few(er) will go there.
But inevitably, as other domains reach diminishing returns, bright minds will take a look around where significant gains for their effort can be found.
And so will the next generation of PyTorch or foundational technologies evolve.
The people who don't think about such things probably wouldn't develop experimentally sans pytorch either.
Yeah and even then, it's been like ~ 2-3 years since the last rather major Architectural improvement, major enough for a lot of people to actually hear about it and use it daily. I think some people lose perspective on how short of a time frame 3 years is.
But yes, there's a ton of interesting and useful stuff (beyond datasets and data related improvements) going on right now, and I'm not even talking about LLMs. I don't do anything related to LLM and even then I still see tons of new stuff popping up regularly.
It's the opposite.
Frameworks like pytorch are really flexible. You can implement any architecture, and if it's not enough, you can learn CUDA.
Keras it's the opposite, it's probably like you describe things.
To be fair, if you imagine a system that successfully reproduced human intelligence, then 'changing datasets' would probably be a fair summary of what it would take to have different models. After all, our own memories, training, education, background, etc are a very large component of our own problem solving abilities.
I will respectfully disagree. All "new" ideas come from old ideas. AI is a tool to access old ideas with speed and with new perspectives that hasn't been available up until now.
Innovation is in the cracks: recognition of holes, intersections, tangents, etc. on old ideas. It has bent said that innovation is done on the shoulders of giants.
So AI can be an express elevator up to an army of giant's shoulders? It all depends on how you use the tools.
Access old ideas? Yes. With new perspectives? Not necessarily. An LLM may be able to assist in interpreting data with new perspectives but in practice they're still fairly bad at greenfield work.
As with most things, the truth lies somewhere in the middle. LLMs can be helpful as a way of accelerating certain kinds and certain aspects of research but not others.
> Access old ideas? Yes. With new perspectives?
I wonder if we can mine patent databases for old ideas that never worked out in the past, but now are more useful. Perhaps due to modern machining or newer materials or just new applications of the idea.
Imagine a human had read every book/publication in every field of knowledge that mankind has ever produced AND couldnāt come up with anything entirely new. Hard to imagine.
My hypothesis of the mismatch is centered around "read" - I think that when you wrote it, and when others similarly think about that scenario, the surprise is because our version of "read" is the implied "read and internalized" or at bare minimum "read for comprehension" but as very best I can tell the LLM's version is "encoded tokens into vector space" and not "encoded into semantic graph"
I welcome the hair-splittery that is sure to follow about what it means to "understand" anything
It is possible that such a human wouldn't come up with anything new, even if they could.
The article is discussing working in AI innovation vs focusing on getting more and better data. And while there have been key breakthroughs in new ideas, one of the best ways to increase the performance of these systems is getting more and better data. And how many people think data is the primary avenue to improvement.
It reminds me of an AI talk a few decades ago, about how the cycle goes: more data -> more layers -> repeat...
Anyways, I'm not sure how your comment relates to these two avenues of improvement.
> I will respectfully disagree. All "new" ideas come from old ideas.
The insight into the structure of the benzene ring famously came in a dream, hadn't been seen before, but was imagined as a snake bitings its own tail.
And as we all know, it came in a dream to a complete novice in chemistry with zero knowledge of any old ideas in chemistry: https://en.wikipedia.org/wiki/August_Kekul%C3%A9
--- start quote ---
The empirical formula for benzene had been long known, but its highly unsaturated structure was a challenge to determine. Archibald Scott Couper in 1858 and Joseph Loschmidt in 1861 suggested possible structures that contained multiple double bonds or multiple rings, but the study of aromatic compounds was in its earliest years, and too little evidence was then available to help chemists decide on any particular structure.
More evidence was available by 1865, especially regarding the relationships of aromatic isomers.
[ Kekule claimed to have had the dream in 1865 ]
--- end quote ---
The dream claim came from Kekule himself 25 years after his proposal that he had to modify 10 years after he proposed it.
Man I can't wait for this '''''AI''''' stuff to blow over. The back and forth gets a bit exhausting.
What about actively obtained data - models seeking data, rather than being fed. Human babies put things in their mouths, they try to stand and fall over. They ādo stuffā to learn what works. Right now weāre just telling models what works.
What about simulation: models can make 3D objects so why not give them a physics simulator? We have amazing high fidelity (and low cost!) game engines that would be a great building block.
What about rumination: behind every Cursor rule for example, is a whole story of why a user added it. Why not take the rule, ask a reasoning model to hypothesize about why that rule was created, and add that rumination (along with the rule) to the training data. Providing opportunities to reflect on the choices made by their users might deepen any insights, squeezing more juice out of the data.
Simulation and embodied AI (putting the AI in a robotic arm or a car so it can try stuff and gather information about the results) are very actively being explored.
What about at inference time? ie. in response to a query.
We let models write code and run it. Which gives them a high chance of getting arithmetic right.
Solving the ācrossing the riverā problem by letting the model create and run a simulation would give a pretty high chance of getting it right.
That would be reinforcement learning. The juice is quite hard to squeeze.
Agreed for most cases.
Each Cursor rule is a byproduct of tons of work and probably contains lots that can be unpacked. Any research on that?
Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.