I have no affiliation with them but here's what I think happened:
1. They claim the official model is based on Qwen 397B. It's likely they didn't disclose Nex Pro at all because Nex itself is based on the same base model (not saying they shouldn't).
2. The improvement would come from merging the weights PLUS on-policy distillation. The confusion is that the uploaded model didn't have the distillation at all.
3. It's important to notice they didn't advertise the model besides posting it on Reddit 2 days ago. It became viral organically, over the weekend, and during Brazil's World Cup debut (Brazilians will understand). Of course the mayor of Rio took the opportunity to capitalize over the free coverage, but that wasn't done in conjunction with the researchers.
4. I don't see why they would disclose Qwen 397B as base and mention the SwiReasoning paper but not mention Nex if all they did was to merge both models.
5. In any case, what they are claiming is easily verifiable once (if) they upload the right model.
Regarding #2
I'm honestly impressed that this even happened at all. "Rio de Janeiro's homegrown LLM" is probably the last headline I ever expected to read on HN.
Worth reminding everyone that Lua was also created in Rio, though admittedly at PUC rather than by the government.
Rio has a strong engineering talent pool, along with many other major capitals in Brazil
Brazil does have talent. Mauro Carvalho Chehab is a Linux kernel maintainer. Elixir was created by José Valim, a brazilian. I have also created my own programming language.
What Brazil doesn't have is a history of properly rewarding talent, which often causes it to migrate elsewhere. So it's definitely surprising when any sort of technological development happens in Brazil: it implies someone who stayed managed to get something done, most likely for much less than what that something is actually worth, while also being crushed by extremely high taxes that essentially doubles the cost of computer hardware.
Yes! That "prefeitura do Rio" huggingface URL is definitely shocking to read to this Brazilian as well (I'm assuming you and parent also are from your usernames).
> 2. The improvement would come from merging the weights PLUS on-policy distillation. The confusion is that the uploaded model didn't have the distillation at all.
They merged the base model with another labâs fine tuned model. The improvements could have come from getting some of the fine tuned weights from the other model.
If they really had a better performing model that they âaccidentallyâ forgot to upload, they could have uploaded the correct file by now.
What do you mean World Cup debut? haven't they won 5?
They meant their first, opening game of this current World Cup tournament
> Every weight tensor in Rio is, to thousands of standard deviations, the same 0.6/0.4 blend of Nex and Qwen â across all 60 layers and every component of the network. Other finetunes cannot be explained as interpolations.
I find it amazing how robust the current deep learning models are. A simple linear combination of every weight did not degrade the performance of the model, but enhanced it.
> A simple linear combination of every weight did not degrade the performance of the model, but enhanced it.
Enhanced it on a couple benchmarks, supposedly.
The game is to turn knobs until you get a benchmark run that shows an improvement, then ship it. There are a lot of fine tunes and chimera models on HuggingFace that are supposedly better at some specific test, but when you use them for anything else they're usually worse.
This happens with a lot of the models that are modified to remove censorship. They succeed in getting the model to emit previously censored outputs, but the overall output quality decreases.
They seem to have deleted most of the README now, but the archived version has benchmarks.
https://web.archive.org/web/20260614082641/https://huggingfa...
And the Nex benchmarks for comparison
https://huggingface.co/nex-agi/Nex-N2-Pro
Rio seems to be about halfway between Qwen 3.5 and Nex, as you'd expect?
I don't think your last point is correct. Ablation, when done correctly, seems to increase the quality and typically also the performance too.
That is something often claimed by heretics. My experience couldn't diverge more, however. All heretic (and abliterix) models I've tried are worse than the original. It's not immediately obvious if all you do is ask 2-3 questions and marvel at how it didn't refuse, but try using them for real over longer 8k+ contexts and it falls apart real fast.
They're more prone to getting stuck in loops, becoming unresponsive, and hallucinating more (presumably because of the reduced desire to not answer).
I've tried all the popular heretic peddlers, but if you have one that you can vouch for maybe I've simply missed it.
Abliterarion is a brute force technique that removes or silences parts of the model. It reduces performance because the abliterated elements arenât perfectly isolated to censorship so other aspects suffer.
Many of the âuncensoredâ model providers also do some fine tuning on the models. Some of them target better benchmarks or other measures, but outside of the benchmarks and metrics theyâre fine tuned for they are generally noticeably worse than the original model.
> game is to turn knobs until you get a benchmark run that shows an improvement, then ship it
i.e reinforcement learning against a weak reward function - benchmark is insufficiently complex and is not representative of the real world sufficiently.
The "game", i.e. decision tree can be modeled as a multi-arm bandit problem, to deploy finite resources ( compute) toward exploitation/exploration .
The main issue is each training / fine-tune is very expensive so number of chances at the slot so to speak is pretty limited today.
This works because Nex itself is a finetune of Qwen3.5 (https://huggingface.co/nex-agi/Nex-N2-Pro). It's merging Qwen3.5 with a Qwen3.5 finetune.
I don't believe this would work on two LLMs that have different pretraining. Even if it did you would need two LLMs that have exact same internal activation shapes, dimensions, expert counts, token vocabulary, realistically it would never happen outside of finetunes or academic experiments.
not this exact thing, no, because the functional circuits dont appear in the same places across models. but if you find where they are you can do something like branch between some of the middle functional circuits between models and it kinda just works, or even do one after the other. you cant just like swap any two layers cause a bunch of em bend hyperbolic curvature to do hierarchical stuff deep in the poincare ball and the geometries get all bonkers, but before and after they do that things are relatively flat, and the geometries are more or less transferrable up to rigid rotation if they're each trained on large enough data.
Correct. We used to think that because NN optimization is non-convex there are all these local minima. Now we know that once you get past the very early parts of training from random init, the loss surface is fairly smooth, and not really convex, but close enough in a bunch of ways - linear combinations of trained models are pretty much always valid combinations. You can think of fine tunings as deltas on the original model which can be summed together successfully. I think this paper first showed that to me: https://arxiv.org/pdf/1802.10026 which was 8 years ago now.
It's is a well known idea[1], although it's still surprising that something as simple, even works.
This team could have stopped here and still had something interesting (albeit not novel) to show. But the hype cycle was too tempting.
What I find fascinating is the idea that there might be a set of "secret" tweaks that when applied to those weights (or even smaller models) could result in an intelligence simulation that could vastly surpass even something like Fable.
The municipality of Rio de Janeiro (via its IT company IplanRIO) released Rio-3.5-Open-397B, presented as a homegrown Qwen3.5 fine-tune that beats comparable open models on benchmarks. The linked issue argues it's actually a weighted merge of ~60% Nex-N2 Pro + ~40% Qwen3.5-397B-A17B - Nex-N2 having been released about a week earlier.
I didn't know model merging like that was possible. (Obviously possible from a pure software standpoint but I'm surprised it's effective)
As another poster above linked, itâs been shown to be effective since 2022: https://arxiv.org/abs/2203.05482
it works because Nex N2 is also a derivative of the original base Qwen model. If it was two completely unrelated models it wouldn't work.
So the problem isnât in the missing attribution to Qwen, but with the fact that they didnât mention Nex-N2 Pro right?
The problem is that they claimed to have made a big achievement with their home grown post training, and they expected to receive a lot of praise for it.
Then researchers looked at the weights and there is no post training at all.
They are now attributing both models they merged, but their excuse for the lack of post training is to claim they accidentally uploaded the wrong files.
Iâd believe they accidentally uploaded the wrong files if they uploaded the correct ones. To state that they accidentally uploaded something else and then not upload the correct version means they probably do not have anything and either hope people forget about this or they are scrambling to have something that is at least close to their original claim.
[dead]
Oh no, someone is profiting off of their work without proper attribution!?!?
This is an open weights model based on other open weights models.
The dispute is that they released it with claims about having done some post training that improved the outputs. It was discovered that the model was not post trained like they claimed.
The HF page now says itâs a merge of models, which wasnât there before. Theyâre trying to claim they accidentally uploaded the wrong model to HF and that theyâll upload the real one soon.
Basically, they thought they could splice two open weights models together and claim their team had accomplished some amazing post training, but they werenât smart enough to realize that other researchers would discover that there wasnât any post training.
Thanks for the factual clarification. This is so important when everyone already has their trigger finger on politics. Not meaning that politics are irrelevant here, see sister comment by jobim.
But it's impossible to form a nuanced opinion when political association has a higher priority than the facts; which, again, don't look flattering for the implementers.
How do they just splice two models together?
The Nex N2 model they merged is based on Qwen 3.5, so you can swap pieces of one into the other. They found a combination of the two that did well on some benchmarks and shipped it.
In the early days of Llama there were a lot of experiments like this. There were even some interesting combinations of models where they stacked layers of different models together or even added more layers with interesting results.
But announcing that you spliced two models together isn't very impressive in 2026, so they announced that they had done their own post training and outdid the big labs. They thought nobody would look close enough to notice.
Out of curiosity, how was it discovered? You would have to look for it to find this linear combination.
How do you feel about the government or government contractors saying they did a bunch of work when they did nothing instead?
Attribution isn't the relevant part. Lying about your lab's capabilities is.
That's also something all the AI companies have been doing.
Lying about model capability is right now the lingua franca of the cloud AI business model, almost; they yes-and each other's lies because they are in a position of needing to generate interest, including going as far as needing to trigger regulatory capture.
(It's not news to anyone who has worked in sales-led businesses that salespeople are prone to believing the claims of other salespeople, I guess).
Theyâre using public money to âtrainâ this.
Sounds like the whole AI movement.
It seems to me like the lies are both for the same reason. To capture attention and profits that are not deserved.
But the whole game is lying and stealing isn't it?
Can someone please explain or link to some information about how models are merged? Is this genuinely merging weights mathematically or some kind of distillation (presumably not if theyâve done zero training as the post suggests).
This is a good starting point: https://huggingface.co/docs/peft/developer_guides/model_merg...
But yes, in general, merging refers to techniques that directly blend the weights of different models mathematically. It had a big moment of popularity ~2 years ago, with many so-called "Frankenmodels" popping up on leaderboards.
I tend to think of merging as belonging to the same general umbrella as things like "abliteration", or other techniques that surgically modify the weights of a model without a traditional training/tuning loop. Maxime Labonne is a great person to follow if you're interested in this general area.
I'm honestly surprised that they even had the inclination to attempt creating a model. I guess it's bullish that a municipal IT department had the guts to try this?
Merges and fine tunes are within reach of individuals with some money to burn so Iâm sure a muni can do it
I like the [dead] comment theory that they proposed a huge LLM training budget to the government, kept most of the money, and released a cheap merge to justify the grift.
This would be so very brazilian of them.
Source: am Huelander.
It's kinda weird to claim extraordinary results in such case though, as that brings a lot of eyes to it.
Nothing weird. The mayor wanted something brag about. That Rio, my friend.
Ah that makes sense
That's essentially Brazil's standard operating procedure. Wouldn't be surprising if that turned out to be the case.
Still, I'm actually impressed that this even happened at all. "Rio de Janeiro's homegrown LLM" is the last headline I expected to read on HN.
They really missed out by not calling it Neuromancer.
Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.