Norway's 2 petabytes of Huawei flash storage and LLM training

2 weeks ago/226 comments/blocksandfiles.com

I'm a Norwegian, and I use the national library almost every day for searching through texts. They have truly one of the best working user interfaces (and functionality) for searching through the massive amounts of text.

18 days ago by vidarh

It's really fantastic. I just wished there were fewer restrictions on the content that is accessible.

(a lot is only accessible from Norwegian IP addresses, so it's one of the main reasons I maintain a VPN as I'm Norwegian but live in the UK; a second set is only available from the IP addresses of libraries or research institutions - still huge amounts that are generally available, though)

18 days ago by TrackerFF

My biggest gripe with it are the restrictions, indeed.

When searching through the closed newspapers, you have to apply for access manually, which gives you 8 hours of access. Great. Only that the access is seemingly manually granted - so if you apply 16:05 on a Friday, chances are you won't get any access until 9-10 the next Monday.

With that said, I do understand why it is like that. If people could apply via API, and get instant access, they would probably just stop buying newspaper subscriptions.

17 days ago by vidarh

I actually didn't realise you could apply. I always just went back and ignored the closed ones without reading closely enough apparently. Thanks for making me aware - there are a few that's relevant to me for genealogy reasons that I've not looked at because of this.

18 days ago by mettamage

Silly question but can a non-Norwegian also access it? Willing to pick up some Norwegian along the way ;-)

18 days ago by vidarh

You can access quite a bit directly. Check out nb.no (or https://www.nb.no/en/ for an English version of the page, but of course most of the works are in Norwegian)

There are escalating series of restrictions, basically:

* Available for everyone.

* Available from a Norwegian IP -> just requires a VPN.

* Available from Norwegian libraries

* Availble under "special conditions". This would mean from a participating research institution or university, or similar.

Pretty much everything that is out of copyright falls in the first category. The second and third categories has a bunch of copyrighted material where the copyright holders have granted limited usage rights. A bunch of newspaper archive material that is still under copyright (but sadly not the biggest ones) are available from Norwegian IPs for example.

18 days ago by Telaneo

If you have access to a Norwegian IP, then yes.

18 days ago by throwaway85825

The lack of a universal search engine is very frustrating. Why can't I search within TV subtitles?

18 days ago by vintermann

Well... You realize how used you are to the basic stemming and spelling flexibility which every search engine has had since Altavista.

18 days ago by KeplerBoy

How true is this statement: "He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language."

I thought all big players already train on basically everything remotely available to them no matter the language or quality, so his take sounds like an opinion formed in the early days of generally available LLMs.

18 days ago by WatchDog

If you want LLMs to have knowledge of the Norwegian language, wouldn't the most obvious thing to do be to build a good training dataset and make the dataset widely available? Why go to the expense of training your own model, especially when it will be inferior to state of the art models.

18 days ago by black_puppydog

I task GPT/Claude with researching stuff that pertains to very specific cultural or legal aspects in French politics, on a daily basis. Even though French is a way more common language globally than Norwegian, these models still haven't figured out that, no matter the language I myself speak to them (German or English depending on my mood) their web searches need to be done in French to return reasonable results. I have to remind them every time lest they come back with "uh, didn't find anything relevant, here take some hallucinations instead."

So, given the anglo-centrism of current models, my confidence in American providers giving any shits about non-american users/use-cases is pretty low. And lower the smaller the language community is.

18 days ago by KaiserPro

I've noticed that it also imposes american moral judgements on certain things, even though it reasons (sometimes) in the native language.

I was trying to work out how and when to use swear words, and the relative power index of them. it translated english swear words into the target language then lectured me on not using them.

It took a bunch of prodding for it to actually think as the target language to then get the (mostly) correct response.

18 days ago by hombre_fatal

Aren’t you already using English in the LLM convo? Telling the model to use French for research or to find resources in French seems like a reasonable step.

If you’re doing this on a daily basis, then you should have an AGENTS.md that accumulates directional instructions like this.

This is how you use the tool correctly.

There’s this weird pattern I’ve noticed where people expect LLMs to require zero effort or proficiency on their part, and when the LLM isn’t perfect without it, of course it wasn’t; LLMs suck.

18 days ago by andai

If you ask in French, it searches in French, right?

I have the opposite problem, where I'll ask in English, about something in a foreign country, the results it finds will all be in that foreign language, and the LLM will switch languages and respond in that language (which I don't speak).

So then I have to ask it "can you repeat that in English please."

I keep waiting for the new GPT-Definitelty-AGI-For-Real-This-Time to fix it but it's still there.

18 days ago by bakugo

> their web searches need to be done in French to return reasonable results.

I wonder how much of this is also just the search engine's region setting.

It's a big problem I regularly have with Google. I almost always want English language, US-centric results, so I have my region set to the US. But occasionally I want results relevant to my actual country, and even searching in my native language usually yields much worse results than just opening an incognito tab and letting it default to my real location.

18 days ago by a2128

What incentives does OpenAI have to make sure the AI actually works well with Norwegian beyond capturing a (small) Norwegian market? What incentives do they have to take Norwegian values into consideration, or to preserve Norwegian culture into the future? The matter is also a question of national sovereignty, so to simply release the data and nicely ask foreign companies to solve the problem for you, would be a fool's move

18 days ago by SOLAR_FIELDS

It's also a bit funny because Norway definitely has enough money to hire a team of Anthropic's best to go out there and train them a model that does whatever they want. They probably have enough money to fund their own Anthropic competitor.

18 days ago by embedding-shape

Yeah, was about to comment that too, instead of training a new model and new weights exclusively for Norwegian (and expecting/wanting every other small/medium-sized country to do the same) which seems infinity harder, they could have made high quality transcriptions and translations of the stories currently described only in Norwegian into English, and making it all public. I guess there still would be a worry that it'd be counted as "less important" compared to other history, news and culture about other countries.

18 days ago by pjc50

Oddly enough, my wife was recently involved in a project to translate historical crime novels from Norwegian; since all the available late 20th century Scandinavian crime novels have already been translated and turned into popular TV series, the plan was to go further back. Into the 1930s. The first cut was done with LLMs, but encountered the problem that (a) Norwegian itself has changed noticeably since then, in both major dialects, and (b) the machine translation deteriorated on large sections, resulting in entirely missing paragraphs and pages in a few places. Not to mention the usual translation issues (what police role does lensman map to?) and localisation (to what extent should the casual antisemitism be left in or removed?)

Translation is never a bijective process. It's never quite the same experience in translation as it is in the original, due to the cultural differences between reader and writer. Larger in this case because 1930s Norway is very different even from 2020s Norway.

Ultimately this was not a success due to marketing difficulties; it is very difficult to get a book noticed.

( https://www.amazon.co.uk/Iron-Chariot-Nordic-Crime-Library/d... )

18 days ago by makeitdouble

> high quality transcriptions and translations of the stories currently described only in Norwegian into English

You make it sound like an easier task than training an LLM. I'd argue it's not obvious, and would assume the contrary.

18 days ago by vintermann

Copyrights and statutes don't allow them to do that. The mandate of the National Library maybe permits them to make an LLM through (though I won't at all be surprised if someone sues them anyway).

18 days ago by electroglyph

absolutely. somebody online was wanting an LLM with Georgian language support, and that's exactly what i suggested: start digitizing Georgian text.

18 days ago by amarant

Not remotely true in my estimation. I don't really speak Norwegian, but I do speak Swedish(which means I mostly understand Norwegian as they're very similar). Every model I've tried speaking Swedish to does it perfectly. I'd be surprised if the same isn't true for Norwegian already

18 days ago by schubidubiduba

Of course they speak swedish. But often, they do not reason in Swedish and do not search in swedish. Swedish makes up a tiny fraction of training data, while the vast majority is English, from the US. Which means the answers will always have a bias towards US culture, even if you ask in Swedish and the LLM answers in Swedish.

18 days ago by NorwegianDude

While Google does a good job with language support in their models, GPT-5.5 can't write proper Norwegian. It's even making up words that does not exist.

18 days ago by mistrial9

different models have been very different in this way.. almost ten years ago the French made a very large effort to capture languages.. the release notes I read at the time IIR had quite a few languages from South Asia / India, and in Africa. The language that was prominently missing was German IIR. I cannot say for the 2025-2026 models since so much has happened.. but models are not equal.

18 days ago by vintermann

Does that include local distilled models? Because it didn't last time I checked for Norwegian.

18 days ago by vintermann

Foreign LLMs are probably not trained on the Norwegian National Library. I regularly find things in there (with regular keyword search, for genealogy) which neither search engines or language models know.

Of course I then usually put the information I'm interested in somewhere AI could scrape it. But it would take a long, long time to get everything interesting out of there.

18 days ago by intronic

Yep in the article it says ..the National Library .. has the single largest digital collection of Norwegian books, newspapers, web pages .. it is entitled to receive copies of every published book and broadcasted content. Its legal deposit mandate in this area extended beyond books, as it was duty-bound to collect and preserve all of Norway’s cultural heritage .. an agreement with Norwegian newspapers permitted LLM training on copyrighted content.

Husnes said: ”No private company has this.”

So yeah they seem to have proprietary data...

18 days ago by pastage

> proprietary data

It is just copyrighted data, that is harder to get a hold of. All the copies are available to anyone to use if they just read it. Copyright makes other uses complicated. I wonder if the whole Creative commons debate was a mistake, you can never fix copyright in a digital world.

18 days ago by orbital-decay

Current-best models are pretty fluent at major languages and cultures, so it's untrue at least for the "any" qualifier. Performance is barely affected or might be even better sometimes. However English patterns can subtly leak into native patterns of other languages. It's obviously very different for low-resource languages, but to improve them you need more data, not a new model.

18 days ago by Barrin92

>Current-best models are pretty fluent at major languages and cultures

strong disagree on that one. As a German interacting with ChatGPT, even in German it gives me the feeling of talking to the Pluribus people, which reminds me of an anecdote of Walmart failing in Germany because people were freaked out by the constantly upbeat, smiling employees.

Understanding a culture is a very different task than translating the syntax of a text, and these systems might be capable of syntactic fluency but they do not really understand culture. You have to metaphorically abuse these models until they stop sounding like the crossover of a HR department person and a Mormon missionary

18 days ago by bblb

I'm Finnish and dear god I hate the default overtly friendly tones of LLMs. Always the first thing to tune in system prompt.

You're a machine, stop anthropomorphizing yourself and pretending to be my best friend, and just give me the damn answer and nothing else. :D

18 days ago by varjag

Set the personality to 'Robot', it makes the interactions so much more tolerable.

18 days ago by solenoid0937

> The Olivia system is an HPE Cray Supercomputing EX system, with 448 GPUs and 64,512 CPU cores.

Training a sovereign LLM with this meager hardware as opposed to a LORA on some open source model seems like a huge mistake and a potential red flag.

There is no way these people have the resources to train a fully fledged LLM, so claiming that is their goal makes me think they don't intend for the LLM to be useful.

Which begs the question, whose money are they wasting - and why?

18 days ago by vslira

It may not be useful to anyone outside, but it's possible that one of the goals is institutional learning (that is, embedding the knowledge in how to build LLMs in an organization).

Even though it's nominally the national library behind this, they were probably chosen (as per the article) because they legally own and can use all NO material for this end. I'd guess researchers from related entities like unis will be involved in the process.

18 days ago by speedgoose

They successfully have made PoC finetunes before, so the next step is training fully fledged LLMs.

I don’t think they aim to anything worthwhile. The finetunes were incredibly broken. I’m guessing it’s more about having the method to do it. I’m not convinced it’s super useful but I’m not one to decide who gets to do what with the research funds.

One finetune I tried did make fun of humans expressing their feelings in the chat. Often.

One other finetune did hallucinate that it was a doctor and my baby had terrible diseases, every time I just wrote "hei" (with a generic neutral system prompt that likely triggered this behaviour though).

I think Olivia is big enough for what it’s used for. In my opinion it’s better to stay up to date and not waste too much money on hardware at the moment.

18 days ago by Schlagbohrer

The article's slides mention how much of an engineering challenge it is just for them to clean their data and create new hardware and software flows to use the data for training. So perhaps it is a big learning exercise to build up institutional / national knowledge of LLM creation.

18 days ago by manquer

> this meager hardware

> they wasting - and why?

i18n language models are not area something frontier labs are focusing ton of resources on? ( certainly not in Norwegian)

The corpus of content in Norwegian - may not require very large clusters, or even if it does, this is best that the library could do, it would be certainly more than anyone else is investing in Norwegian models

SOTA models do not have the access to the quality of content that the national library does? The article mentions licensing with newspapers specifically, and the library has access to its own content archive.

English and Norwegian are not closely related language families, perhaps LoRA is not best approach?

I am curious if there is published research on how well localization works with LoRA depending on how far off the target language grammar/vocabulary is from English.

Projects like this typically have more than one objective and are not only building SOTA project, but is also to build/train foundational local talent , similar to universities launching satellites .

18 days ago by vidarh

> English and Norwegian are not closely related language families, perhaps LoRA is not best approach?

Yes, they are. English is a West Germanic language. Norwegian is a North Germanic language. The French vocabulary in English obscures it a bit, but the two languages have similar grammar and the vocabulary has a huge number of close cognates.

E.g. day -> dag, ship -> skip, apple -> eple, cow -> ku (which makes more sense when you pronounce them correctly out loud), bairn (child; mostly Scotland and Northern England) -> barn, hop -> hopp, yule -> jul just to give a random selection of English Germanic words.

But more than that, the frontier models both a) knows Norwegian quite well, b) certainly knowns German and Dutch well, and there's a continuum of language transfer around the North sea especially when accounting for sounds rather than modern orthography, e.g. to take a couple of examples from above: ship -> schip -> Schiff -> skib -> skip; day -> dag -> Tag -> dag). The "jump" to Dutch already weeds out most of the French. A lot of modern Norwegian orthography comes from Danish, which again shares more than modern Norwegian does with German.

Knowing any of these helps a lot with learning Norwegian and vice versa. E.g. I'm Norwegian, I've never learnt Dutch, but I have learnt English and German, and I can read Dutch fairly well from that alone.

18 days ago by everforward

This makes me deeply curious about how LLMs understand language. Do LLMs relate cognates more than words that are dissimilar in different languages? I wonder if that plays some role in the effectiveness of tokenization.

18 days ago by hedgehog

That's enough resources to build on something like the Olmo 3 recipe but with a mix prioritizing their own data and post-training for their own tasks. If they build their own embedding model, index everything in the library, and train their model to query that data while answering historical, cultural, legal, and strategic questions from their perspective... Pretty interesting and likely useful. They won't beat Anthropic at dumping out React code but also there's no real reason to duplicate that.

18 days ago by timmg

I wonder if instead (or in parallel), Norway should build a set of training data and share it (for free) with all the model builders.

Seems like making the frontier models know Norwegian and their culture is a better (or additional!) way to reach the end they are going for here.

18 days ago by vidarh

The frontier models know Norwegian just fine. They can also adapt to Norwegian dialects, and even ape old Norwegian fairly well.

E.g. I had Claude describe the novel "De knyttede næver" from 1911 in Norwegian orthography ca. 1911, as it's a novel I've read, and it does a good job.

What it lacks is an understanding of Norwegian literature, culture and history. It had to look up "De knyttede næver", which was one of the best-selling Norwegian novels around the time it was published before I'd get anything out of it (ChatGPT does better; in thinking mode in particular it gives a detailed summary).

While not exactly well known today, the author was a prominent newspaper journalist for decades, and the novel series is well enough known that e.g. there's a Norwegian singer that took his stage name after the protagonist, and it was covered in Norwegian papers and books for decades (partly because of controversy over the authors political views and how they coloured his novels), so it does feel like a reasonable test that reveals a quite significant knowledge gap.

I do agree with you that it'd be better if the data set from the national library was made more accessible, though it seems a major addition here is that they have a deal to train on copyrighted data locked away in their archives that they have limitations on the use of.

But even just making the out of copyright data in their collections would be a great start.

18 days ago by e12e

Odd, I'd imagine Wikisource (in many/all languages) would be part of training data for all LLMs with SOTA ambition?

https://no.wikisource.org/wiki/De_knyttede_n%C3%A6ver

18 days ago by vidarh

You'd think so. It seems like there are a lot of odd gaps like that.

I also have a favourite English language PhD thesis I ask every new model about that they still struggle to find even though there's a Wikipedia article about it that links a blog post I wrote about it.

Anyone who thinks they've exhausted even publicly crawlable resources should ask them about some obscure stuff.

18 days ago by calgoo

Why should they share all this data with the greedy american corporations that are stealing everyones data for their own profit? Much better to keep the legal agreement with the national institutions and possibly develop something actual useful to their own country.

18 days ago by konschubert

You are contradicting yourself. If you're hoarding the data for yourself you're not going to develop something useful. Sharing the data means that it will be integrated into the big LLMs, which will be useful "for their own country".

18 days ago by rafram

> Marius Husnes, the Head of IT Platform at the library (Nasjonlbiblioteket) discussed the project at Huawei’s ID Forum 2026 in Paris, saying that no commercial LLM provider was developing a local (Norwegian) language LLM. He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language.

I am not overly confident that Marius Husnes knows what he’s talking about here.

18 days ago by fnordpiglet

He’s right though, although it’s not entirely about the training corpus. It’s about the tokenizer that tokenizes substrings more efficiently based on a necessary bias towards a target language. English oriented LLMs are more powerful for English than other languages because the token space is more parsimonious in English language. Try any online Anthropic tokenizer that calls their api with common English words (typically one or fewer tokens) and Norwegian words - you’ll often see 2-4 tokens instead sometimes more. Some languages like Thai are at a huge disadvantage. Likewise often the corpus selection also is heavily skewed towards the target language simply because more energy is applied to sourcing written works in that language. There will also be semantic biases in the vector space due to cross influence between semantically similar embeddings between languages that create a different than cultural baseline. Finally fine tuning greatly impacts cultural expression in the LLM. None of these are trivial effects.

There are a lot of efforts to create LLMs for dying languages and others that use cross cultural models to boost, but if your language is well literate, there’s a good reason to build a heritage LLM specific to your language and culture. Expecting OpenAI or Anthropic to prioritize your language over their target audience when a tradeoff is to be made is absurd.

18 days ago by undefined

[deleted]

18 days ago by YetAnotherNick

Did you even try to verify your claims. I tested it on few translations on wikipedia articles using [1] and it takes 15-20% more tokens for Norwegian.

English performs the best because there is more data in English and high quality sources are either only in English or there is a good translation in English.

[1]: https://platform.openai.com/tokenizer

18 days ago by tecleandor

Tests I've done with NO and FI texts, for the same number of characters, with the GPT5 tokenizer I get around 2x the tokens than EN. With the older tokenizers it's more like 2x or even 3x.

18 days ago by numpad0

Tokenizer efficiency varying by languages, by as much as up to 15x, is very well known and established

  https://www.google.com/search?q=tokenizer+efficiency+by+language

18 days ago by chvid

When I am chatting with ChatGPT - it is fairly obvious that it is American - its native language, its style, its attitude is American - even if we chat in Danish.

Just as we cannot rely on Netflix and HBO to produce Scandinavian TV-shows even though they might do at the moment, we need to make our own stuff in this area too.

And over time, the technology to do this will become cheap and readily available for us to do so.

18 days ago by amunozo

I chat to it in English instead of my native Spanish not only because of performance, but because I cannot stand the unnatural style it has in Spanish.

18 days ago by anal_reactor

> And over time, the technology to do this will become cheap and readily available for us to do so.

But then the English models will be even better and you'll be back to square one. My guess is that things are going to become more and more American. If you assume that "culture" is a resource like "microchips", then from economic point of view it makes sense to have one country specialize in producing it, and the rest just consume. This is why when you turn on the main radio station of a random country, you're so likely to hit American music.

18 days ago by ikr678

'Only one country should export culture, for economic efficiency' is the kind of take that the Norweigians (and everyone else) would like to protect themselves from.

18 days ago by pjc50

> then from economic point of view it makes sense to have one country specialize in producing it, and the rest just consume

And, for exactly the same reasons as Europeans need to have sovereign compute to protect against economic imperialism, it is also essential to maintain local culture in order to avoid the great replacement of everything with Americanisms.

Yes, it requires pushing against the economics. But you have to do that if you believe that culture has any value per se at all.

18 days ago by wasmitnetzen

> If you assume that "culture" is a resource like "microchips"

I do not. American culture exports American values, which are not universal. Simplest examples being the attitudes towards violence and nudity, which are very different in Europe, and vary within Europe as well.

16 days ago by fastasucan

>makes sense to have one country specialize in producing it, and the rest just consume

Oh ok then, just finnish culture for you then.

18 days ago by isawczuk

Poland have its one LLM called Bielik. It's not only better in preserving Polish sounding wording, it's also better in writing government documents. Why better? They did arena and statistically it's just better.

18 days ago by KaiserPro

could you provide evidence to suggest he is wrong?

It seems like you've made an assertion but not provided evidence. Why is it not a disadvantage to only have english LLMs?

Can you get the nuance of Norwegian history/culture with present models?

18 days ago by 6510

What is called culture here will increasingly be propaganda. It reminds me of people cheering twitter as a replacement of RSS or using facebook to communicate with your customers rather than email. You won't know which will be the winning company, don't know who might control it in the future and we cant predict what it will cost. It doesn't take much to be very annoying.

18 days ago by undefined

[deleted]

18 days ago by seanvk

The Welsh language getting LLM training with Nemotron

https://www.bangor.ac.uk/news/2025-09-15-reaching-across-the...

Daily Digest

Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.

Home About GitHub Kaggle

AI Blog Deep Learning Apps Security Checklist

Bookmarks Hacker News My Stack