Building a Norwegian language model

ChatGPT’s ability to generate human-like language is impressive. But now it’s important to also build specific models for the small languages like Norwegian.

By John Einar Sandvand

Building a Norwegian language model

ChatGPT’s ability to generate human-like language is impressive. But now it’s important to also build specific models for the small languages like Norwegian.

By John Einar Sandvand

“We need a Norwegian language model, built mainly on Norwegian text,” says Schibsted’s Chief Data & Technology Officer, Sven Størmer Thaulow.

Sven is currently the chair of the Norwegian Research Center for AI Innovation (NorwAI) at NTNU (the Norwegian University of Science and Technology) in Trondheim. Schibsted is one of several industrial partners of NorwAI – and contributes both with both competence and data.

One of the big projects at NorwAI is to build a generative language ­model for the Norwegian language. The work has been ongoing for more than two years, and the first version was launched last summer. Schibsted has contributed thousands of articles for the model to be trained on. This is a non-commercial research project at present, and Schibsted will be among the first to test how well it performs compared to the big American models.

Why is a Norwegian language ­model needed when ChatGPT works quite well in all the Scandinavian languages?

Sven shares three main reasons:

Better in Norwegian: A model trained primarily on content in the Norwegian language will likely also be better in Norwegian. To compare, we estimate that only 0.1% of the content ChatGPT was trained on was in Norwegian.

Control over our own infrastructure: Large language models are becoming part of our digital infrastructure. But we see that artificial intelligence is already turning into a global industrial political race. It is not obvious that the technology will be democratised. Therefore, we must develop our own large language models that serve our societies well and can be the basis for innovation and the development of new services for our population.

Consistent with Norwegian culture: We need language models that reflect the value sets of our Nordic societies rather than being dominated by American perspectives. Language models can easily be ­biased, for instance, because of the content they are trained on or how they have been adjusted. By training our own models there is a greater chance that the output will better reflect our culture and values.

Building a large language ­model is an enormous effort. It requires vast amounts of text, specialised competence, and enormous computing ­power. And early this year, Sven ­invited all media companies in Norway to contribute content to the work of building the model.

“We need content that is representative of the full Norwegian society, from news articles, simple chats, government documents, court verdicts – to even the most beautiful novel,” Sven says.

He adds that a successful Norwegian language model can be a shared ­resource that will create much value – for Norwegian society at large, for companies like Schibsted and others, as well as for individuals.


John Einar Sandvand

John Einar Sandvand
Senior Communications Manager, Schibsted
Years in Schibsted: 30
My favourite song the last decade: Save Your Tears – The Weeknd