In 2019, it was announced that the United Nations Educational, Scientific and Cultural Organization (UNESCO) had declared the language of Luxembourgish to be endangered. With a population of 630,000 in 2021, almost half of which are expatriates, the number of Luxembourgish
speakers still hovers at nearly 80%. In a truly multicultural society, where the average Luxembourger speaks four languages, a collaborative project between SnT and BGL BNP Paribas has seen them create a chatbot for the four main languages of Luxembourg. The specific challenge of this project was to develop a brand-new Luxembourgish language model, not covered yet by the artificial intelligence research community.
In collaboration with BGL BNP Paribas and supported by the Fonds National de la Recherche (FNR), SnT researchers from the Trustworthy Software (TruX) research group have developed a new Luxembourgish language model that will enable a chatbot to ‘understand’ Luxembourgish – machines can’t technically attribute meaning to words, but they will be able to take meaning from the lists of numbers attributed to those words. “The model we developed with the BGL BNP Paribas team – called LuxemBERT – was created based on around 12 million sentences. It appears confident in its understanding and can be used to complete multiple retail banking services,” said Prof. Tegawendé Bissyandé and head of the TruX research group. The team also consists of Principal Investigator Prof. Jacques Klein, as well as Doctoral Researcher, Cedric Lothritz, and Research Associate Bertrand Lebichot.
“The problem we face in a project like this is that Luxembourgish is a ‘low-resource’ language, so few resources are available from which we can pull data.”
Bertrand Lebichot, SnT Tweet
Chatbots, the automated help tools visible on so many websites nowadays, are designed using natural language processing (NLP) – a field of artificial intelligence that processes written text or voice and is able to respond accordingly. With languages like English, developing a chatbot is relatively straightforward – it’s a language that has one of the highest amounts of resources available in electronic format to pull data. “The problem we face in a project like this is that Luxembourgish is a ‘low-resource’ language, so few resources are available from which we can pull data,” Dr. Lebichot explained. “NLP systems need text from all kinds of documents – from books to social media content – in order to train their understanding of a language. In fact, the original BERT, on which LuxemBERT was based, was a language model that targeted the English language and had the advantage of using the entire catalogue of Wikipedia entries to train the model. For a language like Luxembourgish, these resources simply don’t exist and that was a major obstacle,” he continued.
“NLP systems need text from all kinds of documents – from books to social media content – in order to train their understanding of a language.”
Bertrand Lebichot, SnT Tweet
With Luxembourgish being a language closely related to German, their approach combined available data in Luxembourgish – including websites like RTL – as well as German data to build an adequate amount of training data. The German data was partially translated into Luxembourgish in order to make it a closer match. “Once we obtained our data, it was time to train the model. For this task, a powerful amount of computing power is needed, for which we had the opportunity to use the University of Luxembourg’s HPC facility,” said Prof. Bissyandé.
The High-Performance Computing Centre (HPC) is a powerful facility, and has been an integral part of research within the University of Luxembourg for many years now. The computers we use on a day-to-day basis have, in comparison to the HPC, a finite amount of computing power. For this reason, many of the research projects at SnT use the facility in order to run data over the course of a few hours, days, or weeks – as opposed to the years or centuries that classic computers may take for the same task.
The project with BGL BNP Paribas is expected to continue until December 2024.