Young Researchers

Dr. Cedric Lothritz on a Luxembourgish Language Model

Young researchers shape our future. Bringing their innovative ideas into our projects, they contribute not only to the excellence of SnT research, but also to our impact in society. They take our research to the next generation.

In this edition of the series, we feature Dr. Cedric Lothritz and his research on the development of a Luxembourgish language model “LuxemBERT” together with BGL BNP Paribas. Addressing the challenges of the financial sector, as SnT is doing in this project, is also the objective of the National Centre of Excellence for Financial Technologies (NCER-FT).

 

Dr. Cedric Lothritz, postdoctoral researcher at the Trustworthy Software research group (TruX), gave us some insights into his research, reflected on how his project will shape the future, and shared his future plans with us.

 

Cedric, what are you working on in your research?

 

Since joining SnT in 2019, my research focus has been natural language processing, initially focussing on the FinTech field before broadening the scope to include multilingualism and the Luxembourgish language. In our project, we worked with BGL BNP Paribas to develop a Luxembourgish language model called “LuxemBERT”. BGL BNP Paribas uses this language model to create a Luxembourgish chatbot. Our solution demonstrates how language models can be created for other low-resource languages throughout the world.

 

What is the motivation of the project?

 

We wanted to create a language model in Luxembourgish to be implemented in the financial sector – so what exactly is a language model?

 

Computers do not process words the way we humans do, but they are very good at processing numbers. The purpose of language models is to bridge that gap between words and numbers, so that computers can understand the meaning of words. For this, language models try to learn how to translate words into lists of numbers – or vectors – and give them some sort of meaning. This is especially important for words that appear in similar contexts. For example, Rome and Paris are both European capitals and popular tourist destinations. Ideally, a language model would translate those words into similar vectors, but they should be different enough so that we can distinguish between them.

 

One important factor to consider: Powerful language models need a huge amount of data to train them. This is not a big issue for widespread languages such as English, German, or French. However, for Luxembourgish as a low-resource language, this was a huge challenge.

 

What is the solution in the project?

 

For the training data in Luxembourgish, we only found about 800 megabytes of data, which amounts to 6 million sentences. This is a small data set compared to other language models, such as the German language model “GottBERT”, which was trained on 145 gigabytes of data. For this reason, we had to create more data. We did that by translating data from a closely related language. Specifically, we used a German data set. We chose specific words that we translated into Luxembourgish. Doing this, we developed pseudo-Luxembourgish sentences that are still very close to an actual Luxembourgish translation. Through this process, we created a lot of new data and used it to train our Luxembourg language model.

 

How does this project shape the future?

 

Today, you are already using language models in everyday life. For example, tools such as search engines, grammar checkers, or chatbots like ChatGPT are powered by language models. Our partner BGL BNP Paribas will use our solution to create a chatbot that understands Luxembourgish. This will make web banking more inclusive and more accessible to Luxembourgish inhabitants who do not feel comfortable doing web banking in a foreign language. In general, if a company delivers customer services, a chatbot like “LuxemBERT” can be the right solution for them to deal with Luxembourgish clients.

 

What inspired you to work in research at SnT?

 

After experiencing some setbacks in my personal life shortly after finishing my master’s degree, I told myself not to become a cog in a machine, and to do something unique in my life that would make me feel like I contributed to society. I figured that the most appropriate path to achieve this would be through a Ph.D. I came across an offer for a Ph.D. programme at SnT that read like the abstract of my master’s thesis and felt like this would be the perfect position for me. While there have been many bumps on the road, doing the Ph.D. turned out to be one of the best decisions in my life!

 

What are your future plans?

 

Working on Luxembourgish natural language processing (NLP) and contributing to the preservation of the Luxembourgish language is what I am most passionate about. Thanks to the increased omnipresence of NLP, in particular due to the release of large language models and tools like ChatGPT, demand for models that are catered to our language has increased, too. Ideally, I would like to stay in research, working on projects that involve contributing to the presence of Luxembourgish in the digital space.

 

About Cedric: Cedric Lothritz obtained his PhD in natural language processing (NLP) at the University of Luxembourg in 2023. His work focused on NLP in the FinTech domain and for the Luxembourgish language. His main research interests lie in language modeling and techniques to develop models for low-resource languages. He has been working at SnT since 2019 and has been part of the TruX research group since its inception.

 

 

This article was originally published on 6 September 2023. 

Dr. Cedric Lothritz