Efficient training
Alan Akbik, professor of machine learning at Humboldt-Universität zu Berlin, is working on smarter approaches to language models
ChatGPT requires vast amounts of data and costs a lot of money. Alan Akbik, professor of machine learning at Humboldt-Universität zu Berlin, is exploring smarter solutions. No AI was used to generate this essay on the key role of language models in artificial intelligence.
When Akbik moved to Humboldt-Universität zu Berlin just before the first lockdown in January 2020, his position was effectively a “one-man chair,” he recalls. Before that, the computer scientist had spent several years in the industry, working in machine-learning research at Zalando, and earlier at IBM in California.
Much has changed in the six years since he took up the professorship—not only in his field but also within the chair itself, which Akbik built from the ground up and which now employs nearly twenty researchers. The breakthrough of ChatGPT transformed the landscape. Even for experts the rapid development of language models came as a surprise. In the past, people around him had thought the field eccentric, he remembers. Language and computers simply did not seem to belong together. “But since ChatGPT has entered our day-to-day lives, I don’t have to explain to anyone why I work in this field.”
His field of expertise is natural language processing (NLP) and machine learning. His research focuses on the role language models play as a key technology in artificial intelligence, how these so-called large language models (LLMs) are constructed—and especially how they might be trained more efficiently.
“A language model like ChatGPT is essentially a prediction model designed to generate plausible text,” he says. Such systems are trained on enormous quantities of text and learn, within sequences of words—so-called tokens—to predict which word is most likely to come next. A sentence like “I went to the zoo and especially liked the X” should therefore be completed plausibly, for example, with the token “giraffe”. In essence, that is all the machine learns. “That’s the only task it is trained for.”
But there is a problem: This process requires truly massive datasets and that makes them extremely expensive. Models such as ChatGPT depend on vast computing capacities that consume huge amounts of electricity. “At the moment, only very large companies with significant resources and funding can build systems like this,” Akbik says.
This is exactly where his research begins. The computer scientist is investigating whether comparable language models could be developed more intelligently and efficiently, by training them just as effectively but with far less data. “We want to improve efficiency so that universities can also start training high-quality NLP models,” he says.
Soon he plans to present a new German-language model that his own team developed. It is called Boldt, named after Humboldt without the “hum”. “We used a web crawl with a German-language data base that is widely used in research,” Akbik explains. To reduce data volume, his team developed a method that filters web pages, assessing whether the German text they contain is suitable for training a language model. “For the automatic evaluation we defined several criteria, such as coherence and factual content.” Using this approach, the dataset shrank from around 400 billion tokens to roughly 28 billion. This is still enormous, but more manageable. The results make him optimistic: “In standardised tests, Boldt performs very well.”
Alongside language models, Akbik and his team are also working on information extraction—the targeted retrieval of knowledge from large text collections. For instance, how often does the term “coronavirus vacchine” appear in negative or positive contexts? For this purpose, Akbik developed the open-source framework Flair NLP framework, which is now used in thousands of projects worldwide. Such sentiment or opinion analyses can be applied across many disciplines. “The technology we have developed is freely available as open-source software,” Akbik says, “so that similar applications can be used in other research areas or institutions.”
For him, conducting AI research at a university offers major advantages. Within his own chair he can pursue longer-term questions more freely than in the industry. Universities also ask different questions from companies, for example, about fairness and ethics. For this reason, training own language models for independent research is particularly important to him. “Otherwise we would simply stand on the sidelines and watch what happens. As a university, we want to help shape the field ourselves.”
Heike Gläser for Adlershof Journal
