OpenEuroLLM: First year progress and next steps

First results in a project developing next-generation open-source language models to advance European AI capabilities.

One year has elapsed since the start of the OpenEuroLLM project. This ambitious project carried out by a consortium of 20 leading European research institutions, companies and EuroHPC centres, coordinated by Jan Hajič (Charles University, Czechia) and co-led by AMD Silo AI, has been busy with the first steps of developing next-generation open-source language models to advance European AI capabilities.

The project's main goal requires extensive research, access to high-performance computing resources, and strategic collaboration with other prominent European initiatives. During its inaugural year, the project has achieved significant milestones in advancing regional AI sovereignty through targeted efforts in digital infrastructure development, data practices, model development, and evaluation tools.

“Creating an open source multilingual LLM in the public space and within a large consortium is a challenging task. I am proud that thanks to the expertise, enthusiasm, commitment and hard work of especially the core partners the project has achieved its first-year goals. However, significant challenges, especially in securing more compute for creating the final models, still remain,” says Jan Hajič, Charles University.

Infrastructure

OpenEuroLLM is developing the digital infrastructure needed to lower thresholds for AI product development in Europe. This includes infrastructure for conducting large-scale distributed training, for running evaluations of models seamlessly across different European clusters and for building robust software stacks for experiments. In the first year of the project, these were essential steps to avoid the dependence on a single cluster and to make the most of the current configurations of European HPCs. 

Data

In collaboration with Open-Sci, reference models for dataset selection and scaling trends have been developed. These reference models provide baselines for comparison to any other method trained on any of the same reference open datasets, making it easier to put a new training procedure into relation to already existing working baselines.

MixtureVitae, another significant open web-scale pretraining dataset, has been developed together with LAION, Ontocord, and Open-Sci. It has proved to be the first permissive dataset that manages to match or outperform strong non-permissive datasets like FineWeb-Edu or DCLM. It is particularly strong on reasoning problems related to mathematics and code.

Together with EuroLLM the project has tackled the challenge of lack of data that most European languages face. As current data collection cannot adequately address language scarcity, limiting proper representation of many languages in multilingual models, the first comprehensive multilingual synthetic pre-training dataset has been created. 

In parallel, the project has established the basis of the OpenEuroLLM catalogue of LLM training data, a structured catalogue providing a uniform, collectively curated, and well-documented collection of candidate LLM training datasets. Datasets in the catalogue have been made publicly available (read-only) on multiple EuroHPC systems such as LUMI, Leonardo and MareNostrum to avoid duplicative efforts and redundant storage. 

Models and Evaluation

In collaboration with HPLT, 2B/100B reference models for various languages have been released. These transparent and easily reproducible reference models provide a means for cross-lingual comparison, inspection of monolingual performance, or understanding of popular evaluation tasks for different languages.

In addition, a range of 2B/4TT models have been trained for studying multilingual data mixes to determine the optimal proportion of each language within a training dataset for producing high-performing multilingual LLMs.

The results of both the 2B/100B and the 2B/4TT models inform future decisions as model sizes are scaled up.

Looking Ahead

As the project enters its second year, transparency, openness and community collaboration continue to be guiding values while the work continues with high ambitions.

OpenEuroLLM succeeded in securing access to EuroHPC strategic compute resources, guaranteeing a substantial amount of compute on four major EuroHPC supercomputers for the remainder of the project. Additional compute resources will however be required to complement the strategic allocations. 

The project is looking to release an 8B model by next summer followed by a larger model using the compute secured with the strategic compute allocation. Additionally, new iterations on the Poro model family will be released.

The Tübingen Contribution

The Tübingen AI Center's contribution to OpenEuroLLM focusses on training and evaluating a highly multilingual family of foundational models. Beyond the scientific goals, the project prioritizes creating a strong community around foundation models to ensure their accessibility and widespread adoption. The Tübingen AI Center leads these community-building efforts, bringing together various stakeholders to support the project's open-source mission. This entails coordinating strategic advice from an international board of AI experts to ensure high-level alignment with existing communities and initiatives as well as EU policies. The project will also strengthen ties with businesses, small enterprises, and high-performance computing (HPC) networks. The goal is to build a lasting commitment to the development and use of open-source AI models, ensuring that key stakeholders remain engaged both during and after the project.

Get the Tübingen AI Center  News feed