Kazakhstan to Launch First Kazakh Large Language Model in December

ASTANA — The Institute of Smart Systems and Artificial Intelligence (ISSAI) is set to launch the first Kazakh large language model (LLM) on Dec. 16, the thirty-third anniversary of Kazakhstan’s independence, as reported during a July 18 briefing at Nazarbayev University (NU).

Photo credit: nu.edu.kz

According to the university’s press service, ISSAI began collecting data in March and is currently training the model using a cloud computing platform with a small number of NVIDIA H100 nodes.

Professor Atakan Varol, the founder and head of ISSAI, highlighted that the project involves students from NU and other universities such as Astana IT University, Bolashak scholarship graduates, and locals.

“At the end of this project, we will create KazLLM, but the most important achievement will be the creation of a workforce capable of producing cutting-edge generative AI tools and products. In this specific technology, we are not far behind other countries. After completing KazLLM and its models, we will be 18 months behind them. Integrating voice will reduce this gap to 12 months, creating language vision models will place us at the cutting edge, and we will do what those other countries do. The important thing is that we are doing this for the people of Kazakhstan in the Kazakh language,” he said.

The project sources data from a diverse range of articles from Wikipedia, news outlets, government websites, and open data sets such as Common Crawl. Over the past five years, ISSAI has developed numerous natural language processing datasets specifically for the Kazakh language. The project addresses national and information security issues, as reliance on foreign products can lead to data leakage and the presentation of distorted information.

Madina Abdrakhmanova, Deputy Director for External Relations and Lead Data Scientist, added that the model’s training corpus will comprise at least 100 billion tokens in Kazakh, Russian, English, and Turkish, with each language represented by 25 billion tokens.

“We now have more than 30 billion tokens. A token is a unit of data valuation, a word or part of a word. Twenty-six billion tokens were created using the Tilmash translator to translate data from English into Kazakh. Our model can now output literate Kazakh. In addition, we will create an interactive interface for users, similar to what OpenAI has done,” she noted.

ISSAI plans to offer a subscription service for general users and a specialized application programming interface (API) for advanced users to ensure widespread adoption. This will enable seamless integration of the models into various products, including websites, smartphone apps, program codes, and PC applications. The platform will support model interaction, reinforcement learning based on human feedback, and tuning for optimal performance in different scenarios.


Get The Astana Times stories sent directly to you! Sign up via the website or subscribe to our Twitter, Facebook, Instagram, Telegram, YouTube and Tiktok!