Training of large language model Mistral on Slovak language data

Autori

  • Peter Bednár
  • Marek Dobeš
  • Radovan Garabík

DOI:

https://doi.org/10.2478/jazcas-2025-0037

Kľúčové slová:

large language models, Mistral, computational linguistics, Slovak language, Natural Language Processing, model fine-tuning

Abstrakt

This study explores the preparation and training of the Mistral 7B large language model on Slovak language data. Despite the availability of commercial models with robust Slovak language support, open-source models often fall short due to limited amount of training data of Slovak. To address this, we chose to fine-tune the existing Mistral 7B model using Slovak text data to create a publicly accessible model and observe the development of Slovak linguistic structures within the neural network. Our approach involved significant decisions regarding the model's purpose, training strategy, and the inclusion of Slovak-specific data. We opted for fine-tuning rather than training a new model from scratch, leveraging existing resources to enhance efficiency and effectiveness. This project underscores the cultural significance of language models as digital artifacts that preserve and advance linguistic heritage.

Sťahovanie

Publikované

19-12-2025

Ako citovať

Training of large language model Mistral on Slovak language data. (2025). Jazykovedný časopis, 76(2), 433-451. https://doi.org/10.2478/jazcas-2025-0037