Distillation for Adaptation Language Models to Russian Language
DOI:
https://doi.org/10.14529/jsfi250401Keywords:
large language model, distillation, fine-tuning, model adaptation, Russian languageAbstract
Adapting large language models (LLMs) to morphologically rich languages like Russian presents a major challenge, as multilingual models often exhibit limited transfer due to predominantly English-centric pre-training. This study investigates knowledge distillation (KD) as a more effective alternative to supervised fine-tuning (SFT) for the final calibration stage of language adaptation. We introduce an efficient offline top-K distillation approach that transfers knowledge from a 32B Russian-adapted teacher model to a 4B student model through tokenizer alignment and direct logit transfer. Experimental results demonstrate that KD consistently surpasses SFT, achieving up to a 4.22% performance improvement, with top-100 distillation yielding the highest gains (3.27% on average) albeit with increased memory consumption (62 GB vs. 7 GB for top-10).Moreover, the advantages of KD are most pronounced for student models with lower adaptive capacity (i.e., smaller LoRA α values). These findings underscore the efficacy of KD as a practical and scalable approach for language adaptation, while emphasizing the necessity of balancing performance improvements against computational efficiency.
References
Achiam, J., Adler, S., Agarwal, S., et al.: GPT-4 Technical Report. arXiv e-prints pp. arXiv–2303 (2023). https://doi.org/10.48550/arXiv.2303.08774
Anshumann, A., Zaidi, M.A., Kedia, A., et al.: Sparse logit sampling: Accelerating knowledge distillation in LLMs. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025. pp. 18085–18108. Association for Computational Linguistics (2025). https://doi.org/10.18653/v1/2025.
DeepSeek-AI: Deepseek-v3 technical report (2024), https://arxiv.org/abs/2412.19437
Dubey, A., Jauhri, A., Pandey, A., et al.: The Llama 3 Herd of Models. arXiv e-prints pp. arXiv–2407 (2024). https://doi.org/10.48550/arXiv.2407.21783
Fenogenova, A., Chervyakov, A., Martynov, N., et al.: MERA: A Comprehensive LLM Evaluation in Russian. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 9920–9948. Association for Computational Linguistics (2024). https://doi.org/10.18653/v1/2024.acl-long.534
Goyal, N., Gao, C., Chaudhary, V., et al.: The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics 10, 522–538 (2022). https://doi.org/10.1162/tacl_a_00474
Grashchenkov, P.V., Pasko, L.I., Studenikina, K.A., et al.: Russian parametric corpus Ru-Param. Journal Scientific and Technical of Information Technologies, Mechanics and Optics 158(6), 991 (2024). https://doi.org/10.17586/2226-1494-2024-24-6-991-998
Gu, Y., Dong, L., Wei, F., et al.: MiniLLM: Knowledge distillation of large language models. In: The Twelfth International Conference on Learning Representations, ICLR 2024. OpenReview.net (2024), https://openreview.net/forum?id=5h0qf7IBZZ
Gusev, I.: rulm: A toolkit for training neural language models (2023)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. Stat 1050, 9 (2015)
Hsieh, C.Y., Li, C.L., Yeh, C.K., et al.: Distilling step-by-step! Outperforming larger language models with less training data and smaller model sizes. In: Findings of the Association for Computational Linguistics, ACL 2023. pp. 8003–8017. Association for Computational Linguistics (2023), https://doi.org/10.18653/v1/2023.findings-acl.507
Hu, E.J., Shen, Y., Wallis, P., et al.: LoRA: Low-Rank Adaptation of Large Language Models. In: The Tenth International Conference on Learning Representations, ICLR 2022. OpenReview.net (2022), https://openreview.net/forum?id=nZeVKeeFYf9
Loukachevitch, N., Artemova, E., Batura, T., et al.: NEREL: A Russian Dataset with Nested Named Entities, Relations and Events. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing, RANLP 2021. pp. 876–885. INCOMA Ltd. (2021), https://aclanthology.org/2021.ranlp-1.100
Loukachevitch, N., Tkachenko, N., Lapanitsyna, A., et al.: RuOpinionNE-2024: Extraction of Opinion Tuples from Russian News Texts. In: Proceedings of the International Conference "Dialogue". vol. 2025 (2025)
Men, X., Xu, M., Zhang, Q., et al.: ShortGPT: Layers in Large Language Models are More Redundant Than You Expect. In: Findings of the Association for Computational Linguistics, ACL 2025. pp. 20192–20204. Association for Computational Linguistics (2025), https://aclanthology.org/2025.findings-acl.1035/
Nikolich, A., Korolev, K., Bratchikov, S., et al.: Vikhr: Constructing a state-of-the-art bilingual open-source instruction-following large language model for Russian. In: Proceedings of the Fourth Workshop on Multilingual Representation Learning, MRL 2024. pp. 189–199 (2024). https://doi.org/10.18653/v1/2024.mrl-1.15
Ouyang, L., Wu, J., Jiang, X., et al.: Training language models to follow instructions with human feedback. In: Proceedings of the 36th Int. Conf. on Neural Information Processing Systems. vol. 35, pp. 27730–27744 (2022). https://doi.org/10.5555/3600270.3602281
Penedo, G., Kydlíčk, H., Sabolčec, V., et al.: FineWeb2: One Pipeline to Scale Them All–Adapting Pre-Training Data Processing to Every Language. arXiv e-prints pp. arXiv–2506 (2025). https://doi.org/10.48550/arXiv.2506.20920
Raman, M., Mani, P., Liang, D., et al.: For distillation, tokens are not all you need. In: NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following (2023)
Team, Q.: Qwen3 technical report (2025), https://arxiv.org/abs/2505.09388
Tikhomirov, M., Chernyshev, D.: Facilitating large language model Russian adaptation with learned embedding propagation. Journal of Language and Education 10(4), 130–145 (2024). https://doi.org/10.17323/jle.2024.22224
Zhou, J., Lu, T., Mishra, S., et al.: Instruction-following evaluation for large language models. CoRR (2023). https://doi.org/10.48550/arXiv.2311.07911
Downloads
Published
How to Cite
License
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-Non Commercial 3.0 License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.