Семантичний аналіз контекстів різної довжини за метрикою лексичної різноманітності MTLD

О. М.  Юрченко; Ю. Ю.  Повесьма

doi:10.30837/bi.2026.1(104).02

Автор(и)

О. М. Юрченко НТУ "ХПІ", Україна https://orcid.org/0000-0002-6074-0241
Ю. Ю. Повесьма НТУ "ХПІ", Україна https://orcid.org/0009-0008-9283-0517

DOI:

https://doi.org/10.30837/bi.2026.1(104).02

Ключові слова:

СЕМАНТИКА, СЕМАНТИЧНИЙ АНАЛІЗ ТЕКСТУ, КЛАСТЕРИЗАЦІЯ, МОДЕЛЮВАННЯ ТЕМ, ТЕМАТИЧНИЙ АНАЛІЗ, BERTopic, МІРА ЛЕКСИЧНОЇ РІЗНОМАНІТНОСТІ ТЕКСТУ (MTLD)

Анотація

Семантичне розгортання сенсу речення в текст здійснюється головним чином шляхом збереження та варіації лексичних елементів, що утворюють спільне семантичне поле значимих концептів різних за розміром контекстів. Це дослідження на прикладі змісту коротких текстів (питання завдань) та довгих текстів (відповіді студентів у вигляді есе) з корпусу EFCAMDAT за допомоги міри лексичної текстової різноманітності (MTLD) розглядає лексичні методи представлення сенсу, здатні правильно відтворювати семантичну інформацію в контекстах різної довжини. Воно спрямоване на подолання нестачі даних для навчання великих мовних моделей (LLM) та сприяння професійній інтеграції та міжкультурній співпраці.

Посилання

Wu, X., Nguyen, T., Zhang, D., Wang, W. Y., & Luu, A. T. (2025). FASTopic: Pretrained Transformer is a Fast, Adaptive, Stable, and Transferable Topic Model. Advances in Neural Information Processing Systems, 37, 84447-84481.

Eertzen G. J., A. T. Lexopoulou A. T., Korhonen A. (2013). Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge open language database (EFCAMDAT), in: 31st Second Language Research Forum (SLRF).

Jarvis, S. (2013). Capturing the Diversity in Lexical Diversity. Language Learning, 63(s1), 87-106.

Nguyen, M. H., Tran, D. Q. (2021). Estimation in semantic similarity of texts, Journal of Information Science and Engineering 37 (2021) 617–633.

Harispe, S., Ranwez, S., Montmain, J. (2022). Semantic similarity from natural language and ontology analysis, Springer Nature.

Feng, Y., Bagheri, E., Ensan, F., Jovanovic, J. (2017). The state of the art in semantic relatedness: a framework for comparison, The Knowledge Engineering Review 32 (2017) 1–30.

Andersen, P. B. (1990). A theory of computer semiotics: semiotic approaches to construction and assessment of computer systems, Vol. 3, Cambridge University Press.

Jackson, H., Zé, E. (2000). Amvela, Words, Meaning, and Vocabulary, Continuum.

Vakulenko, M. (2022). Semantic comparison of texts by the metric approach, Digital Scholarship in the Humanities 38 (2) (2022) 766–771.

Lin, Y.-S., Jiang, Y., Lee, S.-J. (2014). A similarity measure for text classification and clustering, IEEE Trans. on Knowledge and Data Engineering 26 (2014) 1575–1590. doi:10.1109/TKDE.2013.19.

Tytgat, J., Wisniewski, G., Betrancourt, A. (2024). Evaluation de la similarite textuelle : Entre semantique et surface dans les representations neuronales. In JEPTALNRECITAL.

Warner, B., Chaffin, A., Clavié, B., Weller, O., Hallström, O., Taghadouini, S., ... Poli, I. (2024). Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference, arXiv preprint arXiv:2412.13663.

Ruder, S., Vulic, I., & Søgaard, A. (2019). A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research, 65(1):569–630.

Grootendorst, M. R. BERTopic: Neural topic modeling with a class-based TF-IDF procedure, arXiv preprint arXiv:2203.05794 (2022).

Blei, D. M., McAuliffe, J. D. (2010). Supervised Topic Models, in: Advances in Neural Information Processing Systems.

Blei, D. M., Ng, A. Y., Jordan, M. I. (2003). Latent Dirichlet Allocation, Journal of Machine Learning Research 3.993–1022.

McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: a validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42(2), 381–392.

Bonvin, A., & Lambelet, A. (2017). Algorithmic and subjective measures of lexical diversity in bilingual written corpora: a discussion. Corela.

Treffers-Daller, J., Parslow, P. & Williams, S. (2018). Back to basics: how measures of lexical diversity can help discriminate between CEFR levels. Applied Linguistics, 39 (3). pp. 302-327. ISSN 1477-450X.

Laufer, B. et Nation, P. (1995). Vocabulary size and use : Lexical richness in L2 written production. Applied Linguistics, 16, 307-322.

Duran, P., Malvern, D., Richards, B. et Chipere, N. (2004). Developmental trends in lexical diversity. Applied Linguistics, 25, 220-242.

Huang, Y., J. Geertzen, R. Baker, A. Korhonen, and T. Alexopoulou (2017). The EF Cambridge Open Language Database (EFCAMDAT) : Information for Users, University of Cambridge and EF Education First.

Issa, B., Jasser, M.B., Chua, H.N., & Hamzah, M. (2023). A Comparative Study on Embedding Models for Keyword Extraction Using KeyBERT Method. 2023 IEEE 13th International Conference on System Engineering and Technology (ICSET), 40-45.

Egger, R., Yu, J. (2022). A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts, Frontiers in Sociology 7 (2022).

Babalola, O., Ojokoh, B., Boyinbode, O. (2024). Comprehensive Evaluation of LDA, NMF, and BERTopic’s Performance on News Headline Topic Modeling, Journal of Computing Theories and Applications 2 (2024) 268–289.

Malzer, C., & Baum, M. (2019). HDBSCAN(ε̂): An Alternative Cluster Extraction Method for HDBSCAN. ArXiv, abs/1911.02282.

Travieso, G., Benatti, A., & Costa, L.D. (2024). An Analytical Approach to the Jaccard Similarity Index.

Koizumi, R. (2012). Relationships Between Text Length and Lexical Diversity Measures : Can We Use Short Texts of Less than 100 Tokens ? Vocabulary Learning and Instruction, 1(1), 60-69.

Grabar, N., Yurchenko, O., Cherednichenko, O., and Lukashevskyi, A. (2025). Exploring Semantic Similarity in English Learners' Texts through Topic Modelling. In CLW-2025: Computational Linguistics Workshop at 9th International Conference on Computational Linguistics and Intelligent Systems (CoLInS-2025), May 15–16, 2025, Kharkiv, Ukraine Vol-3976, p. 92-106.

Yurchenko, O., Grabar, N., Cherednichenko, O. Analyse du vocabulaire appliquée au transfert sémantique phrase→texte. In: Actes de la conférence Extraction et Gestion des Connaissances (EGC 2026). Vol. RNTI-E-42. Toulouse: Cépaduès-Éditions, 2026. P. 373–374.

Mollas, I., Bassiliades, N. et Tsoumakas, G. (2019). LioNets : Local Interpretation of Neural Networks through Penultimate Layer Decoding. arXiv : Learning.

Bianchi, F., Terragni, S., et Hovy, D. (2020). Pre-training is a Hot Topic : Contextualized Document Embeddings Improve Topic Coherence. Annual Meeting of the Association for Computational Linguistics.

Bouzina, S., De Rossi, D., Pavlov, V. G. et Moretti, S. (2024). Semantic Latency Mapping of Contextual Vector Embeddings in Transformer-Based Models.

Terragni, S., Fersini, E., & Messina, E. (2021). Word Embedding-Based Topic Similarity Measures. International Conference on Applications of Natural Language to Data Bases.

Семантичний аналіз контекстів різної довжини за метрикою лексичної різноманітності MTLD

Автор(и)

DOI:

Ключові слова:

Анотація

Посилання

##submission.downloads##

Опубліковано

Номер

Розділ

Подати статтю