What is a text corpus?

In linguistics, a corpus (or text corpus) is a large, electronically-stored and processed, structured set of authentic, representative, machine-readable, and annotated texts. Linguistic corpora are used for language research, statistical analysis, language recognition, machine translation, and learning foreign languages.

National Corpus of the Crimean Tatar Language

The National Corpus of the Crimean Tatar Language (NCCTL) is a one-of-a-kind electronic set of Crimean Tatar texts covering various genres and historical eras. NCCTL should be considered a tool for comprehensive linguistic research and for creating and developing different automated systems (language recognition, machine translation, and information retrieval). The design of the linguistic corpus is a significant factor in developing the Crimean Tatar language, which UNESCO currently classifies as a severely endangered language.

The Ministry of Reintegration initiated the project of the National Corpus of the Crimean Tatar language as part of the implementation of the Crimean Tatar Language Development Strategy for 2022-2032. The project is implemented by the NGO QIRI'M Young with the support of the Swiss-Ukrainian EGAP Program, implemented by the Eastern Europe Foundation, and Taras Shevchenko National University of Kyiv.

Vision

This project is essential in preserving and developing the Crimean Tatar language. With the help of the NCCTL database, new electronic dictionaries and programs for proofreading and machine translation of texts in the Crimean Tatar language can be created. Such developments will contribute to the popularization of the language both in everyday life and in the scientific and literary spheres. In addition, the linguistic base of the NCCTL will expand the possibilities of the Crimean Tatar language at international technical and educational platforms.