no-description

Follow the link to work with the National Corpus of the Crimean Tatar Language (NCCTL) on the Sketch Engine platform.

Many tools are available to the NCCTL user to work with the corpus text database and obtain the necessary analytics. Here you will find information about the key functions of NCCTL, the possibilities they open up, and instructions for working on the Sketch Engine platform.

N-grams

The N-gram tool produces frequency lists of sequences of tokens. N-grams are also called multi-word expressions (or MWEs) or lexical bundles. This function can help you explore the collocations found in the NCCTL texts. Creating a list of n-grams takes one or two seconds. Using additional filters can slightly slow down the list-generation process.

The Corpus user has a choice of two search mechanisms for creating N-grams:

  •     basic,
  •     advanced.

The basic mechanism allows you to choose only one parameter, namely the length of the N-gram - the number of words in collocations. The basic search also automatically excludes low-frequency N-grams (that is, those that occur very rarely in the Corpus).

The advanced N-gram generation mechanism provides a larger set of options. So, in addition to the length of the N-gram, you can choose an attribute on which the list will be generated, with word and lemma (the initial form of the word) being the most common. The result may be limited by the minimum and maximum frequency of use of collocations in the Corpus. Additional criteria such as "starts with, " "ends with," and "contains" are also available to filter by characters (letters) occurring anywhere within the N-gram.

Concordance

Concordance is the most powerful tool in the Corpus, with various search options. This feature allows you to find words, phrases, tags, documents, text types, or corpus structures and displays the results in the context of their use as a concordance. The concordance can be sorted, filtered, counted, and further processed to obtain the desired result. View options allow you to display additional information such as lemmas (initial word forms), tags (codes for word characteristics), and other attributes.

To create a concordance using the basic mechanism, enter a word or phrase and click the "Search" button. Next, you can adjust one of the viewing options, namely the results display mode:

  • KWIC - outputs a concordance with the search word in the center and some context on the left and right.
  • Sentence - shows whole sentences containing the searched word. Long sentences are not cut but are displayed in several lines.

To create a concordance using the advanced mechanism, the user must select the "Advanced" tab, specify the necessary settings and click the "Search" button.

The table below shows available settings and tools for working with the concordance.

Icon

Title

Description

Result details

Additional data on the obtained results

 

Filter results

Filtering tool for the obtained results

View options

The tool for changing the display of the received results. You can enable features like "Number rows," "Show counter," etc.

Download results

The function of downloading search results in the NCCTL concordance. The following file formats are available for download: CSV, XLSX, XML, etc.

Change criteria

The function of changing the criteria for the received results (for example, changing the search query)

Get a random sample

When working with a large concordance, random sampling reduces the number of concordance lines while maintaining the sample's representativeness.

Shuffle lines

This feature changes the order of the concordance lines to a random one.

Sort

The tool sorts concordance lines alphabetically by KWIC or token left or right of KWIC.

Frequency

The feature lists different words, lemmas (initial forms of words), tags (codes of morphological characteristics), and other attributes found at the specified position in the concordance and calculates their frequency.

Collocations

The tool scans the specified range to the right and left of the KWIC and calculates a selection of statistical characteristics to recognize collocations.

Distribution of hits in the corpus

The diagram shows the parts of the Corpus where the KWIC was found.

KWIC / sentence view

Results display mode:

  • Displays a KWIC concordance with the search word it the centre and some context to the left and right. 
  • Shows complete sentences containing the search word. Long sentences are not trimmed but displayed on multiple lines.

GDEX

GDEX stands for “Good Dictionary EXamples”. It is a system for evaluation of sentences with respect to their suitability to serve as dictionary examples or good examples for teaching purposes. It is one of the concordance tools that allows you to generate a selection of the most illustrative cases of word use based on the Corpus materials.

This feature automatically recognizes sentences that are easy to understand and illustrative enough to serve as Good Dictionary EXamples or sentences suitable for teaching. The GDEX tool will be helpful for compilers of explanatory dictionaries, textbooks, and other educational materials.

Wordlist

With this feature, the Corpus users can create frequency lists of words and their initial forms in minutes, explore which words are used more often in the language, find rare words, and more.

The wordlist tool works at the level of tokens (tokens). The default options will create a list of words, as those units of text that are not words are automatically excluded. The frequency of their use can also limit the list of words by setting a minimum and maximum limit.

The user has a choice of two mechanisms for creating a wordlist:

  • basic,
  • advanced.

Keywords

Keywords are individual tokens that occur more frequently in the focus corpus than in the base (reference) corpus. Any lexeme can claim to be a keyword if it is used more often in the focus corpus than in the base corpus. The result will include primarily nouns and adjectives since the frequencies of other parts of speech are generally the same in all texts.

The keyword extraction tool provides NCCTL users with the opportunity to:

  •     extract single word and multiword units which are typical of the  corpus/document/text or which define its content or topic

This tool detects which medium or high-frequency words are used more often than in the general language.

  •   compare two corpora/documents/texts by identifying what is unique in the first corpus compared to the second one

Comparing two texts by hand is quite difficult. Even with short texts, statistical comparison can reveal phenomena that would be missed by manual comparison. 

Keywords can be used to compare two corpora or subcorpora. The result will show the characteristic of the focal (sub)corpus compared to the reference (sub)corpus.

This tool allows you to work with many parameters for detailed research. For example, focus on rare or common words, include words containing numbers, exclude specific tokens from the search results, etc.

The NCCTL user has a choice of two mechanisms for creating a list of keywords:

  •     basic,
  •     advanced.

Text types analysis

This tool shows a breakdown by metadata and provides users with statistics about the texts included in the NCCTL. For example, you can see how many documents, tokens, or words there are in the Corpus in texts downloaded from each website, written by each author, or published in each period.

The number of values in the chart can be controlled in settings and the pie chart can be downloaded.

Video instructions for analyzing text types.