Keyboard Support

Contact and Search

Keyman.com Homepage

Header bottom

Keyman.com

Other versions
Version 18.0Version 17.0 (current version)Version 16.0Version 15.0Version 14.0Version 13.0Version 12.0

On this page

You are viewing an old version of this documentation. Click here to open the current version, 17.0.

MODEL.TS files


Used by:
Keyman Developer.
Description:
A .MODEL.TS file is a lexical model definition source file. This holds all the code used by a Keyman lexical model, in plain text.
Details:
A .MODEL.TS file is written in the TypeScript language. Keyman Developer compiles this Keyman lexical model source file which can also reference a (.TSV) wordlist to make a lexical model (.MODEL.JS) file.

Reference

This is a small TypeScript source code file that tells us where to find the word list file, as well as gives us the option to tell the compiler a little bit more about our language’s spelling system or orthography.

The model definition template

Keyman Developer will provide you with a model definition similar to the following.

/*
  sencoten 1.0 generated from template.

  This is a minimal lexical model source that uses a tab delimited wordlist.
  See documentation online at https://help.keyman.com/developer/ for
  additional parameters.
*/

const source: LexicalModelSource = {
  format: 'trie-1.0',
  wordBreaker: {
    use: 'default',
  },
  sources: ['wordlist.tsv'],
};
export default source;

Let's step through this file, line-by-line.

On the first line, we're declaring the source code of a new lexical model.

const source: LexicalModelSource = {

On the second line, we're saying the lexical model will use the trie-1.0 format. The trie format creates a lexical model from one or more word lists; the trie structures the lexical model such that it can predict through thousands of words very quickly.

  format: 'trie-1.0',

On lines 3–5, we're specifying the word breaking algorithm that we want to use. Keyman supplies a default algorithm that conforms to the rules expected for many Latin-script languages.

  wordBreaker: {
    use: 'default',
  },

On the sixth line, we're telling the trie where to find our wordlist.

  sources: ['wordlist.tsv'],

The seventh line marks the termination of the lexical model source code. If we specify any customizations, they must be declared above this line:

};

The eighth line is necessary to allow external applications to read the lexical model source code.

export default source;

Customizing our lexical model

The template, as described in the previous section, is a good starting point, and may be all you need for you language. However, most language require a few customizations. The trie model supports the following customizations:

word breaking
How to determine when words start and end in the writing system.
search term to key
How and when to ignore accents and lettercase

Word breaking

The trie family of lexical models needs to know what a word is in running text. In language using the Latin script—like, English, French, and SENĆOŦEN—finding words is easy. Words are separated by spaces or punctuation. The actual rules for where to find words can get quite tricky to describe, but Keyman implements the Unicode Standard Annex #29 §4.1 Default Word Boundary Specification which works well for most languages. If the default doesn't quite work for your language, you can tweak it.

However, in languages written in other scripts—especially East Asian scripts like Chinese, Japanese, Khmer, Lao, and Thai—there are no obvious break in between words. For these languages, there must be special rules for determining when words start and stop. This is what a word breaking function is responsible for. It is a little bit of code that looks at some text to determine where the words are.

Search term to key

To look up words quickly, the trie model creates a search key that takes the latest word (as determined by the word breaking and converts it into a “regular” form. The purpose of this “regular” form is to make searching for a word work, regardless of things such as accents, diacritics, lettercase, and minor spelling variations. The ”regular” form is called the key. Typically, the key is always in lowercase, and lacks all accents and diacritics. For example, the key of “naïve" is "naive" and the key of Canada is “canada”.

The form of the word that is stored is “regularized” through the use of a key function, which you can define in TypeScript code.

The key function takes a string, the raw search term, and returns a string, being the “regular” key. As an example, consider the default key function; that is, the key function that is used if you do not specify one:

searchTermToKey: function (term) {
  // Use this pattern to remove common diacritical marks (accents).
  // See: https://www.compart.com/en/unicode/block/U+0300
  const COMBINING_DIACRITICAL_MARKS = /[\u0300-\u036f]/g;

  // Lowercase each letter in the string INDIVIDUALLY.
  // Why individually? Some languages have context-sensitive lowercasing
  // rules (e.g., Greek), which we would like to avoid.
  // So we convert the string into an array of code points (Array.from(term)),
  // convert each individual code point to lowercase (.map(c => c.toLowerCase())),
  // and join the pieces back together again (.join(''))
  let lowercasedTerm = Array.from(term).map(c => c.toLowerCase()).join('');

  // Once it's lowercased, we convert it to NFKD normalization form
  // This does many things, such as:
  //
  //  - separating characters from their accents/diacritics
  //      e.g., "ï" -> "i" + "¨" (U+0308)
  //  - converting lookalike characters to a canonical ("regular") form
  //      e.g., ";" -> ";" (yes, those are two completely different characters -- U+037E and U+003B!)
  //  - converting "compatible" characters to their canonical ("regular") form
  //      e.g., "𝔥𝔢𝔩𝔩𝔬" -> "hello"
  let normalizedTerm = lowercasedTerm.normalize('NFKD');

  // Now, using the pattern defined above, replace each accent and diacritic with the
  // empty string. This effectively removes all accents and diacritics!
  //
  // e.g.,  "i" + "¨" (U+0308) -> "i"
  let termWithoutDiacritics = normalizedTerm.replace(COMBINING_DIACRITICAL_MARKS, '');

  // The resultant key is lowercased, and has no accents or diacritics.
  return termWithoutDiacritics;
},

This should be sufficient for most Latin-based writing systems. However, there are cases, such as with SENĆOŦEN, where some characters do not decompose into a base letter and a diacritic. In this case, it is necessary to write your own key function.

Version History

Keyman 12
Added: the .model.ts file type.
Keyman 13
No changes.
Keyman 14
Added: an alternative syntax for specifying word breakers: wordBreaker: { 'use': ... }.
Added: specify which characters should be used to join with word breakers: wordBreaker: { 'joinWordsAt': ... }.