Keyboard Support

Contact and Search

Keyman.com Homepage

Header bottom

Keyman.com

Other versions
Version 18.0 (home page)Version 17.0Version 16.0 (current version)Version 15.0Version 14.0Version 13.0Version 12.0Version 11.0 (home page)Version 10.0 (home page)Version 9.0 (home page)Version 8.0 (home page)Version 7.0 (home page)Version 6.0 (home page)Version 5.0 (home page)Version 4.0 (home page)

On this page

Step 4: Editing a model definition file

We have exported our wordlist to wordlist.tsv. We now need to tell the lexical model compiler how to turn this raw word list into a lexical model that is quick to use on a smartphone.

To do this, we must create a model definition file.

This is a small TypeScript source code file that tells us where to find the word list file, as well as gives us the option to tell the compiler a little bit more about our language’s spelling system or orthography.

The model definition template

Keyman Developer will provide you with a model definition similar to the following. If you want to create the file yourself, copy-paste the following template, and save it as model.ts. Place this file in the same folder as wordlist.tsv.

/*
  sencoten 1.0 generated from template.

  This is a minimal lexical model source that uses a tab delimited wordlist.
  See documentation online at https://help.keyman.com/developer/ for
  additional parameters.
*/

const source: LexicalModelSource = {
  format: 'trie-1.0',
  sources: ['wordlist.tsv'],
};
export default source;

Let's step through this file, line-by-line.

On the first line, we're declaring the source code of a new lexical model.

const source: LexicalModelSource = {

On the second line, we're saying the lexical model will use the trie-1.0 format. The trie format creates a lexical model from one or more word lists; the trie structures the lexical model such that it can predict through thousands of words very quickly.

  format: 'trie-1.0',

On the third line, we're telling the trie where to find our wordlist.

  sources: ['wordlist.tsv'],

The fourth line marks the termination of the lexical model source code. If we specify any customizations, they must be declared above this line:

};

The fifth line is necessary to allow external applications to read the lexical model source code.

export default source;

Customizing our lexical model

The template, as described in the previous section, is a good starting point, and may be all you need for you language. However, most language require a few customizations. The trie model supports the following customizations:

word breaking
How to determine when words start and end in the writing system.
search term to key
How and when to ignore accents and lettercase

Word breaking

The trie family of lexical models needs to know what a word is in running text. In language using the Latin script—like, English, French, and SENĆOŦEN—finding words is easy. Words are separated by spaces or punctuation. The actual rules for where to find words can get quite tricky to describe, but Keyman implements the Unicode Standard Annex #29 §4.1 Default Word Boundary Specification which works well for most languages.

However, in languages written in other scripts—especially East Asian scripts like Chinese, Japanese, Khmer, Lao, and Thai—there are is no obvious break in between words. For these languages, there must be special rules for determining when words start and stop. This is what a word breaking function is responsible for. It is a little bit of code that looks at some text to determine where the words are.

Search term to key

To look up words quickly, the trie model creates a search key that takes the latest word (as determined by the word breaking and converts it into a “regular” form. The purpose of this “regular” form is to make searching for a word work, regardless of things such as accents, diacritics, lettercase, and minor spelling variations. The ”regular” form is called the key. Typically, the key is always in lowercase, and lacks all accents and diacritics. For example, the key form of “naïve" is "naive" and the keyform of Canada is “canada”.

The form of the word that is stored is “regularized” through the use of a key function, which you can define in TypeScript code.

The key function takes a string, the raw search term, and returns a string, being the “regular” key. As an example, consider the default key function; that is, the key function that is used if you do not specify one:

searchTermToKey: function (term) {
  // Use this pattern to remove common diacritical marks.
  // See: https://www.compart.com/en/unicode/block/U+0300
  const COMBINING_DIACRITICAL_MARKS = /[\u0300-\u036f]/g;

  // Converts to Unicode Normalization form D.
  // This means that MOST accents and diacritics have been "decomposed" and
  // are stored as separate characters. We can then remove these separate
  // characters!
  //
  // e.g., Å → A + ˚
  let normalizedTerm = term.normalize('NFD');

  // Now, make it lowercase.
  //
  // e.g.,  A + ˚ → a + ˚
  let lowercasedTerm = normalizedTerm.toLowerCase();

  // Now, using the pattern above replace each accent and diacritic with the
  // empty string. This effectively removes all accents and diacritics!
  //
  // e.g.,  a + ˚ → a
  let termWithoutDiacritics = lowercasedTerm.replace(COMBINING_DIACRITICAL_MARKS, '')

  // The resultant key is lowercased, and has no accents or diacritics.
  return termWithoutDiacritics;
},

This should be sufficient for most Latin-based writing systems. However, there are cases, such as with SENĆOŦEN, where some characters do not decompose into a base letter and a diacritic. In this case, it is necessary to write your own key function.

Once customization is done

We may have some tweaks, but first we need to actually build and test our lexical model. This will be discussed in the next step.

Step 5: Compile the lexical model model