Keyboard Support

Contact and Search

Keyman.com Homepage

Header bottom

Keyman.com

Other versions
Version 18.0Version 17.0 (current version)Version 16.0Version 15.0Version 14.0Version 13.0

On this page

Search term to key

To look up words quickly, the trie model creates a search key that takes the latest word (as determined by the word breaker and converts it into an internal form. The purpose of this internal form is to make searching for a word work, as expected, regardless of things such as accents, diacritics, letter case, and minor spelling variations. The internal form is called the key. Typically, the key is always in lowercase, and lacks all accents and diacritics. For example, the key for “naïve" is naive and the key for “Canada” is canada.

The form of the word that is stored is “regularized” through the use of a key function, which you can define in TypeScript code.

Note: this function runs both on every word when the wordlist is compiled and on the input, whenever a suggestion is requested. This way, whatever a user types is matched to something stored in the lexical model, without the user having to type things in a specific way.

The key function takes a string which is the raw search term, and returns a new string, being the “regularized” key. As an example, consider the default key function; that is, the key function that is used if you do not specify one:

searchTermToKey: function (term: string): string {
  // Use this pattern to remove common diacritical marks.
  // See: https://www.compart.com/en/unicode/block/U+0300
  const COMBINING_DIACRITICAL_MARKS = /[\u0300-\u036f]/g;

  // Converts to Unicode Normalization form D.
  // This means that MOST accents and diacritics have been "decomposed" and
  // are stored as separate characters. We can then remove these separate
  // characters!
  //
  // e.g., Å → A + ˚
  let normalizedTerm = term.normalize('NFD');

  // Now, make it lowercase.
  //
  // e.g.,  A + ˚ → a + ˚
  let lowercasedTerm = normalizedTerm.toLowerCase();

  // Now, using the pattern above replace each accent and diacritic with the
  // empty string. This effectively removes all accents and diacritics!
  //
  // e.g.,  a + ˚ → a
  let termWithoutDiacritics = lowercasedTerm.replace(COMBINING_DIACRITICAL_MARKS, '')

  // The resultant key is lowercased, and has no accents or diacritics.
  return termWithoutDiacritics;
},

This should be sufficient for most Latin-based writing systems. However, there are cases, such as with SENĆOŦEN, where some characters do not decompose into a base letter and a diacritic. In this case, it is necessary to write your own key function.

Use in your model definition file

To use this in your model definition file, provide a function as the searchTermToKey property of the lexical model source:

const source: LexicalModelSource = {
  format: 'trie-1.0',
  sources: ['wordlist.tsv'],
  searchTermToKey: function (wordform: string): string {
    // Your searchTermToKey function goes here!
    let key = wordform.toLowerCase();
    return key;
  },
  // other customizations go here:
};

export default source;

Suggested customizations

  • For all writing systems, normalize into NFC or NFD form using wordform = wordform.normalize('NFC').
  • For Latin-based scripts, lowercase the word, and remove diacritics.
  • For scripts that use the U+200C zero-width joiner (ZwJ) and/or the U+200D zero-width non-joiner (ZWNJ) (e.g., Brahamic scripts), remove the ZWJ or ZWNJ from the end of the input with wordform = wordform.replace(/[\u200C\u200D]+$/

Return to “Advanced Lexical Model Topics”