Keyboard Support

Contact and Search

Keyman.com Homepage

Header bottom

Keyman.com

Other versions
Version 18.0Version 17.0 (current version)Version 16.0Version 15.0Version 14.0Version 13.0

On this page

Word breaker

The trie family of lexical models needs to know what a word is in running text. In languages using the Latin script—like, English, French, and SENĆOŦEN—finding words is easy. Words are separated by spaces or punctuation. The actual rules for where to find words can get quite tricky to describe, but Keyman implements the Unicode Standard Annex #29 §4.1 Default Word Boundary Specification which works well for most languages.

However, in languages written in other scripts—especially East Asian scripts like Chinese, Japanese, Khmer, Lao, and Thai—there are is no obvious break in between words. For these languages, there must be special rules for determining when words start and stop. This is what a word breaker function is responsible for. It is a little bit of code that looks at some text to determine where the words are.

The word breaker function can be specified in the model definition file as follows:

const source: LexicalModelSource = {
  format: 'trie-1.0',
  sources: ['wordlist.tsv'],
  // CUSTOMIZE THIS:
  wordBreaker: function(text: string): Span[] {
    // Return zero or more **spans** of text:
    return [];
  },
  // other customizations go here:
};

export default source;

The function must return zero or more Span objects. The spans, representing an indivisible span of text, must be in ascending order of their start point, and they must be non-overlapping.

A Span object

A span is an indivisible piece of a sentence. This is typically a word, but it can also be a series of spaces, an emoji, or a series of punctuation characters. A span that looks like a word is treated like a word in the trie-1.0 model.

A span has the following properties:

{
  start: number;
  end: number;
  length: number;
  text: string;
}

The start and end properties are indices into the original string at which the span begins, and the index at which the next span begins.

length is end - start.

text is the actual text of the string contained within the span.

Example for English

Here is a full example of word breaker function that returns an array of spans in an ASCII (English) string. Note: this is just an example—please use the default word breaker for English text!

const source: LexicalModelSource = {
  format: 'trie-1.0',
  sources: ['wordlist.tsv'],
  // EXAMPLE BEGINS HERE:
  wordBreaker: function(text: string): Span[] {
    // A span derived from a JavaScript RegExp match array:
    class RegExpDerivedSpan implements Span {
      readonly text: string;
      readonly start: number;

      constructor(text: string, start: number) {
        this.text = text;
        this.start = start;
      }

      get length(): number {
        return this.text.length;
      }

      get end(): number {
        return this.start + this.text.length;
      }
    }

    let matchWord = /[A-Za-z0-9']+/g;
    let words: Span[] = [];
    let match: RegExpExecArray;
    while ((match = matchWord.exec(phrase)) !== null) {
      words.push(new RegExpDerivedSpan(match[0], match.index));
    }

    return words;
  },
  // other customizations go here:
};

export default source;

See also

The TypeScript definition of WordBreakingFunction and Span


Return to “Advanced Lexical Model Topics”