Keyboard Support

Contact and Search Homepage

Header bottom

Other versions
Version 17.0 (current version)Version 16.0Version 15.0Version 14.0Version 13.0Version 12.0

On this page

Step 2: Creating a lexical model project

Create the new project

Start Keyman Developer. On the “Welcome” screen, click on New Project.... The “New Project” dialog will appear. Select “Wordlist Lexical Model” and press OK.

“New Project” dialog
The “New Project” dialog, with “Wordlist Lexical Model” selected.

Provide required information

New LM Project Parameters
The New Lexical Model dialog box.

The “New Wordlist Lexical Model Project” dialog will appear.

To make sharing your lexical model easier, a project needs the following information:

Author Name
This is either your full name or the organization you're creating a model for. In this example, I am creating a lexical model on behalf of my organization, the National Research Council Canada, so I write that as the author name.
Model Name
We recommend the name of the language, dialect, or community that this model is intended for. The name must be written in all the Latin letters or Arabic numerals. In this example, we're creating a language model for SENĆOŦEN, so we use the model name Sencoten.

Provide auxiliary information

The following information is also required, but most users will use default values.

Who owns the rights to this model and its data? Typically, you can use the automatically generated default value: © 2024 Your Full Name or Your Organization.
If this is the first time you've created a lexical model for you language, you should leave the version as 1.0. Otherwise, your version number must conform to the following rules: A version string made of major revision number.minor revision number.

Determine your language's BCP 47 language tag

Keyman needs to know how to link your model to the appropriate keyboard layout, so that they can both work together. To do this, Keyman utilizes BCP 47 language tags.

To add a language tag, click the Add button to bring up the “Select BCP 47 Tag” dialog box.

Select BCP 47 Tag dialog
The “Select BCP 47 Tag” dialog box for SENĆOŦEN.

The language subtag

The only required option is the Language subtag, which is an ISO 639-1 or ISO 639-3 code.

ISO 639-1 tags are a two-letter code. ISO 639-3 tags are a three-letter code. First, try to find your language on the list of two-letter ISO 639-1 codes. This Wikipedia page lists all of the two-letter codes.

If you can't find a two-letter code, you'll need to find the closest three-letter code. You can use Glottolog to search for your language, and it will give you an appropriate code. In this example, I searched Glottolog for “Saanich” (name of the First Nations that speak SENĆOŦEN) and found str as the code for all Straits Salish languages.

The next two fields are optional, however, they allow you to be more specific about your language.

The script subtag

The Script subtag allows you to specify the writing system used in your language model. If your language only uses one writing system, leave this blank.

Otherwise, in cases where a language can be written in many different writing systems, you can use this field to choose the ISO 15924 script tag that your lexical model produces.

For example, Plains Cree can either be written in standard Roman orthography, a Latin derived script, or it can be written in syllabics, which is part of the Canadian Aboriginal syllabics family of writing systems. If I wrote a model that produced syllabics, I would choose Cans, as that is the ISO 15924 tag for Canadian Aboriginal syllabics.

The region subtag

The Region tag allows you to specify the region your language or dialect is spoken in. If your language is only spoken in one region, leave this blank.

Otherwise, some languages vary between different regions and countries. In our example, SENĆOŦEN describes the language that covers entire W̱SÁNEĆ region, so this field may be left blank.

However, large languages, like English, Spanish, or French have quite different vocabulary and even different grammatical rules from region to region and country to country. For example, the variety of Spanish spoken in Spain regularly uses words that are uncommon or even vulgar in both in Mexico, and in Latin America. Additionally, regions may have vocabulary that doesn't exist in the other regions where the language is spoken.

If I were creating a lexical model specific to one country, I would use the ISO 3166-1 alpha-2 country code for the region subtag. For example, ES for Spain or MX for Mexico.

If I were creating a lexical model just for Latin American Spanish (a group of countries), I would need to specify Latin America's UN M49 region code. For Latin America, its code is 419. My lexical model would not suggest words that are common in Spain, but vulgar in Latin America, however it would predict words like pupupsas and chuchitos, which are words that are uncommon in both Spain and Mexico.

Once you are finished adding the primary language, click OK to return to the New Lexical Model Project dialog.

The Model ID

Keyman will create a model ID which is how Keyman sorts and organizes different lexical models. If you choose to share your model publicly, the model ID is vital for both people and Keyman to identify and use your lexical model!

Keyman automatically generates a model ID for you, given all the information already filled out. If you're satisfied with the generated model ID, you can skip to the next step.

In this example, my generated model ID is national_research_council_canada.str.sencoten, derived from my organization name, the name of the primary language, and my model name. However, I find the “author ID” part of the generated model ID excessively long. I changed the author ID to nrc, and the model ID automatically changes to the much more manageable nrc.str.sencoten.

Double-check the information

Verify that all of the information is correct. Once all of the required information has been filled in and verified, click OK to create the project.

Once we have created the project, we can begin to prepare the data!

Step 3: Get some language data