Language-Specific Configurations
On this page
To achieve global search functionality, Algolia needs to know the language of both your data and your end users.
Knowing this enables the engine to apply important word-based processing techniques, such as:
- removing common (stop) words like “the” and “a”
- making singulars and plurals equivalent
- detecting word roots
- separating or combining compound words.
This page goes over all that. First, however, you need to tell Algolia what language(s) is being used.
Setting the Language of the Search
Algolia does not try to detect the language of the index nor of the user. However, for dictionary-based settings, like typo tolerance, stop words, plurals, and others, you’ll need to tell the engine which languages you want these settings to use as their logical bases. If you don’t, the engine will use the default setting, which is to use all dictionaries. This will result in such anomalies as applying French spelling features to English. Therefore, if you want language-based settings to perform with precision and unambiguously, you’ll need to override the default by specifying the language of your data and end users.
You can do this individually for each setting, or more globally with one system-wide setting.
Removing stop words
To separate key terms of a search from its common words, such as “the”, “on”, “it”, etc., the engine can be setup to ignore these common words. Stripping a search of these words helps the engine focus on the essentials of what people are looking for - nouns and adjectives.
Removing stop words is dictionary-based. We parse several sources (Wiktionary and ranks.nl) in order to constitute a list of words commonly used as stop words, not only in English, but in ~50 languages available.
Ignoring plurals (and other alternative forms)
Ignoring plurals, if enabled, tells the engine to consider the plural and singular forms of a word as equivalent.
In English, this is as easy as ignoring the “s” (“cars” = “car”), but what about “es”, or “feet” = “foot”?
To ensure completeness, and to support multiple languages, we rely heavily on Wiktionary templates,
which allows Wiktionary contributors to declare alternative forms of a word. For example, the template {en-noun|s}
, would show up like this on the “car” page of Wiktionary:
$
car (plural cars)
By using templates found inside the Wiktionary data, we are able to build our dictionary of alternative forms. Note that almost every language has its own template syntax, and many languages have multiple templates.
Wiktionary templates also support other alternative forms:
- German declension, where a German noun changes form depending not only on its case, gender, and number, but also on the role it plays in a sentence (dative, nominative, accusative, and genitive).
A german noun can therefore have numerous endings: -er, -e, -es, -e (for nominative), en, -e, -es, -e (accusative), -em, -er, -em, -en (dative), -es, -er, -es, -er (genitive).
- Dutch diminutive endings, where a Dutch noun changes its ending based on whether it is small, countable, and other such noun-nuances. For example, huisje is a small huis, and colaatje is a glass of cola.
Splitting compound words
Compound words refer to noun phrases (or nominal groups) which combine, without spaces, a number of words to form a single entity or idea.
For example, “Vaðlaheiðarvegavinnuverkfærageymsluskúraútidyralyklakippuhringur” is a combined collection of Icelandic words with the meaning: the “key ring of the key chain of the outer door to the storage tool shed of the road workers on the Vaðlaheiði plateau”. Very precise, and probably useful when you need a key ring of the key chain, etc..
A simpler example is the German word “Hundehütte”, which means “dog house”.
The goal of decompounding is to index and search the individual words “Hund” and “Hütte” (in English, “dog” and “house”) separately, thus improving the chance of a match.
For example, imagine that a user searches for “Hütte für große Hunde” (in English, “house for big dog”), but in your records, you only have the term “Hundehütte”. Without decompounding, Algolia wouldn’t be able to match these records. The query and records can only match if the records contain the compound word “Hundehütte” in its split form.
This setting supports six languages: Dutch (nl
), German (de
), Finnish (fi
), Danish (da
), Swedish (sv
), and Norwegian Bokmål (no
). Compound words are automatically split within:
- all queries where
queryLanguages
contains one of the six supported languages, - and all attributes configured in
decompoundedAttributes
.
Splitting compound words doesn’t alter the records sent to Algolia. Compound words aren’t replaced by the segmented version, but indexed in the two formats: as the full word, and as the atoms.
Words segmentation
In some logographic languages, words in queries or sentences aren’t separated by spaces as in Latin languages. The reader distinguishes each word based on the context. Since Algolia’s relevance matches words in the query with words in the records, it needs to identify which set of characters represent a word for a given query. To do so, a words segmentation logic takes place.
For example, “長い赤いドレス” in Japanese means “long red dress”. When receiving this query, Algolia segments it into its composing words “長い” (long) “赤い” (red) “ドレス” (dress). The same segmentation happens on the records, ensuring a great match and relevance for queries in Japanese.
Algolia performs segmentation in three languages: Japanese (ja
), Chinese (zh
), and Korean (ko
). To ensure this segmentation applies, the queryLanguages
and indexLanguages
parameters need to be set with the relevant language code.
Japanese transliteration and type-ahead
The Japanese language has a complex writing system using three different alphabets: Kanji, Hiragana, and Katakana. When typing a query in Japanese, users first type its pronunciation in Hiragana, and then, if relevant convert it to Katakana or Kanji.
To ensure Algolia starts returning relevant results as soon as users start typing, Algolia indexes Japanese words in both their original form and in Hiragana. Japanese users thus start seeing search results from the first typed characters, and not just when the query is complete.
Transliteration is only available in Japanese (ja
). To apply it, make sure to set the indexLanguages
setting to ja
. You can limit transliteration to some attributes or completely turn it off with the attributesToTransliterate
setting.
You can use this feature together with Query Suggestions to ensure Japanese users start seeing suggestions from the first keystrokes. To do so, make sure to set the indexLanguages
setting on the query suggestions index.