Default Token Filters

  • Capella Operational
  • reference
    +
    Use a token filter to filter a tokenizer’s results and get better search result matches.

    The Search Service’s token filters work with tokenizers to filter search input tokens. Tokens can come from the content of your Search index or a Search query.

    For more information about token filters, see Token Filters.

    The following token filters are available:

    Token Filter Type Description

    apostrophe

    Removes all characters after an apostrophe (') from tokenizer results. Also removes the apostrophe. For example, the token Couchbase’s becomes Couchbase.

    camelCase

    Splits text in camelCase inside tokenizer results into separate tokens.

    For example, the token filter splits the token camelCaseText into camel, Case, and Text.

    cjk_bigram

    Converts Chinese, Japanese, and Korean tokenizer results into bigrams, or groups of two consecutive words.

    cjk_width

    Converts Chinese, Japanese, and Korean tokenizer results from full width ASCII variants into Latin characters, and half-width katakana characters into their equivalent kana characters.

    elision_ca

    Removes all characters before an apostrophe from Catalan language tokenizer results. Also removes the apostrophe.

    elision_fr

    Removes all characters before an apostrophe from French language tokenizer results. Also removes the apostrophe.

    For example, the token filter converts the token l’avion to avion.

    elision_ga

    Removes all characters before an apostrophe from Gaelic language tokenizer results. Also removes the apostrophe.

    elision_it

    Removes all characters before an apostrophe from Italian language tokenizer results. Also removes the apostrophe.

    hr_suffix_transformation_filter

    Replaces suffixes in Croatian tokenizer results with normalized suffixes.

    lemmatizer_he

    Lemmatizes similar forms of Hebrew words. Corrects spelling mistakes.

    mark_he

    Marks the Hebrew, non-Hebrew, and numeric tokens from tokenizer results.

    niqqud_he

    Forces niqqud-less spelling for Hebrew text in tokenizer results.

    normalize_ar

    Uses Unicode Normalization to normalize Arabic characters in tokens.

    normalize_ckb

    Uses Unicode Normalization to normalize Kurdish characters in tokens.

    normalize_de

    Uses Unicode Normalization to normalize German characters in tokens.

    normalize_fa

    Uses Unicode Normalization to normalize Persian characters in tokens.

    normalize_hi

    Uses Unicode Normalization to normalize Hindi characters in tokens.

    normalize_in

    Uses Unicode Normalization to normalize Indonesian characters in tokens.

    possessive_en

    Checks the second-last character in English-language tokenizer results for an apostrophe. If it finds an apostrophe, the token filter removes the last two characters from the token.

    reverse

    Reverses the tokens in tokenizer results. For example, the token filter converts the token acrobat to taborca.

    stemmer_ar

    Checks Arabic tokenizer results for suffixes and prefixes. If it finds a suffix or any prefixes, the token filter removes them to leave the root word.

    stemmer_ckb

    Checks Kurdish tokenizer results for prefixes. If it finds a prefix, the token filter removes it to leave the root word.

    stemmer_da_snowball

    Uses the Snowball string processing language to convert Danish language tokenizer results into word stems.

    stemmer_de_light

    Uses light stemming to convert German language tokenizer results into word stems.

    Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem.

    Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics.

    stemmer_de_snowball

    Uses the Snowball string processing language to convert German language tokenizer results into word stems.

    stemmer_en_snowball

    Uses the Snowball string processing language to convert English language tokenizer results into word stems.

    stemmer_es_light

    Uses light stemming to convert Spanish language tokenizer results into word stems.

    Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem.

    Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics.

    stemmer_es_snowball

    Uses the Snowball string processing language to convert Castilian Spanish language tokenizer results into word stems.

    stemmer_fi_snowball

    Uses the Snowball string processing language to convert Finnish language tokenizer results into word stems.

    stemmer_fr_light

    Uses light stemming to convert French language tokenizer results into word stems.

    Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem.

    Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics.

    stemmer_fr_min

    Uses minimal stemming to convert French language tokenizer results.

    Minimal stemming only removes the last character of a word or replaces some suffixes. For example, the stemmer_fr_min removes x, s, r, e, and é characters from the end of words and replaces the aux suffix with al.

    stemmer_fr_snowball

    Uses the Snowball string processing language to convert French language tokenizer results into word stems.

    stemmer_hi

    Uses a lightweight stemmer for Hindi to remove suffixes from tokenizer results.

    stemmer_hr

    Uses an open source stemming rule set to find the root word in Croatian language tokenizer results.

    stemmer_hu_snowball

    Uses the Snowball string processing language to convert Hungarian language tokenizer results into word stems.

    stemmer_it_light

    Uses light stemming to convert Italian language tokenizer results into word stems.

    Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem.

    Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics.

    stemmer_it_snowball

    Uses the Snowball string processing language to convert Italian language tokenizer results into word stems.

    stemmer_nl_snowball

    Uses the Snowball string processing language to convert Dutch language tokenizer results into word stems.

    stemmer_no_snowball

    Uses the Snowball string processing language to convert Norwegian language tokenizer results into word stems.

    stemmer_porter

    Transforms tokenizer results with the porter stemming algorithm. For more information, see the official Porter Stemming Algorithm documentation.

    stemmer_pt_light

    Uses light stemming to convert Portuguese language tokenizer results into word stems.

    Regular stemming can affect the semantic meaning of words, as several words with different meanings might have the same root stem.

    Light stemming only removes frequently used prefixes and suffixes, and doesn’t produce the root of a word to preserve semantics.

    stemmer_ro_snowball

    Uses the Snowball string processing language to convert Romanian language tokenizer results into word stems.

    stemmer_ru_snowball

    Uses the Snowball string processing language to convert Russian language tokenizer results into word stems.

    stemmer_sv_snowball

    Uses the Snowball string processing language to convert Swedish language tokenizer results into word stems.

    stemmer_tr_snowball

    Uses the Snowball string processing language to convert Turkish language tokenizer results into word stems.

    stop_ar

    Removes tokens from tokenizer results that are unnecessary for a search, based on an Arabic dictionary.

    stop_bg

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Bulgarian dictionary.

    stop_ca

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Catalan dictionary.

    stop_ckb

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Kurdish dictionary.

    stop_cs

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Czech dictionary.

    stop_da

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Danish dictionary.

    stop_de

    Removes tokens from tokenizer results that are unnecessary for a search, based on a German dictionary.

    stop_el

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Greek dictionary.

    stop_en

    Removes tokens from tokenizer results that are unnecessary for a search, based on an English dictionary. For example, the token filter removes and, is, and the from tokenizer results.

    stop_es

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Castilian Spanish dictionary.

    stop_eu

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Basque dictionary.

    stop_fa

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Persian dictionary.

    stop_fi

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Finnish dictionary.

    stop_fr

    Removes tokens from tokenizer results that are unnecessary for a search, based on a French dictionary.

    stop_ga

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Gaelic dictionary.

    stop_gl

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Galician Spanish dictionary.

    stop_he

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Hebrew dictionary.

    stop_hi

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Hindi dictionary.

    stop_hr

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Croatian dictionary.

    stop_hu

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Hungarian dictionary.

    stop_hy

    Removes tokens from tokenizer results that are unnecessary for a search, based on an Armenian dictionary.

    stop_id

    Removes tokens from tokenizer results that are unnecessary for a search, based on an Indonesian dictionary.

    stop_it

    Removes tokens from tokenizer results that are unnecessary for a search, based on an Italian dictionary.

    stop_nl

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Dutch dictionary.

    stop_no

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Norwegian dictionary.

    stop_pt

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Portuguese dictionary.

    stop_ro

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Romanian dictionary.

    stop_ru

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Russian dictionary.

    stop_sv

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Swedish dictionary.

    stop_tr

    Removes tokens from tokenizer results that are unnecessary for a search, based on a Turkish dictionary.

    to_lower

    Converts all characters in tokens to lowercase.

    unique

    Removes any tokens that aren’t unique.