Class BaseTokenizer

    • Field Detail

      • TOKENIZER_DEBUG_PROVIDER

        public static final ICommentProvider TOKENIZER_DEBUG_PROVIDER
    • Constructor Detail

      • BaseTokenizer

        public BaseTokenizer()
    • Method Detail

      • tokenizeWords

        public Token[] tokenizeWords​(java.lang.String strOrig,
                                     ITokenizer.StemmingMode stemmingMode)
        Breaks a string into word-only tokens. Numbers, tags, and other non-word tokens are NOT included in the result. Stemming can be used depending on the supplied ITokenizer.StemmingMode.

        This method is used to find fuzzy matches and glossary entries.

        Results can be cached for better performance.

        Specified by:
        tokenizeWords in interface ITokenizer
      • tokenizeWordsToStrings

        public java.lang.String[] tokenizeWordsToStrings​(java.lang.String str,
                                                         ITokenizer.StemmingMode stemmingMode)
        Description copied from interface: ITokenizer
        Breaks a string into word-only strings. Numbers, tags, and other non-word tokens are NOT included in the result. Stemming can be used depending on the supplied ITokenizer.StemmingMode.

        When stemming is used, both the original word and its stem may be included in the results, if they differ. (The stem will come first.)

        This method used for dictionary lookup.

        Results are not cached.

        Specified by:
        tokenizeWordsToStrings in interface ITokenizer
      • tokenizeVerbatim

        public Token[] tokenizeVerbatim​(java.lang.String strOrig)
        Breaks a string into tokens. Numbers, tags, and other non-word tokens are included in the result. Stemming is NOT used.

        This method is used to mark string differences in the UI and to tune similarity.

        Results are not cached.

        Specified by:
        tokenizeVerbatim in interface ITokenizer
      • tokenizeVerbatimToStrings

        public java.lang.String[] tokenizeVerbatimToStrings​(java.lang.String str)
        Description copied from interface: ITokenizer
        Breaks a string into strings. Numbers, tags, and other non-word tokens are included in the result. Stemming is NOT used.

        This method is used to mark string differences in the UI and for debugging purposes.

        Results are not cached.

        Specified by:
        tokenizeVerbatimToStrings in interface ITokenizer
      • getSupportedLanguages

        public java.lang.String[] getSupportedLanguages()
        Description copied from interface: ITokenizer
        Return an array of language strings (xx-yy) indicating the tokenizer's supported languages. Meant for tokenizers for which the supported languages can only be determined at runtime, like the HunspellTokenizer.

        Indicate that this should be used by setting the Tokenizer annotation to contain only Tokenizer.DISCOVER_AT_RUNTIME.

        Specified by:
        getSupportedLanguages in interface ITokenizer