Class DefaultTokenizer

  • All Implemented Interfaces:
    ITokenizer

    public class DefaultTokenizer
    extends java.lang.Object
    implements ITokenizer
    Methods for tokenize string.
    • Field Detail

      • EMPTY_TOKENS_LIST

        public static final Token[] EMPTY_TOKENS_LIST
      • EMPTY_STRINGS_LIST

        public static final java.lang.String[] EMPTY_STRINGS_LIST
    • Constructor Detail

      • DefaultTokenizer

        public DefaultTokenizer()
    • Method Detail

      • tokenizeWords

        public Token[] tokenizeWords​(java.lang.String strOrig,
                                     ITokenizer.StemmingMode stemmingMode)
        Breaks a string into word-only tokens. Numbers, tags, and other non-word tokens are NOT included in the result. Stemming can be used depending on the supplied ITokenizer.StemmingMode.

        This method is used to find fuzzy matches and glossary entries.

        Results can be cached for better performance.

        Specified by:
        tokenizeWords in interface ITokenizer
      • tokenizeWordsToStrings

        public java.lang.String[] tokenizeWordsToStrings​(java.lang.String str,
                                                         ITokenizer.StemmingMode stemmingMode)
        Description copied from interface: ITokenizer
        Breaks a string into word-only strings. Numbers, tags, and other non-word tokens are NOT included in the result. Stemming can be used depending on the supplied ITokenizer.StemmingMode.

        When stemming is used, both the original word and its stem may be included in the results, if they differ. (The stem will come first.)

        This method used for dictionary lookup.

        Results are not cached.

        Specified by:
        tokenizeWordsToStrings in interface ITokenizer
      • tokenizeVerbatim

        public Token[] tokenizeVerbatim​(java.lang.String strOrig)
        Description copied from interface: ITokenizer
        Breaks a string into tokens. Numbers, tags, and other non-word tokens are included in the result. Stemming is NOT used.

        This method is used to mark string differences in the UI and to tune similarity.

        Results are not cached.

        Specified by:
        tokenizeVerbatim in interface ITokenizer
      • tokenizeVerbatimToStrings

        public java.lang.String[] tokenizeVerbatimToStrings​(java.lang.String str)
        Description copied from interface: ITokenizer
        Breaks a string into strings. Numbers, tags, and other non-word tokens are included in the result. Stemming is NOT used.

        This method is used to mark string differences in the UI and for debugging purposes.

        Results are not cached.

        Specified by:
        tokenizeVerbatimToStrings in interface ITokenizer
      • getWordBreaker

        public static java.text.BreakIterator getWordBreaker()
        Returns an iterator to break sentences into words.
      • isContains

        public static boolean isContains​(Token[] tokensList,
                                         Token tokenForCheck)
        Check if array contains token.
      • isContainsAll

        public static boolean isContainsAll​(Token[] tokensList,
                                            Token[] listForFind,
                                            boolean notExact)
        Check if the listForFind tokens are present in tokensList.
        Parameters:
        tokensList - a list of tokens to be searched
        listForFind - a list of tokens to search in tokensList
        notExact - is true if the tokens in listForFind can be non-contiguous or in a different order in the tokensList. If false, tokens must be exactly the same.
        Returns:
        true if the tokens in listForFind are found in tokensList
      • searchAll

        public static java.util.List<Token[]> searchAll​(Token[] tokensList,
                                                        Token[] listForFind,
                                                        boolean notExact)
        Find and return all tokens in tokensList that match the tokens in listForFind.
        Parameters:
        tokensList - a list of tokens to be searched
        listForFind - a list of tokens to search in tokensList
        notExact - is true if the tokens in listForFind can be non-contiguous or in a different order in the tokensList. If false, tokens must be exactly the same.
        Returns:
        A list containing each hit of the matched tokens. Each token array represents a different instance of listForFind that was found in tokensList.
      • getSupportedLanguages

        public java.lang.String[] getSupportedLanguages()
        Description copied from interface: ITokenizer
        Return an array of language strings (xx-yy) indicating the tokenizer's supported languages. Meant for tokenizers for which the supported languages can only be determined at runtime, like the HunspellTokenizer.

        Indicate that this should be used by setting the Tokenizer annotation to contain only Tokenizer.DISCOVER_AT_RUNTIME.

        Specified by:
        getSupportedLanguages in interface ITokenizer