public class DefaultTokenizer extends java.lang.Object implements ITokenizer
ITokenizer.StemmingMode
Modifier and Type | Field and Description |
---|---|
static java.lang.String[] |
EMPTY_STRINGS_LIST |
static Token[] |
EMPTY_TOKENS_LIST |
Constructor and Description |
---|
DefaultTokenizer() |
Modifier and Type | Method and Description |
---|---|
java.lang.String[] |
getSupportedLanguages()
Return an array of language strings (
xx-yy ) indicating the tokenizer's
supported languages. |
static java.text.BreakIterator |
getWordBreaker()
Returns an iterator to break sentences into words.
|
static boolean |
isContains(Token[] tokensList,
Token tokenForCheck)
Check if array contains token.
|
static boolean |
isContainsAll(Token[] tokensList,
Token[] listForFind,
boolean notExact)
Check if the
listForFind tokens are present in tokensList . |
static java.util.List<Token[]> |
searchAll(Token[] tokensList,
Token[] listForFind,
boolean notExact)
Find and return all tokens in
tokensList that match the tokens in listForFind . |
Token[] |
tokenizeVerbatim(java.lang.String strOrig)
Breaks a string into tokens.
|
java.lang.String[] |
tokenizeVerbatimToStrings(java.lang.String str)
Breaks a string into strings.
|
Token[] |
tokenizeWords(java.lang.String strOrig,
ITokenizer.StemmingMode stemmingMode)
Breaks a string into word-only tokens.
|
java.lang.String[] |
tokenizeWordsToStrings(java.lang.String str,
ITokenizer.StemmingMode stemmingMode)
Breaks a string into word-only strings.
|
public static final Token[] EMPTY_TOKENS_LIST
public static final java.lang.String[] EMPTY_STRINGS_LIST
public Token[] tokenizeWords(java.lang.String strOrig, ITokenizer.StemmingMode stemmingMode)
ITokenizer.StemmingMode
.
This method is used to find fuzzy matches and glossary entries.
Results can be cached for better performance.
tokenizeWords
in interface ITokenizer
public java.lang.String[] tokenizeWordsToStrings(java.lang.String str, ITokenizer.StemmingMode stemmingMode)
ITokenizer
ITokenizer.StemmingMode
.
When stemming is used, both the original word and its stem may be included in the results, if they differ. (The stem will come first.)
This method used for dictionary lookup.
Results are not cached.
tokenizeWordsToStrings
in interface ITokenizer
public Token[] tokenizeVerbatim(java.lang.String strOrig)
ITokenizer
This method is used to mark string differences in the UI and to tune similarity.
Results are not cached.
tokenizeVerbatim
in interface ITokenizer
public java.lang.String[] tokenizeVerbatimToStrings(java.lang.String str)
ITokenizer
This method is used to mark string differences in the UI and for debugging purposes.
Results are not cached.
tokenizeVerbatimToStrings
in interface ITokenizer
public static java.text.BreakIterator getWordBreaker()
public static boolean isContains(Token[] tokensList, Token tokenForCheck)
public static boolean isContainsAll(Token[] tokensList, Token[] listForFind, boolean notExact)
listForFind
tokens are present in tokensList
.tokensList
- a list of tokens to be searchedlistForFind
- a list of tokens to search in tokensList
notExact
- is true if the tokens in listForFind
can be non-contiguous or in a different order in the
tokensList
. If false, tokens must be exactly the same.listForFind
are found in tokensList
public static java.util.List<Token[]> searchAll(Token[] tokensList, Token[] listForFind, boolean notExact)
tokensList
that match the tokens in listForFind
.tokensList
- a list of tokens to be searchedlistForFind
- a list of tokens to search in tokensListnotExact
- is true if the tokens in listForFind can be non-contiguous or in a different order in the tokensList.
If false, tokens must be exactly the same.listForFind
that was found in tokensList
.public java.lang.String[] getSupportedLanguages()
ITokenizer
xx-yy
) indicating the tokenizer's
supported languages. Meant for tokenizers for which the supported languages
can only be determined at runtime, like the HunspellTokenizer
.
Indicate that this should be used by setting the Tokenizer
annotation
to contain only Tokenizer.DISCOVER_AT_RUNTIME
.
getSupportedLanguages
in interface ITokenizer