public interface ITokenizer
Modifier and Type | Interface and Description |
---|---|
static class |
ITokenizer.StemmingMode |
Modifier and Type | Method and Description |
---|---|
java.lang.String[] |
getSupportedLanguages()
Return an array of language strings (
xx-yy ) indicating the tokenizer's
supported languages. |
Token[] |
tokenizeVerbatim(java.lang.String str)
Breaks a string into tokens.
|
java.lang.String[] |
tokenizeVerbatimToStrings(java.lang.String str)
Breaks a string into strings.
|
Token[] |
tokenizeWords(java.lang.String str,
ITokenizer.StemmingMode stemmingMode)
Breaks a string into word-only tokens.
|
java.lang.String[] |
tokenizeWordsToStrings(java.lang.String str,
ITokenizer.StemmingMode stemmingMode)
Breaks a string into word-only strings.
|
Token[] tokenizeWords(java.lang.String str, ITokenizer.StemmingMode stemmingMode)
ITokenizer.StemmingMode
.
This method is used to find fuzzy matches and glossary entries.
Results can be cached for better performance.
java.lang.String[] tokenizeWordsToStrings(java.lang.String str, ITokenizer.StemmingMode stemmingMode)
ITokenizer.StemmingMode
.
When stemming is used, both the original word and its stem may be included in the results, if they differ. (The stem will come first.)
This method used for dictionary lookup.
Results are not cached.
Token[] tokenizeVerbatim(java.lang.String str)
This method is used to mark string differences in the UI and to tune similarity.
Results are not cached.
java.lang.String[] tokenizeVerbatimToStrings(java.lang.String str)
This method is used to mark string differences in the UI and for debugging purposes.
Results are not cached.
java.lang.String[] getSupportedLanguages()
xx-yy
) indicating the tokenizer's
supported languages. Meant for tokenizers for which the supported languages
can only be determined at runtime, like the HunspellTokenizer
.
Indicate that this should be used by setting the Tokenizer
annotation
to contain only Tokenizer.DISCOVER_AT_RUNTIME
.