public abstract class BaseTokenizer extends java.lang.Object implements ITokenizer
ITokenizer.StemmingMode
Modifier and Type | Field and Description |
---|---|
protected static int |
DEFAULT_TOKENS_COUNT |
protected static java.lang.String[] |
EMPTY_STRING_LIST |
protected static Token[] |
EMPTY_TOKENS_LIST |
protected boolean |
shouldDelegateTokenizeExactly
Indicates that
tokenizeVerbatim(String) should use OmegaT's
WordIterator to tokenize "exactly" for display. |
static ICommentProvider |
TOKENIZER_DEBUG_PROVIDER |
Constructor and Description |
---|
BaseTokenizer() |
Modifier and Type | Method and Description |
---|---|
protected Language |
getEffectiveLanguage() |
protected Language |
getProjectLanguage() |
protected org.apache.lucene.analysis.TokenStream |
getStandardTokenStream(java.lang.String strOrig)
Minimal implementation that returns the default implementation
corresponding to all false parameters.
|
java.lang.String[] |
getSupportedLanguages()
Return an array of language strings (
xx-yy ) indicating the tokenizer's
supported languages. |
protected abstract org.apache.lucene.analysis.TokenStream |
getTokenStream(java.lang.String strOrig,
boolean stemsAllowed,
boolean stopWordsAllowed) |
protected java.lang.String |
printTest(java.lang.String[] strings,
java.lang.String input) |
protected java.lang.String |
test(java.lang.String... args) |
protected Token[] |
tokenize(java.lang.String strOrig,
boolean stemsAllowed,
boolean stopWordsAllowed,
boolean filterDigits,
boolean filterWhitespace) |
protected Token[] |
tokenizeByCodePoint(java.lang.String strOrig) |
protected java.lang.String[] |
tokenizeByCodePointToStrings(java.lang.String strOrig) |
protected java.lang.String[] |
tokenizeToStrings(java.lang.String str,
boolean stemsAllowed,
boolean stopWordsAllowed,
boolean filterDigits,
boolean filterWhitespace) |
Token[] |
tokenizeVerbatim(java.lang.String strOrig)
Breaks a string into tokens.
|
java.lang.String[] |
tokenizeVerbatimToStrings(java.lang.String str)
Breaks a string into strings.
|
Token[] |
tokenizeWords(java.lang.String strOrig,
ITokenizer.StemmingMode stemmingMode)
Breaks a string into word-only tokens.
|
java.lang.String[] |
tokenizeWordsToStrings(java.lang.String str,
ITokenizer.StemmingMode stemmingMode)
Breaks a string into word-only strings.
|
protected static final java.lang.String[] EMPTY_STRING_LIST
protected static final Token[] EMPTY_TOKENS_LIST
protected static final int DEFAULT_TOKENS_COUNT
protected boolean shouldDelegateTokenizeExactly
tokenizeVerbatim(String)
should use OmegaT's
WordIterator
to tokenize "exactly" for display.
For language-specific tokenizers that maintain the property that
(the concatenation of all tokens).equals(original string) == true
,
set this to false to use the language-specific tokenizer for everything.
public static final ICommentProvider TOKENIZER_DEBUG_PROVIDER
public Token[] tokenizeWords(java.lang.String strOrig, ITokenizer.StemmingMode stemmingMode)
ITokenizer.StemmingMode
.
This method is used to find fuzzy matches and glossary entries.
Results can be cached for better performance.
tokenizeWords
in interface ITokenizer
public java.lang.String[] tokenizeWordsToStrings(java.lang.String str, ITokenizer.StemmingMode stemmingMode)
ITokenizer
ITokenizer.StemmingMode
.
When stemming is used, both the original word and its stem may be included in the results, if they differ. (The stem will come first.)
This method used for dictionary lookup.
Results are not cached.
tokenizeWordsToStrings
in interface ITokenizer
public Token[] tokenizeVerbatim(java.lang.String strOrig)
This method is used to mark string differences in the UI and to tune similarity.
Results are not cached.
tokenizeVerbatim
in interface ITokenizer
public java.lang.String[] tokenizeVerbatimToStrings(java.lang.String str)
ITokenizer
This method is used to mark string differences in the UI and for debugging purposes.
Results are not cached.
tokenizeVerbatimToStrings
in interface ITokenizer
protected Token[] tokenizeByCodePoint(java.lang.String strOrig)
protected java.lang.String[] tokenizeByCodePointToStrings(java.lang.String strOrig)
protected Token[] tokenize(java.lang.String strOrig, boolean stemsAllowed, boolean stopWordsAllowed, boolean filterDigits, boolean filterWhitespace)
protected java.lang.String[] tokenizeToStrings(java.lang.String str, boolean stemsAllowed, boolean stopWordsAllowed, boolean filterDigits, boolean filterWhitespace)
protected abstract org.apache.lucene.analysis.TokenStream getTokenStream(java.lang.String strOrig, boolean stemsAllowed, boolean stopWordsAllowed) throws java.io.IOException
java.io.IOException
protected org.apache.lucene.analysis.TokenStream getStandardTokenStream(java.lang.String strOrig) throws java.io.IOException
java.io.IOException
public java.lang.String[] getSupportedLanguages()
ITokenizer
xx-yy
) indicating the tokenizer's
supported languages. Meant for tokenizers for which the supported languages
can only be determined at runtime, like the HunspellTokenizer
.
Indicate that this should be used by setting the Tokenizer
annotation
to contain only Tokenizer.DISCOVER_AT_RUNTIME
.
getSupportedLanguages
in interface ITokenizer
protected Language getEffectiveLanguage()
protected Language getProjectLanguage()
protected java.lang.String test(java.lang.String... args)
protected java.lang.String printTest(java.lang.String[] strings, java.lang.String input)