Package org.omegat.tokenizer
Class DefaultTokenizer
- java.lang.Object
-
- org.omegat.tokenizer.DefaultTokenizer
-
- All Implemented Interfaces:
ITokenizer
public class DefaultTokenizer extends java.lang.Object implements ITokenizer
Methods for tokenize string.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from interface org.omegat.tokenizer.ITokenizer
ITokenizer.StemmingMode
-
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String[]
EMPTY_STRINGS_LIST
static Token[]
EMPTY_TOKENS_LIST
-
Constructor Summary
Constructors Constructor Description DefaultTokenizer()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description java.lang.String[]
getSupportedLanguages()
Return an array of language strings (xx-yy
) indicating the tokenizer's supported languages.static java.text.BreakIterator
getWordBreaker()
Returns an iterator to break sentences into words.static boolean
isContains(Token[] tokensList, Token tokenForCheck)
Check if array contains token.static boolean
isContainsAll(Token[] tokensList, Token[] listForFind, boolean notExact)
Check if thelistForFind
tokens are present intokensList
.static java.util.List<Token[]>
searchAll(Token[] tokensList, Token[] listForFind, boolean notExact)
Find and return all tokens intokensList
that match the tokens inlistForFind
.Token[]
tokenizeVerbatim(java.lang.String strOrig)
Breaks a string into tokens.java.lang.String[]
tokenizeVerbatimToStrings(java.lang.String str)
Breaks a string into strings.Token[]
tokenizeWords(java.lang.String strOrig, ITokenizer.StemmingMode stemmingMode)
Breaks a string into word-only tokens.java.lang.String[]
tokenizeWordsToStrings(java.lang.String str, ITokenizer.StemmingMode stemmingMode)
Breaks a string into word-only strings.
-
-
-
Field Detail
-
EMPTY_TOKENS_LIST
public static final Token[] EMPTY_TOKENS_LIST
-
EMPTY_STRINGS_LIST
public static final java.lang.String[] EMPTY_STRINGS_LIST
-
-
Method Detail
-
tokenizeWords
public Token[] tokenizeWords(java.lang.String strOrig, ITokenizer.StemmingMode stemmingMode)
Breaks a string into word-only tokens. Numbers, tags, and other non-word tokens are NOT included in the result. Stemming can be used depending on the suppliedITokenizer.StemmingMode
.This method is used to find fuzzy matches and glossary entries.
Results can be cached for better performance.
- Specified by:
tokenizeWords
in interfaceITokenizer
-
tokenizeWordsToStrings
public java.lang.String[] tokenizeWordsToStrings(java.lang.String str, ITokenizer.StemmingMode stemmingMode)
Description copied from interface:ITokenizer
Breaks a string into word-only strings. Numbers, tags, and other non-word tokens are NOT included in the result. Stemming can be used depending on the suppliedITokenizer.StemmingMode
.When stemming is used, both the original word and its stem may be included in the results, if they differ. (The stem will come first.)
This method used for dictionary lookup.
Results are not cached.
- Specified by:
tokenizeWordsToStrings
in interfaceITokenizer
-
tokenizeVerbatim
public Token[] tokenizeVerbatim(java.lang.String strOrig)
Description copied from interface:ITokenizer
Breaks a string into tokens. Numbers, tags, and other non-word tokens are included in the result. Stemming is NOT used.This method is used to mark string differences in the UI and to tune similarity.
Results are not cached.
- Specified by:
tokenizeVerbatim
in interfaceITokenizer
-
tokenizeVerbatimToStrings
public java.lang.String[] tokenizeVerbatimToStrings(java.lang.String str)
Description copied from interface:ITokenizer
Breaks a string into strings. Numbers, tags, and other non-word tokens are included in the result. Stemming is NOT used.This method is used to mark string differences in the UI and for debugging purposes.
Results are not cached.
- Specified by:
tokenizeVerbatimToStrings
in interfaceITokenizer
-
getWordBreaker
public static java.text.BreakIterator getWordBreaker()
Returns an iterator to break sentences into words.
-
isContains
public static boolean isContains(Token[] tokensList, Token tokenForCheck)
Check if array contains token.
-
isContainsAll
public static boolean isContainsAll(Token[] tokensList, Token[] listForFind, boolean notExact)
Check if thelistForFind
tokens are present intokensList
.- Parameters:
tokensList
- a list of tokens to be searchedlistForFind
- a list of tokens to search intokensList
notExact
- is true if the tokens inlistForFind
can be non-contiguous or in a different order in thetokensList
. If false, tokens must be exactly the same.- Returns:
- true if the tokens in
listForFind
are found intokensList
-
searchAll
public static java.util.List<Token[]> searchAll(Token[] tokensList, Token[] listForFind, boolean notExact)
Find and return all tokens intokensList
that match the tokens inlistForFind
.- Parameters:
tokensList
- a list of tokens to be searchedlistForFind
- a list of tokens to search in tokensListnotExact
- is true if the tokens in listForFind can be non-contiguous or in a different order in the tokensList. If false, tokens must be exactly the same.- Returns:
- A list containing each hit of the matched tokens. Each token array represents a different instance of
listForFind
that was found intokensList
.
-
getSupportedLanguages
public java.lang.String[] getSupportedLanguages()
Description copied from interface:ITokenizer
Return an array of language strings (xx-yy
) indicating the tokenizer's supported languages. Meant for tokenizers for which the supported languages can only be determined at runtime, like theHunspellTokenizer
.Indicate that this should be used by setting the
Tokenizer
annotation to contain onlyTokenizer.DISCOVER_AT_RUNTIME
.- Specified by:
getSupportedLanguages
in interfaceITokenizer
-
-