Package org.omegat.tokenizer
Class LuceneSmartChineseTokenizer
- java.lang.Object
-
- org.omegat.tokenizer.BaseTokenizer
-
- org.omegat.tokenizer.LuceneSmartChineseTokenizer
-
- All Implemented Interfaces:
ITokenizer
public class LuceneSmartChineseTokenizer extends BaseTokenizer
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from interface org.omegat.tokenizer.ITokenizer
ITokenizer.StemmingMode
-
-
Field Summary
-
Fields inherited from class org.omegat.tokenizer.BaseTokenizer
TOKENIZER_DEBUG_PROVIDER
-
-
Constructor Summary
Constructors Constructor Description LuceneSmartChineseTokenizer()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description Token[]
tokenizeVerbatim(java.lang.String strOrig)
Breaks a string into tokens.java.lang.String[]
tokenizeVerbatimToStrings(java.lang.String strOrig)
Breaks a string into strings.-
Methods inherited from class org.omegat.tokenizer.BaseTokenizer
getSupportedLanguages, tokenizeWords, tokenizeWordsToStrings
-
-
-
-
Method Detail
-
tokenizeVerbatim
public Token[] tokenizeVerbatim(java.lang.String strOrig)
Description copied from class:BaseTokenizer
Breaks a string into tokens. Numbers, tags, and other non-word tokens are included in the result. Stemming is NOT used.This method is used to mark string differences in the UI and to tune similarity.
Results are not cached.
- Specified by:
tokenizeVerbatim
in interfaceITokenizer
- Overrides:
tokenizeVerbatim
in classBaseTokenizer
-
tokenizeVerbatimToStrings
public java.lang.String[] tokenizeVerbatimToStrings(java.lang.String strOrig)
Description copied from interface:ITokenizer
Breaks a string into strings. Numbers, tags, and other non-word tokens are included in the result. Stemming is NOT used.This method is used to mark string differences in the UI and for debugging purposes.
Results are not cached.
- Specified by:
tokenizeVerbatimToStrings
in interfaceITokenizer
- Overrides:
tokenizeVerbatimToStrings
in classBaseTokenizer
-
-