Package org.omegat.tokenizer
Interface ITokenizer
-
- All Known Implementing Classes:
BaseTokenizer
,DefaultTokenizer
,HunspellTokenizer
,LuceneArabicTokenizer
,LuceneArmenianTokenizer
,LuceneBasqueTokenizer
,LuceneBrazilianTokenizer
,LuceneBulgarianTokenizer
,LuceneCatalanTokenizer
,LuceneCJKTokenizer
,LuceneCzechTokenizer
,LuceneDanishTokenizer
,LuceneDutchTokenizer
,LuceneEnglishTokenizer
,LuceneFinnishTokenizer
,LuceneFrenchTokenizer
,LuceneGalicianTokenizer
,LuceneGermanTokenizer
,LuceneGreekTokenizer
,LuceneHindiTokenizer
,LuceneHungarianTokenizer
,LuceneIndonesianTokenizer
,LuceneIrishTokenizer
,LuceneItalianTokenizer
,LuceneJapaneseTokenizer
,LuceneLatvianTokenizer
,LuceneNorwegianTokenizer
,LucenePersianTokenizer
,LucenePolishTokenizer
,LucenePortugueseTokenizer
,LuceneRomanianTokenizer
,LuceneRussianTokenizer
,LuceneSmartChineseTokenizer
,LuceneSpanishTokenizer
,LuceneSwedishTokenizer
,LuceneThaiTokenizer
,LuceneTurkishTokenizer
public interface ITokenizer
Interface for tokenize string engine.
-
-
Nested Class Summary
Nested Classes Modifier and Type Interface Description static class
ITokenizer.StemmingMode
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description java.lang.String[]
getSupportedLanguages()
Return an array of language strings (xx-yy
) indicating the tokenizer's supported languages.Token[]
tokenizeVerbatim(java.lang.String str)
Breaks a string into tokens.java.lang.String[]
tokenizeVerbatimToStrings(java.lang.String str)
Breaks a string into strings.Token[]
tokenizeWords(java.lang.String str, ITokenizer.StemmingMode stemmingMode)
Breaks a string into word-only tokens.java.lang.String[]
tokenizeWordsToStrings(java.lang.String str, ITokenizer.StemmingMode stemmingMode)
Breaks a string into word-only strings.
-
-
-
Method Detail
-
tokenizeWords
Token[] tokenizeWords(java.lang.String str, ITokenizer.StemmingMode stemmingMode)
Breaks a string into word-only tokens. Numbers, tags, and other non-word tokens are NOT included in the result. Stemming can be used depending on the suppliedITokenizer.StemmingMode
.This method is used to find fuzzy matches and glossary entries.
Results can be cached for better performance.
-
tokenizeWordsToStrings
java.lang.String[] tokenizeWordsToStrings(java.lang.String str, ITokenizer.StemmingMode stemmingMode)
Breaks a string into word-only strings. Numbers, tags, and other non-word tokens are NOT included in the result. Stemming can be used depending on the suppliedITokenizer.StemmingMode
.When stemming is used, both the original word and its stem may be included in the results, if they differ. (The stem will come first.)
This method used for dictionary lookup.
Results are not cached.
-
tokenizeVerbatim
Token[] tokenizeVerbatim(java.lang.String str)
Breaks a string into tokens. Numbers, tags, and other non-word tokens are included in the result. Stemming is NOT used.This method is used to mark string differences in the UI and to tune similarity.
Results are not cached.
-
tokenizeVerbatimToStrings
java.lang.String[] tokenizeVerbatimToStrings(java.lang.String str)
Breaks a string into strings. Numbers, tags, and other non-word tokens are included in the result. Stemming is NOT used.This method is used to mark string differences in the UI and for debugging purposes.
Results are not cached.
-
getSupportedLanguages
java.lang.String[] getSupportedLanguages()
Return an array of language strings (xx-yy
) indicating the tokenizer's supported languages. Meant for tokenizers for which the supported languages can only be determined at runtime, like theHunspellTokenizer
.Indicate that this should be used by setting the
Tokenizer
annotation to contain onlyTokenizer.DISCOVER_AT_RUNTIME
.
-
-