BaseTokenizer (OmegaT 5.7.0 API)

java.lang.Object
- org.omegat.tokenizer.BaseTokenizer

All Implemented Interfaces:

ITokenizer

Direct Known Subclasses:

HunspellTokenizer, LuceneArabicTokenizer, LuceneArmenianTokenizer, LuceneBasqueTokenizer, LuceneBrazilianTokenizer, LuceneBulgarianTokenizer, LuceneCatalanTokenizer, LuceneCJKTokenizer, LuceneCzechTokenizer, LuceneDanishTokenizer, LuceneDutchTokenizer, LuceneEnglishTokenizer, LuceneFinnishTokenizer, LuceneFrenchTokenizer, LuceneGalicianTokenizer, LuceneGermanTokenizer, LuceneGreekTokenizer, LuceneHindiTokenizer, LuceneHungarianTokenizer, LuceneIndonesianTokenizer, LuceneIrishTokenizer, LuceneItalianTokenizer, LuceneJapaneseTokenizer, LuceneLatvianTokenizer, LuceneNorwegianTokenizer, LucenePersianTokenizer, LucenePolishTokenizer, LucenePortugueseTokenizer, LuceneRomanianTokenizer, LuceneRussianTokenizer, LuceneSmartChineseTokenizer, LuceneSpanishTokenizer, LuceneSwedishTokenizer, LuceneThaiTokenizer, LuceneTurkishTokenizer
```
public abstract class BaseTokenizer
extends java.lang.Object
implements ITokenizer
```
Base class for Lucene-based tokenizers.

Nested Class Summary
- Nested classes/interfaces inherited from interface org.omegat.tokenizer.ITokenizer
  ITokenizer.StemmingMode

Field Summary

Fields
Modifier and Type	Field and Description
`protected static int`	`DEFAULT_TOKENS_COUNT`
`protected static java.lang.String[]`	`EMPTY_STRING_LIST`
`protected static Token[]`	`EMPTY_TOKENS_LIST`
`protected boolean`	`shouldDelegateTokenizeExactly` Indicates that `tokenizeVerbatim(String)` should use OmegaT's `WordIterator` to tokenize "exactly" for display.
`static ICommentProvider`	`TOKENIZER_DEBUG_PROVIDER`

Constructor Summary

Constructors
Constructor and Description

BaseTokenizer()

Constructors
Constructor and Description
`BaseTokenizer()`

Method Summary

All Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`protected Language`	`getEffectiveLanguage()`
`protected Language`	`getProjectLanguage()`
`protected org.apache.lucene.analysis.TokenStream`	`getStandardTokenStream(java.lang.String strOrig)` Minimal implementation that returns the default implementation corresponding to all false parameters.
`java.lang.String[]`	`getSupportedLanguages()` Return an array of language strings (`xx-yy`) indicating the tokenizer's supported languages.
`protected abstract org.apache.lucene.analysis.TokenStream`	`getTokenStream(java.lang.String strOrig, boolean stemsAllowed, boolean stopWordsAllowed)`
`protected java.lang.String`	`printTest(java.lang.String[] strings, java.lang.String input)`
`protected java.lang.String`	`test(java.lang.String... args)`
`protected Token[]`	`tokenize(java.lang.String strOrig, boolean stemsAllowed, boolean stopWordsAllowed, boolean filterDigits, boolean filterWhitespace)`
`protected Token[]`	`tokenizeByCodePoint(java.lang.String strOrig)`
`protected java.lang.String[]`	`tokenizeByCodePointToStrings(java.lang.String strOrig)`
`protected java.lang.String[]`	`tokenizeToStrings(java.lang.String str, boolean stemsAllowed, boolean stopWordsAllowed, boolean filterDigits, boolean filterWhitespace)`
`Token[]`	`tokenizeVerbatim(java.lang.String strOrig)` Breaks a string into tokens.
`java.lang.String[]`	`tokenizeVerbatimToStrings(java.lang.String str)` Breaks a string into strings.
`Token[]`	`tokenizeWords(java.lang.String strOrig, ITokenizer.StemmingMode stemmingMode)` Breaks a string into word-only tokens.
`java.lang.String[]`	`tokenizeWordsToStrings(java.lang.String str, ITokenizer.StemmingMode stemmingMode)` Breaks a string into word-only strings.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - EMPTY_STRING_LIST
```
protected static final java.lang.String[] EMPTY_STRING_LIST
```
  - EMPTY_TOKENS_LIST
```
protected static final Token[] EMPTY_TOKENS_LIST
```
  - DEFAULT_TOKENS_COUNT
```
protected static final int DEFAULT_TOKENS_COUNT
```
    See Also:
    
    Constant Field Values
  - shouldDelegateTokenizeExactly
```
protected boolean shouldDelegateTokenizeExactly
```
    Indicates that tokenizeVerbatim(String) should use OmegaT's WordIterator to tokenize "exactly" for display.
    For language-specific tokenizers that maintain the property that (the concatenation of all tokens).equals(original string) == true, set this to false to use the language-specific tokenizer for everything.
  - TOKENIZER_DEBUG_PROVIDER
```
public static final ICommentProvider TOKENIZER_DEBUG_PROVIDER
```
- Constructor Detail
  - BaseTokenizer
```
public BaseTokenizer()
```
- Method Detail
  - tokenizeWords
```
public Token[] tokenizeWords(java.lang.String strOrig,
                             ITokenizer.StemmingMode stemmingMode)
```
    Breaks a string into word-only tokens. Numbers, tags, and other non-word tokens are NOT included in the result. Stemming can be used depending on the supplied ITokenizer.StemmingMode.
    This method is used to find fuzzy matches and glossary entries.
    Results can be cached for better performance.
    
    Specified by:
    
    tokenizeWords in interface ITokenizer
  - tokenizeWordsToStrings
```
public java.lang.String[] tokenizeWordsToStrings(java.lang.String str,
                                                 ITokenizer.StemmingMode stemmingMode)
```
    Description copied from interface: ITokenizer
    
    Breaks a string into word-only strings. Numbers, tags, and other non-word tokens are NOT included in the result. Stemming can be used depending on the supplied ITokenizer.StemmingMode.
    When stemming is used, both the original word and its stem may be included in the results, if they differ. (The stem will come first.)
    This method used for dictionary lookup.
    Results are not cached.
    
    Specified by:
    
    tokenizeWordsToStrings in interface ITokenizer
  - tokenizeVerbatim
```
public Token[] tokenizeVerbatim(java.lang.String strOrig)
```
    Breaks a string into tokens. Numbers, tags, and other non-word tokens are included in the result. Stemming is NOT used.
    This method is used to mark string differences in the UI and to tune similarity.
    Results are not cached.
    
    Specified by:
    
    tokenizeVerbatim in interface ITokenizer
  - tokenizeVerbatimToStrings
```
public java.lang.String[] tokenizeVerbatimToStrings(java.lang.String str)
```
    Description copied from interface: ITokenizer
    
    Breaks a string into strings. Numbers, tags, and other non-word tokens are included in the result. Stemming is NOT used.
    This method is used to mark string differences in the UI and for debugging purposes.
    Results are not cached.
    
    Specified by:
    
    tokenizeVerbatimToStrings in interface ITokenizer
  - tokenizeByCodePoint
```
protected Token[] tokenizeByCodePoint(java.lang.String strOrig)
```
  - tokenizeByCodePointToStrings
```
protected java.lang.String[] tokenizeByCodePointToStrings(java.lang.String strOrig)
```
  - tokenize
```
protected Token[] tokenize(java.lang.String strOrig,
                           boolean stemsAllowed,
                           boolean stopWordsAllowed,
                           boolean filterDigits,
                           boolean filterWhitespace)
```
  - tokenizeToStrings
```
protected java.lang.String[] tokenizeToStrings(java.lang.String str,
                                               boolean stemsAllowed,
                                               boolean stopWordsAllowed,
                                               boolean filterDigits,
                                               boolean filterWhitespace)
```
  - getTokenStream
```
protected abstract org.apache.lucene.analysis.TokenStream getTokenStream(java.lang.String strOrig,
                                                                         boolean stemsAllowed,
                                                                         boolean stopWordsAllowed)
                                                                  throws java.io.IOException
```
    Throws:
    
    java.io.IOException
  - getStandardTokenStream
```
protected org.apache.lucene.analysis.TokenStream getStandardTokenStream(java.lang.String strOrig)
                                                                 throws java.io.IOException
```
    Minimal implementation that returns the default implementation corresponding to all false parameters. Subclasses should override this to handle true parameters.
    
    Throws:
    
    java.io.IOException
  - getSupportedLanguages
```
public java.lang.String[] getSupportedLanguages()
```
    Description copied from interface: ITokenizer
    
    Return an array of language strings (xx-yy) indicating the tokenizer's supported languages. Meant for tokenizers for which the supported languages can only be determined at runtime, like the HunspellTokenizer.
    Indicate that this should be used by setting the Tokenizer annotation to contain only Tokenizer.DISCOVER_AT_RUNTIME.
    
    Specified by:
    
    getSupportedLanguages in interface ITokenizer
  - getEffectiveLanguage
```
protected Language getEffectiveLanguage()
```
  - getProjectLanguage
```
protected Language getProjectLanguage()
```
  - test
```
protected java.lang.String test(java.lang.String... args)
```
  - printTest
```
protected java.lang.String printTest(java.lang.String[] strings,
                                     java.lang.String input)
```

Class BaseTokenizer

Nested Class Summary

Nested classes/interfaces inherited from interface org.omegat.tokenizer.ITokenizer

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

EMPTY_STRING_LIST

EMPTY_TOKENS_LIST

DEFAULT_TOKENS_COUNT

shouldDelegateTokenizeExactly

TOKENIZER_DEBUG_PROVIDER

Constructor Detail

BaseTokenizer

Method Detail

tokenizeWords

tokenizeWordsToStrings

tokenizeVerbatim

tokenizeVerbatimToStrings

tokenizeByCodePoint

tokenizeByCodePointToStrings

tokenize

tokenizeToStrings

getTokenStream

getStandardTokenStream

getSupportedLanguages

getEffectiveLanguage

getProjectLanguage

test

printTest