java.lang.Object
- org.omegat.tokenizer.DefaultTokenizer

All Implemented Interfaces:: ITokenizer

public class DefaultTokenizer
extends java.lang.Object
implements ITokenizer

Methods for tokenize string.

Nested Class Summary
- Nested classes/interfaces inherited from interface org.omegat.tokenizer.ITokenizer
  ITokenizer.StemmingMode

Field Summary

Fields
Modifier and Type Field Description

static java.lang.String[] EMPTY_STRINGS_LIST

static Token[] EMPTY_TOKENS_LIST

Constructor Summary

Constructors
Constructor Description

DefaultTokenizer()

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`java.lang.String[]`	`getSupportedLanguages()`	Return an array of language strings (`xx-yy`) indicating the tokenizer's supported languages.
`static java.text.BreakIterator`	`getWordBreaker()`	Returns an iterator to break sentences into words.
`static boolean`	`isContains(Token[] tokensList, Token tokenForCheck)`	Check if array contains token.
`static boolean`	`isContainsAll(Token[] tokensList, Token[] listForFind, boolean notExact)`	Check if the `listForFind` tokens are present in `tokensList`.
`static java.util.List<Token[]>`	`searchAll(Token[] tokensList, Token[] listForFind, boolean notExact)`	Find and return all tokens in `tokensList` that match the tokens in `listForFind`.
`Token[]`	`tokenizeVerbatim(java.lang.String strOrig)`	Breaks a string into tokens.
`java.lang.String[]`	`tokenizeVerbatimToStrings(java.lang.String str)`	Breaks a string into strings.
`Token[]`	`tokenizeWords(java.lang.String strOrig, ITokenizer.StemmingMode stemmingMode)`	Breaks a string into word-only tokens.
`java.lang.String[]`	`tokenizeWordsToStrings(java.lang.String str, ITokenizer.StemmingMode stemmingMode)`	Breaks a string into word-only strings.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - EMPTY_TOKENS_LIST
```
public static final Token[] EMPTY_TOKENS_LIST
```
  - EMPTY_STRINGS_LIST
```
public static final java.lang.String[] EMPTY_STRINGS_LIST
```
- Constructor Detail
  - DefaultTokenizer
```
public DefaultTokenizer()
```
- Method Detail
  - tokenizeWords
```
public Token[] tokenizeWords(java.lang.String strOrig,
                             ITokenizer.StemmingMode stemmingMode)
```
    Breaks a string into word-only tokens. Numbers, tags, and other non-word tokens are NOT included in the result. Stemming can be used depending on the supplied ITokenizer.StemmingMode.
    This method is used to find fuzzy matches and glossary entries.
    Results can be cached for better performance.
    
    Specified by:
    
    tokenizeWords in interface ITokenizer
  - tokenizeWordsToStrings
```
public java.lang.String[] tokenizeWordsToStrings(java.lang.String str,
                                                 ITokenizer.StemmingMode stemmingMode)
```
    Description copied from interface: ITokenizer
    
    Breaks a string into word-only strings. Numbers, tags, and other non-word tokens are NOT included in the result. Stemming can be used depending on the supplied ITokenizer.StemmingMode.
    When stemming is used, both the original word and its stem may be included in the results, if they differ. (The stem will come first.)
    This method used for dictionary lookup.
    Results are not cached.
    
    Specified by:
    
    tokenizeWordsToStrings in interface ITokenizer
  - tokenizeVerbatim
```
public Token[] tokenizeVerbatim(java.lang.String strOrig)
```
    Description copied from interface: ITokenizer
    
    Breaks a string into tokens. Numbers, tags, and other non-word tokens are included in the result. Stemming is NOT used.
    This method is used to mark string differences in the UI and to tune similarity.
    Results are not cached.
    
    Specified by:
    
    tokenizeVerbatim in interface ITokenizer
  - tokenizeVerbatimToStrings
```
public java.lang.String[] tokenizeVerbatimToStrings(java.lang.String str)
```
    Description copied from interface: ITokenizer
    
    Breaks a string into strings. Numbers, tags, and other non-word tokens are included in the result. Stemming is NOT used.
    This method is used to mark string differences in the UI and for debugging purposes.
    Results are not cached.
    
    Specified by:
    
    tokenizeVerbatimToStrings in interface ITokenizer
  - getWordBreaker
```
public static java.text.BreakIterator getWordBreaker()
```
    Returns an iterator to break sentences into words.
  - isContains
```
public static boolean isContains(Token[] tokensList,
                                 Token tokenForCheck)
```
    Check if array contains token.
  - isContainsAll
```
public static boolean isContainsAll(Token[] tokensList,
                                    Token[] listForFind,
                                    boolean notExact)
```
    Check if the listForFind tokens are present in tokensList.
    
    Parameters:
    
    tokensList - a list of tokens to be searched
    
    listForFind - a list of tokens to search in tokensList
    
    notExact - is true if the tokens in listForFind can be non-contiguous or in a different order in the tokensList. If false, tokens must be exactly the same.
    
    Returns:
    
    true if the tokens in listForFind are found in tokensList
  - searchAll
```
public static java.util.List<Token[]> searchAll(Token[] tokensList,
                                                Token[] listForFind,
                                                boolean notExact)
```
    Find and return all tokens in tokensList that match the tokens in listForFind.
    
    Parameters:
    
    tokensList - a list of tokens to be searched
    
    listForFind - a list of tokens to search in tokensList
    
    notExact - is true if the tokens in listForFind can be non-contiguous or in a different order in the tokensList. If false, tokens must be exactly the same.
    
    Returns:
    
    A list containing each hit of the matched tokens. Each token array represents a different instance of listForFind that was found in tokensList.
  - getSupportedLanguages
```
public java.lang.String[] getSupportedLanguages()
```
    Description copied from interface: ITokenizer
    
    Return an array of language strings (xx-yy) indicating the tokenizer's supported languages. Meant for tokenizers for which the supported languages can only be determined at runtime, like the HunspellTokenizer.
    Indicate that this should be used by setting the Tokenizer annotation to contain only Tokenizer.DISCOVER_AT_RUNTIME.
    
    Specified by:
    
    getSupportedLanguages in interface ITokenizer

Modifier and Type	Field	Description
`static java.lang.String[]`	`EMPTY_STRINGS_LIST`
`static Token[]`	`EMPTY_TOKENS_LIST`

Class DefaultTokenizer

Nested Class Summary

Nested classes/interfaces inherited from interface org.omegat.tokenizer.ITokenizer

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

EMPTY_TOKENS_LIST

EMPTY_STRINGS_LIST

Constructor Detail

DefaultTokenizer

Method Detail

tokenizeWords

tokenizeWordsToStrings

tokenizeVerbatim

tokenizeVerbatimToStrings

getWordBreaker

isContains

isContainsAll

searchAll

getSupportedLanguages