Package org.omegat.util
Class DeNormalize
- java.lang.Object
-
- org.omegat.util.DeNormalize
-
public final class DeNormalize extends java.lang.Object
Denormalize a(n English) string in a collection of ways listed below.- Capitalize the first character in the string
- Detokenize
- Delete whitespace in front of periods and commas
- Join contractions
- Capitalize name titles (Mr Ms Miss Dr etc.)
- TODO: Handle surrounding characters ([{<"''">}])
- TODO: Join multi-period abbreviations (e.g. M.Phil. i.e.)
- TODO: Handle ambiguities like "st.", which can be an abbreviation for both "Saint" and "street"
- TODO: Capitalize both the title and the name of a person, e.g. Mr. Morton (named entities should be demarcated).
N.B. These methods all assume that every translation result that will be denormalized has the following format:- There is only one space between every pair of tokens
- There is no whitespace before the first token
- There is no whitespace after the final token
- Standard spaces are the only type of whitespace
-
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static java.lang.String
capitalizeI(java.lang.String line)
static java.lang.String
capitalizeLineFirstLetter(java.lang.String line)
Capitalize the first letter of a line.static java.lang.String
capitalizeNameTitleAbbrvs(java.lang.String line)
Capitalize the first character of the titles of names: Mr Mrs Ms Miss Dr Profstatic java.lang.String
joinContractions(java.lang.String line)
Scanning the line from left-to-right, a contraction suffix preceded by a space will become just the contraction suffix.static java.lang.String
joinHyphen(java.lang.String line)
Scanning from left-to-right, a hyphen surrounded by a space before and after it will become just the hyphen.static java.lang.String
joinPunctuationMarks(java.lang.String line)
Scanning from left-to-right, a comma or period preceded by a space will become just the comma/period.static java.lang.String
processSingleLine(java.lang.String normalized)
Apply all the denormalization methods to the normalized input line.static java.lang.String
replaceBracketTokens(java.lang.String line)
Case-insensitively replace all of the character sequences that represent a bracket character.
-
-
-
Method Detail
-
processSingleLine
public static java.lang.String processSingleLine(java.lang.String normalized)
Apply all the denormalization methods to the normalized input line.- Parameters:
normalized
-- Returns:
-
capitalizeLineFirstLetter
public static java.lang.String capitalizeLineFirstLetter(java.lang.String line)
Capitalize the first letter of a line. This should be the last denormalization step applied to a line.- Parameters:
line
- The single-line input string- Returns:
- The input string modified as described above
-
joinPunctuationMarks
public static java.lang.String joinPunctuationMarks(java.lang.String line)
Scanning from left-to-right, a comma or period preceded by a space will become just the comma/period.- Parameters:
line
- The single-line input string- Returns:
- The input string modified as described above
-
joinHyphen
public static java.lang.String joinHyphen(java.lang.String line)
Scanning from left-to-right, a hyphen surrounded by a space before and after it will become just the hyphen.- Parameters:
line
- The single-line input string- Returns:
- The input string modified as described above
-
joinContractions
public static java.lang.String joinContractions(java.lang.String line)
Scanning the line from left-to-right, a contraction suffix preceded by a space will become just the contraction suffix.
I.e., the preceding space will be deleting, joining the prefix to the suffix.
E.g.wo n't
becomeswon't
- Parameters:
line
- The single-line input string- Returns:
- The input string modified as described above
-
capitalizeNameTitleAbbrvs
public static java.lang.String capitalizeNameTitleAbbrvs(java.lang.String line)
Capitalize the first character of the titles of names: Mr Mrs Ms Miss Dr Prof- Parameters:
line
- The single-line input string- Returns:
- The input string modified as described above
-
capitalizeI
public static java.lang.String capitalizeI(java.lang.String line)
-
replaceBracketTokens
public static java.lang.String replaceBracketTokens(java.lang.String line)
Case-insensitively replace all of the character sequences that represent a bracket character. Keys are token representations of abbreviations of titles for names that capitalize more than just the first letter.
Bracket token sequences: -lrb- -rrb- -lsb- -rsb- -lcb- -rcb-
See http://www.cis.upenn.edu/~treebank/tokenization.html- Parameters:
line
- The single-line input string- Returns:
- The input string modified as described above
-
-