Class LevenshteinDistance
- java.lang.Object
-
- org.omegat.core.matching.LevenshteinDistance
-
- All Implemented Interfaces:
ISimilarityCalculator
public class LevenshteinDistance extends java.lang.Object implements ISimilarityCalculator
Class to compute Levenshtein Distance.Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t.
For example,
- If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
- If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t.
The greater the Levenshtein distance, the more different the strings are.
Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit distance. alex73's comment: We can't make 'compute' mathod static, because in this case LevenshteinDistance will not be thread-safe(see 'd' and 'p' arrays). We can't create these arrays inside 'compute' method, because it's enough slow operation. We have to create LevenshteinDistance instance one for each thread where we will call it. It's best way for best performance.
- See Also:
- Levenshtein Distance, in Three Flavors
-
-
Constructor Summary
Constructors Constructor Description LevenshteinDistance()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description int
compute(Token[] s, Token[] t)
Compute Levenshtein distance between two lists.
-
-
-
Method Detail
-
compute
public int compute(Token[] s, Token[] t)
Compute Levenshtein distance between two lists.The difference between this impl. and the canonical one is that, rather than creating and retaining a matrix of size s.length()+1 by t.length()+1, we maintain two single-dimensional arrays of length s.length()+1.
The first, d, is the 'current working' distance array that maintains the newest distance cost counts as we iterate through the characters of String s. Each time we increment the index of String t we are comparing, d is copied to p, the second int[]. Doing so allows us to retain the previous cost counts as required by the algorithm (taking the minimum of the cost count to the left, up one, and diagonally up and to the left of the current cost count being calculated).
(Note that the arrays aren't really copied anymore, just switched... this is clearly much better than cloning an array or doing a System.arraycopy() each time through the outer loop.)
Effectively, the difference between the two implementations is this one does not cause an out of memory condition when calculating the LD over two very large strings.
For perfomance reasons the maximal number of compared items is
MAX_N
.- Specified by:
compute
in interfaceISimilarityCalculator
- Parameters:
s
- source segmentt
- target segment- Returns:
- similarity
-
-