public class LevenshteinDistance extends java.lang.Object implements ISimilarityCalculator
Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t.
For example,
The greater the Levenshtein distance, the more different the strings are.
Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit distance. alex73's comment: We can't make 'compute' mathod static, because in this case LevenshteinDistance will not be thread-safe(see 'd' and 'p' arrays). We can't create these arrays inside 'compute' method, because it's enough slow operation. We have to create LevenshteinDistance instance one for each thread where we will call it. It's best way for best performance.
Constructor and Description |
---|
LevenshteinDistance() |
Modifier and Type | Method and Description |
---|---|
int |
compute(Token[] s,
Token[] t)
Compute Levenshtein distance between two lists.
|
public int compute(Token[] s, Token[] t)
The difference between this impl. and the canonical one is that, rather than creating and retaining a matrix of size s.length()+1 by t.length()+1, we maintain two single-dimensional arrays of length s.length()+1.
The first, d, is the 'current working' distance array that maintains the newest distance cost counts as we iterate through the characters of String s. Each time we increment the index of String t we are comparing, d is copied to p, the second int[]. Doing so allows us to retain the previous cost counts as required by the algorithm (taking the minimum of the cost count to the left, up one, and diagonally up and to the left of the current cost count being calculated).
(Note that the arrays aren't really copied anymore, just switched... this is clearly much better than cloning an array or doing a System.arraycopy() each time through the outer loop.)
Effectively, the difference between the two implementations is this one does not cause an out of memory condition when calculating the LD over two very large strings.
For perfomance reasons the maximal number of compared items is MAX_N
.
compute
in interface ISimilarityCalculator
s
- source segmentt
- target segment