Class LevenshteinDistance

  • All Implemented Interfaces:
    ISimilarityCalculator

    public class LevenshteinDistance
    extends java.lang.Object
    implements ISimilarityCalculator
    Class to compute Levenshtein Distance.

    Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t.

    For example,

    • If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
    • If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t.

    The greater the Levenshtein distance, the more different the strings are.

    Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit distance. alex73's comment: We can't make 'compute' mathod static, because in this case LevenshteinDistance will not be thread-safe(see 'd' and 'p' arrays). We can't create these arrays inside 'compute' method, because it's enough slow operation. We have to create LevenshteinDistance instance one for each thread where we will call it. It's best way for best performance.

    See Also:
    Levenshtein Distance, in Three Flavors
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      int compute​(Token[] s, Token[] t)
      Compute Levenshtein distance between two lists.
      • Methods inherited from class java.lang.Object

        equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • LevenshteinDistance

        public LevenshteinDistance()
    • Method Detail

      • compute

        public int compute​(Token[] s,
                           Token[] t)
        Compute Levenshtein distance between two lists.

        The difference between this impl. and the canonical one is that, rather than creating and retaining a matrix of size s.length()+1 by t.length()+1, we maintain two single-dimensional arrays of length s.length()+1.

        The first, d, is the 'current working' distance array that maintains the newest distance cost counts as we iterate through the characters of String s. Each time we increment the index of String t we are comparing, d is copied to p, the second int[]. Doing so allows us to retain the previous cost counts as required by the algorithm (taking the minimum of the cost count to the left, up one, and diagonally up and to the left of the current cost count being calculated).

        (Note that the arrays aren't really copied anymore, just switched... this is clearly much better than cloning an array or doing a System.arraycopy() each time through the outer loop.)

        Effectively, the difference between the two implementations is this one does not cause an out of memory condition when calculating the LD over two very large strings.

        For perfomance reasons the maximal number of compared items is MAX_N.

        Specified by:
        compute in interface ISimilarityCalculator
        Parameters:
        s - source segment
        t - target segment
        Returns:
        similarity