Levenshtein Distance (or) Edit Distance

Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t.

For example,

    * If s is “test” and t is “test”, then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
    * If s is “test” and t is “tent”, then LD(s,t) = 1, because one substitution (change “s” to “n”) is sufficient to transform s into t.

The greater the Levenshtein distance, the more different the strings are.

Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can’t spell or pronounce Levenshtein, the metric is also sometimes called edit distance.

The Levenshtein distance algorithm has been used in:

    * Spell checking
    * Speech recognition
    * DNA analysis
    * Plagiarism detection

The Algorithm is as follows

Set n to be the length of s.
Set m to be the length of t.

If n = 0, return m and exit.
If m = 0, return n and exit.

Construct a matrix containing 0..m rows and 0..n columns.

Initialize the first row to 0..n.
Initialize the first column to 0..m.

  Examine each character of s (i from 1 to n).
  Examine each character of t (j from 1 to m).
  If s[i] equals t[j], the cost is 0.
 If s[i] doesn’t equal t[j], the cost is 1.
  Set cell d[i,j] of the matrix equal to the minimum of:
  a. The cell immediately above plus 1: d[i-1,j] + 1.
  b. The cell immediately to the left plus 1: d[i,j-1] + 1.
  c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.

After the iteration steps are complete, the distance is found in cell d[n,m].

The Java Implementation of the above algorithm is as follows

public class EditDistance {
 
 private int getMinimum(int a, int b, int c)
 {
  return Math.min(a,Math.min(b, c));
 }
 
 public int findCost(String s1,String s2)
 {
  
  //Find the length of strings
  int s1Length = s1.length();
  int s2Length = s2.length();
  //declare two dimension array of string length + 1
  int matrix[][] = new int[s1Length+1][s2Length+1];
  
  //If Length of any one of the String is zero then return the length of the other as Cost
  if(s1Length==0) return s2Length;
  if(s2Length==0) return s1Length;
  
  //Initialize the first row with 0 to s1Length
  for(int i=0;i<=s1Length;i++)
   matrix[i][0] = i;
  
  //Initialize the first Column  with 0 to s2Length
  for(int i=0;i<=s2Length;i++)
   matrix[0][i] = i;  
  
  //Iterate through the Matrix
  for(int i=1;i<=s1Length;i++)
  {
   for(int j=1;j<=s2Length;j++)
   {
    
    int cost;
    //Check the Character at each location, if they are same then the Cost is Zero else Cost is One
    if(s1.charAt(i-1)==s2.charAt(j-1))
     cost=0;
    else
     cost=1;
    
    //Set the value of matrix[i][j] based on the logic of the minimum value
    //  of 1) Value of the Cell on the Top + 1
    //     2) Value of the Cell on the Left + 1
    //     3) Value of the Cell on the Left Top Diagonal + cost
    matrix[i][j] = getMinimum(matrix[i][j-1]+1, matrix[i-1][j]+1, matrix[i-1][j-1]+cost);
   }
  }
  
  // Used for debugging the Matrix
  /*
  for(int i=0;i<=s1Length;i++)
  {
   for(int j=0;j<=s2Length;j++)
    System.out.print(matrix[i][j]  + " ");
   System.out.println();
  }  
  */
  return matrix[s1Length][s2Length];
 }
public static void main(String[] args)
 {
  EditDistance d = new EditDistance();
  System.out.println("Distance is " + d.findCost("worse", "world"));
 }

 Reference Links:

http://www.merriampark.com/ld.htm

http://en.wikipedia.org/wiki/Levenshtein_distance