Difference between revisions of "Edit distance"

From HaskellWiki
Jump to navigation Jump to search
 
(Tried to tidy up the text a bit; no changes to the code)
 
(5 intermediate revisions by 3 users not shown)
Line 1: Line 1:
  +
[[Category:Code]]
Here is an implementation of
 
  +
The [http://en.wikipedia.org/wiki/Edit_distance edit distance] is the minimum number of (delete/insert/modify) operations required to edit one given string into another given string. The functions given on this page calculate the [http://en.wikipedia.org/wiki/Levenshtein_distance Levenshtein distance] of two strings.
   
  +
The edit distance of two strings can be found using a dynamic programming algorithm. The time complexity of this algorithm is O(nm), where n and m are the lengths of the input strings.
http://www.csse.monash.edu.au/~lloyd/tildeStrings/Alignment/92.IPL.html
 
  +
  +
The function which follows uses an array to store the table used by the algorithm:
  +
  +
<haskell>
  +
editDistance :: Eq a => [a] -> [a] -> Int
  +
editDistance xs ys = table ! (m,n)
  +
where
  +
(m,n) = (length xs, length ys)
  +
x = array (1,m) (zip [1..] xs)
  +
y = array (1,n) (zip [1..] ys)
  +
  +
table :: Array (Int,Int) Int
  +
table = array bnds [(ij, dist ij) | ij <- range bnds]
  +
bnds = ((0,0),(m,n))
  +
  +
dist (0,j) = j
  +
dist (i,0) = i
  +
dist (i,j) = minimum [table ! (i-1,j) + 1, table ! (i,j-1) + 1,
  +
if x ! i == y ! j then table ! (i-1,j-1) else 1 + table ! (i-1,j-1)]
  +
</haskell>
  +
  +
  +
Lloyd Allison's paper, [http://www.csse.monash.edu.au/~lloyd/tildeStrings/Alignment/92.IPL.html Lazy Dynamic-Programming can be Eager], describes a more efficient method for computing the edit distance.
  +
  +
The following Haskell function computes the edit distance in O(length a * (1 + dist a b)) time complexity. It is a translation of the function presented in Allison's paper, which is written in lazy ML.
   
 
<haskell>
 
<haskell>
dist :: Eq a => Int -> Int -> [a] -> [a] -> Int
+
dist :: Eq a => [a] -> [a] -> Int
 
dist a b
 
dist a b
 
= last (if lab == 0 then mainDiag
 
= last (if lab == 0 then mainDiag
 
else if lab > 0 then lowers !! (lab - 1)
 
else if lab > 0 then lowers !! (lab - 1)
 
else{- < 0 -} uppers !! (-1 - lab))
 
else{- < 0 -} uppers !! (-1 - lab))
where mainDiag = oneDiag anyInt a b (head uppers) (anyInt : head lowers)
+
where mainDiag = oneDiag a b (head uppers) (-1 : head lowers)
 
uppers = eachDiag a b (mainDiag : uppers) -- upper diagonals
 
uppers = eachDiag a b (mainDiag : uppers) -- upper diagonals
 
lowers = eachDiag b a (mainDiag : lowers) -- lower diagonals
 
lowers = eachDiag b a (mainDiag : lowers) -- lower diagonals
Line 23: Line 49:
 
thisdiag = firstelt : doDiag a b firstelt diagAbove (tail diagBelow)
 
thisdiag = firstelt : doDiag a b firstelt diagAbove (tail diagBelow)
 
lab = length a - length b
 
lab = length a - length b
anyInt = 0
 
 
min3 x y z = if x < y then x else min y z
 
min3 x y z = if x < y then x else min y z
 
</haskell>
 
</haskell>

Latest revision as of 13:46, 8 February 2010

The edit distance is the minimum number of (delete/insert/modify) operations required to edit one given string into another given string. The functions given on this page calculate the Levenshtein distance of two strings.

The edit distance of two strings can be found using a dynamic programming algorithm. The time complexity of this algorithm is O(nm), where n and m are the lengths of the input strings.

The function which follows uses an array to store the table used by the algorithm:

editDistance :: Eq a => [a] -> [a] -> Int
editDistance xs ys = table ! (m,n)
    where
    (m,n) = (length xs, length ys)
    x     = array (1,m) (zip [1..] xs)
    y     = array (1,n) (zip [1..] ys)
    
    table :: Array (Int,Int) Int
    table = array bnds [(ij, dist ij) | ij <- range bnds]
    bnds  = ((0,0),(m,n))
    
    dist (0,j) = j
    dist (i,0) = i
    dist (i,j) = minimum [table ! (i-1,j) + 1, table ! (i,j-1) + 1,
        if x ! i == y ! j then table ! (i-1,j-1) else 1 + table ! (i-1,j-1)]


Lloyd Allison's paper, Lazy Dynamic-Programming can be Eager, describes a more efficient method for computing the edit distance.

The following Haskell function computes the edit distance in O(length a * (1 + dist a b)) time complexity. It is a translation of the function presented in Allison's paper, which is written in lazy ML.

dist :: Eq a => [a] -> [a] -> Int
dist a b 
    = last (if lab == 0 then mainDiag
	    else if lab > 0 then lowers !! (lab - 1)
		 else{- < 0 -}   uppers !! (-1 - lab))
    where mainDiag = oneDiag a b (head uppers) (-1 : head lowers)
	  uppers = eachDiag a b (mainDiag : uppers) -- upper diagonals
	  lowers = eachDiag b a (mainDiag : lowers) -- lower diagonals
	  eachDiag a [] diags = []
	  eachDiag a (bch:bs) (lastDiag:diags) = oneDiag a bs nextDiag lastDiag : eachDiag a bs diags
	      where nextDiag = head (tail diags)
	  oneDiag a b diagAbove diagBelow = thisdiag
	      where doDiag [] b nw n w = []
		    doDiag a [] nw n w = []
		    doDiag (ach:as) (bch:bs) nw n w = me : (doDiag as bs me (tail n) (tail w))
			where me = if ach == bch then nw else 1 + min3 (head w) nw (head n)
		    firstelt = 1 + head diagBelow
		    thisdiag = firstelt : doDiag a b firstelt diagAbove (tail diagBelow)
	  lab = length a - length b
          min3 x y z = if x < y then x else min y z