💾 Download

String edit distance

Today we will cover another example of memoization, quite different in nature. It finds the "diff" between two strings, meaning what was changed to go from one to another.

The number of changes is often called the edit distance or Levenshtein distance. Here we are just going to find the diff. The result will be a visual string that makes it easy to see where the changes were.

In [9]:
# Run this first, it's from Week 4, day 1 lecture 
# See https://python-course.eu/advanced-python/memoization-decorators.php
# for an explanation of how this works and 
# https://realpython.com/primer-on-python-decorators/ for a description
# of python decorators in general.
def memoize(func):
  memory = {}
  def helper(*args):
    key = str(args)
    if key in memory:
      return memory[key]
    ans = func(*args)
    memory[key] = ans 
    return ans
  helper.memory = memory
  return helper

Base cases: First, the base case as in all recursions can be when s or t are empty strings. That's easy. Also, if they agree on the first characters, s[0] == t[0], then we can just recurse on the rest: in that case edit_distance(s, t) is equal to edit_distance(s[1:], t[1:]) because s[1:] is everything after the first character.

Recursion: Otherwise, there are three possibilities, which we solve by recursion. We then take the smallest distance. Either the first character of s is removed, the first character was changed, or the first character of t was inserted on s.

In [17]:
def edit_distance(s, t):
  '''Compute the minimal edit distance between two strings s and t, where an 
  an insertion counts as distance 1, a deletion counts as distance 1,
  and a replacement counts as distance 1.

  Example edit_distance('hello', 'hola') = 3 because:
  1) you delete one of the two 'l' chars at a cost of 1
  2) you replace the 'e' with an 'o' at a cost of 1
  3) you replace the 'o' with an 'a' at a cost of 1
  if len(s) == 0:
    return len(t)
  if len(t) == 0:
    return len(s)
  if s[0] == t[0]:
    return edit_distance(s[1:], t[1:])
  a = 1 + edit_distance(s[1:], t) # remove the first character of s
  b = 1 + edit_distance(s, t[1:]) # insert the first character of t
  c = 1 + edit_distance(s[1:], t[1:]) # substitute the first character
  return min(a, b, c)

edit_distance('fun', 'farming')
In [18]:
{"('', 'arming')": 6,
 "('', 'rming')": 5,
 "('', 'ming')": 4,
 "('', 'ing')": 3,
 "('', 'g')": 1,
 "('n', 'ng')": 1,
 "('', 'ng')": 2,
 "('n', 'ing')": 2,
 "('n', 'ming')": 3,
 "('n', 'rming')": 4,
 "('n', 'arming')": 5,
 "('n', '')": 1,
 "('', '')": 0,
 "('n', 'g')": 1,
 "('un', '')": 2,
 "('un', 'g')": 2,
 "('un', 'ng')": 2,
 "('un', 'ing')": 2,
 "('un', 'ming')": 3,
 "('un', 'rming')": 4,
 "('un', 'arming')": 5,
 "('fun', 'farming')": 5}

Further resources

Since this was my last lecture, I include further resources here:

  • Free Jupyter notebooks at Colab
  • VS code for editing python locally on your computer (it also can run Jupyter notebooks locally, and is great for any programming language or html etc)
  • repl.it – edit python online (in .py files, not notebooks)
  • Class notebooks and Python Puzzles – all the notebooks will be shared soon.
  • pythontutor.org is a great visual debugger for tiny programs. For debugging anything larger, use VS code.
  • Python for Everybody is a class that teaches fundamentals of Python from scratch
  • Art of Problem Solving is a great website for math problems and community for learning math
  • Codeforces and Leetcode are ocw.mit.edu -- Mit classes especially 6.0001 and 6.006