Dynamic programming: Hidden Markov Models

Following the introduction on dynamic programming, I’m giving an example using Hidden Markov Models (HMM). Since I’m not here to teach math or the usage of such tools in bioinformatics, but just to present an application of the method, I’ll try to keep everything simple.

Let’s try a simple example: the “grumpy cat model”. This cat can be in a good or bad mood, but since it is grumpy it is more likely to stay in a bad mood. Let’s say that a mood lasts the whole day. Also, when the cat is in a bad mood it is more likely to scratch the carpet than to purr, while the opposite happens when in a good mood.

What you want to achieve with the HMM is to know something about good and bad mood days given the observation of scratched carpet and repeated sounds. In other words, the emotional state of the cat is hidden and you would like to know it given the model.

In general, you need a set of states, the probability of moving from one state to another (given an input) and some observable features. In our “grumpy cat model” example: states can be “good mood” and “bad mood“, the probability of going from good to bad is 75% and from bad to good is 40%. The observable features are “scratched carpet“, with an 80% probability to happen when in bad mood and 30% while in good mood, while 20% and 70% are for “purr“, respectively.

Also we need to know how likely is to start in a bad mood, let’s say 60%. Reaching the END states has a 5% probability for both the bad and the good mood. In this example, the END is the death of the cat.

First and most simple application of dynamic programming to HMM is to answer the following question: how likely is to have a specific sequence of observation given the model? I.e. how likely is to have 5 days of scratched carpet, or 2 purr days + 3 scratched carpet days given our grumpy cat model?

This can be achieved by using the forward algorithm. It starts with a table, in our example with 4 rows (one of START, one for BAD mood, one for GOOD mood and one of END) and 7 columns (the 5 days plus one column for the begin and one for the end).

In the first column you have 1 for START and 0 for everything else (initialization of probabilities).

Beginning
Scratch Scratch Scratch Scratch Scratch Pr.
START 1
GOOD 0
BAD 0
END 0

The iteration is the following: the table cell in row I and column J+1 contains the probability of observing the feature J+1 while in state I (i.e. if I is BAD and J+1 is scratched carpet, you use the value 0.8) times the sum of the values in the previous column (J) each multiplied by the probability of moving from that row to the current one.

Day 1
Scratch Scratch Scratch Scratch Scratch Pr.
START 1
GOOD 0  0.4*0.3
BAD 0  0.6*0.8
END 0

This was easy: you have 0.4 probability of going from START to GOOD state, and you have 0.3 probability of a scratchy cat in a good mood. Then you have 0.6 probability of going from START to BAD and 0.8 probability of scratch while in bad MOOD.

Moving to the next column:

Day 2
Sc. Scratch Sc. Sc. Sc. Pr.
START 1
GOOD 0  0.12 0.3*(0.2*0.12 + 0.4*0.48)
BAD 0  0.48  0.8*(0.75*0.12 + 0.55*0.48)
END 0

In this case, for the GOOD state we have that either the cat stayed in a good mood, or it changed mood from a bad one to the good one. We have 0.2 chance of staying in a good mood times the chance on the good mood in the previous column (0.12), plus the chance of moving from a bad mood to a good mood (0.4) times the probability of bad mood in the previous column (0.48). The sum of these probabilities is multiplied by the chance of seeing a scratch while in a good mood (0.3).

The same reasoning applies for the bad mood.

This process is then iterated for the remaining columns, as shown in the following tables.

Day 3
Sc. Sc. Scratch Sc. Sc. Pr.
START 1
GOOD 0  0.12 0.065  0.3*(0.2*0.065 + 0.4*0.28)
BAD 0  0.48  0.28 0.8*(0.75*0.065 + 0.55*0.28)
END 0
Day 4
Sc. Sc. Sc. Scratch Sc. Pr.
START 1
GOOD 0  0.12 0.065 0.038  0.3*(0.2*0.038 + 0.4*0.163)
BAD 0  0.48  0.28 0.163  0.8* (0.75*0.038 + 0.55*0.163)
END 0
Day 5
Sc. Sc. Sc. Sc. Scratch Pr.
START 1
GOOD 0  0.12 0.065 0.038 0.022  0.3*(0.2*0.022 + 0.4*0.095)
BAD 0  0.48  0.28 0.163 0.095  0.8*(0.75*0.022 + 0.55*0.095)
END 0

After the last day, there is no emission, then the probabilities of the two states (moods) times the probability of reaching the end from that state (5% for both) are summed.

End
Sc. Sc. Sc. Sc. Sc. Pr.
START 1
GOOD 0  0.12 0.065 0.038 0.022 0.013
BAD 0  0.48  0.28 0.163 0.095 0.055
END 0  0.013*0.05 + 0.029*0.05

So, there is a 0.003, i.e. a 0.3% probability of observing 5 days of scratching given our grumpy cat model.

The dynamic programming approach evaluates just one column at the time while considering only the previous one. In this way, we can compute the probability for sequences of any length (i.e. as many days as we want) without using complex algorithms.

As exercise, you could try to compute the same matrix for the sequence of observations “Purr, Purr, Scratch, Scratch, Scratch”.

Jan 25 2016: the post has been edited to fix the transition probability to the END state.

Dynamic programming

Dynamic programming is a clever methodology used for solving very complex problems.

The basic idea is like construction works: you build a bigger wall by placing new bricks on existing ones. In dynamic programming, you identify a new solution by extending a previous one. Divide et impera to its best.

In the fields of bioinformatics and computational biology, two great examples of dynamic programming are used: alignment algorithms and Hidden Markov Models. I’m going to present some of them in the next posts, for now I would like to focus on the basic idea of dynamic programming, by using a very simple example.

At the basic of alignment and HMM algorithms you have a table. What you have to do is filling the table using some rules: you start from one side of the table, apply the rules to the next row or column and keep going on until the table is filled.

Let’s try the following: create a table of 5 rows and 10 columns.

As first step, you should always fill at leas a value into the table (initialization step). In our example, we put the numbers from 1 to 5 into the first column.

Now it is time to apply the iterative step: we have to repeat the same operation on all cells, starting from one side of the table and moving on.

In our example, the cell should contain the result of the following formula: value on the left times row number.

The first row should read as 1,1,1,1,1; the second 2,4,8,16,32 and so on. As you can see, each row contains the powers of the first number. What you needed to build it was just a local information: the previous value. So the 16 in the second row comes from the 8 in the third column, while the 16 in the fourth row is evaluated from the 4 on its left.

This is a very simple example, real dynamic programming algorithms may use the whole preceding column, complex mathematical functions on the previous values and so on. But the strategy is the same: if I change my function to “The maximum value on the preceding column times the minimum one, minus row number”, I would proceed in the same way: filling one column at the time, just using a different function.

 

Python scope and references

Variables are not that easy to understand for students. Often the name of the variable is something that has an ontological meaning for them, other times the usage of a variable in both the left and the right side of an assignment may be confusing.

But something they should not underestimate is the mixed management of references in Python.

To start, let’s first discuss scope (or visibility). Is a variable usable in a specific piece of code? Can I use the same name for a different variable?

While in other programming languages with a more strict syntax, the scope is straightforward, in Python is a bit more complicated.

Let’s start considering the simplest case: a global variable and its usage inside a function

a=1000

def f():
   print "Inside f",a

print a
f()
print a

So, the variable inside the function has the same value. Just to make thing more clear, you should remember that the value of a is the one at the time of execution of f(). I.e. it doesn’t matter if the variable is initialized before the definition:

def f():
   print "Inside f:",a

a = 73
f()
print a

What if I change the value of variable a inside f?

def f():
   a = 256
   print "Inside f:",a

a = 73
print a
f()
print a

As you can see, the variable inside the function is a different one. This is the scope in action: visibility of the inner variable is limited to the function. That applies also for parameters:

def f(a):
   a = 256
   print "Inside f:",a

a = 73
print a
f(a)
print a

So, everything assigned after the function name is local. Unless it is a reference.

A reference is like a pointer to something else: the memory location storing the information is the same, but it has different labels (i.e. variable names). This is true for lists:

l = [1,4]
t = l
l[0]=7
print l
print t

in this case, t is not a copy of l, it is just a different name for the same object. So changing the content of l would change the content of t, and vice-versa.
This happens for function parameters too: lists and dictionaries are passed as references, so changing the value inside a function would change the value of the original variable.

def f(a):
   a[0]=99
   print "Inside f:",a

a = [12,73]
print a
f(a)
print a

This is a classic example of a function with a side effect: the function changes something and you should be aware of that. Another example for side-effects due to references can be the following.
Consider that you may want to record every element added to any list into another one, like a log. What happens when you use the same function on the log itself?

log = []

def add(el,l):
    l.append(el)
    log.append(el)

a = [12,73]
add(2,a)
print a
print log
add(99,log)
print log

This happens because l and log point to the same memory location.

This is even worse when you try to create matrices using lists of list. Try the following:

def create_matrix(rows,cols):
    matrix = [[0]*cols]*rows
    return matrix

m = create_matrix(3,4)
for row in m:
    print row
m[1][1]=7
for row in m:
    print row

As you can see, changing an element in a row will change all the other elements in the same position in the other rows. This is because we have just one list as row, and every other is a reference to the same. While [0]*cols duplicates the integer value 0 (not a reference, but an actual value), the duplication of the rows is exactly as the assignment of a list to a different variable name: it creates a reference to the previous one.
Then, if you want to create a matrix, it may be better to use:

def create_matrix(rows,cols):
    matrix = []
    for i in range(rows):
       r = []
       for j in range(cols):
          r.append(0)
       matrix.append(r)
    return matrix

m = create_matrix(3,4)
for row in m:
    print row
m[1][1]=7
for row in m:
    print row

Recursion vs iteration

A recent tutorial for the students was about the differece between recursion and iteration.
The most effective way to explain it was the following:

  • iteration requires a set of values, and you perform the same operations on each of them until you are done
  • recursion implies an operation on an input, that has a result depending on the same operation on smaller inputs up to a very simple one (like zero or one) with an easy answer

The classic example for recursion is the computation of a factorial. The factorial of a number n, denoted by n!,  is the product of all the numbers up to n. So 3! = 1 x 2 x 3, 5! = 1 x 2 x 3 x 4 x 5.

It is easy to implement recursively in Python as:

def fact(n):
    if n==1:
       return 1
    else:
       return n*fact(n-1)

As you can see, we have the simple input when n equals to 1, otherwise the value depends on a smaller input (n-1).

If you can write a recursive function for a task, you can also write an iterative version of the same task.

def fact(n):
    result = 1
    for i in range(1,n+1):
        result = result * i
    return result

The iterative version requires a variable to store intermediate results.

A more complex example is the binary search. The input for binary search is a sorted list of elements, for example numbers. It is faster to find the presence of a value in the list using the following approach:

  1. look at the middle element
  2. if it is equal to the one we are looking for, we are done
  3. if it is greater, then our number should be in the lower half of the list
  4. if it is lesser, then our number should be in the upper half of the list
  5. repeat from step 1 for the appropriate half

In this case, the recursive version is straghtforward too:

def binary_search(l,n):
    #l is the list, n is the element to be found
    if len(l)>0:
       mid = len(l)/2
       if l[mid]==n:
          return True
       elif l[mid]>n:
          return binary_search(l[:mid],n)
       elif l[mid]<n:
          return binary_search(l[mid+1:],n)
    else: #empty list
       return False

However, if we also want the position of the element, we need few more parameters:

def binary_search(l,n,low,high):
    if high<low:
       return None
    else:
       mid = (high-low)/2+low
       if l[mid]==n:
          return mid
       elif l[mid]>n:
          return binary_search(l,n,low,mid-1)
       elif l[mid]<n:
          return binary_search(l,n,mid+1,high)

To create an iterative version of this, a while loop is required:

def binary_search(l,n,low,high):
    while low<=high:
        mid = (high-low)/2+low
        if l[mid]==n:
          return mid
        elif l[mid]>n:
          high = mid-1
        elif l[mid]<n:
          low = mid+1
    return None

In this last case, there is not a clear list of inputs, like in the factorial example, but we have an exit condition similar to the one used by recursion.

Switching dictionary keys and values in Python

Yesterday a student asked how could he may swap a dictionary so that keys become values and values keys.

One naive solution may be something like that:

new_dict = dict([(value,key) for key,value in old_dict.items()])

However, this solution has a major drawback: keys are guaranteed to be unique, while values may not. Given the fact that old_dict.items() returns a list of dictionary (key,value) elements in an arbitrary order, the last key having a specific value will be the final value in the new_dict. Example:

>>> old_dict = {12:1,76:3,23:1,3:3}
>>> old_dict.items()
dict_items([(76, 3), (3, 3), (12, 1), (23, 1)])
>>> new_dict = dict([(value,key) for key,value in old_dict.items()])
>>> new_dict
{1: 23, 3: 3}

Notice that (3,3) and (23,1) are the (key,value) pairs used to build he new dictionary.

Maybe the intended result is to have a (value,[key1,key2,..]) element for each value. This can be achieved using a for loop like:

for key,value in old_dict.items():
   if value in new_dict:
       new_dict[value].append(key)
   else:
       new_dict[value]=[key]

or by using a comprehension like:

new_dict = dict([(value, [key for key,v in old_dict.items() if v==value]) for value in set(old_dict.values())])

which gives {1: [12, 23], 3: [76, 3]} as result.

The latter example loops through the dictionary N times, where N is the number of distinct values, while the former performs just one loop.

Retrieving Python dictionary values in order

Complex data in Python may be represented by lists or dictionaries, and sometimes it may be helpful to sort it.

While sorting a list is quite straightforward using

l.sort()

or, if you have a list of lists/tuples and you prefer to sort on a specific value of them:

# sorts using the second value in the element
# example list with name,age,city
l=[['Alice',34,'Berlin'],
   ['Bob',29,'London'],
   ['Carol',30,'Rome']]
l.sort(key=lambda x: x[1])
# l is now sorted by age

Dicationaries are a different thing. Suppose that you have a dictionary mapping students to their grades, like:

grades = { 'Bruce': 100,
           'Diana': 98,
           'Barbara': 78,
           'Clark': 23}

If you print the content of the dictionary it may not appear in sorted order or even using the order in which you specified the keys:

for name in grades:
	print(name,grades[name])

on my machine I get:

Bruce 100
Clark 23
Diana 98
Barbara 78

If you want to get the list sorted by names, you should first sort the keys:

names = list(grades.keys())
names.sort()
for name in names:
	print(name,grades[name])

But what about sorting by values? (I.e. sorting by grades instead of names)

By using the previously shown key parameter of the sort method, you can convert the dictionary to a list of tuples, sort it using the second element of the tuple and then you get the result:

#convert dictionary into list
l = [(x,y) for (x,y) in grades.iteritems()]
l.sort(key=lambda x: x[1])

Then iterating through the list you should get:

Clark 23
Barbara 78
Diana 98
Bruce 100

Reading a FASTA file with Python

Since one of my tutoring duties is related to basic programming tasks for bioinformatics, I’m starting from there.

A good starting point is the exercise “Build a dictionary containing sequences from a FASTA file”.

The FASTA file format is a text based representation of a biological sequences. A FASTA file contains one or multiple entries: each entry consists of an header like, starting with the character > (greater than), and a set of lines (each up to 60 characters long) containing the sequence.

Of course you can use one of the already made libraries for the task: Bio.SeqIO.to_dict() or a loop using other methods from BioPython may be faster. However, this exercise is perfect in teaching how to think like a computer scientist.

First, the problem must be solved without using code: if you don’t know how to solve a problem, you can’t write a program that performs the solution.

In general, the FASTA file contains multiple entries, so we need a way to track the current entry name. We must distinguish the header from the sequence content.

Building from the inside-out:

#distinguish header from sequence
if line[0]=='>': #or line.startswith('>')
    #it is the header
else:
    #it is sequence

In the first case, we need to store the new sequence name and create its entry in the dictionary:

#distinguish header from sequence
if line[0]=='>': #or line.startswith('>')
    #it is the header
    name = line[1:] #discarding the initial >
    seqs[name] = ''
else:
    #it is sequence

We wrap everything inside a for loop, i.e. we are reading one line at the time:

for line in f:
    #let's discard the newline (if any)
    line = line.rstrip()
    #distinguish header from sequence
    if line[0]=='>': #or line.startswith('>')
        #it is the header
        name = line[1:] #discarding the initial >
        seqs[name] = ''
    else:
        #it is sequence

Now we have different ways to manage the sequence:

  1. build a string and update the dictionary when a new header or the end of file is found
  2. update the dictionary entry belonging to the current name

In the first case:

name = None
s = ''
for line in f:
    #let's discard the newline at the end (if any)
    line = line.rstrip()
    #distinguish header from sequence
    if line[0]=='>': #or line.startswith('>')
        #it is the header
        #if this is not the first sequence (i.e. we already read one)
        if name:
            seqs[name]=s
        name = line[1:] #discarding the initial >
        seqs[name] = ''
        s = ''
    else:
        #it is sequence
        s = s + line
#we must keep in mind the last entry: it was not saved
#since we didn't find a new header
if name: #just checking against an empty file
   seqs[name]=s

In the second approach:

name = None
for line in f:
    #let's discard the newline at the end (if any)
    line = line.rstrip()
    #distinguish header from sequence
    if line[0]=='>': #or line.startswith('>')
        #it is the header
        name = line[1:] #discarding the initial >
        seqs[name] = ''
    else:
        #it is sequence
        seqs[name] = seqs[name] + line

F can be a file object from open() or a list of lines from readlines() (I discourage the second one).