Python scope and references

Variables are not that easy to understand for students. Often the name of the variable is something that has an ontological meaning for them, other times the usage of a variable in both the left and the right side of an assignment may be confusing.

But something they should not underestimate is the mixed management of references in Python.

To start, let’s first discuss scope (or visibility). Is a variable usable in a specific piece of code? Can I use the same name for a different variable?

While in other programming languages with a more strict syntax, the scope is straightforward, in Python is a bit more complicated.

Let’s start considering the simplest case: a global variable and its usage inside a function

a=1000

def f():
   print "Inside f",a

print a
f()
print a

So, the variable inside the function has the same value. Just to make thing more clear, you should remember that the value of a is the one at the time of execution of f(). I.e. it doesn’t matter if the variable is initialized before the definition:

def f():
   print "Inside f:",a

a = 73
f()
print a

What if I change the value of variable a inside f?

def f():
   a = 256
   print "Inside f:",a

a = 73
print a
f()
print a

As you can see, the variable inside the function is a different one. This is the scope in action: visibility of the inner variable is limited to the function. That applies also for parameters:

def f(a):
   a = 256
   print "Inside f:",a

a = 73
print a
f(a)
print a

So, everything assigned after the function name is local. Unless it is a reference.

A reference is like a pointer to something else: the memory location storing the information is the same, but it has different labels (i.e. variable names). This is true for lists:

l = [1,4]
t = l
l[0]=7
print l
print t

in this case, t is not a copy of l, it is just a different name for the same object. So changing the content of l would change the content of t, and vice-versa.
This happens for function parameters too: lists and dictionaries are passed as references, so changing the value inside a function would change the value of the original variable.

def f(a):
   a[0]=99
   print "Inside f:",a

a = [12,73]
print a
f(a)
print a

This is a classic example of a function with a side effect: the function changes something and you should be aware of that. Another example for side-effects due to references can be the following.
Consider that you may want to record every element added to any list into another one, like a log. What happens when you use the same function on the log itself?

log = []

def add(el,l):
    l.append(el)
    log.append(el)

a = [12,73]
add(2,a)
print a
print log
add(99,log)
print log

This happens because l and log point to the same memory location.

This is even worse when you try to create matrices using lists of list. Try the following:

def create_matrix(rows,cols):
    matrix = [[0]*cols]*rows
    return matrix

m = create_matrix(3,4)
for row in m:
    print row
m[1][1]=7
for row in m:
    print row

As you can see, changing an element in a row will change all the other elements in the same position in the other rows. This is because we have just one list as row, and every other is a reference to the same. While [0]*cols duplicates the integer value 0 (not a reference, but an actual value), the duplication of the rows is exactly as the assignment of a list to a different variable name: it creates a reference to the previous one.
Then, if you want to create a matrix, it may be better to use:

def create_matrix(rows,cols):
    matrix = []
    for i in range(rows):
       r = []
       for j in range(cols):
          r.append(0)
       matrix.append(r)
    return matrix

m = create_matrix(3,4)
for row in m:
    print row
m[1][1]=7
for row in m:
    print row

Recursion vs iteration

A recent tutorial for the students was about the differece between recursion and iteration.
The most effective way to explain it was the following:

  • iteration requires a set of values, and you perform the same operations on each of them until you are done
  • recursion implies an operation on an input, that has a result depending on the same operation on smaller inputs up to a very simple one (like zero or one) with an easy answer

The classic example for recursion is the computation of a factorial. The factorial of a number n, denoted by n!,  is the product of all the numbers up to n. So 3! = 1 x 2 x 3, 5! = 1 x 2 x 3 x 4 x 5.

It is easy to implement recursively in Python as:

def fact(n):
    if n==1:
       return 1
    else:
       return n*fact(n-1)

As you can see, we have the simple input when n equals to 1, otherwise the value depends on a smaller input (n-1).

If you can write a recursive function for a task, you can also write an iterative version of the same task.

def fact(n):
    result = 1
    for i in range(1,n+1):
        result = result * i
    return result

The iterative version requires a variable to store intermediate results.

A more complex example is the binary search. The input for binary search is a sorted list of elements, for example numbers. It is faster to find the presence of a value in the list using the following approach:

  1. look at the middle element
  2. if it is equal to the one we are looking for, we are done
  3. if it is greater, then our number should be in the lower half of the list
  4. if it is lesser, then our number should be in the upper half of the list
  5. repeat from step 1 for the appropriate half

In this case, the recursive version is straghtforward too:

def binary_search(l,n):
    #l is the list, n is the element to be found
    if len(l)>0:
       mid = len(l)/2
       if l[mid]==n:
          return True
       elif l[mid]>n:
          return binary_search(l[:mid],n)
       elif l[mid]<n:
          return binary_search(l[mid+1:],n)
    else: #empty list
       return False

However, if we also want the position of the element, we need few more parameters:

def binary_search(l,n,low,high):
    if high<low:
       return None
    else:
       mid = (high-low)/2+low
       if l[mid]==n:
          return mid
       elif l[mid]>n:
          return binary_search(l,n,low,mid-1)
       elif l[mid]<n:
          return binary_search(l,n,mid+1,high)

To create an iterative version of this, a while loop is required:

def binary_search(l,n,low,high):
    while low<=high:
        mid = (high-low)/2+low
        if l[mid]==n:
          return mid
        elif l[mid]>n:
          high = mid-1
        elif l[mid]<n:
          low = mid+1
    return None

In this last case, there is not a clear list of inputs, like in the factorial example, but we have an exit condition similar to the one used by recursion.

Switching dictionary keys and values in Python

Yesterday a student asked how could he may swap a dictionary so that keys become values and values keys.

One naive solution may be something like that:

new_dict = dict([(value,key) for key,value in old_dict.items()])

However, this solution has a major drawback: keys are guaranteed to be unique, while values may not. Given the fact that old_dict.items() returns a list of dictionary (key,value) elements in an arbitrary order, the last key having a specific value will be the final value in the new_dict. Example:

>>> old_dict = {12:1,76:3,23:1,3:3}
>>> old_dict.items()
dict_items([(76, 3), (3, 3), (12, 1), (23, 1)])
>>> new_dict = dict([(value,key) for key,value in old_dict.items()])
>>> new_dict
{1: 23, 3: 3}

Notice that (3,3) and (23,1) are the (key,value) pairs used to build he new dictionary.

Maybe the intended result is to have a (value,[key1,key2,..]) element for each value. This can be achieved using a for loop like:

for key,value in old_dict.items():
   if value in new_dict:
       new_dict[value].append(key)
   else:
       new_dict[value]=[key]

or by using a comprehension like:

new_dict = dict([(value, [key for key,v in old_dict.items() if v==value]) for value in set(old_dict.values())])

which gives {1: [12, 23], 3: [76, 3]} as result.

The latter example loops through the dictionary N times, where N is the number of distinct values, while the former performs just one loop.

Retrieving Python dictionary values in order

Complex data in Python may be represented by lists or dictionaries, and sometimes it may be helpful to sort it.

While sorting a list is quite straightforward using

l.sort()

or, if you have a list of lists/tuples and you prefer to sort on a specific value of them:

# sorts using the second value in the element
# example list with name,age,city
l=[['Alice',34,'Berlin'],
   ['Bob',29,'London'],
   ['Carol',30,'Rome']]
l.sort(key=lambda x: x[1])
# l is now sorted by age

Dicationaries are a different thing. Suppose that you have a dictionary mapping students to their grades, like:

grades = { 'Bruce': 100,
           'Diana': 98,
           'Barbara': 78,
           'Clark': 23}

If you print the content of the dictionary it may not appear in sorted order or even using the order in which you specified the keys:

for name in grades:
	print(name,grades[name])

on my machine I get:

Bruce 100
Clark 23
Diana 98
Barbara 78

If you want to get the list sorted by names, you should first sort the keys:

names = list(grades.keys())
names.sort()
for name in names:
	print(name,grades[name])

But what about sorting by values? (I.e. sorting by grades instead of names)

By using the previously shown key parameter of the sort method, you can convert the dictionary to a list of tuples, sort it using the second element of the tuple and then you get the result:

#convert dictionary into list
l = [(x,y) for (x,y) in grades.iteritems()]
l.sort(key=lambda x: x[1])

Then iterating through the list you should get:

Clark 23
Barbara 78
Diana 98
Bruce 100

Reading a FASTA file with Python

Since one of my tutoring duties is related to basic programming tasks for bioinformatics, I’m starting from there.

A good starting point is the exercise “Build a dictionary containing sequences from a FASTA file”.

The FASTA file format is a text based representation of a biological sequences. A FASTA file contains one or multiple entries: each entry consists of an header like, starting with the character > (greater than), and a set of lines (each up to 60 characters long) containing the sequence.

Of course you can use one of the already made libraries for the task: Bio.SeqIO.to_dict() or a loop using other methods from BioPython may be faster. However, this exercise is perfect in teaching how to think like a computer scientist.

First, the problem must be solved without using code: if you don’t know how to solve a problem, you can’t write a program that performs the solution.

In general, the FASTA file contains multiple entries, so we need a way to track the current entry name. We must distinguish the header from the sequence content.

Building from the inside-out:

#distinguish header from sequence
if line[0]=='>': #or line.startswith('>')
    #it is the header
else:
    #it is sequence

In the first case, we need to store the new sequence name and create its entry in the dictionary:

#distinguish header from sequence
if line[0]=='>': #or line.startswith('>')
    #it is the header
    name = line[1:] #discarding the initial >
    seqs[name] = ''
else:
    #it is sequence

We wrap everything inside a for loop, i.e. we are reading one line at the time:

for line in f:
    #let's discard the newline (if any)
    line = line.rstrip()
    #distinguish header from sequence
    if line[0]=='>': #or line.startswith('>')
        #it is the header
        name = line[1:] #discarding the initial >
        seqs[name] = ''
    else:
        #it is sequence

Now we have different ways to manage the sequence:

  1. build a string and update the dictionary when a new header or the end of file is found
  2. update the dictionary entry belonging to the current name

In the first case:

name = None
s = ''
for line in f:
    #let's discard the newline at the end (if any)
    line = line.rstrip()
    #distinguish header from sequence
    if line[0]=='>': #or line.startswith('>')
        #it is the header
        #if this is not the first sequence (i.e. we already read one)
        if name:
            seqs[name]=s
        name = line[1:] #discarding the initial >
        seqs[name] = ''
        s = ''
    else:
        #it is sequence
        s = s + line
#we must keep in mind the last entry: it was not saved
#since we didn't find a new header
if name: #just checking against an empty file
   seqs[name]=s

In the second approach:

name = None
for line in f:
    #let's discard the newline at the end (if any)
    line = line.rstrip()
    #distinguish header from sequence
    if line[0]=='>': #or line.startswith('>')
        #it is the header
        name = line[1:] #discarding the initial >
        seqs[name] = ''
    else:
        #it is sequence
        seqs[name] = seqs[name] + line

F can be a file object from open() or a list of lines from readlines() (I discourage the second one).