Valuable recipe’s of core Python we love to skip part1

Blink fast - Imgur

Recipe 1:   Accessing private methods in python

Python doesn’t support privacy directly, but relies on the programmer to know when it is safe to modify an attribute from the outside. After all, you should know how to use an object before using that object. It is, however, possible to achieve something like private attributes with a little trickery.

To make a method or attribute private (inaccessible from the outside), simply start its name with two underscores:
class Secretive:
    def __inaccessible(self):
        print “Bet you can’t see me…”

    def accessible(self):
        print “The secret message is:”
        self.__inaccessible()

Now __inaccessible is inaccessible to the outside world, while it can still be used inside the class (for example, from accessible):

>>> s = Secretive()
>>> s.__inaccessible()
Traceback (most recent call last):

File “<pyshell#112>”, line 1, in ?
s.__inaccessible()

AttributeError: Secretive instance has no attribute ‘__inaccessible’

>>> s.accessible()
The secret message is:
Bet you can’t see me…

Although the double underscores are a bit strange, this seems like a standard private method, as found in other languages. What’s not so standard is what actually happens. Inside a class definition, all names beginning with a double underscore are “translated” by adding a single underscore and the class name to the beginning:

>>> Secretive._Secretive__inaccessible
<unbound method Secretive.__inaccessible>
If you know how this works behind the scenes, it is still possible to access private methods outside the class, even though you’re not supposed to:
>>> s._Secretive__inaccessible()
Bet you can’t see me…

So, in short, you can’t be sure that others won’t access the methods and attributes of your objects, but this sort of name-mangling is a pretty strong signal that they shouldn’t.
If you don’t want the name-mangling effect, but you still want to send a signal for other objects to stay away, you can use a single initial underscore. This is mostly just a convention, but has some practical effects. For example, names with an initial underscore aren’t imported with starred imports.

Recipe 2: Dictionary methods you should know perfectly

    

    has_key

The has_key method checks whether a dictionary has a given key. The expression d.has_key(k) is equivalent to k in d. The choice of which to use is largely a matter of taste, althoughhas_key is on its way out of the language (it will be gone in Python 3.0).
Here is an example of how you might use has_key:

>>> d = {}
>>> d.has_key(‘name’)
False
>>> d['name'] = ‘Eric’
>>> d.has_key(‘name’)
True
    

    items and iteritems

The items method returns all the items of the dictionary as a list of items in which each item is of the form (key, value). The items are not returned in any particular order:

>>> d = {‘title’: ‘Python Web Site’, ‘url’: ‘http://www.python.org&#8217;, ‘spam’: 0} >>> d.items()
[('url', 'http://www.python.org'), ('spam', 0), ('title', 'Python Web Site')]
The iteritems method works in much the same way, but returns an iterator instead of a list:

>>> it = d.iteritems()
>>> it

>>> list(it) # Convert the iterator to a list
[('url', 'http://www.python.org'), ('spam', 0), ('title', 'Python Web Site')]
Using iteritems may be more efficient in many cases (especially if you want to iterate over the result). For more information on iterators, see Chapter 9.
keys and iterkeys
The keys method returns a list of the keys in the dictionary, while iterkeys returns an iterator over the keys.
    

    pop

The pop method can be used to get the value corresponding to a given key, and then remove the key-value pair from the dictionary:

>>> d = {‘x’: 1, ‘y’: 2}
>>> d.pop(‘x’)
1
>>> d
{‘y’: 2}
    

    popitem

The popitem method is similar tolist.pop, which pops off the last element of a list. Unlike list.pop, however,popitem pops off an arbitrary item because dictionaries don’t have a “last element” or any order whatsoever. This may be very useful if you want to remove and process the items one by one in an efficient way (without retrieving a list of the keys first):

>>> d
{‘url’: ‘http://www.python.org&#8217;, ‘spam’: 0, ‘title’: ‘Python Web Site’} >>> d.popitem()
(‘url’, ‘http://www.python.org&#8217;)
>>> d
{‘spam’: 0, ‘title’: ‘Python Web Site’}

Although popitem is similar to the list method pop, there is no dictionary equivalent of append (which adds an element to the end of a list). Because dictionaries have no order, such a method wouldn’t make any sense.
    

    setdefault

The setdefault method is somewhat similar to get, in that it retrieves a value associated with a given key. In addition to the get functionality,setdefault sets the value corresponding to the given key if it is not already in the dictionary:

>>> d = {}
>>> d.setdefault(‘name’, ‘N/A’)
‘N/A’
>>> d
{‘name’: ‘N/A’}
>>> d['name'] = ‘Gumby’
>>> d.setdefault(‘name’, ‘N/A’)
‘Gumby’
>>> d
{‘name’: ‘Gumby’}

As you can see, when the key is missing, setdefault returns the default and updates the dictionary accordingly. If the key is present, its value is returned and the dictionary is left unchanged. The default is optional, as with get; if it is left out,None is used:

>>> d = {}
>>> print d.setdefault(‘name’) None
>>> d
{‘name’: None}

    

    update


The update method updates one dictionary with the items of another:

>>> d = {
‘title’: ‘Python Web Site’,
‘url’: ‘http://www.python.org&#8217;,
‘changed’: ‘Mar 14 22:09:15 MET 2008′}
>>> x = {‘title’: ‘Python Language Website’}
>>> d.update(x)
>>> d
{‘url’: ‘http://www.python.org&#8217;, ‘changed’:
‘Mar 14 22:09:15 MET 2008′, ‘title’: ‘Python Language Website’}

The items in the supplied dictionary are added to the old one, supplanting any items there with the same keys.
The update method can be called in the same way as the dict function (or type constructor), as discussed earlier in this chapter. This means that update can be called with a mapping, a sequence (or other iterable object) of (key, value) pairs, or keyword arguments.
values and itervalues
The values method returns a list of the values in the dictionary (anditervalues returns an iterator of the values). Unlike keys, the list returned by values may contain duplicates:

>>> d = {}
>>> d[1] = 1
>>> d[2] = 2
>>> d[3] = 3
>>> d[4] = 1
>>> d.values()
[1, 2, 3, 1]

Recipe 3: How nested functions in python can be used productively?

Python functions may be nested—you can put one inside another.2 Here is an example:
def foo():
    def bar():
        print “Hello, world!”
    bar()

Nesting is normally not all that useful, but there is one particular application that stands out: using one function to “create” another. This means that you can (among other things) write functions like the following:

def multiplier(factor):
    def multiplyByFactor(number):
        return number*factor
    return multiplyByFactor

One function is inside another, and the outer function returns the inner one; that is, the function itself is returned—it is not called. What’s important is that the returned function still has access to the scope where it was defined; in other words, it carries its environment (and the associated local variables) with it!

Each time the outer function is called, the inner one gets redefined, and each time, the variable factor may have a new value. Because of Python’s nested scopes, this variable from the outer local scope (of multiplier) is accessible in the inner function later on, as follows:

>>> double = multiplier(2)
>>> double(5)
10
>>> triple = multiplier(3)
>>> triple(3)
9
>>> multiplier(5)(4)
20

A function such as multiplyByFactor that stores its enclosing scopes is called a closure. Normally, you cannot rebind variables in outer scopes. In Python 3.0, however, the keyword nonlocal is introduced. It is used in much the same way as global, and lets you assign to variables in outer (but nonglobal) scopes.

Recipe 4: Launching a webbrowser from a python program

The os.system function is useful for a lot of things, but for the specific task of launching a web browser, there’s an even better solution: the webbrowser module. It contains a function called open, which lets you automatically launch a web browser to open the given URL. For example, if you want your program to open the impythonist web site in a web browser (either starting a new browser or using one that is already running), you simply use this:

import webbrowser
    webbrowser.open(“http://www.impythonist.co.nr&#8221;)

The page should pop up. Pretty nifty, huh?
        We will discuss lot more.Stay tuned. 

How to use python expressive power for solving a level-5 challenge

Hello Friends.It took too long for me to post new article.But Impythonist came up with with really interesting Insight this time.

Ok,what is title refer to?.We all done programming days and nights,why we stumble if new challenge is given to us?.
This article is not intended for Hardco(ders)re,but main focus is on beginners & intermediate ones.

“The coding is an art which rises from deep of the heart”

So here we are going to see an use case challenge which arrives frequently in local or web code contests.We can use any programming
language but as a pythonist i was privileged to use python,a beautiful and elegant programming language.
The problem goes as follows:

CHALLENGE

“”” You are provided with a text file containing a list of integers.

Let S be the longest subsequence of consecutive increasing integers in this list,
such that the element at position n is greater than the element at position n-1.
If there are multiple sub-sequences of the same length, let S be the one that appears first in the list.
What is the absolute value of the sum of the elements in S?

For example, take the list [3, 4, 2, 5, 7, 7, 1, 2].
For this list, the longest subsequence of increasing integers has 3 elements: S = [2, 5, 7].
The absolute value of the sum of these elements is |2+5+7|=14
text file is below:

https://app.box.com/s/7fjlraqe5dm6ik54cbgg

Step 1: Understand The Problem

It is not a simple question but thinking a bit could lead us to create theories for attacking the problem.
If we first see the question we will come up with an idea of dynamic programming.But only a little applicaton of dynamic
programming is required here.

step 2: Run the python Interpreter

Now assume how the beginning of the solution for the challenge could be!.what data types are required?,what python concepts
are involved in this.

By little observation from the question statement we draw following conclusions:

1.Usage of file handling concept in python
2.String handling is required little bit.
3.List manipulation is necessary and nothing more

Now go and get some coffee and rock on sleeves before knowing a technique.

step 3: No more steps.Jumping right on to the code.

#My Version Of Solution to The Subsequence Challenge

 

1. fil=open(“num.txt”,”r”).readlines()
2. mylo=[]
3. for l in fil:
4.     mylo.append(int(l))
5. long_seq,i,temp,max_seq_index=0,0,1,0
6. while i<len(mylo):
7.     if long_seq>temp: temp,max_seq_index=long_seq,i-1
8.     long_seq,j=1,i
9.     while j!=len(mylo)-1 and mylo[j]<mylo[j+1] :
10.         long_seq+=1
11.         j+=1
12.     i+=1
13. total=abs(sum(mylo[max_seq_index:max_seq_index+temp]))
14. print “The longest sequence no in File is: %d and sum is %d”% (temp,total)

 

Just observe the above code.It is complete solution to the problem.Your feet will ache if you try to find
the solution in the other traditional languages.Don’t be panicked by seeing code,we are going to discuss each and every line.
line 1:
fil=open(“num.txt”,”r”).readlines()

It is simple.we are opening a file called num.txt in reading mode.readlines() method of file object returns a ‘list’ of
strings for all lines in the file.Here num.txt is in current directory.path can be specified instead.

line 2:
mylo=[]

Here we are planning to capture each line as a single value and store it into list.So create an empty list first.

line 3,4:
for l in fil:
mylo.append(int(l))

Now for each element in the returned list ‘fil’ we are casting each value to int and stores it into list mylo.
By this a list is set with values to work on.Actual logic starts from next lines.

line 5:
long_seq,i,temp,max_seq_index=0,0,1,0

In this line we are packing(assigning) four variables in a single line.see what the variables mean.

long_seq ———> This variable going to store maximum length of items satisfied condition for each iteration.
It may seem alien,but things will settle in some time.

i ———-> This variable is used for iteration of each element of list.

temp ———-> This variable is used to store the longest length of subsequence found.It means it consists
of part of answer i.e.what is the length of maximum matched elements satisfying required conditions.
s[i]<s[i+1].It is initialized with 1 because at least one element satisfy given condition.

max_seq_index —–> This variable holds the index for starting of maximum subsequence.This is later used by us to find out
the sum of all elements matching criteria s[i]<s[i+1] in longest subsequence.

line 6:
while i<len(mylo):

Here we are iterating from first element in list ‘mylo’ up to last element.

line 7:
if long_seq>temp: temp,max_seq_index=long_seq,i-1

Here comes the tricky part.Here this construct is written by seeing into future.That is what dynamic programming.It states if current
found sequence length is greater than previously calculated one then modify temp,and store index where we find the maximum subsequence
length into max_seq_index.why it is i-1 and not i because,we are looking into future.It will be clarified in few minutes.

line 8:
long_seq,j=1,i

long_seq is a temporary variable for each iteration.It is a sentinel which works for temp.It is analogous as follows.If money collected
by long_seq is greater temp will own it,other wise temp doesn’t care.Now long_seq is sent into battlefield with initializing one rupee.If
earn in the coming battle in next loop,it gets the longest sequence which is later fed to temp.j is initialized to i in order to find any
sub-sequence is availing for greater length.

line 9-11:

while j!=len(mylo)-1 and mylo[j]<mylo[j+1] :
long_seq+=1
j+=1

j!=len(mylo)-1 is written in afraid of array index out of bounds chance for next condition mylo[j]<mylo[j+1].If these two are passed then j
skids to next higher value and long_seq bagged 1 point for satisfying the condition.This continues until longer subsequence satisfying
condition happens.If condition fails i restarts with next element.

line 11:

i+=1

Simple,this is used to iterate i for each element in ‘mylo’

line 12:

total=abs(sum(mylo[max_seq_index:max_seq_index+temp]))

Here total stores sum of all the elements in of the largest subsequence satisfying the condition.The temp value is useful for stripping the list
‘mylo’ and return only part of it, which is a longest subsequence.predefined sum() method adds stripped list and abs() method returns absolute
value converting negative into positive.

line 13:

print “The longest sequence no in File is: %d and sum is %d”% (temp,total)

As it is clear this line prints the longest subsequence length and sum as asked in challenge

In this way python constructs are highly optimized and used to write new libraries at lightening speed.Here dynamic programming approach might
confuse a little bit,but if you follow the execution path you will find easier how cleverly the problem was solved.
It is a little effort to enhance designing strategies for beginners.For further discussions feel free to mail me.

my mail address : narenarya@live.com

Peter norvig’s 21 line spelling corrector using probability theory

Image

 

Remembering the legend,In his own words story follows:

How to Write a Spelling Corrector

In the past week, two friends (Dean and Bill) independently told me they were amazed at how Google does spelling correction so well and quickly. Type in a search like [speling] and Google comes back in 0.1 seconds or so with Did you mean: spelling. (Yahoo and Microsoft are similar.) What surprised me is that I thought Dean and Bill, being highly accomplished engineers and mathematicians, would have good intuitions about statistical language processing problems such as spelling correction. But they didn’t, and come to think of it, there’s no reason they should: it was my expectations that were faulty, not their knowledge.

I figured they and many others could benefit from an explanation. The full details of an industrial-strength spell corrector are quite complex (you con read a little about it here or here). What I wanted to do here is to develop, in less than a page of code, a toy spelling corrector that achieves 80 or 90% accuracy at a processing speed of at least 10 words per second.

So here, in 21 lines of Python 2.5 code, is the complete spelling corrector:(almost all code works with 2.7 also)

import re, collections

def words(text): return re.findall('[a-z]+', text.lower())

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

The code defines the function correct, which takes a word as input and returns a likely correction of that word. For example:

>>> correct('speling')
'spelling'
>>> correct('korrecter')
'corrector'

The version of edits1 shown here is a variation on one proposed by Darius Bacon; I think this is clearer than the version I originally had. Darius also fixed a bug in the function correct.

How It Works: Some Probability Theory

How does it work? First, a little theory. Given a word, we are trying to choose the most likely spelling correction for that word (the “correction” may be the original word itself). There is no way to know for sure (for example, should “lates” be corrected to “late” or “latest”?), which suggests we use probabilities. We will say that we are trying to find the correction c, out of all possible corrections, that maximizes the probability of c given the original word w:

argmaxc P(c|w)

By Bayes’ Theorem this is equivalent to:

argmaxc P(w|c) P(c) / P(w)

Since P(w) is the same for every possible c, we can ignore it, giving:

argmaxc P(w|c) P(c)

There are three parts of this expression. From right to left, we have:

  1. P(c), the probability that a proposed correction c stands on its own. This is called the language model: think of it as answering the question “how likely is c to appear in an English text?” So P(“the”) would have a relatively high probability, while P(“zxzxzxzyyy”) would be near zero.
  2. P(w|c), the probability that w would be typed in a text when the author meant c. This is the error model: think of it as answering “how likely is it that the author would type w by mistake when c was intended?”
  3. argmaxc, the control mechanism, which says to enumerate all feasible values of c, and then choose the one that gives the best combined probability score.

One obvious question is: why take a simple expression like P(c|w) and replace it with a more complex expression involving two models rather than one? The answer is that P(c|w) is already conflating two factors, and it is easier to separate the two out and deal with them explicitly. Consider the misspelled word w=”thew” and the two candidate corrections c=”the” and c=”thaw”. Which has a higher P(c|w)? Well, “thaw” seems good because the only change is “a” to “e”, which is a small change. On the other hand, “the” seems good because “the” is a very common word, and perhaps the typist’s finger slipped off the “e” onto the “w”. The point is that to estimate P(c|w) we have to consider both the probability of c and the probability of the change from c to w anyway, so it is cleaner to formally separate the two factors.

Now we are ready to show how the program works. First P(c). We will read a big text file, big.txt, which consists of about a million words. The file is a concatenation of several public domain books from Project Gutenberg and lists of most frequent words from Wiktionary and the British National Corpus. (On the plane home when I was writing the first version of the code all I had was a collection of Sherlock Holmes stories that happened to be on my laptop; I added the other sources later and stopped adding texts when they stopped helping, as we shall see in the Evaluation section.)

We then extract the individual words from the file (using the function words, which converts everything to lowercase, so that “the” and “The” will be the same and then defines a word as a sequence of alphabetic characters, so “don’t” will be seen as the two words “don” and “t”). Next we train a probability model, which is a fancy way of saying we count how many times each word occurs, using the function train. It looks like this:

def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))

At this point, NWORDS[w] holds a count of how many times the word w has been seen. There is one complication: novel words. What happens with a perfectly good word of English that wasn’t seen in our training data? It would be bad form to say the probability of a word is zero just because we haven’t seen it yet. There are several standard approaches to this problem; we take the easiest one, which is to treat novel words as if we had seen them once. This general process is called smoothing, because we are smoothing over the parts of the probability distribution that would have been zero, bumping them up to the smallest possible count. This is achieved through the class collections.defaultdict, which is like a regular Python dict (what other languages call hash tables) except that we can specify the default value of any key; here we use 1.

Now let’s look at the problem of enumerating the possible corrections c of a given word w. It is common to talk of the edit distance between two words: the number of edits it would take to turn one into the other. An edit can be a deletion (remove one letter), a transposition (swap adjacent letters), an alteration (change one letter to another) or an insertion (add a letter). Here’s a function that returns a set of all words c that are one edit away from w:

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

This can be a big set. For a word of length n, there will be n deletions, n-1 transpositions, 26n alterations, and 26(n+1) insertions, for a total of 54n+25 (of which a few are typically duplicates). For example, len(edits1(‘something’)) — that is, the number of elements in the result of edits1(‘something’) — is 494.

The literature on spelling correction claims that 80 to 95% of spelling errors are an edit distance of 1 from the target. As we shall see shortly, I put together a development corpus of 270 spelling errors, and found that only 76% of them have edit distance 1. Perhaps the examples I found are harder than typical errors. Anyway, I thought this was not good enough, so we’ll need to consider edit distance 2. That’s easy: just apply edits1 to all the results of edits1:

def edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1))

This is easy to write, but we’re starting to get into some serious computation: len(edits2(‘something’)) is 114,324. However, we do get good coverage: of the 270 test cases, only 3 have an edit distance greater than 2. That is, edits2 will cover 98.9% of the cases; that’s good enough for me. Since we aren’t going beyond edit distance 2, we can do a small optimization: only keep the candidates that are actually known words. We still have to consider all the possibilities, but we don’t have to build up a big set of them. The function known_edits2 does this:

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

Now, for example, known_edits2(‘something’) is a set of just 4 words: {‘smoothing’, ‘seething’, ‘something’, ‘soothing’}, rather than the set of 114,324 words generated by edits2. That speeds things up by about 10%.

Now the only part left is the error model, P(w|c). Here’s where I ran into difficulty. Sitting on the plane, with no internet connection, I was stymied: I had no training data to build a model of spelling errors. I had some intuitions: mistaking one vowel for another is more probable than mistaking two consonants; making an error on the first letter of a word is less probable, etc. But I had no numbers to back that up. So I took a shortcut: I defined a trivial model that says all known words of edit distance 1 are infinitely more probable than known words of edit distance 2, and infinitely less probable than a known word of edit distance 0. By “known word” I mean a word that we have seen in the language model training data — a word in the dictionary. We can implement this strategy as follows:

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

The function correct chooses as the set of candidate words the set with the shortest edit distance to the original word, as long as the set has some known words. Once it identifies the candidate set to consider, it chooses the element with the highest P(c) value, as estimated by the NWORDS model.

Evaluation

Now it is time to evaluate how well this program does. On the plane I tried a few examples, and it seemed okay. After my plane landed, I downloaded Roger Mitton’s Birkbeck spelling error corpus from the Oxford Text Archive. From that I extracted two test sets of corrections. The first is for development, meaning I get to look at it while I’m developing the program. The second is a final test set, meaning I’m not allowed to look at it, nor change my program after evaluating on it. This practice of having two sets is good hygiene; it keeps me from fooling myself into thinking I’m doing better than I am by tuning the program to one specific set of tests. Here I show an excerpt of the two tests and the function to run them; to see the complete set of tests (along with the rest of the program), see the file spell.py.

tests1 = { 'access': 'acess', 'accessing': 'accesing', 'accommodation':
    'accomodation acommodation acomodation', 'account': 'acount', ...}

tests2 = {'forbidden': 'forbiden', 'decisions': 'deciscions descisions',
    'supposedly': 'supposidly', 'embellishing': 'embelishing', ...}

def spelltest(tests, bias=None, verbose=False):
    import time
    n, bad, unknown, start = 0, 0, 0, time.clock()
    if bias:
        for target in tests: NWORDS[target] += bias
    for target,wrongs in tests.items():
        for wrong in wrongs.split():
            n += 1
            w = correct(wrong)
            if w!=target:
                bad += 1
                unknown += (target not in NWORDS)
                if verbose:
                    print '%r => %r (%d); expected %r (%d)' % (
                        wrong, w, NWORDS[w], target, NWORDS[target])
    return dict(bad=bad, n=n, bias=bias, pct=int(100. - 100.*bad/n),
                unknown=unknown, secs=int(time.clock()-start) )

print spelltest(tests1)
print spelltest(tests2) ## only do this after everything is debugged

This gives the following output:

{'bad': 68, 'bias': None, 'unknown': 15, 'secs': 16, 'pct': 74, 'n': 270}
{'bad': 130, 'bias': None, 'unknown': 43, 'secs': 26, 'pct': 67, 'n': 400}

So on the development set of 270 cases, we get 74% correct in 13 seconds (a rate of 17 Hz), and on the final test set we get 67% correct (at 15 Hz).

Update: In the original version of this essay I incorrectly reported a higher score on both test sets, due to a bug. The bug was subtle, but I should have caught it, and I apologize for misleading those who read the earlier version. In the original version of spelltest, I left out the if bias: in the fourth line of the function (and the default value was bias=0, not bias=None). I figured that when bias = 0, the statement NWORDS[target] += bias would have no effect. In fact it does not change the value of NWORDS[target], but it does have an effect: it makes (target in NWORDS) true. So in effect the spelltest routine was cheating by making all the unknown words known. This was a humbling error, and I have to admit that much as I like defaultdict for the brevity it adds to programs, I think I would not have had this bug if I had used regular dicts.

Update 2: defaultdict strikes again. Darius Bacon pointed out that in the function correct, I had accessed NWORDS[w]. This has the unfortunate side-effect of adding w to the defaultdict, if w was not already there (i.e., if it was an unknown word). Then the next time, it would be present, and we would get the wrong answer. Darius correctly suggested changing to NWORDS.get. (This works because max(None, i) is i for any integer i.)

In conclusion, I met my goals for brevity, development time, and runtime speed, but not for accuracy.

Future Work

Let’s think about how we could do better. (I’ve done some more in a separate chapter for a book.) We’ll again look at all three factors of the probability model: (1) P(c); (2) P(w|c); and (3) argmaxc. We’ll look at examples of what we got wrong. Then we’ll look at some factors beyond the three…

  1. P(c), the language model. We can distinguish two sources of error in the language model. The more serious is unknown words. In the development set, there are 15 unknown words, or 5%, and in the final test set, 43 unknown words or 11%. Here are some examples of the output of spelltest with verbose=True:
    correct('economtric') => 'economic' (121); expected 'econometric' (1)
    correct('embaras') => 'embargo' (8); expected 'embarrass' (1)
    correct('colate') => 'coat' (173); expected 'collate' (1)
    correct('orentated') => 'orentated' (1); expected 'orientated' (1)
    correct('unequivocaly') => 'unequivocal' (2); expected 'unequivocally' (1)
    correct('generataed') => 'generate' (2); expected 'generated' (1)
    correct('guidlines') => 'guideline' (2); expected 'guidelines' (1)

    In this output we show the call to correct and the result (with the NWORDS count for the result in parentheses), and then the word expected by the test set (again with the count in parentheses). What this shows is that if you don’t know that ‘econometric’ is a word, you’re not going to be able to correct ‘economtric’. We could mitigate by adding more text to the training corpus, but then we also add words that might turn out to be the wrong answer. Note the last four lines above are inflections of words that do appear in the dictionary in other forms. So we might want a model that says it is okay to add ‘-ed’ to a verb or ‘-s’ to a noun.

    The second potential source of error in the language model is bad probabilities: two words appear in the dictionary, but the wrong one appears more frequently. I must say that I couldn’t find cases where this is the only fault; other problems seem much more serious.

    We can simulate how much better we might do with a better language model by cheating on the tests: pretending that we have seen the correctly spelled word 1, 10, or more times. This simulates having more text (and just the right text) in the language model. The function spelltest has a parameter, bias, that does this. Here’s what happens on the development and final test sets when we add more bias to the correctly-spelled words:

    Bias Dev. Test
    0 74% 67%
    1 74% 70%
    10 76% 73%
    100 82% 77%
    1000 89% 80%

    On both test sets we get significant gains, approaching 80-90%. This suggests that it is possible that if we had a good enough language model we might get to our accuracy goal. On the other hand, this is probably optimistic, because as we build a bigger language model we would also introduce words that are the wrong answer, which this method does not do.

    Another way to deal with unknown words is to allow the result of correct to be a word we have not seen. For example, if the input is “electroencephalographicallz”, a good correction would be to change the final “z” to an “y”, even though “electroencephalographically” is not in our dictionary. We could achieve this with a language model based on components of words: perhaps on syllables or suffixes (such as “-ally”), but it is far easier to base it on sequences of characters: 2-, 3- and 4-letter sequences.

  2. P(w|c), the error model. So far, the error model has been trivial: the smaller the edit distance, the smaller the error. This causes some problems, as the examples below show. First, some cases where correct returns a word at edit distance 1 when it should return one at edit distance 2:
    correct('reciet') => 'recite' (5); expected 'receipt' (14)
    correct('adres') => 'acres' (37); expected 'address' (77)
    correct('rember') => 'member' (51); expected 'remember' (162)
    correct('juse') => 'just' (768); expected 'juice' (6)
    correct('accesing') => 'acceding' (2); expected 'assessing' (1)

    Here, for example, the alteration of ‘d’ to ‘c’ to get from ‘adres’ to ‘acres’ should count more than the sum of the two changes ‘d’ to ‘dd’ and ‘s’ to ‘ss’.

    Also, some cases where we choose the wrong word at the same edit distance:

    correct('thay') => 'that' (12513); expected 'they' (4939)
    correct('cleark') => 'clear' (234); expected 'clerk' (26)
    correct('wer') => 'her' (5285); expected 'were' (4290)
    correct('bonas') => 'bones' (263); expected 'bonus' (3)
    correct('plesent') => 'present' (330); expected 'pleasant' (97)

    The same type of lesson holds: In ‘thay’, changing an ‘a’ to an ‘e’ should count as a smaller change than changing a ‘y’ to a ‘t’. How much smaller? It must be a least a factor of 2.5 to overcome the prior probability advantage of ‘that’ over ‘they’.

    Clearly we could use a better model of the cost of edits. We could use our intuition to assign lower costs for doubling letters and changing a vowel to another vowel (as compared to an arbitrary letter change), but it seems better to gather data: to get a corpus of spelling errors, and count how likely it is to make each insertion, deletion, or alteration, given the surrounding characters. We need a lot of data to do this well. If we want to look at the change of one character for another, given a window of two characters on each side, that’s 266, which is over 300 million characters. You’d want several examples of each, on average, so we need at least a billion characters of correction data; probably safer with at least 10 billion.

    Note there is a connection between the language model and the error model. The current program has such a simple error model (all edit distance 1 words before any edit distance 2 words) that it handicaps the language model: we are afraid to add obscure words to the model, because if one of those obscure words happens to be edit distance 1 from an input word, then it will be chosen, even if there is a very common word at edit distance 2. With a better error model we can be more aggressive about adding obscure words to the dictionary. Here are some examples where the presence of obscure words in the dictionary hurts us:

    correct('wonted') => 'wonted' (2); expected 'wanted' (214)
    correct('planed') => 'planed' (2); expected 'planned' (16)
    correct('forth') => 'forth' (83); expected 'fourth' (79)
    correct('et') => 'et' (20); expected 'set' (325)
  3. The enumeration of possible corrections, argmaxc. Our program enumerates all corrections within edit distance 2. In the development set, only 3 words out of 270 are beyond edit distance 2, but in the final test set, there were 23 out of 400. Here they are:
    purple perpul
    curtains courtens
    minutes muinets
    
    successful sucssuful
    hierarchy heiarky
    profession preffeson
    weighted wagted
    inefficient ineffiect
    availability avaiblity
    thermawear thermawhere
    nature natior
    dissension desention
    unnecessarily unessasarily
    disappointing dissapoiting
    acquaintances aquantences
    thoughts thorts
    criticism citisum
    immediately imidatly
    necessary necasery
    necessary nessasary
    necessary nessisary
    unnecessary unessessay
    night nite
    minutes muiuets
    assessing accesing
    necessitates nessisitates
    

    We could consider extending the model by allowing a limited set of edits at edit distance 3. For example, allowing only the insertion of a vowel next to another vowel, or the replacement of a vowel for another vowel, or replacing close consonants like “c” to “s” would handle almost all these cases.

  4. There’s actually a fourth (and best) way to improve: change the interface to correct to look at more context. So far, correct only looks at one word at a time. It turns out that in many cases it is difficult to make a decision based only on a single word. This is most obvious when there is a word that appears in the dictionary, but the test set says it should be corrected to another word anyway:
    correct('where') => 'where' (123); expected 'were' (452)
    correct('latter') => 'latter' (11); expected 'later' (116)
    correct('advice') => 'advice' (64); expected 'advise' (20)

    We can’t possibly know that correct('where') should be ‘were’ in at least one case, but should remain ‘where’ in other cases. But if the query had been correct('They where going') then it seems likely that “where” should be corrected to “were”.

    The context of the surrounding words can help when there are obvious errors, but two or more good candidate corrections. Consider:

    correct('hown') => 'how' (1316); expected 'shown' (114)
    correct('ther') => 'the' (81031); expected 'their' (3956)
    correct('quies') => 'quiet' (119); expected 'queries' (1)
    correct('natior') => 'nation' (170); expected 'nature' (171)
    correct('thear') => 'their' (3956); expected 'there' (4973)
    correct('carrers') => 'carriers' (7); expected 'careers' (2)

    Why should ‘thear’ be corrected as ‘there’ rather than ‘their’? It is difficult to tell by the single word alone, but if the query were correct('There's no there thear') it would be clear.

    To build a model that looks at multiple words at a time, we will need a lot of data. Fortunately, Google has released a database of word counts for all sequences up to five words long, gathered from a corpus of a trillion words.

    I believe that a spelling corrector that scores 90% accuracy will need to use the context of the surrounding words to make a choice. But we’ll leave that for another day…

  5. We could improve our accuracy scores by improving the training data and the test data. We grabbed a million words of text and assumed they were all spelled correctly; but it is very likely that the training data contains several errors. We could try to identify and fix those. Less daunting a task is to fix the test sets. I noticed at least three cases where the test set says our program got the wrong answer, but I believe the program’s answer is better than the expected answer:
    correct('aranging') => 'arranging' (20); expected 'arrangeing' (1)
    correct('sumarys') => 'summary' (17); expected 'summarys' (1)
    correct('aurgument') => 'argument' (33); expected 'auguments' (1)

    We could also decide what dialect we are trying to train for. The following three errors are due to confusion about American versus British spelling (our training data contains both):

    correct('humor') => 'humor' (17); expected 'humour' (5)
    correct('oranisation') => 'organisation' (8); expected 'organization' (43)
    correct('oranised') => 'organised' (11); expected 'organized' (70)
  6. Finally, we could improve the implementation by making it much faster, without changing the results. We could re-implement in a compiled language rather than an interpreted one. We could have a lookup table that is specialized to strings rather than Python’s general-purpose dict. We could cache the results of computations so that we don’t have to repeat them multiple times. One word of advice: before attempting any speed optimizations, profile carefully to see where the time is actually going.
  7. Norvig is not only pioneer in python but in other fields also.For his complete  works visit  norvig.com(no www before url).

Mongo DB for Python Developers

Introduction to NoSQL Architecture with MongoDB for  Python Developers

This article covers using MongoDB as a way to get started with NoSQL. It presents an introduction to considering NoSQL, why MongoDB is a good NoSQL implementation to get started with, MongoDB shortcommings and tradeoffs, key MongoDB developer concepts, MongoDB architectural concepts (sharding, replica sets), using the console to learn MongoDB, and getting started with MongoDB for Python, Java and PHP developers. The article uses MongoDB, but many concepts introduced are common in other NoSQL solutions. The article should be useful for new developers, ops and DevOps who are new to NoSQL.

From no NoSQL to sure why not

The first time I heard of something that actually could be classified as NoSQL was from Warner Onstine, he is currently working on some CouchDB articles for InfoQ. Warner was going on and on about how great CouchDB was. This was before the term NoSQL was coined. I was skeptical, and had just been on a project that was converted from an XML Document Database back to Oracle due to issues with the XML Database implementation. I did the conversion. I did not pick the XML Database solution, or decide to convert it to Oracle. I was just the consultant guy on the project (circa 2005) who did the work after the guy who picked the XML Database moved on and the production issues started to happen.

This was my first document database. This bred skepticism and distrust of databases that were not established RDBMS (Oracle, MySQL, etc.). This incident did not create the skepticism. Let me explain.

First there were all of the Object Oriented Database (OODB) folks for years preaching how it was going to be the next big thing. It did not happen yet. I hear 2013 will be the year of the OODB just like it was going to be 1997. Then there were the XML Database people preaching something very similar, which did not seem to happen either at least at the pervasive scale that NoSQL is happening.

My take was, ignore this document oriented approach and NoSQL, see if it goes away. To be successful, it needs some community behind it, some clear use case wins, and some corporate muscle/marketing, and I will wait until then. Sure the big guys need something like Dynamo and BigTable, but it is a niche I assumed. Then there was BigTable, MapReduce, Google App Engine, Dynamo in the news with white papers. Then Hadoop, Cassandra, MongoDB, Membase, HBase, and the constant small but growing drum beat of change and innovation. Even skeptics have limits.

Then in 2009, Eric Evans coined the term NoSQL to describe the growing list of open-source distributed databases. Now there is this NoSQL movement-three years in and counting. Like Ajax, giving something a name seems to inspire its growth, or perhaps we don’t name movements until there is already a ground swell. Either way having a name like NoSQL with a common vision is important to changing the world, and you can see the community, use case wins, and corporate marketing muscle behind NoSQL. It has gone beyond the buzz stage. Also in 2009 was the first project that I worked on that had mass scale out requirements that was using something that is classified as part of NoSQL.

2009 was when MongoDB was released from 10Gen, the NoSQL movement was in full swing. Somehow MongoDB managed to move to the front of the pack in terms of mindshare followed closely by Cassandra and others (see figure 1). MongoDB is listed as a top job trend on Indeed.com, #2 to be exact (behind HTML 5 and before iOS), which is fairly impressive given MongoDB was a relativly latecomer to the NoSQL party.

Figure 1: MongoDB leads the NoSQL pack

MongoDB takes early lead in NoSQL adoption race.

MongoDB is a distributed document-oriented, schema-less storage solution similar to CouchBase and CouchDB. MongoDB uses JSON-style documents to represent, query and modify data. Internally data is stored in BSON (binary JSON). MongoDB’s closest cousins seem to be CouchDB/Couchbase. MongoDB supports many clients/languages, namely, Python, PHP, Java, Ruby, C++, etc. This article is going to introduce key MongoDB concepts and then show basic CRUD/Query examples in JavaScript (part of MongoDB console application), Java, PHP and Python.

Disclaimer: I have no ties with the MongoDB community and no vested interests in their success or failure. I am not an advocate. I merely started to write about MongoDB because they seem to be the most successful, seem to have the most momentum for now, and in many ways typify the very diverse NoSQL market. MongoDB success is largely due to having easy-to-use, familiar tools. I’d love to write about CouchDB, Cassandra, CouchBase, Redis, HBase or number of NoSQL solution if there was just more hours in the day or stronger coffee or if coffee somehow extended how much time I had. Redis seems truly fascinating.

MongoDB seems to have the right mix of features and ease-of-use, and has become a prototypical example of what a NoSQL solution should look like. MongoDB can be used as sort of base of knowledge to understand other solutions (compare/contrast). This article is not an endorsement. Other than this, if you want to get started with NoSQL, MongoDB is a great choice.

MongoDB a gateway drug to NoSQL

MongoDB combinations of features, simplicity, community, and documentation make it successful. The product itself has high availability, journaling (which is not always a given with NoSQL solutions), replication, auto-sharding, map reduce, and an aggregation framework (so you don’t have to use map-reduce directly for simple aggregations). MongoDB can scale reads as well as writes.

NoSQL, in general, has been reported to be more agile than full RDBMS/ SQL due to problems with schema migration of SQL based systems. Having been on large RDBMS systems and witnessing the trouble and toil of doing SQL schema migrations, I can tell you that this is a real pain to deal with. RDBMS / SQL often require a lot of upfront design or a lot schema migration later. In this way, NoSQL is viewed to be more agile in that it allows the applications worry about differences in versioning instead of forcing schema migration and larger upfront designs. To the MongoDB crowd, it is said that MongoDB has dynamic schema not no schema (sort of like the dynamic language versus untyped language argument from Ruby, Python, etc. developers).

MongoDB does not seem to require a lot of ramp up time. Their early success may be attributed to the quality and ease-of-use of their client drivers, which was more of an afterthought for other NoSQL solutions (“Hey here is our REST or XYZ wire protocol, deal with it yourself”). Compared to other NoSQL solution it has been said that MongoDB is easier to get started. Also with MongoDB many DevOps things come cheaply or free. This is not that there are never any problems or one should not do capacity planning. MongoDB has become for many an easy on ramp for NoSQL, a gateway drug if you will.

MongoDB was built to be fast. Speed is a good reason to pick MongoDB. Raw speed shaped architecture of MongoDB. Data is stored in memory using memory mapped files. This means that the virtual memory manager, a very highly optimized system function of modern operating systems, does the paging/caching. MongoDB also pads areas around documents so that they can be modified in place, making updates less expensive. MongoDB uses a binary protocol instead of REST like some other implementations. Also, data is stored in a binary format instead of text (JSON, XML), which could speed writes and reads.

Another reason MongoDB may do well is because it is easy to scale out reads and writes with replica sets and autosharding. You might expect if MongoDB is so great that there would be a lot of big names using them, and there are like: MTV, Craigslist, Disney, Shutterfly, Foursqaure, bit.ly, The New York Times, Barclay’s, The Guardian, SAP, Forbes, National Archives UK, Intuit, github, LexisNexis and many more.

Caveats

MongoDB indexes may not be as flexible as Oracle/MySQL/Postgres or other even other NoSQL solutions. The order of index matters as it uses B-Trees. Realtime queries might not be as fast as Oracle/MySQL and other NoSQL solutions especially when dealing with array fields, and the query plan optimization work is not as far as long as more mature solutions. You can make sure MongoDB is using the indexes you setup quite easily with an explain function. Not to scare you, MongoDB is good enough if queries are simple and you do a little homework, and is always improving.

MongoDB does not have or integrate a full text search engine like many other NoSQL solutions do (many use Lucene under the covers for indexing), although it seems to support basic text search better than most traditional databases.

Every version of MongoDB seems to add more features and addresses shortcoming of the previous releases. MongoDB added journaling a while back so they can have single server durability; prior to this you really needed a replica or two to ensure some level of data safety. 10gen improved Mongo’s replication and high availability with Replica Sets.

Another issue with current versions of MongoDB (2.0.5) is lack of concurrency due to MongoDB having a global read/write lock, which allows reads to happen concurrently while write operations happen one at a time. There are workarounds for this involving shards, and/or replica sets, but not always ideal, and does not fit the “it should just work mantra” of MongoDB. Recently at the MongoSF conference, Dwight Merriman, co-founder and CEO of 10gen, discussed the concurrency internals of MongoDB v2.2 (future release). Dwight described that MongoDB 2.2 did a major refactor to add database level concurrency, and will soon have collection level concurrency now that the hard part of the concurrency refactoring is done. Also keep in mind, writes are in RAM and eventually get synced to disk since MongoDB uses a memory mapped files. Writes are not as expensive as if you were always waiting to sync to disk. Speed can mitigate concurrency issues.

This is not to say that MongoDB will never have some shortcomings and engineering tradeoffs. Also, you can, will and sometimes should combine MongoDB with a relational database or a full text search like Solr/Lucene for some applications. For example, if you run into issue with effectively building indexes to speed some queries you might need to combine MongoDB with Memcached. None of this is completely foreign though, as it is not uncommon to pair RDBMS with Memcached or Lucene/Solr. When to use MongoDB and when to develop a hybrid solution is beyond the scope of this article. In fact, when to use a SQL/RDBMS or another NoSQL solution is beyond the scope of this article, but it would be nice to hear more discussion on this topic.

The price you pay for MongoDB, one of the youngest but perhaps best managed NoSQL solution, is lack of maturity. It does not have a code base going back three decades like RDBMS systems. It does not have tons and tons of third party management and development tools. There have been issues, there are issues and there will be issues, but MongoDB seems to work well for a broad class of applications, and is rapidly addressing many issues.

Also finding skilled developers, ops (admins, devops, etc.) that are familiar with MongoDB or other NoSQL solutions might be tough. Somehow MongoDB seems to be the most approachable or perhaps just best marketed. Having worked on projects that used large NoSQL deployments, few people on the team really understand the product (limitations, etc.), which leads to trouble.

In short if you are kicking the tires of NoSQL, starting with MongoDB makes a lot of sense.

MongoDB concepts

MongoDB is document oriented but has many comparable concepts to traditional SQL/RDBMS solutions.

  1. Oracle: Schema, Tables, Rows, Columns
  2. MySQL: Database, Tables, Rows, Columns
  3. MongoDB: Database, Collections, Document, Fields
  4. MySQL/Oracle: Indexes
  5. MongoDB: Indexes
  6. MySQL/Oracle: Stored Procedures
  7. MongoDB: Stored JavaScript
  8. Oracle/MySQL: Database Schema
  9. MongoDB: Schema free!
  10. Oracle/MySQL: Foreign keys, and joins
  11. MongoDB: DBRefs, but mostly handled by client code
  12. Oracle/MySQL: Primary key
  13. MongoDB: ObjectID

If you have used MySQL or Oracle here is a good guide to similar processes in MongoDB:

Database Process Type Oracle MySQL MongoDB
Daemon/Server oracle mysqld mongod
Console Client sqlplus mysql mongo
Backup utility sqlplus mysqldump mongodump
Import utility sqlplus mysqlimport mongoimport

You can see a trend here. Where possible, Mongo tries to follow the terminology of MySQL. They do this with console commands as well. If you are used to using MySQL, where possible, MongoDB tries to make the transition a bit less painful.

SQL operations versus MongoDB operations

MongoDB queries are similar in concept to SQL queries and use a lot of the same terminology. There is no special language or syntax to execute MongoDB queries; you simply assemble a JSON object. The MongoDB site has a complete set of example queries done in both SQL and MongoDB JSON docs to highlight the conceptual similarities. What follows is several small listings to compare MongoDB operations to SQL.

Insert

SQL
INSERT INTO CONTACTS (NAME, PHONE_NUMBER) VALUES('RICK HIGHTOWER','520-555-1212')
MongoDB
db.contacts.insert({name:'RICK HIGHTOWER',phoneNumber:'520-555-1212'})

Selects

SQL
SELECT name, phone_number FROM contacts WHERE age=30 ORDER BY name DESC
MongoDB
db.contacts.find({age:30}, {name:1,phoneNumber:1}).sort({name:-1})
SQL
SELECT name, phone_number FROM contacts WHERE age>30 ORDER BY name DESC
MongoDB
db.contacts.find({age:{$gt:33}}, {name:1,phoneNumber:1}).sort({name:-1})

Creating indexes

SQL
CREATE INDEX contact_name_idx ON contact(name DESC)
MongoDB
db.contacts.ensureIndex({name:-1})

Updates

SQL
UPDATE contacts SET phoneNumber='415-555-1212' WHERE name='Rick Hightower'
MongoDB
db.contacts.update({name:'Rick Hightower'}, {$set:{phoneNumber:1}}, false, true)

Additional features of note

MongoDB has many useful features like Geo Indexing (How close am I to X?), distributed file storage, capped collection (older documents auto-deleted), aggregation framework (like SQL projections for distributed nodes without the complexities of MapReduce for basic operations on distributed nodes), load sharing for reads via replication, auto sharding for scaling writes, high availability, and your choice of durability (journaling) and/or data safety (make sure a copy exists on other servers).

Architecture Replica Sets, Autosharding

The model of MongoDB is such that you can start basic and use features as your growth/needs change without too much trouble or change in design. MongoDB uses replica sets to provide read scalability, and high availability. Autosharding is used to scale writes (and reads). Replica sets and autosharding go hand in hand if you need mass scale out. With MongoDB scaling out seems easier than traditional approaches as many things seem to come built-in and happen automatically. Less operation/administration and a lower TCO than other solutions seems likely. However you still need capacity planning (good guess), monitoring (test your guess), and the ability to determine your current needs (adjust your guess).

Replica Sets

The major advantages of replica sets are business continuity through high availability, data safety through data redundancy, and read scalability through load sharing (reads). Replica sets use a share nothing architecture. A fair bit of the brains of replica sets is in the client libraries. The client libraries are replica set aware. With replica sets, MongoDB language drivers know the current primary. Langauge driver is a library for a particular programing language, think JDBC driver or ODBC driver, but for MongoDB. All write operations go to the primary. If the primary is down, the drivers know how to get to the new primary (an elected new primary), this is auto failover for high availability. The data is replicated after writing. Drivers always write to the replica set’s primary (called the master), the master then replicates to slaves. The primary is not fixed. The master/primary is nominated.

Typically you have at least three MongoDB instances in a replica set on different server machines (see figure 2). You can add more replicas of the primary if you like for read scalability, but you only need three for high availability failover. There is a way to sort of get down to two, but let’s leave that out for this article. Except for this small tidbit, there are advantages of having three versus two in general. If you have two instances and one goes down, the remaining instance has 200% more load than before. If you have three instances and one goes down, the load for the remaining instances only go up by 50%. If you run your boxes at 50% capacity typically and you have an outage that means your boxes will run at 75% capacity until you get the remaining box repaired or replaced. If business continuity is your thing or important for your application, then having at least three instances in a replica set sounds like a good plan anyway (not all applications need it).

Figure 2: Replica Sets

MongoDB Replica Sets

In general, once replication is setup it just works. However, in your monitoring, which is easy to do in MongoDB, you want to see how fast data is getting replicated from the primary (master) to replicas (slaves). The slower the replication is, the dirtier your reads are. The replication is by default async (non-blocking). Slave data and primary data can be out of sync for however long it takes to do the replication. There are already whole books written just on making Mongo scalable, if you work at a Foursquare like company or a company where high availability is very important, and use Mongo, I suggest reading such a book.

By default replication is non-blocking/async. This might be acceptable for some data (category descriptions in an online store), but not other data (shopping cart’s credit card transaction data). For important data, the client can block until data is replicated on all servers or written to the journal (journaling is optional). The client can force the master to sync to slaves before continuing. This sync blocking is slower. Async/non-blocking is faster and is often described as eventual consistency. Waiting for a master to sync is a form of data safety. There are several forms of data safety and options available to MongoDB from syncing to at least one other server to waiting for the data to be written to a journal (durability). Here is a list of some data safety options for MongoDB:

  1. Wait until write has happened on all replicas
  2. Wait until write is on two servers (primary and one other)
  3. Wait until write has occurred on majority of replicas
  4. Wait until write operation has been written to journal

(The above is not an exhaustive list of options.) The key word of each option above is wait. The more syncing and durability the more waiting, and harder it is to scale cost effectively.

Journaling: Is durability overvalued if RAM is the new Disk? Data Safety versus durability

It may seem strange to some that journaling was added as late as version 1.8 to MongoDB. Journaling is only now the default for 64 bit OS for MongoDB 2.0. Prior to that, you typically used replication to make sure write operations were copied to a replica before proceeding if the data was very important. The thought being that one server might go down, but two servers are very unlikely to go down at the same time. Unless somebody backs a truck over a high voltage utility poll causing all of your air conditioning equipment to stop working long enough for all of your servers to overheat at once, but that never happens (it happened to Rackspace and Amazon). And if you were worried about this, you would have replication across availability zones, but I digress.

At one point MongoDB did not have single server durability, now it does with addition of journaling. But, this is far from a moot point. The general thought from MongoDB community was and maybe still is that to achieve Web Scale, durability was thing of the past. After all memory is the new disk. If you could get the data on second server or two, then the chances of them all going down at once is very, very low. How often do servers go down these days? What are the chances of two servers going down at once? The general thought from MongoDB community was (is?) durability is overvalued and was just not Web Scale. Whether this is a valid point or not, there was much fun made about this at MongoDB’s expense (rated R, Mature 17+).

As you recall MongoDB uses memory mapped file for its storage engine so it could be a while for the data in memory to get synced to disk by the operating system. Thus if you did have several machines go down at once (which should be very rare), complete recoverability would be impossible. There were workaround with tradeoffs, for example to get around this (now non issue) or minimize this issue, you could force MongoDB to do an fsync of the data in memory to the file system, but as you guessed even with a RAID level four and a really awesome server that can get slow quick. The moral of the story is MongoDB has journaling as well as many other options so you can decide what the best engineering tradeoff in data safety, raw speed and scalability. You get to pick. Choose wisely.

The reality is that no solution offers “complete” reliability, and if you are willing to allow for some loss (which you can with some data), you can get enormous improvements in speed and scale. Let’s face it your virtual farm game data is just not as important as Wells Fargo’s bank transactions. I know your mom will get upset when she loses the virtual tractor she bought for her virtual farm with her virtual money, but unless she pays real money she will likely get over it. I’ve lost a few posts on twitter over the years, and I have not sued once. If your servers have an uptime of 99 percent and you block/replicate to three servers than the probability of them all going down at once is (0.000001) so the probability of them all going down is 1 in 1,000,000. Of course uptime of modern operating systems (Linux) is much higher than this so one in 100,000,000 or more is possible with just three servers. Amazon EC2 offers discounts if they can’t maintain an SLA of 99.95% (other cloud providers have even higher SLAs). If you were worried about geographic problems you could replicate to another availability zone or geographic area connected with a high-speed WAN. How much speed and reliability do you need? How much money do you have?

An article on when to use MongoDB journaling versus older recommendations will be a welcome addition. Generally it seems journaling is mostly a requirement for very sensitive financial data and single server solutions. Your results may vary, and don’t trust my math, it has been a few years since I got a B+ in statistics, and I am no expert on SLA of modern commodity servers (the above was just spit balling).

If you have ever used a single non-clustered RDBMS system for a production system that relied on frequent backups and transaction log (journaling) for data safety, raise your hand. Ok, if you raised your hand, then you just may not need autosharding or replica sets. To start with MongoDB, just use a single server with journaling turned on. If you require speed, you can configure MongoDB journaling to batch writes to the journal (which is the default). This is a good model to start out with and probably very much like quite a few application you already worked on (assuming that most application don’t need high availability). The difference is, of course, if later your application deemed to need high availability, read scalability, or write scalability, MongoDB has your covered. Also setting up high availability seems easier on MongoDB than other more established solutions.

Figure 3: Simple setup with journaling and single server ok for a lot of applications

Simple Non-Sharded, non replicated installation

If you can afford two other servers and your app reads more than it writes, you can get improved high availability and increased read scalability with replica sets. If your application is write intensive then you might need autosharding. The point is you don’t have to be Facebook or Twitter to use MongoDB. You can even be working on a one-off dinky application. MongoDB scales down as well as up.

Autosharding

Replica sets are good for failover and speeding up reads, but to speed up writes, you need autosharding. According to a talk by Roger Bodamer on Scaling with MongoDB, 90% of projects do not need autosharding. Conversely almost all projects will benefit from replication and high availability provided by replica sets. Also once MongoDB improves its concurrency in version 2.2 and beyond, it may be the case that 97% of projects don’t need autosharding.

Sharding allows MongoDB to scale horizontally. Sharding is also called partitioning. You partition each of your servers a portion of the data to hold or the system does this for you. MongoDB can automatically change partitions for optimal data distribution and load balancing, and it allows you to elastically add new nodes (MongoDB instances). How to setup autosharding is beyond the scope of this introductory article. Autosharding can support automatic failover (along with replica sets). There is no single point of failure. Remember 90% of deployments don’t need sharding, but if you do need scalable writes (apps like Foursquare, Twitter, etc.), then autosharding was designed to work with minimal impact on your client code.

There are three main process actors for autosharding: mongod (database daemon), mongos, and the client driver library. Each mongod instance gets a shard. Mongod is the process that manages databases, and collections. Mongos is a router, it routes writes to the correct mongod instance for autosharding. Mongos also handles looking for which shards will have data for a query. To the client driver, mongos looks like a mongod process more or less (autosharding is transparent to the client drivers).

Figure 4: MongoDB Autosharding

Sharding with MongoDB

Autosharding increases write and read throughput, and helps with scale out. Replica sets are for high availability and read throughput. You can combine them as shown in figure 5.

Figure 5: MongoDB Autosharding plus Replica Sets for scalable reads, scalable writes, and high availability

MongoDB Autosharding for Scalable Reads, Scalable Writes and High Availability

You shard on an indexed field in a document. Mongos collaborates with config servers (mongod instances acting as config servers), which have the shard topology (where do the key ranges live). Shards are just normal mongod instances. Config servers hold meta-data about the cluster and are also mongodb instances.

Shards are further broken down into 64 MB chunks called chunks. A chunk is 64 MB worth of documents for a collection. Config servers hold which shard the chunks live in. The autosharding happens by moving these chunks around and distributing them into individual shards. The mongos processes have a balancer routine that wakes up so often, it checks to see how many chunks a particular shard has. If a particular shard has too many chunks (nine more chunks than another shard), then mongos starts to move data from one shard to another to balance the data capacity amongst the shards. Once the data is moved then the config servers are updated in a two phase commit (updates to shard topology are only allowed if all three config servers are up).

The config servers contain a versioned shard topology and are the gatekeeper for autosharding balancing. This topology maps which shard has which keys. The config servers are like DNS server for shards. The mongos process uses config servers to find where shard keys live. Mongod instances are shards that can be replicated using replica sets for high availability. Mongos and config server processes do not need to be on their own server and can live on a primary box of a replica set for example. For sharding you need at least three config servers, and shard topologies cannot change unless all three are up at the same time. This ensures consistency of the shard topology. The full autosharding topology is show in figure 6. An excellent talk on the internals of MongoDB sharding was done by Kristina Chodorow, author of Scaling MongoDB, at OSCON 2011 if you would like to know more.

Figure 6: MongoDB Autosharding full topology for large deployment including Replica Sets, Mongos routers, Mongod Instance, and Config Servers

MongoDB Autosharding full topology for large deployment including Replica Sets, mongos routers, mongod instance, client drivers and config servers

MapReduce

MongoDB has MapReduce capabilities for batch processing similar to Hadoop. Massive aggregation is possible through the divide and conquer nature of MapReduce. Before the Aggregation Framework MongoDB’s MapReduced could be used instead to implement what you might do with SQL projections (Group/By SQL). MongoDB also added the aggregation framework, which negates the need for MapReduce for common aggregation cases. In MongoDB, Map and Reduce functions are written in JavaScript. These functions are executed on servers (mongod), this allows the code to be next to data that it is operating on (think stored procedures, but meant to execute on distributed nodes and then collected and filtered). The results can be copied to a results collections.

MongoDB also provides incremental MapReduce. This allows you to run MapReduce jobs over collections, and then later run a second job but only over new documents in the collection. You can use this to reduce work required by merging new data into existing results collection.

Aggregation Framework

The Aggregation Framework was added in MongoDB 2.1. It is similar to SQL group by. Before the Aggregation framework, you had to use MapReduce for things like SQL’s group by. Using the Aggregation framework capabilities is easier than MapReduce. Let’s cover a small set of aggregation functions and their SQL equivalents inspired by the Mongo docs as follows:.

Count

SQL
SELECT COUNT(*) FROM employees
MongoDB
db.users.employees([ 
{ $group: {_id:null, count:{$sum:1}} }
])

Sum salary where of each employee who are not retired by department

SQL
SELECT dept_name SUM(salary) FROM employees WHERE retired=false GROUP BY dept_name
MongoDB
db.orders.aggregate([ 
 { $match:{retired:false} }, 
 { $group:{_id:"$dept_name",total:{$sum:"$salary"}} } 
])

This is quite an improvement over using MapReduce for common projections like these. Mongo will take care of collecting and summing the results from multiple shards if need be.

Installing MongoDB

Let’s mix in some code samples to try out along with the concepts.

To install MongoDB go to their download page, download and untar/unzip the download to ~/mongodb-platform-version/. Next you want to create the directory that will hold the data and create a mongodb.config file (/etc/mongodb/mongodb.config) that points to said directory as follows:

Listing: Installing MongoDB

$ sudo mkdir /etc/mongodb/data

$ cat /etc/mongodb/mongodb.config 
dbpath=/etc/mongodb/data

The /etc/mongodb/mongodb.config has one line dbpath=/etc/mongodb/data that tells mongo where to put the data. Next, you need to link mongodb to /usr/local/mongodb and then add it to the path environment variable as follows:

Listing: Setting up MongoDB on your path

$ sudo ln -s  ~/mongodb-platform-version/  /usr/local/mongodb
$ export PATH=$PATH:/usr/local/mongodb/bin

Run the server passing the configuration file that we created earlier.

Listing: Running the MongoDB server

$ mongod --config /etc/mongodb/mongodb.config

Mongo comes with a nice console application called mongo that let’s you execute commands and JavaScript. JavaScript to Mongo is what PL/SQL is to Oracle’s database. Let’s fire up the console app, and poke around.

Firing up the mongos console application

$ mongo
MongoDB shell version: 2.0.4
connecting to: test
…
> db.version()
2.0.4
>

One of the nice things about MongoDB is the self describing console. It is easy to see what commands a MongoDB database supports with the db.help() as follows:

Client: mongo db.help()

> db.help()
DB methods:
db.addUser(username, password[, readOnly=false])
db.auth(username, password)
db.cloneDatabase(fromhost)
db.commandHelp(name) returns the help for the command
db.copyDatabase(fromdb, todb, fromhost)
db.createCollection(name, { size : ..., capped : ..., max : ... } )
db.currentOp() displays the current operation in the db
db.dropDatabase()
db.eval(func, args) run code server-side
db.getCollection(cname) same as db['cname'] or db.cname
db.getCollectionNames()
db.getLastError() - just returns the err msg string
db.getLastErrorObj() - return full status object
db.getMongo() get the server connection object
db.getMongo().setSlaveOk() allow this connection to read from the nonmaster member of a replica pair
db.getName()
db.getPrevError()
db.getProfilingStatus() - returns if profiling is on and slow threshold 
db.getReplicationInfo()
db.getSiblingDB(name) get the db at the same server as this one
db.isMaster() check replica primary status
db.killOp(opid) kills the current operation in the db
db.listCommands() lists all the db commands
db.logout()
db.printCollectionStats()
db.printReplicationInfo()
db.printSlaveReplicationInfo()
db.printShardingStatus()
db.removeUser(username)
db.repairDatabase()
db.resetError()
db.runCommand(cmdObj) run a database command.  if cmdObj is a string, turns it into { cmdObj : 1 }
db.serverStatus()
db.setProfilingLevel(level,{slowms}) 0=off 1=slow 2=all
db.shutdownServer()
db.stats()
db.version() current version of the server
db.getMongo().setSlaveOk() allow queries on a replication slave server
db.fsyncLock() flush data to disk and lock server for backups
db.fsyncUnock() unlocks server following a db.fsyncLock()

You can see some of the commands refer to concepts we discussed earlier. Now let’s create a employee collection, and do some CRUD operations on it.

Create Employee Collection

 > use tutorial; 
switched to db tutorial 
> db.getCollectionNames(); [ ]
 > db.employees.insert({name:'Rick Hightower', gender:'m', gender:'m', phone:'520-555-1212', age:42}); 
Mon Apr 23 23:50:24 [FileAllocator] allocating new datafile /etc/mongodb/data/tutorial.ns, ...

The use command uses a database. If that database does not exist, it will be lazily created the first time we access it (write to it). The db object refers to the current database. The current database does not have any document collections to start with (this is why db.getCollections() returns an empty list). To create a document collection, just insert a new document. Collections like databases are lazily created when they are actually used. You can see that two collections are created when we inserted our first document into the employees collection as follows:

> db.getCollectionNames();
[ "employees", "system.indexes" ]

The first collection is our employees collection and the second collection is used to hold onto indexes we create.

To list all employees you just call the find method on the employees collection.

> db.employees.find()
{ "_id" : ObjectId("4f964d3000b5874e7a163895"), "name" : "Rick Hightower", 
    "gender" : "m", "phone" : "520-555-1212", "age" : 42 }

The above is the query syntax for MongoDB. There is not a separate SQL like language. You just execute JavaScript code, passing documents, which are just JavaScript associative arrays, err, I mean JavaScript objects. To find a particular employee, you do this:

> db.employees.find({name:"Bob"})

Bob quit so to find another employee, you would do this:

> db.employees.find({name:"Rick Hightower"})
{ "_id" : ObjectId("4f964d3000b5874e7a163895"), "name" : "Rick Hightower", "gender" : "m", "phone" : "520-555-1212", "age" : 42 }

The console application just prints out the document right to the screen. I don’t feel 42. At least I am not 100 as shown by this query:

> db.employees.find({age:{$lt:100}})
{ "_id" : ObjectId("4f964d3000b5874e7a163895"), "name" : "Rick Hightower", "gender" : "m", "phone" : "520-555-1212", "age" : 42 }

Notice to get employees less than a 100, you pass a document with a subdocument, the key is the operator ($lt), and the value is the value (100). Mongo supports all of the operators you would expect like $lt for less than, $gt for greater than, etc. If you know JavaScript, it is easy to inspect fields of a document, as follows:

> db.employees.find({age:{$lt:100}})[0].name
Rick Hightower

If we were going to query, sort or shard on employees.name, then we would need to create an index as follows:

db.employees.ensureIndex({name:1}); //ascending index, descending would be -1

Indexing by default is a blocking operation, so if you are indexing a large collection, it could take several minutes and perhaps much longer. This is not something you want to do casually on a production system. There are options to build indexes as a background task, to setup a unique index, and complications around indexing on replica sets, and much more. If you are running queries that rely on certain indexes to be performant, you can check to see if an index exists with db.employees.getIndexes(). You can also see a list of indexes as follows:

> db.system.indexes.find()
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "tutorial.employees", "name" : "_id_" }

By default all documents get an object id. If you don’t not give it an object an _id, it will be assigned one by the system (like a criminal suspects gets a lawyer). You can use that _id to look up an object as follows with findOne:

> db.employees.findOne({_id : ObjectId("4f964d3000b5874e7a163895")})
{ "_id" : ObjectId("4f964d3000b5874e7a163895"), "name" : "Rick Hightower", 
   "gender" : "m", "phone" : "520-555-1212", "age" : 42 }

Python MongoDB Setup

Setting up Python and MongoDB are quite easy since Python has its own package manager.

To install mongodb lib for Python MAC OSX, you would do the following:

$ sudo env ARCHFLAGS='-arch i386 -arch x86_64'
$ python -m easy_install pymongo

To install Python MongoDB on Linux or Windows do the following:

$ easy_install pymongo

or

$ pip install pymongo

If you don’t have easy_install on your Linux box you may have to do some sudo apt-get install python-setuptools or sudo yum install python-setuptools iterations, although it seems to be usually installed with most Linux distributions these days. If easy_install or pip is not installed on Windows, try reformatting your hard disk and installing a real OS, or if that is too inconvient go here.

Once you have it all setup, you will can create some code that is equivalent to the first console examples as shown in figure 11.

Figure 11: Python code listing part 1

Python, MongoDB, Pymongo

Python does have literals for maps so working with Python is much closer to the JavaScript/Console from earlier than Java is. Like Java there are libraries for Python that work with MongoDB (MongoEngine, MongoKit, and more). Even executing queries is very close to the JavaScript experience as shown in figure 12.

Figure 12: Python code listing part 2

Python, MongoDB, Pymongo

Here is the complete listing to make the cut and paste crowd (like me), happy.

Listing: Complete Python listing

import pymongo
from bson.objectid import ObjectId

connection = pymongo.Connection()

db = connection["tutorial"]
employees = db["employees"]

employees.insert({"name": "Lucas Hightower", 'gender':'m', 'phone':'520-555-1212', 'age':8})

cursor = db.employees.find()
for employee in db.employees.find():
    print employee

print employees.find({"name":"Rick Hightower"})[0]

cursor = employees.find({"age": {"$lt": 35}})
for employee in cursor:
     print "under 35: %s" % employee

diana = employees.find_one({"_id":ObjectId("4f984cce72320612f8f432bb")})
print "Diana %s" % diana

The output for the Python example is as follows:

{u'gender': u'm', u'age': 42.0, u'_id': ObjectId('4f964d3000b5874e7a163895'), u'name': u'Rick Hightower', u'phone':
u'520-555-1212'} 

{u'gender': u'f', u'age': 30, u'_id': ObjectId('4f984cae72329d0ecd8716c8'), u'name': u'Diana Hightower', u'phone':
u'520-555-1212'} 

{u'gender': u'm', u'age': 8, u'_id': ObjectId('4f9e111980cbd54eea000000'), u'name': u'Lucas Hightower', u'phone':
u'520-555-1212'}

Grab sites using Python:Create your own site grabber like IDM

Site Grabber Using Twisted Python
—————————————————————-

First install Twisted network engine.you can download it from
you can download it from -> https://twistedmatrix.com/trac/
Direct installer is available for python2.7 for windows.After
installing twisted go to editor and create this pythone file.
Twisted is a event driven network engine got lot more capabilities.
This is only a small use of it.

#########Grab.py###############

#######PROGRAM#################
from twisted.internet import reactor
from twisted.web.client import downloadPage
import sys
def printError(failure):
     print >>sys.stderr, failure
def stop(result):
    reactor.stop()
if len(sys.argv) != 3:
    print >>sys.stderr, “Usage: python download_resource.py <URL>         <output file>”
    exit(1)
d = downloadPage(sys.argv[1], sys.argv[2])
d.addErrback(printError)
d.addBoth(stop)
reactor.run()
###############################

Run it:

C:>python grab.py http://microsoft.com microsoft.html

error

semantics of above is http://www.microsoft.com static page will be saved
to microsoft.html in current directory.

Note:
some people will get this error “no module found win32api” while running
the program.Then download windows extensions for python from here.

->http://sourceforge.net/projects/pywin32/files/pywin32/Build%20218/
Rerun it.sure it will works.

Hint:

we can also use url parsing from httplib to mine links out further and save all corresponding links as offline pages.

I’m leaving it for you.

Meta programming with python

meta

Meta programming with Python Idioms

Objects are created by other objects: special objects called “classes” that we can set up to spit out objects that are configured to our liking.

Classes are just objects, and they can be modified the same way:

>>> class Foo: pass
...
>>> Foo.field = 42
>>> x = Foo()
>>> x.field
42
>>> Foo.field2 = 99
>>> x.field2
99
>>> Foo.method = lambda self: "Hi!"
>>> x.method()
'Hi!'

To modify a class, you perform operations on it like any other object. You can add and subtract fields and methods, for example. The difference is that any change you make to a class affects all the objects of that class, even the ones that have already been instantiated.

What creates these special “class” objects? Other special objects, called metaclasses.

The default metaclass is called type and in the vast majority of cases it does the right thing. In some situations, however, you can gain leverage by modifying the way that classes are produced – typically by performing extra actions or injecting code. When this is the case, you can use metaclass programming to modify the way that some of your class objects are created.

It’s worth re-emphasizing that in the vast majority of cases, you don’t need metaclasses, because it’s a fascinating toy and the temptation to use it everywhere can be overwhelming. Some of the examples in this chapter will show both metaclass and non-metaclass solutions to a problem, so you can see that there’s usually another (often simpler) approach.

Some of the functionality that was previously only available with metaclasses is now available in a simpler form using class decorators. It is still useful, however, to understand metaclasses, and certain results can still be achieved only through metaclass programming.

Basic Metaprogramming

So metaclasses create classes, and classes create instances. Normally when we write a class, the default metaclass type is automatically invoked to create that class, and we aren’t even aware that it’s happening.

It’s possible to explicitly code the metaclass’ creation of a class. type called with one argument produces the type information of an existing class; type called with three arguments creates a new class object. The arguments when invoking type are the name of the class, a list of base classes, and a dictionary giving the namespace for the class (all the fields and methods). So the equivalent of:

class C: pass

is:

C = type('C', (), {})

Classes are often referred to as “types,” so this reads fairly sensibly: you’re calling a function that creates a new type based on its arguments.

We can also add base classes, fields and methods:

# Metaprogramming/MyList.py

def howdy(self, you):
    print("Howdy, " + you)

MyList = type('MyList', (list,), dict(x=42, howdy=howdy))

ml = MyList()
ml.append("Camembert")
print(ml)
print(ml.x)
ml.howdy("John")

print(ml.__class__.__class__)

""" Output:
['Camembert']
42
Howdy, John
"""

Note that printing the class of the class produces the metaclass.

The ability to generate classes programmatically using type opens up some interesting possibilities. Consider the GreenHouseLanguage.py example in the Jython chapter – all the subclasses in that case were written using repetetive code. We can automate the generation of the subclasses using type:

# Metaprogramming/GreenHouse.py

class Event(object):
    events = [] # static

    def __init__(self, action, time):
        self.action = action
        self.time = time
        Event.events.append(self)

    def __cmp__ (self, other):
        "So sort() will compare only on time."
        return cmp(self.time, other.time)

    def run(self):
        print("%.2f: %s" % (self.time, self.action))

    @staticmethod
    def run_events():
        Event.events.sort();
        for e in Event.events:
            e.run()

def create_mc(description):
    "Create subclass using the 'type' metaclass"
    class_name = "".join(x.capitalize() for x in description.split())
    def __init__(self, time):
        Event.__init__(self, description + " [mc]", time)
    globals()[class_name] = \
        type(class_name, (Event,), dict(__init__ = __init__))

def create_exec(description):
    "Create subclass by exec-ing a string"
    class_name = "".join(x.capitalize() for x in description.split())
    klass = """
class %s(Event):
    def __init__(self, time):
        Event.__init__(self, "%s [exec]", time)
""" % (class_name, description)
    exec klass in globals()

if __name__ == "__main__":
    descriptions = ["Light on", "Light off", "Water on", "Water off",
                    "Thermostat night", "Thermostat day", "Ring bell"]
    initializations = "ThermostatNight(5.00); LightOff(2.00); \
        WaterOn(3.30); WaterOff(4.45); LightOn(1.00); \
        RingBell(7.00); ThermostatDay(6.00)"
    [create_mc(dsc) for dsc in descriptions]
    exec initializations in globals()
    [create_exec(dsc) for dsc in descriptions]
    exec initializations in globals()
    Event.run_events()

""" Output:
1.00: Light on [mc]
1.00: Light on [exec]
2.00: Light off [mc]
2.00: Light off [exec]
3.30: Water on [mc]
3.30: Water on [exec]
4.45: Water off [mc]
4.45: Water off [exec]
5.00: Thermostat night [mc]
5.00: Thermostat night [exec]
6.00: Thermostat day [mc]
6.00: Thermostat day [exec]
7.00: Ring bell [mc]
7.00: Ring bell [exec]
"""

The Event base class is the same. The classes are created automatically using the create_mc() function, which takes its description argument and generates a class name from it. Then it defines an __init__() method, which it puts into the namespace dictionary for the type call, producing a new subclass of Event. Note that the resulting class must be inserted into the global namespace, otherwise it will not be seen.

This approach works fine, but then consider the subsequent create_exec() function, which accomplishes the same thing by calling exec on a string defining the class. This will be much easier to understand by the vast majority of the people reading your code: those who do not understand metaclasses.

The Metaclass Hook

So far, we’ve only used the type metaclass directly. Metaclass programming involves hooking our own operations into the creation of class objects. This is accomplished by:

  1. Writing a subclass of the metaclass type.
  2. Inserting the new metaclass into the class creation process using the metaclass hook.

In Python 2.x, the metaclass hook is a static field in the class called __metaclass__. In the ordinary case, this is not assigned so Python just uses type to create the class. But if you define __metaclass__ to point to a callable, Python will call __metaclass__() after the initial creation of the class object, passing in the class object, the class name, the list of base classes and the namespace dictionary.

Python 2.x also allows you to assign to the global __metaclass__ hook, which will be used if there is not a class-local __metaclass__ hook (is there an equivalent in Python 3?).

Thus, the basic process of metaclass programming looks like this:

# Metaprogramming/SimpleMeta1.py
# Two-step metaclass creation in Python 2.x

class SimpleMeta1(type):
    def __init__(cls, name, bases, nmspc):
        super(SimpleMeta1, cls).__init__(name, bases, nmspc)
        cls.uses_metaclass = lambda self : "Yes!"

class Simple1(object):
    __metaclass__ = SimpleMeta1
    def foo(self): pass
    @staticmethod
    def bar(): pass

simple = Simple1()
print([m for m in dir(simple) if not m.startswith('__')])
# A new method has been injected by the metaclass:
print simple.uses_metaclass()

""" Output:
['bar', 'foo', 'uses_metaclass']
Yes!
"""

By convention, when defining metaclasses cls is used rather than self as the first argument to all methods except __new__() (which uses mcl, for reasons explained later). cls is the class object that is being modified.

Note that the practice of calling the base-class constructor first (via super()) in the derived-class constructor should be followed with metaclasses as well.

__metaclass__ only needs to be callable, so in Python 2.x it’s possible to define __metaclass__ inline:

# Metaprogramming/SimpleMeta2.py
# Combining the steps for metaclass creation in Python 2.x

class Simple2(object):
    class __metaclass__(type):
        def __init__(cls, name, bases, nmspc):
            # This won't work:
            # super(__metaclass__, cls).__init__(name, bases, nmspc)
            # Less-flexible specific call:
            type.__init__(cls, name, bases, nmspc)
            cls.uses_metaclass = lambda self : "Yes!"

class Simple3(Simple2): pass
simple = Simple3()
print simple.uses_metaclass()

""" Output:
Yes!
"""

The compiler won’t accept the super() call because it says __metaclass__ hasn’t been defined, forcing us to use the specific call to type.__init__().

Because it only needs to be callable, it’s even possible to define __metaclass__ as a function:

# Metaprogramming/SimpleMeta3.py
# A function for __metaclass__ in Python 2.x

class Simple4(object):
    def __metaclass__(name, bases, nmspc):
        cls = type(name, bases, nmspc)
        cls.uses_metaclass = lambda self : "Yes!"
        return cls

simple = Simple4()
print simple.uses_metaclass()

""" Output:
Yes!
"""

As you’ll see, Python 3 doesn’t allow the syntax of these last two examples. Even so, the above example makes it quite clear what’s happening: the class object is created, then modified, then returned.

Note

Or does it allow that syntax?

The Metaclass Hook in Python 3

Python 3 changes the metaclass hook. It doesn’t disallow the __metaclass__ field, but it ignores it. Instead, you use a keyword argument in the base-class list:

class Simple1(object, metaclass = SimpleMeta1):
    ...

This means that none of the (clever) alternative ways of defining __metaclass__ directly as a class or function are available in Python 3 [[check this]]. All metaclasses must be defined as separate classes. This is probably just as well, as it makes metaclass programs more consistent and thus easier to read and understand.

Example: Self-Registration of Subclasses

It is sometimes convienient to use inheritance as an organizing mechanism – each sublclass becomes an element of a group that you work on. For example, in CodeManager.py in the Comprehensions chapter, the subclasses of Language were all the languages that needed to be processed. Each Language subclass described specific processing traits for that language.

To solve this problem, consider a system that automatically keeps a list of all of its “leaf” subclasses (only the classes that have no inheritors). This way we can easily enumerate through all the subtypes:

# Metaprogramming/RegisterLeafClasses.py

class RegisterLeafClasses(type):
    def __init__(cls, name, bases, nmspc):
        super(RegisterLeafClasses, cls).__init__(name, bases, nmspc)
        if not hasattr(cls, 'registry'):
            cls.registry = set()
        cls.registry.add(cls)
        cls.registry -= set(bases) # Remove base classes
    # Metamethods, called on class objects:
    def __iter__(cls):
        return iter(cls.registry)
    def __str__(cls):
        if cls in cls.registry:
            return cls.__name__
        return cls.__name__ + ": " + ", ".join([sc.__name__ for sc in cls])

class Color(object):
    __metaclass__ = RegisterLeafClasses

class Blue(Color): pass
class Red(Color): pass
class Green(Color): pass
class Yellow(Color): pass
print(Color)
class PhthaloBlue(Blue): pass
class CeruleanBlue(Blue): pass
print(Color)
for c in Color: # Iterate over subclasses
    print(c)

class Shape(object):
    __metaclass__ = RegisterLeafClasses

class Round(Shape): pass
class Square(Shape): pass
class Triangular(Shape): pass
class Boxy(Shape): pass
print(Shape)
class Circle(Round): pass
class Ellipse(Round): pass
print(Shape)

""" Output:
Color: Red, Blue, Green, Yellow
Color: Red, CeruleanBlue, Green, PhthaloBlue, Yellow
Red
CeruleanBlue
Green
PhthaloBlue
Yellow
Shape: Square, Round, Boxy, Triangular
Shape: Square, Ellipse, Circle, Boxy, Triangular
"""

Two separate tests are used to show that the registries are independent of each other. Each test shows what happens when another level of leaf classes are added – the former leaf becomes a base class, and so is removed from the registry.

This also introduces metamethods, which are defined in the metaclass so that they become methods of the class. That is, you call them on the class rather than object instances, and their first argument is the class object rather than self.

Using Class Decorators

Using the inspect module

(As in the Comprehensions chapter)

Example: Making a Class “Final”

It is sometimes convenient to prevent a class from being inherited:

# Metaprogramming/Final.py
# Emulating Java's 'final'

class final(type):
    def __init__(cls, name, bases, namespace):
        super(final, cls).__init__(name, bases, namespace)
        for klass in bases:
            if isinstance(klass, final):
                raise TypeError(str(klass.__name__) + " is final")

class A(object):
    pass

class B(A):
    __metaclass__= final

print B.__bases__
print isinstance(B, final)

# Produces compile-time error:
class C(B):
    pass

""" Output:
(<class '__main__.A'>,)
True
...
TypeError: B is final
"""

During class object creation, we check to see if any of the bases are derived from final. Notice that using a metaclass makes the new type an instance of that metaclass, even though the metaclass doesn’t show up in the base-class list.

Because this process of checking for finality must be installed to happen as the subclasses are created, rather than afterwards as performed by class decorators, it appears that this is an example of something that requires metaclasses and can’t be accomplished with class decorators.

Using __init__ vs. __new__ in Metaclasses

It can be confusing when you see metaclass examples that appear to arbitrarily use __new__ or __init__ – why choose one over the other?

__new__ is called for the creation of a new class, while __init__ is called after the class is created, to perform additional initialization before the class is handed to the caller:

# Metaprogramming/NewVSInit.py
from pprint import pprint

class Tag1: pass
class Tag2: pass
class Tag3:
    def tag3_method(self): pass

class MetaBase(type):
    def __new__(mcl, name, bases, nmspc):
        print('MetaBase.__new__\n')
        return super(MetaBase, mcl).__new__(mcl, name, bases, nmspc)

    def __init__(cls, name, bases, nmspc):
        print('MetaBase.__init__\n')
        super(MetaBase, cls).__init__(name, bases, nmspc)

class MetaNewVSInit(MetaBase):
    def __new__(mcl, name, bases, nmspc):
        # First argument is the metaclass ``MetaNewVSInit``
        print('MetaNewVSInit.__new__')
        for x in (mcl, name, bases, nmspc): pprint(x)
        print('')
        # These all work because the class hasn't been created yet:
        if 'foo' in nmspc: nmspc.pop('foo')
        name += '_x'
        bases += (Tag1,)
        nmspc['baz'] = 42
        return super(MetaNewVSInit, mcl).__new__(mcl, name, bases, nmspc)

    def __init__(cls, name, bases, nmspc):
        # First argument is the class being initialized
        print('MetaNewVSInit.__init__')
        for x in (cls, name, bases, nmspc): pprint(x)
        print('')
        if 'bar' in nmspc: nmspc.pop('bar') # No effect
        name += '_y' # No effect
        bases += (Tag2,) # No effect
        nmspc['pi'] = 3.14159 # No effect
        super(MetaNewVSInit, cls).__init__(name, bases, nmspc)
        # These do work because they operate on the class object:
        cls.__name__ += '_z'
        cls.__bases__ += (Tag3,)
        cls.e = 2.718

class Test(object):
    __metaclass__ = MetaNewVSInit
    def __init__(self):
        print('Test.__init__')
    def foo(self): print('foo still here')
    def bar(self): print('bar still here')

t = Test()
print('class name: ' + Test.__name__)
print('base classes: ', [c.__name__ for c in Test.__bases__])
print([m for m in dir(t) if not m.startswith("__")])
t.bar()
print(t.e)

""" Output:
MetaNewVSInit.__new__
<class '__main__.MetaNewVSInit'>
'Test'
(<type 'object'>,)
{'__init__': <function __init__ at 0x7ecf0>,
 '__metaclass__': <class '__main__.MetaNewVSInit'>,
 '__module__': '__main__',
 'bar': <function bar at 0x7ed70>,
 'foo': <function foo at 0x7ed30>}

MetaBase.__new__

MetaNewVSInit.__init__
<class '__main__.Test_x'>
'Test'
(<type 'object'>,)
{'__init__': <function __init__ at 0x7ecf0>,
 '__metaclass__': <class '__main__.MetaNewVSInit'>,
 '__module__': '__main__',
 'bar': <function bar at 0x7ed70>,
 'baz': 42}

MetaBase.__init__

Test.__init__
class name: Test_x_z
('base classes: ', ['object', 'Tag1', 'Tag3'])
['bar', 'baz', 'e', 'tag3_method']
bar still here
2.718
"""

The primary difference is that when overriding __new__() you can change things like the ‘name’, ‘bases’ and ‘namespace’ arguments before you call the super constructor and it will have an effect, but doing the same thing in __init__() you won’t get any results from the constructor call.

One special case in __new__() is that you can manipulate things like __slots__, but in __init__() you can’t.

Note that, since the base-class version of __init__() doesn’t make any modifications, it makes sense to call it first, then perform any additional operations. In C++ and Java, the base-class constructor must be called as the first operation in a derived-class constructor, which makes sense because derived-class constructions can then build upon base-class foundations.

In many cases, the choice of __new__() vs __init__() is a style issue and doesn’t matter, but because __new__() can do everything and __init__() is slightly more limited, some people just start using __new__() and stick with it. This use can be confusing – I tend to hunt for the reason that __init__() has been chosen, and if I can’t find it wonder whether the author knew what they were doing. I prefer to only use __new__() when it has meaning – when you must in order to change things that only __new__() can change.

Class Methods and Metamethods

A metamethod can be called from either the metaclass or from the class, but not from an instance. A classmethod can be called from either a class or its instances, but is not part of the metaclass.

(Is a similar relationship true with attributes, or is it different?)

Intercepting Class Creation

This example implements Singleton using metaclasses, by overriding the __call__() metamethod, which is invoked when a new instance is created:

# Metaprogramming/Singleton.py

class Singleton(type):
    instance = None
    def __call__(cls, *args, **kw):
        if not cls.instance:
             cls.instance = super(Singleton, cls).__call__(*args, **kw)
        return cls.instance

class ASingleton(object):
    __metaclass__ = Singleton

a = ASingleton()
b = ASingleton()
assert a is b
print(a.__class__.__name__, b.__class__.__name__)

class BSingleton(object):
    __metaclass__ = Singleton

c = BSingleton()
d = BSingleton()
assert c is d
print(c.__class__.__name__, d.__class__.__name__)
assert c is not a

""" Output:
('ASingleton', 'ASingleton')
('BSingleton', 'BSingleton')
"""

By overriding __call__() in the metaclass, the creation of instances are intercepted. Instance creation is bypassed if one already exists.

Note the dependence upon the behavior of static class fields. When cls.instance is first read, it gets the static value of instance from the metaclass, which is None. However, when the assignment is made, Python creates a local version for the particular class, and the next time cls.instance is read, it sees that local version. Because of this behavior, each class ends up with its own class-specific instance field (thus instance is not somehow being “inherited” from the metaclass).

A Class Decorator Singleton

# Metaprogramming/SingletonDecorator.py

def singleton(klass):
    "Simple replacement of object creation operation"
    def getinstance(*args, **kw):
        if not hasattr(klass, 'instance'):
            klass.instance = klass(*args, **kw)
        return klass.instance
    return getinstance

def singleton(klass):
    """
    More powerful approach: Change the behavior
    of the instances AND the class object.
    """
    class Decorated(klass):
        def __init__(self, *args, **kwargs):
            if hasattr(klass, '__init__'):
                klass.__init__(self, *args, **kwargs)
        def __repr__(self) : return klass.__name__ + " obj"
        __str__ = __repr__
    Decorated.__name__ = klass.__name__
    class ClassObject:
        def __init__(cls):
            cls.instance = None
        def __repr__(cls):
            return klass.__name__
        __str__ = __repr__
        def __call__(cls, *args, **kwargs):
            print str(cls) + " __call__ "
            if not cls.instance:
                cls.instance = Decorated(*args, **kwargs)
            return cls.instance
    return ClassObject()

@singleton
class ASingleton: pass

a = ASingleton()
b = ASingleton()
print(a, b)
print a.__class__.__name__
print ASingleton
assert a is b

@singleton
class BSingleton:
    def __init__(self, x):
        self.x = x

c = BSingleton(11)
d = BSingleton(22)
assert c is d
assert c is not a

""" Output:
ASingleton __call__
ASingleton __call__
(ASingleton obj, ASingleton obj)
ASingleton
ASingleton
BSingleton __call__
BSingleton __call__
"""

The __prepare__() Metamethod

One of the things you can’t do with class decorators is to replace the default dictionary. In Python 3 this is enabled with the __prepare__() metamethod:

@classmethod
def __prepare__(mcl, name, bases):
    return odict()

For an example of using both __prepare__() and __slots__ in metaclasses, see Michele Simionato’s article.

Module-level __metaclass__ Assignment

(Does this work in Python 3? If not is there an alternative?)

Metaclass Conflicts

Note that the metaclass argument is singular – you can’t attach more than one metaclass to a class. However, through multiple inheritance you can accidentally end up with more than one metaclass, and this produces a conflict which must be resolved.

http://code.activestate.com/recipes/204197/

Further Reading

Excellent step-by-step introduction to metaclasses:
http://cleverdevil.org/computing/78/
Metaclass intro and comparison of syntax between Python 2.x and 3.x:
http://mikewatkins.ca/2008/11/29/python-2-and-3-metaclasses/
David Mertz’s metaclass primer:
http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html
Three-part in-depth coverage of metaclasses on IBM Developer Works. Quite useful and authoritative:
Michele Simionato’s articles on Artima, with special emphasis on the difference between Python 2.x and 3.x metaclasses:

Once you understand the foundations, you can find lots of examples by searching for “metaclass” within the Python Cookbook: http://code.activestate.com/recipes/langs/python/

The printed version of the Python Cookbook has far fewer examples than the online version, but the print version has been filtered and edited and so tends to be more authoritative.

Ian Bicking writes about metaclasses:

Lots of good information about classes, types, metaclasses, etc., including historical stuff in the Python 2.2 docs (is this duplicated in later versions of the docs):

Open heart with Guido Van Rosuum,a lost interview of python creator part2

guido3

You spent so much time trying to create (preferably) one obvious way to do things. It seems like you’re of the opinion that doing things that way, the Python way, really lets you take advantage of Python.

Guido: I’m not sure that I really spend a lot of time making sure that there’s only one way. The “Zen of Python” is much younger than the language Python, and most defining characteristics of the language were there long before Tim Peters wrote it down as a form of poetry. I don’t think he expected it to be quite as widespread and successful when he wrote it up.

It’s a catchy phrase.

Guido: Tim has a way with words. “There’s only one way to do it” is actually in most cases a white lie. There are many ways to do data structures. You can use tuples and lists. In many cases, it really doesn’t matter that much whether you use a tuple or a list or sometimes a dictionary. It turns out usually if you look really carefully, one solution is objectively better because it works just as well in a number of situations, and there’s one or two cases where lists just works so much better than tuples when you keep growing them.

That comes more actually from the original ABC philosophy that was trying to be very sparse in the components. ABC actually shared a philosophy with ALGOL-68, which is now one of the deadest languages around, but was very influentia. Certainly where I was at the time during the 80s, it was very influential because Adriaan van Wijngaarden was the big guy from ALGOL 68. He was still teaching classes when I went to college. I did one or two semesters where he was just telling anecdotes from the history of ALGOL 68 if he felt like it. He had been the director of CWI. Someone else was it by the time I joined. There were many people who had been very close with ALGOL 68. I think Lambert Meertens, the primary author of ABC, was also one of the primary editors of the ALGOL 68 report, which probably means he did a lot of the typesetting, but he may occasionally also have done quite a lot of the thinking and checking. He was clearly influenced by ALGOL 68’s philosophy of providing constructs that can be combined in many different ways to produce all sorts of different data structures or ways of structuring a program.

It was definitely his influence that said, “We have lists or arrays, and they can contain any kind of other thing. They can contain numbers or strings, but they can also contain other arrays and tuples of other things. You can combine all of these things together.” Suddenly you don’t need a separate concept of a multidimensional array because an array of arrays solves that for any dimensionality. That philosophy of taking a few key things that cover different directions of flexibility and allow them to be combined was very much a part of ABC. I borrowed all of that almost without thinking about it very hard.

While Python tries to give the appearance that you can combine things in very flexible ways as long as you don’t try to nest statements inside expressions, there is actually a remarkable number of special cases in the syntax where in some cases a comma means a separation between parameters, and in other cases the comma means the items of a list, and in yet another case it means an implicit tuple.

There are a whole bunch of variations in the syntax where certain operators are not allowed because they would conflict with some surrounding syntax. That is never really a problem because you can always put an extra pair of parentheses around something when it doesn’t work. Because of that the syntax, at least from the parser author’s perspective, has grown quite a bit. Things like list comprehensions and generator expressions are syntactically still not completely unified. In Python 3000, I believe they are. There’s still some subtle semantic differences, but the syntax at least is the same.

Multiple Pythons
Does the parser get simpler in Python 3000?

Guido: Hardly. It didn’t become more complex, but it also didn’t really become simpler.

No more complex I think is a win.

Guido: Yeah.

Why the simplest, dumbest compiler imaginable?

Guido: That was originally a very practical goal, because I didn’t have a degree in code generation. There was just me, and I had to have the byte code generator behind me before I could do any other interesting work on the language.
I still believe that having a very simple parser is a good thing; after all, it is just the thing that turns the text into a tree that represents the structure of the program. If the syntax is so ambiguous that it takes really advanced parts of technology to figure it out, then human readers are probably confused half the time as well. It also makes it really hard to write another parser.

Python is incredibly simple to parse, at least at the syntactic level. At the lexical level, the analysis is relatively subtle because you have to read the indentation with a little stack that is embedded in the lexical analyzer, which is a counterexample for the theory of separation between lexical and grammatical analysis. Nevertheless, that is the right solution. The funny thing is that I love automatically generated parsers, but I do not believe very strongly in automatically generated lexical analysis. Python has always had a manually generated scanner and an automated parser.

People have written many different parsers for Python. Even port of Python to a different virtual machine, whether Jython or IronPython or PyPy, has its own parser, and it’s no big deal because the parser is never a very complex piece of the project, because the structure of the language is such that you can very easily parse it with the most basic one-token lookahead recursive descent parser.

What makes parsers slow is actually ambiguities that can only be resolved by looking ahead until the end of the program. In natural languages there are many examples where it’s impossible to parse a sentence until you’ve read the last word and the arbitrary nesting in the sentence. Or there are sentences that can only be parsed if you actually know the person that they are talking about, but that’s a completely different situation. For parsing programming languages, I like my one-token lookahead.

That suggests to me that there may never be macros in Python because you have to perform another parsing phase then!

Guido: There are ways of embedding the macros in the parser that could probably work. I’m not at all convinced that macros solve any problem that is particularly pressing for Python, though. On the other hand, since the language is easy to parse, if you come up with some kind of hygienic set of macros that fit within the language syntax, it might be very simple to implement micro-evaluation as parse tree manipulations. That’s just not an area that I’m particularly interested in.

Why did you choose to use strict formatting in source code?

Guido: The choice of indentation for grouping was not a novel concept in Python; I inherited this from ABC, but it also occurred in occam, an older language. I don’t know if the ABC authors got the idea from occam, or invented it independently, or if there was a common ancestor. The idea may be attributed to Don Knuth, who proposed this as early as 1974.
Of course, I could have chosen not to follow ABC’s lead, as I did in other areas (e.g., ABC used uppercase for language keywords and procedure names, an idea I did not copy), but I had come to like the feature quite a bit while using ABC, as it seemed to do away with a certain type of pointless debate common amongst C users at the time, about where to place the curly braces. I also was well aware that readable code uses indentation voluntarily anyway to indicate grouping, and I had come across subtle bugs in code where the indentation disagreed with the syntactic grouping using curly braces—the programmer and any reviewers had assumed that the indentation matched the grouping and therefore not noticed the bug. Again, a long debugging session taught a valuable lesson.

Strict formatting should produce a cleaner code and probably reduce the differences in the “layout” of the code of different programmers, but doesn’t this sound like forcing a human being to adapt to the machine, instead of the opposite path?

Guido: Quite the contrary—it helps the human reader more than it helps the machine; see the previous example. Probably the advantages of this approach are more visible when maintaining code written by another programmer.

New users are often put off by this initially, although I don’t hear about this so much any more; perhaps the people teaching Python have learned to anticipate this effect and counter it effectively.

I would like to ask you about multiple implementations of Python. There are four or five big implementations, including Stackless and PyPy.

Guido: Stackless, technically, is not a separate implementation. Stackless is often listed as a separate Python implementation because it is a fork of Python that replaces a pretty small part of the virtual machine with a different approach.

Basically the byte code dispatch, right?

Guido: Most of the byte code dispatch is very similar. I think the byte codes are the same and certainly all of the objects are the same. What they do different is when you have a call from one Python procedure to another procedure: they do that with manipulation of objects, where they just push a stack of stack frames and the same bit of C code remains in charge. The way it’s done in C Python is that, at that point, a C function is invoked which will then eventually invoke a new instance of the virtual machine. It’s not really the whole virtual machine, but the loop that interprets the byte code. There’s only one of those loops on the C stack in stackless. In traditional C Python, you can have that same loop on your C stack many times. That’s the only difference.

PyPy, IronPython, Jython are separate implementations. I don’t know about something that translates to JavaScript, but I wouldn’t be surprised if someone had gotten quite far with that at some point. I have heard of experimental things that translate to OCaml and Lisp and who knows what. There once was something that translated to C code as well. Mark Hammond and Greg Stein worked on it in the late 90s, but they found out that the speedup that they could obtain was very, very modest. In the best circumstances, it would run twice as fast; also, the generated code was so large that you had these enormous binaries, and that became a problem.

Start-up time hurt you there.

Guido: I think the PyPy people are on the right track.

It sounds like you’re generally supportive of these implementations.

Guido: I have always been supportive of alternate implementations. From the day that Jim Hugunin walked in the door with a more or less completed JPython implementation, I was excited about it. In a sense, it counts as a validation of the language design. It also means that people can use their favorite language on the platform where otherwise they wouldn’t have access to it. We still have a way to go there, but it certainly helped me isolate which features were really features of the language that I cared about, and which features were features of a particular implementation where I was OK with other implementations doing things differently. That’s where we ended up on the unfortunately slippery slope of garbage collection.

That’s always a slippery slope.

Guido: But it’s also necessary. I cannot believe how long we managed to live with pure reference counting and no way to break cycles. I have always seen reference counting as a way of doing garbage collection, and not a particularly bad one. There used to be this holy war between reference counting versus garbage collection, and that always seemed rather silly to me.

Regarding these implementations again, I think Python is an interesting space because it has a pretty good specification. Certainly compared to other languages like Tcl, Ruby, and Perl 5. Was that something that came about because you wanted to standardize the language and its behavior, or because you were looking at multiple implementations, or something else?

Guido: It was probably more a side effect of the community process around PEPs and the multiple implementations. When I originally wrote the first set of documentation, I very enthusiastically started a language reference manual, which was supposed to be a sufficiently precise specification that someone from Mars or Jupiter could implement the language and get the semantics right. I never got anywhere near fulfilling that goal.

ALGOL 68 probably got the closest of any language ever with their highly mathematical specification. Other languages like C++ and JavaScript have managed with sheer willpower of the standardization committee, especially in the case of C++. That’s obviously an incredibly impressive effort. At the same time, it takes so much manpower to write a specification that is that precise, that my hope of getting something like that for Python never really got implemented.
What we do have is enough understanding of how the language is supposed to work, and enough unit tests, and enough people on hand that can answer to implementers of other versions in finite time. I know that, for example, the IronPython folks have been very conscientious in trying to run the entire Python test suite, and for every failure deciding if the test suite was really testing the specific behavior of the C Python implementation or if they actually had more work to do in their implementation.

The PyPy folks did the same thing, and they went one step further. They have a couple of people who are much smarter than I, and who have come up with an edge case probably prompted by their own thinking about how to generate code and how to analyze code in a JIT environment. They have actually contributed quite a few tests and disambiguations and questions when they found out that there was a particular combination of things that nobody had ever really thought about. That was very helpful. The process of having multiple implementations of the language has been tremendously helpful for getting the specification of the language disambiguated.

Do you foresee a time when C Python may not be the primary implementation?

Guido: That’s hard to see. I mean some people foresee a time where .NET rules the world; other people foresee a time where JVMs rule the world. To me, that all seems like wishful thinking. At the same time, I don’t know what will happen. There could be a quantum jump where, even though the computers that we know don’t actually change, a different kind of platform suddenly becomes much more prevalent and the rules are different.

Perhaps a shift away from the von Neumann architecture?

Guido: I wasn’t even thinking of that, but that’s certainly also a possibility. I was more thinking of what if mobile phones become the ubiquitous computing device. Mobile phones are only a few years behind the curve of the power of regular laptops, which suggests that in a few years, mobile phones, apart from the puny keyboard and screen, will have enough computing power so that you don’t need a laptop anymore. It may well be that mobile phones for whatever platform politics end up all having a JVM or some other standard environment where C Python is not the best approach and some other Python implementation would work much better.

There’s certainly also the question of what do we do when we have 64 cores on a chip, even in a laptop or in a cell phone. I don’t actually know if that should change the programming paradigm all that much for most of the things we do. There may be a use for some languages that let you specify incredibly subtle concurrent processes, but in most cases the average programmer cannot write correct thread-safe code anyway. Assuming that somehow the ascent of multiple cores forces them to do that is kind of unrealistic. I expect that multiple cores will certainly be useful, but they will be used for coarse-grained parallelism, which is better anyway, because with the enormous cost difference between cache hits and cache misses, main memory no longer really serves the function of shared memory. You want to have your processes as isolated as possible.

How should we deal with concurrency? At what level should this problem be dealt with or, even better, solved?

Guido: My feeling is that writing single-threaded code is hard enough, and writing multithreaded code is way harder—so hard that most people don’t have a hope of getting it right, and that includes myself. Therefore, I don’t believe that fine-grained synchronization primitives and shared memory are the solution—instead, I’d much rather see messagepassing solutions get back in style. I’m pretty sure that changing all programming languages to add synchronization constructs is a bad idea.

I also still don’t believe that trying to remove the GIL from CPython will work. I do believe that some support for managing multiple processes (as opposed to threads) is a piece of the puzzle, and for that reason Python 2.6 and 3.0 will have a new standard library module, multiprocessing, that offers an API similar to that of the threading module for doing exactly that. As a bonus, it even supports processes running on different hosts!

Expedients and Experience
Is there any tool or feature that you feel is missing when writing software?

Guido: If I could sketch on a computer as easily as I can with pencil and paper, I might be making more sketches while doing the hard thinking about a design. I fear that I’ll have to wait until the mouse is universally replaced by a pen (or your finger) that lets you draw on the screen. Personally, I feel terribly handicapped when using any kind of computerized drawing tool, even if I’m pretty good with pencil and paper—perhaps I inherited it from my father, who was an architect and was always making rough sketches, so I was always sketching as a teenager.

At the other end of the scale, I suppose I may not even know what I’m missing for spelunking large codebases. Java programmers have IDEs now that provide quick answers to questions like “where are the callers of this method?” or “where is this variable assigned to?” For large Python programs, this would also be useful, but the necessary static analysis is harder because of Python’s dynamic nature.

How do you test and debug your code?

Guido: Whatever is expedient. I do a lot of testing when I write code, but the testing method varies per project. When writing your basic pure algorithmic code, unit tests are usually great, but when writing code that is highly interactive or interfaces to legacy APIs, I often end up doing a lot of manual testing, assisted by command-line history in the shell or page-reload in the browser. As an (extreme) example, you can’t very well write a unit test for a script whose sole purpose is to shut down the current machine; sure, you can mock out the part that actually does the shut down, but you still have to test that part, too, or else how do you know that your script actually works?
Testing something in different environments is also often hard to automate. Buildbot is great for large systems, but the overhead to set it up is significant, so for smaller systems often you just end up doing a lot of manual QA. I’ve gotten a pretty good intuition for doing QA, but unfortunately it’s hard to explain.

When should debugging be taught? And how?

Guido: Continuously. You are debugging your entire life. I just “debugged” a problem with my six-year-old son’s wooden train set where his trains kept getting derailed at a certain point on the track. Debugging is usually a matter of moving down an abstraction level or two, and helped by stopping to look carefully, thinking, and (sometimes) using the right tools.

I don’t think there is a single “right” way of debugging that can be taught at a specific point, even for a very specific target such as debugging program bugs. There is an incredibly large spectrum of possible causes for program bugs, including simple typos, “thinkos,” hidden limitations of underlying abstractions, and outright bugs in abstractions or their implementation. The right approach varies from case to case. Tools come into play mostly when the required analysis (“looking carefully”) is tedious and repetitive. I note that Python programmers often need few tools because the search space (the program being debugged) is so much smaller.

How do you resume programming?

Guido: This is actually an interesting question. I don’t recall ever looking consciously at how I do this, while I indeed deal with this all the time. Probably the tool I used most for this is version control: when I come back to a project I do a diff between my workspace and the repository, and that will tell me the state I’m in.

If I have a chance, I leave XXX markers in the unfinished code when I know I am about to be interrupted, telling me about specific subtasks. I sometimes also use something I picked up from Lambert Meertens some 25 years ago: leave a specific mark in the current source file at the place of the cursor. The mark I use is “HIRO,” in his honor. It is colloquial Dutch for “here” and selected for its unlikeliness to ever occur in finished code. :-)

At Google we also have tools integrated with Perforce that help me in an even earlier stage: when I come in to work, I might execute a command that lists each of the unfinished projects in my workspace, so as to remind me which projects I was working on the previous day. I also keep a diary in which I occasionally record specific hard-to-remember strings (like shell commands or URLs) that help me perform specific tasks for the project at hand—for example, the full URL to a server stats page, or the shell command that rebuilds the components I’m working on.

What are your suggestions to design an interface or an API?

Guido: Another area where I haven’t spent a lot of conscious thought about the best process, even though I’ve designed tons of interfaces (or APIs). I wish I could just include a talk by Josh Bloch on the subject here; he talked about designing Java APIs, but most of what he said would apply to any language. There’s lots of basic advice like picking clear names (nouns for classes, verbs for methods), avoiding abbreviations, consistency in naming, providing a small set of simple methods that provide maximal flexibility when combined, and so on. He is big on keeping the argument lists short: two to three arguments is usually the maximum you can have without creating confusion about the order. The worst thing is having several consecutive arguments that all have the same type; an accidental swap can go unnoticed for a long time then.

I have a few personal pet peeves: first of all, and this is specific to dynamic languages, don’t make the return type of a method depend on the value of one of the arguments; otherwise it may be hard to understand what’s returned if you don’t know the relationship— maybe the type-determining argument is passed in from a variable whose content you can’t easily guess while reading the code.

Second, I dislike “flag” arguments that are intended to change the behavior of a method in some big way. With such APIs the flag is always a constant in actually observed parameter lists, and the call would be more readable if the API had separate methods: one for each flag value.

Another pet peeve is to avoid APIs that could create confusion about whether they return a new object or modify an object in place. This is the reason why in Python the list method sort( ) doesn’t return a value: this emphasizes that it modifies the list in place. As an alternative, there is the built-insorted( ) function, which returns a new, sorted list.

Should application programmers adopt the “less is more” philosophy? How should they simplify the user interface to provide a shorter learning path?

Guido: When it comes to graphical user interfaces, it seems there’s finally growing support for my “less is more” position. The Mozilla foundation has hired Aza Raskin, son of the late Jef Raskin (codesigner of the original Macintosh UI) as a UI designer. Firefox 3 has at least one example of a UI that offers a lot of power without requiring buttons, configuration, preferences or anything: the smart location bar watches what I type, compares it to things I’ve browsed to before, and makes useful suggestions. If I ignore the suggestions it will try to interpret what I type as a URL or, if that fails, as a Google query. Now that’s smart! And it replaces three or four pieces of functionality that would otherwise require separate buttons or menu items.

This reflects what Jef and Aza have been saying for so many years: the keyboard is such a powerful input device, let’s use it in novel ways instead of forcing users to do everything with the mouse, the slowest of all input devices. The beauty is that it doesn’t require new hardware, unlike Sci-Fi solutions proposed by others like virtual reality helmets or eye movement sensors, not to mention brainwave detectors.

There’s a lot to do of course—for example, Firefox’s Preferences dialog has the dreadful look and feel of anything coming out of Microsoft, with at least two levels of tabs and many modal dialogs hidden in obscure places. How am I supposed to remember that in order to turn off JavaScript I have to go to the Content tab? Are Cookies under the Privacy tab or under Security? Maybe Firefox 4 can replace the Preferences dialog with a “smart” feature that lets you type keywords so that if I start typing “pass,” it will take me to the section to configure passwords.

What do the lessons about the invention, further development, and adoption of your language say to people developing computer systems today and in the forseeable future?

Guido: I have one or two small thoughts about this. I’m not the philosophical kind, so this is not the kind of question I like or to which I have a prepared response, but here’s one thing I realized early on that I did right with Python (and which Python’s predecessor, ABC, didn’t do, to its detriment). A system should be extensible by its users. Moreover, a large system should be extensible at two (or more) levels.

Since the first time I released Python to the general public, I got requests to modify the language to support certain kinds of use cases. My first response to such requests is always to suggest writing some Python code to cover their needs and put it in a module for their own use. This is the first level of extensibility—if the functionality is useful enough, it may end up in the standard library.

The second level of extensibility is to write an extension module in C (or in C++, or other languages). Extension modules can do certain things that are not feasible in pure Python (though the capabilities of pure Python have increased over the years). I would much rather add a C-level API so that extension modules can muck around in Python’s internal data structures, than change the language itself, since language changes are held to the highest possible standard of compatibility, quality, semantic clarity, etc. Also, “forks” in the language might happen when people “help themselves” by changing the language implementation in their own copy of the interpreter, which they may distribute to others as well. Such forks cause all sorts of problems, such as maintenance of the private changes as the core language also evolves, or merging multiple independently forked versions that other users might need to combine. Extension modules don’t have these problems; in practice most functionality needed by extensions is already available in the C API, so changes to the C API are rarely necessary in order to enable a particular extension.

Another thought is to accept that you don’t get everything right the first time. Early on during development, when you have a small number of early adopters as users, is the time to fix things drastically as soon as you notice a problem, never mind backward compatibility. A great anecdote I often like to quote, and which has been confirmed as truthful by someone who was there at the time, is that Stuart Feldman, the original author of “Make” in Unix v7, was asked to change the dependence of the Makefile syntax on hard tab characters. His response was something along the lines that he agreed tab was a problem, but that it was too late to fix since there were already a dozen or so users.

As the user base grows, you need to be more conservative, and at some point absolute backward compatibility is a necessity. There comes a point where you have accumulated so many misfeatures that this is no longer feasible. A good strategy to deal with this is what I’m doing with Python 3.0: announce a break with backward compatibility for one particular version, use the opportunity to fix as many such issues as possible, and give the user community a lot of time to deal with the transition.

In Python’s case, we’re planning to support Python 2.6 and 3.0 alongside each other for a long time—much longer than the usual support lifetime of older releases. We’re also offering several transitional strategies: an automated source-to-source conversion tool that is far from perfect, combined with optional warnings in version 2.6 about the use of functionality that will change in 3.0 (especially if the conversion tool cannot properly recognize the situation), as well as selective back-porting of certain 3.0 features to 2.6. At the same time, we’re not making 3.0 a total rewrite or a total redesign (unlike Perl 6 or, in the Python world, Zope 3), thereby minimizing the risk of accidentally dropping essential functionality.

One trend I’ve noticed in the past four or five years is much greater corporate adoption of dynamic languages. First PHP, Ruby in some context, definitely Python in other contexts, especially Google. That’s interesting to me. I wonder where these people were 20 years ago when languages like Tcl and Perl, and Python a little bit later, were doing all of these useful things. Have you seen desire to make these languages more enterprise-friendly, whatever that means?

Guido: Enterprise-friendly is usually when the really smart people lose interest and the people of more mediocre skills have to somehow fend for themselves. I don’t know if Python is harder to use for mediocre people. In a sense you would think that there is quite a bit of damage you cannot do in Python because it’s all interpreted. On the other hand, if you write something really huge and you don’t use enough unit testing, you may have no idea what it actually does.

You’ve made the argument that a line of Python, a line of Ruby, a line of Perl, a line of PHP, may be 10 lines of Java code.

Guido: Often it is. I think that the adoption level in the enterprise world, even though there are certain packages of functionality that are helpful, is probably just a fear of very conservative managers. Imagine the people in charge of IT resources for 100,000 people in a company where IT is not a main product—maybe they are building cars, or doing insurance, or something else, but everything they do is touched by computers. The people in charge of that infrastructure necessarily have to be very conservative. They will go with stuff that looks like it has a big name attached, like maybe Sun or Microsoft, because they know that Sun and Microsoft screw up all the time, but these companies are obliged to recover from those screwups and fix them, even if it takes five years.

Open source projects traditionally have just not offered that same peace of mind to the average CIO. I don’t know exactly if and how and when that will change. It’s possible that if Microsoft or Sun suddenly supported Python on their respective VMs, programmers in enterprises would actually discover that they can get higher productivity without any downsides by using more advanced languages.

Open Heart with Guido Van Rosuum,a lost interview of the python creator part1

guido2

Guido von Rossum
Python is a modern, general-purpose, high-level language developed by Guido van Rossum as a result of his work with the ABC programming language. Python’s philosophy is pragmatic; its users often speak of the Zen of Python, strongly preferring a single obvious way to accomplish any task. Ports exist for VMs such as Microsoft’s CLR and the JVM, but the primary implementation is CPython, still developed by van Rossum and other volunteers, who just released Python 3.0, a backward-incompatible rethinking of parts of the language and its core libraries.

The Pythonic Way
What differences are there between developing a programming language and developing a “common” software project?

Guido van Rossum: More than with most software projects, your most important users are programmers themselves. This gives a language project a high level of “meta” content. In the dependency tree of software projects, programming languages are pretty much at the bottom—everything else depends on one or more languages. This also makes it hard to change a language—an incompatible change affects so many dependents that it’s usually just not feasible. In other words, all mistakes, once released, are cast in stone. The ultimate example of this is probably C++, which is burdened with compatibility requirements that effectively require code written maybe 20 years ago to be still valid.

How do you debug a language?

Guido: You don’t. Language design is one area where agile development methodologies just don’t make sense—until the language is stable, few people want to use it, and you won’t find the bugs in the language definition until you have so many users that it’s too late to change things.

Of course there’s plenty in the implementation that can be debugged like any old program, but the language design itself pretty much requires careful design up front, because the cost of bugs is so exorbitant.

How do you decide when a feature should go in a library as an extension or when it needs to have support from the core language?

Guido: Historically, I’ve had a pretty good answer for that. One thing I noticed very early on was that everybody wants their favorite feature added to the language, and most people are relatively inexperienced about language design. Everybody is always proposing “let’s add this to the language,” “let’s have a statement that does X.” In many cases, the answer is, “Well, you can already do X or something almost like X by writing these two or three lines of code, and it’s not all that difficult.” You can use a dictionary, or you can combine a list and a tuple and a regular expression, or write a little metaclass—all of those things. I may even have had the original version of this answer from Linus, who seems to have a similar philosophy.

Telling people you can already do that and here is how is a first line of defense. The second thing is, “Well, that’s a useful thing and we can probably write or you can probably write your own module or class, and encapsulate that particular bit of abstraction.” Then the next line of defense is, “OK, this looks so interesting and useful that we’ll actually accept it as a new addition to the standard library, and it’s going to be pure Python.” And then, finally, there are things that just aren’t easy to do in pure Python and we’ll suggest or recommend how to turn them into a C extension. The C extensions are the last line of defense before we have to admit, “Well, yeah, this is so useful and you really cannot do this, so we’ll have to change the language.”
There are other criteria that determine whether it makes more sense to add something to the language or it makes more sense to add something to the library, because if it has to do with the semantics of namespaces or that kind of stuff, there’s really nothing you can do besides changing the language. On the other hand, the extension mechanism was made powerful enough that there is an amazing amount of stuff you can do from C code that extends the library and possibly even adds new built-in functionality without actually changing the language. The parser doesn’t change. The parse tree doesn’t change. The documentation for the language doesn’t change. All your tools still work, and yet you have added new functionality to your system.

I suppose there are probably features that you’ve looked at that you couldn’t implement in Python other than by changing the language, but you probably rejected them. What criteria do you use to say this is something that’s Pythonic, this is something that’s not Pythonic?

Guido: That’s much harder. That is probably, in many cases, more a matter of a gut feeling than anything else. People use the word Pythonic and “that is Pythonic” a lot, but nobody can give you a watertight definition of what it means for something to be Pythonic or un-Pythonic.

You have the “Zen of Python,” but beyond that?

Guido: That requires a lot of interpretation, like every good holy book. When I see a good or a bad proposal, I can tell if it is a good or bad proposal, but it’s really hard to write a set of rules that will help someone else to distinguish good language change proposals from bad change proposals.

Sounds almost like it’s a matter of taste as much as anything.

Guido: Well, the first thing is always try to say “no,” and see if they go away or find a way to get their itch scratched without changing the language. It’s remarkable how often that works. That’s more of a operational definition of “it’s not necessary to change the language.”

If you keep the language constant, people will still find a way to do what they need to do. Beyond that it’s often a matter of use cases coming from different areas where there is nothing application-specific. If something was really cool for the Web, that would not make it a good feature to add to the language. If something was really good for writing shorter functions or writing classes that are more maintainable, that might be a good thing to add to the language. It really needs to transcend application domains in general, and make things simpler or more elegant.

When you change the language, you affect everyone. There’s no feature that you can hide so well that most people don’t need to know about. Sooner or later, people will encounter code written by someone else that uses it, or they’ll encounter some obscure corner case where they have to learn about it because things don’t work the way they expected. Often elegance is also in the eye of the beholder. We had a recent discussion on one of the Python lists where people were arguing forcefully that usingdollar instead ofself-dot was much more elegant. I think their definition of elegance was number of keystrokes.

There’s an argument to make for parsimony there, but very much in the context of personal taste.

Guido: Elegance and simplicity and generality all are things that, to a large extent, depend on personal taste, because what seems to cover a larger area for me may not cover enough for someone else, and vice versa.

How did the Python Enhancement Proposal (PEP) process come about?

Guido: That’s a very interesting historical tidbit. I think it was mostly started and championed by Barry Warsaw, one of the core developers. He and I started working together in ‘95, and I think around 2000, he came up with the suggestion that we needed more of a formal process around language changes.

I tend to be slow in these things. I mean I wasn’t the person who discovered that we really needed a mailing list. I wasn’t the person who discovered that the mailing list got unwieldy and we needed a newsgroup. I wasn’t the person to propose that we needed a website. I was also not the person to propose that we needed a process for discussing and inventing language changes, and making sure to avoid the occasional mistake where things had been proposed and quickly accepted without thinking through all of the consequences.

At the time between 1995 and 2000, Barry, myself, and a few other core developers, Fred Drake, Ken Manheimer for a while, were all at CNRI, and one of the things that CNRI did was organize the IETF meetings. CNRI had this little branch that eventually split off that was a conference organizing bureau, and their only customer was the IETF. They later also did the Python conferences for a while, actually. Because of that it was a pretty easy boondoggle to attend IETF meetings even if they weren’t local. I certainly got a taste of the IETF process with its RFCs and its meeting groups and stages, and Barry also got a taste of that. When he proposed to do something similar for Python, that was an easy argument to make. We consciously decided that we wouldn’t make it quite as heavy-handed as the IETF RFCs had become by then, because Internet standards, at least some of them, affect way more industries and people and software than a Python change, but we definitely modeled it after that. Barry is a genius at coming up with good names, so I am pretty sure that PEP was his idea.

We were one of the first open source projects at the time to have something like this, and it’s been relatively widely copied. The Tcl/Tk community basically changed the title and used exactly the same defining document and process, and other projects have done similar things.

Do you find that adding a little bit of formalism really helps crystallize the design decisions around Python enhancements?

Guido: I think it became necessary as the community grew and I wasn’t necessarily able to judge every proposal on its value by itself. It has really been helpful for me to let other people argue over various details, and then come with relatively clear-cut conclusions.

Do they lead to a consensus where someone can ask you to weigh in on a single particular crystallized set of expectations and proposals?

Guido: Yes. It often works in a way where I initially give a PEP a thumb’s up in the sense that I say, “It looks like we have a problem here. Let’s see if someone figures out what the right solution is.” Often they come out with a bunch of clear conclusions on how the problem should be solved and also a bunch of open issues. Sometimes my gut feelings can help close the open issues. I’m very active in the PEP process when it’s an area that I’m excited about—if we had to add a new loop control statement, I wouldn’t want that to be designed by other people. Sometimes I stay relatively far away from it like database APIs.

What creates the need for a new major version?

Guido: It depends on your definition of major. In Python, we generally consider releases like 2.4, 2.5, and 2.6 “major” events, which only happen every 18–24 months. These are the only occasions where we can introduce new features. Long ago, releases were done at the whim of the developers (me, in particular). Early this decade, however, the users requested some predictability—they objected against features being added or changed in “minor” revisions (e.g., 1.5.2 added major features compared to 1.5.1), and they wished the major releases to be supported for a certain minimum amount of time (18 months). So now we have more or less time-based major releases: we plan the series of dates leading up to a major release (e.g., when alpha and beta versions and release candidates are issued) long in advance, based on things like release manager availability, and we urge the developers to get their changes in well in advance of the final release date.

Features selected for addition to releases are generally agreed upon by the core developers, after (sometimes long) discussions on the merits of the feature and its precise specification. This is the PEP process: Python Enhancement Proposal, a document-base process not unlike the IETF’s RFC process or the Java world’s JSR process, except that we aren’t quite as formal, as we have a much smaller community of developers. In case of prolonged disagreement (either on the merits of a feature or on specific details), I may end up breaking a tie; my tie-breaking algorithm is mostly intuitive, since by the time it is invoked, rational argument has long gone out of the window.

The most contentious discussions are typically about user-visible language features; library additions are usually easy (as they don’t harm users who don’t care), and internal improvements are not really considered features, although they are constrained by pretty stringent backward compatibility at the C API level.
Since the developers are typically the most vocal users, I can’t really tell whether features are proposed by users or by developers—in general, developers propose features based on needs they perceived among the users they know. If a user proposes a new feature, it is rarely a success, since without a thorough understanding of the implementation (and of language design and implementation in general) it is nearly impossible to properly propose a new feature. We like to ask users to explain their problems without having a specific solution in mind, and then the developers will propose solutions and discuss the merits of different alternatives with the users.

There’s also the concept of a radically major or breakthrough version, like 3.0. Historically, 1.0 was evolutionarily close to 0.9, and 2.0 was also a relatively small step from 1.6. From now on, with the much larger user base, such versions are rare indeed, and provide the only occasion for being truly incompatible with previous versions. Major versions are made backward compatible with previous major versions with a specific mechanism available for deprecating features slated for removal.

How did you choose to handle numbers as arbitrary precision integers (with all the cool advantages you get) instead of the old (and super common) approach to pass it to the hardware?

Guido: I originally inherited this idea from Python’s predecessor, ABC. ABC used arbitrary precision rationals, but I didn’t like the rationals that much, so I switched to integers; for reals, Python uses the standard floating-point representation supported by the hardware (and so did ABC, with some prodding).

Originally Python had two types of integers: the customary 32-bit variety (“int”) and a separate arbitrary precision variety (“long”). Many languages do this, but the arbitrary precision variety is relegated to a library, like Bignum in Java and Perl, or GNU MP for C. In Python, the two have (nearly) always lived side-by-side in the core language, and users had to choose which one to use by appending an “L” to a number to select the long variety. Gradually this was considered an annoyance; in Python 2.2, we introduced automatic conversion to long when the mathematically correct result of an operation on ints could not be represented as an int (for example, 2**100).

Previously, this would raise an OverflowError exception. There was once a time where the result would silently be truncated, but I changed it to raising an exception before ever letting others use the language. In early 1990, I wasted an afternoon debugging a short demo program I’d written implementing an algorithm that made non-obvious use of very large integers. Such debugging sessions are seminal experiences.

However, there were still certain cases where the two number types behaved slightly different; for example, printing an int in hexadecimal or octal format would produce an unsigned outcome (e.g., –1 would be printed as FFFFFFFF), while doing the same on the mathematically equal long would produce a signed outcome (–1, in this case). In Python 3.0, we’re taking the radical step of supporting only a single integer type; we’re calling it int, but the implementation is largely that of the old long type.

Why do you call it a radical step?

Guido: Mostly because it’s a big deviation from current practice in Python. There was a lot of discussion about this, and people proposed various alternatives where two (or more) representations would be used internally, but completely or mostly hidden from end users (but not from C extension writers). That might perform a bit better, but in the end it was already a massive amount of work, and having two representations internally would just increase the effort of getting it right, and make interfacing to it from C code even hairier. We are now hoping that the performance hit is minor and that we can improve performance with other techniques like caching.

How did you adopt the “there should be one—and preferably only one—obvious way to do it” philosophy?

Guido: This was probably subconscious at first. When Tim Peters wrote the “Zen of Python” (from which you quote), he made explicit a lot of rules that I had been applying without being aware of them. That said, this particular rule (while often violated, with my consent) comes straight from the general desire for elegance in mathematics and computer science. ABC’s authors also applied it, in their desire for a small number of orthogonal types or concepts. The idea of orthogonality is lifted straight from mathematics, where it refers to the very definition of having one way (or one true way) to express something. For example, the XYZ coordinates of any point in 3D space are uniquely determined, once you’ve picked an origin and three basis vectors.

I also like to think that I’m doing most users a favor by not requiring them to choose between similar alternatives. You can contrast this with Java, where if you need a listlike data structure, the standard library offers many versions (a linked list, or an array list, and others), or C, where you have to decide how to implement your own list data type.

What is your take on static versus dynamic typing?

Guido: I wish I could say something simple like “static typing bad, dynamic typing good,” but it isn’t always that simple. There are different approaches to dynamic typing, from Lisp to Python, and different approaches to static typing, from C++ to Haskell. Languages like C++ and Java probably give static typing a bad name because they require you to tell the compiler the same thing several times over. Languages like Haskell and ML, however, use type inferencing, which is quite different, and has some of the same benefits as dynamic typing, such as more concise expression of ideas in code. However the functional paradigm seems to be hard to use on its own—things like I/O or GUI interaction don’t fit well into that mold, and typically are solved with the help of a bridge to a more traditional language, like C, for example.

In some situations the verbosity of Java is considered a plus; it has enabled the creation of powerful code-browsing tools that can answer questions like “where is this variable changed?” or “who calls this method?” Dynamic languages make answering such questions harder, because it’s often hard to find out the type of a method argument without analyzing every path through the entire codebase. I’m not sure how functional languages like Haskell support such tools; it could well be that you’d have to use essentially the same technique as for dynamic languages, since that’s what type inferencing does anyway—in my limited understanding!

Are we moving toward hybrid typing?

Guido: I expect there’s a lot to say for some kind of hybrid. I’ve noticed that most large systems written in a statically typed language actually contain a significant subset that is essentially dynamically typed. For example, GUI widget sets and database APIs for Java often feel like they are fighting the static typing every step of the way, moving most correctness checks to runtime.

A hybrid language with functional and dynamic aspects might be quite interesting. I should add that despite Python’s support for some functional tools likemap( ) andlambda, Python does not have a functional-language subset: there is no type inferencing, and no opportunity for parallellization.

Why did you choose to support multiple paradigms?

Guido: I didn’t really; Python supports procedural programming, to some extent, and OO. These two aren’t so different, and Python’s procedural style is still strongly influenced by objects (since the fundamental data types are all objects). Python supports a tiny bit of functional programming—but it doesn’t resemble any real functional language, and it never will. Functional languages are all about doing as much as possible at compile time— the “functional” aspect means that the compiler can optimize things under a very strong guarantee that there are no side effects, unless explicitly declared. Python is about having the simplest, dumbest compiler imaginable, and the official runtime semantics actively discourage cleverness in the compiler like parallelizing loops or turning recursion into loops.

Python probably has the reputation of supporting functional programming based on the inclusion oflambda,map,filter, andreduce in the language, but in my eyes these are just syntactic sugar, and not the fundamental building blocks that they are in functional languages. The more fundamental property that Python shares with Lisp (not a functional language either!) is that functions are first-class objects, and can be passed around like any other object. This, combined with nested scopes and a generally Lisp-like approach to function state, makes it possible to easily implement concepts that superficially resemble concepts from functional languages, like currying, map, and reduce. The primitive operations that are necessary to implement those concepts are built in Python, where in functional languages, those concepts are the primitive operations. You can writereduce( ) in a few lines of Python. Not so in a functional language.

When you created the language, did you consider the type of programmers it might have attracted?

Guido: Yes, but I probably didn’t have enough imagination. I was thinking of professional programmers in a Unix or Unix-like environment. Early versions of the Python tutorial used a slogan something like “Python bridges the gap between C and shell programming,” because that was where I was myself, and the people immediately around me. It never occurred to me that Python would be a good language to embed in applications until people started asking about that.

The fact that it was useful for teaching first principles of programming in a middle school or college setting or for self-teaching was merely a lucky coincidence, enabled by the many ABC features that I kept—ABC was aimed specifically at teaching programming to nonprogrammers.

How do you balance the different needs of a language that should be easy to learn for novices versus a language that should be powerful enough for experienced programmers to do useful things? Is that a false dichotomy?

Guido: Balance is the word. There are some well-known traps to avoid, like stuff that is thought to help novices but annoys experts, and stuff that experts need but confuses novices. There’s plenty enough space in between to keep both sides happy. Another strategy is to have ways for experts to do advanced things that novices will never encounter—for example, the language supports metaclasses, but there’s no reason for novices to know about them.

The Good Programmer
How do you recognize a good programmer?

Guido: It takes time to recognize a good programmer. For example, it’s really hard to tell good from bad in a one-hour interview. When you work together with someone though, on a variety of problems, it usually becomes pretty clear which are the good ones. I hesitate to give specific criteria—I guess in general the good ones show creativity, learn quickly, and soon start producing code that works and doesn’t need a lot of changes before it’s ready to be checked in. Note that some folks are good at different aspects of programming than others—some folks are good at algorithms and data structures, others are good at large-scale integration, or protocol design, or testing, or API design, or user interfaces, or whatever other aspects of programming exist.

What method would you use to hire programmers?

Guido: Based on my interviewing experience in the past, I don’t think I’d be any good at hiring in the traditional way—my interview skills are nearly nonexistent on both sides of the table! I guess what I’d do would be to use some kind of apprentice system where I’d be working closely with people for quite some time and would eventually get a feeling for their strengths and weaknesses. Sort of the way an open source project works.

Is there any characteristic that becomes fundamental to evaluate if we are looking for great Python programmers?

Guido: I’m afraid you are asking this from the perspective of the typical manager who simply wants to hire a bunch of Python programmers. I really don’t think there’s a simple answer, and in fact I think it’s probably the wrong question. You don’t want to hire Python programmers. You want to hire smart, creative, self-motivated people.

If you check job ads for programmers, nearly all of them include a line about being able to work in a team. What is your opinion on the role of the team in programming? Do you still see space for the brilliant programmer who can’t work with others?

Guido: I am with the job ads in that one aspect. Brilliant programmers who can’t do teamwork shouldn’t get themselves in the position of being hired into a traditional programming position—it will be a disaster for all involved, and their code will be a nightmare for whoever inherits it. I actually think it’s a distinct lack of brilliance if you can’t do teamwork. Nowadays there are ways to learn how to work with other people, and if you’re really so brilliant you should be able to learn teamwork skills easily—it’s really not as hard as learning how to implement an efficient Fast Fourier Transform, if you set your mind about it.

Being the designer of Python, what advantages do you see when coding with your language compared to another skilled developer using Python?

Guido: I don’t know—at this point the language and VM have been touched by so many people that I’m sometimes surprised at how certain things work in detail myself! If I have an advantage over other developers, it probably has more to do with having used the language longer than anyone than with having written it myself. Over that long period of time, I have had the opportunity to ponder which operations are faster and which are slower—for example, I may be aware more than most users that locals are faster than globals (though others have gone overboard using this, not me!), or that functions and method calls are expensive (more so than in C or Java), or that the fastest data type is a tuple.

When it comes to using the standard library and beyond, I often feel that others have an advantage. For example, I write about one web application every few years, and the technology available changes each time, so I end up writing a “first” web app using a new framework or approach each time. And I still haven’t had the opportunity to do serious XML mangling in Python.

It seems that one of the features of Python is its conciseness. How does this affect the maintainability of the code?

Guido: I’ve heard of research as well as anecdotal evidence indicating that the error rate per number of lines of code is pretty consistent, regardless of the programming language used. So a language like Python where a typical application is just much smaller than, say, the same amount of functionality written in C++ or Java, would make that application much more maintainable. Of course, this is likely going to mean that a single programmer is responsible for more functionality. That’s a separate issue, but it still comes out in favor of Python: more productivity per programmer probably means fewer programmers on a team, which means less communication overhead, which according to The Mythical Man-Month [Frederick P. Brooks; Addison-Wesley Professional] goes up by the square of the team size, if I remember correctly.

What link do you see between the easiness of prototyping offered by Python and the effort needed to build a complete application?

Guido: I never meant Python to be a prototyping language. I don’t believe there should be a clear distinction between prototyping and “production” languages. There are situations where the best way to write a prototype would be to write a little throwaway C hack. There are other situations where a prototype can be created using no “programming” at all—for example, using a spreadsheet or a set offind andgrep commands.

The earliest intentions I had for Python were simply for it to be a language to be used in cases where C was overkill and shell scripts became too cumbersome. That covers a lot of prototyping, but it also covers a lot of “business logic” (as it’s come to be called these days) that isn’t particularly greedy in computing resources but requires a lot of code to be written. I would say that most Python code is not written as a prototype but simply to get a job done. In most cases Python is fully up to the job, and there is no need to change much in order to arrive at the final application.

A common process is that a simple application gradually acquires more functionality, and ends up growing tenfold in complexity, and there is never a precise cutover point from prototype to final application. For example, the code review application Mondrian that I started at Google has probably grown tenfold in code size since I first released it, and it is still all written in Python. Of course, there are also examples where Python did eventually get replaced by a faster language—for example, the earliest Google crawler/indexer was (largely) written in Python—but those are the exceptions, not the rule.

How does the immediacy of Python affect the design process?

Guido: This is often how I work, and, at least for me, in general it works out well! Sure, I write a lot of code that I throw away, but it’s much less code than I would have written in any other language, and writing code (without even running it) often helps me tremendously in understanding the details of the problem. Thinking about how to rearrange the code so that it solves the problem in an optimal fashion often helps me think about the problem. Of course, this is not to be used as an excuse to avoid using a whiteboard to sketch out a design or architecture or interaction, or other early design techniques. The trick is to use the right tool for the job. Sometimes that’s a pencil and a napkin—other times it’s an Emacs window and a shell prompt.

Do you think that bottom-up program development is more suited to Python?

Guido: I don’t see bottom-up versus top-down as religious opposites like vi versus Emacs. In any software development process, there are times when you work bottom-up, and other times when you work top-down. Top-down probably means you’re dealing with something that needs to be carefully reviewed and designed before you can start coding, while bottom-up probably means that you are building new abstractions on top of existing ones, for example, creating new APIs. I’m not implying that you should start coding APIs without having some kind of design in mind, but often new APIs follow logically from the available lower-level APIs, and the design work happens while you are actually writing code.

When do you think Python programmers appreciate more its dynamic nature?

Guido: The language’s dynamic features are often most useful when you are exploring a large problem or solution space and you don’t know your way around yet—you can do a bunch of experiments, each using what you learned from the previous ones, without having too much code that locks you into a particular approach. Here it really helps that you can write very compact code in Python—writing 100 lines of Python to run an experiment once and then starting over is much more efficient than writing a 1,000-line framework for experimentation in Java and then finding out it solves the wrong problem!

From a security point of view, what does Python offer to the programmer?

Guido: That depends on the attacks you’re worried about. Python has automatic memory allocation, so Python programs aren’t prone to certain types of bugs that are common in C and C++ code like buffer overflows or using deallocated memory, which have been the bread and butter of many attacks on Microsoft software. Of course the Python runtime itself is written in C, and indeed vulnerabilities have been found here over the years, and there are intentional escapes from the confines of the Python runtime, like thectypes module that lets one call arbitrary C code.

Does its dynamic nature help or rather the opposite?

Guido: I don’t think the dynamic nature helps or hurts. One could easily design a dynamic language that has lots of vulnerabilities, or a static language that has none. However having a runtime, or virtual machine as is now the “hip” term, helps by constraining access to the raw underlying machine. This is coincidentally one of the reasons that Python is the first language supported by Google App Engine, the project in which I am currently participating.

How can a Python programmer check and improve his code security?

Guido: I think Python programmers shouldn’t worry much about security, certainly not without having a specific attack model in mind. The most important thing to look for is the same as in all languages: be suspicious of data provided by someone you don’t trust (for a web server, this is every byte of the incoming web request, even the headers). One specific thing to watch out for is regular expressions—it is easy to write a regular expression that runs in exponential time, so web applications that implement searches where the end user types in a regular expression should have some mechanism to limit the running time.

Is there any fundamental concept (general rule, point of view, mindset, principle) that you would suggest to be proficient in developing with Python?

Guido: I would say pragmatism. If you get too hung up about theoretical concepts like data hiding, access control, abstractions, or specifications, you aren’t a real Python programmer, and you end up wasting time fighting the language, instead of using (and enjoying) it; you’re also likely to use it inefficiently. Python is good if you’re an instant gratification junkie like myself. It works well if you enjoy approaches like extreme programming or other agile development methods, although even there I would recommend taking everything in moderation.

What do you mean by “fighting the language”?

Guido: That usually means that they’re trying to continue their habits that worked well with a different language.

A lot of the proposals to somehow get rid of explicit self come from people who have recently switched to Python and still haven’t gotten used to it. It becomes an obsession for them. Sometimes they come out with a proposal to change the language; other times they come up with some super-complicated metaclass that somehow makes self implicit. Usually things like that are super-inefficient or don’t actually work in a multithreaded environment or whatever other edge case, or they’re so obsessed about having to type those four characters that they changed the convention from self tos or capitalS. People will turn everything into a class, and turn every access into an accessor method, where that is really not a wise thing to do in Python; you’ll just have more verbose code that is harder to debug and runs a lot slower. You know the expression “You can write FORTRAN in any language?” You can write Java in any language, too.

continued…………… in part 2 article

web2py : a simple,clean but powerful webframework in python


Image

The Python programming language is a versatile, high-level language that can be used easily for web programming, among other tasks. Building web applications can be difficult and toilsome on your own. Web frameworks preform much of the heavy lifting and can set up a template to build your web application easily.

Python has many different web frameworks. One of the most interesting is web2py. Web2py is a full-stack web framework that can be used to completely develop your web app. It has SQL database integration, a multi-threaded web server, and comes with a complete web IDE for designing your program.

This allows you to easily program with a unified interface from anywhere you can access a web browser. You can easily create and test your application in the same interface.

We will be installing and working with this framework on an Ubuntu 12.04 VPS.

Install the Web2py Software


Your Ubuntu server instance should already come with Python installed by default. This takes care of one of the only things that web2py needs to run successfully.

The only other software that we need to install is the unzip package, so that we can extract the web2py files from the zip file we will be downloading:

sudo apt-get update
sudo apt-get install unzip

Now, we can get the framework from the project’s website. We will download this to our home folder:

cd ~
wget http://www.web2py.com/examples/static/web2py_src.zip

Now, we can unzip the file we just downloaded and move inside:

unzip web2py_src.zip
cd web2py

Now that we are inside the web2py directory, how do we install it? Well, one of the great things about web2py is that you do not install it. You can run it right from this folder by typing:

python web2py.py

However, this will only launch the web interface that is accessible on the local machine. This is a security feature, but it doesn’t help us since our framework is being hosted on a remote droplet.

To stop the server, typing “CTRL-C” in the terminal. We will sort out the web access issue momentarily.

Create SSL Certificates to Allow Remote Access


To allow remote access, we must start the web framework with SSL. We need to create our certificates before we can do that. Luckily, the openssl package is already installed.

We can create an RSA key to use for certificate generation with the following command:

openssl genrsa -out server.key 2048

Using this key, we can then generate the .csr file:

openssl req -new -key server.key -out server.csr

Finally, we can use these two pieces to create an SSL certificate:

openssl x509 -req -days 365 -in server.csr -signkey server.key -out server.crt

You should now have a server.key, server.csr, and server.crt file in your web2py directory. We can use these to start up the interface in a secure manner by passing some parameters to web2py when we call it:

python web2py.py -a 'admin_password' -c server.crt -k server.key -i 0.0.0.0 -p 8000

You can select whichever password you would like to use to log into your framework web interface. The “0.0.0.0″ allows this to be accessible to remote systems.

You can access the interface by visiting:

https://your_ip:8000

You should get a warning that says that the SSL certificate is not signed by a known certificate authority.

DigitalOcean SSL not trusted

This is normal, since we signed the key itself. Click the button that allows you to proceed anyways.

How To Use the Web Interface to Design Your Program


Upon visiting the site, you should be presented with the default web2py application:

DigitalOcean web2py default app

If you click on the “Administrative Interface” button, you can sign in with the password you just chose when starting the server.

When you’re done, you should be taken to the admin interface:

DigitalOcean web2py admin interface

On the left, you can see three folders. These are actually included sample applications. If you click on the folder titles, you will be taken to the live apps. One is the “admin” interface that you are using now. The “welcome” app is the basic application that you saw before you signed in.

You can edit all of these applications (except the admin) from within this interface by clicking on the “Manage” drop-down menu and selecting “Edit”:

DigitalOcean web2py edit app

If you did this, get back to the main interface by clicking the “Site” link in the top navigation bar.

Let’s create a new app through the “New simple application” field on the right side of the application. We will name it “sample_app”:

DigitalOcean web2py new app

This takes us to the editing interface of our new “sample_app” application. It lists the standard MVC (model, view, and controller) files, as well as a languages directory, a static pages directory, and directories for modules, plugins, and private files.

DigitalOcean web2py main edit

This is actually just a graphical representation of what is going on in the filesystem. If you were to stop the web server with CTRL-C, and navigate within the web2py directory to the “applications/sample_app” folder, you will see folders that match these categories.

If we visit the site by going to:

https://your_ip:8000/sample_app

We can see that it looks almost exactly like the welcome app that was included by default (the only difference being the title).

Exploring the Web2py’s MVC Implementation


Web2py takes care of a lot of the details necessary to form a fully functioning app. However, you must know how it interacts in order to effectively develop within this system. Web2py implements a policy of “coding-by-convention”.

Coding-by-convention is a design choice selected in order to reduce the number of decisions a developer has to make. This means that a developer should only have to specify when they are moving away from a convention.

For instance, if the controller for an application is called “image_blog.py” and it has a function called “main”, web2py will try to serve this page with a view called “image_blog/main.html”. The “.html” part is the default if no other extension is specified in the controller. If we break this convention, we would have to specify the alternative. Otherwise, this is automatic.

We can see that there is a controller called default.py. This is the default controller that is used when no other is specified. If you click on the “Edit” button to the left of this controller, you can see the functions it defines:

DigitalOcean web2py default functions

If we return back to the main editing interface (by clicking the “Edit” link in the top navigation bar), we can see that the functions that define user-facing actions have a matching view:

DigitalOcean web2py matching view

Create A Sample Controller and View


Let’s try this out by creating our own controller first. Under the “Controllers” section, click “Create” to make a new controller:

DigitalOcean web2py hello controller

If we edit this file, we will see that it defines an “index” function that will be called if no other is requested. This serves as a default. This function simply returns a dictionary that stores a variable called message with a value of “hello from hello.py”.

# coding: utf8
# try something like
def index(): return dict(message="hello from hello.py")

You can edit this and click on the disk icon to save your work. For our purposes, this does exactly what we want it to do.

Next, we need to create the corresponding view to render the information that is being passed back (the dictionary that defines message). Click on the “Edit” button or the “<<back” button to return to the app directory.

Under the “Views” section create a new view. We need to match the controller and function for this view to be automatically applied:

DigitalOcean web2py hello view

Edit the file by clicking the button to the left of it. You should see the default view, which contains something like:

{{extend 'layout.html'}}
<h1>This is the hello/index.html template</h1>
{{=BEAUTIFY(response._vars)}}

This file extends, or builds off of the layout.html view. This is a useful method of keeping a consistent look between all of your pages. If we visit the page in our browser, we can see how it renders.

We can get to this page by visiting:

https://your_ip:8000/sample_app/hello

DigitalOcean web2py hello render

You can see that it rendered the results in our dictionary below the heading. We can print this directly by editing our view.

If we remove the second two lines, we can replace them with our message text, since it forms the complete message we want to show:

{{extend 'layout.html'}}
<h1>{{=message}}</h1>

Now, we should just see the message that the controller passed, rendered as an h1 header:

DigitalOcean web2py modified render

If we would like to get rid of the styling, we can remove the top line.

The {{=message}} is a way that we can embed Python code within our files. This allows us to dynamically generate content that is not necessarily available at the time the program is written.

Conclusion


You should now have a very basic understanding of the web2py framework. More importantly, you should be able to see how easy it is to develop on this framework. The flexibility of being able to develop through text files or through a web interface means that you can work well in different environments.

The web interface also comes with a great number of tools for working with your growing application. You can see stack traces when a bug is introduced, you can easily package your application for deployment, and you view a log of all of the errors your application has encountered while developing.

Web2py is a framework, an IDE, and a development suite all in one package.

Resources:

website: http://www.web2py.com

video tutorials: http://www.youtube.com/channel/UC3ehYtHsuX19OPg7XTtBb4A?feature=watch

on vimeo: https://vimeo.com/user315328

Language translation with python part 2

Image

Stemming, lemmatization,Translation simplified

Stemming words

Stemming is a technique for removing affixes from a word, ending up with the stem. For example, the stem of “cooking” is “cook”, and a good stemming algorithm knows that the “ing” suffix can be removed. Stemming is most commonly used by search engines for indexing words. Instead of storing all forms of a word, a search engine can store only the stems, greatly reducing the size of index while increasing retrieval accuracy.
One of the most common stemming algorithms is the Porter Stemming Algorithm, by Martin Porter. It is designed to remove and replace well known suffixes of English words, and its usage in NLTK will be covered next.

The resulting stem is not always a valid word. For example, the stem of “cookery” is “cookeri”. This is a feature, not a bug.
How to do it…

NLTK comes with an implementation of the Porter Stemming Algorithm, which is very easy to use. Simply instantiate the PorterStemmer class and call the stem() method with the word you want to stem.

>>> from nltk.stem import PorterStemmer
>>> stemmer = PorterStemmer()
>>> stemmer.stem(‘cooking’)
‘cook’
>>> stemmer.stem(‘cookery’)
‘cookeri’

How it works…

The PorterStemmer knows a number of regular word forms and suffixes, and uses that knowledge to transform your input word to a final stem through a series of steps. The resulting stem is often a shorter word, or at least a common form of the word, that has the same root meaning.

There’s more…

There are other stemming algorithms out there besides the Porter Stemming Algorithm, such as the Lancaster Stemming Algorithm, developed at Lancaster University. NLTK includes it as the LancasterStemmer class. At the time of writing, there is no definitive research demonstrating the superiority of one algorithm over the other. However, Porter Stemming is generally the default choice.
All the stemmers covered next inherit from the StemmerI interface, which defines the stem() method. The following is an inheritance diagram showing this:

LancasterStemmer
The LancasterStemmer functions just like the PorterStemmer, but can produce slightly different results. It is known to be slightly more aggressive than the PorterStemmer.

>>> from nltk.stem import LancasterStemmer
>>> stemmer = LancasterStemmer()
>>> stemmer.stem(‘cooking’)
‘cook’
>>> stemmer.stem(‘cookery’)
‘cookery’

RegexpStemmer
You can also construct your own stemmer using the RegexpStemmer. It takes a single regular expression (either compiled or as a string) and will remove any prefix or suffix that matches.

>>> from nltk.stem import RegexpStemmer
>>> stemmer = RegexpStemmer(‘ing’)
>>> stemmer.stem(‘cooking’)
‘cook’
>>> stemmer.stem(‘cookery’)
‘cookery’
>>> stemmer.stem(‘ingleside’)
‘leside’

A RegexpStemmer should only be used in very specific cases that are not covered by the PorterStemmer or LancasterStemmer.

SnowballStemmer
New in NLTK 2.0b9 is the SnowballStemmer, which supports 13 non-English languages. To use it, you create an instance with the name of the language you are using, and then call the stem() method. Here is a list of all the supported languages, and an example using the Spanish SnowballStemmer:

>>> from nltk.stem import SnowballStemmer
>>> SnowballStemmer.languages
(‘danish’, ‘dutch’, ‘finnish’, ‘french’, ‘german’, ‘hungarian’, ‘italian’, ‘norwegian’, ‘portuguese’, ‘romanian’, ‘russian’, ‘spanish’, ‘swedish’)
>>> spanish_stemmer = SnowballStemmer(‘spanish’)
>>> spanish_stemmer.stem(‘hola’)
u’hol’

Lemmatizing words with WordNet

Lemmatization is very similar to stemming, but is more akin to synonym replacement. A lemma is a root word, as opposed to the root stem. So unlike stemming, you are always left with a valid word which means the same thing. But the word you end up with can be completely different. A few examples will explain lemmatization…

Getting ready

Be sure you have unzipped the wordnet corpus in nltk_data/corpora/wordnet. This will allow the WordNetLemmatizer to access WordNet.

How to do it…
We will use the WordNetLemmatizer to find lemmas:

>>> from nltk.stem import WordNetLemmatizer >>> lemmatizer = WordNetLemmatizer() >>> lemmatizer.lemmatize(‘cooking’) ‘cooking’
>>> lemmatizer.lemmatize(‘cooking’, pos=’v’) ‘cook’
>>> lemmatizer.lemmatize(‘cookbooks’) ‘cookbook’

How it works…

The WordNetLemmatizer is a thin wrapper around the WordNet corpus, and uses the morphy() function of the WordNetCorpusReader to find a lemma. If no lemma is found, the word is returned as it is. Unlike with stemming, knowing the part of speech of the word is important. As demonstrated previously, “cooking” does not have a lemma unless you specify that the part of speech (pos) is a verb. This is because the default part of speech is a noun, and since “cooking” is not a noun, no lemma is found. “Cookbooks”, on the other hand, is a noun, and its lemma is the singular form, “cookbook”.

There’s more…
Here’s an example that illustrates one of the major differences between stemming and lemmatization:

>>> from nltk.stem import PorterStemmer
>>> stemmer = PorterStemmer()
>>> stemmer.stem(‘believes’)
‘believ’
>>> lemmatizer.lemmatize(‘believes’)
‘belief’

Instead of just chopping off the “es” like the PorterStemmer, the WordNetLemmatizer finds a valid root word. Where a stemmer only looks at the form of the word, the lemmatizer looks at the meaning of the word. And by returning a lemma, you will always get a valid word.

Combining stemming with lemmatization
Stemming and lemmatization can be combined to compress words more than either process can by itself. These cases are somewhat rare, but they do exist:

>>> stemmer.stem(‘buses’)
‘buse’
>>> lemmatizer.lemmatize(‘buses’)
‘bus’
>>> stemmer.stem(‘bus’)
‘bu’

In this example, stemming saves one character, lemmatizing saves two characters, and stemming the lemma saves a total of three characters out of five characters. That is nearly a 60% compression rate! This level of word compression over many thousands of words, while unlikely to always produce such high gains, can still make a huge difference.

Finally Translation:

Translating text with Babelfish

B abelfish is an online language translation API provided by Yahoo. With it, you can translate text in a source language to a target language. NLTK comes with a simple interface for using it.

Getting ready
Be sure you are connected to the internet first. The babelfish.translate() function requires access to Yahoo’s online API in order to work.
How to do it…
To translate your text, you first need to know two things:
1. The language of your text or source language.
2. The language you want to translate to or target language.
Language detection is outside the scope of this recipe, so we will assume you already know the source and target languages.

>>> from nltk.misc import babelfish
>>> babelfish.translate(‘cookbook’, ‘english’, ‘spanish’) ‘libro de cocina’
>>> babelfish.translate(‘libro de cocina’, ‘spanish’, ‘english’) ‘kitchen book’
>>> babelfish.translate(‘cookbook’, ‘english’, ‘german’) ‘Kochbuch’
>>> babelfish.translate(‘kochbuch’, ‘german’, ‘english’) ‘cook book’

You cannot translate using the same language for both source and target. Attempting to do so will raise a BabelfishChangedError.
How it works…

The translate() function is a small function that sends a urllib request to http://babelfish.yahoo.com/translate_txt, and then searches the response for the translated text.

If Yahoo, for whatever reason, had changed their HTML response to the point that translate() cannot identify the translated text, a BabelfishChangedError will be raised. This is unlikely to happen, but if it does, you may need to upgrade to a newer version of NLTK and/or report the error.

There’s more…
There is also a fun function called babelize() that translates back and forth between the source and target language until there are no more changes.

>>> for text in babelfish.babelize(‘cookbook’, ‘english’, ‘spanish’): … print text
cookbook
libro de cocina
kitchen book
libro de la cocina
book of the kitchen

Available languages
You can see all the languages available for translation by examining the available_ languages attribute.

>>> babelfish.available_languages
['Portuguese', 'Chinese', 'German', 'Japanese', 'French', 'Spanish', 'Russian', 'Greek', 'English', 'Korean', 'Italian']

The lowercased version of each of these languages can be used as a source or target language for translation.