Forbes top 100 quote collecting spider with Dragline

 

In this post  we start with a new spider using Dragline crawling framework writing a jump start real world spider. By the end of this post additional to dragline api, we are able to cover some basic concepts of python.

Forbes compiles good quotes and displays them in its website.So you wish to save them to forward them to your close friends.Because you are a programmer that too a python coder and your intention is to fetch all the data from that website by using a spider.There are many spiders for understanding in coming days, but we here discuss a basic one for fetching Forbes quotes.This attempt is to make you familiar with dragline framework.

As already explained in previous post dragline got lot of invisible features which makes the spiders created by it smart .Hoping that we already installed dragline.If not, see instructions for installing in the previous post.

Task 1:

Learning Basics of dragline api

Dragline mainly consists of these major modules

      • dragline.http
      • dragline.htmlparser

dragline.http

It has a request method

class dragline.http.Request(url, method=’GET’, form_data=None, headers{}callback=None,meta=None)

Parameters:
  • url (string) – the URL of this request
  • method (string) – the HTTP method of this request. Defaults to 'GET'.
  • headers (dict) – the headers of this request.
  • callback (string) – name of the function to call after url is downloaded.
  • meta (dict) – A dict that contains arbitrary metadata for this request.
send()
This function sends HTTP requests.

Returns: response
Return type: dragline.http.Response
Raises: dragline.http.RequestError: when failed to fetch contents
>>> req = Request("http://www.example.org")
>>> response = req.send()
>>> print response.headers['status']
200

 and a Response method

class dragline.http.Response(url=None, body=None, headers=None, meta=None)

Parameters:
  • headers (dict) – the headers of this response.
  • body (str) – the response body.
  • meta (dict) – meta copied from request

This function is used to create user defined response to test your spider and also in many other cases. It is much easier than Requests module get method.

dragline.htmlParser

Basic parser module for extracting content from html data, there is a main function in htmlparser called as HtmlParser. Apart from entire Dragline,htmlparser alone is a powerful parsing application.

HtmlParser Function

dragline.htmlparser.HtmlParser(response)
Parameters: response (dragline.http.Response)

This method takes response object as its argument and returns the lxml etree object.

HtmlParser function returns a lxml object of type HtmlElement which got few potential methods. All the details of lxml object are discussed in section lxml.html.HtmlElement.

first we should create a HtmlElement object by sending appropriate URL as parameter.The URL is for the page we want to scrape.

HtmlElement object is returned by the HtmlParser function of dragline.htmlparser module:

>>> req = Request('www.gutenberg.com')
>>> parse_object = HtmlParser(req.send())
The methods upon HtmlElement object are:
extract_urls(xpath_expr)

This function fetches all the links from the webpage in response by the specified xpath as its argument.

If xpath is not included then links are fetched from entire document. From previous example let HtmlElement be parse_obj.

>>> parse_obj.extract_urls('//div[@class="product"]')
xpath(expression)

This function directly accumulate the results from the xpath expression.It is used to fetch the html body elements directly:

<html>
    <head>
    </head>
    <body>
        <div class="tree">
            <a href="http://www.treesforthefuture.org/">Botany</a>
        </div>
        <div class="animal">
            <a href="http://www.animalplanet.com/">Zoology</a>
        </div>
    </body>
</html>

then we can use the following XPath expressions.

>>> parse_object.extract_urls('//div[@class="tree"]')
extract_text(xpath_expr)

This function grabs all the text from the web page that specified.xpath is an optional argument.If specified the text obtained will be committed to condition in xpath expression.

     >>> parse_obj.extract_text('//html')

So now you have understood what are the main modules of dragline and important methods in those.

Now let’s begin our journey by writing small spider
First go to folder where you want to save your spider and follow the procedure below.
  • $ mkdir samplespider
  • $ cd samplespider
  • $ dragline-admin init forbesquotes

this creates a spider called forbesquotes in your newly created samplespider directory.

now you see a folder forbesquotes in samplespider and traverse into it

  • $ cd forbesquotes
 Task 2:

Writing a spider for collecting top 100 quotes frrom forbes

 

This is the 26 line spider for extracting top 100 quotes from forbes.

from dragline.htmlparser import HtmlParser
from dragline.http import Request
import re

class Spider:
    def __init__(self, conf):
    self.name = "forbesquotes"
    self.start = "http://www.forbes.com/sites/kevinkruse/2013/05/28/inspirational-quotes"
    self.allowed_domains = ['www.forbes.com']
    self.conf = conf
 
    def parse(self,response):
        html = HtmlParser(response)
        self.parseQuote(response) 
        for url in html.xpath('//span[@class="page_links"]/a/@href'):
            yield Request(url,callback="parseQuote")
 
    def parseQuote(self,response):
        print response.url
        html = HtmlParser(response)
        title = html.xpath('//div[@class="body contains_vestpocket"]/p/text()')
        quotes = [i.encode('ascii',"ignore") for i in title if i!=' '][2:]
        pat = re.compile(r'\d*\.')
        with open('quotes.txt','a') as fil:
        for quote in [i.split(pat.search(i).group(),1)[1] for i in quotes]:
            fil.write('\n'+quote+'\n')

This is a 26 line spider with dragline.By seeing it you might have not understood a bit from it.Let’s explain everything.

As already told when we create a new spider a new directory is formed in the name of spider.It consists of two files

  • main.py
  • settings.py

main.py looks like following with default class called spider and a methods init,parse.

from dragline.htmlparser import HtmlParser
from dragline.http import Request


class Spider:

    def __init__(self, conf):
       self.name = "forbesquotes"
       self.start = "http://www.example.org"
       self.allowed_domains = []
       self.conf = conf

    def parse(self,response):
       html = HtmlParser(response)

All these things are given to us as a gift without hardcoding them again.Just now we need to concentrate on how to attack the problem.

1) init method takes the starting url  and allowed domains from where spider to begin.

In our case forbesquotes spider starts in self.start = ‘http://www.forbes.com/sites/kevinkruse/2013/05/28/inspirational-quotes&#8217;

and set self.allowed_domains = ['www.forbes.com']

it is a list which can take more no of allowed domains

Now our main.py looks like

from dragline.htmlparser import HtmlParser
from dragline.http import Request


class Spider:

    def __init__(self, conf):
        self.name = "forbesquotes"
        self.start = "http://www.forbes.com/sites/kev inkruse/2013/05/28/inspirational-quotes"
        self.allowed_domains = ['www.forbes.com']
        self.conf = conf

    def parse(self,response):
        html = HtmlParser(response)

Ok now we should crawl through the pages ,so lets write a function called parseQuote for processing the page whose input is the response object and outcome is quotes from response page are written to a file.We should repeat parseQuote for no of times equal to the total no of pages in which quotes are available.So after adding the parseQuote function

from dragline.htmlparser import HtmlParser
from dragline.http import Request


class Spider:

    def __init__(self, conf):
        self.name = "forbesquotes"
        self.start = "http://www.forbes.com/sites/kevinkruse/2013/05/28/inspirational-quotes"
        self.allowed_domains = ['www.forbes.com']
        self.conf = conf

    def parse(self,response):
        html = HtmlParser(response)

 
    def parseQuote(self,response):
        print response.url
        html = HtmlParser(response)
        title = html.xpath('//div[@class="body contains_vestpocket"]/p/text()')
        quotes = [i.encode('ascii',"ignore") for i in title if i!=' '][2:]
        pat = re.compile(r'\d*\.')
        with open('quotes.txt','a') as fil:
        for quote in [i.split(pat.search(i).group(),1)[1] for i in quotes]:
        fil.write('\n'+quote+'\n')

If you observe parseQuote ,only first three lines were the job of framework and remaining code is pure python logic for stripping and editing the raw quotes fetched from response and then writing it to a file.

parse is the function where spider execution starts.We should supply callbacks from there to the pages where we wish to navigate.It means spider goes smartly in the path we mention.

So now i am adding content to parse method.After observing the web pages structure i am calling parseQuote on current response.

Next using extract_urls method of dragline HtmlElement object I extract all the urls specifying relevant XPATH and pass them as call backs for the parseQuote function.Resulting code looks like

from dragline.htmlparser import HtmlParser
from dragline.http import Request
import re

class Spider:
    def __init__(self, conf):
        self.name = "forbesquotes"
        self.start = "http://www.forbes.com/sites/kevinkruse/2013/05/28/inspirational-quotes"
        self.allowed_domains = ['www.forbes.com']
        self.conf = conf
 
   def parse(self,response):
       html = HtmlParser(response)
       self.parseQuote(response) 
       for url in html.extract_urls('//span[@class="page_links"]/a'):
           yield Request(url,callback="parseQuote")
 
   def parseQuote(self,response):
       print response.url
       html = HtmlParser(response)
       title = html.xpath('//div[@class="body contains_vestpocket"]/p/text()')
       quotes = [i.encode('ascii',"ignore") for i in title if i!=' '][2:]
       pat = re.compile(r'\d*\.')
       with open('quotes.txt','a') as fil:
           for quote in [i.split(pat.search(i).group(),1)[1] for i in quotes]:
               fil.write('\n'+quote+'\n')

so now after comlpleting the main.py just go to terminal and type following command to run spider.

  • $ dragline  .
  • $  dragline  /path_to_spider/  from outer paths

then our spider starts running with displaying all processed urls as information in command prompt.and a new file will be created in our current directory with top 100 quotes

 Life isnt about getting and having, its about giving and being. 

 Whatever the mind of man can conceive and believe, it can achieve. Napoleon Hill

 Strive not to be a success, but rather to be of value. Albert Einstein

 Two roads diverged in a wood, and II took the one less traveled by, And that has made all the difference. Robert Frost

 I attribute my success to this: I never gave or took any excuse. Florence Nightingale

 You miss 100% of the shots you dont take. Wayne Gretzky

 Ive missed more than 9000 shots in my career. Ive lost almost 300 games. 26 times Ive been trusted to take the game winning shot and missed. Ive failed over and over and over again in my life. And that is why I succeed. Michael Jordan

 The most difficult thing is the decision to act, the rest is merely tenacity. Amelia Earhart

 Every strike brings me closer to the next home run. Babe Ruth

 Definiteness of purpose is the starting point of all achievement. W. Clement Stone

 We must balance conspicuous consumption with conscious capitalism. Kevin Kruse

 Life is what happens to you while youre busy making other plans. John Lennon

 We become what we think about. Earl Nightingale

Twenty years from now you will be more disappointed by the things that you didnt do than by the ones you did do, so throw off the bowlines, sail away from safe harbor, catch the trade winds in your sails. Explore, Dream, Discover. Mark Twain

Life is 10% what happens to me and 90% of how I react to it. Charles Swindoll

 There is only one way to avoid criticism: do nothing, say nothing, and be nothing. Aristotle

 Ask and it will be given to you; search, and you will find; knock and the door will be opened for you. Jesus

 The only person you are destined to become is the person you decide to be. Ralph Waldo Emerson

 Go confidently in the direction of your dreams. Live the life you have imagined. Henry David Thoreau

 When I stand before God at the end of my life, I would hope that I would not have a single bit of talent left and could say, I used everything you gave me. Erma Bombeck

 Few things can help an individual more than to place responsibility on him, and to let him know that you trust him. Booker T. Washington

 Certain things catch your eye, but pursue only those that capture the heart. Ancient Indian Proverb

 Believe you can and youre halfway there. Theodore Roosevelt

 Everything youve ever wanted is on the other side of fear. George Addair

 We can easily forgive a child who is afraid of the dark; the real tragedy of life is when men are afraid of the light. Plato

 

 If youre offered a seat on a rocket ship, dont ask what seat! Just get on. Sheryl Sandberg

 First, have a definite, clear practical ideal; a goal, an objective. Second, have the necessary means to achieve your ends; wisdom, money, materials, and methods. Third, adjust all your means to that end. Aristotle

 If the wind will not serve, take to the oars. Latin Proverb

 You cant fall if you dont climb. But theres no joy in living your whole life on the ground. Unknown

 We must believe that we are gifted for something, and that this thing, at whatever cost, must be attained. Marie Curie

 Too many of us are not living our dreams because we are living our fears. Les Brown

 Challenges are what make life interesting and overcoming them is what makes life meaningful. Joshua J. Marine

 If you want to lift yourself up, lift up someone else. Booker T. Washington

 I have been impressed with the urgency of doing. Knowing is not enough; we must apply. Being willing is not enough; we must do. Leonardo da Vinci

 Limitations live only in our minds. But if we use our imaginations, our possibilities become limitless. Jamie Paolinetti

 You take your life in your own hands, and what happens? A terrible thing, no one to blame. Erica Jong

 Whats money? A man is a success if he gets up in the morning and goes to bed at night and in between does what he wants to do. Bob Dylan

 I didnt fail the test. I just found 100 ways to do it wrong. Benjamin Franklin

 Nothing is impossible, the word itself says, Im possible! Audrey Hepburn

 The only way to do great work is to love what you do. Steve Jobs

 If you can dream it, you can achieve it. Zig Ziglar

 Life isnt about getting and having, its about giving and being. 

 Whatever the mind of man can conceive and believe, it can achieve. Napoleon Hill

 Strive not to be a success, but rather to be of value. Albert Einstein
          and so on ....................
                 

So this is a very small example.It is actually killing an ant with an axe.The main theme of this post is to introduce dragline and make you familiar with that.Crawling is not a legal one so write spiders concerning the threats and benefits.Many python techniques were used like smart usage of list comphrensions and regex.Hope you enjoyed.Comment if you had any queries.

Dragline,a Samson jawbone for crawling the web

samsonite

Samson,a great hero from mythology killed all the enemys with a single Jawbone.Here we are having a snigle tool to slay all the problems of crawling. The olden days of pain and patience were gone. Now a new library is emerging, especially for writing powerful spiders for the web. This library will instantly turn you into an amazing spiderman who can play with the hyperlinks.This is not a tool for novice and using it enterprise level work can be done with ease.

What is Dragline?

Dragline is a powerful python framework to write our own spiders.It is even considered as a full time replacement for the other well-known  web scraping frameworks still evolved.

Dragline actually has many advantages than its ancestors in the same field of crawling.The main features are not going to be discussed here.You can find why dragline is more sophisticated than the other scraping frameworks by navigating to this link. Dragline features

Where to get it?

You can download Dragline from the official python repository https://pypi.python.org/pypi/Dragline

we can also install dragline with the following command if pip is installed in the system

$ sudo pip install –pre dragline

or

c:\  pip install –pre dragline

|||||||||||||||||||||||| 1.Introduction to dragline ||||||||||||||||||||||||||

Now we can begin our fun journey.I am going to show a real world example in an upcoming post, but now there are few important points to ponder.

What is a spider ?. A spider is a program that crawls through the web pages in a specified manner.Here specified manner is the way we askes our spider to run.You may wonder  that there are no good resources on crawling and even not a single orielly book on the subject of spiders and especially with python.

Many good crawling frameworks are ignored and it is known for a few developers(looked  few in such a large python community) who are really working in the enterprise industry.Is it a worthy issue to consider crawling.Yes obviously because crawlers are the main sources for creation of datasets and also for fetching information programmatically.

Then why this new framework emerged.There are some drawbacks for the existing crawling frameworks, if we are working with a huge projects.Young readers may be frustrated by my words like “project”,”enterprise”  but i am asking to take them light.I too don’t like them.Everthing should be plain.I used dragline to write spiders for many websites and it roughly takes 5 minutes to write a spider for a normal website.What an amazing speed!.

The complexity of crawling increases by some factors like:

a)javascript rendering of web pages

b)dynamically loading pages by scrolling down

c)The rejection of HTTP requests by the server  i.e. timeout

First two factors are  unavoidable and left for the genius of a programmar while crawling but last thing can be handled by library if it is smart.Dragline is good at pausing and resuming the connection if a server load is heavy.I want to keep the usage of dragline as a suspense for time being.But if you are inspired by my words you can check it right now.There is a good but not an extraordinary documentation available.But you might wonder how good that framework works once you understand it.

Thanks for listening patiently but i assure that dragline won’t disappoint you.We can meet next time with a  real-time spider that amazes you and makes your sleeves always up the shoulder.

If you wish any advancements to dragline you can contribute dragline’s github repository from here.

Dragline github

 

 

Up and running MongoDB with python.

Image

MongoDB is a NoSQL database which stores records as documents.MongoDB has some really nice, unique tools that are not (all) present in any other solution.

1.Indexing

MongoDB supports generic secondary indexes, allowing a variety of fast queries, and provides unique, compound, and geospatial indexing capabilities as well.

2.Stored JavaScript

Instead of stored procedures, developers can store and use JavaScript functions and values on the server side.

   3.Aggregation
MongoDB supports MapReduce and other aggregation tools.
4.Fixed-size collections
Capped collections are fixed in size and are useful for certain types of data, such as logs.
5.File storage
MongoDB supports an easy-to-use protocol for storing large files and file metadata.

Some features common to relational databases are not present in MongoDB, notably joins and complex multi row transactions. These are architectural decisions to allow for scalability, because both of those features are difficult to provide efficiently in a distributed system.

After installing the MongoDB please run following commands to start the mongoDB server.

$./mongod          for linux or mac users (or)

C:\>mongod       if windows users

Server runs default on port 27017.

next if you want to perform raw database operations open another terminal and type

$./mongo (or) C:\>mongo to launch client  database java script shell.

Actually the documents stored in the MongoDB are in the format of BSON.Python developers only deals with storing  Dictionaries as documents into mongoDB but never look into BSON format,we doubt how things are going fine?.The solution is a driver called pymongo abstracts the conversion of python dictionary into MongoDB valid BSON document.So now we should install pymongo to integrate python and mongodb.

First we can look into how basic MongoDB client java script shell works.

initially the shell looks like this:

MongoDB shell version 2.2.7

connecting to: test

>

> x = 200
200
> x / 5;
40

> “Hello, World!”.replace(“World”, “MongoDB”);
Hello, MongoDB!

> function factorial (n) {
… if (n <= 1) return 1;
… return n * factorial(n – 1);
… }
> factorial(5);
120

All the above are java script expressions and its functions.So it is a full fledged interpreter.

If you have a database called ‘student’ then we can switch from test db to student db as

>use student

switched to db student

 

Insertion into MongoDB

First we can create a post and insert it into database.

> post = {“title” : “Naren”,
… “ID” : “10C91A05B3″,
… “date” : new Date()}

>db.student.insert(post)

Now the record post is saved in the student database.

 

Retrieving the record from MongoDB

>db.student.find()

this command returns all the documents in student database.It is equivalent to ‘select * from student’ in SQL.

>db.student.findOne()

this command returns only first document from the student database.we can also use criteria for fetching documents.

Note: _id is the default field that will be added by MongoDB for each document we inserts.By fetching you will see additional field in addition to “title”,”ID”,”date”.In this way MongoDB provides default primary key for each document.

If you want to select a record whose name is Naren.then use the following command.

>db.student.findOne({“title”:”Naren”})

 

Updating a document in MongoDB

case 1: Adding new field

If we want to insert new field percentage into document then first do

>post.percentage=83

>db.student.update({“title”:”Naren”},post)

Now document looks like

>db.student.findOne(“title”:”Naren”)

{“title”:”Naren”,”ID”:”10C91A05B3″,

“date”:”Sunday May 18 2014 09:18:22 PM”,

“percentage”:83}

case 2: modifying existing record

>db.student.update({“title”:”Naren”},{“$set”:{“ID”:”10C91A0501″}})

by above command the database search for document whose title is Naren and sets its ID value to 10C91A0501.

‘$set’ is called as modifier,MongoDB is very rich in modifiers.Check MongoDB manual for complete list.

 

Deleting records in database

> db.student.remove({title : “Naren”})

it removes all the records with title:”Naren”

Note: If you want to remove entire student database collection,then don’t loop for each document.Simply use following command

>db.drop_collection(“student”)

Above command makes student database empty and it is preferred because of performance issues.

Jump starting the Python and MongoDB

First install the pymongo library for python interpreter.Use  “pip install pymongo” to do that.Once pymongo was installed now we are ready to play with mongoDB.


 

from datetime import datetime
import sys
from pymongo import Connection
from pymongo.errors import ConnectionFailure
def main():
    """connects to mongo db"""
    try:
        c=Connection(host="localhost",port=27017)
        print "Connected sucessfully"
    except ConnectionFailure,e:
        sys.stderr.write("Couldn't connect:%s"%e)
        sys.exit(1)
    dbh=c['mydb']
    assert dbh.connection==c
    user_doc={
        "username":"narenarya",
        "firstname":"naren",
        "secondname":"arya",
        "date_of_bitrh":datetime(1993,6,24),
        "email":"narenarya@live.com",
        "score":80}
    dbh.users.insert(user_doc,safe=True)
    print "successfully inserted doc:%s"%user_doc
    print "Now reteiving result"
    fan=dbh.users.find({"firstname":"naren"})
    for user in fan:
        print user["email"]
if __name__=="__main__":main()

  1. First you should start mongoDB server by typing $./mongod or C:\>mongod.
  2. Don't be afraid by seeing all those dirty imports.Just we required is Connection class from pymongo package.So 'from pymongo import Connection' is required.
  3. Above two steps are compulsory.Now we should get a Connection object to mongoDB so,c=Connection(host,port).Here host=”localhost”,port=27017 by default.
  4. Now we should get a database handle to perform CRUD operations,so dbh=c['mydb'].This fetch handle for database mydb.in this database we can store many collections(tables in SQL language).
  5. Now to insert a document called user_doc into collection ‘users’, we simply call dbh.users.insert(user_doc). we don’t need to create a new collection called users.MongoDB automatically creates users.
  6. By retrieving we will get a cursor object here it is “fan”.So for each document user in collection fan we are printing ‘email’ .All other imports are for error handling if connection fails,and datetime is for generating date for user_doc.
  7. This is just to give a quick start for you to switch to MongoDB,there are good books on the topic.                                                   For any queries please mail,me.narenarya@live.com

Valuable recipe’s of core Python we love to skip part1

Blink fast - Imgur

Recipe 1:   Accessing private methods in python

Python doesn’t support privacy directly, but relies on the programmer to know when it is safe to modify an attribute from the outside. After all, you should know how to use an object before using that object. It is, however, possible to achieve something like private attributes with a little trickery.

To make a method or attribute private (inaccessible from the outside), simply start its name with two underscores:
class Secretive:
    def __inaccessible(self):
        print “Bet you can’t see me…”

    def accessible(self):
        print “The secret message is:”
        self.__inaccessible()

Now __inaccessible is inaccessible to the outside world, while it can still be used inside the class (for example, from accessible):

>>> s = Secretive()
>>> s.__inaccessible()
Traceback (most recent call last):

File “<pyshell#112>”, line 1, in ?
s.__inaccessible()

AttributeError: Secretive instance has no attribute ‘__inaccessible’

>>> s.accessible()
The secret message is:
Bet you can’t see me…

Although the double underscores are a bit strange, this seems like a standard private method, as found in other languages. What’s not so standard is what actually happens. Inside a class definition, all names beginning with a double underscore are “translated” by adding a single underscore and the class name to the beginning:

>>> Secretive._Secretive__inaccessible
<unbound method Secretive.__inaccessible>
If you know how this works behind the scenes, it is still possible to access private methods outside the class, even though you’re not supposed to:
>>> s._Secretive__inaccessible()
Bet you can’t see me…

So, in short, you can’t be sure that others won’t access the methods and attributes of your objects, but this sort of name-mangling is a pretty strong signal that they shouldn’t.
If you don’t want the name-mangling effect, but you still want to send a signal for other objects to stay away, you can use a single initial underscore. This is mostly just a convention, but has some practical effects. For example, names with an initial underscore aren’t imported with starred imports.

Recipe 2: Dictionary methods you should know perfectly

    

    has_key

The has_key method checks whether a dictionary has a given key. The expression d.has_key(k) is equivalent to k in d. The choice of which to use is largely a matter of taste, althoughhas_key is on its way out of the language (it will be gone in Python 3.0).
Here is an example of how you might use has_key:

>>> d = {}
>>> d.has_key(‘name’)
False
>>> d['name'] = ‘Eric’
>>> d.has_key(‘name’)
True
    

    items and iteritems

The items method returns all the items of the dictionary as a list of items in which each item is of the form (key, value). The items are not returned in any particular order:

>>> d = {‘title': ‘Python Web Site’, ‘url': ‘http://www.python.org&#8217;, ‘spam': 0} >>> d.items()
[('url', 'http://www.python.org'), ('spam', 0), ('title', 'Python Web Site')]
The iteritems method works in much the same way, but returns an iterator instead of a list:

>>> it = d.iteritems()
>>> it

>>> list(it) # Convert the iterator to a list
[('url', 'http://www.python.org'), ('spam', 0), ('title', 'Python Web Site')]
Using iteritems may be more efficient in many cases (especially if you want to iterate over the result). For more information on iterators, see Chapter 9.
keys and iterkeys
The keys method returns a list of the keys in the dictionary, while iterkeys returns an iterator over the keys.
    

    pop

The pop method can be used to get the value corresponding to a given key, and then remove the key-value pair from the dictionary:

>>> d = {‘x': 1, ‘y': 2}
>>> d.pop(‘x’)
1
>>> d
{‘y': 2}
    

    popitem

The popitem method is similar tolist.pop, which pops off the last element of a list. Unlike list.pop, however,popitem pops off an arbitrary item because dictionaries don’t have a “last element” or any order whatsoever. This may be very useful if you want to remove and process the items one by one in an efficient way (without retrieving a list of the keys first):

>>> d
{‘url': ‘http://www.python.org&#8217;, ‘spam': 0, ‘title': ‘Python Web Site’} >>> d.popitem()
(‘url’, ‘http://www.python.org&#8217;)
>>> d
{‘spam': 0, ‘title': ‘Python Web Site’}

Although popitem is similar to the list method pop, there is no dictionary equivalent of append (which adds an element to the end of a list). Because dictionaries have no order, such a method wouldn’t make any sense.
    

    setdefault

The setdefault method is somewhat similar to get, in that it retrieves a value associated with a given key. In addition to the get functionality,setdefault sets the value corresponding to the given key if it is not already in the dictionary:

>>> d = {}
>>> d.setdefault(‘name’, ‘N/A’)
‘N/A’
>>> d
{‘name': ‘N/A’}
>>> d['name'] = ‘Gumby’
>>> d.setdefault(‘name’, ‘N/A’)
‘Gumby’
>>> d
{‘name': ‘Gumby’}

As you can see, when the key is missing, setdefault returns the default and updates the dictionary accordingly. If the key is present, its value is returned and the dictionary is left unchanged. The default is optional, as with get; if it is left out,None is used:

>>> d = {}
>>> print d.setdefault(‘name’) None
>>> d
{‘name': None}

    

    update


The update method updates one dictionary with the items of another:

>>> d = {
‘title': ‘Python Web Site’,
‘url': ‘http://www.python.org&#8217;,
‘changed': ‘Mar 14 22:09:15 MET 2008′}
>>> x = {‘title': ‘Python Language Website’}
>>> d.update(x)
>>> d
{‘url': ‘http://www.python.org&#8217;, ‘changed':
‘Mar 14 22:09:15 MET 2008′, ‘title': ‘Python Language Website’}

The items in the supplied dictionary are added to the old one, supplanting any items there with the same keys.
The update method can be called in the same way as the dict function (or type constructor), as discussed earlier in this chapter. This means that update can be called with a mapping, a sequence (or other iterable object) of (key, value) pairs, or keyword arguments.
values and itervalues
The values method returns a list of the values in the dictionary (anditervalues returns an iterator of the values). Unlike keys, the list returned by values may contain duplicates:

>>> d = {}
>>> d[1] = 1
>>> d[2] = 2
>>> d[3] = 3
>>> d[4] = 1
>>> d.values()
[1, 2, 3, 1]

Recipe 3: How nested functions in python can be used productively?

Python functions may be nested—you can put one inside another.2 Here is an example:
def foo():
    def bar():
        print “Hello, world!”
    bar()

Nesting is normally not all that useful, but there is one particular application that stands out: using one function to “create” another. This means that you can (among other things) write functions like the following:

def multiplier(factor):
    def multiplyByFactor(number):
        return number*factor
    return multiplyByFactor

One function is inside another, and the outer function returns the inner one; that is, the function itself is returned—it is not called. What’s important is that the returned function still has access to the scope where it was defined; in other words, it carries its environment (and the associated local variables) with it!

Each time the outer function is called, the inner one gets redefined, and each time, the variable factor may have a new value. Because of Python’s nested scopes, this variable from the outer local scope (of multiplier) is accessible in the inner function later on, as follows:

>>> double = multiplier(2)
>>> double(5)
10
>>> triple = multiplier(3)
>>> triple(3)
9
>>> multiplier(5)(4)
20

A function such as multiplyByFactor that stores its enclosing scopes is called a closure. Normally, you cannot rebind variables in outer scopes. In Python 3.0, however, the keyword nonlocal is introduced. It is used in much the same way as global, and lets you assign to variables in outer (but nonglobal) scopes.

Recipe 4: Launching a webbrowser from a python program

The os.system function is useful for a lot of things, but for the specific task of launching a web browser, there’s an even better solution: the webbrowser module. It contains a function called open, which lets you automatically launch a web browser to open the given URL. For example, if you want your program to open the impythonist web site in a web browser (either starting a new browser or using one that is already running), you simply use this:

import webbrowser
    webbrowser.open(“http://www.impythonist.co.nr&#8221;)

The page should pop up. Pretty nifty, huh?
        We will discuss lot more.Stay tuned. 

How to use python expressive power for solving a level-5 challenge

Hello Friends.It took too long for me to post new article.But Impythonist came up with with really interesting Insight this time.

Ok,what is title refer to?.We all done programming days and nights,why we stumble if new challenge is given to us?.
This article is not intended for Hardco(ders)re,but main focus is on beginners & intermediate ones.

“The coding is an art which rises from deep of the heart”

So here we are going to see an use case challenge which arrives frequently in local or web code contests.We can use any programming
language but as a pythonist i was privileged to use python,a beautiful and elegant programming language.
The problem goes as follows:

CHALLENGE

“”” You are provided with a text file containing a list of integers.

Let S be the longest subsequence of consecutive increasing integers in this list,
such that the element at position n is greater than the element at position n-1.
If there are multiple sub-sequences of the same length, let S be the one that appears first in the list.
What is the absolute value of the sum of the elements in S?

For example, take the list [3, 4, 2, 5, 7, 7, 1, 2].
For this list, the longest subsequence of increasing integers has 3 elements: S = [2, 5, 7].
The absolute value of the sum of these elements is |2+5+7|=14
text file is below:

https://app.box.com/s/7fjlraqe5dm6ik54cbgg

Step 1: Understand The Problem

It is not a simple question but thinking a bit could lead us to create theories for attacking the problem.
If we first see the question we will come up with an idea of dynamic programming.But only a little applicaton of dynamic
programming is required here.

step 2: Run the python Interpreter

Now assume how the beginning of the solution for the challenge could be!.what data types are required?,what python concepts
are involved in this.

By little observation from the question statement we draw following conclusions:

1.Usage of file handling concept in python
2.String handling is required little bit.
3.List manipulation is necessary and nothing more

Now go and get some coffee and rock on sleeves before knowing a technique.

step 3: No more steps.Jumping right on to the code.

#My Version Of Solution to The Subsequence Challenge

 

1. fil=open(“num.txt”,”r”).readlines()
2. mylo=[]
3. for l in fil:
4.     mylo.append(int(l))
5. long_seq,i,temp,max_seq_index=0,0,1,0
6. while i<len(mylo):
7.     if long_seq>temp: temp,max_seq_index=long_seq,i-1
8.     long_seq,j=1,i
9.     while j!=len(mylo)-1 and mylo[j]<mylo[j+1] :
10.         long_seq+=1
11.         j+=1
12.     i+=1
13. total=abs(sum(mylo[max_seq_index:max_seq_index+temp]))
14. print “The longest sequence no in File is: %d and sum is %d”% (temp,total)

 

Just observe the above code.It is complete solution to the problem.Your feet will ache if you try to find
the solution in the other traditional languages.Don’t be panicked by seeing code,we are going to discuss each and every line.
line 1:
fil=open(“num.txt”,”r”).readlines()

It is simple.we are opening a file called num.txt in reading mode.readlines() method of file object returns a ‘list’ of
strings for all lines in the file.Here num.txt is in current directory.path can be specified instead.

line 2:
mylo=[]

Here we are planning to capture each line as a single value and store it into list.So create an empty list first.

line 3,4:
for l in fil:
mylo.append(int(l))

Now for each element in the returned list ‘fil’ we are casting each value to int and stores it into list mylo.
By this a list is set with values to work on.Actual logic starts from next lines.

line 5:
long_seq,i,temp,max_seq_index=0,0,1,0

In this line we are packing(assigning) four variables in a single line.see what the variables mean.

long_seq ———> This variable going to store maximum length of items satisfied condition for each iteration.
It may seem alien,but things will settle in some time.

i ———-> This variable is used for iteration of each element of list.

temp ———-> This variable is used to store the longest length of subsequence found.It means it consists
of part of answer i.e.what is the length of maximum matched elements satisfying required conditions.
s[i]<s[i+1].It is initialized with 1 because at least one element satisfy given condition.

max_seq_index —–> This variable holds the index for starting of maximum subsequence.This is later used by us to find out
the sum of all elements matching criteria s[i]<s[i+1] in longest subsequence.

line 6:
while i<len(mylo):

Here we are iterating from first element in list ‘mylo’ up to last element.

line 7:
if long_seq>temp: temp,max_seq_index=long_seq,i-1

Here comes the tricky part.Here this construct is written by seeing into future.That is what dynamic programming.It states if current
found sequence length is greater than previously calculated one then modify temp,and store index where we find the maximum subsequence
length into max_seq_index.why it is i-1 and not i because,we are looking into future.It will be clarified in few minutes.

line 8:
long_seq,j=1,i

long_seq is a temporary variable for each iteration.It is a sentinel which works for temp.It is analogous as follows.If money collected
by long_seq is greater temp will own it,other wise temp doesn’t care.Now long_seq is sent into battlefield with initializing one rupee.If
earn in the coming battle in next loop,it gets the longest sequence which is later fed to temp.j is initialized to i in order to find any
sub-sequence is availing for greater length.

line 9-11:

while j!=len(mylo)-1 and mylo[j]<mylo[j+1] :
long_seq+=1
j+=1

j!=len(mylo)-1 is written in afraid of array index out of bounds chance for next condition mylo[j]<mylo[j+1].If these two are passed then j
skids to next higher value and long_seq bagged 1 point for satisfying the condition.This continues until longer subsequence satisfying
condition happens.If condition fails i restarts with next element.

line 11:

i+=1

Simple,this is used to iterate i for each element in ‘mylo’

line 12:

total=abs(sum(mylo[max_seq_index:max_seq_index+temp]))

Here total stores sum of all the elements in of the largest subsequence satisfying the condition.The temp value is useful for stripping the list
‘mylo’ and return only part of it, which is a longest subsequence.predefined sum() method adds stripped list and abs() method returns absolute
value converting negative into positive.

line 13:

print “The longest sequence no in File is: %d and sum is %d”% (temp,total)

As it is clear this line prints the longest subsequence length and sum as asked in challenge

In this way python constructs are highly optimized and used to write new libraries at lightening speed.Here dynamic programming approach might
confuse a little bit,but if you follow the execution path you will find easier how cleverly the problem was solved.
It is a little effort to enhance designing strategies for beginners.For further discussions feel free to mail me.

my mail address : narenarya@live.com

Peter norvig’s 21 line spelling corrector using probability theory

Image

 

Remembering the legend,In his own words story follows:

How to Write a Spelling Corrector

In the past week, two friends (Dean and Bill) independently told me they were amazed at how Google does spelling correction so well and quickly. Type in a search like [speling] and Google comes back in 0.1 seconds or so with Did you mean: spelling. (Yahoo and Microsoft are similar.) What surprised me is that I thought Dean and Bill, being highly accomplished engineers and mathematicians, would have good intuitions about statistical language processing problems such as spelling correction. But they didn’t, and come to think of it, there’s no reason they should: it was my expectations that were faulty, not their knowledge.

I figured they and many others could benefit from an explanation. The full details of an industrial-strength spell corrector are quite complex (you con read a little about it here or here). What I wanted to do here is to develop, in less than a page of code, a toy spelling corrector that achieves 80 or 90% accuracy at a processing speed of at least 10 words per second.

So here, in 21 lines of Python 2.5 code, is the complete spelling corrector:(almost all code works with 2.7 also)

import re, collections

def words(text): return re.findall('[a-z]+', text.lower())

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

The code defines the function correct, which takes a word as input and returns a likely correction of that word. For example:

>>> correct('speling')
'spelling'
>>> correct('korrecter')
'corrector'

The version of edits1 shown here is a variation on one proposed by Darius Bacon; I think this is clearer than the version I originally had. Darius also fixed a bug in the function correct.

How It Works: Some Probability Theory

How does it work? First, a little theory. Given a word, we are trying to choose the most likely spelling correction for that word (the “correction” may be the original word itself). There is no way to know for sure (for example, should “lates” be corrected to “late” or “latest”?), which suggests we use probabilities. We will say that we are trying to find the correction c, out of all possible corrections, that maximizes the probability of c given the original word w:

argmaxc P(c|w)

By Bayes’ Theorem this is equivalent to:

argmaxc P(w|c) P(c) / P(w)

Since P(w) is the same for every possible c, we can ignore it, giving:

argmaxc P(w|c) P(c)

There are three parts of this expression. From right to left, we have:

  1. P(c), the probability that a proposed correction c stands on its own. This is called the language model: think of it as answering the question “how likely is c to appear in an English text?” So P(“the”) would have a relatively high probability, while P(“zxzxzxzyyy”) would be near zero.
  2. P(w|c), the probability that w would be typed in a text when the author meant c. This is the error model: think of it as answering “how likely is it that the author would type w by mistake when c was intended?”
  3. argmaxc, the control mechanism, which says to enumerate all feasible values of c, and then choose the one that gives the best combined probability score.

One obvious question is: why take a simple expression like P(c|w) and replace it with a more complex expression involving two models rather than one? The answer is that P(c|w) is already conflating two factors, and it is easier to separate the two out and deal with them explicitly. Consider the misspelled word w=”thew” and the two candidate corrections c=”the” and c=”thaw”. Which has a higher P(c|w)? Well, “thaw” seems good because the only change is “a” to “e”, which is a small change. On the other hand, “the” seems good because “the” is a very common word, and perhaps the typist’s finger slipped off the “e” onto the “w”. The point is that to estimate P(c|w) we have to consider both the probability of c and the probability of the change from c to w anyway, so it is cleaner to formally separate the two factors.

Now we are ready to show how the program works. First P(c). We will read a big text file, big.txt, which consists of about a million words. The file is a concatenation of several public domain books from Project Gutenberg and lists of most frequent words from Wiktionary and the British National Corpus. (On the plane home when I was writing the first version of the code all I had was a collection of Sherlock Holmes stories that happened to be on my laptop; I added the other sources later and stopped adding texts when they stopped helping, as we shall see in the Evaluation section.)

We then extract the individual words from the file (using the function words, which converts everything to lowercase, so that “the” and “The” will be the same and then defines a word as a sequence of alphabetic characters, so “don’t” will be seen as the two words “don” and “t”). Next we train a probability model, which is a fancy way of saying we count how many times each word occurs, using the function train. It looks like this:

def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))

At this point, NWORDS[w] holds a count of how many times the word w has been seen. There is one complication: novel words. What happens with a perfectly good word of English that wasn’t seen in our training data? It would be bad form to say the probability of a word is zero just because we haven’t seen it yet. There are several standard approaches to this problem; we take the easiest one, which is to treat novel words as if we had seen them once. This general process is called smoothing, because we are smoothing over the parts of the probability distribution that would have been zero, bumping them up to the smallest possible count. This is achieved through the class collections.defaultdict, which is like a regular Python dict (what other languages call hash tables) except that we can specify the default value of any key; here we use 1.

Now let’s look at the problem of enumerating the possible corrections c of a given word w. It is common to talk of the edit distance between two words: the number of edits it would take to turn one into the other. An edit can be a deletion (remove one letter), a transposition (swap adjacent letters), an alteration (change one letter to another) or an insertion (add a letter). Here’s a function that returns a set of all words c that are one edit away from w:

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

This can be a big set. For a word of length n, there will be n deletions, n-1 transpositions, 26n alterations, and 26(n+1) insertions, for a total of 54n+25 (of which a few are typically duplicates). For example, len(edits1(‘something’)) — that is, the number of elements in the result of edits1(‘something’) — is 494.

The literature on spelling correction claims that 80 to 95% of spelling errors are an edit distance of 1 from the target. As we shall see shortly, I put together a development corpus of 270 spelling errors, and found that only 76% of them have edit distance 1. Perhaps the examples I found are harder than typical errors. Anyway, I thought this was not good enough, so we’ll need to consider edit distance 2. That’s easy: just apply edits1 to all the results of edits1:

def edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1))

This is easy to write, but we’re starting to get into some serious computation: len(edits2(‘something’)) is 114,324. However, we do get good coverage: of the 270 test cases, only 3 have an edit distance greater than 2. That is, edits2 will cover 98.9% of the cases; that’s good enough for me. Since we aren’t going beyond edit distance 2, we can do a small optimization: only keep the candidates that are actually known words. We still have to consider all the possibilities, but we don’t have to build up a big set of them. The function known_edits2 does this:

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

Now, for example, known_edits2(‘something’) is a set of just 4 words: {‘smoothing’, ‘seething’, ‘something’, ‘soothing’}, rather than the set of 114,324 words generated by edits2. That speeds things up by about 10%.

Now the only part left is the error model, P(w|c). Here’s where I ran into difficulty. Sitting on the plane, with no internet connection, I was stymied: I had no training data to build a model of spelling errors. I had some intuitions: mistaking one vowel for another is more probable than mistaking two consonants; making an error on the first letter of a word is less probable, etc. But I had no numbers to back that up. So I took a shortcut: I defined a trivial model that says all known words of edit distance 1 are infinitely more probable than known words of edit distance 2, and infinitely less probable than a known word of edit distance 0. By “known word” I mean a word that we have seen in the language model training data — a word in the dictionary. We can implement this strategy as follows:

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

The function correct chooses as the set of candidate words the set with the shortest edit distance to the original word, as long as the set has some known words. Once it identifies the candidate set to consider, it chooses the element with the highest P(c) value, as estimated by the NWORDS model.

Evaluation

Now it is time to evaluate how well this program does. On the plane I tried a few examples, and it seemed okay. After my plane landed, I downloaded Roger Mitton’s Birkbeck spelling error corpus from the Oxford Text Archive. From that I extracted two test sets of corrections. The first is for development, meaning I get to look at it while I’m developing the program. The second is a final test set, meaning I’m not allowed to look at it, nor change my program after evaluating on it. This practice of having two sets is good hygiene; it keeps me from fooling myself into thinking I’m doing better than I am by tuning the program to one specific set of tests. Here I show an excerpt of the two tests and the function to run them; to see the complete set of tests (along with the rest of the program), see the file spell.py.

tests1 = { 'access': 'acess', 'accessing': 'accesing', 'accommodation':
    'accomodation acommodation acomodation', 'account': 'acount', ...}

tests2 = {'forbidden': 'forbiden', 'decisions': 'deciscions descisions',
    'supposedly': 'supposidly', 'embellishing': 'embelishing', ...}

def spelltest(tests, bias=None, verbose=False):
    import time
    n, bad, unknown, start = 0, 0, 0, time.clock()
    if bias:
        for target in tests: NWORDS[target] += bias
    for target,wrongs in tests.items():
        for wrong in wrongs.split():
            n += 1
            w = correct(wrong)
            if w!=target:
                bad += 1
                unknown += (target not in NWORDS)
                if verbose:
                    print '%r => %r (%d); expected %r (%d)' % (
                        wrong, w, NWORDS[w], target, NWORDS[target])
    return dict(bad=bad, n=n, bias=bias, pct=int(100. - 100.*bad/n),
                unknown=unknown, secs=int(time.clock()-start) )

print spelltest(tests1)
print spelltest(tests2) ## only do this after everything is debugged

This gives the following output:

{'bad': 68, 'bias': None, 'unknown': 15, 'secs': 16, 'pct': 74, 'n': 270}
{'bad': 130, 'bias': None, 'unknown': 43, 'secs': 26, 'pct': 67, 'n': 400}

So on the development set of 270 cases, we get 74% correct in 13 seconds (a rate of 17 Hz), and on the final test set we get 67% correct (at 15 Hz).

Update: In the original version of this essay I incorrectly reported a higher score on both test sets, due to a bug. The bug was subtle, but I should have caught it, and I apologize for misleading those who read the earlier version. In the original version of spelltest, I left out the if bias: in the fourth line of the function (and the default value was bias=0, not bias=None). I figured that when bias = 0, the statement NWORDS[target] += bias would have no effect. In fact it does not change the value of NWORDS[target], but it does have an effect: it makes (target in NWORDS) true. So in effect the spelltest routine was cheating by making all the unknown words known. This was a humbling error, and I have to admit that much as I like defaultdict for the brevity it adds to programs, I think I would not have had this bug if I had used regular dicts.

Update 2: defaultdict strikes again. Darius Bacon pointed out that in the function correct, I had accessed NWORDS[w]. This has the unfortunate side-effect of adding w to the defaultdict, if w was not already there (i.e., if it was an unknown word). Then the next time, it would be present, and we would get the wrong answer. Darius correctly suggested changing to NWORDS.get. (This works because max(None, i) is i for any integer i.)

In conclusion, I met my goals for brevity, development time, and runtime speed, but not for accuracy.

Future Work

Let’s think about how we could do better. (I’ve done some more in a separate chapter for a book.) We’ll again look at all three factors of the probability model: (1) P(c); (2) P(w|c); and (3) argmaxc. We’ll look at examples of what we got wrong. Then we’ll look at some factors beyond the three…

  1. P(c), the language model. We can distinguish two sources of error in the language model. The more serious is unknown words. In the development set, there are 15 unknown words, or 5%, and in the final test set, 43 unknown words or 11%. Here are some examples of the output of spelltest with verbose=True:
    correct('economtric') => 'economic' (121); expected 'econometric' (1)
    correct('embaras') => 'embargo' (8); expected 'embarrass' (1)
    correct('colate') => 'coat' (173); expected 'collate' (1)
    correct('orentated') => 'orentated' (1); expected 'orientated' (1)
    correct('unequivocaly') => 'unequivocal' (2); expected 'unequivocally' (1)
    correct('generataed') => 'generate' (2); expected 'generated' (1)
    correct('guidlines') => 'guideline' (2); expected 'guidelines' (1)

    In this output we show the call to correct and the result (with the NWORDS count for the result in parentheses), and then the word expected by the test set (again with the count in parentheses). What this shows is that if you don’t know that ‘econometric’ is a word, you’re not going to be able to correct ‘economtric’. We could mitigate by adding more text to the training corpus, but then we also add words that might turn out to be the wrong answer. Note the last four lines above are inflections of words that do appear in the dictionary in other forms. So we might want a model that says it is okay to add ‘-ed’ to a verb or ‘-s’ to a noun.

    The second potential source of error in the language model is bad probabilities: two words appear in the dictionary, but the wrong one appears more frequently. I must say that I couldn’t find cases where this is the only fault; other problems seem much more serious.

    We can simulate how much better we might do with a better language model by cheating on the tests: pretending that we have seen the correctly spelled word 1, 10, or more times. This simulates having more text (and just the right text) in the language model. The function spelltest has a parameter, bias, that does this. Here’s what happens on the development and final test sets when we add more bias to the correctly-spelled words:

    Bias Dev. Test
    0 74% 67%
    1 74% 70%
    10 76% 73%
    100 82% 77%
    1000 89% 80%

    On both test sets we get significant gains, approaching 80-90%. This suggests that it is possible that if we had a good enough language model we might get to our accuracy goal. On the other hand, this is probably optimistic, because as we build a bigger language model we would also introduce words that are the wrong answer, which this method does not do.

    Another way to deal with unknown words is to allow the result of correct to be a word we have not seen. For example, if the input is “electroencephalographicallz”, a good correction would be to change the final “z” to an “y”, even though “electroencephalographically” is not in our dictionary. We could achieve this with a language model based on components of words: perhaps on syllables or suffixes (such as “-ally”), but it is far easier to base it on sequences of characters: 2-, 3- and 4-letter sequences.

  2. P(w|c), the error model. So far, the error model has been trivial: the smaller the edit distance, the smaller the error. This causes some problems, as the examples below show. First, some cases where correct returns a word at edit distance 1 when it should return one at edit distance 2:
    correct('reciet') => 'recite' (5); expected 'receipt' (14)
    correct('adres') => 'acres' (37); expected 'address' (77)
    correct('rember') => 'member' (51); expected 'remember' (162)
    correct('juse') => 'just' (768); expected 'juice' (6)
    correct('accesing') => 'acceding' (2); expected 'assessing' (1)

    Here, for example, the alteration of ‘d’ to ‘c’ to get from ‘adres’ to ‘acres’ should count more than the sum of the two changes ‘d’ to ‘dd’ and ‘s’ to ‘ss’.

    Also, some cases where we choose the wrong word at the same edit distance:

    correct('thay') => 'that' (12513); expected 'they' (4939)
    correct('cleark') => 'clear' (234); expected 'clerk' (26)
    correct('wer') => 'her' (5285); expected 'were' (4290)
    correct('bonas') => 'bones' (263); expected 'bonus' (3)
    correct('plesent') => 'present' (330); expected 'pleasant' (97)

    The same type of lesson holds: In ‘thay’, changing an ‘a’ to an ‘e’ should count as a smaller change than changing a ‘y’ to a ‘t’. How much smaller? It must be a least a factor of 2.5 to overcome the prior probability advantage of ‘that’ over ‘they’.

    Clearly we could use a better model of the cost of edits. We could use our intuition to assign lower costs for doubling letters and changing a vowel to another vowel (as compared to an arbitrary letter change), but it seems better to gather data: to get a corpus of spelling errors, and count how likely it is to make each insertion, deletion, or alteration, given the surrounding characters. We need a lot of data to do this well. If we want to look at the change of one character for another, given a window of two characters on each side, that’s 266, which is over 300 million characters. You’d want several examples of each, on average, so we need at least a billion characters of correction data; probably safer with at least 10 billion.

    Note there is a connection between the language model and the error model. The current program has such a simple error model (all edit distance 1 words before any edit distance 2 words) that it handicaps the language model: we are afraid to add obscure words to the model, because if one of those obscure words happens to be edit distance 1 from an input word, then it will be chosen, even if there is a very common word at edit distance 2. With a better error model we can be more aggressive about adding obscure words to the dictionary. Here are some examples where the presence of obscure words in the dictionary hurts us:

    correct('wonted') => 'wonted' (2); expected 'wanted' (214)
    correct('planed') => 'planed' (2); expected 'planned' (16)
    correct('forth') => 'forth' (83); expected 'fourth' (79)
    correct('et') => 'et' (20); expected 'set' (325)
  3. The enumeration of possible corrections, argmaxc. Our program enumerates all corrections within edit distance 2. In the development set, only 3 words out of 270 are beyond edit distance 2, but in the final test set, there were 23 out of 400. Here they are:
    purple perpul
    curtains courtens
    minutes muinets
    
    successful sucssuful
    hierarchy heiarky
    profession preffeson
    weighted wagted
    inefficient ineffiect
    availability avaiblity
    thermawear thermawhere
    nature natior
    dissension desention
    unnecessarily unessasarily
    disappointing dissapoiting
    acquaintances aquantences
    thoughts thorts
    criticism citisum
    immediately imidatly
    necessary necasery
    necessary nessasary
    necessary nessisary
    unnecessary unessessay
    night nite
    minutes muiuets
    assessing accesing
    necessitates nessisitates
    

    We could consider extending the model by allowing a limited set of edits at edit distance 3. For example, allowing only the insertion of a vowel next to another vowel, or the replacement of a vowel for another vowel, or replacing close consonants like “c” to “s” would handle almost all these cases.

  4. There’s actually a fourth (and best) way to improve: change the interface to correct to look at more context. So far, correct only looks at one word at a time. It turns out that in many cases it is difficult to make a decision based only on a single word. This is most obvious when there is a word that appears in the dictionary, but the test set says it should be corrected to another word anyway:
    correct('where') => 'where' (123); expected 'were' (452)
    correct('latter') => 'latter' (11); expected 'later' (116)
    correct('advice') => 'advice' (64); expected 'advise' (20)

    We can’t possibly know that correct('where') should be ‘were’ in at least one case, but should remain ‘where’ in other cases. But if the query had been correct('They where going') then it seems likely that “where” should be corrected to “were”.

    The context of the surrounding words can help when there are obvious errors, but two or more good candidate corrections. Consider:

    correct('hown') => 'how' (1316); expected 'shown' (114)
    correct('ther') => 'the' (81031); expected 'their' (3956)
    correct('quies') => 'quiet' (119); expected 'queries' (1)
    correct('natior') => 'nation' (170); expected 'nature' (171)
    correct('thear') => 'their' (3956); expected 'there' (4973)
    correct('carrers') => 'carriers' (7); expected 'careers' (2)

    Why should ‘thear’ be corrected as ‘there’ rather than ‘their’? It is difficult to tell by the single word alone, but if the query were correct('There's no there thear') it would be clear.

    To build a model that looks at multiple words at a time, we will need a lot of data. Fortunately, Google has released a database of word counts for all sequences up to five words long, gathered from a corpus of a trillion words.

    I believe that a spelling corrector that scores 90% accuracy will need to use the context of the surrounding words to make a choice. But we’ll leave that for another day…

  5. We could improve our accuracy scores by improving the training data and the test data. We grabbed a million words of text and assumed they were all spelled correctly; but it is very likely that the training data contains several errors. We could try to identify and fix those. Less daunting a task is to fix the test sets. I noticed at least three cases where the test set says our program got the wrong answer, but I believe the program’s answer is better than the expected answer:
    correct('aranging') => 'arranging' (20); expected 'arrangeing' (1)
    correct('sumarys') => 'summary' (17); expected 'summarys' (1)
    correct('aurgument') => 'argument' (33); expected 'auguments' (1)

    We could also decide what dialect we are trying to train for. The following three errors are due to confusion about American versus British spelling (our training data contains both):

    correct('humor') => 'humor' (17); expected 'humour' (5)
    correct('oranisation') => 'organisation' (8); expected 'organization' (43)
    correct('oranised') => 'organised' (11); expected 'organized' (70)
  6. Finally, we could improve the implementation by making it much faster, without changing the results. We could re-implement in a compiled language rather than an interpreted one. We could have a lookup table that is specialized to strings rather than Python’s general-purpose dict. We could cache the results of computations so that we don’t have to repeat them multiple times. One word of advice: before attempting any speed optimizations, profile carefully to see where the time is actually going.
  7. Norvig is not only pioneer in python but in other fields also.For his complete  works visit  norvig.com(no www before url).

Mongo DB for Python Developers

Introduction to NoSQL Architecture with MongoDB for  Python Developers

This article covers using MongoDB as a way to get started with NoSQL. It presents an introduction to considering NoSQL, why MongoDB is a good NoSQL implementation to get started with, MongoDB shortcommings and tradeoffs, key MongoDB developer concepts, MongoDB architectural concepts (sharding, replica sets), using the console to learn MongoDB, and getting started with MongoDB for Python, Java and PHP developers. The article uses MongoDB, but many concepts introduced are common in other NoSQL solutions. The article should be useful for new developers, ops and DevOps who are new to NoSQL.

From no NoSQL to sure why not

The first time I heard of something that actually could be classified as NoSQL was from Warner Onstine, he is currently working on some CouchDB articles for InfoQ. Warner was going on and on about how great CouchDB was. This was before the term NoSQL was coined. I was skeptical, and had just been on a project that was converted from an XML Document Database back to Oracle due to issues with the XML Database implementation. I did the conversion. I did not pick the XML Database solution, or decide to convert it to Oracle. I was just the consultant guy on the project (circa 2005) who did the work after the guy who picked the XML Database moved on and the production issues started to happen.

This was my first document database. This bred skepticism and distrust of databases that were not established RDBMS (Oracle, MySQL, etc.). This incident did not create the skepticism. Let me explain.

First there were all of the Object Oriented Database (OODB) folks for years preaching how it was going to be the next big thing. It did not happen yet. I hear 2013 will be the year of the OODB just like it was going to be 1997. Then there were the XML Database people preaching something very similar, which did not seem to happen either at least at the pervasive scale that NoSQL is happening.

My take was, ignore this document oriented approach and NoSQL, see if it goes away. To be successful, it needs some community behind it, some clear use case wins, and some corporate muscle/marketing, and I will wait until then. Sure the big guys need something like Dynamo and BigTable, but it is a niche I assumed. Then there was BigTable, MapReduce, Google App Engine, Dynamo in the news with white papers. Then Hadoop, Cassandra, MongoDB, Membase, HBase, and the constant small but growing drum beat of change and innovation. Even skeptics have limits.

Then in 2009, Eric Evans coined the term NoSQL to describe the growing list of open-source distributed databases. Now there is this NoSQL movement-three years in and counting. Like Ajax, giving something a name seems to inspire its growth, or perhaps we don’t name movements until there is already a ground swell. Either way having a name like NoSQL with a common vision is important to changing the world, and you can see the community, use case wins, and corporate marketing muscle behind NoSQL. It has gone beyond the buzz stage. Also in 2009 was the first project that I worked on that had mass scale out requirements that was using something that is classified as part of NoSQL.

2009 was when MongoDB was released from 10Gen, the NoSQL movement was in full swing. Somehow MongoDB managed to move to the front of the pack in terms of mindshare followed closely by Cassandra and others (see figure 1). MongoDB is listed as a top job trend on Indeed.com, #2 to be exact (behind HTML 5 and before iOS), which is fairly impressive given MongoDB was a relativly latecomer to the NoSQL party.

Figure 1: MongoDB leads the NoSQL pack

MongoDB takes early lead in NoSQL adoption race.

MongoDB is a distributed document-oriented, schema-less storage solution similar to CouchBase and CouchDB. MongoDB uses JSON-style documents to represent, query and modify data. Internally data is stored in BSON (binary JSON). MongoDB’s closest cousins seem to be CouchDB/Couchbase. MongoDB supports many clients/languages, namely, Python, PHP, Java, Ruby, C++, etc. This article is going to introduce key MongoDB concepts and then show basic CRUD/Query examples in JavaScript (part of MongoDB console application), Java, PHP and Python.

Disclaimer: I have no ties with the MongoDB community and no vested interests in their success or failure. I am not an advocate. I merely started to write about MongoDB because they seem to be the most successful, seem to have the most momentum for now, and in many ways typify the very diverse NoSQL market. MongoDB success is largely due to having easy-to-use, familiar tools. I’d love to write about CouchDB, Cassandra, CouchBase, Redis, HBase or number of NoSQL solution if there was just more hours in the day or stronger coffee or if coffee somehow extended how much time I had. Redis seems truly fascinating.

MongoDB seems to have the right mix of features and ease-of-use, and has become a prototypical example of what a NoSQL solution should look like. MongoDB can be used as sort of base of knowledge to understand other solutions (compare/contrast). This article is not an endorsement. Other than this, if you want to get started with NoSQL, MongoDB is a great choice.

MongoDB a gateway drug to NoSQL

MongoDB combinations of features, simplicity, community, and documentation make it successful. The product itself has high availability, journaling (which is not always a given with NoSQL solutions), replication, auto-sharding, map reduce, and an aggregation framework (so you don’t have to use map-reduce directly for simple aggregations). MongoDB can scale reads as well as writes.

NoSQL, in general, has been reported to be more agile than full RDBMS/ SQL due to problems with schema migration of SQL based systems. Having been on large RDBMS systems and witnessing the trouble and toil of doing SQL schema migrations, I can tell you that this is a real pain to deal with. RDBMS / SQL often require a lot of upfront design or a lot schema migration later. In this way, NoSQL is viewed to be more agile in that it allows the applications worry about differences in versioning instead of forcing schema migration and larger upfront designs. To the MongoDB crowd, it is said that MongoDB has dynamic schema not no schema (sort of like the dynamic language versus untyped language argument from Ruby, Python, etc. developers).

MongoDB does not seem to require a lot of ramp up time. Their early success may be attributed to the quality and ease-of-use of their client drivers, which was more of an afterthought for other NoSQL solutions (“Hey here is our REST or XYZ wire protocol, deal with it yourself”). Compared to other NoSQL solution it has been said that MongoDB is easier to get started. Also with MongoDB many DevOps things come cheaply or free. This is not that there are never any problems or one should not do capacity planning. MongoDB has become for many an easy on ramp for NoSQL, a gateway drug if you will.

MongoDB was built to be fast. Speed is a good reason to pick MongoDB. Raw speed shaped architecture of MongoDB. Data is stored in memory using memory mapped files. This means that the virtual memory manager, a very highly optimized system function of modern operating systems, does the paging/caching. MongoDB also pads areas around documents so that they can be modified in place, making updates less expensive. MongoDB uses a binary protocol instead of REST like some other implementations. Also, data is stored in a binary format instead of text (JSON, XML), which could speed writes and reads.

Another reason MongoDB may do well is because it is easy to scale out reads and writes with replica sets and autosharding. You might expect if MongoDB is so great that there would be a lot of big names using them, and there are like: MTV, Craigslist, Disney, Shutterfly, Foursqaure, bit.ly, The New York Times, Barclay’s, The Guardian, SAP, Forbes, National Archives UK, Intuit, github, LexisNexis and many more.

Caveats

MongoDB indexes may not be as flexible as Oracle/MySQL/Postgres or other even other NoSQL solutions. The order of index matters as it uses B-Trees. Realtime queries might not be as fast as Oracle/MySQL and other NoSQL solutions especially when dealing with array fields, and the query plan optimization work is not as far as long as more mature solutions. You can make sure MongoDB is using the indexes you setup quite easily with an explain function. Not to scare you, MongoDB is good enough if queries are simple and you do a little homework, and is always improving.

MongoDB does not have or integrate a full text search engine like many other NoSQL solutions do (many use Lucene under the covers for indexing), although it seems to support basic text search better than most traditional databases.

Every version of MongoDB seems to add more features and addresses shortcoming of the previous releases. MongoDB added journaling a while back so they can have single server durability; prior to this you really needed a replica or two to ensure some level of data safety. 10gen improved Mongo’s replication and high availability with Replica Sets.

Another issue with current versions of MongoDB (2.0.5) is lack of concurrency due to MongoDB having a global read/write lock, which allows reads to happen concurrently while write operations happen one at a time. There are workarounds for this involving shards, and/or replica sets, but not always ideal, and does not fit the “it should just work mantra” of MongoDB. Recently at the MongoSF conference, Dwight Merriman, co-founder and CEO of 10gen, discussed the concurrency internals of MongoDB v2.2 (future release). Dwight described that MongoDB 2.2 did a major refactor to add database level concurrency, and will soon have collection level concurrency now that the hard part of the concurrency refactoring is done. Also keep in mind, writes are in RAM and eventually get synced to disk since MongoDB uses a memory mapped files. Writes are not as expensive as if you were always waiting to sync to disk. Speed can mitigate concurrency issues.

This is not to say that MongoDB will never have some shortcomings and engineering tradeoffs. Also, you can, will and sometimes should combine MongoDB with a relational database or a full text search like Solr/Lucene for some applications. For example, if you run into issue with effectively building indexes to speed some queries you might need to combine MongoDB with Memcached. None of this is completely foreign though, as it is not uncommon to pair RDBMS with Memcached or Lucene/Solr. When to use MongoDB and when to develop a hybrid solution is beyond the scope of this article. In fact, when to use a SQL/RDBMS or another NoSQL solution is beyond the scope of this article, but it would be nice to hear more discussion on this topic.

The price you pay for MongoDB, one of the youngest but perhaps best managed NoSQL solution, is lack of maturity. It does not have a code base going back three decades like RDBMS systems. It does not have tons and tons of third party management and development tools. There have been issues, there are issues and there will be issues, but MongoDB seems to work well for a broad class of applications, and is rapidly addressing many issues.

Also finding skilled developers, ops (admins, devops, etc.) that are familiar with MongoDB or other NoSQL solutions might be tough. Somehow MongoDB seems to be the most approachable or perhaps just best marketed. Having worked on projects that used large NoSQL deployments, few people on the team really understand the product (limitations, etc.), which leads to trouble.

In short if you are kicking the tires of NoSQL, starting with MongoDB makes a lot of sense.

MongoDB concepts

MongoDB is document oriented but has many comparable concepts to traditional SQL/RDBMS solutions.

  1. Oracle: Schema, Tables, Rows, Columns
  2. MySQL: Database, Tables, Rows, Columns
  3. MongoDB: Database, Collections, Document, Fields
  4. MySQL/Oracle: Indexes
  5. MongoDB: Indexes
  6. MySQL/Oracle: Stored Procedures
  7. MongoDB: Stored JavaScript
  8. Oracle/MySQL: Database Schema
  9. MongoDB: Schema free!
  10. Oracle/MySQL: Foreign keys, and joins
  11. MongoDB: DBRefs, but mostly handled by client code
  12. Oracle/MySQL: Primary key
  13. MongoDB: ObjectID

If you have used MySQL or Oracle here is a good guide to similar processes in MongoDB:

Database Process Type Oracle MySQL MongoDB
Daemon/Server oracle mysqld mongod
Console Client sqlplus mysql mongo
Backup utility sqlplus mysqldump mongodump
Import utility sqlplus mysqlimport mongoimport

You can see a trend here. Where possible, Mongo tries to follow the terminology of MySQL. They do this with console commands as well. If you are used to using MySQL, where possible, MongoDB tries to make the transition a bit less painful.

SQL operations versus MongoDB operations

MongoDB queries are similar in concept to SQL queries and use a lot of the same terminology. There is no special language or syntax to execute MongoDB queries; you simply assemble a JSON object. The MongoDB site has a complete set of example queries done in both SQL and MongoDB JSON docs to highlight the conceptual similarities. What follows is several small listings to compare MongoDB operations to SQL.

Insert

SQL
INSERT INTO CONTACTS (NAME, PHONE_NUMBER) VALUES('RICK HIGHTOWER','520-555-1212')
MongoDB
db.contacts.insert({name:'RICK HIGHTOWER',phoneNumber:'520-555-1212'})

Selects

SQL
SELECT name, phone_number FROM contacts WHERE age=30 ORDER BY name DESC
MongoDB
db.contacts.find({age:30}, {name:1,phoneNumber:1}).sort({name:-1})
SQL
SELECT name, phone_number FROM contacts WHERE age>30 ORDER BY name DESC
MongoDB
db.contacts.find({age:{$gt:33}}, {name:1,phoneNumber:1}).sort({name:-1})

Creating indexes

SQL
CREATE INDEX contact_name_idx ON contact(name DESC)
MongoDB
db.contacts.ensureIndex({name:-1})

Updates

SQL
UPDATE contacts SET phoneNumber='415-555-1212' WHERE name='Rick Hightower'
MongoDB
db.contacts.update({name:'Rick Hightower'}, {$set:{phoneNumber:1}}, false, true)

Additional features of note

MongoDB has many useful features like Geo Indexing (How close am I to X?), distributed file storage, capped collection (older documents auto-deleted), aggregation framework (like SQL projections for distributed nodes without the complexities of MapReduce for basic operations on distributed nodes), load sharing for reads via replication, auto sharding for scaling writes, high availability, and your choice of durability (journaling) and/or data safety (make sure a copy exists on other servers).

Architecture Replica Sets, Autosharding

The model of MongoDB is such that you can start basic and use features as your growth/needs change without too much trouble or change in design. MongoDB uses replica sets to provide read scalability, and high availability. Autosharding is used to scale writes (and reads). Replica sets and autosharding go hand in hand if you need mass scale out. With MongoDB scaling out seems easier than traditional approaches as many things seem to come built-in and happen automatically. Less operation/administration and a lower TCO than other solutions seems likely. However you still need capacity planning (good guess), monitoring (test your guess), and the ability to determine your current needs (adjust your guess).

Replica Sets

The major advantages of replica sets are business continuity through high availability, data safety through data redundancy, and read scalability through load sharing (reads). Replica sets use a share nothing architecture. A fair bit of the brains of replica sets is in the client libraries. The client libraries are replica set aware. With replica sets, MongoDB language drivers know the current primary. Langauge driver is a library for a particular programing language, think JDBC driver or ODBC driver, but for MongoDB. All write operations go to the primary. If the primary is down, the drivers know how to get to the new primary (an elected new primary), this is auto failover for high availability. The data is replicated after writing. Drivers always write to the replica set’s primary (called the master), the master then replicates to slaves. The primary is not fixed. The master/primary is nominated.

Typically you have at least three MongoDB instances in a replica set on different server machines (see figure 2). You can add more replicas of the primary if you like for read scalability, but you only need three for high availability failover. There is a way to sort of get down to two, but let’s leave that out for this article. Except for this small tidbit, there are advantages of having three versus two in general. If you have two instances and one goes down, the remaining instance has 200% more load than before. If you have three instances and one goes down, the load for the remaining instances only go up by 50%. If you run your boxes at 50% capacity typically and you have an outage that means your boxes will run at 75% capacity until you get the remaining box repaired or replaced. If business continuity is your thing or important for your application, then having at least three instances in a replica set sounds like a good plan anyway (not all applications need it).

Figure 2: Replica Sets

MongoDB Replica Sets

In general, once replication is setup it just works. However, in your monitoring, which is easy to do in MongoDB, you want to see how fast data is getting replicated from the primary (master) to replicas (slaves). The slower the replication is, the dirtier your reads are. The replication is by default async (non-blocking). Slave data and primary data can be out of sync for however long it takes to do the replication. There are already whole books written just on making Mongo scalable, if you work at a Foursquare like company or a company where high availability is very important, and use Mongo, I suggest reading such a book.

By default replication is non-blocking/async. This might be acceptable for some data (category descriptions in an online store), but not other data (shopping cart’s credit card transaction data). For important data, the client can block until data is replicated on all servers or written to the journal (journaling is optional). The client can force the master to sync to slaves before continuing. This sync blocking is slower. Async/non-blocking is faster and is often described as eventual consistency. Waiting for a master to sync is a form of data safety. There are several forms of data safety and options available to MongoDB from syncing to at least one other server to waiting for the data to be written to a journal (durability). Here is a list of some data safety options for MongoDB:

  1. Wait until write has happened on all replicas
  2. Wait until write is on two servers (primary and one other)
  3. Wait until write has occurred on majority of replicas
  4. Wait until write operation has been written to journal

(The above is not an exhaustive list of options.) The key word of each option above is wait. The more syncing and durability the more waiting, and harder it is to scale cost effectively.

Journaling: Is durability overvalued if RAM is the new Disk? Data Safety versus durability

It may seem strange to some that journaling was added as late as version 1.8 to MongoDB. Journaling is only now the default for 64 bit OS for MongoDB 2.0. Prior to that, you typically used replication to make sure write operations were copied to a replica before proceeding if the data was very important. The thought being that one server might go down, but two servers are very unlikely to go down at the same time. Unless somebody backs a truck over a high voltage utility poll causing all of your air conditioning equipment to stop working long enough for all of your servers to overheat at once, but that never happens (it happened to Rackspace and Amazon). And if you were worried about this, you would have replication across availability zones, but I digress.

At one point MongoDB did not have single server durability, now it does with addition of journaling. But, this is far from a moot point. The general thought from MongoDB community was and maybe still is that to achieve Web Scale, durability was thing of the past. After all memory is the new disk. If you could get the data on second server or two, then the chances of them all going down at once is very, very low. How often do servers go down these days? What are the chances of two servers going down at once? The general thought from MongoDB community was (is?) durability is overvalued and was just not Web Scale. Whether this is a valid point or not, there was much fun made about this at MongoDB’s expense (rated R, Mature 17+).

As you recall MongoDB uses memory mapped file for its storage engine so it could be a while for the data in memory to get synced to disk by the operating system. Thus if you did have several machines go down at once (which should be very rare), complete recoverability would be impossible. There were workaround with tradeoffs, for example to get around this (now non issue) or minimize this issue, you could force MongoDB to do an fsync of the data in memory to the file system, but as you guessed even with a RAID level four and a really awesome server that can get slow quick. The moral of the story is MongoDB has journaling as well as many other options so you can decide what the best engineering tradeoff in data safety, raw speed and scalability. You get to pick. Choose wisely.

The reality is that no solution offers “complete” reliability, and if you are willing to allow for some loss (which you can with some data), you can get enormous improvements in speed and scale. Let’s face it your virtual farm game data is just not as important as Wells Fargo’s bank transactions. I know your mom will get upset when she loses the virtual tractor she bought for her virtual farm with her virtual money, but unless she pays real money she will likely get over it. I’ve lost a few posts on twitter over the years, and I have not sued once. If your servers have an uptime of 99 percent and you block/replicate to three servers than the probability of them all going down at once is (0.000001) so the probability of them all going down is 1 in 1,000,000. Of course uptime of modern operating systems (Linux) is much higher than this so one in 100,000,000 or more is possible with just three servers. Amazon EC2 offers discounts if they can’t maintain an SLA of 99.95% (other cloud providers have even higher SLAs). If you were worried about geographic problems you could replicate to another availability zone or geographic area connected with a high-speed WAN. How much speed and reliability do you need? How much money do you have?

An article on when to use MongoDB journaling versus older recommendations will be a welcome addition. Generally it seems journaling is mostly a requirement for very sensitive financial data and single server solutions. Your results may vary, and don’t trust my math, it has been a few years since I got a B+ in statistics, and I am no expert on SLA of modern commodity servers (the above was just spit balling).

If you have ever used a single non-clustered RDBMS system for a production system that relied on frequent backups and transaction log (journaling) for data safety, raise your hand. Ok, if you raised your hand, then you just may not need autosharding or replica sets. To start with MongoDB, just use a single server with journaling turned on. If you require speed, you can configure MongoDB journaling to batch writes to the journal (which is the default). This is a good model to start out with and probably very much like quite a few application you already worked on (assuming that most application don’t need high availability). The difference is, of course, if later your application deemed to need high availability, read scalability, or write scalability, MongoDB has your covered. Also setting up high availability seems easier on MongoDB than other more established solutions.

Figure 3: Simple setup with journaling and single server ok for a lot of applications

Simple Non-Sharded, non replicated installation

If you can afford two other servers and your app reads more than it writes, you can get improved high availability and increased read scalability with replica sets. If your application is write intensive then you might need autosharding. The point is you don’t have to be Facebook or Twitter to use MongoDB. You can even be working on a one-off dinky application. MongoDB scales down as well as up.

Autosharding

Replica sets are good for failover and speeding up reads, but to speed up writes, you need autosharding. According to a talk by Roger Bodamer on Scaling with MongoDB, 90% of projects do not need autosharding. Conversely almost all projects will benefit from replication and high availability provided by replica sets. Also once MongoDB improves its concurrency in version 2.2 and beyond, it may be the case that 97% of projects don’t need autosharding.

Sharding allows MongoDB to scale horizontally. Sharding is also called partitioning. You partition each of your servers a portion of the data to hold or the system does this for you. MongoDB can automatically change partitions for optimal data distribution and load balancing, and it allows you to elastically add new nodes (MongoDB instances). How to setup autosharding is beyond the scope of this introductory article. Autosharding can support automatic failover (along with replica sets). There is no single point of failure. Remember 90% of deployments don’t need sharding, but if you do need scalable writes (apps like Foursquare, Twitter, etc.), then autosharding was designed to work with minimal impact on your client code.

There are three main process actors for autosharding: mongod (database daemon), mongos, and the client driver library. Each mongod instance gets a shard. Mongod is the process that manages databases, and collections. Mongos is a router, it routes writes to the correct mongod instance for autosharding. Mongos also handles looking for which shards will have data for a query. To the client driver, mongos looks like a mongod process more or less (autosharding is transparent to the client drivers).

Figure 4: MongoDB Autosharding

Sharding with MongoDB

Autosharding increases write and read throughput, and helps with scale out. Replica sets are for high availability and read throughput. You can combine them as shown in figure 5.

Figure 5: MongoDB Autosharding plus Replica Sets for scalable reads, scalable writes, and high availability

MongoDB Autosharding for Scalable Reads, Scalable Writes and High Availability

You shard on an indexed field in a document. Mongos collaborates with config servers (mongod instances acting as config servers), which have the shard topology (where do the key ranges live). Shards are just normal mongod instances. Config servers hold meta-data about the cluster and are also mongodb instances.

Shards are further broken down into 64 MB chunks called chunks. A chunk is 64 MB worth of documents for a collection. Config servers hold which shard the chunks live in. The autosharding happens by moving these chunks around and distributing them into individual shards. The mongos processes have a balancer routine that wakes up so often, it checks to see how many chunks a particular shard has. If a particular shard has too many chunks (nine more chunks than another shard), then mongos starts to move data from one shard to another to balance the data capacity amongst the shards. Once the data is moved then the config servers are updated in a two phase commit (updates to shard topology are only allowed if all three config servers are up).

The config servers contain a versioned shard topology and are the gatekeeper for autosharding balancing. This topology maps which shard has which keys. The config servers are like DNS server for shards. The mongos process uses config servers to find where shard keys live. Mongod instances are shards that can be replicated using replica sets for high availability. Mongos and config server processes do not need to be on their own server and can live on a primary box of a replica set for example. For sharding you need at least three config servers, and shard topologies cannot change unless all three are up at the same time. This ensures consistency of the shard topology. The full autosharding topology is show in figure 6. An excellent talk on the internals of MongoDB sharding was done by Kristina Chodorow, author of Scaling MongoDB, at OSCON 2011 if you would like to know more.

Figure 6: MongoDB Autosharding full topology for large deployment including Replica Sets, Mongos routers, Mongod Instance, and Config Servers

MongoDB Autosharding full topology for large deployment including Replica Sets, mongos routers, mongod instance, client drivers and config servers

MapReduce

MongoDB has MapReduce capabilities for batch processing similar to Hadoop. Massive aggregation is possible through the divide and conquer nature of MapReduce. Before the Aggregation Framework MongoDB’s MapReduced could be used instead to implement what you might do with SQL projections (Group/By SQL). MongoDB also added the aggregation framework, which negates the need for MapReduce for common aggregation cases. In MongoDB, Map and Reduce functions are written in JavaScript. These functions are executed on servers (mongod), this allows the code to be next to data that it is operating on (think stored procedures, but meant to execute on distributed nodes and then collected and filtered). The results can be copied to a results collections.

MongoDB also provides incremental MapReduce. This allows you to run MapReduce jobs over collections, and then later run a second job but only over new documents in the collection. You can use this to reduce work required by merging new data into existing results collection.

Aggregation Framework

The Aggregation Framework was added in MongoDB 2.1. It is similar to SQL group by. Before the Aggregation framework, you had to use MapReduce for things like SQL’s group by. Using the Aggregation framework capabilities is easier than MapReduce. Let’s cover a small set of aggregation functions and their SQL equivalents inspired by the Mongo docs as follows:.

Count

SQL
SELECT COUNT(*) FROM employees
MongoDB
db.users.employees([ 
{ $group: {_id:null, count:{$sum:1}} }
])

Sum salary where of each employee who are not retired by department

SQL
SELECT dept_name SUM(salary) FROM employees WHERE retired=false GROUP BY dept_name
MongoDB
db.orders.aggregate([ 
 { $match:{retired:false} }, 
 { $group:{_id:"$dept_name",total:{$sum:"$salary"}} } 
])

This is quite an improvement over using MapReduce for common projections like these. Mongo will take care of collecting and summing the results from multiple shards if need be.

Installing MongoDB

Let’s mix in some code samples to try out along with the concepts.

To install MongoDB go to their download page, download and untar/unzip the download to ~/mongodb-platform-version/. Next you want to create the directory that will hold the data and create a mongodb.config file (/etc/mongodb/mongodb.config) that points to said directory as follows:

Listing: Installing MongoDB

$ sudo mkdir /etc/mongodb/data

$ cat /etc/mongodb/mongodb.config 
dbpath=/etc/mongodb/data

The /etc/mongodb/mongodb.config has one line dbpath=/etc/mongodb/data that tells mongo where to put the data. Next, you need to link mongodb to /usr/local/mongodb and then add it to the path environment variable as follows:

Listing: Setting up MongoDB on your path

$ sudo ln -s  ~/mongodb-platform-version/  /usr/local/mongodb
$ export PATH=$PATH:/usr/local/mongodb/bin

Run the server passing the configuration file that we created earlier.

Listing: Running the MongoDB server

$ mongod --config /etc/mongodb/mongodb.config

Mongo comes with a nice console application called mongo that let’s you execute commands and JavaScript. JavaScript to Mongo is what PL/SQL is to Oracle’s database. Let’s fire up the console app, and poke around.

Firing up the mongos console application

$ mongo
MongoDB shell version: 2.0.4
connecting to: test
…
> db.version()
2.0.4
>

One of the nice things about MongoDB is the self describing console. It is easy to see what commands a MongoDB database supports with the db.help() as follows:

Client: mongo db.help()

> db.help()
DB methods:
db.addUser(username, password[, readOnly=false])
db.auth(username, password)
db.cloneDatabase(fromhost)
db.commandHelp(name) returns the help for the command
db.copyDatabase(fromdb, todb, fromhost)
db.createCollection(name, { size : ..., capped : ..., max : ... } )
db.currentOp() displays the current operation in the db
db.dropDatabase()
db.eval(func, args) run code server-side
db.getCollection(cname) same as db['cname'] or db.cname
db.getCollectionNames()
db.getLastError() - just returns the err msg string
db.getLastErrorObj() - return full status object
db.getMongo() get the server connection object
db.getMongo().setSlaveOk() allow this connection to read from the nonmaster member of a replica pair
db.getName()
db.getPrevError()
db.getProfilingStatus() - returns if profiling is on and slow threshold 
db.getReplicationInfo()
db.getSiblingDB(name) get the db at the same server as this one
db.isMaster() check replica primary status
db.killOp(opid) kills the current operation in the db
db.listCommands() lists all the db commands
db.logout()
db.printCollectionStats()
db.printReplicationInfo()
db.printSlaveReplicationInfo()
db.printShardingStatus()
db.removeUser(username)
db.repairDatabase()
db.resetError()
db.runCommand(cmdObj) run a database command.  if cmdObj is a string, turns it into { cmdObj : 1 }
db.serverStatus()
db.setProfilingLevel(level,{slowms}) 0=off 1=slow 2=all
db.shutdownServer()
db.stats()
db.version() current version of the server
db.getMongo().setSlaveOk() allow queries on a replication slave server
db.fsyncLock() flush data to disk and lock server for backups
db.fsyncUnock() unlocks server following a db.fsyncLock()

You can see some of the commands refer to concepts we discussed earlier. Now let’s create a employee collection, and do some CRUD operations on it.

Create Employee Collection

 > use tutorial; 
switched to db tutorial 
> db.getCollectionNames(); [ ]
 > db.employees.insert({name:'Rick Hightower', gender:'m', gender:'m', phone:'520-555-1212', age:42}); 
Mon Apr 23 23:50:24 [FileAllocator] allocating new datafile /etc/mongodb/data/tutorial.ns, ...

The use command uses a database. If that database does not exist, it will be lazily created the first time we access it (write to it). The db object refers to the current database. The current database does not have any document collections to start with (this is why db.getCollections() returns an empty list). To create a document collection, just insert a new document. Collections like databases are lazily created when they are actually used. You can see that two collections are created when we inserted our first document into the employees collection as follows:

> db.getCollectionNames();
[ "employees", "system.indexes" ]

The first collection is our employees collection and the second collection is used to hold onto indexes we create.

To list all employees you just call the find method on the employees collection.

> db.employees.find()
{ "_id" : ObjectId("4f964d3000b5874e7a163895"), "name" : "Rick Hightower", 
    "gender" : "m", "phone" : "520-555-1212", "age" : 42 }

The above is the query syntax for MongoDB. There is not a separate SQL like language. You just execute JavaScript code, passing documents, which are just JavaScript associative arrays, err, I mean JavaScript objects. To find a particular employee, you do this:

> db.employees.find({name:"Bob"})

Bob quit so to find another employee, you would do this:

> db.employees.find({name:"Rick Hightower"})
{ "_id" : ObjectId("4f964d3000b5874e7a163895"), "name" : "Rick Hightower", "gender" : "m", "phone" : "520-555-1212", "age" : 42 }

The console application just prints out the document right to the screen. I don’t feel 42. At least I am not 100 as shown by this query:

> db.employees.find({age:{$lt:100}})
{ "_id" : ObjectId("4f964d3000b5874e7a163895"), "name" : "Rick Hightower", "gender" : "m", "phone" : "520-555-1212", "age" : 42 }

Notice to get employees less than a 100, you pass a document with a subdocument, the key is the operator ($lt), and the value is the value (100). Mongo supports all of the operators you would expect like $lt for less than, $gt for greater than, etc. If you know JavaScript, it is easy to inspect fields of a document, as follows:

> db.employees.find({age:{$lt:100}})[0].name
Rick Hightower

If we were going to query, sort or shard on employees.name, then we would need to create an index as follows:

db.employees.ensureIndex({name:1}); //ascending index, descending would be -1

Indexing by default is a blocking operation, so if you are indexing a large collection, it could take several minutes and perhaps much longer. This is not something you want to do casually on a production system. There are options to build indexes as a background task, to setup a unique index, and complications around indexing on replica sets, and much more. If you are running queries that rely on certain indexes to be performant, you can check to see if an index exists with db.employees.getIndexes(). You can also see a list of indexes as follows:

> db.system.indexes.find()
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "tutorial.employees", "name" : "_id_" }

By default all documents get an object id. If you don’t not give it an object an _id, it will be assigned one by the system (like a criminal suspects gets a lawyer). You can use that _id to look up an object as follows with findOne:

> db.employees.findOne({_id : ObjectId("4f964d3000b5874e7a163895")})
{ "_id" : ObjectId("4f964d3000b5874e7a163895"), "name" : "Rick Hightower", 
   "gender" : "m", "phone" : "520-555-1212", "age" : 42 }

Python MongoDB Setup

Setting up Python and MongoDB are quite easy since Python has its own package manager.

To install mongodb lib for Python MAC OSX, you would do the following:

$ sudo env ARCHFLAGS='-arch i386 -arch x86_64'
$ python -m easy_install pymongo

To install Python MongoDB on Linux or Windows do the following:

$ easy_install pymongo

or

$ pip install pymongo

If you don’t have easy_install on your Linux box you may have to do some sudo apt-get install python-setuptools or sudo yum install python-setuptools iterations, although it seems to be usually installed with most Linux distributions these days. If easy_install or pip is not installed on Windows, try reformatting your hard disk and installing a real OS, or if that is too inconvient go here.

Once you have it all setup, you will can create some code that is equivalent to the first console examples as shown in figure 11.

Figure 11: Python code listing part 1

Python, MongoDB, Pymongo

Python does have literals for maps so working with Python is much closer to the JavaScript/Console from earlier than Java is. Like Java there are libraries for Python that work with MongoDB (MongoEngine, MongoKit, and more). Even executing queries is very close to the JavaScript experience as shown in figure 12.

Figure 12: Python code listing part 2

Python, MongoDB, Pymongo

Here is the complete listing to make the cut and paste crowd (like me), happy.

Listing: Complete Python listing

import pymongo
from bson.objectid import ObjectId

connection = pymongo.Connection()

db = connection["tutorial"]
employees = db["employees"]

employees.insert({"name": "Lucas Hightower", 'gender':'m', 'phone':'520-555-1212', 'age':8})

cursor = db.employees.find()
for employee in db.employees.find():
    print employee

print employees.find({"name":"Rick Hightower"})[0]

cursor = employees.find({"age": {"$lt": 35}})
for employee in cursor:
     print "under 35: %s" % employee

diana = employees.find_one({"_id":ObjectId("4f984cce72320612f8f432bb")})
print "Diana %s" % diana

The output for the Python example is as follows:

{u'gender': u'm', u'age': 42.0, u'_id': ObjectId('4f964d3000b5874e7a163895'), u'name': u'Rick Hightower', u'phone':
u'520-555-1212'} 

{u'gender': u'f', u'age': 30, u'_id': ObjectId('4f984cae72329d0ecd8716c8'), u'name': u'Diana Hightower', u'phone':
u'520-555-1212'} 

{u'gender': u'm', u'age': 8, u'_id': ObjectId('4f9e111980cbd54eea000000'), u'name': u'Lucas Hightower', u'phone':
u'520-555-1212'}