Forbes top 100 quote collecting spider with Dragline

 

In this post  we start with a new spider using Dragline crawling framework writing a jump start real world spider. By the end of this post additional to dragline api, we are able to cover some basic concepts of python.

Forbes compiles good quotes and displays them in its website.So you wish to save them to forward them to your close friends.Because you are a programmer that too a python coder and your intention is to fetch all the data from that website by using a spider.There are many spiders for understanding in coming days, but we here discuss a basic one for fetching Forbes quotes.This attempt is to make you familiar with dragline framework.

As already explained in previous post dragline got lot of invisible features which makes the spiders created by it smart .Hoping that we already installed dragline.If not, see instructions for installing in the previous post.

Task 1:

Learning Basics of dragline api

Dragline mainly consists of these major modules

      • dragline.http
      • dragline.htmlparser

dragline.http

It has a request method

class dragline.http.Request(url, method=’GET’, form_data=None, headers{}callback=None,meta=None)

Parameters:
  • url (string) – the URL of this request
  • method (string) – the HTTP method of this request. Defaults to 'GET'.
  • headers (dict) – the headers of this request.
  • callback (string) – name of the function to call after url is downloaded.
  • meta (dict) – A dict that contains arbitrary metadata for this request.
send()
This function sends HTTP requests.

Returns: response
Return type: dragline.http.Response
Raises: dragline.http.RequestError: when failed to fetch contents
>>> req = Request("http://www.example.org")
>>> response = req.send()
>>> print response.headers['status']
200

 and a Response method

class dragline.http.Response(url=None, body=None, headers=None, meta=None)

Parameters:
  • headers (dict) – the headers of this response.
  • body (str) – the response body.
  • meta (dict) – meta copied from request

This function is used to create user defined response to test your spider and also in many other cases. It is much easier than Requests module get method.

dragline.htmlParser

Basic parser module for extracting content from html data, there is a main function in htmlparser called as HtmlParser. Apart from entire Dragline,htmlparser alone is a powerful parsing application.

HtmlParser Function

dragline.htmlparser.HtmlParser(response)
Parameters: response (dragline.http.Response)

This method takes response object as its argument and returns the lxml etree object.

HtmlParser function returns a lxml object of type HtmlElement which got few potential methods. All the details of lxml object are discussed in section lxml.html.HtmlElement.

first we should create a HtmlElement object by sending appropriate URL as parameter.The URL is for the page we want to scrape.

HtmlElement object is returned by the HtmlParser function of dragline.htmlparser module:

>>> req = Request('www.gutenberg.com')
>>> parse_object = HtmlParser(req.send())
The methods upon HtmlElement object are:
extract_urls(xpath_expr)

This function fetches all the links from the webpage in response by the specified xpath as its argument.

If xpath is not included then links are fetched from entire document. From previous example let HtmlElement be parse_obj.

>>> parse_obj.extract_urls('//div[@class="product"]')
xpath(expression)

This function directly accumulate the results from the xpath expression.It is used to fetch the html body elements directly:

<html>
    <head>
    </head>
    <body>
        <div class="tree">
            <a href="http://www.treesforthefuture.org/">Botany</a>
        </div>
        <div class="animal">
            <a href="http://www.animalplanet.com/">Zoology</a>
        </div>
    </body>
</html>

then we can use the following XPath expressions.

>>> parse_object.extract_urls('//div[@class="tree"]')
extract_text(xpath_expr)

This function grabs all the text from the web page that specified.xpath is an optional argument.If specified the text obtained will be committed to condition in xpath expression.

     >>> parse_obj.extract_text('//html')

So now you have understood what are the main modules of dragline and important methods in those.

Now let’s begin our journey by writing small spider
First go to folder where you want to save your spider and follow the procedure below.
  • $ mkdir samplespider
  • $ cd samplespider
  • $ dragline-admin init forbesquotes

this creates a spider called forbesquotes in your newly created samplespider directory.

now you see a folder forbesquotes in samplespider and traverse into it

  • $ cd forbesquotes
 Task 2:

Writing a spider for collecting top 100 quotes frrom forbes

 

This is the 26 line spider for extracting top 100 quotes from forbes.

from dragline.htmlparser import HtmlParser
from dragline.http import Request
import re

class Spider:
    def __init__(self, conf):
    self.name = "forbesquotes"
    self.start = "http://www.forbes.com/sites/kevinkruse/2013/05/28/inspirational-quotes"
    self.allowed_domains = ['www.forbes.com']
    self.conf = conf
 
    def parse(self,response):
        html = HtmlParser(response)
        self.parseQuote(response) 
        for url in html.xpath('//span[@class="page_links"]/a/@href'):
            yield Request(url,callback="parseQuote")
 
    def parseQuote(self,response):
        print response.url
        html = HtmlParser(response)
        title = html.xpath('//div[@class="body contains_vestpocket"]/p/text()')
        quotes = [i.encode('ascii',"ignore") for i in title if i!=' '][2:]
        pat = re.compile(r'\d*\.')
        with open('quotes.txt','a') as fil:
        for quote in [i.split(pat.search(i).group(),1)[1] for i in quotes]:
            fil.write('\n'+quote+'\n')

This is a 26 line spider with dragline.By seeing it you might have not understood a bit from it.Let’s explain everything.

As already told when we create a new spider a new directory is formed in the name of spider.It consists of two files

  • main.py
  • settings.py

main.py looks like following with default class called spider and a methods init,parse.

from dragline.htmlparser import HtmlParser
from dragline.http import Request


class Spider:

    def __init__(self, conf):
       self.name = "forbesquotes"
       self.start = "http://www.example.org"
       self.allowed_domains = []
       self.conf = conf

    def parse(self,response):
       html = HtmlParser(response)

All these things are given to us as a gift without hardcoding them again.Just now we need to concentrate on how to attack the problem.

1) init method takes the starting url  and allowed domains from where spider to begin.

In our case forbesquotes spider starts in self.start = ‘http://www.forbes.com/sites/kevinkruse/2013/05/28/inspirational-quotes&#8217;

and set self.allowed_domains = [‘www.forbes.com’]

it is a list which can take more no of allowed domains

Now our main.py looks like

from dragline.htmlparser import HtmlParser
from dragline.http import Request


class Spider:

    def __init__(self, conf):
        self.name = "forbesquotes"
        self.start = "http://www.forbes.com/sites/kev inkruse/2013/05/28/inspirational-quotes"
        self.allowed_domains = ['www.forbes.com']
        self.conf = conf

    def parse(self,response):
        html = HtmlParser(response)

Ok now we should crawl through the pages ,so lets write a function called parseQuote for processing the page whose input is the response object and outcome is quotes from response page are written to a file.We should repeat parseQuote for no of times equal to the total no of pages in which quotes are available.So after adding the parseQuote function

from dragline.htmlparser import HtmlParser
from dragline.http import Request


class Spider:

    def __init__(self, conf):
        self.name = "forbesquotes"
        self.start = "http://www.forbes.com/sites/kevinkruse/2013/05/28/inspirational-quotes"
        self.allowed_domains = ['www.forbes.com']
        self.conf = conf

    def parse(self,response):
        html = HtmlParser(response)

 
    def parseQuote(self,response):
        print response.url
        html = HtmlParser(response)
        title = html.xpath('//div[@class="body contains_vestpocket"]/p/text()')
        quotes = [i.encode('ascii',"ignore") for i in title if i!=' '][2:]
        pat = re.compile(r'\d*\.')
        with open('quotes.txt','a') as fil:
        for quote in [i.split(pat.search(i).group(),1)[1] for i in quotes]:
        fil.write('\n'+quote+'\n')

If you observe parseQuote ,only first three lines were the job of framework and remaining code is pure python logic for stripping and editing the raw quotes fetched from response and then writing it to a file.

parse is the function where spider execution starts.We should supply callbacks from there to the pages where we wish to navigate.It means spider goes smartly in the path we mention.

So now i am adding content to parse method.After observing the web pages structure i am calling parseQuote on current response.

Next using extract_urls method of dragline HtmlElement object I extract all the urls specifying relevant XPATH and pass them as call backs for the parseQuote function.Resulting code looks like

from dragline.htmlparser import HtmlParser
from dragline.http import Request
import re

class Spider:
    def __init__(self, conf):
        self.name = "forbesquotes"
        self.start = "http://www.forbes.com/sites/kevinkruse/2013/05/28/inspirational-quotes"
        self.allowed_domains = ['www.forbes.com']
        self.conf = conf
 
   def parse(self,response):
       html = HtmlParser(response)
       self.parseQuote(response) 
       for url in html.extract_urls('//span[@class="page_links"]/a'):
           yield Request(url,callback="parseQuote")
 
   def parseQuote(self,response):
       print response.url
       html = HtmlParser(response)
       title = html.xpath('//div[@class="body contains_vestpocket"]/p/text()')
       quotes = [i.encode('ascii',"ignore") for i in title if i!=' '][2:]
       pat = re.compile(r'\d*\.')
       with open('quotes.txt','a') as fil:
           for quote in [i.split(pat.search(i).group(),1)[1] for i in quotes]:
               fil.write('\n'+quote+'\n')

so now after comlpleting the main.py just go to terminal and type following command to run spider.

  • $ dragline  .
  • $  dragline  /path_to_spider/  from outer paths

then our spider starts running with displaying all processed urls as information in command prompt.and a new file will be created in our current directory with top 100 quotes

 Life isnt about getting and having, its about giving and being. 

 Whatever the mind of man can conceive and believe, it can achieve. Napoleon Hill

 Strive not to be a success, but rather to be of value. Albert Einstein

 Two roads diverged in a wood, and II took the one less traveled by, And that has made all the difference. Robert Frost

 I attribute my success to this: I never gave or took any excuse. Florence Nightingale

 You miss 100% of the shots you dont take. Wayne Gretzky

 Ive missed more than 9000 shots in my career. Ive lost almost 300 games. 26 times Ive been trusted to take the game winning shot and missed. Ive failed over and over and over again in my life. And that is why I succeed. Michael Jordan

 The most difficult thing is the decision to act, the rest is merely tenacity. Amelia Earhart

 Every strike brings me closer to the next home run. Babe Ruth

 Definiteness of purpose is the starting point of all achievement. W. Clement Stone

 We must balance conspicuous consumption with conscious capitalism. Kevin Kruse

 Life is what happens to you while youre busy making other plans. John Lennon

 We become what we think about. Earl Nightingale

Twenty years from now you will be more disappointed by the things that you didnt do than by the ones you did do, so throw off the bowlines, sail away from safe harbor, catch the trade winds in your sails. Explore, Dream, Discover. Mark Twain

Life is 10% what happens to me and 90% of how I react to it. Charles Swindoll

 There is only one way to avoid criticism: do nothing, say nothing, and be nothing. Aristotle

 Ask and it will be given to you; search, and you will find; knock and the door will be opened for you. Jesus

 The only person you are destined to become is the person you decide to be. Ralph Waldo Emerson

 Go confidently in the direction of your dreams. Live the life you have imagined. Henry David Thoreau

 When I stand before God at the end of my life, I would hope that I would not have a single bit of talent left and could say, I used everything you gave me. Erma Bombeck

 Few things can help an individual more than to place responsibility on him, and to let him know that you trust him. Booker T. Washington

 Certain things catch your eye, but pursue only those that capture the heart. Ancient Indian Proverb

 Believe you can and youre halfway there. Theodore Roosevelt

 Everything youve ever wanted is on the other side of fear. George Addair

 We can easily forgive a child who is afraid of the dark; the real tragedy of life is when men are afraid of the light. Plato

 

 If youre offered a seat on a rocket ship, dont ask what seat! Just get on. Sheryl Sandberg

 First, have a definite, clear practical ideal; a goal, an objective. Second, have the necessary means to achieve your ends; wisdom, money, materials, and methods. Third, adjust all your means to that end. Aristotle

 If the wind will not serve, take to the oars. Latin Proverb

 You cant fall if you dont climb. But theres no joy in living your whole life on the ground. Unknown

 We must believe that we are gifted for something, and that this thing, at whatever cost, must be attained. Marie Curie

 Too many of us are not living our dreams because we are living our fears. Les Brown

 Challenges are what make life interesting and overcoming them is what makes life meaningful. Joshua J. Marine

 If you want to lift yourself up, lift up someone else. Booker T. Washington

 I have been impressed with the urgency of doing. Knowing is not enough; we must apply. Being willing is not enough; we must do. Leonardo da Vinci

 Limitations live only in our minds. But if we use our imaginations, our possibilities become limitless. Jamie Paolinetti

 You take your life in your own hands, and what happens? A terrible thing, no one to blame. Erica Jong

 Whats money? A man is a success if he gets up in the morning and goes to bed at night and in between does what he wants to do. Bob Dylan

 I didnt fail the test. I just found 100 ways to do it wrong. Benjamin Franklin

 Nothing is impossible, the word itself says, Im possible! Audrey Hepburn

 The only way to do great work is to love what you do. Steve Jobs

 If you can dream it, you can achieve it. Zig Ziglar

 Life isnt about getting and having, its about giving and being. 

 Whatever the mind of man can conceive and believe, it can achieve. Napoleon Hill

 Strive not to be a success, but rather to be of value. Albert Einstein
          and so on ....................
                 

So this is a very small example.It is actually killing an ant with an axe.The main theme of this post is to introduce dragline and make you familiar with that.Crawling is not a legal one so write spiders concerning the threats and benefits.Many python techniques were used like smart usage of list comphrensions and regex.Hope you enjoyed.Comment if you had any queries.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s