Screenshot top 20 websites from top 20 categories using python

 

Yes you heard it right.In this post we are going to simulate the way back machine using python.We are going to take the screenshots of top 20 websites from top 20 categories .

We are creating here a project similar to http://www.waybackmachine.com . But here we are going to save  a screenshot of a top website in the form of image in our computer.Along with that,we can save all those top websites URL in a text file for the future use.

Let us build a SnapShotter

For building the SnapShotter(i named it that way) we need to face two questions.

1.How to get the URL of 20 top websites in different categories?

2.Then how to navigate to that URL and snapshot it?.

So for this we incline to a step by step approach. Everything will be clear in this post. No hurry bury.

step 1 : Know about spynner

First we look about how to screenshot a web page?. There is a great python library called webkit to help us.But even a wrapper library for webkit is developed which is easy to use,and it’s name is spynner. Why it is named spynner, because it helps us to perform headless testing of web page rendering similar to pantomJS and acts as a spy in war.

I advice you to install spynner. Don’t jump for PIP to install.A clear installation procedure is given here, once refer . install Spynner .

Now open the python terminal and type following

>>>import spynner
>>>browser = spynner.Browser()
>>>browser.load('www.example.com')
>>>browser.snapshot().save('example.png')

We are creating a browser instance. Next we are loading an URL to that headless browser.last line screenshots example.com and saves that png in the current working directory with file name ‘example.png’.

So now we have a way to capture webpage into an image.Now let’s go and get required URL for our project.

step 2 : Design scraper

We need to write one small web crawler to fetch the required URL from top websites.I found this website http://www.top20.com that lists top 20 websites from top 20 categories .First roam the website and see how it was designed . So we need to have 400+ URL get screenshot .Doing this thing manually is a Herculean task and, that is why we require a crawler here.

#scraper.py
from lxml import html
import requests
def scrape(url,expr):
    #get the response
    page=requests.get(url)
    #build lxml tree from response body
    tree=html.fromstring(page.text)
    #use xpath() to fetch DOM elements
    url_box=set(tree.xpath(expr))
    return url_box
 We are creating here a new file called scraper.py with a function called scrape() in it. We are going to use this to build our crawler. Observe that ,the scrape function takes a URL and a XPATH expression as it’s arguments. it returns  a set of all the URL’s in a given webpage. For crawling from one web page to the another we requires all the navigating URL’s from that page.
step 3: Design crawler body
 Now we are going to write code to scrape all the links of top websites from http://www.top20.com
#SnapShotter.py
from scraper import scrape
import spynner

#Initializations
browser=spynner.Browser()
w = open('top20sites.txt','w')
base_url = 'http://www.top20.com'
Now we are done with the imports and Initialization task.Next job is to write handlers for navigating from one webpage to another.
def scrape_page():
    for scraped_url in scrape(base_url,'//a[@class="link"]/@href'):
        yield scraped_url

scrape_page() function calls the scrape() with base_url and XPATH expression and gets the URL for different categories.It yields that URL. XPATH expression is designed totally by observing the DOM structure of webpage.If you have doubts on writing XPATH expressions kindly refer this. http://lxml.de/xpathxslt.html

def scrape_absolute_url():
    for scraped_url in scrape_page():
    for final_url in scrape(scraped_url,'//a[@class="link"]/@href'):
        yield final_url

This is second call back for second page which consists of top 20 websites for a category.It gets the each category link by calling  scrape_page() callback.It sends all the 20 websites URL to scrape() function with a XPATH expression.This function yields the top website URL which we capture in the another function called save_url()

def save_url():
    for final_url in scrape_absolute_url():
        browser.load(final_url)
        browser.snapshot().save('%s.png'%(final_url))
        w.write(final_url+'\n')

This save_url creates a screenshot for the website whose URL is passed into the function and also write that URL to a text file called   “top20sites.txt” which we opened before.

step 4: Initiate calling of handlers
save_url()

This is the starting point of our program.we need to call save_url which calls scrape_absolute_url that in turn calls scrape_page.See how callbacks are transferring the control.Beauty isn’t it you felt ?.

w.close()

Next we need to close the file.That’s it ,our entire code looks this way.

step 5: Complete code
#ScreenShotter.py
from scraper import scrape
import spynner

#Initializations
browser=spynner.Browser()
w = open('top20sites.txt','w')
base_url = 'http://www.top20.com'

#rock the spider from here

def scrape_page():
    for scraped_url in scrape(base_url,'//a[@class="link"]/@href'):
        yield scraped_url

def scrape_absolute_url():
    for scraped_url in scrape_page():
        for final_url in scrape(scraped_url,'//a[@class="link"]/@href'):
            yield final_url

def save_url():
    for final_url in scrape_absolute_url():
        browser.load(final_url)
        browser.snapshot().save('%s.png'%(final_url))
        w.write(final_url+'\n')

save_url()
w.close()

This completes our ScreenShotter and you will get image screenshots in your directory along with a text file  listing URL of all top websites.Here i am showing the text file which is generated for me. https://app.box.com/s/895ypei1mlzb2yk0p0gb

Hope you enjoyed this post.This is the basic way to scrape the web systematically.

Advertisements

One thought on “Screenshot top 20 websites from top 20 categories using python

  1. Good one, but the first code example contains a typo.
    You need ‘) where >> is in the third line.

    As a side note, how would you go about automating this?
    I’ve tried to use Spynner with cron on a Linux server, but cannot get X11 to work.
    Do you use a linux server, and how do you go about it?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s