Yes you heard it right.In this post we are going to simulate the way back machine using python.We are going to take the screenshots of top 20 websites from top 20 categories .
We are creating here a project similar to http://www.waybackmachine.com . But here we are going to save a screenshot of a top website in the form of image in our computer.Along with that,we can save all those top websites URL in a text file for the future use.
Let us build a SnapShotter
For building the SnapShotter(i named it that way) we need to face two questions.
1.How to get the URL of 20 top websites in different categories?
2.Then how to navigate to that URL and snapshot it?.
So for this we incline to a step by step approach. Everything will be clear in this post. No hurry bury.
step 1 : Know about spynner
First we look about how to screenshot a web page?. There is a great python library called webkit to help us.But even a wrapper library for webkit is developed which is easy to use,and it’s name is spynner. Why it is named spynner, because it helps us to perform headless testing of web page rendering similar to pantomJS and acts as a spy in war.
I advice you to install spynner. Don’t jump for PIP to install.A clear installation procedure is given here, once refer . install Spynner .
Now open the python terminal and type following
>>>import spynner >>>browser = spynner.Browser() >>>browser.load('www.example.com') >>>browser.snapshot().save('example.png')
We are creating a browser instance. Next we are loading an URL to that headless browser.last line screenshots example.com and saves that png in the current working directory with file name ‘example.png’.
So now we have a way to capture webpage into an image.Now let’s go and get required URL for our project.
step 2 : Design scraper
We need to write one small web crawler to fetch the required URL from top websites.I found this website http://www.top20.com that lists top 20 websites from top 20 categories .First roam the website and see how it was designed . So we need to have 400+ URL get screenshot .Doing this thing manually is a Herculean task and, that is why we require a crawler here.
#scraper.py from lxml import html import requests def scrape(url,expr): #get the response page=requests.get(url) #build lxml tree from response body tree=html.fromstring(page.text) #use xpath() to fetch DOM elements url_box=set(tree.xpath(expr)) return url_box
step 3: Design crawler body
#SnapShotter.py from scraper import scrape import spynner #Initializations browser=spynner.Browser() w = open('top20sites.txt','w') base_url = 'http://www.top20.com'
def scrape_page(): for scraped_url in scrape(base_url,'//a[@class="link"]/@href'): yield scraped_url
scrape_page() function calls the scrape() with base_url and XPATH expression and gets the URL for different categories.It yields that URL. XPATH expression is designed totally by observing the DOM structure of webpage.If you have doubts on writing XPATH expressions kindly refer this. http://lxml.de/xpathxslt.html
def scrape_absolute_url(): for scraped_url in scrape_page(): for final_url in scrape(scraped_url,'//a[@class="link"]/@href'): yield final_url
This is second call back for second page which consists of top 20 websites for a category.It gets the each category link by calling scrape_page() callback.It sends all the 20 websites URL to scrape() function with a XPATH expression.This function yields the top website URL which we capture in the another function called save_url()
def save_url(): for final_url in scrape_absolute_url(): browser.load(final_url) browser.snapshot().save('%s.png'%(final_url)) w.write(final_url+'\n')
This save_url creates a screenshot for the website whose URL is passed into the function and also write that URL to a text file called “top20sites.txt” which we opened before.
step 4: Initiate calling of handlers
This is the starting point of our program.we need to call save_url which calls scrape_absolute_url that in turn calls scrape_page.See how callbacks are transferring the control.Beauty isn’t it you felt ?.
Next we need to close the file.That’s it ,our entire code looks this way.
step 5: Complete code
#ScreenShotter.py from scraper import scrape import spynner #Initializations browser=spynner.Browser() w = open('top20sites.txt','w') base_url = 'http://www.top20.com' #rock the spider from here def scrape_page(): for scraped_url in scrape(base_url,'//a[@class="link"]/@href'): yield scraped_url def scrape_absolute_url(): for scraped_url in scrape_page(): for final_url in scrape(scraped_url,'//a[@class="link"]/@href'): yield final_url def save_url(): for final_url in scrape_absolute_url(): browser.load(final_url) browser.snapshot().save('%s.png'%(final_url)) w.write(final_url+'\n') save_url() w.close()
This completes our ScreenShotter and you will get image screenshots in your directory along with a text file listing URL of all top websites.Here i am showing the text file which is generated for me. https://app.box.com/s/895ypei1mlzb2yk0p0gb
Hope you enjoyed this post.This is the basic way to scrape the web systematically.