Note: For my other tech articles on JS and Software Development visit this site. https://medium.com/dev-bits
Many of us use below libraries to perform scraping.
I don’t mention scrapy or dragline frameworks here since underlying basic scraper is lxml .My favorite one is lxml.why? ,It has the element traversal methods rather than relying on regular expressions methodology like BeautifulSoup.Here I am going to take a very interesting example.I am so amazed after finding that ,my article is appeared in recent PyCoders weekly issue 147.So I am taking PyCoders weekly as an example to scrape all useful links from PyCoders archives.link to PyCoders weekly archives is here.
import requests from lxml import html #storing response response = requests.get('http://pycoders.com/archive/') #creating lxml tree from response body tree = html.fromstring(response.text) #Finding all anchor tags in response print tree.xpath('//div[@class="campaign"]/a/@href')
When I run this I got following output
So I returned with only 3 links.How is that possible,because there are nearly 133 archives of PyCoders weekly.So I got nothing in response.Now I will think about tackling the problem.
How can we get the content?
There is one approach of getting data from JS rendered web pages.It is using Web kit library.Web kit library can do everything that a browser can perform.For some browsers Web kit will be the underground element for rendering web pages.Web kit is part of the QT library.So if you installed QT library and PyQT4 then you are ready to go.
You can install it by using command
sudo apt-get install python-qt4
Now everything is finished.We retry the fetching process,but with a different approach.
Here comes the solution
We first give the request through the web kit.We wait until everything is loaded perfectly and then return the completed HTML to a variable.Then we scrape that HTML content using lxml and obtain results.This process is little bit slow but you will be surprised by seeing that content fetched perfectly.
Let us take this code for granted
import sys from PyQt4.QtGui import * from PyQt4.QtCore import * from PyQt4.QtWebKit import * from lxml import html class Render(QWebPage): def __init__(self, url): self.app = QApplication(sys.argv) QWebPage.__init__(self) self.loadFinished.connect(self._loadFinished) self.mainFrame().load(QUrl(url)) self.app.exec_() def _loadFinished(self, result): self.frame = self.mainFrame() self.app.quit()
Render class renders the web page. QWebPage is the input URL of web page to scrape.It does something,don’t bother about details.Remember that when we create Render object, it loads everything and creates a frame containing all information about the web page.
url = 'http://pycoders.com/archive/' #This does the magic.Loads everything r = Render(url) #result is a QString. result = r.frame.toHtml()
We are storing the result HTML into variable result.It is not a string to be processed with lxml.So we need to process before using content by lxml.
#QString should be converted to string before processed by lxml formatted_result = str(result.toAscii()) #Next build lxml tree from formatted_result tree = html.fromstring(formatted_result) #Now using correct Xpath we are fetching URL of archives archive_links = tree.xpath('//div[@class="campaign"]/a/@href') print archive_links
It gives us all the links for archives and output is a very populated one.
Total code looks like this.
import sys from PyQt4.QtGui import * from PyQt4.QtCore import * from PyQt4.QtWebKit import * from lxml import html #Take this class for granted.Just use result of rendering. class Render(QWebPage): def __init__(self, url): self.app = QApplication(sys.argv) QWebPage.__init__(self) self.loadFinished.connect(self._loadFinished) self.mainFrame().load(QUrl(url)) self.app.exec_() def _loadFinished(self, result): self.frame = self.mainFrame() self.app.quit() url = 'http://pycoders.com/archive/' r = Render(url) result = r.frame.toHtml() #This step is important.Converting QString to Ascii for lxml to process archive_links = html.fromstring(str(result.toAscii())) print archive_links
All the best.