Ultimate guide for scraping JavaScript rendered web pages

We all scraped web pages.HTML content returned as response has our data and we scrape it for fetching certain results.If web page has JavaScript implementation, original data is obtained after rendering process. When we use normal requests package in that situation then responses those are returned  contains no data in them.Browsers know how to render and display the final result,but how a program can know?. So I came with a power pack solution to scrape any JavaScript rendered website very easily.

Many of us use below libraries to perform scraping.



I don’t mention scrapy or dragline frameworks here since underlying basic scraper is lxml .My favorite one is lxml.why? ,It has the element traversal methods rather than relying on regular expressions methodology like BeautifulSoup.Here I am going to take a very interesting example.I am so amazed after finding that ,my article is appeared in recent PyCoders weekly issue 147.So I am taking PyCoders weekly as an example to scrape all useful links from PyCoders archives.link to PyCoders weekly archives is here.


It is totally a JavaScript rendered website.I want all links for those archives and next all links from each archive post.How to do that?. First I will show that it returned me nothing when used HTTP approach.

import requests
from lxml import html

#storing response
response = requests.get('http://pycoders.com/archive/')

#creating lxml tree from response body
tree = html.fromstring(response.text)

#Finding all anchor tags in response
print tree.xpath('//div[@class="campaign"]/a/@href')

When I run this I got following output


So I returned with only 3 links.How is that possible,because there are nearly 133 archives of PyCoders weekly.So I got nothing in response.Now I will think about tackling the problem.

How can we get the content?

There is one approach of getting data from JS rendered web pages.It is using Web kit library.Web kit library can do everything that a browser can perform.For some browsers Web kit will be the underground element for rendering web pages.Web kit is part of the QT library.So if you installed QT library and PyQT4 then you are ready to go.

You can install it by using command

sudo apt-get install python-qt4

Now everything is finished.We retry the fetching process,but with a different approach.

Here comes the solution

We first give the request through the web kit.We wait until everything is loaded perfectly and then return the completed HTML to a variable.Then we scrape that HTML content using lxml and obtain results.This process is little bit slow but you will be surprised by seeing that content fetched perfectly.

Let us take this code for granted

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html 

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  

Render class renders the web page. QWebPage is the input URL of web page to scrape.It does something,don’t bother about details.Remember that when we create Render object, it loads everything and creates a frame containing all information about the web page.

url = 'http://pycoders.com/archive/'  
#This does the magic.Loads everything
r = Render(url)  
#result is a QString.
result = r.frame.toHtml()

We are storing the result HTML into variable result.It is not a string to be processed with lxml.So we need to process before using content by lxml.

#QString should be converted to string before processed by lxml
formatted_result = str(result.toAscii())

#Next build lxml tree from formatted_result
tree = html.fromstring(formatted_result)

#Now using correct Xpath we are fetching URL of archives
archive_links = tree.xpath('//div[@class="campaign"]/a/@href')
print archive_links

It gives us all the links for archives and output is a very populated one.


So next create Render objects with these links as URL and extract the required content.The power of Web kit provides us to render a web page pragmatically then fetches data.So use this technique and get data from any JavaScript rendered web page.

Total code looks like this.

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html 

#Take this class for granted.Just use result of rendering.
class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  

url = 'http://pycoders.com/archive/'  
r = Render(url)  
result = r.frame.toHtml()
#This step is important.Converting QString to Ascii for lxml to process
archive_links = html.fromstring(str(result.toAscii()))
print archive_links


I showed you the fully functional way to scrape a JavaScript rendered web page .Apply this technique to automate any no of steps or integrate this technique and override default behavior of a scraping framework.It is slow but 100% result prone.I hope you enjoyed the post.Try now this on any website you think is tricky to scrape.

All the best.

53 thoughts on “Ultimate guide for scraping JavaScript rendered web pages

  1. When trying to implement it for a long iterative process , its not def _loadFinished is not getting called. Please reply(is there any thing that i am missing over here, if possible pls give an example too)

    1. Actually it is creating render object there.So If you try to create more objects ,first one may not render properly while _loadfinished is rerunning.There should be enough time for objects to render.Other wise you should use some multi-threading implementations.

  2. Very nice, i’ve been bypassing all antiscraping JS except for one. I’ll use this technique instead of ChromeDriver wich is waaay more resource expensive and it has some limitations.

  3. Thanks! This is great. I’m running into the same problem as the first poster though. I need to use this on multiple pages. Is there a way to delete the Render object before creating a new one? (the command ‘del’ doesn’t work. It kills my kernel)

  4. Python + Selenium Chrome webdriver set up to work on some scraping; lxml library for the parsing. I need some help as I cannot get my head around this problem. I have a ASP webform, which essentially is the scaffolding around the query bottons/menus etc. The form is filled with a java script rendered table, over several pages. Page control is served as the bottom-most element of a table. Each frame has ten pages; the previous/next frame is rendered through a pointer to last/next page at the edges. The java script does a Postback with Eventtarget and event argument fields. Because the pagination moves sequentially, I am using an outermost loop to assert page numbers. For example, asserting page #11 on frame one (ten pages per frame) will take the view to the next frame and (Chrome) browser to page 11.
    Problem: Suppose last page is n and now I assert page n+1. Selenium will not throw a 404 error. But since the last (n-th) page is still resident in the browser, the data scrape still dumps the n-th page. I cannot insert a stop condition as an explicit wait because the DOM is totally valid. Question: How, if at all, can I tear down the last JS rendered table-view before calling the next page? Intuition: if I can scrap the last page, I can use expected conditions to insert a stop condition. I have read far and wide but could not find a work-around. Any help will be useful, and gratefully acknowleged.

  5. Many sites are responding to me saying that I have private browsing turned on and need to turn it off. Is this common. Is it really turned on, or is this just some sort of scrape defense they have?

  6. Thanks for the guide. Can i just if there is a typo in the code below?
    archive_links = tree.xpath(‘//divass=”campaign”]/a/@href’)

    I could only get … tree.xpath(‘//div[@class=”campaign”]/a/@href’) to work

    1. Was gonna say this. Also,
      response = requests.get(‘http://pycoders.com/archive

      Great article though, I used to work with selenium webdriver for retrieving javascript generated html, but it needs an actual browser.
      This is a much better, thanks!

  7. I’m impressed, I must say. Rarely do I encounter a
    blog that’s equally educative and amusing, and let me tell you, you have hit
    the nail on the head. The issue is something that not enough men and women are speaking intelligently about.
    Now i’m very happy I found this in my search for something
    relating to this.

  8. Hello Naren,

    Its an awesome article .
    I am stuck in scraping “http://www.thecampsite.org/directory” using your code . But still unable to see the data related to an employee (you will be able to see the employee details when you click on an employee name) . I kindly request you to help me on the approach in extracting the data related to all the employees in the link provide above.
    Do you recommend the similar approach ? Yes , I followed your approach , able to print html data , but not able to get the employee data. There are about multiple pages to loop through to get details of all employees.


  9. If I loop over this code i get an error on the 3rd loop. I have even mixed up the URLs that I’m looking at, inserting one that works in position 1 and 2 for the one in position 3 & 3rd in the first two positions. Always fails on the 3rd. I’m wondering if I need to close out something in the first two so that a third can exist? Tried some googling but obviously don’t have the correct search terms.

    Process finished with exit code -1073741819 (0xC0000005)

  10. How would this be able to handle web pages with POST? For example, I want to send data to a URL, and scrape the results from the page that loads after the POST submission

  11. It is giving me some error:
    Traceback (most recent call last):
    File “jspage.py”, line 32, in
    archive_links = tree.xpath(‘//divass=”campaign”]/a/@href’)
    File “src/lxml/lxml.etree.pyx”, line 1587, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:61854)
    File “src/lxml/xpath.pxi”, line 307, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:178516)
    File “src/lxml/xpath.pxi”, line 227, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:177421)
    lxml.etree.XPathEvalError: Invalid expression

  12. Hey!

    Thanks for this, amazing post! Really a sore pain point for many scrapers out there.

    I am trying to scrape events off this website – https://world.timeout.com/

    the thing is that prior to scraping, I need to adjust the map to a specific region and play around with the filters, like changing the events displayed from ‘suggested’ to ‘most attended’.

    Any tips on how I may go about this?



  13. Hi. I’m trying to execute your example but i’m taking the following error. I Know that it’s probably an enviroment issue, but can you help me to find out a way to solve this?

    “cannot connect to X server”

    I already tried to install Xvfb, but the problem continues.

      1. After your comment, a try the pyvirtualdisplay. So, i import display and put these lines above right before your code.

        display = Display(visible=0, size=(800, 600))

        But seems like the page are not render yet. I’m use BeautifulSoup insted of lxml, but it’s m not the problema.

        I put a print right after the command “toHtml()” and the page still with JavaScript non resolvoing
        As i’m a real noob on this pyvirtualdisplay, could i doing something wrong?

        Do you have any Idea on how to solve it?

        Thank you for you patience and help.

        var _gaq = _gaq || [];
        _gaq.push([‘_setAccount’, ‘UA-1990784-2’]);

        (function() {
        var ga = document.createElement(‘script’); ga.type = ‘text/javascr
        ipt’; ga.async = true;
        ga.src = (‘https:’ == document.location.protocol ? ‘https://ssl’ :
        ‘http://www’) + ‘.google-analytics.com/ga.js’;
        var s = document.getElementsByTagName(‘script’)[0]; s.parentNode.i
        nsertBefore(ga, s);

        var _gauges = _gauges || [];
        (function() {
        var t = document.createElement(‘script’);
        t.type = ‘text/javascript’;
        t.async = true;
        t.id = ‘gauges-tracker’;
        t.setAttribute(‘data-site-id’, ‘4f2c8f8f613f5d1549000076’);
        t.src = ‘//secure.gaug.es/track.js’;
        var s = document.getElementsByTagName(‘script’)[0];
        s.parentNode.insertBefore(t, s);

  14. Hi, finally i get the “final” code, and it’s works well thanks to you.

    I crawlled a supermarket site to get the products and prices, but i already have one issue.
    As the entire site is dinamic, the page of the products only load a few ones. The whole products only apear if you scrow down the page.

    This is my code.

    Could you help me to load the complete page before parsing?

    This is an example page. It’s a beer session and has all most 250 beers. With my crawler, i only got 80 beers.

    Thank you again.

  15. I am getting following error, when executing the program. Using python 3.4
    AttributeError: ‘str’ object has no attribute ‘toAscii’

  16. What is the site needs log in?
    I am trying to scrape a webpage that uses JS and requires login.

    I use the following piece of code to log in to the webpage but I cant integrate it with this blog’s tutorial. I reckon it is because the Qt4 opens a new connection different from the one I used to login so it doesnt recognize the session.
    When I run both the login code and the scrape code (same script), it logs in but doesnt scrape. It just keeps running forever without ever printing anything or giving an error.

    Do you know a way to log in using Qt4?

    from lxml import html
    import requests

    #def main():
    session_requests = requests.session()
    # Get login csrf token
    result = session_requests.get(LOGIN_URL)
    tree = html.fromstring(result.text)
    authenticity_token = list(set(tree.xpath(“//input[@name=’_csrf_token’]/@value”)))[0]
    # Create payload
    payload = {
    “email”: USERNAME,
    “password”: PASSWORD,
    “_csrf_token”: authenticity_token
    print(“Log in”)
    # Perform login
    result = session_requests.post(LOGIN_URL, data = payload, headers = dict(referer = LOGIN_URL))

  17. I’m trying to scrape a javascript contained web site (with many )
    after I logged in to it (using spynner), I tried to play with your example with no success..
    using spynner.show() I can tell the webpage is loaded fine, but the html contains the webpage prior to the javscript processing.

  18. Hi,

    I tried to use same code given, to get the response of dynamic web page(with java script results). However the results are not getting consistent. For some executions, getting the response as expected with results after java script execution. In other case, only the html response without javascript is getting rendered.

    any suggestions over this?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s