Samson,a great hero from mythology killed all the enemys with a single Jawbone.Here we are having a snigle tool to slay all the problems of crawling. The olden days of pain and patience were gone. Now a new library is emerging, especially for writing powerful spiders for the web. This library will instantly turn you into an amazing spiderman who can play with the hyperlinks.This is not a tool for novice and using it enterprise level work can be done with ease.
What is Dragline?
Dragline is a powerful python framework to write our own spiders.It is even considered as a full time replacement for the other well-known web scraping frameworks still evolved.
Dragline actually has many advantages than its ancestors in the same field of crawling.The main features are not going to be discussed here.You can find why dragline is more sophisticated than the other scraping frameworks by navigating to this link. Dragline features
Where to get it?
You can download Dragline from the official python repository https://pypi.python.org/pypi/Dragline
we can also install dragline with the following command if pip is installed in the system
$ sudo pip install –pre dragline
c:\ pip install –pre dragline
|||||||||||||||||||||||| 1.Introduction to dragline ||||||||||||||||||||||||||
Now we can begin our fun journey.I am going to show a real world example in an upcoming post, but now there are few important points to ponder.
What is a spider ?. A spider is a program that crawls through the web pages in a specified manner.Here specified manner is the way we askes our spider to run.You may wonder that there are no good resources on crawling and even not a single orielly book on the subject of spiders and especially with python.
Many good crawling frameworks are ignored and it is known for a few developers(looked few in such a large python community) who are really working in the enterprise industry.Is it a worthy issue to consider crawling.Yes obviously because crawlers are the main sources for creation of datasets and also for fetching information programmatically.
Then why this new framework emerged.There are some drawbacks for the existing crawling frameworks, if we are working with a huge projects.Young readers may be frustrated by my words like “project”,”enterprise” but i am asking to take them light.I too don’t like them.Everthing should be plain.I used dragline to write spiders for many websites and it roughly takes 5 minutes to write a spider for a normal website.What an amazing speed!.
The complexity of crawling increases by some factors like:
b)dynamically loading pages by scrolling down
c)The rejection of HTTP requests by the server i.e. timeout
First two factors are unavoidable and left for the genius of a programmar while crawling but last thing can be handled by library if it is smart.Dragline is good at pausing and resuming the connection if a server load is heavy.I want to keep the usage of dragline as a suspense for time being.But if you are inspired by my words you can check it right now.There is a good but not an extraordinary documentation available.But you might wonder how good that framework works once you understand it.
Thanks for listening patiently but i assure that dragline won’t disappoint you.We can meet next time with a real-time spider that amazes you and makes your sleeves always up the shoulder.
If you wish any advancements to dragline you can contribute dragline’s github repository from here.