How situations drive me to write my own applications?

This article shows how I created an audio book downloader because I am tired of manual downloading.

Who am I?

I am a good fan of books.I have an e-book reader.I listen to podcasts often.I always  download audio books before journey.It is pleasant to listen stories and novels rather than reading,means using ears rather than eyes.So I always aware of some websites which provide good audio books to listen online.They are audible.com,ITunes and many more.Only few provide the downloadable audio books free with no restrictions.The main players are: http://www.librivox.org , http://www.loyalbooks.com , http://www.openculture.com/freeaudiobooks Everything is fine for some days. I was adjusted with them.Then came the actual problem.The recordings which are made by librivox,loyalbooks are volunteered.It means anybody can create and upload an audio book.Some books are good to listen,but many others are boring and voices are below average.I need premium voices for zero bucks.I searched on the web and found one such website.It is Lit2Go. http://etc.usf.edu/lit2go/books/   lit2go It consists of all classical books recorded with premium voices.But it kept as a free service because it is one of the academical project of http://etc.usf.edu/ .

What is drawback of Lit2Go?

Lit2Go is a free service.It divides each book into no of chapters.It allows you to play chapters online.But we can download those by  right clicking, a common way.But if chapters are more then downloading them is dead boring task.Going to each chapter link and right clicking.Next copying them to a folder.I do it frequently and always thought “why these people are not providing all chapters in a single file?”.

How I overcame that ?

This is my personal life and python is still playing a role in it.Computing is made simpler like talking about a movie to a friend.I created a home application called “createbook” that asks me which book to download.Moreover it  downloads all chapters within a single folder in a systematic order.Just by running a small script, I can select and download entire audio book now.

How createbook works?

mike Just i ran the script(You can just click on script,if you were a windows user) and I got listed with all classic book options. select I selected 199(Winesburg Ohio) and I saw this folder in my script directory after some time. ohio ohio1 All 26 chapters were downloaded excellently with a folder named after book. Now I am running it happily.Next copying the downloaded folder to my phone,so that enjoying the premium voices without having any pain of downloading.Python helped me in this situation.It may help you too.Just apply it and you will see the comfort you get.Below is the code for my application.I used lxml,requests to do this.In your case,find out the library that suits your situation and build your own fun applications.

#Just run it,select book and sit back,sip coffee
import requests
from lxml import html
import os

def createbook(url):
    res = requests.get(url)
    folder = url.split('/')[-2]
    if not os.path.exists(folder):
        os.makedirs(folder)
    tree  = html.fromstring(res.text)
    parts = tree.xpath('//dl/dt/a/@href')
    for i in parts:
        res = requests.get(i)
        tree  = html.fromstring(res.text)
        parturl = tree.xpath('//audio/source[@type="audio/mpeg"]/@src')
        for surl in parturl:
            with open('%s/%s'%(folder,surl.split('/')[-1]), 'wb') as handle:
                response = requests.get(surl, stream=True)
                for block in response.iter_content(1024):
                    if not block:
                        break
                    handle.write(block)

    print '"%s" is successfully downloaded.....'%folder

res = requests.get('http://etc.usf.edu/lit2go/books/')
tree = html.fromstring(res.text)
books = [i.encode('utf-8') for i in tree.xpath('//figcaption[@class="title"]/a/text()')]
links = tree.xpath('//figcaption[@class="title"]/a/@href')

catalog = dict(zip(books,links))
numcatalog = enumerate(books,1)
chose = {}
print '@@@@@@@@@@ Select from the books below @@@@@@@@@@\n'

for i,j in numcatalog:
    print '%d) %s'%(i,j)
    chose[i] = catalog[j]

choice = int(raw_input('\nSelect Book:'))

print 'Your book started downloading.......'
createbook(chose[choice])

If you too have the same taste of listening to audio books ,feel free to use this script and download classic books.you can find same code in my git repository.The output book folder will be on the same directory of your script.I advise you to use python computing power, wherever it is required.Create your own applications and be comfort. https://github.com/narenaryan/makebook

Anatomy and application of parallel programming in python

Google data center clusters

What is Parallel  Processing?

Parallel processing is the simultaneous use of more than one CPU or processor core to execute a program or multiple computational threads. Ideally, parallel processing makes programs run faster because there are more engines (CPUs or cores) running it. In practice, it is often difficult to divide a program in such a way that separate CPUs or cores can execute different portions without interfering with each other. Most computers have just one CPU, but some models have several, and multi-core processor chips are becoming the norm. There are even computers with thousands of CPUs. After reading that weird Wikipedia definition, what are you thinking?. The same stuff repeated in our ears for years.We all know the core concepts,but coming to implementation we take step lazily.There are many reasons for that. In this post we are going to understand the basic concepts of parallel processing,distributed computing and jump start to a hands-on example for implementing them.

Why parallel Processing?

Assume that you have a very popular website,it has a web Fibonacci calculator.If everything is processed sequentially the requests from users are stored in a Queue and program calculates Fibonacci no and return it to user.If website got 80000 requests and calculating sequentially causes waiting of last submitted user.His reload bar will rotate for ever.How all e-commerce websites are serving the customers with light speed spontaneously?. They use parallelization techniques.Learn how to use it,implement straight away in your programmes.

 Let’s begin the show

The parallel program is the program that is distributed among various process,they in turn given to the cores of the processor to execute. Take a look at smallest parallel computing illustration here.It is multiplying a scalar to a matrix.Multiplication is done by sequentially multiplying scalar to each element.But in parallel computing each element is divided as a unit,and workers are allocated with tasks of multiplying scalar to that element unit.This big task is distributed to 4 processes. Selection_022

Patterns for designing parallel structures

The criteria for designing a pattern for a parallel problem depends on the context of problem itself.The no of workers to be dispatched,no of cores to be used all this factors are dealt with taking main problem into consideration.But here we are going to see a universal pattern for parallel computing problems.

Pipeline concept

This concept is so simple.A task is processed in different stages.many no of workers are present in the each stage.At stage 1,workers 1 process the data,next it goes to stage 2 etc. Selection_026 It is like an assembly line shown in Discovery and NatGeo shows of car manufacturing.In that a car chassis  will be sent to the next stage in which multiple robotic hands paint body at a time,next it moves to another stage in which another set of robotic workers fix the bolts etc. Bayerische Motoren Werke AG Unveil Their Latest Mini Automobile

Analyzing best python tools for implementing parallelism

There are four ways for achieving concurrent processing in python. 1) threading and concurrent.features 2) multiprocessing and ProcessPoolExecutor 3) Parallel Python 4) Celery First two tools are available as built in libraries.Next two are external libraries that do most of the job behind the screens.threading solution to a parallel problem is not preferable since,synchronization mechanisms need to be carefully implemented.In this post we are making things simple and clear.No locks,no synchronization techniques.Just execute a program in a parallel way.

Parallel Python, a wonderful ,simple,cool library for parallel as well as distributed computing

If you installed Parallel Python,it is fine.Other wise just open your terminal and install it by downloading compressed file,extract and running setup.py.compressed files available here. http://www.parallelpython.com/content/view/18/32/ After pp is successfully installed, we are going to build a great practical example to illustrate parallel processing using Parallel Python library,which is a very good tool.

What Parallel Python can do?

The most important advantage of using PP is the abstraction that this module provides. Some important features of PP are as follows: •     Automatic detection of number of processors to improve load balance •     Many processors allocated can be changed at runtime •     Load balance at runtime •     Auto-discovery resources throughout the network

Parallel Python basics

Just we need to create a Server from pp next use submit method of that instance to submit tasks for multiple processes.

import pp

#Server encapsulates and dispatches task to multiple processes
s = pp.Server()

#Server can also be started with multiple cores and other distributed systems

s = pp.Server(ncpus=4,ppservers = ("192.168.25.21", "192.168.25.9"))

# ncpus => no of cores to use, ppservers = Ip's_of_computers_conected_as_cluster
#There is one important function called submit for adding tasks to processes.

"""
submit(self, func, args=(), depfuncs=(), modules=(),
callback=None, callbackargs=(), group='default',
globals=None)

func -> target function to be executed,
args -> arguments to be passed for func,
modules -> list of packages to be imported by func to do it's function
callback ->A function to which result of func will be returned,we can process results here,like sending computed values as response to user
"""
#For exmaple see this submit call below.
s.submit(fibo_task, (item,), modules=('os',),
callback=aggregate_results)

with these basics let us solve a real world problem.

Problem ( calculate greater circle distance )

I created a popular website for getting greater circle distance and I need to process 80000 requests.It means the data entered by users should be processed i.e results are computed concurrently. What is greater circle distance?. It is the distance between two locations on earth.Location is a tuple of (latitude,longitude). Calculating greater circle distance uses a formula called Haversine formula.Don’t be panic by seeing big terms.It is a function that returns distance if two locations are given as arguments.

import math

#This is a function.Don't bother it's contents 

def haversine(key,lat1, lon1, lat2, lon2):
    R = 6372.8 # Earth radius in kilometers
    dLat = math.radians(lat2 - lat1)
    dLon = math.radians(lon2 - lon1)
    lat1 = math.radians(lat1)
    lat2 = math.radians(lat2)
    a = math.sin(dLat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dLon/2)**2
    c = 2* math.asin(math.sqrt(a))
    #calculating KM
    a = R * c
    return a

Think this way.haversine function takes lat1,lon1 and lat2,lan2 and return Greater circle distance.For example we have California(37.0000° N, 120.0000° W) and New Jersey(40.0000° N, 74.5000° W).If we want to find distance between California and New Jersey then use above function as distance_ca_nj = haversine(37.0000,120.0000,40.0000,74.0000) #Gives us distance between California and New jersey in KM.

How to make haversine function execute in parallel fashion

Here I am using four inputs from users,we can extend it to any no of inputs and make them run concurrently.

import os, pp

users={'california_to_newJersey' : (37.0000,120.0000,40.0000,74.0000),
'oklahoma_to_texas' : (35.5000, 98.0000,31.0000, 100.0000),
'arizona_to_kansas' : (34.0000, 112.0000,38.5000, 98.0000),
'mississippi_to_boston' : (33.0000, 90.0000,42.3581, 71.0636)} 

#This dict stores Great distance values for each key in users
result_dict = {}

Now we can modify haversine function.

def haversine(key,lat1, lon1, lat2, lon2):
    R = 6372.8 # Earth radius in kilometers
    dLat = math.radians(lat2 - lat1)
    dLon = math.radians(lon2 - lon1)
    lat1 = math.radians(lat1)
    lat2 = math.radians(lat2)
    a = math.sin(dLat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dLon/2)**2
    c = 2* math.asin(math.sqrt(a))
    #calculating KM
    a = R * c
    message = "the Great Circle Distance calculated by pid %d was %d KM"%(os.getpid(), a)
    return (key, message)
Next we need to process the result returned.So let’s create a callback function that takes result returned and stores it in the result_dict
def aggregate_results(result):
    print "Computing results with PID [%d]" % os.getpid()
    result_dict[result[0]] = result[1]
 Now let us code the main thing that adds tasks to process,we will use all the functions we created above.
job_server = pp.Server(ncpus=4)

for key in users.keys(): 
    job_server.submit(haversine,(key,users[key][0],users[key][1],users[key] [2],users[key][3]), modules=('os','math'), \
callback=aggregate_results)

"""
Above line creates multiple processes and assign 
executing haversine function with arguments as key,expanded tuple.
Processes starts executing using all cores of your processor.
Wait for all processes complete execution before retrieving result
"""
job_server.wait()

#Next main process starts executing
print "Main process PID [%d]" % os.getpid()
for key, value in result_dict.items():
    print "For input %d, %s" % (key, value)

See the output, It works

Why there is common pid for all processes?,actually program executed parallel under main process.So main process process id is listed for each dispatched worker.But it  executes in 1/4 th  time of general sequential program.
Total code can be found at this github link.Just save haversine.py from below repository and run it.You will find it’s lightening execution speed.
any discussions please forward to narenarya@live.com
Resources:
*
*

Parallel Programming with Python
By Jan Palach

*

Geospatial developement simplified with python

 

 

 

 

In this article I am going to show how easy it is to build a real world Geo-spatial application with python.

GIS?

What is Geospatial development or Geographical Information Systems (GIS)?. According to Wikipedia ,

Geographical information systems (GIS), which is a large domain that provides a variety of capabilities designed to capture, store, manipulate, analyze, manage, and present all types of geographical data, and utilizes geospatial analysis in a variety of contexts, operations and applications.

Basic Applications

Geo-spatial analysis, using GIS, was developed for problems in the environmental and life sciences, in particular ecology, geology and epidemiology. It has extended to almost all industries including defense, intelligence, utilities, Natural Resources (i.e. Oil and Gas, Forestry etc.), social sciences, medicine and Public Safety (i.e. emergency management and criminology), disaster risk reduction and management (DRRM), and climate change adaptation (CCA). Spatial statistics typically result primarily from observation rather than experimentation.

For more visit this page http://en.wikipedia.org/wiki/Geospatial_analysis

What Python got do with it?

If somebody asks you to find the locations of churches around 30KM from your new residence,as a first choice you go for web and find manually by observing the map of your city.But it will be complex ,if you wish to know about a location of many types. GIS are the systems those analyze earth data and fetches you the results.There are two types of earth data.

1)Raster data,data consisting scanned images of earth

2)Vector data,data consisting geometry drawn images

Analyzing both types of data is essential for many organizations like Military,Survey and few more.In initial days the people from mathematics background are hired for performing Geo-spatial development because all the basic operations,algorithms applied should be mathematically designed peer-peer.But now a days abstract bindings for existing libraries enables normal software developers to use wrappers for creating Geo-spatial applications.If you chose the correct programming language like Python,we can create powerful Geo-spatial applications in less time with less effort.Just programmer need to be familiar with terminology of geography like latitude,longitude,hemisphere,datum,meridian,directions,shape files etc.

Python got 3 fantastic binding libraries which are wrote upon existing libraries in c++

1) GDAL/OGR

2)PyQgis

3)  ArcPy

In this post i am going to show you a jump start example for finding the locations of a city in United States.I have United States city data with me.If your own region has data available,you can use the same code for finding theaters,parks etc around few Kilometers. My application is totally off-line,since I am analyzing a downloaded dataset.

I am going to use GDAL/OGR library for creating the application,You need to install GDAL/OGR, pyproj , shapely python libraries before starting to build application.For installing those libraries see

http://gis.stackexchange.com/questions/9553/whats-the-easiest-way-to-install-gdal-and-ogr-for-python/124751#124751

Seeing is believing

See the application running,and be confident.I am going to find cliffs around Texas,with 50 KM range

 

Here my main program is cliffs.py and program prompts to enter ISO2 code of city,TX-Texas CA-Callifornia.

 

Selection_017

now my application finally returns me all the cliffs around Texas in 50KM range

Selection_018

How I did that?

First we need to download required Datasets into our directory,two datasets we required here are

1) https://app.box.com/s/2s7i5culjrkq6sm3fqr5

2) https://app.box.com/s/7qzk2y3bpvo264hgxs0r

 

First file is Place-Find.zip,unzip it and save all files in the same directory of the program cliffs.py.Next second file is NationalFile_20141005.txt.Keep this file in same directory.Now we got the required datasets.

It will then look like

Selection_020

I am creating a new file called settings.py which stores the information of places to display

#settings.py
cats= []
with open('NationalFile_20141005.txt','r') as fil:
    for i in fil.readlines():
        cats.append((i.rstrip().split('|'))[2])

#List the places and take input from User for Park,Bar,Hotel etc
#we can use dict(enumerate(set(cats))) here but we need to delete FEATURE_CLASS,Unknown fields from list
show_first = {k:v for k,v in enumerate(set(cats)) if v != 'FEATURE_CLASS' and v != 'Unknown'} 

Once observe the  NationalFile_20141005.txt,it consists of information about the locations like parks,bays,cliffs etc.
Next I am going to create the main program cliff.py.Here go imports
from __future__ import division
from settings import show_first

from osgeo import ogr
import shapely.geometry
import shapely.wkt
osgeo deals with opening shape files,shapely is helpful in translating them into geometric shapes.division is used to convert KM into angular distance to measure(1 degree=100KM). We are importing show_first dictionary from settings.py
shapefile = ogr.Open("tl_2014_us_cbsa.shp")
layer = shapefile.GetLayer(0)

 this opens a file in the same directory and creates a shapefile object.Next we created a layer using GetLayer function,Since that shape file t1_2014-us_cbsa.shp has only one layer we can use GetLayer(0) to fetch that single layer.
city = raw_input('\nEnter ISO2 code of city: ')
print '\nSelect category of place to search in and around the city\n'
for index,place in show_first.items():
    print '||%s|||||||%s'%(index,place)
place_choice = int(raw_input('\nEnter code of place from above listing: '))
place = show_first[place_choice]
distance = int(raw_input('\nEnter with in range distance(KM) to find %s: '%place))
#converting distance range to angular distance $$$$ 100 KM = 1 Degree $$$
MAX_DISTANCE = distance/100 # Angular distance; approx 10 km.

Above lines are normal python code for intaking the preferences from user and last line is converting distance range to angular.

print "Loading urban areas..."

# Maps area name to Shapely polygon.
urbanAreas = {} 

for i in range(layer.GetFeatureCount()):
    feature = layer.GetFeature(i)
    name = feature.GetField("NAME")
    geometry = feature.GetGeometryRef()
    shape = shapely.wkt.loads(geometry.ExportToWkt())
    dilatedShape = shape.buffer(MAX_DISTANCE)
    urbanAreas[name] = dilatedShape
In the above code we are translating geometry of each feature from the shape file into dilated shape and storing it in urbanAreas.
These urbanAreas now hold data for city polygons.Now i will open the NationalFile_20141005.txt and read the latitude and longitude values of the points around MAX_DISTANCE.If that point is within the polygon save it,else ignore.
f = open("NationalFile_20141005.txt", "r")
result = {}
for line in f.readlines():
    chunks = line.rstrip().split("|")
    if chunks[2] == place and chunks[3] == city:
        cliffName = chunks[1]
        latitude = float(chunks[9])
        longitude = float(chunks[10])
        pt = shapely.geometry.Point(longitude, latitude)
        for urbanName,urbanArea in urbanAreas.items():
            if urbanArea.contains(pt):
                if not result.has_key(cliffName):
                    result[cliffName]=[urbanName]
                else:
                    result[cliffName].append(urbanName)
 result dictionary is for saving the cliff_name,list of locations.The code above is quite obvious.Now print the results
print '\n---------------------%s--------------------\n'%place
for k,v in result.items():
    print k,'\n','=========================='
    for item in v:
        print item
    print '\n\n'
f.close()
this finishes our cliff.py.This application works entirely offline.Datasets are huge because of the details they are holding.The final code looks this way
#cliffs.py                    
from __future__ import division
from settings import show_first

from osgeo import ogr
import shapely.geometry
import shapely.wkt



shapefile = ogr.Open("tl_2014_us_cbsa.shp")
layer = shapefile.GetLayer(0)
#code for displaying and as

city = raw_input('\nEnter ISO2 code of city: ')

print '\nSelect category of place to search in and around the city\n'
for index,place in show_first.items():
    print '||%s|||||||%s'%(index,place)

place_choice = int(raw_input('\nEnter code of place from above listing: '))
place = show_first[place_choice]

distance = int(raw_input('\nEnter with in range distance(KM) to find %s: '%place))

#converting distance range to angular distance $$$$ 100 KM = 1 Degree $$$
MAX_DISTANCE = distance/100 # Angular distance; approx 10 km.
print "Loading urban areas..."

# Maps area name to Shapely polygon.
urbanAreas = {} 

for i in range(layer.GetFeatureCount()):
    feature = layer.GetFeature(i)
    name = feature.GetField("NAME")
    geometry = feature.GetGeometryRef()
    shape = shapely.wkt.loads(geometry.ExportToWkt())
    dilatedShape = shape.buffer(MAX_DISTANCE)
    urbanAreas[name] = dilatedShape

print "Checking %ss..."%place


f = open("NationalFile_20141005.txt", "r")
result = {}
for line in f.readlines():
    chunks = line.rstrip().split("|")
    if chunks[2] == place and chunks[3] == city:
        parkName = chunks[1]
        latitude = float(chunks[9])
        longitude = float(chunks[10])
        pt = shapely.geometry.Point(longitude, latitude)
        for urbanName,urbanArea in urbanAreas.items():
            if urbanArea.contains(pt):
                if not result.has_key(parkName):
                    result[parkName]=[urbanName]
                else:
                    result[parkName].append(urbanName)


print '\n---------------------%s--------------------\n'%place
for k,v in result.items():
    print k,'\n','=========================='
    for item in v:
        print item
    print '\n\n'
f.close()
 This completes our application.We can do many other things like finding borders of countries accurately,length from one place to another,satellite functions.Any average python developer can excel in Geo-spatial informatics because he got powerful tool as programming language and open source libraries built already with hiding complexity behind great mathematical functions.code forthis application is available at GITHUB.
If you haven’t done programming in GIS,then it might look complex stuff to  you.But learn basics and see it once again,you feel this post nerdy.

Quotebot,a clever twitter bot powered by Python

A person creating twitter bot and passing some orders to it.

 

We can do everything these days with the help of computing.We can automate,we can play with web,we can do anything.The main concern is collecting the different pieces of thoughts and combining them into an Idea.The thing I did is something crazy,but not senseless.

What is twitter bot?

Twitter bot is a program that posts to the twitter page autonomously without the intervention of a human operator.Once the bot is initiated,nobody concerns about  it,because bot itself can see its job from that moment.

Quotebot,story begins!

I created a twitter bot called Quotebot which posts quotes on a twitter page daily.There are lot’s of bots out there to do different things.In this post I am showcasing my twitter bot which is totally written in python.My major thought was to post a quotation daily,but according to the day, category of the quotations should change.

In the beginning i don’t have a single quote data  with me.Instantly i wrote a spider in the powerful framework,dragline and collected nearly 93000 of quotes from different catagories within few minutes.Below is my robomongo snapshot,I created 31 categories since a month has at most 31 days.

mongodb

Now i have the data source to publish.but quotes can have varying length.Twitter will not allow me to post message length greater than 140 characters.I depressed,because I collected the data,but twitter is limiting me.Then one thing came to my mind.

Twitter allows us to upload media,i.e image files.So  converting those quotes into images and then uploading them to the twitter works out.

I used tweepy,a well known twitter handling python library for dealing with updating status,used Img4me API for converting the quotes into images.The main characteristics given to Quotebot are outlined below:

1)  It posts 120 posts at max daily,one post per every 5 minutes.

2) Only posts quotes according to the day in a month,each day in a month is classified as love day,science day,inspiration day.

Quotebot only posts the corresponding posts on a particular day.

3) Each quote that fetched from the database to post is automatically converted into image and then uploaded to the twitter page

4) Logs the posting time and all the network transactions.

5) No conflicts between data posted and to post data,since bot uses Redis data store for controlling the post integrity.

6) Quotebot wakes up when system is booted up,and posts 120 posts and then goes to sleep.

Technologies used in building Quotebot

1) Tweepy library & Python

2) Redis

3) MongoDB

4)Image4me.com API

Here is the screenshot of the page on which twitter bot is running

I think you are interested how things are going in the back. Tweepy is the straight forward tool for working with twitter. Posting,retweeting,deleting tweets,make friends,remove friends,follow pages everything can be done by the tweepy.Once tweak it if you are interested.link is this. http://tweepy.readthedocs.org/en/v2.3.0/getting_started.html#introduction

Where is the code for Qoutebot?

Because it is not good  of telling a thing and messing up post with code same time,I am not describing the entire procedure here.But entire fully functional code for the Quotebot is located open at my github repository.Feel free to explore the code.

https://github.com/narenaryan/Quotebot

Hey where is that twitter page link,you just kept one screenshot?

The Quotebot functionality is 100% authentic.I posted this to show how we can automate things with help of computing power,that too with sweet language Python and light speed Redis.The spider part is only for fetching data from the well known website and store quotes in MongoDB. Here is link for that twitter page bot is operating on.

https://twitter.com/QuotesAryan

Explore the code.Enjoy the thought.You can also view my other repositories on github.

https://github.com/narenaryan

 

How to wake up a Python script, while you are in a sound sleep?

pract

We all know that programmes die after execution.Data may be persistent if serialized.So consider the case that we need to backup all the logs,or delete something periodically.We need a scheduler to do that.One great scheduler available for Linux systems is CRON scheduler.There are two things having subtle importance here.

1) How to use CRON scheduler to execute any command on a Linux computer.

2)How to use APScheduler to run some functions in a python script for a particular time.

Crontab

The crontab (cron derives from chronos, Greek for time; tab stands for table) command, found in Unix and Unix-like operating systems, is used to schedule commands to be executed periodically. To see what crontabs are currently running on your system, you can open a terminal and run:

$ crontab -l

If we want to create new job or edit the existing crontab job Just type

$ crontab -e
 If we wish to remove crontabs just type

$ crontab -r

It removes all crontabs.What is crontab ?,it is a file to which jobs are added .So add jobs to the end of that file by launching command “$ crontab -e”

This will open a the default editor (could be vi or pico, if you want you can change the default editor) to let us manipulate the crontab. If you save and exit the editor, all your cronjobs are saved into crontab. Cronjobs are written in the following format,here any valid command can be used after 5 *:

* * * * * /bin/execute/this/script.sh
* * * * * [any_valid_linux_command]

Scheduling

 

As you can see there are 5 stars. The stars represent different date parts in the following order:

  • minute (from 0 to 59)
  • hour (from 0 to 23)
  • day of month (from 1 to 31)
  • month (from 1 to 12)
  • day of week (from 0 to 6) (0=Sunday)

Execute every minute

If you leave the star, or asterisk, it means every. Maybe that’s a bit unclear. Let’s use the the previous example again:

* * * * * python /home/execute/this/funny.py

They are all still asterisks! So this means execute /home/execute/this/funny.py:

  • every minute
  • of every hour
  • of every day of the month
  • of every month
  • and every day in the week.

In short: This script is being executed every minute. Without exception.

Execute every Friday 1AM

So if we want to schedule the python script to run at 1AM every Friday, we would need the following cronjob:

0 1 * * 5 python /home/aryan/this/script.py

Get it? The script is now being executed when the system clock hits:

  • minute: 0
  • of hour: 1
  • of day of month: * (every day of month)
  • of month: * (every month)
  • and weekday: 5 (=Friday)

Execute on workdays 1AM

So if we want to schedule the python script to Monday till Friday at 1 AM, we would need the following cronjob:

0 1 * * 1-5 python /bin/execute/this/script.py

Neat scheduling tricks

What if you’d want to run something every 10 minutes? Well you could do this:

0,10,20,30,40,50 * * * * python /bin/execute/this/script.py

But crontab allows you to do this as well:

*/10 * * * * python /bin/execute/this/script.py

 

Storing the crontab output

By default cron saves the output of /bin/execute/this/backup.py in the user’s mailbox (root in this case). But it’s prettier if the output is saved in a separate logfile. Here’s how:

*/10 * * * * python /bin/execute/this/backup.py >> /var/log/script_output.log 2>&1

Linux can report on different levels. There’s standard output (STDOUT) and standard errors (STDERR). STDOUT is marked 1, STDERR is marked 2. So the following statement tells Linux to store STDERR in STDOUT as well, creating one datastream for messages & errors:

2>&1

 this is a shortcut illustration to up and run with cron

So now we understood how to run a python script for particular time.This is doing outside the program,means scheduling python programme manually by programmer.But sometimes we require to schedule from inside the program.For that we use a good library called APScheduler.you can install it by using the following command.

$ sudo pip install apscheduler
Ok after installing APScheduler,we can see how simple scheduling any job.Here we are going to a level deeper and schedule python functions to execute at a particular time.Here jobs are python functions.
from apscheduler.scheduler import Scheduler
#start the scheduler i.e create instance
sched=Scheduler
sched.start()
def my_job():
    print 'Happy_Birthday,Aryan'
#schedules job function my_job to greet me every year on my Birthday
sched.add_cron_job(my_job,month=6,day=24,hour=0)
 So this script greets me on my birthday by running the function every year.Running the script is done by the step 1 i.e cron scheduling,and inside the script scheduling is handled by APScheduler,see how goodies are provided for us.Here my_job is doing simple task but in real time systems,this mean anything like taking backup,deleting logs,housekeeping etc.
There are lots of things we can do with APScheduler,by adding jobs to Mongostore,redis store etc.For full fledged documentation kindly go to
and especially for cron format go to this page,and explore yourself.

Last but not least

Sometimes we need to create some python applications for operating systems using  KDE,GTK+.So those applications should run at system start up ,for that we need to add this  below command in the crontab file.Just add this simple line in crontab file

@reboot python /home/arya/Timepass.py &
 here @reboot tells that command should be executed at reboot,and & tells that process to run background.This finishes our little chit chat about CRON and APScheduler.Hope you enjoyed this post.

Understanding Egyptian multiplication via Python

Egyptian Multiplication

The ancient Egyptians used a curious way to multiply two numbers. The algorithm draws on thebinary system: multiplication by 2, or just adding a number two itself. Unlike, the Russian Peasant Multiplication that determines the involved powers of 2 automatically, the Egyptian algorithm has an extra step where those powers have to be found explicitly.

The applet below allows for experimentation with the algorithm I’ll present shortly. The two blue numbers at the top – the multiplicands – can be modified by clicking on their digits. (The digits can be treated individually or as part of a number depending on the state of the “Autonomous digits” checkbox.) The number of digits in the multiplicands changes from 1 through 4.


Write two multiplicands with some room in-between as the captions for two columns of numbers. The first column starts with 1 and the second with the second multiplicand. Below, in each column, write successively the doubles of the preceding numbers. The first column will generate the sequence of the powers of 2: 1, 2, 4, 8, … Stop when the next power becomes greater than the first multiplicand. I’ll use the same example as in the Russian Peasant Multiplication, 85×18:

 

 

The right column is exactly the same as it would be in the Russian Peasant Multiplication. The left column consists of the powers of two. The red ones are important: the corresponding entries in the right column add up to the product 85×18 = 1530:

 

Why some powers of two come in red, while others in gold? Those in red add up to the first multiplicand:

  85 = 1 + 4 + 16 + 64,

which corresponds to the binary representation of 85:

  85 = 10101012,

According to the Rhind papyrus these powers are found the following way.

64 is included simply because it’s the largest power below 85. Compute 85 – 64 = 21 and find the largest power of 2 below 21: 16. Compute 21 – 16 = 5 and find the largest power of 2 below 5: 4. Compute 5 – 4 = 1 and observe that the result, 1, is a power of 2: 1 = 20. This is a reason to stop. The powers of two that go into 85 are 64, 16, 4, 1.

For the product 18×85, we get the following result:

 

It is also called as Russian peasant Algorithm.

Now let us deal this problem in python

first prepare imports

from __future__ import division
import math

 

we can design a function that returns the greatest power of 2 which is less than or equal to the given no.Because we need to frequently use that concept.

def greatest2power(n,i=0):
    while int(math.pow(2,i)) <= n : i = i+1
    return int(math.pow(2,i-1))

Now let us take inputs.a multiplier, and a multiplicand.

m = int(raw_input('Enter multiplicand'))
n = int(raw_input('Enter multiplier'))

 Now according to the above description,Set greatest to first,and least to second.

if m>n : first , second = m , n
else : first , second = n , m

 We are simulating two columns for 85 and 18 with fcol,scol.Seed is the two multiple which is used to populate those columns according to the algorithm. 

fcol , scol = [] , []
seed = 1

 Now we are populating the two columns with the values generated as algorithm described.Code snippet below is quite obvious.

while seed <= greatest2power(first):
    fcol.append(seed)
    scol.append(second*seed)
    seed = seed*2

 Now we need to compute the valid powers of two which are subtracting from the first element and store them in a list.

valid , backseed = [] , seed//2
while backseed>=1:
    valid.append(backseed)
    temp = backseed
    backseed = greatest2power(first-backseed)
    first = first - temp

The above snippet is analogous to (85-64=21, 21>16) and (21-16=5,5>4),     (5-4=1>=1).so [64,16,4,1] are the valid powers of 2.

Now we iterate over that zip of fcol , scol in order to fetch the corresponding element for a valid two power.

answer = 0
for sol in valid:
    for a,b in zip(fcol,scol):
        if a==sol:
            answer = answer+b

Finally we got the answer stored in answer variable.we are printing it.

print 'The Egyptian Product is:%d'%answer 

What is the specialty in this?.Instead we can do straight forward multiplication.The actual beauty lies in the Egyptian strategy was they used only 2 in their calculation.If you see in the program raising a 2 power is equivalent of adding 2 extra to it.So Egyptians did multiplications with addition operator and number 2 as we did in program.Here goes the complete code here.

https://drive.google.com/file/d/0B6VAvV8caRaBd1JDQU5tendBVVE/view?usp=sharing

resources :

http://www.cut-the-knot.org/Curriculum/Algebra/EgyptianMultiplication.shtml

http://en.wikipedia.org/wiki/Ancient_Egyptian_multiplication

Alas, Julius Caesar doesn’t have python in 50 BC

 

 

CuteCaesar-EtTuBwute

We all know that Julius Caesar is a Roman dictator , who is also notable for his initial cryptography studies.The one thing all of us are unaware is hundreds of trees were cut down in 50 BC to provide Cipher wheels to all the Roman generals.A Cipher wheel is a data encrypting device that use Caesar cipher algorithm which gave the base idea for all the modern encryption technologies.

Little past

The Roman ruler Julius Caesar (100 B.C. – 44 B.C.) used a very simple cipher for secret communication. He substituted each letter of the alphabet with a letter three positions further along. Later, any cipher that used this “displacement” concept for the creation of a cipher alphabet, was referred to as a Caesar cipher. Of all the substitution type ciphers, this Caesar cipher is the simplest to solve, since there are only 25 possible combinations.

What is a Cipher wheel ?

A cipher wheel is an encrypting device that consists of two concentric circles inner circle and outer circle.The inner circle is fixed and outer circle is rotated randomly,so that it stops at some point.Then ‘A’ of outer circle is tallied with the position of ‘A’ of inner circle.That position is considered as key and the mapping of all the positions of outer and inner circles is used as encrypting logic.

 

here key = 3 ,since ‘A’ of outer circle is on ‘D’ of inner circle

Why Julius Caesar wonder if he is alive ?

If encrypting message is small, then it can be encrypted  using a Cipher disk by hand.But if message consists of thousands of lines, then computing power can only make it as easy as a ‘Home Alone’ task.Unfortunately Caesar doesn’t have a computer and a python interpreter in it to do that.If he is alive,he might have been wondered how simple it is to implement any mathematical algorithm in python.We here now building a Cipher wheel in python, a minimal encryption program for communicating secrets.

Ready,set go. build it

#cipherwheel.py
import string
from random import randrange

#functions for encryption and decryption

def encrypt(m):
    #define circular wheels
    inner_wheel = [i for i in string.lowercase]
    outer_wheel = inner_wheel
    #caluclate random secret key
    while True:
        key = randrange(26)
        if key!=0:
            break
    cipher_dict={}
    #map the encryption logic
    original_key =key
    for i in range(26):
        cipher_dict[outer_wheel[i]] = inner_wheel[key%26]
        key = key+1
    #getting encrypted message
    print 'Encrypted with secret key ->> %d\n'%original_key
    cipher = ''.join([cipher_dict[i] if i!=' ' else ' ' for i in m])
    return cipher,original_key

def decrypt(cipher,key):
    inner_wheel = [i for i in string.lowercase]    
    outer_wheel = inner_wheel
    cipher_dict={}
    for i in range(26):
        cipher_dict[outer_wheel[i]] = inner_wheel[key%26]
        key = key+1
    #decryption logic
    reverse_dict = dict(zip(cipher_dict.values() , cipher_dict.keys()))

    #getting original message back
    message = ''.join([reverse_dict[i] if i!=' ' else ' ' for i in cipher])
    return message

#Using cipher wheel here


while True:
    s = raw_input("Enter your secret message:")
    encrypted = encrypt(s)
    print 'encrypted message ->> %s\n'%(encrypted[0])
    print 'decrypted message ->> %s\n'%decrypt(encrypted[0],encrypted[1])

This is the small basic encryption system that uses Caesar cipher as its algorithm.Let us do anatomy of program and try to understand how it was built.

Anatomy of above Caesar wheel

First let us design encrypt function with cipher wheel.It is analogous to encrypt() function in our program caesarcipher.py

We need an inner wheel,an outer wheel initialized with 26 alphabets.So for that use string module variable string.lowercase that returns ‘abcd……xyz’.So we are splitting it to get list of alphabets.

import string
inner_wheel = [i for i in string.lowercase] 
outer_wheel = inner_wheel

so now both outer and inner circles are initialized with list of alphabets.Now when outer circle is rotated it should stop at some random point which is key of the algorithm.

from random import randrange
#rotating outer circle i.e generating random key
while True:
    key = randrange(26)
    if key!=0:
    break

 

Here program is rotating the outer circle and  generating a random key which is used to encrypt message.While encrypting i lose key value,so we are making backup for it.

original_key =key

Now we need to create a mapping dictionary that maps the ‘a’ of outer circle to the respective alphabet of the inner circle at the position of key.For example if key=2,then ‘a’ of outer circle is mapped with ‘c’ of inner circle because c has the ‘2’ index in the list.This mapping procedure is done with the below code.

cipher_dict={}
for i in range(26):
    cipher_dict[outer_wheel[i]] = inner_wheel[key%26]
    key = key+1

 

By this a mapping dictionary according to a randomly generated key is formed.Now we need to use this dictionary to translate original message into secret message.

cipher = ''.join([cipher_dict[i] if i!=' ' else ' ' for i in m])
return cipher,original_key

 

cipher is the secret message created by the encryption mapping dictionary cipher_dict.Next we are sending both cipher,randomly generated key from encrypt() function

Decryption process is similar but we need to reverse map the dictionary in order to get original message.This intelligent one line tweak shows the expressive power of python.

#reverse map the dictionary
reverse_dict = dict(zip(cipher_dict.values() , cipher_dict.keys()))

#get original message from cipher
message = ''.join([reverse_dict[i] if i!=' ' else ' ' for i in cipher])

 

So the final output for the program look like this.

progout

 We got it. We designed a Cipher wheel with python.There are many other design aspects like using special symbols,combination of lower and upper cases in the message.

Caution

This is the very basic encryption algorithm that came into mind of Julius Caesar.He might not expected that, with the very same python in few seconds we can crack algorithm, because only 26 combinations are required for brute force.So don’t use this algorithm for commercial purpose(don’t reveal to kids).My intention is to show ‘how to build practical things with python’.In next article i come up with ‘Transposition cipher’ which is more powerful than Caesar cipher but not most powerful one.
You can download source code for cipher wheel here: cipherwheel.py

Screenshot top 20 websites from top 20 categories using python

 

Yes you heard it right.In this post we are going to simulate the way back machine using python.We are going to take the screenshots of top 20 websites from top 20 categories .

We are creating here a project similar to http://www.waybackmachine.com . But here we are going to save  a screenshot of a top website in the form of image in our computer.Along with that,we can save all those top websites URL in a text file for the future use.

Let us build a SnapShotter

For building the SnapShotter(i named it that way) we need to face two questions.

1.How to get the URL of 20 top websites in different categories?

2.Then how to navigate to that URL and snapshot it?.

So for this we incline to a step by step approach. Everything will be clear in this post. No hurry bury.

step 1 : Know about spynner

First we look about how to screenshot a web page?. There is a great python library called webkit to help us.But even a wrapper library for webkit is developed which is easy to use,and it’s name is spynner. Why it is named spynner, because it helps us to perform headless testing of web page rendering similar to pantomJS and acts as a spy in war.

I advice you to install spynner. Don’t jump for PIP to install.A clear installation procedure is given here, once refer . install Spynner .

Now open the python terminal and type following

>>>import spynner
>>>browser = spynner.Browser()
>>>browser.load('www.example.com')
>>>browser.snapshot().save('example.png')

We are creating a browser instance. Next we are loading an URL to that headless browser.last line screenshots example.com and saves that png in the current working directory with file name ‘example.png’.

So now we have a way to capture webpage into an image.Now let’s go and get required URL for our project.

step 2 : Design scraper

We need to write one small web crawler to fetch the required URL from top websites.I found this website http://www.top20.com that lists top 20 websites from top 20 categories .First roam the website and see how it was designed . So we need to have 400+ URL get screenshot .Doing this thing manually is a Herculean task and, that is why we require a crawler here.

#scraper.py
from lxml import html
import requests
def scrape(url,expr):
    #get the response
    page=requests.get(url)
    #build lxml tree from response body
    tree=html.fromstring(page.text)
    #use xpath() to fetch DOM elements
    url_box=set(tree.xpath(expr))
    return url_box
 We are creating here a new file called scraper.py with a function called scrape() in it. We are going to use this to build our crawler. Observe that ,the scrape function takes a URL and a XPATH expression as it’s arguments. it returns  a set of all the URL’s in a given webpage. For crawling from one web page to the another we requires all the navigating URL’s from that page.
step 3: Design crawler body
 Now we are going to write code to scrape all the links of top websites from http://www.top20.com
#SnapShotter.py
from scraper import scrape
import spynner

#Initializations
browser=spynner.Browser()
w = open('top20sites.txt','w')
base_url = 'http://www.top20.com'
Now we are done with the imports and Initialization task.Next job is to write handlers for navigating from one webpage to another.
def scrape_page():
    for scraped_url in scrape(base_url,'//a[@class="link"]/@href'):
        yield scraped_url

scrape_page() function calls the scrape() with base_url and XPATH expression and gets the URL for different categories.It yields that URL. XPATH expression is designed totally by observing the DOM structure of webpage.If you have doubts on writing XPATH expressions kindly refer this. http://lxml.de/xpathxslt.html

def scrape_absolute_url():
    for scraped_url in scrape_page():
    for final_url in scrape(scraped_url,'//a[@class="link"]/@href'):
        yield final_url

This is second call back for second page which consists of top 20 websites for a category.It gets the each category link by calling  scrape_page() callback.It sends all the 20 websites URL to scrape() function with a XPATH expression.This function yields the top website URL which we capture in the another function called save_url()

def save_url():
    for final_url in scrape_absolute_url():
        browser.load(final_url)
        browser.snapshot().save('%s.png'%(final_url))
        w.write(final_url+'\n')

This save_url creates a screenshot for the website whose URL is passed into the function and also write that URL to a text file called   “top20sites.txt” which we opened before.

step 4: Initiate calling of handlers
save_url()

This is the starting point of our program.we need to call save_url which calls scrape_absolute_url that in turn calls scrape_page.See how callbacks are transferring the control.Beauty isn’t it you felt ?.

w.close()

Next we need to close the file.That’s it ,our entire code looks this way.

step 5: Complete code
#ScreenShotter.py
from scraper import scrape
import spynner

#Initializations
browser=spynner.Browser()
w = open('top20sites.txt','w')
base_url = 'http://www.top20.com'

#rock the spider from here

def scrape_page():
    for scraped_url in scrape(base_url,'//a[@class="link"]/@href'):
        yield scraped_url

def scrape_absolute_url():
    for scraped_url in scrape_page():
        for final_url in scrape(scraped_url,'//a[@class="link"]/@href'):
            yield final_url

def save_url():
    for final_url in scrape_absolute_url():
        browser.load(final_url)
        browser.snapshot().save('%s.png'%(final_url))
        w.write(final_url+'\n')

save_url()
w.close()

This completes our ScreenShotter and you will get image screenshots in your directory along with a text file  listing URL of all top websites.Here i am showing the text file which is generated for me. https://app.box.com/s/895ypei1mlzb2yk0p0gb

Hope you enjoyed this post.This is the basic way to scrape the web systematically.

How I satisfied a request from my friend with Python

 

You may have wondered by now why the title is in that way. Yes, my friend requested me to do a task. What is the request and how I completed that will be known to you if you read this story.

The Story

Two days back my friend Sai Madhu is leaving to his home place from our office in Kochi, which is 700 miles away. He booked the train and should leave in the afternoon. He is very passionate about cricket. He never missed the score when the Indian cricket team is playing. On the same day there is an ODI match between England and  India. Because he doesn’t have a smartphone to access, in the train, he is crippled  to know the score. But he had a feature phone. He requested me to send the score as sms frequently, and I gave my word. I have lots of work to do and thought for a moment “Whether I unthinkingly gave my word to him? No whatever I should send him the score”. Then python came into the rescue. In the next section you find how I successfully handled Sai Madhu request.

Python pulled rabbit out of hat

I thought “why  do I not automate the process?” and  two things came into my mind.

1. Scraping the score from the live score website

2. Send him an SMS, consisting score through a message API

3. SMS is sent for every 3 minutes

Step 1:

import requests

from lxml import html

Step 2:

I am using twilio for sending the SMS here. You can sign up  a free account and  can enjoy sending SMS for verified numbers. There are  lots of free SMS api but twilio works perfectly. I verified Sai Madhu no and kept ready to go. To access our twilio account from python program,we require to install  a package called twilio and we can get it easily by typing

$ sudo pip install twilio

Now installation is over and twilio library is ready to use.Do this final import.

from twilio.rest import TwilioRestClient

Now we imported all the tools required for completing my job of automation.

import requests
from lxml import html
from twilio.rest import TwilioRestClient
import time

 

Ok, i also imported time package to use the sleep function in it to make delay of 3 minutes between messages. Twilio provides Account SID  and Authorization token which are required to send a message. We will instantiate a Twillio Rest client with those details. But I am defining a function separately for dealing the sending stuff.I named it sendscore.

def sendscore(body):
    account_sid = "ACe5a382a0fe505XXXXXXXXXXXXXXXXXXX"
    auth_token = "cc2b50d82df3XXXXXXXXXXXXXXXXXXXXXXX"
    client = TwilioRestClient(account_sid, auth_token)
    message = client.messages.create(body=body,to="+919052108147",from_="+1 720-548-2740")
    print message.sid

 

So simple. First i created a twilio client  and passed Account SID,Authorization token as arguments.Then i will get a connection to my twilio account.There are many functions available on a twilio client for sending SMS,sending MMS,making a call etc.But i chose to send SMS. “twlio.messages” creates an instance for monitoring message transactions.I use create function on that  instance to send an SMS. ‘to’ argument is the verified destination no. ‘from_’  argument is the number  which also allocated to us on signup along with SID,token . If message is sent successfully SMS id will be returned else error message.I display it to know what happend actually.Now I design python code for scraping score.

page=requests.get('http://sports.ndtv.com/cricket/live-scores')
tree=html.fromstring(page.text)
score=(tree.xpath('//div[@class="ckt-scr"]/text()')[0].lstrip()).rstrip()

Ok now i got a clean score which is fetched using xpath method on tree object of lxml and formatting it neatly.Observe the way i send the body of the html response to the html.fromstring() to build an element tree.Now i am calling sendscore function with this score that fetched. as argument. It sends a message to the ‘to’ number with the score as it’s body.I also log the thing for monitoring the process.

sendscore(score)
print "%s sent at:%s "%(score,time.ctime())
#to delay a message by three minutes 
time.sleep(180)

This process should run for every three minutes.So i use while loop to achieve that and combine above two snippets.

while True:
    page=requests.get('http://sports.ndtv.com/cricket/live-scores')
    tree=html.fromstring(page.text)
    score=(tree.xpath('//div[@class="ckt-scr"]/text()')[0].lstrip()).rstrip()
    sendscore(score)
    print "%s sent at:%s "%(score,time.ctime())
    time.sleep(180)
This completes the task.Buliding all together will give us the program that sends the Score of England Vs India for every three minutes to Sai Madhu.
#final script for sending score.I name it sendscore.py
from twilio.rest import TwilioRestClient
from lxml import html
import requests
import time



def sendscore(body):
    #Your account_sid goes here
    account_sid = "ACe5a382a0fe505faaXXXXXXXXXXXXXXXX"
    #Your authorization token goes here
    auth_token = "cc2b50d82df3a31cXXXXXXXXXXXXXXXXXX"
    client = TwilioRestClient(account_sid, auth_token)
    #to = "number to which SMS should be sent",from_="your twilio number"
    message = client.messages.create(body=body,to="+919052108147",from_="+1 720-548-2740")
    print message.sid

while True:
    page=requests.get('http://sports.ndtv.com/cricket/live-scores')
    tree=html.fromstring(page.text)
    score=(tree.xpath('//div[@class="ckt-scr"]/text()')[0].lstrip()).rstrip()
    sendscore(score)
    print "%s sent at:%s "%(score,time.ctime())
    time.sleep(180)

You can press Ctrl+C to quit the program.Once you started running it,It automatically sends the score for every three minutes.I designed this program in 5 minutes and Sai Madhu is so much impressed for the accuracy of scores he got that day. “What not one can do with a computer,internet and python ? “.

My intention of writing this post is to show, how we can send a SMS via python program? .We can also send MMS and do calls.You can refer to complete API here. https://www.twilio.com/docs/api/rest .
Hope you enjoyed the post.Bye.

Forbes top 100 quote collecting spider with Dragline

 

In this post  we start with a new spider using Dragline crawling framework writing a jump start real world spider. By the end of this post additional to dragline api, we are able to cover some basic concepts of python.

Forbes compiles good quotes and displays them in its website.So you wish to save them to forward them to your close friends.Because you are a programmer that too a python coder and your intention is to fetch all the data from that website by using a spider.There are many spiders for understanding in coming days, but we here discuss a basic one for fetching Forbes quotes.This attempt is to make you familiar with dragline framework.

As already explained in previous post dragline got lot of invisible features which makes the spiders created by it smart .Hoping that we already installed dragline.If not, see instructions for installing in the previous post.

Task 1:

Learning Basics of dragline api

Dragline mainly consists of these major modules

      • dragline.http
      • dragline.htmlparser

dragline.http

It has a request method

class dragline.http.Request(url, method=’GET’, form_data=None, headers{}callback=None,meta=None)

Parameters:
  • url (string) – the URL of this request
  • method (string) – the HTTP method of this request. Defaults to 'GET'.
  • headers (dict) – the headers of this request.
  • callback (string) – name of the function to call after url is downloaded.
  • meta (dict) – A dict that contains arbitrary metadata for this request.
send()
This function sends HTTP requests.

Returns: response
Return type: dragline.http.Response
Raises: dragline.http.RequestError: when failed to fetch contents
>>> req = Request("http://www.example.org")
>>> response = req.send()
>>> print response.headers['status']
200

 and a Response method

class dragline.http.Response(url=None, body=None, headers=None, meta=None)

Parameters:
  • headers (dict) – the headers of this response.
  • body (str) – the response body.
  • meta (dict) – meta copied from request

This function is used to create user defined response to test your spider and also in many other cases. It is much easier than Requests module get method.

dragline.htmlParser

Basic parser module for extracting content from html data, there is a main function in htmlparser called as HtmlParser. Apart from entire Dragline,htmlparser alone is a powerful parsing application.

HtmlParser Function

dragline.htmlparser.HtmlParser(response)
Parameters: response (dragline.http.Response)

This method takes response object as its argument and returns the lxml etree object.

HtmlParser function returns a lxml object of type HtmlElement which got few potential methods. All the details of lxml object are discussed in section lxml.html.HtmlElement.

first we should create a HtmlElement object by sending appropriate URL as parameter.The URL is for the page we want to scrape.

HtmlElement object is returned by the HtmlParser function of dragline.htmlparser module:

>>> req = Request('www.gutenberg.com')
>>> parse_object = HtmlParser(req.send())
The methods upon HtmlElement object are:
extract_urls(xpath_expr)

This function fetches all the links from the webpage in response by the specified xpath as its argument.

If xpath is not included then links are fetched from entire document. From previous example let HtmlElement be parse_obj.

>>> parse_obj.extract_urls('//div[@class="product"]')
xpath(expression)

This function directly accumulate the results from the xpath expression.It is used to fetch the html body elements directly:

<html>
    <head>
    </head>
    <body>
        <div class="tree">
            <a href="http://www.treesforthefuture.org/">Botany</a>
        </div>
        <div class="animal">
            <a href="http://www.animalplanet.com/">Zoology</a>
        </div>
    </body>
</html>

then we can use the following XPath expressions.

>>> parse_object.extract_urls('//div[@class="tree"]')
extract_text(xpath_expr)

This function grabs all the text from the web page that specified.xpath is an optional argument.If specified the text obtained will be committed to condition in xpath expression.

     >>> parse_obj.extract_text('//html')

So now you have understood what are the main modules of dragline and important methods in those.

Now let’s begin our journey by writing small spider
First go to folder where you want to save your spider and follow the procedure below.
  • $ mkdir samplespider
  • $ cd samplespider
  • $ dragline-admin init forbesquotes

this creates a spider called forbesquotes in your newly created samplespider directory.

now you see a folder forbesquotes in samplespider and traverse into it

  • $ cd forbesquotes
 Task 2:

Writing a spider for collecting top 100 quotes frrom forbes

 

This is the 26 line spider for extracting top 100 quotes from forbes.

from dragline.htmlparser import HtmlParser
from dragline.http import Request
import re

class Spider:
    def __init__(self, conf):
    self.name = "forbesquotes"
    self.start = "http://www.forbes.com/sites/kevinkruse/2013/05/28/inspirational-quotes"
    self.allowed_domains = ['www.forbes.com']
    self.conf = conf
 
    def parse(self,response):
        html = HtmlParser(response)
        self.parseQuote(response) 
        for url in html.xpath('//span[@class="page_links"]/a/@href'):
            yield Request(url,callback="parseQuote")
 
    def parseQuote(self,response):
        print response.url
        html = HtmlParser(response)
        title = html.xpath('//div[@class="body contains_vestpocket"]/p/text()')
        quotes = [i.encode('ascii',"ignore") for i in title if i!=' '][2:]
        pat = re.compile(r'\d*\.')
        with open('quotes.txt','a') as fil:
        for quote in [i.split(pat.search(i).group(),1)[1] for i in quotes]:
            fil.write('\n'+quote+'\n')

This is a 26 line spider with dragline.By seeing it you might have not understood a bit from it.Let’s explain everything.

As already told when we create a new spider a new directory is formed in the name of spider.It consists of two files

  • main.py
  • settings.py

main.py looks like following with default class called spider and a methods init,parse.

from dragline.htmlparser import HtmlParser
from dragline.http import Request


class Spider:

    def __init__(self, conf):
       self.name = "forbesquotes"
       self.start = "http://www.example.org"
       self.allowed_domains = []
       self.conf = conf

    def parse(self,response):
       html = HtmlParser(response)

All these things are given to us as a gift without hardcoding them again.Just now we need to concentrate on how to attack the problem.

1) init method takes the starting url  and allowed domains from where spider to begin.

In our case forbesquotes spider starts in self.start = ‘http://www.forbes.com/sites/kevinkruse/2013/05/28/inspirational-quotes&#8217;

and set self.allowed_domains = [‘www.forbes.com’]

it is a list which can take more no of allowed domains

Now our main.py looks like

from dragline.htmlparser import HtmlParser
from dragline.http import Request


class Spider:

    def __init__(self, conf):
        self.name = "forbesquotes"
        self.start = "http://www.forbes.com/sites/kev inkruse/2013/05/28/inspirational-quotes"
        self.allowed_domains = ['www.forbes.com']
        self.conf = conf

    def parse(self,response):
        html = HtmlParser(response)

Ok now we should crawl through the pages ,so lets write a function called parseQuote for processing the page whose input is the response object and outcome is quotes from response page are written to a file.We should repeat parseQuote for no of times equal to the total no of pages in which quotes are available.So after adding the parseQuote function

from dragline.htmlparser import HtmlParser
from dragline.http import Request


class Spider:

    def __init__(self, conf):
        self.name = "forbesquotes"
        self.start = "http://www.forbes.com/sites/kevinkruse/2013/05/28/inspirational-quotes"
        self.allowed_domains = ['www.forbes.com']
        self.conf = conf

    def parse(self,response):
        html = HtmlParser(response)

 
    def parseQuote(self,response):
        print response.url
        html = HtmlParser(response)
        title = html.xpath('//div[@class="body contains_vestpocket"]/p/text()')
        quotes = [i.encode('ascii',"ignore") for i in title if i!=' '][2:]
        pat = re.compile(r'\d*\.')
        with open('quotes.txt','a') as fil:
        for quote in [i.split(pat.search(i).group(),1)[1] for i in quotes]:
        fil.write('\n'+quote+'\n')

If you observe parseQuote ,only first three lines were the job of framework and remaining code is pure python logic for stripping and editing the raw quotes fetched from response and then writing it to a file.

parse is the function where spider execution starts.We should supply callbacks from there to the pages where we wish to navigate.It means spider goes smartly in the path we mention.

So now i am adding content to parse method.After observing the web pages structure i am calling parseQuote on current response.

Next using extract_urls method of dragline HtmlElement object I extract all the urls specifying relevant XPATH and pass them as call backs for the parseQuote function.Resulting code looks like

from dragline.htmlparser import HtmlParser
from dragline.http import Request
import re

class Spider:
    def __init__(self, conf):
        self.name = "forbesquotes"
        self.start = "http://www.forbes.com/sites/kevinkruse/2013/05/28/inspirational-quotes"
        self.allowed_domains = ['www.forbes.com']
        self.conf = conf
 
   def parse(self,response):
       html = HtmlParser(response)
       self.parseQuote(response) 
       for url in html.extract_urls('//span[@class="page_links"]/a'):
           yield Request(url,callback="parseQuote")
 
   def parseQuote(self,response):
       print response.url
       html = HtmlParser(response)
       title = html.xpath('//div[@class="body contains_vestpocket"]/p/text()')
       quotes = [i.encode('ascii',"ignore") for i in title if i!=' '][2:]
       pat = re.compile(r'\d*\.')
       with open('quotes.txt','a') as fil:
           for quote in [i.split(pat.search(i).group(),1)[1] for i in quotes]:
               fil.write('\n'+quote+'\n')

so now after comlpleting the main.py just go to terminal and type following command to run spider.

  • $ dragline  .
  • $  dragline  /path_to_spider/  from outer paths

then our spider starts running with displaying all processed urls as information in command prompt.and a new file will be created in our current directory with top 100 quotes

 Life isnt about getting and having, its about giving and being. 

 Whatever the mind of man can conceive and believe, it can achieve. Napoleon Hill

 Strive not to be a success, but rather to be of value. Albert Einstein

 Two roads diverged in a wood, and II took the one less traveled by, And that has made all the difference. Robert Frost

 I attribute my success to this: I never gave or took any excuse. Florence Nightingale

 You miss 100% of the shots you dont take. Wayne Gretzky

 Ive missed more than 9000 shots in my career. Ive lost almost 300 games. 26 times Ive been trusted to take the game winning shot and missed. Ive failed over and over and over again in my life. And that is why I succeed. Michael Jordan

 The most difficult thing is the decision to act, the rest is merely tenacity. Amelia Earhart

 Every strike brings me closer to the next home run. Babe Ruth

 Definiteness of purpose is the starting point of all achievement. W. Clement Stone

 We must balance conspicuous consumption with conscious capitalism. Kevin Kruse

 Life is what happens to you while youre busy making other plans. John Lennon

 We become what we think about. Earl Nightingale

Twenty years from now you will be more disappointed by the things that you didnt do than by the ones you did do, so throw off the bowlines, sail away from safe harbor, catch the trade winds in your sails. Explore, Dream, Discover. Mark Twain

Life is 10% what happens to me and 90% of how I react to it. Charles Swindoll

 There is only one way to avoid criticism: do nothing, say nothing, and be nothing. Aristotle

 Ask and it will be given to you; search, and you will find; knock and the door will be opened for you. Jesus

 The only person you are destined to become is the person you decide to be. Ralph Waldo Emerson

 Go confidently in the direction of your dreams. Live the life you have imagined. Henry David Thoreau

 When I stand before God at the end of my life, I would hope that I would not have a single bit of talent left and could say, I used everything you gave me. Erma Bombeck

 Few things can help an individual more than to place responsibility on him, and to let him know that you trust him. Booker T. Washington

 Certain things catch your eye, but pursue only those that capture the heart. Ancient Indian Proverb

 Believe you can and youre halfway there. Theodore Roosevelt

 Everything youve ever wanted is on the other side of fear. George Addair

 We can easily forgive a child who is afraid of the dark; the real tragedy of life is when men are afraid of the light. Plato

 

 If youre offered a seat on a rocket ship, dont ask what seat! Just get on. Sheryl Sandberg

 First, have a definite, clear practical ideal; a goal, an objective. Second, have the necessary means to achieve your ends; wisdom, money, materials, and methods. Third, adjust all your means to that end. Aristotle

 If the wind will not serve, take to the oars. Latin Proverb

 You cant fall if you dont climb. But theres no joy in living your whole life on the ground. Unknown

 We must believe that we are gifted for something, and that this thing, at whatever cost, must be attained. Marie Curie

 Too many of us are not living our dreams because we are living our fears. Les Brown

 Challenges are what make life interesting and overcoming them is what makes life meaningful. Joshua J. Marine

 If you want to lift yourself up, lift up someone else. Booker T. Washington

 I have been impressed with the urgency of doing. Knowing is not enough; we must apply. Being willing is not enough; we must do. Leonardo da Vinci

 Limitations live only in our minds. But if we use our imaginations, our possibilities become limitless. Jamie Paolinetti

 You take your life in your own hands, and what happens? A terrible thing, no one to blame. Erica Jong

 Whats money? A man is a success if he gets up in the morning and goes to bed at night and in between does what he wants to do. Bob Dylan

 I didnt fail the test. I just found 100 ways to do it wrong. Benjamin Franklin

 Nothing is impossible, the word itself says, Im possible! Audrey Hepburn

 The only way to do great work is to love what you do. Steve Jobs

 If you can dream it, you can achieve it. Zig Ziglar

 Life isnt about getting and having, its about giving and being. 

 Whatever the mind of man can conceive and believe, it can achieve. Napoleon Hill

 Strive not to be a success, but rather to be of value. Albert Einstein
          and so on ....................
                 

So this is a very small example.It is actually killing an ant with an axe.The main theme of this post is to introduce dragline and make you familiar with that.Crawling is not a legal one so write spiders concerning the threats and benefits.Many python techniques were used like smart usage of list comphrensions and regex.Hope you enjoyed.Comment if you had any queries.