Language Translation with python part 1


                      Natural Language Toolkit intro

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

Thanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics, NLTK is suitable for linguists, engineers, students, educators, researchers, and industry users alike. NLTK is available for Windows, Mac OS X, and Linux. Best of all, NLTK is a free, open source, community-driven project.

NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”

Natural Language Processing with Python provides a practical introduction to programming for language processing. Written by the creators of NLTK, it guides the reader through the fundamentals of writing Python programs, working with corpora, categorizing text, analyzing linguistic structure, and more. A new version with updates for Python 3 and NLTK 3 is in preparation.

Tokenizing text into sentances:

Tokenization is the process of splitting a string into a list of pieces, or tokens. We’ll start by splitting a paragraph into a list of sentences.
Getting ready

Installation instructions for NLTK are available at and the latest version as of this writing is 2.0b9. NLTK requires Python 2.4 or higher, but is not compatible with Python 3.0. The recommended Python version is 2.6.

Once you’ve installed NLTK, you’ll also need to install the data by following the instructions at We recommend installing everything, as we’ll be using a number of corpora and pickled objects. The data is installed in a data directory, which on Mac and Linux/Unix is usually /usr/share/nltk_data, or on Windows is C:\nltk_data. Make sure that tokenizers/ is in the data directory and has been unpacked so that there’s a file at tokenizers/punkt/english.pickle.

Finally, to run the code examples, you’ll need to start a Python console. Instructions on how to do so are available at For Mac with Linux/Unix users, you can open a terminal and type python.

How to do it…
Once NLTK is installed and you have a Python console running, we can start by creating a paragraph of text:
>>> para = “Hello World. It’s good to see you. Thanks for buying this book.”
Now we want to split para into sentences. First we need to import the sentence tokenization function, and then we can call it with the paragraph as an argument.

>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize(para)
[‘Hello World.’, “It’s good to see you.”, ‘Thanks for buying this book.’]

So now we have a list of sentences that we can use for further processing.
How it works…

sent_tokenize uses an instance of PunktSentenceTokenizer from the nltk. tokenize.punkt module. This instance has already been trained on and works well for many European languages. So it knows what punctuation and characters mark the end of a sentence and the beginning of a new sentence.

There’s more…

The instance used in sent_tokenize() is actually loaded on demand from a pickle file. So if you’re going to be tokenizing a lot of sentences, it’s more efficient to load the PunktSentenceTokenizer once, and call its tokenize() method instead.

>>> import
>>> tokenizer =‘tokenizers/punkt/english.pickle’) >>> tokenizer.tokenize(para)
[‘Hello World.’, “It’s good to see you.”, ‘Thanks for buying this book.’]

Other languages

If you want to tokenize sentences in languages other than English, you can load one of the other pickle files in tokenizers/punkt and use it just like the English sentence tokenizer. Here’s an example for Spanish:

>>> spanish_tokenizer =‘tokenizers/punkt/spanish. pickle’)
>>> spanish_tokenizer.tokenize(‘Hola amigo. Estoy bien.’)

Tokenizing sentences using regular expressions:

Regular expression can be used if you want complete control over how to tokenize text. As regular expressions can get complicated very quickly, we only recommend using them if the word tokenizers covered in the previous recipe are unacceptable.

Getting ready
First you need to decide how you want to tokenize a piece of text, as this will determine how you construct your regular expression. The choices are:
f Match on the tokens
f Match on the separators, or gaps
We’ll start with an example of the first, matching alphanumeric tokens plus single quotes so that we don’t split up contractions.
How to do it…
We’ll create an instance of the RegexpTokenizer, giving it a regular expression string to use for matching tokens.

>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer = RegexpTokenizer(“[\w’]+”)
>>> tokenizer.tokenize(“Can’t is a contraction.”)
[“Can’t”, ‘is’, ‘a’, ‘contraction’]

There’s also a simple helper function you can use in case you don’t want to instantiate the class.

>>> from nltk.tokenize import regexp_tokenize
>>> regexp_tokenize(“Can’t is a contraction.”, “[\w’]+”) [“Can’t”, ‘is’, ‘a’, ‘contraction’]

Now we finally have something that can treat contractions as whole words, instead of splitting them into tokens.
How it works…

The RegexpTokenizer works by compiling your pattern, then calling re.findall() on your text. You could do all this yourself using the re module, but the RegexpTokenizer implements the TokenizerI interface, just like all the word tokenizers from the previous recipe. This means it can be used by other parts of the NLTK package, such as corpus readers, which we’ll cover in detail in Chapter 3, Creating Custom Corpora. Many corpus readers need a way to tokenize the text they’re reading, and can take optional keyword arguments specifying an instance of a TokenizerI subclass. This way, you have the ability to provide your own tokenizer instance if the default tokenizer is unsuitable.

There’s more…

RegexpTokenizer can also work by matching the gaps, instead of the tokens. Instead of using re.findall(), the RegexpTokenizer will use re.split(). This is how the BlanklineTokenizer in nltk.tokenize is implemented.

Simple whitespace tokenizer
Here’s a simple example of using the RegexpTokenizer to tokenize on whitespace: >>> tokenizer = RegexpTokenizer(‘\s+’, gaps=True)
>>> tokenizer.tokenize(“Can’t is a contraction.”)
[“Can’t”, ‘is’, ‘a’, ‘contraction.’]

Notice that punctuation still remains in the tokens.

Tokenizing sentences into words:

In this recipe, we’ll split a sentence into individual words. The simple task of creating a list of words from a string is an essential part of all text processing.
How to do it…
Basic word tokenization is very simple: use the word_tokenize() function:

>>> from nltk.tokenize import word_tokenize
>>> word_tokenize(‘Hello World.’)
[‘Hello’, ‘World’, ‘.’]

How it works…
word_tokenize() is a wrapper function that calls tokenize() on an instance of the TreebankWordTokenizer. It’s equivalent to the following:

>>> from nltk.tokenize import TreebankWordTokenizer
>>> tokenizer = TreebankWordTokenizer()
>>> tokenizer.tokenize(‘Hello World.’)
[‘Hello’, ‘World’, ‘.’]

It works by separating words using spaces and punctuation. And as you can see, it does not discard the punctuation, allowing you to decide what to do with it.
There’s more…

Ignoring the obviously named WhitespaceTokenizer and SpaceTokenizer, there are two other word tokenizers worth looking at: PunktWordTokenizer and WordPunctTokenizer. These differ from the TreebankWordTokenizer by how they handle punctuation and contractions, but they all inherit from TokenizerI. The inheritance tree looks like this:

TreebankWordTokenizer uses conventions found in the Penn Treebank corpus, which we’ll be using for training in Chapter 4, Part-of-Speech Tagging and Chapter 5, Extracting Chunks. One of these conventions is to separate contractions. For example:

>>> word_tokenize(“can’t”)
[‘ca’, “n’t”]

If you find this convention unacceptable, then read on for alternatives, and see the next recipe for tokenizing with regular expressions.

An alternative word tokenizer is the PunktWordTokenizer. It splits on punctuation, but keeps it with the word instead of creating separate tokens.

>>> from nltk.tokenize import PunktWordTokenizer
>>> tokenizer = PunktWordTokenizer()
>>> tokenizer.tokenize(“Can’t is a contraction.”)
[‘Can’, “‘t”, ‘is’, ‘a’, ‘contraction.’]

Another alternative word tokenizer is WordPunctTokenizer. It splits all punctuations into separate tokens.

>>> from nltk.tokenize import WordPunctTokenizer
>>> tokenizer = WordPunctTokenizer()
>>> tokenizer.tokenize(“Can’t is a contraction.”)
[‘Can’, “‘”, ‘t’, ‘is’, ‘a’, ‘contraction’, ‘.’]

translating one language to another will be discussed in part-2 continuation to this.stay tuned.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s