Language translation with python part 2


Stemming, lemmatization,Translation simplified

Stemming words

Stemming is a technique for removing affixes from a word, ending up with the stem. For example, the stem of “cooking” is “cook”, and a good stemming algorithm knows that the “ing” suffix can be removed. Stemming is most commonly used by search engines for indexing words. Instead of storing all forms of a word, a search engine can store only the stems, greatly reducing the size of index while increasing retrieval accuracy.
One of the most common stemming algorithms is the Porter Stemming Algorithm, by Martin Porter. It is designed to remove and replace well known suffixes of English words, and its usage in NLTK will be covered next.

The resulting stem is not always a valid word. For example, the stem of “cookery” is “cookeri”. This is a feature, not a bug.
How to do it…

NLTK comes with an implementation of the Porter Stemming Algorithm, which is very easy to use. Simply instantiate the PorterStemmer class and call the stem() method with the word you want to stem.

>>> from nltk.stem import PorterStemmer
>>> stemmer = PorterStemmer()
>>> stemmer.stem(‘cooking’)
>>> stemmer.stem(‘cookery’)

How it works…

The PorterStemmer knows a number of regular word forms and suffixes, and uses that knowledge to transform your input word to a final stem through a series of steps. The resulting stem is often a shorter word, or at least a common form of the word, that has the same root meaning.

There’s more…

There are other stemming algorithms out there besides the Porter Stemming Algorithm, such as the Lancaster Stemming Algorithm, developed at Lancaster University. NLTK includes it as the LancasterStemmer class. At the time of writing, there is no definitive research demonstrating the superiority of one algorithm over the other. However, Porter Stemming is generally the default choice.
All the stemmers covered next inherit from the StemmerI interface, which defines the stem() method. The following is an inheritance diagram showing this:

The LancasterStemmer functions just like the PorterStemmer, but can produce slightly different results. It is known to be slightly more aggressive than the PorterStemmer.

>>> from nltk.stem import LancasterStemmer
>>> stemmer = LancasterStemmer()
>>> stemmer.stem(‘cooking’)
>>> stemmer.stem(‘cookery’)

You can also construct your own stemmer using the RegexpStemmer. It takes a single regular expression (either compiled or as a string) and will remove any prefix or suffix that matches.

>>> from nltk.stem import RegexpStemmer
>>> stemmer = RegexpStemmer(‘ing’)
>>> stemmer.stem(‘cooking’)
>>> stemmer.stem(‘cookery’)
>>> stemmer.stem(‘ingleside’)

A RegexpStemmer should only be used in very specific cases that are not covered by the PorterStemmer or LancasterStemmer.

New in NLTK 2.0b9 is the SnowballStemmer, which supports 13 non-English languages. To use it, you create an instance with the name of the language you are using, and then call the stem() method. Here is a list of all the supported languages, and an example using the Spanish SnowballStemmer:

>>> from nltk.stem import SnowballStemmer
>>> SnowballStemmer.languages
(‘danish’, ‘dutch’, ‘finnish’, ‘french’, ‘german’, ‘hungarian’, ‘italian’, ‘norwegian’, ‘portuguese’, ‘romanian’, ‘russian’, ‘spanish’, ‘swedish’)
>>> spanish_stemmer = SnowballStemmer(‘spanish’)
>>> spanish_stemmer.stem(‘hola’)

Lemmatizing words with WordNet

Lemmatization is very similar to stemming, but is more akin to synonym replacement. A lemma is a root word, as opposed to the root stem. So unlike stemming, you are always left with a valid word which means the same thing. But the word you end up with can be completely different. A few examples will explain lemmatization…

Getting ready

Be sure you have unzipped the wordnet corpus in nltk_data/corpora/wordnet. This will allow the WordNetLemmatizer to access WordNet.

How to do it…
We will use the WordNetLemmatizer to find lemmas:

>>> from nltk.stem import WordNetLemmatizer >>> lemmatizer = WordNetLemmatizer() >>> lemmatizer.lemmatize(‘cooking’) ‘cooking’
>>> lemmatizer.lemmatize(‘cooking’, pos=’v’) ‘cook’
>>> lemmatizer.lemmatize(‘cookbooks’) ‘cookbook’

How it works…

The WordNetLemmatizer is a thin wrapper around the WordNet corpus, and uses the morphy() function of the WordNetCorpusReader to find a lemma. If no lemma is found, the word is returned as it is. Unlike with stemming, knowing the part of speech of the word is important. As demonstrated previously, “cooking” does not have a lemma unless you specify that the part of speech (pos) is a verb. This is because the default part of speech is a noun, and since “cooking” is not a noun, no lemma is found. “Cookbooks”, on the other hand, is a noun, and its lemma is the singular form, “cookbook”.

There’s more…
Here’s an example that illustrates one of the major differences between stemming and lemmatization:

>>> from nltk.stem import PorterStemmer
>>> stemmer = PorterStemmer()
>>> stemmer.stem(‘believes’)
>>> lemmatizer.lemmatize(‘believes’)

Instead of just chopping off the “es” like the PorterStemmer, the WordNetLemmatizer finds a valid root word. Where a stemmer only looks at the form of the word, the lemmatizer looks at the meaning of the word. And by returning a lemma, you will always get a valid word.

Combining stemming with lemmatization
Stemming and lemmatization can be combined to compress words more than either process can by itself. These cases are somewhat rare, but they do exist:

>>> stemmer.stem(‘buses’)
>>> lemmatizer.lemmatize(‘buses’)
>>> stemmer.stem(‘bus’)

In this example, stemming saves one character, lemmatizing saves two characters, and stemming the lemma saves a total of three characters out of five characters. That is nearly a 60% compression rate! This level of word compression over many thousands of words, while unlikely to always produce such high gains, can still make a huge difference.

Finally Translation:

Translating text with Babelfish

B abelfish is an online language translation API provided by Yahoo. With it, you can translate text in a source language to a target language. NLTK comes with a simple interface for using it.

Getting ready
Be sure you are connected to the internet first. The babelfish.translate() function requires access to Yahoo’s online API in order to work.
How to do it…
To translate your text, you first need to know two things:
1. The language of your text or source language.
2. The language you want to translate to or target language.
Language detection is outside the scope of this recipe, so we will assume you already know the source and target languages.

>>> from nltk.misc import babelfish
>>> babelfish.translate(‘cookbook’, ‘english’, ‘spanish’) ‘libro de cocina’
>>> babelfish.translate(‘libro de cocina’, ‘spanish’, ‘english’) ‘kitchen book’
>>> babelfish.translate(‘cookbook’, ‘english’, ‘german’) ‘Kochbuch’
>>> babelfish.translate(‘kochbuch’, ‘german’, ‘english’) ‘cook book’

You cannot translate using the same language for both source and target. Attempting to do so will raise a BabelfishChangedError.
How it works…

The translate() function is a small function that sends a urllib request to, and then searches the response for the translated text.

If Yahoo, for whatever reason, had changed their HTML response to the point that translate() cannot identify the translated text, a BabelfishChangedError will be raised. This is unlikely to happen, but if it does, you may need to upgrade to a newer version of NLTK and/or report the error.

There’s more…
There is also a fun function called babelize() that translates back and forth between the source and target language until there are no more changes.

>>> for text in babelfish.babelize(‘cookbook’, ‘english’, ‘spanish’): … print text
libro de cocina
kitchen book
libro de la cocina
book of the kitchen

Available languages
You can see all the languages available for translation by examining the available_ languages attribute.

>>> babelfish.available_languages
[‘Portuguese’, ‘Chinese’, ‘German’, ‘Japanese’, ‘French’, ‘Spanish’, ‘Russian’, ‘Greek’, ‘English’, ‘Korean’, ‘Italian’]

The lowercased version of each of these languages can be used as a source or target language for translation.

4 thoughts on “Language translation with python part 2

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s