To us, here and now, it appears thus.

rants, ramblings and other things that make life worth living…

Is language huffman coded?

with 5 comments

Huffman coding, for the uninitiated is a sort of compression scheme in computer science that assignes short binary representations to frequently used characters and longer binary representations to the less frequently used. The idea is that what you lose by encoding less frequently occuring characters with longer bitstrings, you gain by encoding more frequently used characters with shorter bitstrings.

Having said that, something I noticed only lately is that human language seems like its encoded using a similar scheme. The more frequent a word is used, the shorter it is. For examples, most of the closed class words like prepositons and determiners and sometimes even commonly used verbs are short, while the ones which are longer are usually rarely used words. Since I’ve been bragging lately about all this computing power that I have, Its about time to flaunt it.

I used the enron corpus which contains about 250,000 unique email messages totalling to approximately 1.5 gigs of text, small but not too small. I plotted the word frequency vs length of the word and the results seem as expected (click the image to see a better picture).

There is an initial peak when the length of characters is about 3 or so and then the frequency rapidly declines to almost nothing when the length increases to around 13 or so. A much clearer picture of the encoding that goes on here can be obtained if we normalize the frequency counts by the number of words of a given length. There are very few one letter words (namely the determiner ‘a’) but more two letter words (‘an’,’to’,’in’,etc..) and so on.

normalized-graph.jpg

At first look the above graph looks negative exponential to me (meaning, good job on the compression). Infact this is what one would expect if someone were to do a good job on the compression.

Its really suprising to realise that what usually requires sophisticated critical thinking (schemes like huffman encoding) can also be easily reproduced by random, unplanned phenomena like linguistic evolution.

Signing Off,
Vishnu Vyas.

Advertisements

Written by vishnuvyas

May 31, 2007 at 9:51 pm

5 Responses

Subscribe to comments with RSS.

  1. For text analysis, word and phrase frequency calculations, concordance, wordforms frequencies and more : let me recommend our product called Textanz. This tool has become a must-have for many writers, editors and proofreaders during the last year. If interested, take a look to detailed description and screenshots at
    http://www.cro-code.com/textanz.jsp .

    Thank you .
    Alexander Potyomkin
    Cro-Code

    Cro-Code

    June 1, 2007 at 5:52 am

  2. Interesting idea, but I know how easy it is to see a pattern when you’re looking for it. Natural languages have been around much longer than any formal Huffman coding techniques. I think it is far more likely that language evolution has resorted to common sense over time, rather than formally creating a set of words. Dropping letters from words to shorten them, for example.

    I think there is one clear trait of natural languages which shows they couldn’t be Huffman codes; they have built in redundancy. This is done to catch errors in spelling and to ensure some level of accuracy of any message that may be received over a noisy channel. Huffman coding has no such redundancy, and in fact fails completely if used on its own over a noisy channel.

    Your results are clear; yes, there is a pattern to be observed. I’m sure there are other interesting patterns to observe in natural languages – maybe even enough to do a PhD thesis! Now there’s a thought…

    Anyway, I was happy to see someone else looking at this field of mathematics. It’s not something that crops up regularly in the blogosphere!

    Leechio

    June 1, 2007 at 8:47 am

  3. Interesting observation leechio, I can see redundancy in terms of polysemous words? But catching spelling errors? That I am still not very sure.

    I think it is far more likely that language evolution has resorted to common sense over time, rather than formally creating a set of words. Dropping letters from words to shorten them, for example.

    That was more or less the point, linguistic evolution, usage of common sense by millions of independent individuals can also come up with the sophisticated things similar to Huffman codes. – Hence, removing any need for formal systems of dropping letters.

    vishnuvyas

    June 1, 2007 at 2:35 pm

  4. Who gave you permission to license huffman coding to computer science? wE communicationists will thwart any attempts to wrest its ownership from us

    M

    June 3, 2007 at 3:42 am

  5. See “Zipf’s law”

    RiderOfGiraffes

    June 3, 2007 at 11:14 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: