Is language huffman coded?
Huffman coding, for the uninitiated is a sort of compression scheme in computer science that assignes short binary representations to frequently used characters and longer binary representations to the less frequently used. The idea is that what you lose by encoding less frequently occuring characters with longer bitstrings, you gain by encoding more frequently used characters with shorter bitstrings.
Having said that, something I noticed only lately is that human language seems like its encoded using a similar scheme. The more frequent a word is used, the shorter it is. For examples, most of the closed class words like prepositons and determiners and sometimes even commonly used verbs are short, while the ones which are longer are usually rarely used words. Since I’ve been bragging lately about all this computing power that I have, Its about time to flaunt it.
I used the enron corpus which contains about 250,000 unique email messages totalling to approximately 1.5 gigs of text, small but not too small. I plotted the word frequency vs length of the word and the results seem as expected (click the image to see a better picture).
There is an initial peak when the length of characters is about 3 or so and then the frequency rapidly declines to almost nothing when the length increases to around 13 or so. A much clearer picture of the encoding that goes on here can be obtained if we normalize the frequency counts by the number of words of a given length. There are very few one letter words (namely the determiner ‘a’) but more two letter words (‘an’,’to’,’in’,etc..) and so on.
At first look the above graph looks negative exponential to me (meaning, good job on the compression). Infact this is what one would expect if someone were to do a good job on the compression.
Its really suprising to realise that what usually requires sophisticated critical thinking (schemes like huffman encoding) can also be easily reproduced by random, unplanned phenomena like linguistic evolution.