Big Numbers 2 – The Hadoop Version.
Well, I’ve talked about big numbers before, but never before have I thought that having a 1.5 gigabytes of text corpora to be actually small. Well, guess what – it is small. Tiny even. This is what happens when you have the power to process multiples of terabytes and all you have is puny 1.5 gigabytes of data. That’s right, I’m now setup on two clusters one of 3 nodes and one of 19 nodes all running Hadoop – an open source version of the Google File System and the Google Map Reduce program.
So, what is all the fuss about hadoop and map reduce? Haven’t people been doing such stuff for a long long time? – Well, yes and no. The idea of distributing your computation and then combining the results has been around for long, but what hadoop does is that instead of moving your data to the place of computation, it moves computation to the location of data. And thus allowing to run multiple independent jobs called ‘maps’ which work on each chunk of data independently and then one can use the output to do a ‘reduce’ which then combines all the output of a map step into the final result that one desires. Programming in this model is fun, powerful and furthermore really really simple.
I will probably put up some ajaxy demos of some results that I’ve with the my new found computing power quite soon, so till then, stay tuned.