Archive for the ‘Computer Science’ Category
4 hours, yeah.. 4 fucking hours, especially if you are a newbie to the whole “networking” -iptables, ipchains thingamajiks…
I am trying to setup a small python based annotation engine and I am planning to let it into the wild on the internet (the horror!!) and as any normal chump who’s seeing the whole “web is the way to go for apps” mentality everywhere, I setup my application server behind apache using mod_proxy and let it run for sometime. And sometime in the fast moving internet space is 3 days, and on the third day.. I check my logs and I see lots and lots random people from all over the world trying to hack my damn server. Well this story is not about them.
So I decide to setup iptables – thats a pretty darn good idea, you might say.. except for one thing.. I don’t know anything about iptables. So, after browsing for almost an hour on tutorials, howtos, message boards, google groups..(has anyone noticed the search in google groups sucks?) I still couldn’t get anywhere.
Every tutorial out there seems to want to teach me what a TCP packet is or what link layer protocols are or the history of the whole IPTables filtering. Many would say thats great, you learn from the basics, you get your concepts straight. And to them I say “F*#$ you”. I just want to secure my damn server, not take the RHCE. And finally after three more hours of digging and reading the various “subtleties” of the IP protocol, I finally maanged to figure out what to do to secure my server.
Write 2 lines. Yeah, just 2 lines – the result for spending 4 fucking hours is not enlightenment, just getting to write two lines. For those who are using mod_proxy and don’t have linux networking guru to service you at your every beck and call, here are those two lines :
/sbin/iptables -A INPUT -p tcp -m tcp -s “your-hostname/ip/trusted subnet” -dpt:”application server port” -j ACCEPT
/sbin/iptables -A INPUT -p tcp -m tcp -dpt:”application server port” -j DROP
Where “your-hostname/ip/trusted subnet” should usually refer to the machine on which apache is running, In my case, the same machine. The “application server port” is the port on which CherryPy listens, by default i think its 8080. If you have multiple instances of CherryPy running, you would need to add similar rules for each instance (note : add the ACCEPT rules first, before you do the DROP rules).
Emacs, for those who know me, I am an big fan of, almost to the point of being religious. And and recently I’ve found another one – Eclipse. Emacs, as most would know is the ultimate editor that is written in a dialect of lisp called elisp (which predates attempts to standardize lisp and common lisp) – was the result of a time and a place where almost every programmer wrote lisp, AI was a buzzword and Symbolics was a household name.
Thus, emacs, naturally was written in the language of its time – lisp. With over 3o years behind its belt, emacs is now a mature multipurpose software application that most people go to the extent of calling it an operating system. The things that made emacs such a huge success story was not only was it written in lisp, the language of the day, it was also extensible in lisp, the language that most programmers who first used emacs knew. Thus, every pet-peeve of almost every programmer was solvable with just a few lines of elisp. Extensibility – Thats what made emacs a huge success. With packages for everything from terminal emulation, remote editing, newsreaders and even a web browser – Emacs is one multipurpose software application.
With, the coming of the AI winter, lisp lost ground and eventually gave way to Java. Java, being severely used in the past 10-20 years has become the lingua franca of the time. And, with Java we have another emacs incarnate, something that’s not only written in Java, also extensible in Java – eclipse. It has the same extensibility as emacs has , though not as mature in terms of extensions as emacs. So, Is Eclipse the next emacs?
It’s been quite a while since I wrote anything of any significance these days. My blog seems to have moved into a more or less vegitative state. Also, since I am in line for quite some writing in the coming days ahead I think Its about time I did some emergency CPR here and get this blog back to life. Anyway, as a start, maybe I should start with a story. No, its not one about damsels in distress and charming princes. Its a more mundane story about programming.
This happened not so long ago, I’ve always been a pretty good C++ programmer, and of late I’ve been doing a lot of my programming in python. Python, if I hadn’t mentioned before, is this amazing dynamic language which is amazingly easy to use and more importantly maintain. Its one great language, except for its speed. For most practical purposes I never had any problems with the speed of python. But, sometimes when you have to wait for an hour to get some output on some data you are processing, it gets irritating. The task here was simple decipherment. I was basically using the EM algorithm (or to be more precise, the forward backward algorithm) for deciphering a piece of text. I managed to write a pretty good implementation of it in python, but it was slow – real slow.
So, I sat down and rewrote the forward backward algorithm in C++ (in the time that my python program was running) and the speed difference was unbelievable. My C++ code went 40 times faster than my hand optimized, psyco-compiled python code. If you have programmed in both C++ and python, you already knew that. C++ is faster than python, atleast 10-fold on the average. But thats not the lesson here.
The most amazing thing was, I actually managed to write, debug and get a working version of the C++ program in less time than I would have expected it to take. That’s the most surprising part. So, I’ve decided to share my experience with you guys. One of the main things that really helped me during my C++ development was not only did I have a very clear goal of what I am doing (which most software projects rarely have), but I also had a very clear goal of how I was going to do it. This was because, I had already implemented my original version in python.
Python, as someone has already said, is executable psuedo-code. Not only did I have a very clear idea of what data structure to use where, How to model the various elements (in this case, the plain text, the cipher text, etc..) and how my models interact with each other. This was all ready done, the only thing remaining was more or less manual translation from python to C++. The whole lesson here is that python is not only a great language for exploratory programming, but its a great language to prototype as well.
I am sure, that if I had started all this in C++ from the beginning, I would have been just too lazy to do all the refactoring that my code would have required. Changing from one type of object-method interface to another is pretty much a pain in C++. On the other hand, by the time I had my python code running, not only was it a correct working version, but a well designed version as well. Any screw-ups in the initial design were promptly corrected without too much effort. Any useless “just in case virtual functions” that would have cropped up in my C++ were not there because, refactoring is so easy in python that you can add them as you go. And most of all, you can test for all the bigger logical errors that occur when you have multiple objects interacting with each other in a complicated program in a python program easier than in a C++ program.
As, an unexpected side effect, I picked up a couple of good habits from python that I would have never bothered to do in C++ for my hobby programming. For example, unit-testing. I do write unit-tests, only if my projects get big enough that I think Its worth the trouble, but with python, you always have this simple
if __name__ == '__main__' which serves as a poor man’s unit test. Not too much trouble, yet worth the every second you invest in writing simple tests there. These days, I do it as a matter of habit for all my python modules, and thats one good habit that spontaneously extended to my C++. With a bit of preprocessor magic, you can do pretty much the same type of poor-man’s unit-testing in C++ as well, and this did save me some pain later.
Now, that I’ve rather incoherently rambled on, I would like to summarize my experience. With, python you can not only prototype with great speed and get a clean implementation, you also end up picking up a lot of good habits on the way, that not only makes you a better python programmer, but a better C++ programmer as well!.
Huffman coding, for the uninitiated is a sort of compression scheme in computer science that assignes short binary representations to frequently used characters and longer binary representations to the less frequently used. The idea is that what you lose by encoding less frequently occuring characters with longer bitstrings, you gain by encoding more frequently used characters with shorter bitstrings.
Having said that, something I noticed only lately is that human language seems like its encoded using a similar scheme. The more frequent a word is used, the shorter it is. For examples, most of the closed class words like prepositons and determiners and sometimes even commonly used verbs are short, while the ones which are longer are usually rarely used words. Since I’ve been bragging lately about all this computing power that I have, Its about time to flaunt it.
I used the enron corpus which contains about 250,000 unique email messages totalling to approximately 1.5 gigs of text, small but not too small. I plotted the word frequency vs length of the word and the results seem as expected (click the image to see a better picture).
There is an initial peak when the length of characters is about 3 or so and then the frequency rapidly declines to almost nothing when the length increases to around 13 or so. A much clearer picture of the encoding that goes on here can be obtained if we normalize the frequency counts by the number of words of a given length. There are very few one letter words (namely the determiner ‘a’) but more two letter words (‘an’,’to’,’in’,etc..) and so on.
At first look the above graph looks negative exponential to me (meaning, good job on the compression). Infact this is what one would expect if someone were to do a good job on the compression.
Its really suprising to realise that what usually requires sophisticated critical thinking (schemes like huffman encoding) can also be easily reproduced by random, unplanned phenomena like linguistic evolution.
Well, I’ve talked about big numbers before, but never before have I thought that having a 1.5 gigabytes of text corpora to be actually small. Well, guess what – it is small. Tiny even. This is what happens when you have the power to process multiples of terabytes and all you have is puny 1.5 gigabytes of data. That’s right, I’m now setup on two clusters one of 3 nodes and one of 19 nodes all running Hadoop – an open source version of the Google File System and the Google Map Reduce program.
So, what is all the fuss about hadoop and map reduce? Haven’t people been doing such stuff for a long long time? – Well, yes and no. The idea of distributing your computation and then combining the results has been around for long, but what hadoop does is that instead of moving your data to the place of computation, it moves computation to the location of data. And thus allowing to run multiple independent jobs called ‘maps’ which work on each chunk of data independently and then one can use the output to do a ‘reduce’ which then combines all the output of a map step into the final result that one desires. Programming in this model is fun, powerful and furthermore really really simple.
I will probably put up some ajaxy demos of some results that I’ve with the my new found computing power quite soon, so till then, stay tuned.
You know you are going to be an engineer if not a scientist if numbers fascinate you. And every time you encounter a big number for the first time your eyes pop open, baffled at even the existence of the concept – How can such large numbers exist? How many is really a mole? What! – the sun is 98 million miles from here? How far is 98 million miles really? These kinda questions that you could ponder about all day and your brain would just tire from thinking about them.
Then there comes a time, so sneakily that you don’t suspect it has come. You become used to the numbers. They become fact. They become usual and finally they become boring. Now the sun is no longer the exciting and unimaginable 98 million miles it was when you heard it for the first time. You take that mole is that ugly big number. Its kind of a coming of age thing, rites of passage if you will.
And Computer Science is no different, it offers its own bag of big numbers. In fact, you wouldn’t believe me if I told you most of the computer scientists are obsessed with big numbers. And so was I. My jaw dropped when I heard things like this program consumes 300 MB of ram. It takes 7 hours to get over. We need around 300 computers to do that. The pan-galactic gargle blaster effect. I was just as excited and confused as I was when I heard about the sun or the mole. But working with large data sets, multi- million word corpus, gigabyte size databases for someone whose biggest database was a puny address book, these big numbers have become usual, rather sneakily if I might add. They’ve become matter of factly, common and boring. Its not the pan-galactic gargle blaster anymore. Its not even strong russian vodka that any fellow comrade would swear by. Its kinda become beer – American beer.
I guess I’ve come of age, the necessary rite of passage before I can contemplate terabytes, exponential order and thousand-node clusters.
Vish(!hick!) nu Vyas