Archive for the ‘Virtual Machines and Programming Languages’ Category
For every one who comes from the world of dangling pointers and manual resource management of C++, Java deceptively looks very similar, unless you look under the hood. (I’m pretty much a newbie to Java, who is forced to learn its quirks, primarily due to Hadoop). Most of these tips are written for grad students who’s life cycle can be succinctly paraphrased as “program, collect data, evaluate, rinse, repeat”.
Tip 1. Get a good IDE (read Eclipse)
Unlike the world of C++, where the standard library is essentially small and can for the most part be held in your head, Java has a massive standard library which includes features for everything under the sun. Keeping that in your head is not for the light of heart. Thats where a good IDE like Eclipse comes in, especially the content assist features can make your life a lot easier. Also, the IDE can take a lot of pain out of managing your projects and libraries (because Java has a preferred conventions for writing and using libraries). One more amazing thing with Eclipse is that its got a nifty Java debugger which lets you inspect among other things – the threads currently running in your program.
If you think eclipse is massive bloat or don’t have access to a powerful system for doing your development, (some times I do, where I have to ssh into some machine and change, recompile, rinse, repeat if I am running experiments) then get JDEE (an Emacs plug in).
Tip 2. Get Ant for Building
Ant is a simple Java build system based on XML (yeah.. yeah, you’ve to write XML, stop complaining). In fact, you can use their tutorial build file and start from there. It works with very little modifications, unless you have a special project needs (In which case, Internet is your friend). Say good bye to make files and man up and write some XML.
Tip 3. When you have multiple cores, use Java threads.
Java has threading built in, even though there are some pains with the thread synchronization, Java makes things a bit easier and the best part is on any Linux distribution with NPTL (which is almost every distribution out there), the Java threading model is essentially 1-1, which means you can use all those cores to do the hard work. This is great if you are running experiments under different conditions, in which case, each run of the experiment is essentially independent of others. This usually gives you a performance boost of more that 20-25% (well, but then, I have access to a machine with 8cores and 8gigs of RAM, so YMMV).
Here is a really nice tutorial to get started on Java threads
Tip 4. Learn to use Scanner and Console classes.
For those who mostly work with input output formats from text files Scanner is one of your best friends. It essentially encapsulates the work of reading a line, splitting it into various parts and converting them into your favorite data types. And for all your input requirements learn to use the Console class, which provides a nice interactive console. If you the simple readLine method does not satisfy your needs, Java has a complete regex library, and again, the internet is your friend.
Tip 5. Use jconsole
And for the last tip of the day, learn to use jconsole a monitoring tool for java applications which lets you inspect the amount of memory usage and active threads. If you are using this tool locally, then for the most part there is no configuration required – type jconsole at a command prompt and you should see a window with the list of process that are open and their PIDs. connect to one that you want to monitor and you are ready to track your java process.
Things get a bit tricky with remote process and security issues – so my advice is setup a VPN or use ssh tunneling. More information on using jconsole is available here.
So, thats all for now.
It’s been quite a while since I wrote anything of any significance these days. My blog seems to have moved into a more or less vegitative state. Also, since I am in line for quite some writing in the coming days ahead I think Its about time I did some emergency CPR here and get this blog back to life. Anyway, as a start, maybe I should start with a story. No, its not one about damsels in distress and charming princes. Its a more mundane story about programming.
This happened not so long ago, I’ve always been a pretty good C++ programmer, and of late I’ve been doing a lot of my programming in python. Python, if I hadn’t mentioned before, is this amazing dynamic language which is amazingly easy to use and more importantly maintain. Its one great language, except for its speed. For most practical purposes I never had any problems with the speed of python. But, sometimes when you have to wait for an hour to get some output on some data you are processing, it gets irritating. The task here was simple decipherment. I was basically using the EM algorithm (or to be more precise, the forward backward algorithm) for deciphering a piece of text. I managed to write a pretty good implementation of it in python, but it was slow – real slow.
So, I sat down and rewrote the forward backward algorithm in C++ (in the time that my python program was running) and the speed difference was unbelievable. My C++ code went 40 times faster than my hand optimized, psyco-compiled python code. If you have programmed in both C++ and python, you already knew that. C++ is faster than python, atleast 10-fold on the average. But thats not the lesson here.
The most amazing thing was, I actually managed to write, debug and get a working version of the C++ program in less time than I would have expected it to take. That’s the most surprising part. So, I’ve decided to share my experience with you guys. One of the main things that really helped me during my C++ development was not only did I have a very clear goal of what I am doing (which most software projects rarely have), but I also had a very clear goal of how I was going to do it. This was because, I had already implemented my original version in python.
Python, as someone has already said, is executable psuedo-code. Not only did I have a very clear idea of what data structure to use where, How to model the various elements (in this case, the plain text, the cipher text, etc..) and how my models interact with each other. This was all ready done, the only thing remaining was more or less manual translation from python to C++. The whole lesson here is that python is not only a great language for exploratory programming, but its a great language to prototype as well.
I am sure, that if I had started all this in C++ from the beginning, I would have been just too lazy to do all the refactoring that my code would have required. Changing from one type of object-method interface to another is pretty much a pain in C++. On the other hand, by the time I had my python code running, not only was it a correct working version, but a well designed version as well. Any screw-ups in the initial design were promptly corrected without too much effort. Any useless “just in case virtual functions” that would have cropped up in my C++ were not there because, refactoring is so easy in python that you can add them as you go. And most of all, you can test for all the bigger logical errors that occur when you have multiple objects interacting with each other in a complicated program in a python program easier than in a C++ program.
As, an unexpected side effect, I picked up a couple of good habits from python that I would have never bothered to do in C++ for my hobby programming. For example, unit-testing. I do write unit-tests, only if my projects get big enough that I think Its worth the trouble, but with python, you always have this simple
if __name__ == '__main__' which serves as a poor man’s unit test. Not too much trouble, yet worth the every second you invest in writing simple tests there. These days, I do it as a matter of habit for all my python modules, and thats one good habit that spontaneously extended to my C++. With a bit of preprocessor magic, you can do pretty much the same type of poor-man’s unit-testing in C++ as well, and this did save me some pain later.
Now, that I’ve rather incoherently rambled on, I would like to summarize my experience. With, python you can not only prototype with great speed and get a clean implementation, you also end up picking up a lot of good habits on the way, that not only makes you a better python programmer, but a better C++ programmer as well!.
Caution: When I mean esoteric, I mean non-mainstream as opposed to things like INTERCAL or brainfuck.
The first thing that anyone who gets to know me in a professional capacity seems to find unusual about me is that I can program in a couple of languages that are very non mainstream. Things like Haskell and Smalltalk. They consider that its a rather time wasting if not an utterly useless hobby.
One thing that a friend of mine asked is that why is there even these languages in the first place as no one even practically uses it. That’s one question that I have never bothered to ask myself, in-spite of getting to play with more than 20 or so languages. He considers that languages such as Haskell are practically useless in the sense that there is almost no mainstream development going on and there is very little point in even trying to develop new ones.
Being a language enthusiast I came up with a plethora of standard reasons that language enthusiasts do. Trite boring old reasons like productivity, higher expressive power and what not. Then there were always reasons which I dish out, in a half-believing manner like how if it weren’t for Sun’s marketing muscle smalltalk would have been the order of the day and things like that. But what struck me unusual was the part of the question about what purpose if at all any, do they serve, apart from satiating the bloated ego’s of self-proclaimed language enthusiasts.
But only on some deeper thinking could I answer that question myself in a much more clearer manner. Either through short sight or arrogance I’ve never seen this angle. They are fertile breeding grounds for newer ideas, paradigms and sometimes even ground breaking innovations in the way we program (as opposed to just newer linguistic constructs). It is entirely plausible that those same innovations come from the mainstream languages, and once in a while they do – like STL for instance. But generally they don’t.
That in my opinion is the bane of any mainstream language. Mainstream languages by virtue of being mainstream have a tradition in the way which things are done. Style guides, language restrictions, limits of the runtime or other restrictions. New innovations even if they are good need a lot of pushing from within a community to gain any acceptance. On the other hand, in fringe languages like Haskell or ML there is lesser community inertia if any at all and they can easily push newer innovations, its much easier to fork into newer territories or basically explore the unexplored.
These are not just fringe languages like I’ve referred to them before, they are in fact frontier languages. They are usually at the edges of current paradigms and sometimes they just fall over the edge flat without ever coming up with anything new. On the other hand sometimes truly interesting ideas come out of it. Many a time, these ideas are incorporated into older, more mainstream languages. But once in a while, there comes an idea or a philosophy that’s associated with a language that’s so different, ground breaking and amazing that it simply is not possible to do the back-porting anymore. Then the language has no choice but to go mainstream – case in point is that of Ruby and Ruby on Rails.
That’s what we need those esoteric languages for. That’s where these language enthusiasts come in. They are the ones who will discover the next big thing. That’s precisely the need for esoteric languages.
Its long time since I wrote anything remotely techy. Though I’ve taken a small break from the outside world for personal reasons I’ve still been hacking around, the usual fun stuff you see. Now since Python 2.5 is out with a whole lot of new features I thought I should check it out . And well I’m impressed at a lot of work the guys have done and still I have a few peeves with it to. This is just a cursory glance and probably my opinions could change. So here is what I’ve gleaned so far from less than two days of hacking around. I will discuss the features that I found the most interesting.
Speed : This is one are where they have done lots of improvement. There is a perceptible difference in speed between the previous release and this release. I have moved a lot of maintenance scripts that I use to clean up my database and I managed to port them to python 2.5 without any problems. So a tip of the hat for the guys who’ve done the good work in improvements for speed. I have’nt checked out for compatibility so far, but if it doesn’t create new dependencies, I guess a lot of my mainstream projects would definitely get a power boost because they are basically frameworks other people use.
Co-Routines : Well, generators just got better with this version of python. They have been morphed into full blooded co-routines which already makes me think of the possibilites. For those who are new to these things, Co-Routines are functions on steriods. Normal functions have single point of entry and a single point of return (which may be dynamically decided of course). Co-Routines are cousins of functions which have can be entered at more than one point and can be left at multiple point and hence, they can be resumed. Though this looks like some form of primitive co-operative multitasking (which they are), They are also amazingly powerful programming constructs.
There are lots of situations where you could put this to use. Basically sometimes its easier to imagine your programs as moving from one state to another (exhaustive searches on graphs) or even better there are lots of occasions where you can actually consider the whole program to be a graph and the flow of control as traversing a graph in a particular order. Say a small MUD for example. Its a classic situation where co-routines can be put to good use. Another great place where you could use these is in finding patterns. In conjunction with partials (which I will discuss below) you can probably write pretty fast regular expression matchers. A boon for all those bioinformatics guys.
Partials : If you have any experience in functional programming, this shouldn’t be new to you at all. Partial Functions are functions which are built out of other functions with some arguments filled in with particular values. Kinda like C++’s default arguments, but a partial function always has a lesser number of arguments from the original function it has been constructed from. Let’s take an example so we can see them a bit more clearly.
consider the function add which takes in 2 arguments a and b so you define it as
add a b -> a + b
Now, consider another function inc, which is a partial function built from the original function..
inc = add 1 a
Of course this inc a is a partial application of the add function. There only one of the arguments of is defined and the other is passed down from the inc function. Now if the function could store a state (or is a co routine) and is partially applied in another function, then that gives us a powerful control structure. Imagine the possibilities…
Now that you have drooled over all that, I will give you the bad news. The syntax is absolutely clumsy. The syntax I used above is not the actual syntax, because the actual one is too verbose. This could probably be my Lisp/Haskell biases talking here, but in general I consider the syntax clumsy and not well thought out. So thats a big dissapointment there.
Overall, at first glance, python 2.5 seems a good improvement with few dissapointments. Since I’m all itching to get something working with 2.5 time to get coding now..
Vishnu Vyas (your favourite dungeon monkey)
Writing a virtual machine is serious fun. But one thing that you probably shouldn't be doing is designing the virtual machine and the compiler at the same time. You ask why? Well, there is a very good reason why. If its a single person who is doing both the development (coding) and the design of a virtual machine, then that part of your brain which lets you write code (the same part that also lets you sit through CAT) is working along with the part that you are probably using when you are doing much more creative stuff, say, like writing blogs.
Now, whats the problem with more of your brain working at the same time? Well, ask any schizophernic and he will tell you. The reason that you shouldn't do both at the same time is because, when you are designing a virtual machine you are letting your creative side flow, and when its a compiler, its not as much a problem of design as it is a problem of assembly. And doing both requires exteremely different types of mind-sets.
When you are designing a compiler, its like you are building something out of lego-blocks. You have to nitpick it to perfection. Every component has to talk to ever other component properly. That means your interfaces and your data-structures that move from one component to the other must be clearly defined. It is in short a process of pain and paitence. But if you are out to write a compiler, don't fret, because the rewards are surely great. Write one compiler, and you will master almost all programming languages that are out there. When you see the syntax trees flowing, the scopes unfolding, and the recursion proceeding in a cyclic loop generating code, you will attain nirvana. Trust me on that.
And, now about that virtual machine. This is more like solving a challenging math problem. Its not as if you are writing a proof for publication, but just like solving a problem for fun. Its more of a creative endeavour. You have a billion options in design, each one with its own pros and cons. Each design decision affecting the way certain language constructs must be generated. This is not only serious creative thinking, but exteremly worthwhile exercise in critical thinking as well. And this spell should never be interrupted by the part of the brain that actually is incharge of the compiler. When both are working together, you get each part involved in a time-wasting dogfight and sometimes you may even be seen arguing with yourself, raising doubts about your sanity in the minds of others. And those are just the minor irritations.
The major ones, come up when the compiler brain comes to your vm brain and starts arguing about who should handle internal name-resolution or the fact that lexical scopes are handled by renaming rather than having a true runtime sphagetti stack. Or, since the vm tries to be elegeant and minimal, then you can't even have RTTI thats better than C++ inspite of the compiler actually compiling some breed of dynamically typed language. Its like in the movies, when there is an angel to your right and a crazy guy in a red drape on your left with a trident poking you.
The greatest mistake you could do while designing a virtual machine would be to let that evil little red guy win! So, if you are ever writing a virtual machine, don't let that evil little red guy win.
The thing that's been ringing in my mind recently is that garbage collection is an inherently random process. Its not just garbage collection – memory allocation and memory access patterns can also be modelled as random process. And I also have a very strong hunch they are all related. I wish I could test all this with some real data, but as such both my virtual machine and the compiler for my VM is still incomplete. And there are no real programs yet to actually test my theory on. But I am pretty much confident that garbage collection can be modelled as a Random Process and it would give us some real insights into improving them.
In fact I have the following theory. The lifetime of an object on a gc'd heap is not just a function of its past life time (as in generational gc) but also a function of the amount of information it contains (one rough measure of information it contains would be the size of the object itself). If I had an applicable and verifiable model, it would really do the world so much good with gc'd languages being in fashion (and not seeming to go out of fashion anytime soon).
Also I guess the design of a virtual machine would affect the way garbage collection takes place, but only in terms of the number of access (ie, quantitatively rather than qualitatively). What would really matter would be the memory allocation/garbage collection algorithm. Also, I am very interested in the role finalisation plays with garbage collection. (If you have ever tried doing a gc, finalisation is a big pain in the a**).
But unfortuantely,these are just theories, and every scientific theory should be tested on the solid ground of experimentation and good experimentation requires good data. Since my final year project was just a VM lacking any angle, I guess I've finally got my angle. So I've decided to post here regarding the status of whats happening whenever I've some results or atleast something interesting in this regard. So watch this space!.
Its been a long while since I blogged, partly because I am up to neck with work and partly because I don’t have access to a computer outside of my job. For someone who’s been addicted to computers, spending more time without one is highly frustrating, so frustrating that I have actually taken to reading books (the dead tree kind) on economics!
I’ve been doing more than my fair share of development these days, not that I am complaining, but just that I am having less time for other interesting stuff that I would like to do, like work on my pet projects or blog. And due to self-imposed constraints on quality, espescially with production code, test driven development is slowly becoming second nature to me. I still can’t wrap around my mind to writing tests first before sitting down to writing the actual code. I’ve tried that, but my normal linear sequential brain simply cannot handle it.
However, one thing that is a work around for people with lesser amount of gray cells, like me, is to co-develop your test routines along with the code. Thats what I’ve been doing these days and I can assure you that it saves you a lot of trouble. It surely has saved me tons of it. Unit testing has become a way of life with me, just that my tests are written “while” I write my code rather than before writing it (agile) or after (??) writing it.
Here is a little snippet from an experience that I would like to share. One of the hardest things to write is “proper” concurrent code. Multi-threaded kind is fun enough and it all resides on one computer and within the same address space, so sharing data is easy. Now, imagine the same scenario, but just that your threads might run on any of the many machines that are available in a cluster. Not only that, from the programmers point of view they should behave just like threads in a program, and sharing data should be just as easy.
One of the easiest ways of doing that is simply using a virtual machine (like the JVM or .NET), and let the virtual machine handle threading. Its not too difficult to let the VM take care of threads and handle the data requests. It can all be done with some form of message passing (infact, my first prototype actually passes ad-hoc message objects marshalled across a socket). The difficulty here arises because, unlike on a uniprocessor machine, here you get true parallelism. And, like with any good gift, you have one big disadvantage to take care of, debugging this code is scary. And when you have the whole infrastructure in place, there are so many components running in parallel that finding out what went wrong where, would be a job best left to sherlock holmes.
A manageble solution to that problem is unit-testing, and writing the tests along with writing the code has some advantages. One thing that runs in your mind when you are writing your tests, is “how would my code handle this test”. Infact more than half my time, I spent on working out (in my head) on how my code would handle the tests.
This process, imho, has the same advantage writing (pen/paper/ms-word style) has. It gives shape to your ideas. Infact, by the time I had finished writing some of my tests, I had a very good idea of the expected behaviour and the actual behaviour of my code. Some times, I caught errors, even before I tested the code. Infact, writing tests is actually a way of looking at your code from a completely different point of view. Also, when you are writing tests along with the code you are developing, you have your code so fresh in your mind that, its a lot easier to reason about it, rather than when you write it afterwards.
So, maybe, test-driven development has something to it after all.