Recently I needed to show a heat map of a quite a lot of coordinate points for a little project of mine that ended up in a data visualization contest (that unfortunately I didn’t win, even though I made to the finalists). The idea was to show the distribution of the georeferenced wikipedia pages through a heat map, so when I first heard about openheatmap.com I knew it was the tool to use. OpenHeatMap.com is an excellent project by Pete Warden that takes a dataset as a CSV, Excel or Google Spreadsheet file and convert it to a nice, browsable heat map presentation.
For a problem I’m working on I got stuck onto the classical situation of local maximum. After trying to work around the problem in several more or less creative ways, I thought of the simulated annealing algorithm. Considering it’s been a while since I last saw it I tried to search for it on the web and surprisingly there is not much stuff about it, and the few bits I found are often contraddictory. After quite a lot of digging I decided to write about it here. As a warning I should probably say that there will be digging into some basic statistic and complexity analysis, as well as a quick formal introduction to the problem of the knapsack. You should be able to follow even if you don’t know nothing about those topics, but having some foundations in these areas would be of great help.
A couple of days ago I finally had my Google Wave sandbox account. Given that I just finished developing my very first robot, I thought I’d share some impressions on the whole thing. From the user-side, things are far from being ready. Some important features are still missing, just to name one you can’t remove user from a wave once they joined (or, alternatively, there’s no way to ignore a wave). Indeed, given that I joined several waves to try other people applications, I’m getting continuous notifications. But anyway, the whole thing is to me like a great development playground where I can make all the sorts of experiments.
One of the reasons I haven’t been writing on this blog that much lately is that I’ve been terribly busy with university given that I just cleared out six exams in six months. That said, for one of my three exams that I still have left, I had to develop an inference engine written in C++. Since this was a fairly large project that had to deal with some sort of NP-complete problems (see also: unification) and given that this was the first time I wrote something serious in C++ (i.e.: that would involve more than a class and that didn’t contain the “Hello world” string) I had the chance to learn quite a few new things.
I came back yesterday from the third Italian PyCon (aka pycon3) which was held in Florence and all I can say is that has been an amazing experience. I had the chance to meet a lot of new great people as well as the BDFL (which won’t be back in Europe for quite some time, as he said). Here follows a resume of what I think were the most interesting talks.
The last time I blogged about a new course I’m following at my university. This course, held by Pasquale Lops and Giovanni Semeraro, is very interesting at the point that I’ll be developing a custom information retrieval engine as part of my internship project. I can’t tell much more at this point since the internship haven’t started yet and I’m not sure I can release more details about this project (we’re still in the process of deciding if and how the whole thing will be released to the world).
In the meantime, I’ve been doing several experiments on this topic mostly about the memory usage and the performances of such system on limited hardware. This practically means implementing the algorithms you’ll be using and measuring the computational time they require.
It has been about a week since I began doing a deeper study of information retrieval. Actually, everything just began with a new course at my university about that and I just fallen in love almost immediately. The fact is that this thing really got me interested, and I began doing some experiments (one involves django as well, keep reading to know more).
In this week I learned a lot of things about information retrieval, text categorization, natural language processing and machine learning. But the most relevant thing is: the principles are easy, their implementation is not. The fact is that most of the techniques are relatively simple but you usually have to deal with very large datasets and this could be challenging, since one of the main requirements about information retrieval is time. It’s really much more important that you give less results in one second rather than giving better results in one hour. No one will ever care to use your system if it takes an hour to get some result. And if you’re considering to store your data in a database forget about normalization, it wouldn’t really take you anywhere.
Running django with fastcgi is not a difficult task, also because of the excellent documentation provided. Anyway the doc provides a very basic script to automatize the start/stop fcgi process, so today I had to write my own so I don’t have to manually fix things if something goes wrong since I let my script handle the various situations.
Today I just ended one of my side projects: pytagram. Basically it generates an SVG file (that can successively be saved as eps/pdf/whatever and eventually manually manipulated) starting from a tree-like plain text file. This can be useful for generating cheat sheets or quick references to classes or functions that belongs to some project.
I did this for generating a django quick reference (here it is) since it has a lot of functions and I know what’s their purpose, but I can never remember the names (and now two A4 papers are right in front of me).
If you’re interested in this, check out the google code project page and grab your copy from the SVN repository.
There are tons of things that can be changed/optimized (i.e.: add some optional short explanation of the function, add more examples, easier way to change colors, …) but now the code is working quite well so that can be already useful to the people out there.
When I was redesigning this site, I was experimenting many different options for the header. Among the whole set of solutions I tried, I was very happy with the one I’m going to illustrate even though I chosen another one (that is the one you can see now) because it integrates better with the the whole layout.