Category Archives: Python

Clustering coordinate points together with quad-trees

Recently I needed to show a heat map of a quite a lot of coor­di­nate points for a little project of mine that ended up in a data visu­al­iza­tion con­test (that unfor­tu­nately I didn’t win, even though I made to the final­ists). The idea was to show the dis­tri­b­u­tion of the geo­ref­er­enced wikipedia pages through a heat map, so when I first heard about open​heatmap.com I knew it was the tool to use. Open​HeatMap.com is an excel­lent project by Pete Warden that takes a dataset as a CSV, Excel or Google Spread­sheet file and con­vert it to a nice, brows­able heat map pre­sen­ta­tion.
Read more »

My Italian PyCon experience

I came back yes­ter­day from the third Ital­ian PyCon (aka pycon3) which was held in Flo­rence and all I can say is that has been an amaz­ing expe­ri­ence. I had the chance to meet a lot of new great people as well as the BDFL (which won’t be back in Europe for quite some time, as he said). Here fol­lows a resume of what I think were the most inter­est­ing talks.

Read more »

Optimize your programs

The last time I blogged about a new course I’m fol­low­ing at my uni­ver­sity. This course, held by Pasquale Lops and Gio­vanni Semer­aro, is very inter­est­ing at the point that I’ll be devel­op­ing a custom infor­ma­tion retrieval engine as part of my intern­ship project. I can’t tell much more at this point since the intern­ship haven’t started yet and I’m not sure I can release more details about this project (we’re still in the process of decid­ing if and how the whole thing will be released to the world).

In the mean­time, I’ve been doing sev­eral exper­i­ments on this topic mostly about the memory usage and the per­for­mances of such system on lim­ited hard­ware. This prac­ti­cally means imple­ment­ing the algo­rithms you’ll be using and mea­sur­ing the com­pu­ta­tional time they require.

Read more »

What I learned by information retrieval in one week

It has been about a week since I began doing a deeper study of infor­ma­tion retrieval. Actu­ally, every­thing just began with a new course at my uni­ver­sity about that and I just fallen in love almost imme­di­ately. The fact is that this thing really got me inter­ested, and I began doing some exper­i­ments (one involves django as well, keep read­ing to know more).

In this week I learned a lot of things about infor­ma­tion retrieval, text cat­e­go­riza­tion, nat­ural lan­guage pro­cess­ing and machine learn­ing. But the most rel­e­vant thing is: the prin­ci­ples are easy, their imple­men­ta­tion is not. The fact is that most of the tech­niques are rel­a­tively simple but you usu­ally have to deal with very large datasets and this could be chal­leng­ing, since one of the main require­ments about infor­ma­tion retrieval is time. It’s really much more impor­tant that you give less results in one second rather than giving better results in one hour. No one will ever care to use your system if it takes an hour to get some result. And if you’re con­sid­er­ing to store your data in a data­base forget about nor­mal­iza­tion, it wouldn’t really take you anywhere.

Read more »