It has been about a week since I began doing a deeper study of infor­ma­tion retrieval. Actu­ally, every­thing just began with a new course at my uni­ver­sity about that and I just fallen in love almost imme­di­ately. The fact is that this thing really got me inter­ested, and I began doing some exper­i­ments (one involves django as well, keep read­ing to know more).

In this week I learned a lot of things about infor­ma­tion retrieval, text cat­e­go­riza­tion, nat­ural lan­guage pro­cess­ing and machine learn­ing. But the most rel­e­vant thing is: the prin­ci­ples are easy, their imple­men­ta­tion is not. The fact is that most of the tech­niques are rel­a­tively simple but you usu­ally have to deal with very large datasets and this could be chal­leng­ing, since one of the main require­ments about infor­ma­tion retrieval is time. It’s really much more impor­tant that you give less results in one second rather than giving better results in one hour. No one will ever care to use your system if it takes an hour to get some result. And if you’re con­sid­er­ing to store your data in a data­base forget about nor­mal­iza­tion, it wouldn’t really take you anywhere.

Talk­ing about stor­ing infor­ma­tions, you know that if you’re deal­ing with doc­u­ments most of the words are the so called stop words. Those stop words are words that doesn’t really mean any­thing, but they help the read­ers to get a better text flux. Clas­sic exam­ples of stop words are arti­cles like “the”, “a”, “an” or logic con­nec­tors like “or” and “and”. These words are so common that their pres­ence is quite use­less since they’re are… every­where. If you’re going to study infor­ma­tion retrieval than you’ll learn about a weight­ing tech­nique called tf-​idf that gives a weight near to 0 to these words, but since you’d prob­a­bly use a reverse index for words (an index that given a word, tells you in which doc­u­ments that word appears) you can under­stand that this would take a lot of space if you’re going to include stop words.

So one of the biggest issues until now is that you’re going to deal with extremely large datasets, so you have to strip as many things as pos­si­ble. Now con­sider those words: “fish­ing”, “fishes”, “fish”. They all talk about “fish”, and an user that is search­ing for “fish” would prob­a­bly be inter­ested in “fishes” or “fish­ing” as well. Addi­tion­ally, it’s use­less to store three words that are almost iden­ti­cal. So here comes the stem­ming that, by quot­ing the related wikipedia page, is the process for reduc­ing inflected (or some­times derived) words to their stem, base or root form – gen­er­ally a writ­ten word form. For­tu­nately, if you’re deal­ing with eng­lish texts, there’s the Porter algo­rithm that is the state-of-the-art algo­rithm for this sort of things. But that works only with eng­lish, so if your doc­u­ments are writ­ten in another lan­guage or they are writ­ten in mul­ti­ple lan­guages, things are going to be complicated.

This leads to think about the prob­lem of the lan­guage iden­ti­fi­ca­tion. How do you know if some text is writ­ten in a lan­guage or in another just by look­ing at it? Of course you can describe the document’s lan­guage with some kind of meta tag­ging, but not all the doc­u­ments have this kind of descrip­tion, just think about the web. There are some kind of sta­tis­ti­cal meth­ods based upon the clas­si­fi­ca­tion of n-grams but I haven’t deeply inves­ti­gated about them yet, so I can’t really say anything.

Now you got your col­lec­tion of doc­u­ments that match a cer­tain query. Now: how do you know what doc­u­ment is more rel­e­vant than another (in other words: how do you rank pages)? You got two alter­na­tives (well, prob­a­bly more, but I know just these at this moment): the tf-​idf that we said above and the cosine sim­i­lar­ity. The latter is an inter­est­ing one: con­sider the tf-​idf vec­tors of the doc­u­ments, then con­sider the query as a doc­u­ment too. Now plot those tf-​idf vec­tors and mea­sure their cosine of the angle between them. The more you’re near to 1, the more rel­e­vant is the document.

There are a lot of other impor­tant things that need to be said like the pre­ci­sion and recall con­cept, but that’s enough for now. I’ll talk about this another time.

Anyway I’m doing an exper­i­men­tal project named django search­able. It’s a plug­gable app for django that imple­ments an infor­ma­tion retrieval engine based on tf-​idf weight­ing. Play with it if you’re brave enough.