What I learned by information retrieval in one week
October 19th, 2008
It has been about a week since I began doing a deeper study of information retrieval. Actually, everything just began with a new course at my university about that and I just fallen in love almost immediately. The fact is that this thing really got me interested, and I began doing some experiments (one involves django as well, keep reading to know more).
In this week I learned a lot of things about information retrieval, text categorization, natural language processing and machine learning. But the most relevant thing is: the principles are easy, their implementation is not. The fact is that most of the techniques are relatively simple but you usually have to deal with very large datasets and this could be challenging, since one of the main requirements about information retrieval is time. It’s really much more important that you give less results in one second rather than giving better results in one hour. No one will ever care to use your system if it takes an hour to get some result. And if you’re considering to store your data in a database forget about normalization, it wouldn’t really take you anywhere.
Talking about storing informations, you know that if you’re dealing with documents most of the words are the so called stop words. Those stop words are words that doesn’t really mean anything, but they help the readers to get a better text flux. Classic examples of stop words are articles like “the”, “a”, “an” or logic connectors like “or” and “and”. These words are so common that their presence is quite useless since they’re are… everywhere. If you’re going to study information retrieval than you’ll learn about a weighting technique called tf-idf that gives a weight near to 0 to these words, but since you’d probably use a reverse index for words (an index that given a word, tells you in which documents that word appears) you can understand that this would take a lot of space if you’re going to include stop words.
So one of the biggest issues until now is that you’re going to deal with extremely large datasets, so you have to strip as many things as possible. Now consider those words: “fishing”, “fishes”, “fish”. They all talk about “fish”, and an user that is searching for “fish” would probably be interested in “fishes” or “fishing” as well. Additionally, it’s useless to store three words that are almost identical. So here comes the stemming that, by quoting the related wikipedia page, is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form. Fortunately, if you’re dealing with english texts, there’s the Porter algorithm that is the state-of-the-art algorithm for this sort of things. But that works only with english, so if your documents are written in another language or they are written in multiple languages, things are going to be complicated.
This leads to think about the problem of the language identification. How do you know if some text is written in a language or in another just by looking at it? Of course you can describe the document’s language with some kind of meta tagging, but not all the documents have this kind of description, just think about the web. There are some kind of statistical methods based upon the classification of n-grams but I haven’t deeply investigated about them yet, so I can’t really say anything.
Now you got your collection of documents that match a certain query. Now: how do you know what document is more relevant than another (in other words: how do you rank pages)? You got two alternatives (well, probably more, but I know just these at this moment): the tf-idf that we said above and the cosine similarity. The latter is an interesting one: consider the tf-idf vectors of the documents, then consider the query as a document too. Now plot those tf-idf vectors and measure their cosine of the angle between them. The more you’re near to 1, the more relevant is the document.
There are a lot of other important things that need to be said like the precision and recall concept, but that’s enough for now. I’ll talk about this another time.
Anyway I’m doing an experimental project named django searchable. It’s a pluggable app for django that implements an information retrieval engine based on tf-idf weighting. Play with it if you’re brave enough.
Hi,
I’m finishing a PhD in natural language processing (natural language generation and understanding). I spent my first few years building search engines, and studied them in great depth. You’ll find that the guys who work in SEO (search engine optimisation) have a pretty good understanding of how the Google algorithm works, PageRank, stemming, tagging, and all that stuff.
I also wrote a stemmer in my first year which is pretty simple really but has been really useful for me, as it stems to actual words, in their proper natural form. Quite important for NLG.
http://fizz.cmp.uea.ac.uk/Research/stemmer/
You’ll find that tf-idf has some limitations, such as not working very well on large document collections, it’s also not very good with synonyms so it’s hard for this method to find relationships between words.
N-grams - look at Witten-Bell discounting and smoothing.
Check out LSA too, although it also has some limitations but fewer. You can use neural-networks.
On my blog you’ll find some links to good NLP/AI tools:
http://scienceforseo.blogspot.com/2008/10/top-freeware-stuff.html
Good work, and enjoy your journey in NLP/IR!
Comment by CJ — October 20th, 2008 at 11:22
P.S: you’ll learn an awful lot by reading Google patents and looking at how their algo works (within the limitations of the knowledge shared by them) -look at SIGIR papers too, and go if you can, it’s a great conference.
cj
Comment by CJ — October 20th, 2008 at 11:24
I just made research.google.com my homepage :)
Anyway thanks for the good advices, I’m looking forward to this stuff right now. Thanks again!
Comment by Giuliani Vito, Ivan — October 20th, 2008 at 19:04
[…] The last time I blogged about a new course I’m following at my university. This course, held by Pasquale Lops and Giovanni Semeraro, is very interesting at the point that I’ll be developing a custom information retrieval engine as part of my internship project. I can’t tell much more at this point since the internship haven’t started yet and I’m not sure I can release more details about this project (we’re still in the process of deciding if and how the whole thing will be released to the world). […]
Pingback by Zeta-Puppis.com » Optimize your programs — December 2nd, 2008 at 22:17