<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Zeta-Puppis.com</title>
	<atom:link href="http://zeta-puppis.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://zeta-puppis.com</link>
	<description>my very own personal corner</description>
	<lastBuildDate>Wed, 07 Apr 2010 15:36:25 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Simulated Annealing</title>
		<link>http://zeta-puppis.com/2010/02/22/simulated-annealing/</link>
		<comments>http://zeta-puppis.com/2010/02/22/simulated-annealing/#comments</comments>
		<pubDate>Mon, 22 Feb 2010 13:17:59 +0000</pubDate>
		<dc:creator>kratorius</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Coding]]></category>
		<category><![CDATA[algorithm]]></category>
		<category><![CDATA[combinatorial problem]]></category>
		<category><![CDATA[knapsack problem]]></category>
		<category><![CDATA[local search]]></category>
		<category><![CDATA[np-complete]]></category>
		<category><![CDATA[simulated annealing]]></category>
		<category><![CDATA[stochastic]]></category>

		<guid isPermaLink="false">http://zeta-puppis.com/?p=261</guid>
		<description><![CDATA[For a problem I&#8217;m working on I got stuck onto the classical situation of local maximum. After trying to work around the problem in several more or less creative ways, I thought of the simulated annealing algorithm. Considering it&#8217;s been a while since I last saw it I tried to search for it on the [...]]]></description>
			<content:encoded><![CDATA[<p>For a problem I&#8217;m working on I got stuck onto the classical situation of local maximum. After trying to work around the problem in several more or less creative ways, I thought of the <a href="http://en.wikipedia.org/wiki/Simulated_annealing">simulated annealing</a> algorithm. Considering it&#8217;s been a while since I last saw it I tried to search for it on the web and surprisingly there is not much stuff about it, and the few bits I found are often contraddictory. After quite a lot of digging I decided to write about it here. As a warning I should probably say that there will be digging into some basic statistic and complexity analysis, as well as a quick formal introduction to the problem of the knapsack. You should be able to follow even if you don&#8217;t know nothing about those topics, but having some foundations in these areas would be of great help.<br />
<span id="more-261"></span><br />
Let begin with the knapsack problem. This is a classic combinatorial computer science problem known to be <a href="http://en.wikipedia.org/wiki/NP-complete">NP-complete</a>, meaning that the exact optimal solution cannot be found in polynomial time. This often means that most of the times we are happy of a good solution, assuming it&#8217;s not so far from the optimal one. In the simpliest possible terms you are a thief and you&#8217;re in a room with a set of objects that are worth something but you have only one knapsack, and that knapsack can carry at most a certain weight, so you have to choose carefully what objects to steal in order to maximize the earnings. For example consider the following situation: you can carry at most 5kg, and there is one laptop and a 4kg safe with pure diamonds within. You can&#8217;t carry both of them so you have to choose what&#8217;s better to carry on, the diamonds or the laptop. A smart thief would choose the diamonds since their value is considerably higher than the&nbsp;laptop.</p>
<p>There are few variations of the same problem but most common one is named &#8220;0-1&#8221;: you can&#8217;t split the weight over two or more carriers or bags but either you take the whole weight or you leave the object where it is. Mathematically talking, consider a set of <img src='http://s.wordpress.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n' title='n' class='latex' /> objects, each item <img src='http://s.wordpress.com/latex.php?latex=x_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x_j' title='x_j' class='latex' /> is worth <img src='http://s.wordpress.com/latex.php?latex=p_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_j' title='p_j' class='latex' /> and weights <img src='http://s.wordpress.com/latex.php?latex=w_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='w_j' title='w_j' class='latex' /> with <img src='http://s.wordpress.com/latex.php?latex=1%20%5Cleq%20j%20%5Cleq%20n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1 \leq j \leq n' title='1 \leq j \leq n' class='latex' />. Then the goal is to maximize the following&nbsp;function:</p>
<img src='http://s.wordpress.com/latex.php?latex=q%28%5C%7Bx_1%2C%20x_2%2C%20%5Cldots%2C%20x_n%5C%7D%29%20%3D%20%5Csum_%7Bj%3D0%7D%5E%7Bn%7Dp_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q(\{x_1, x_2, \ldots, x_n\}) = \sum_{j=0}^{n}p_j' title='q(\{x_1, x_2, \ldots, x_n\}) = \sum_{j=0}^{n}p_j' class='latex' />
<p>But keeping the following constraint (being <img src='http://s.wordpress.com/latex.php?latex=W&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='W' title='W' class='latex' /> the maximum weight we can&nbsp;carry):</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Csum_%7Bj%3D0%7D%5E%7Bn%7Dw_j%20%5Cleq%20W&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\sum_{j=0}^{n}w_j \leq W' title='\sum_{j=0}^{n}w_j \leq W' class='latex' />
<p>The first problem we have to face then is how we should generate the items and how the value of a solution should be calculated. The following example, as the other that will follow, it&#8217;s written in Python but it&#8217;s quite easy to understand so porting to another language wouldn&#8217;t be that hard. I chosen to generate 50 objects with values that range from 1 to 99$ (both ends included) using a <a href="http://en.wikipedia.org/wiki/Uniform_distribution_(continuous)">uniform distribution</a> (if you don&#8217;t know much about statistic, it means that all the values are equally distributed among the objects). The same with the weights except they range from 1 to 20 (the choice of the weight&#8217;s unit measure is left to&nbsp;you).</p>
<pre><code>def generate_items(n_items=100):
    "Generate a list of items that could be stealed"
    items = []
    for n in range(0, n_items):
        # use a uniform distribution both for values and for weights
        cost = random.randint(1, 100)
        weight = random.uniform(1, 20)
        items.append((cost, weight))
    return items</code></pre>
<p>So the items are nothing other than a list of pairs in the format <img src='http://s.wordpress.com/latex.php?latex=%28value_i%2C%20weight_i%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(value_i, weight_i)' title='(value_i, weight_i)' class='latex' /> for every object <img src='http://s.wordpress.com/latex.php?latex=x_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x_i' title='x_i' class='latex' />. Probably a better and more realistic dataset would have used a <a href="http://en.wikipedia.org/wiki/Normal_distribution">normal distribution</a> for the weights but it&#8217;s trivial to change the generation to function to work in that way. For our purposes the uniform distribution does its&nbsp;job.</p>
<p>As we said above the problem is NP-complete so we usually need to visit the whole search space to get the optimal solution which can be quite big when as the number of objects grows. Here comes the simulated annealing. We won&#8217;t visit the search space extensively, but we&#8217;d rather <em>generate</em> solutions. Indeed, it is a stochastic heuristic search algorithm. An heuristic is a function that measures how much something is good or bad, and stochastic means that we move more or less in a random way into the search space. In practical terms it&#8217;s not greedy as that it doesn&#8217;t always follow what the heuristic says but rather randomly search where the heuristic function points to. For example consider the needle in the haystack situation: an exaustive search method would take every straw piece, check that it&#8217;s not a needle, put it apart and repeat those moves until you don&#8217;t find the needle. You don&#8217;t want to proceed in this way, you&#8217;re more likely to end in less time if you look randomly in the haystack and if somethings stings you while you&#8217;re holding straw then search into that straw piece, because there may be the needle in there. In the knapsack problem we do have the heuristic, and it&#8217;s the <img src='http://s.wordpress.com/latex.php?latex=q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q' title='q' class='latex' /> function above that, for each solution (admissible or not), says how much it&#8217;s worth. In this case we define an admissible solution as one that it&#8217;s not too&nbsp;heavy.</p>
<p>An useful (and classic as well) example is the one of the blind hill climbing. You&#8217;re blind and stuck on a hill and you need to reach the top. A good principle that could lead you to the top is to touch the terrain and always follow the rising path. It <i>could</i>, because if you&#8217;re on a rock than you surely haven&#8217;t reached the top but the principle above doesn&#8217;t apply: you reached a local maximum (invert the things and you get the same thing for a local minimum). Simulated annealing avoids these problems by trying worsening moves from time to time: even if this may not sound like a good move it helps avoiding the problems we described&nbsp;above.</p>
<p><img src="http://zeta-puppis.com/wp-content/uploads/2010/02/plotsurf.gif" class="align-center" /></p>
<p>In the function above once in the middle we could choose to take the left maximum (which is a local maximum). Using hill climbing we&#8217;d be stuck on that because we wouldn&#8217;t try other&nbsp;paths.</p>
<p>Simulated annealing takes its name from the same process that metals go through when cooling from a melting point. Indeed, the cooling process consists of several particles that changes energy states (this statement may not be accurate or be inexact at all, but please forgive me as I never studied those things and I all know in this field comes from simulated annealing algorithm itself), in particular we can calculate the transiction probability from one state to another. Considering two energy states <img src='http://s.wordpress.com/latex.php?latex=e_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='e_i' title='e_i' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=e_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='e_j' title='e_j' class='latex' /> and a temperature <img src='http://s.wordpress.com/latex.php?latex=T&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='T' title='T' class='latex' />, switching from <img src='http://s.wordpress.com/latex.php?latex=e_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='e_i' title='e_i' class='latex' /> to <img src='http://s.wordpress.com/latex.php?latex=e_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='e_j' title='e_j' class='latex' /> has&nbsp;probability:</p>
<img src='http://s.wordpress.com/latex.php?latex=P%28e_i%2C%20e_j%20%7C%20T%29%20%3D%20e%5E%7B%28e_i%20-%20e_j%29%20%2F%20%28k_BT%29%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='P(e_i, e_j | T) = e^{(e_i - e_j) / (k_BT)}' title='P(e_i, e_j | T) = e^{(e_i - e_j) / (k_BT)}' class='latex' />
<p>Where <img src='http://s.wordpress.com/latex.php?latex=k_B&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k_B' title='k_B' class='latex' /> is a constant called <a href="http://en.wikipedia.org/wiki/Boltzmann_constant">Boltzmann&#8217;s constant</a>. But then, how do we apply those statements to our problem (or a combinatorial search problem, in general)? The most important concept to grasp is the energy switching one. As a particle change state, a solution might change. Indeed for the knapsack problem there are many admissible solution, each one with an associated earning. Of course we&#8217;d prefer the one with the higher earnings (and simulated annealing will help us find that) but still it&#8217;s perfectly acceptable to go from a solution to another as long as the other solution continues to be&nbsp;admissible.</p>
<p>So here it is what the simulated annealing does: if you find a better item go on an take that path (under this circumnstance, behaves just like the hill climbing), otherwise change state with probability <img src='http://s.wordpress.com/latex.php?latex=P%28e_i%2C%20e_j%20%7C%20T%29%20%3D%20e%5E%7B%28e_i%20-%20e_j%29%20%2F%20T%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='P(e_i, e_j | T) = e^{(e_i - e_j) / T}' title='P(e_i, e_j | T) = e^{(e_i - e_j) / T}' class='latex' /> (you may notice that the Boltzmann&#8217;s constant is missing, indeed that constant applies mostly to thermodynamic when dealing with different metals). The effect that the temperature scaling has is that at higher temperatures it&#8217;ll try worsening moves quite often while on lower temperatures the probability to do worsening moves is lower so when temperature tends towards 0 it behaves quite like the hill climbing&nbsp;algorithm.</p>
<pre><code>def simulated_annealing(solution, items, max_weight):
    "Apply the simulated annealing for solving the knapsack problem"
    best = solution
    best_value = compute_cost(solution, items)[0]
    current_sol = solution
    temperature = 1.0

    while True:
        current_value = compute_cost(best, items)[0]

        for i in range(0, COOLING_STEPS):
            moves = generate_moves(current_sol, items, max_weight)
            idx = random.randint(0, len(moves) - 1)
            random_move = moves[idx]

            delta = compute_cost(random_move, items)[0] - compute_cost(best, items)[0]

            if delta &gt; 0:
                best = random_move
                best_value = compute_cost(best, items)[0]
                current_sol = random_move
            else:
                if math.exp(delta / float(temperature)) &gt; random.random():
                    current_sol = random_move

        temperature = TEMP_ALPHA * temperature
        if current_value &gt;= best_value or temperature &lt;= 0:
            break</code></pre>
<p>And finally, that is the simulated annealing. You start from a temperature of 1.0 then you have a certain number of cooling steps, in every one of them you extract a random item from the neighbours and, if the item is better than the current best item then it becomes the new best item (and the new local solution). If the new item&#8217;s value is worst than the current best then update the current local solution with the probability expressed above. After the cooling steps the temperature is decreased with an <a href="http://en.wikipedia.org/wiki/Exponential_decay">exponential decay</a> (usually it is&nbsp;<img src='http://s.wordpress.com/latex.php?latex=t%20%3D%20%5Calpha%20t%2C%200.8%20%3C%20%5Calpha%20%3C%200.9&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t = \alpha t, 0.8 &lt; \alpha &lt; 0.9' title='t = \alpha t, 0.8 &lt; \alpha &lt; 0.9' class='latex' />).</p>
<p>In the example above you don&#8217;t wait for the temperature to be 0 but you leave the loop if after all the cooling steps there hasn&#8217;t been any improvement. How big must be the number of cooling steps and the <img src='http://s.wordpress.com/latex.php?latex=%5Calpha&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\alpha' title='\alpha' class='latex' /> value it&#8217;s a fine tuning problem. A different approach that allow to get rid of the cooling steps is to make the temperature get cold slowly (<img src='http://s.wordpress.com/latex.php?latex=t%20%3D%20%20t%20%2F%20%281%20%2B%20%28%5Cbeta%20t%29%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t =  t / (1 + (\beta t))' title='t =  t / (1 + (\beta t))' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=%5Cbeta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\beta' title='\beta' class='latex' /> is a very small value like 0.01). Besides, a common improvement is to cache the moves within the cooling steps unless you found a new best value or changed&nbsp;state.</p>
<p>You can see that the most critical points are the neighbour generation and the cost computation. While the neighbour generation could be cached like I said above, the cost computation could be replaced with a probability estimate in order to reduce the time per cooling&nbsp;step.</p>
<p>But how do you apply the algorithm? If an empty solution is acceptable then you can just start with that and let the neighbour&#8217;s generator to create a solution, but usually you start with a greedy solution (found through the hill climbing, for example) or from a random&nbsp;one.</p>
<p>Here follows the complete code. I used 1000 cooling steps and a <img src='http://s.wordpress.com/latex.php?latex=%5Calpha&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\alpha' title='\alpha' class='latex' /> value of 0.8. I start from a random solution whose behaviour is not that bad considering how much time it takes to compute the solution. Indeed often random algorithms perform really well on combinatorial algorithms, see <a href="http://www.cs.ubc.ca/labs/beta/Courses/CPSC532D-02/tutorial-slides.pdf">stochastic search</a> for some more&nbsp;informations.</p>
<pre><code>#!/usr/bin/env python
import math
import operator
import pprint
import random
import sys

COOLING_STEPS = 1000
TEMP_ALPHA = 0.8

random.seed()

def generate_items(n_items=100):
    "Generate a list of items that could be stealed"
    items = []
    for n in range(0, n_items):
        # use a uniform distribution both for values and for weights
        cost = random.randint(1, 100)
        weight = random.uniform(1, 20)
        items.append((cost, weight))
    return items

def main(args):
    items = generate_items()
    pprint.pprint(items)

    start_sol = generate_random_solution(items, max_weight=40)
    print "Random solution: %s" % start_sol
    print "value: (cost: %d, weight: %f)" % compute_cost(start_sol, items)

    solution = simulated_annealing(start_sol, items, max_weight=40)
    print "Final solution: %s" % solution
    print "value: (cost: %d, weight: %f)" % compute_cost(solution, items)

    return False

def generate_random_solution(items, max_weight):
    "Generate a starting random solution"

    # generate a random solution by adding a random item
    # until we don't get over the weight
    solution = []
    while compute_cost(solution, items)[1] &lt;= max_weight:
        idx = random.randint(0, len(items) - 1)
        # skip duplicates
        if idx not in solution:
            solution.append(idx)
    # last item makes us get over the weight so simply remove it
    # we'll look for better results after
    solution = solution[:-1]
    return solution

def simulated_annealing(solution, items, max_weight):
    "Apply the simulated annealing for solving the knapsack problem"
    best = solution
    best_value = compute_cost(solution, items)[0]
    current_sol = solution
    temperature = 1.0

    while True:
        current_value = compute_cost(best, items)[0]

        for i in range(0, COOLING_STEPS):
            moves = generate_moves(current_sol, items, max_weight)
            idx = random.randint(0, len(moves) - 1)
            random_move = moves[idx]

            delta = compute_cost(random_move, items)[0] - \
                    compute_cost(best, items)[0]

            if delta &gt; 0:
                best = random_move
                best_value = compute_cost(best, items)[0]
                current_sol = random_move
            else:
                if math.exp(delta / float(temperature)) &gt; random.random():
                    current_sol = random_move

        temperature = TEMP_ALPHA * temperature
        if current_value &gt;= best_value or temperature &lt;= 0:
            break
    return best

def generate_moves(solution, items, max_weight):
    """
    Generate all the ammissible moves starting from the input
    solution
    """
    moves = []
    # try to add another item and save as a possible move
    for idx, item in enumerate(items):
        if idx not in solution:
            move = solution[::]
            move.append(idx)

            if compute_cost(move, items)[1] &lt;= max_weight:
                moves.append(move)

    # try to remove one item
    for idx, item in enumerate(solution):
        move = solution[::]
        del move[idx]
        if move not in moves:
            moves.append(move)

    return moves

def compute_cost(solution, items):
    """
    Return a tuple in the format (id_item1, id_item2, ...)
    for the input solution
    """
    cost, weight = 0, 0
    for item in solution:
        cost += items[item][0]
        weight += items[item][1]
    return (cost, weight)

if __name__ == '__main__':
    sys.exit(main(sys.argv))</code></pre>
<p>The results are suprising. For three different set of 100 items, each one with its own value and weight, here are the&nbsp;results:</p>
<pre><code>$ python sa.py
Random solution: [98, 71, 95]
value: (cost: 44, weight: 27.001685)
Final solution: [71, 95, 67, 9, 41, 33, 27]
value: (cost: 229, weight: 39.791386)

$ python sa.py
Random solution: [38, 16, 62, 31]
value: (cost: 124, weight: 36.863846)
Final solution: [38, 16, 62, 31, 5]
value: (cost: 194, weight: 38.970745)

Random solution: [61, 44, 48, 38]
value: (cost: 293, weight: 30.357135)
Final solution: [61, 44, 48, 38, 37, 5, 2]
value: (cost: 421, weight: 39.331549)</code></pre>
<p>We usually don&#8217;t leave much free weight and the quality of the solutions found is quite good considering that the time required to compute that is&nbsp;~0.5s.</p>
]]></content:encoded>
			<wfw:commentRss>http://zeta-puppis.com/2010/02/22/simulated-annealing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Google Wave impressions from a developer point of view</title>
		<link>http://zeta-puppis.com/2009/08/28/google-wave-impressions-from-a-developer-point-of-view/</link>
		<comments>http://zeta-puppis.com/2009/08/28/google-wave-impressions-from-a-developer-point-of-view/#comments</comments>
		<pubDate>Thu, 27 Aug 2009 22:47:04 +0000</pubDate>
		<dc:creator>kratorius</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Geekness]]></category>
		<category><![CDATA[beta]]></category>
		<category><![CDATA[developer]]></category>
		<category><![CDATA[development]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[googlewave]]></category>
		<category><![CDATA[point of view]]></category>
		<category><![CDATA[review]]></category>
		<category><![CDATA[sandbox]]></category>
		<category><![CDATA[wave]]></category>

		<guid isPermaLink="false">http://zeta-puppis.com/?p=235</guid>
		<description><![CDATA[A couple of days ago I finally had my Google Wave sandbox account. Given that I just finished developing my very first robot, I thought I&#8217;d share some impressions on the whole thing. From the user-side, things are far from being ready. Some important features are still missing, just to name one you can&#8217;t remove [...]]]></description>
			<content:encoded><![CDATA[<p>A couple of days ago I finally had my <a href="http://wave.google.com">Google Wave</a> sandbox account. Given that I just finished developing my very first robot, <strong>I thought I&#8217;d share some impressions</strong> on the whole thing. From the user-side, things are far from being ready. Some important features are still missing, just to name one you can&#8217;t remove user from a wave once they joined (or, alternatively, there&#8217;s no way to ignore a wave). Indeed, given that I joined several waves to try other people applications, I&#8217;m getting continuous notifications. But anyway, the whole thing is to me like a great development playground where I can make all the sorts of&nbsp;experiments.</p>
<p><span id="more-235"></span>They&#8217;ve been loyal: when you signed up the registration form, they asked you if you were comfortable with APIs changing or an instable system. That&#8217;s what you&#8217;ll find once you get your sandbox account. APIs are there but haven&#8217;t been fully documented yet and <strong>most of your work when developing some robot/gadget will be in exploring the API sources</strong> (they&#8217;re open source, yau!) or searching for some examples on the <a href="http://wave-samples-gallery.appspot.com/">samples gallery</a>, which is an invaluable resource by the&nbsp;way.</p>
<p><strong>Debugging is hard too</strong>, given that you can&#8217;t test what you done locally but you have to upload your code to <a href="http://appengine.google.com">AppEngine</a> to see if it works (actually AppEngine is the only platform they accept requests from, but they plan to allow every host that talks the <a href="http://www.waveprotocol.org">wave protocol</a> in the future). This means that if, for example, there&#8217;s some typo in the code (i.e.: <code>appendText()</code> rather than <code>AppendText()</code>), you&#8217;d know only by looking at the AppEngine&nbsp;logs.</p>
<p><strong>Be prepared to experience casual failures too</strong>. Sometimes your robot is working correctly and is receiving the whole wavelet (which is the whole conversation thread), but its response is ignored by the server for some unknown&nbsp;cause.</p>
<p>Anyway, even though there&#8217;s still some clear work in progress, I felt like <strong>the whole thing was quite exciting</strong> both from the user and as the developer point of view. The event model they thought for the external applications perfectly fits the nature of The Wave and gives room for some nice asynchronous applications. Hopefully, we&#8217;ll meet on Google Wave&nbsp;soon.</p>
]]></content:encoded>
			<wfw:commentRss>http://zeta-puppis.com/2009/08/28/google-wave-impressions-from-a-developer-point-of-view/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Dealing with algorithms and data structures</title>
		<link>http://zeta-puppis.com/2009/07/21/dealing-with-algorithms-and-data-structures/</link>
		<comments>http://zeta-puppis.com/2009/07/21/dealing-with-algorithms-and-data-structures/#comments</comments>
		<pubDate>Tue, 21 Jul 2009 21:49:08 +0000</pubDate>
		<dc:creator>kratorius</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[algorithm]]></category>
		<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[cache]]></category>
		<category><![CDATA[compiler]]></category>
		<category><![CDATA[computational analysis]]></category>
		<category><![CDATA[computer science]]></category>
		<category><![CDATA[data structures]]></category>
		<category><![CDATA[locality]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[optimizer]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[processor]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://zeta-puppis.com/?p=223</guid>
		<description><![CDATA[One of the reasons I haven&#8217;t been writing on this blog that much lately is that I&#8217;ve been terribly busy with university given that I just cleared out six exams in six months. That said, for one of my three exams that I still have left, I had to develop an inference engine written in [...]]]></description>
			<content:encoded><![CDATA[<p>One of the reasons I haven&#8217;t been writing on this blog that much lately is that I&#8217;ve been terribly busy with university given that I just cleared out six exams in six months. That said, for one of my three exams that I still have left, I had to develop an <a href="http://en.wikipedia.org/wiki/Inference_engine">inference engine</a> written in C++. Since this was a fairly large project that had to deal with some sort of NP-complete problems (see also: <a href="http://en.wikipedia.org/wiki/Unification">unification</a>) and given that this was the first time I wrote something serious in C++ (i.e.: that would involve more than a class and that didn&#8217;t contain the &#8220;Hello world&#8221; string) I had the chance to learn quite a few new&nbsp;things.</p>
<p><span id="more-223"></span>First thing, no matter how good your data structure is or how well you implemented that, sooner or later you&#8217;ll meet a speed barrier that even the best data structure for the job can&#8217;t beat. That doesn&#8217;t mean you don&#8217;t have to think about what data structure to use, besides I&#8217;m really passionate about finding the right data structure for the job so if I have the chance to deal with this kind of problems seems like I won the lottery. <strong>Sometimes the algorithm you&#8217;re using simply has some limits that can&#8217;t be beaten</strong> unless you radically adjust your algorithm or you completely change it with some a better-algorithm-for-the-job. Say we have an implicit, non-weighted graph, but we know that at most we&#8217;d be expanding fifty nodes of about a hundred bytes each and the expansion of every node comes cheap as computational cost. Now you want to find a particular node within that graph and along with it you want to find the shortest path to the starting node. Given these considerations, what algorithm would you use in this case? I&#8217;d go for breadth-first search, since we don&#8217;t lose much time expanding the nodes (computational cost of each node expansion), we have relatively few nodes that, unless we&#8217;re running on some special hardware that is somewhat limited (even though today even those automatized toilet&#8217;s chips can hold 5Kb in memory), will take a very little amount of&nbsp;memory.</p>
<p>Now suppose we still have that implicit graph, with our fifty nodes of about a hundred bytes each. But now we know that generating every node is very expensive in terms of computational costs. The BFS above could still be applicable under certain circumstances but we&#8217;d better be looking for alternatives. You can hold the graph in whatever data structure you want but unless you decide to change the algorithm with something better there&#8217;s nothing you can do about. Of course you can gain some speed by improving your data structure but that&#8217;s not the point since <strong>what makes the shortest path search slow is the generation of a new node</strong>. So what you have to do? Try to generate as few nodes as possible. Indeed, say our nearest solution is ten edges far from the starting node, with BFS you&#8217;d expand first all the nodes which have a direct connection to the starting node (so one edge), than from each of these nodes you&#8217;d be expanding other nodes that now will have two edges separating them from the starting nodes, and so on until you expand all the <em>levels</em> and get to the solution which is separated from the starting node by ten edges. You had the solution eventually, but it came at an expensive price. What are the alternatives then? Say we know, for each node we expand, a numerical value which says how good that node is in terms of distance from what we&#8217;re looking for. In this way we can follow only the good leads and leaving the bad paths out of our research (that, though, is not entirely true, and I&#8217;ll say why in a minute) and still, we have limited the number of expanded nodes and as result we had a great speed up. There&#8217;s an entire category of algorithms that are based upon the principle that <em>you know something</em> of your problem that can help you out in some cases and these are called informed algorithms. One of those who comes to my mind is <a href="http://en.wikipedia.org/wiki/A*_search_algorithm">A*</a> which is quite simple to implement too. These algorithms are based upon the assumption that you can make some estimation of <em>how good</em> the current state is. Indeed, most of these algorithm&#8217;s accuracy comes from how good the function that gives you that estimation is. This function is called the <em>heuristic function</em>. But the heuristic function is an estimation and it&#8217;s likely to be wrong in some cases. So you still end up expanding more nodes than necessary and following some wrong leads, but now you save yourself from exploring all the wrong leads that the previous BFS would have forced you to&nbsp;do.</p>
<p>For the project I was talking above, at a certain point we hit the limit. We tried switching data structures initially and that saved some time but still we weren&#8217;t able to come up with any consistent time reduction. Until we switched algorithm, then we gained something like a 80% speed up and we even had the chance to use some simpler data structure which allowed us to exploit some stuff like caching which I suppose is a large part of that&nbsp;80%.</p>
<p>Saying that, here comes the second lesson: <strong>know your environment</strong>. You have to know how your processor works, what&#8217;s the difference between, say, a L2 and a L1 cache, why disk access is slow and how it works, how the paging works and so on. These things can make great room for optimization if you know how to use them. Caching, for example, is probably one a thing that may give a great help to you. If you know what the <a href="http://en.wikipedia.org/wiki/CPU_cache">cache line</a> is and how big is on your processor, you&#8217;ll know how and when you can exploit the data locality principle. Knowing that disk reading is not <em>that slow</em> as it&#8217;s said could really help you (I heard you crying: I did not mistyped, disk read ain&#8217;t slow, what&#8217;s makes this operation slow is the time the disk&#8217;s head takes to position itself. Once we got the head in position, reading a whole block is probably faster than you imagine. Of course, I&#8217;m talking about old-style mechanical disks, with these shiny new solid state disks is another story). In the same way it can help knowing that your program will not cause many page faults or won&#8217;t fragment the allocation memory because your data structure doesn&#8217;t fit in one page or your algorithm makes thousand of allocations/deallocations of different size. This is very important, even though I got to admit is one of the topics I&#8217;m lagging&nbsp;behind.</p>
<p>Then you chosen the best algorithm in the world, you resurrected the dead in order to fit your data structure in exactly one page and you used condoms while you were coding, but still your program is very slow and you still can&#8217;t find the reason. Well I tell you, <strong>probably you&#8217;re using the wrong data structure</strong> but at this point I guess you&#8217;d know. With the project I said above, we experienced exactly this. We did everything we can to speed up the algorithms, we optimized everything could have been optimized but yet our program was slow until we realized that something was a real bottleneck for our purpose. Big Oh analysis it&#8217;s really useful, but you got to take into account that that little &#8216;n&#8217; in that big &#8216;o&#8217; is meant to be big. Indeed, that gives you an idea of how the data structure works as n grows, so that two same algorithms or data structures which both have a O(nlogn) complexity can be very different. In our case, the bottleneck was the STL map implementation. It turns out that, under the hood, it is a <a href="http://en.wikipedia.org/wiki/Red-black_tree">red-black tree</a> that was everything but fast in our program. We were spending 30% of the time within the STL map looking for the key. The problem here really was that we hadn&#8217;t many values within the map, most I&#8217;ve seen were ten items but we had an average of three or four items. When we switched from the usual map to a very simple hash, we had a terrible speed&nbsp;improvement.</p>
<p>Next thing is optimizing little stuff. This won&#8217;t yield greet speed ups but yet, they&#8217;re useful. We gained about one second by only switching one function&#8217;s calling convention to <a href="http://en.wikipedia.org/wiki/X86_calling_conventions#fastcall">fastcall</a> and we had even greater benefits by forcing the inlining of some other&nbsp;functions.</p>
<p>Other than that, having to work on a large C++ code base has been challenging, most because of some C++ gotchas (i.e.: why something like <code>string fn() {}</code> returns NULL implicitly?) even though I come from a assembler/C background and I have a solid OO theory base behind. In the end, though, it has been probably the funniest problem I faced in the latest years and will be eventually be open sourced after the&nbsp;summer.</p>
]]></content:encoded>
			<wfw:commentRss>http://zeta-puppis.com/2009/07/21/dealing-with-algorithms-and-data-structures/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>My Italian PyCon experience</title>
		<link>http://zeta-puppis.com/2009/05/11/my-italian-pycon-experience/</link>
		<comments>http://zeta-puppis.com/2009/05/11/my-italian-pycon-experience/#comments</comments>
		<pubDate>Mon, 11 May 2009 10:42:00 +0000</pubDate>
		<dc:creator>kratorius</dc:creator>
				<category><![CDATA[Geekness]]></category>
		<category><![CDATA[Me]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[BDFL]]></category>
		<category><![CDATA[conf]]></category>
		<category><![CDATA[florence]]></category>
		<category><![CDATA[pycon]]></category>
		<category><![CDATA[pycon3]]></category>
		<category><![CDATA[python italia]]></category>

		<guid isPermaLink="false">http://zeta-puppis.com/?p=216</guid>
		<description><![CDATA[I came back yesterday from the third Italian PyCon (aka pycon3) which was held in Florence and all I can say is that has been an amazing experience. I had the chance to meet a lot of new great people as well as the BDFL (which won&#8217;t be back in Europe for quite some time, [...]]]></description>
			<content:encoded><![CDATA[<p>I came back yesterday from the third Italian <a href="http://www.pycon.it">PyCon</a> (aka pycon3) which was held in Florence and all I can say is that has been an <strong>amazing experience</strong>. I had the chance to meet a lot of new great people as well as the <a href="http://neopythonic.blogspot.com">BDFL</a> (which won&#8217;t be back in Europe for quite some time, as he said). Here follows a resume of what I think were the most interesting&nbsp;talks.</p>
<p><span id="more-216"></span>On first day, there were two keynotes: &#8220;A retrospective of how the community helped build Python 3.0&#8221;, held by <strong>Guido van Rossum</strong> and &#8220;Zen and the art of Abstractions&#8217; maintenance&#8221; by Alex Martelli. I can just say that they were two extremely interesting talks which by the way weren&#8217;t diving too much&thinsp;&mdash;&thinsp;or any at all as in Guido&#8217;s talk&thinsp;&mdash;&thinsp;into&nbsp;code.</p>
<p>On the second day I really enjoyed two talks: &#8220;Erlang + Python, joining two worlds&#8221; by <a href="http://www.pycon.it/conference/speakers/lawrence-oluyede">Lawrence Oluyede</a> and a really great talk by Raymond Hettinger, &#8220;Easy AI with Python.&#8221; The former left me with a great curiosity about the functional languages world, while the latter really impressed me with <strong>how easy is to solve certain AI problems with Python</strong> (I solved many of the problems Raymond talked about previously, but never in Python and never really thought about even trying&nbsp;to).</p>
<p>On third day the <strong>Antonio Cangiano&#8217;s talk</strong> was enlightening. Even though it wasn&#8217;t really Python specific, he has given a great insight of how you can, well, &#8220;become rich with&nbsp;Python.&#8221;</p>
<p>Unfortunately I didn&#8217;t follow the Sunday afternoon&#8217;s talks since my airplane was leaving at 3.00pm, but at the end I can say that this was an incredible experience that I hope I can make again next year. And as a side note: <strong>the food was&nbsp;marvelous</strong>.</p>
]]></content:encoded>
			<wfw:commentRss>http://zeta-puppis.com/2009/05/11/my-italian-pycon-experience/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Optimize your programs</title>
		<link>http://zeta-puppis.com/2008/12/02/optimize-your-programs/</link>
		<comments>http://zeta-puppis.com/2008/12/02/optimize-your-programs/#comments</comments>
		<pubDate>Tue, 02 Dec 2008 20:17:39 +0000</pubDate>
		<dc:creator>kratorius</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[compression]]></category>
		<category><![CDATA[information retrieval]]></category>
		<category><![CDATA[optimizations]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[speed]]></category>
		<category><![CDATA[zlib]]></category>

		<guid isPermaLink="false">http://zeta-puppis.com/?p=185</guid>
		<description><![CDATA[The last time I blogged about a new course I&#8217;m following at my university. This course, held by Pasquale Lops and Giovanni Semeraro, is very interesting at the point that I&#8217;ll be developing a custom information retrieval engine as part of my internship project. I can&#8217;t tell much more at this point since the internship [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://zeta-puppis.com/2008/10/19/what-i-learned-by-information-retrieval-in-one-week/">The last time</a> I blogged about a new course I&#8217;m following at my university. This course, held by <a href="http://www.di.uniba.it/~lops/lops.html">Pasquale Lops</a> and <a href="http://lacam.di.uniba.it:8000/people/semeraro.htm">Giovanni Semeraro</a>, is very interesting at the point that I&#8217;ll be developing a <strong>custom information retrieval engine</strong> as part of my internship project. I can&#8217;t tell much more at this point since the internship haven&#8217;t started yet and I&#8217;m not sure I can release more details about this project (we&#8217;re still in the process of deciding if and how the whole thing will be released to the&nbsp;world).</p>
<p>In the meantime, I&#8217;ve been doing several experiments on this topic mostly about the memory usage and the performances of such system on limited hardware. This practically means implementing the algorithms you&#8217;ll be using and measuring the computational time they&nbsp;require.</p>
<p><span id="more-185"></span>One of the most common thing that our information retrieval engine have to do is to take a document and compress it, but considering&nbsp;that:</p>
<ul>
<li>this is a fundamental piece of this IR&nbsp;engine</li>
<li>it will be used very&nbsp;often</li>
<li>it&#8217;s not rare to process very large&nbsp;documents</li>
</ul>
<p>You&#8217;ll get that this operation should be as efficient as&nbsp;possible.</p>
<p>I chosen to go down with zlib as my compression library for mainly two&nbsp;reasons:</p>
<ul>
<li>it&#8217;s already included in Python (this is not really a strong point since better compression algorithms are included in Python&nbsp;too)</li>
<li>offers the best compromise in speed/compression&nbsp;ratio</li>
</ul>
<p>Given the above considerations, let start coding our compression&nbsp;system.</p>
<p>We will use as our document example the PDF specifications, available at the <a href="http://www.adobe.com/devnet/pdf/pdf_reference.html ">Adobe Development Center</a> (<a href="http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf">this is the file</a>) that are 8.6Mb&nbsp;large.</p>
<p>So let start doing the things the basic&nbsp;way:</p>
<pre><code>#!/usr/bin/env python
# compress1.py
import zlib

def compress(input_path, output_path, compression_level=6):
    input_fd = open(input_path, 'rb')
    output_fd = open(output_path, 'wb')

    cobj = zlib.compressobj(compression_level)
    out = ''
    for line in input_fd:
        out += cobj.compress(line)
    out += cobj.flush()

    output_fd.write(out)

    input_fd.close()
    output_fd.close()

def decompress(input_path, output_path):
    input_fd = open(input_path, 'rb')
    output_fd = open(output_path, 'wb')

    dobj = zlib.decompressobj()
    out = ''
    for line in input_fd:
        out += dobj.decompress(line)
    out += dobj.flush()

    output_fd.write(out)

    input_fd.close()
    output_fd.close()

if __name__ == '__main__':
    import sys
    args = sys.argv[1:]

    options = { 'compress': compress,
                'decompress': decompress,
    }

    input_path, output_path = args[1], args[2]

    try:
        options[args[0]](input_path, output_path)
    except (KeyError, IndexError):
        print("Invalid arguments")</code></pre>
<p>By running this program and performing a very basic profiling we get some&nbsp;indications:</p>
<pre>
kratorius@becks:~/compress$ time ./compress1.py compress PDF32000_2008.pdf compr.zlib
real    0m2.517s
user    0m1.496s
sys     0m0.060s

kratorius@becks:~/compress$ time ./compress1.py decompress compr.zlib decompr.pdf
real    0m0.640s
user    0m0.537s
sys     0m0.085s
</pre>
<p>We need 2.5 secs in order to compress a file smaller than 10Mb. This is quite unacceptable, since it means that we&#8217;re processing about 3.5Mb per second; so we need to understand what we&#8217;re doing wrong. I can spot at least two big errors in this&nbsp;script:</p>
<ol>
<li>we&#8217;re reading the input file line by line that isn&#8217;t very efficient since in this way <strong>we&#8217;re accessing the disk multiple times</strong> (not counting that we are also processing the compression stuff line by line, that it&#8217;s not efficient and hasn&#8217;t so much sense in a binary file like our&nbsp;PDF)</li>
<li><strong>we keep our compressed object in memory</strong> until we finish the compression, and this means that if the script would run faster, we&#8217;d still have a very high memory usage that is not&nbsp;optimal</li>
</ol>
<p>So here it is the new version of our compression script that address the issues&nbsp;above:</p>
<pre><code>#!/usr/bin/env python
# compress2.py
import zlib

def compress(input_path, output_path, compression_level=6):
    input_fd = open(input_path, 'rb')
    output_fd = open(output_path, 'wb')

    out = zlib.compress(input_fd.read(), compression_level)
    output_fd.write(out)

    input_fd.close()
    output_fd.close()

def decompress(input_path, output_path):
    input_fd = open(input_path, 'rb')
    output_fd = open(output_path, 'wb')

    out = zlib.decompress(input_fd.read())
    output_fd.write(out)

    input_fd.close()
    output_fd.close()

if __name__ == '__main__':
    import sys
    args = sys.argv[1:]

    options = { 'compress': compress,
                'decompress': decompress,
    }

    input_path, output_path = args[1], args[2]

    try:
        options[args[0]](input_path, output_path)
    except (KeyError, IndexError):
        print("Invalid arguments")</code></pre>
<p>Let perform our basic profiling&nbsp;again:</p>
<pre>kratorius@becks:~/compress$ time ./compress2.py compress PDF32000_2008.pdf compr.zlib
real    0m1.668s
user    0m1.337s
sys     0m0.079s

kratorius@becks:~/compress$ time ./compress2.py decompress compr.zlib decompr.pdf
real    0m0.561s
user    0m0.394s
sys     0m0.086s
</pre>
<p>We are now reading the whole input file in memory (minimizing the disk accesses), compressing everything in memory and writing the compressed file to the output in a single shot. We got a high speedup in this way but <strong>we have just increased our memory usage</strong> since now we&#8217;re keeping in memory both the input and the compressed file. This could be optimal if we&#8217;re processing small files, but since we need to have a generalized approach, this solution is not that&nbsp;good.</p>
<p>We can do better. And we&#8217;ll do better in the third&nbsp;try:</p>
<pre><code>#!/usr/bin/env python
# compress2.py
import zlib

READ_BYTES = 2097152 # 2Mb

def compress(input_path, output_path, compression_level=6):
    input_fd = open(input_path, 'rb')
    output_fd = open(output_path, 'wb')

    cobj = zlib.compressobj(compression_level)
    done = False
    while not done:
        rd = input_fd.read(READ_BYTES)
        done = rd == ''

        output_fd.write(cobj.compress(rd))

    output_fd.write(cobj.flush())

    input_fd.close()
    output_fd.close()

def decompress(input_path, output_path):
    input_fd = open(input_path, 'rb')
    output_fd = open(output_path, 'wb')

    dobj = zlib.decompressobj()
    done = False
    while not done:
        rd = input_fd.read(READ_BYTES)
        done = rd == ''

        output_fd.write(dobj.decompress(rd))

    output_fd.write(dobj.flush())

    input_fd.close()
    output_fd.close()

if __name__ == '__main__':
    import sys
    args = sys.argv[1:]

    options = { 'compress': compress,
                'decompress': decompress,
    }

    input_path, output_path = args[1], args[2]

    try:
        options[args[0]](input_path, output_path)
    except (KeyError, IndexError):
        print("Invalid arguments")</code></pre>
<p>And we finally reached our&nbsp;goal:</p>
<pre>kratorius@becks:~/compress$ time ./compress3.py compress PDF32000_2008.pdf compr.zlib
real    0m1.325s
user    0m1.226s
sys     0m0.070s

kratorius@becks:~/compress$ time ./compress3.py decompress compr.zlib decompr.pdf
real    0m0.534s
user    0m0.404s
sys     0m0.119s
</pre>
<p>This last try works because <strong>we&#8217;re still minimizing the disk accesses</strong> for small files (we&#8217;re reading 2Mb chunks per time) and this time <strong>we&#8217;re reducing the memory usage</strong>&nbsp;since:</p>
<ul>
<li>we read a 2Mb block from our input&nbsp;file</li>
<li>we compress the read&nbsp;input</li>
<li>we write it directly to our output&nbsp;file</li>
</ul>
<p>I&#8217;m sure there&#8217;s still room for improvement but at this point we can be quite happy of our achievement. You can find the final script that performs error checking and file locking <a href="http://zeta-puppis.com/wp-content/uploads/2008/12/compress.py">here</a> (file locking works only on UNIX systems though, on Windows you should just comment the <code>fcntl</code> lines out). As always, suggestions are&nbsp;welcome.</p>
]]></content:encoded>
			<wfw:commentRss>http://zeta-puppis.com/2008/12/02/optimize-your-programs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What I learned by information retrieval in one week</title>
		<link>http://zeta-puppis.com/2008/10/19/what-i-learned-by-information-retrieval-in-one-week/</link>
		<comments>http://zeta-puppis.com/2008/10/19/what-i-learned-by-information-retrieval-in-one-week/#comments</comments>
		<pubDate>Sun, 19 Oct 2008 16:38:24 +0000</pubDate>
		<dc:creator>kratorius</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Django]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[information retrieval]]></category>
		<category><![CDATA[IR]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[search engine]]></category>
		<category><![CDATA[text categorization]]></category>
		<category><![CDATA[tf-idf]]></category>

		<guid isPermaLink="false">http://zeta-puppis.com/?p=159</guid>
		<description><![CDATA[It has been about a week since I began doing a deeper study of information retrieval. Actually, everything just began with a new course at my university about that and I just fallen in love almost immediately. The fact is that this thing really got me interested, and I began doing some experiments (one involves [...]]]></description>
			<content:encoded><![CDATA[<p>It has been about a week since I began doing a deeper study of information retrieval. Actually, everything just began with a new course at my university about that and I just fallen in love almost immediately. The fact is that this thing really got me interested, and I began doing some experiments (one involves django as well, keep reading to know&nbsp;more).</p>
<p>In this week I learned a lot of things about information retrieval, text categorization, natural language processing and machine learning. But the most relevant thing is: <strong>the principles are easy, their implementation is not</strong>. The fact is that most of the techniques are relatively simple but you usually have to deal with very large datasets and this could be challenging, since one of the main requirements about information retrieval is time. It&#8217;s really much more important that you give less results in one second rather than giving better results in one hour. No one will ever care to use your system if it takes an hour to get some result. And if you&#8217;re considering to store your data in a database forget about normalization, it wouldn&#8217;t really take you&nbsp;anywhere.</p>
<p><span id="more-159"></span>Talking about storing informations, you know that if you&#8217;re dealing with documents most of the words are the so called <em>stop words</em>. Those stop words are words that doesn&#8217;t really mean anything, but they help the readers to get a better text flux. Classic examples of stop words are articles like &#8220;the&#8221;, &#8220;a&#8221;, &#8220;an&#8221; or logic connectors like &#8220;or&#8221; and &#8220;and&#8221;. <strong>These words are so common that their presence is quite useless since they&#8217;re are&#8230; everywhere</strong>. If you&#8217;re going to study information retrieval than you&#8217;ll learn about a weighting technique called <a href="http://en.wikipedia.org/wiki/Tf-idf">tf-idf</a> that gives a weight near to 0 to these words, but since you&#8217;d probably use a reverse index for words (an index that given a word, tells you in which documents that word appears) you can understand that this would take a lot of space if you&#8217;re going to include stop&nbsp;words.</p>
<p>So one of the biggest issues until now is that you&#8217;re going to deal with extremely large datasets, so you have to strip as many things as possible. Now consider those words: &#8220;fishing&#8221;, &#8220;fishes&#8221;, &#8220;fish&#8221;. They all talk about &#8220;fish&#8221;, and an user that is searching for &#8220;fish&#8221; would probably be interested in &#8220;fishes&#8221; or &#8220;fishing&#8221; as well. Additionally, it&#8217;s useless to store three words that are almost identical. So here comes the <em>stemming</em> that, by quoting the related <a href="http://en.wikipedia.org/wiki/Stemming">wikipedia page</a>, is the <cite>process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form</cite>. Fortunately, if you&#8217;re dealing with english texts, there&#8217;s the <a href="http://tartarus.org/~martin/PorterStemmer/">Porter algorithm</a> that is the state-of-the-art algorithm for this sort of things. But that works only with english, so <strong>if your documents are written in another language or they are written in multiple languages, things are going to be&nbsp;complicated</strong>.</p>
<p>This leads to think about the problem of the language identification. How do you know if some text is written in a language or in another just by looking at it? Of course you can describe the document&#8217;s language with some kind of meta tagging, but not all the documents have this kind of description, just think about the web. There are some kind of statistical methods based upon the classification of <a href="http://en.wikipedia.org/wiki/N-gram">n-grams</a> but I haven&#8217;t deeply investigated about them yet, so I can&#8217;t really say&nbsp;anything.</p>
<p>Now you got your collection of documents that <em>match</em> a certain query. Now: how do you know what document is more relevant than another (in other words: how do you <em>rank</em> pages)? You got two alternatives (well, probably more, but I know just these at this moment): <strong>the tf-idf that we said above and the <a href="http://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a></strong>. The latter is an interesting one: consider the tf-idf vectors of the documents, then consider the query as a document too. Now plot those tf-idf vectors and measure their cosine of the angle between them. The more you&#8217;re near to 1, the more relevant is the&nbsp;document.</p>
<p>There are a lot of other important things that need to be said like the precision and recall concept, but that&#8217;s enough for now. I&#8217;ll talk about this another&nbsp;time.</p>
<p>Anyway I&#8217;m doing an experimental project named <a href="http://code.google.com/p/django-searchable/">django searchable</a>. It&#8217;s a pluggable app for django that implements an information retrieval engine based on tf-idf weighting. Play with it if you&#8217;re brave&nbsp;enough.</p>
]]></content:encoded>
			<wfw:commentRss>http://zeta-puppis.com/2008/10/19/what-i-learned-by-information-retrieval-in-one-week/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Running Django with fastcgi</title>
		<link>http://zeta-puppis.com/2008/10/08/running-django-with-fastcgi/</link>
		<comments>http://zeta-puppis.com/2008/10/08/running-django-with-fastcgi/#comments</comments>
		<pubDate>Wed, 08 Oct 2008 18:18:01 +0000</pubDate>
		<dc:creator>kratorius</dc:creator>
				<category><![CDATA[Django]]></category>
		<category><![CDATA[fcgi]]></category>
		<category><![CDATA[lighttpd]]></category>
		<category><![CDATA[pid]]></category>
		<category><![CDATA[runfcgi]]></category>
		<category><![CDATA[server]]></category>
		<category><![CDATA[socket]]></category>
		<category><![CDATA[startserver]]></category>

		<guid isPermaLink="false">http://zeta-puppis.com/?p=152</guid>
		<description><![CDATA[Running django with fastcgi is not a difficult task, also because of the excellent documentation provided. Anyway the doc provides a very basic script to automatize the start/stop fcgi process, so today I had to write my own so I don&#8217;t have to manually fix things if something goes wrong since I let my script [...]]]></description>
			<content:encoded><![CDATA[<p>Running django with fastcgi is not a difficult task, also because of the excellent documentation provided. Anyway the doc provides a very basic script to <em>automatize</em> the start/stop fcgi process, so today I had to write my own so I don&#8217;t have to manually fix things if something goes wrong since I let my script handle the various&nbsp;situations.</p>
<p><span id="more-152"></span>It has a very basic start/stop/restart interface like normal startup scripts. Let see an example of basic&nbsp;usage:</p>
<pre><code>kratorius@becks:~/prj$ sudo sh startserver.sh start
Starting fcgi process... done!
kratorius@becks:~/prj$ sudo sh startserver.sh stop
kratorius@becks:~/prj$ sudo sh startserver.sh restart
fcgi process is not running
kratorius@becks:~/prj$ sudo sh startserver.sh start
Starting fcgi process... done!
kratorius@becks:~/prj$ sudo sh startserver.sh restart
Starting fcgi process... done!</code></pre>
<p>And here it&nbsp;is:</p>
<pre><code>#!/bin/bash
# Start and stop a django fcgi process
# Giuliani Vito Ivan &lt;giuliani.v@gmail.com&gt;

# project directory
PROJDIR=`pwd`

# user owner (usually www-data)
USER_OWNER="www-data"

# group owner (usually www-data)
GROUP_OWNER="www-data"

# extra python path (leave empty if unneeded)
PYTHONPATH="../python:.."

# do not edit anything below
PIDFILE="$PROJDIR/technosec.pid"
SOCKET="$PROJDIR/technosec.sock"

start_fcgi()
{
	if [ -f $PIDFILE ] || [ -f $SOCKET ]; then
		echo "The fcgi process is already running, please stop"
		echo "that before running another process"
	else
		echo -n "Starting fcgi process... "

		nohup /usr/bin/env - \
			PYTHONPATH=$PYTHONPATH \
			python manage.py runfcgi socket=$SOCKET pidfile=$PIDFILE &gt; /dev/null 2&gt;&amp;1 &amp;
		if [ $? -eq 0 ]; then
			sleep 1
			chown $USER_OWNER:$GROUP_OWNER $SOCKET
			chown $USER_OWNER:$GROUP_OWNER $PIDFILE
			echo "done!"
		else
			echo "failed!"
		fi

		return $?
	fi

	return 1
}

stop_fcgi()
{
	cd $PROJDIR
	if [ -f $PIDFILE ]; then
		kill `cat -- $PIDFILE`
		rm -f -- $PIDFILE
		rm -f -- $SOCKET

		return 0
	else
		echo "fcgi process is not running"
		return 1
	fi
}

restart_fcgi()
{
	stop_fcgi
	if [ $? -eq 0 ]; then
		start_fcgi
	fi
}

case "$1" in
	start)
		start_fcgi
		;;
	stop)
		stop_fcgi
		;;
	restart)
		restart_fcgi
		;;
	*)
		echo "Usage: $0 {start|stop|restart}"
		exit 1
		;;
esac

exit 0</code></pre>
<p>Just in case you may want to download it, <a href="/wp-content/uploads/2008/10/startserver.sh">here it&nbsp;is</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://zeta-puppis.com/2008/10/08/running-django-with-fastcgi/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Announcing Pytagram</title>
		<link>http://zeta-puppis.com/2008/08/21/announcing-pytagram/</link>
		<comments>http://zeta-puppis.com/2008/08/21/announcing-pytagram/#comments</comments>
		<pubDate>Thu, 21 Aug 2008 14:42:11 +0000</pubDate>
		<dc:creator>kratorius</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Django]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Web]]></category>
		<category><![CDATA[pytagram]]></category>
		<category><![CDATA[svg]]></category>
		<category><![CDATA[toc]]></category>

		<guid isPermaLink="false">http://zeta-puppis.com/?p=138</guid>
		<description><![CDATA[Today I just ended one of my side projects: pytagram. Basically it generates an SVG file (that can successively be saved as eps/pdf/whatever and eventually manually manipulated) starting from a tree-like plain text file. This can be useful for generating cheat sheets or quick references to classes or functions that belongs to some&#160;project.
I did this [...]]]></description>
			<content:encoded><![CDATA[<p>Today I just ended one of my side projects: pytagram. Basically it generates an SVG file (that can successively be saved as eps/pdf/whatever and eventually manually manipulated) starting from a tree-like plain text file. This can be useful for generating <strong>cheat sheets or quick references</strong> to classes or functions that belongs to some&nbsp;project.</p>
<p>I did this for generating a <a href="http://djangoproject.com">django</a> quick reference (<a href="http://zeta-puppis.com/wp-content/uploads/2008/08/django1.svg">here it is</a>) since it has a lot of functions and I know what&#8217;s their purpose, but I can never remember the names (and now two A4 papers are right in front of&nbsp;me).</p>
<p>If you&#8217;re interested in this, check out the <a href="http://code.google.com/p/pytagram/">google code project page</a> and grab your copy from the SVN&nbsp;repository.</p>
<p>There are <strong>tons of things that can be changed/optimized</strong> (i.e.: add some optional short explanation of the function, add more examples, easier way to change colors, &#8230;) but now the code is working quite well so that can be already useful to the people out&nbsp;there.</p>
]]></content:encoded>
			<wfw:commentRss>http://zeta-puppis.com/2008/08/21/announcing-pytagram/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Alternate text colors with CSS</title>
		<link>http://zeta-puppis.com/2008/08/07/alternate-text-colors-with-css/</link>
		<comments>http://zeta-puppis.com/2008/08/07/alternate-text-colors-with-css/#comments</comments>
		<pubDate>Thu, 07 Aug 2008 16:40:28 +0000</pubDate>
		<dc:creator>kratorius</dc:creator>
				<category><![CDATA[Web]]></category>
		<category><![CDATA[cascade style sheet]]></category>
		<category><![CDATA[css]]></category>
		<category><![CDATA[css trick]]></category>
		<category><![CDATA[experiment]]></category>
		<category><![CDATA[html]]></category>
		<category><![CDATA[style]]></category>
		<category><![CDATA[text manipulation]]></category>
		<category><![CDATA[trick]]></category>

		<guid isPermaLink="false">http://zeta-puppis.com/?p=122</guid>
		<description><![CDATA[When I was redesigning this site, I was experimenting many different options for the header. Among the whole set of solutions I tried, I was very happy with the one I&#8217;m going to illustrate even though I chosen another one (that is the one you can see now) because it integrates better with the the [...]]]></description>
			<content:encoded><![CDATA[<p>When I was redesigning this site, I was experimenting many different options for the header. Among the whole set of solutions I tried, I was very happy with the one I&#8217;m going to illustrate even though I chosen another one (that is the one you can see now) because it integrates better with the the whole&nbsp;layout.</p>
<p><span id="more-122"></span></p>
<p>What I wanted to achieve was basically&nbsp;this:</p>
<div style="clear: both; margin: 0 auto; text-align: center"><img src="http://zeta-puppis.com/wp-content/uploads/2008/08/2colorstext.png" alt="" title="two colors text" width="465" height="80" class="size-full wp-image-123" /></div>
<p>I thought at various techniques on how to realize that: fixed image, text with alternated background, experiments with PNG transparency, ecc&#8230;. All of these things have their pros and cons, but in the end I came up with a simple solution that required just <strong>few lines of CSS code</strong> and had few cons (in respect to the&nbsp;others).</p>
<p>Let see how I did it. First of all, we need a container for our text. I used a <code>div</code> on the which I set the <code>text</code> id. In <code>#text</code> we&#8217;re going to put our text. We&#8217;ll use absolute positioning within the <code>div</code>, so <code>#text</code> has to have <code>position: relative</code>&nbsp;set.</p>
<p>Next, we need two <code>span</code>s with the same text within; on these we&#8217;re going to set the absolute positioning I said before, setting <code>top: 0; left: 0;</code> on them. And now here is how the trick works. By having absolute positioned the two <code>span</code>s and having put them at the same coordinates (with the same font size), they&#8217;re going to overlap each other. Thanks to absolute positioning we can set the <code>height</code> on the <code>span</code>s, and this is the core of the trick, since to make it work we&#8217;ll have to set <strong>the first <code>span</code>&#8217;s height to half the height of the second&nbsp;<code>span</code></strong>.</p>
<p>But that&#8217;s not all. If we stop here nothing will work. We also have to use <code>overflow: hidden</code> and <code>z-index: 1000</code> on the first <code>span</code>. Just as side note, 1000 is not really a mandatory value for <code>z-index</code>, you can use even 1 but 1000 is pretty standard if you want that the class is <em>always</em> on the&nbsp;top.</p>
<p>In this way <a href="http://zeta-puppis.com/wp-content/uploads/alternatecolorstext/">this is the final result</a>. I really like it but cannot implement on this design as doesn&#8217;t integrates well the whole layout. Anyway, it has some cons: <strong>you have to manually tweak the <code>font-size</code> and <code>height</code>&#8217;s values</strong> as it&#8217;s not possible to use relative sizes and your code is going to have the text repeated twice, though this can be easily <em>fixed</em> with a bit of&nbsp;javascript.</p>
<p>I tested this under Firefox 3.0.2, Explorer 7, Opera 9.25 and Safari 3.1.2 and <strong>it works without any hack</strong> though I&#8217;m quite sure that it&#8217;s going to have some issues with older versions of&nbsp;Explorer.</p>
]]></content:encoded>
			<wfw:commentRss>http://zeta-puppis.com/2008/08/07/alternate-text-colors-with-css/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Practical Django Projects</title>
		<link>http://zeta-puppis.com/2008/07/05/practical-django-projects/</link>
		<comments>http://zeta-puppis.com/2008/07/05/practical-django-projects/#comments</comments>
		<pubDate>Sat, 05 Jul 2008 12:02:58 +0000</pubDate>
		<dc:creator>kratorius</dc:creator>
				<category><![CDATA[Django]]></category>
		<category><![CDATA[Me]]></category>
		<category><![CDATA[Web]]></category>
		<category><![CDATA[amazon]]></category>
		<category><![CDATA[book]]></category>
		<category><![CDATA[practical django projects]]></category>

		<guid isPermaLink="false">http://zeta-puppis.com/?p=95</guid>
		<description><![CDATA[Due to my devotion to the Django web framework, I finally got my copy of Practical Django Projects, by James Bennet. Not really expecting to have that soon, but a beautiful suprise anyway (to say the truth, I didn&#8217;t bought this: this has been sent to me as replacement prize for djangodash because I was [...]]]></description>
			<content:encoded><![CDATA[<p>Due to my devotion to the <a href="http://djangoproject.com">Django</a> web framework, I finally got my copy of <a href="http://www.amazon.com/dp/1590599969/">Practical Django Projects</a>, by <a href="http://b-list.org">James Bennet</a>. Not really expecting to have that soon, but a <strong>beautiful suprise</strong> anyway (to say the truth, I didn&#8217;t bought this: this has been sent to me as <em>replacement prize</em> for <a href="http://djangodash.com">djangodash</a> because I was not elegible to get the G33K beers since I live outside US. Thanks to the generosity of <a href="http://toastdriven.com">Daniel&nbsp;Lindsley</a>).</p>
<p><span id="more-95"></span></p>
<p><img src="/wp-content/uploads/2008/07/practical.jpg" alt="the Django Practical Projects book" width="400" height="308" class="aligncenter size-full wp-image-96" /></p>
]]></content:encoded>
			<wfw:commentRss>http://zeta-puppis.com/2008/07/05/practical-django-projects/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
