<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Zeta-Puppis.com &#187; Coding</title>
	<atom:link href="http://zeta-puppis.com/category/coding/feed/" rel="self" type="application/rss+xml" />
	<link>http://zeta-puppis.com</link>
	<description>my very own personal corner</description>
	<lastBuildDate>Sat, 18 Feb 2012 12:53:23 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Clustering coordinate points together with quad-trees</title>
		<link>http://zeta-puppis.com/2010/10/02/clustering-coordinate-points-together-with-quad-trees/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=clustering-coordinate-points-together-with-quad-trees</link>
		<comments>http://zeta-puppis.com/2010/10/02/clustering-coordinate-points-together-with-quad-trees/#comments</comments>
		<pubDate>Sat, 02 Oct 2010 18:15:20 +0000</pubDate>
		<dc:creator>kratorius</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[algorithm]]></category>
		<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[dataset]]></category>
		<category><![CDATA[openheatmap]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[quad tree]]></category>
		<category><![CDATA[wikipedia]]></category>

		<guid isPermaLink="false">http://zeta-puppis.com/?p=329</guid>
		<description><![CDATA[<a href="http://zeta-puppis.com/2010/10/02/clustering-coordinate-points-together-with-quad-trees/" title="Clustering coordinate points together with quad-trees"></a>Recently I needed to show a heat map of a quite a lot of coordinate points for a little project of mine that ended up in a data visualization contest (that unfortunately I didn&#8217;t win, even though I made to &#8230;<p class="read-more"><a href="http://zeta-puppis.com/2010/10/02/clustering-coordinate-points-together-with-quad-trees/">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<a href="http://zeta-puppis.com/2010/10/02/clustering-coordinate-points-together-with-quad-trees/" title="Clustering coordinate points together with quad-trees"></a><p>Recently I needed to show a heat map of a quite a lot of coordinate points for a little project of mine that ended up in a <a href="http://thisweekinrelevance.com/2010/09/07/twir-contest/">data visualization contest</a> (that unfortunately I didn&#8217;t win, even though I made to the finalists). The idea was to show the distribution of the georeferenced wikipedia pages through a heat map, so when I first heard about openheatmap.com I knew it was the tool to use. OpenHeatMap.com is an excellent project by <a href="http://petewarden.typepad.com">Pete Warden</a> that takes a dataset as a CSV, Excel or Google Spreadsheet file and convert it to a nice, browsable heat map presentation.<br />
<span id="more-329"></span><br />
The first step was to obtain I dataset I could work on. I first tried to work directly onto the whole <a href="http://en.wikipedia.org/wiki/Wikipedia:Database_download">wikipedia database dump,</a> extracting all the georeferenced pages in a smaller dataset. I actually succeeded but this work included only the english georeferenced pages. Also, extracting and converting coordinates to a common format would have been a real pain. So instead I decided to use the dump from the <a href="http://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Georeferenzierung/Wikipedia-World/en">wikipedia-world project</a> that already included data in a CSV file from all the downloadable wikipedia dumps which include other languages other than english. This dataset include roughly 1.300.000 points, so I had to narrow down some options to process&nbsp;it.</p>
<p>Once I had the dataset ready and knew how big it was I realized I had three&nbsp;options:</p>
<ol>
<li>the naive approach, just add every coordinate to the CSV&nbsp;file</li>
<li>use a reverse geocoding service to get the country where the point belongs&nbsp;to</li>
<li>cluster set of points&nbsp;together</li>
</ol>
<p>It was clear that the first approach wouldn&#8217;t have worked for two reasons, the former being that there were just too many points for <a href="http://openheatmap.com">OHM</a> (the rendering is done on client side and that would slow things a lot). Also, I would just draw points onto a map without effectively creating a &#8220;heat map&#8221; so I discarded that option soon.<br />
Using a reverse geocoding service wouldn&#8217;t have worked too: I should have ran too many requests to a service like this and it would have taken ages. Also, I would have ended up with per-country rather than a per-city highlighting and that would have faked the final result. So it was clear that the only viable option was to cluster set of points together and then produce a CSV file that OHM would understand. Soon I realized I needed some sort of spatial indexing for a 2d space that turned out to be quad&nbsp;trees.</p>
<p>Before we dive deeper in how to cluster the points together we need to understand what&#8217;s a quad-tree. In the classic, recursive, definition of a tree, a quad-tree is a tree where each node, that represent a coordinate point, has up to four children. Each child represent a relative position to its father, being north west, north east, south west or south&nbsp;east.</p>
<p><img src="http://zeta-puppis.com/wp-content/uploads/2010/10/quadtree.png" alt="" title="Quad Tree" width="300" height="200" class="alignnone size-full wp-image-333" /></p>
<p>One requirement for generating the heat map is knowing how many nodes we clustered together. Thus it&#8217;s easy to define a node item as an object storing the coordinates of the point and the number of nodes that have been aggregated on that point. We can thus define a class like this (all the code examples following will be in&nbsp;Python):</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">class</span> PointNode<span style="color: black;">&#40;</span><span style="color: #008000;">object</span><span style="color: black;">&#41;</span>:
    NW, NE, SW, SE = <span style="color: #ff4500;">0</span>, <span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">2</span>, <span style="color: #ff4500;">3</span>
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, lat, lon<span style="color: black;">&#41;</span>:
        <span style="color: #008000;">self</span>.<span style="color: black;">lat</span>, <span style="color: #008000;">self</span>.<span style="color: black;">lon</span> = <span style="color: #008000;">float</span><span style="color: black;">&#40;</span>lat<span style="color: black;">&#41;</span>, <span style="color: #008000;">float</span><span style="color: black;">&#40;</span>lon<span style="color: black;">&#41;</span>
        <span style="color: #008000;">self</span>.<span style="color: black;">nodes</span> = <span style="color: black;">&#91;</span><span style="color: #008000;">None</span><span style="color: black;">&#93;</span> <span style="color: #66cc66;">*</span> <span style="color: #ff4500;">4</span>
        <span style="color: #008000;">self</span>.<span style="color: black;">aggregate_no</span> = <span style="color: #ff4500;">1</span>
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__str__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #483d8b;">&quot;%s, %s&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">lat</span>, <span style="color: #008000;">self</span>.<span style="color: black;">lon</span><span style="color: black;">&#41;</span></pre></td></tr></table></div>

<p>We&#8217;ll do two operations on the quad-tree: insert a new node and visit the whole tree. We can give a recursive definition of the insertion using the quad-tree as underlying data structure. Given a&nbsp;node:</p>
<ol>
<li>if the tree&#8217;s node is empty just insert the node there and set the number of points clustered together for that node to&nbsp;1</li>
<li>if the new node is near the tree&#8217;s node then compute a &#8220;middle node&#8221;, substitute it to the tree&#8217;s node and increment the number of points clustered together for that&nbsp;node</li>
<li>otherwise find out where the new node belongs in the quad-tree (north west, north east, south west or south east) and insert it&nbsp;there</li>
</ol>
<p>The insert operation on the quad-tree can be thus coded like&nbsp;this:</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
</pre></td><td class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">def</span> qtree_insert<span style="color: black;">&#40;</span>root, node<span style="color: black;">&#41;</span>:
    <span style="color: #483d8b;">&quot;&quot;&quot;
    Insert a point into the quad tree substituting a node with its
    midpoint if the nodes are near to each other (less than DISTANCE_LIMIT)
    &quot;&quot;&quot;</span>
    <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #ff7700;font-weight:bold;">not</span> root: <span style="color: #ff7700;font-weight:bold;">return</span> node
&nbsp;
    <span style="color: #808080; font-style: italic;"># if we are under the distance limit, replace the root node with the</span>
    <span style="color: #808080; font-style: italic;"># midpoint of the two nodes</span>
    <span style="color: #ff7700;font-weight:bold;">if</span> point_distance<span style="color: black;">&#40;</span>root, node<span style="color: black;">&#41;</span> <span style="color: #66cc66;">&lt;</span> DISTANCE_LIMIT:
        c = point_midpoint<span style="color: black;">&#40;</span>root, node<span style="color: black;">&#41;</span>
        c.<span style="color: black;">nodes</span> = root.<span style="color: black;">nodes</span>
        c.<span style="color: black;">aggregate_no</span> = root.<span style="color: black;">aggregate_no</span> + <span style="color: #ff4500;">1</span>
        root = c
    <span style="color: #ff7700;font-weight:bold;">else</span>:
        <span style="color: #808080; font-style: italic;"># otherwise just insert the node where it belongs</span>
&nbsp;
        <span style="color: #808080; font-style: italic;"># exploit PointNode child indexing (with NW being 0 we just need to add</span>
        <span style="color: #808080; font-style: italic;"># the proper number to get what we need)</span>
        pos = PointNode.<span style="color: black;">NW</span>
        <span style="color: #ff7700;font-weight:bold;">if</span> node.<span style="color: black;">lat</span> <span style="color: #66cc66;">&gt;</span> root.<span style="color: black;">lat</span>:
            pos += <span style="color: #ff4500;">2</span>
        <span style="color: #ff7700;font-weight:bold;">if</span> node.<span style="color: black;">lon</span> <span style="color: #66cc66;">&gt;</span> root.<span style="color: black;">lon</span>:
            pos += <span style="color: #ff4500;">1</span>
&nbsp;
        root.<span style="color: black;">nodes</span><span style="color: black;">&#91;</span>pos<span style="color: black;">&#93;</span> = qtree_insert<span style="color: black;">&#40;</span>root.<span style="color: black;">nodes</span><span style="color: black;">&#91;</span>pos<span style="color: black;">&#93;</span>, node<span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">return</span> root</pre></td></tr></table></div>

<p>The distance between two nodes can be computed using the pythagorean formula with parallel meridians. This formula returns the distance from two points in kilometers and it&#8217;s defined as: <img src='http://s.wordpress.com/latex.php?latex=D%3DR%5Csqrt%7B%28%5CDelta%5Cphi%29%5E2%2B%28%5CDelta%5Clambda%29%5E2%7D%5C%21&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='D=R\sqrt{(\Delta\phi)^2+(\Delta\lambda)^2}\!' title='D=R\sqrt{(\Delta\phi)^2+(\Delta\lambda)^2}\!' class='latex' /> where <img src='http://s.wordpress.com/latex.php?latex=R&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='R' title='R' class='latex' /> is the Earth&#8217;s radius and <img src='http://s.wordpress.com/latex.php?latex=%28%5Cphi_0%2C%5Clambda_0%29%5C%2C%5C%21&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(\phi_0,\lambda_0)\,\!' title='(\phi_0,\lambda_0)\,\!' class='latex' />, <img src='http://s.wordpress.com/latex.php?latex=%28%5Cphi_1%2C%5Clambda_1%29%5C%2C%5C%21&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(\phi_1,\lambda_1)\,\!' title='(\phi_1,\lambda_1)\,\!' class='latex' /> are two points coordinates in radians (thus <img src='http://s.wordpress.com/latex.php?latex=%5CDelta%5Cphi&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta\phi' title='\Delta\phi' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=%5CDelta%5Clambda&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta\lambda' title='\Delta\lambda' class='latex' /> are the differences between the two points longitudes and latitudes).<br />
Note that we can return the distance in miles or whatever distance measure we want just by converting the Earth&#8217;s radius accordingly. We should keep in mind though that this formula it not very accurate so if we need better accuracy we need to find a better&nbsp;alternative.</p>
<p>Browsing the whole tree can be done using a classic tree visit and, considering that for my purposes there&#8217;s no need to visit the nodes in a special order I chosen DFS to save some memory. The final source code of the Python script can be found on <a href="http://github.com/kratorius/wikipedia-fun/blob/master/wikicoords.py">my repository on github</a>. The whole process takes slightly more than one minute on my desktop machine. This, instead, is the final heat map that joined the <a href="http://thisweekinrelevance.com/">This Week In Relevance</a>&nbsp;contest:</p>
<p><iframe width="600" height="450" src="http://www.openheatmap.com/embed.html?map=ChacmaFoliaceousnessTarata" ></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://zeta-puppis.com/2010/10/02/clustering-coordinate-points-together-with-quad-trees/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Simulated Annealing</title>
		<link>http://zeta-puppis.com/2010/02/22/simulated-annealing/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=simulated-annealing</link>
		<comments>http://zeta-puppis.com/2010/02/22/simulated-annealing/#comments</comments>
		<pubDate>Mon, 22 Feb 2010 13:17:59 +0000</pubDate>
		<dc:creator>kratorius</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Coding]]></category>
		<category><![CDATA[algorithm]]></category>
		<category><![CDATA[combinatorial problem]]></category>
		<category><![CDATA[knapsack problem]]></category>
		<category><![CDATA[local search]]></category>
		<category><![CDATA[np-complete]]></category>
		<category><![CDATA[simulated annealing]]></category>
		<category><![CDATA[stochastic]]></category>

		<guid isPermaLink="false">http://zeta-puppis.com/?p=261</guid>
		<description><![CDATA[<a href="http://zeta-puppis.com/2010/02/22/simulated-annealing/" title="Simulated Annealing"></a>For a problem I&#8217;m working on I got stuck onto the classical situation of local maximum. After trying to work around the problem in several more or less creative ways, I thought of the simulated annealing algorithm. Considering it&#8217;s been &#8230;<p class="read-more"><a href="http://zeta-puppis.com/2010/02/22/simulated-annealing/">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<a href="http://zeta-puppis.com/2010/02/22/simulated-annealing/" title="Simulated Annealing"></a><p>For a problem I&#8217;m working on I got stuck onto the classical situation of local maximum. After trying to work around the problem in several more or less creative ways, I thought of the <a href="http://en.wikipedia.org/wiki/Simulated_annealing">simulated annealing</a> algorithm. Considering it&#8217;s been a while since I last saw it I tried to search for it on the web and surprisingly there is not much stuff about it, and the few bits I found are often contraddictory. After quite a lot of digging I decided to write about it here. As a warning I should probably say that there will be digging into some basic statistic and complexity analysis, as well as a quick formal introduction to the problem of the knapsack. You should be able to follow even if you don&#8217;t know nothing about those topics, but having some foundations in these areas would be of great help.<br />
<span id="more-261"></span><br />
Let begin with the knapsack problem. This is a classic combinatorial computer science problem known to be <a href="http://en.wikipedia.org/wiki/NP-complete">NP-complete</a>, meaning that the exact optimal solution cannot be found in polynomial time. This often means that most of the times we are happy of a good solution, assuming it&#8217;s not so far from the optimal one. In the simpliest possible terms you are a thief and you&#8217;re in a room with a set of objects that are worth something but you have only one knapsack, and that knapsack can carry at most a certain weight, so you have to choose carefully what objects to steal in order to maximize the earnings. For example consider the following situation: you can carry at most 5kg, and there is one laptop and a 4kg safe with pure diamonds within. You can&#8217;t carry both of them so you have to choose what&#8217;s better to carry on, the diamonds or the laptop. A smart thief would choose the diamonds since their value is considerably higher than the&nbsp;laptop.</p>
<p>There are few variations of the same problem but most common one is named &#8220;0-1&#8221;: you can&#8217;t split the weight over two or more carriers or bags but either you take the whole weight or you leave the object where it is. Mathematically talking, consider a set of <img src='http://s.wordpress.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n' title='n' class='latex' /> objects, each item <img src='http://s.wordpress.com/latex.php?latex=x_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x_j' title='x_j' class='latex' /> is worth <img src='http://s.wordpress.com/latex.php?latex=p_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_j' title='p_j' class='latex' /> and weights <img src='http://s.wordpress.com/latex.php?latex=w_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='w_j' title='w_j' class='latex' /> with <img src='http://s.wordpress.com/latex.php?latex=1%20%5Cleq%20j%20%5Cleq%20n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1 \leq j \leq n' title='1 \leq j \leq n' class='latex' />. Then the goal is to maximize the following&nbsp;function:</p>
<img src='http://s.wordpress.com/latex.php?latex=q%28%5C%7Bx_1%2C%20x_2%2C%20%5Cldots%2C%20x_n%5C%7D%29%20%3D%20%5Csum_%7Bj%3D0%7D%5E%7Bn%7Dp_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q(\{x_1, x_2, \ldots, x_n\}) = \sum_{j=0}^{n}p_j' title='q(\{x_1, x_2, \ldots, x_n\}) = \sum_{j=0}^{n}p_j' class='latex' />
<p>But keeping the following constraint (being <img src='http://s.wordpress.com/latex.php?latex=W&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='W' title='W' class='latex' /> the maximum weight we can&nbsp;carry):</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Csum_%7Bj%3D0%7D%5E%7Bn%7Dw_j%20%5Cleq%20W&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\sum_{j=0}^{n}w_j \leq W' title='\sum_{j=0}^{n}w_j \leq W' class='latex' />
<p>The first problem we have to face then is how we should generate the items and how the value of a solution should be calculated. The following example, as the other that will follow, it&#8217;s written in Python but it&#8217;s quite easy to understand so porting to another language wouldn&#8217;t be that hard. I chosen to generate 50 objects with values that range from 1 to 99$ (both ends included) using a <a href="http://en.wikipedia.org/wiki/Uniform_distribution_(continuous)">uniform distribution</a> (if you don&#8217;t know much about statistic, it means that all the values are equally distributed among the objects). The same with the weights except they range from 1 to 20 (the choice of the weight&#8217;s unit measure is left to&nbsp;you).</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">def</span> generate_items<span style="color: black;">&#40;</span>n_items=<span style="color: #ff4500;">100</span><span style="color: black;">&#41;</span>:
    <span style="color: #483d8b;">&quot;Generate a list of items that could be stealed&quot;</span>
    items = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
    <span style="color: #ff7700;font-weight:bold;">for</span> n <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span>, n_items<span style="color: black;">&#41;</span>:
        <span style="color: #808080; font-style: italic;"># use a uniform distribution both for values and for weights</span>
        cost = <span style="color: #dc143c;">random</span>.<span style="color: black;">randint</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">100</span><span style="color: black;">&#41;</span>
        weight = <span style="color: #dc143c;">random</span>.<span style="color: black;">uniform</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">20</span><span style="color: black;">&#41;</span>
        items.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: black;">&#40;</span>cost, weight<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">return</span> items</pre></div></div>

<p>So the items are nothing other than a list of pairs in the format <img src='http://s.wordpress.com/latex.php?latex=%28value_i%2C%20weight_i%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(value_i, weight_i)' title='(value_i, weight_i)' class='latex' /> for every object <img src='http://s.wordpress.com/latex.php?latex=x_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x_i' title='x_i' class='latex' />. Probably a better and more realistic dataset would have used a <a href="http://en.wikipedia.org/wiki/Normal_distribution">normal distribution</a> for the weights but it&#8217;s trivial to change the generation to function to work in that way. For our purposes the uniform distribution does its&nbsp;job.</p>
<p>As we said above the problem is NP-complete so we usually need to visit the whole search space to get the optimal solution which can be quite big when as the number of objects grows. Here comes the simulated annealing. We won&#8217;t visit the search space extensively, but we&#8217;d rather <em>generate</em> solutions. Indeed, it is a stochastic heuristic search algorithm. An heuristic is a function that measures how much something is good or bad, and stochastic means that we move more or less in a random way into the search space. In practical terms it&#8217;s not greedy as that it doesn&#8217;t always follow what the heuristic says but rather randomly search where the heuristic function points to. For example consider the needle in the haystack situation: an exaustive search method would take every straw piece, check that it&#8217;s not a needle, put it apart and repeat those moves until you don&#8217;t find the needle. You don&#8217;t want to proceed in this way, you&#8217;re more likely to end in less time if you look randomly in the haystack and if somethings stings you while you&#8217;re holding straw then search into that straw piece, because there may be the needle in there. In the knapsack problem we do have the heuristic, and it&#8217;s the <img src='http://s.wordpress.com/latex.php?latex=q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q' title='q' class='latex' /> function above that, for each solution (admissible or not), says how much it&#8217;s worth. In this case we define an admissible solution as one that it&#8217;s not too&nbsp;heavy.</p>
<p>An useful (and classic as well) example is the one of the blind hill climbing. You&#8217;re blind and stuck on a hill and you need to reach the top. A good principle that could lead you to the top is to touch the terrain and always follow the rising path. It <i>could</i>, because if you&#8217;re on a rock than you surely haven&#8217;t reached the top but the principle above doesn&#8217;t apply: you reached a local maximum (invert the things and you get the same thing for a local minimum). Simulated annealing avoids these problems by trying worsening moves from time to time: even if this may not sound like a good move it helps avoiding the problems we described&nbsp;above.</p>
<p><img src="http://zeta-puppis.com/wp-content/uploads/2010/02/plotsurf.gif" class="align-center" /></p>
<p>In the function above once in the middle we could choose to take the left maximum (which is a local maximum). Using hill climbing we&#8217;d be stuck on that because we wouldn&#8217;t try other&nbsp;paths.</p>
<p>Simulated annealing takes its name from the same process that metals go through when cooling from a melting point. Indeed, the cooling process consists of several particles that changes energy states (this statement may not be accurate or be inexact at all, but please forgive me as I never studied those things and I all know in this field comes from simulated annealing algorithm itself), in particular we can calculate the transiction probability from one state to another. Considering two energy states <img src='http://s.wordpress.com/latex.php?latex=e_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='e_i' title='e_i' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=e_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='e_j' title='e_j' class='latex' /> and a temperature <img src='http://s.wordpress.com/latex.php?latex=T&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='T' title='T' class='latex' />, switching from <img src='http://s.wordpress.com/latex.php?latex=e_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='e_i' title='e_i' class='latex' /> to <img src='http://s.wordpress.com/latex.php?latex=e_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='e_j' title='e_j' class='latex' /> has&nbsp;probability:</p>
<img src='http://s.wordpress.com/latex.php?latex=P%28e_i%2C%20e_j%20%7C%20T%29%20%3D%20e%5E%7B%28e_i%20-%20e_j%29%20%2F%20%28k_BT%29%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='P(e_i, e_j | T) = e^{(e_i - e_j) / (k_BT)}' title='P(e_i, e_j | T) = e^{(e_i - e_j) / (k_BT)}' class='latex' />
<p>Where <img src='http://s.wordpress.com/latex.php?latex=k_B&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k_B' title='k_B' class='latex' /> is a constant called <a href="http://en.wikipedia.org/wiki/Boltzmann_constant">Boltzmann&#8217;s constant</a>. But then, how do we apply those statements to our problem (or a combinatorial search problem, in general)? The most important concept to grasp is the energy switching one. As a particle change state, a solution might change. Indeed for the knapsack problem there are many admissible solution, each one with an associated earning. Of course we&#8217;d prefer the one with the higher earnings (and simulated annealing will help us find that) but still it&#8217;s perfectly acceptable to go from a solution to another as long as the other solution continues to be&nbsp;admissible.</p>
<p>So here it is what the simulated annealing does: if you find a better item go on an take that path (under this circumnstance, behaves just like the hill climbing), otherwise change state with probability <img src='http://s.wordpress.com/latex.php?latex=P%28e_i%2C%20e_j%20%7C%20T%29%20%3D%20e%5E%7B%28e_i%20-%20e_j%29%20%2F%20T%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='P(e_i, e_j | T) = e^{(e_i - e_j) / T}' title='P(e_i, e_j | T) = e^{(e_i - e_j) / T}' class='latex' /> (you may notice that the Boltzmann&#8217;s constant is missing, indeed that constant applies mostly to thermodynamic when dealing with different metals). The effect that the temperature scaling has is that at higher temperatures it&#8217;ll try worsening moves quite often while on lower temperatures the probability to do worsening moves is lower so when temperature tends towards 0 it behaves quite like the hill climbing&nbsp;algorithm.</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">def</span> simulated_annealing<span style="color: black;">&#40;</span>solution, items, max_weight<span style="color: black;">&#41;</span>:
    <span style="color: #483d8b;">&quot;Apply the simulated annealing for solving the knapsack problem&quot;</span>
    best = solution
    best_value = compute_cost<span style="color: black;">&#40;</span>solution, items<span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
    current_sol = solution
    temperature = <span style="color: #ff4500;">1.0</span>
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">while</span> <span style="color: #008000;">True</span>:
        current_value = compute_cost<span style="color: black;">&#40;</span>best, items<span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span>, COOLING_STEPS<span style="color: black;">&#41;</span>:
            moves = generate_moves<span style="color: black;">&#40;</span>current_sol, items, max_weight<span style="color: black;">&#41;</span>
            idx = <span style="color: #dc143c;">random</span>.<span style="color: black;">randint</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span>, <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>moves<span style="color: black;">&#41;</span> - <span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
            random_move = moves<span style="color: black;">&#91;</span>idx<span style="color: black;">&#93;</span>
&nbsp;
            delta = compute_cost<span style="color: black;">&#40;</span>random_move, items<span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span> - compute_cost<span style="color: black;">&#40;</span>best, items<span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
&nbsp;
            <span style="color: #ff7700;font-weight:bold;">if</span> delta <span style="color: #66cc66;">&gt;</span> <span style="color: #ff4500;">0</span>:
                best = random_move
                best_value = compute_cost<span style="color: black;">&#40;</span>best, items<span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
                current_sol = random_move
            <span style="color: #ff7700;font-weight:bold;">else</span>:
                <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #dc143c;">math</span>.<span style="color: black;">exp</span><span style="color: black;">&#40;</span>delta / <span style="color: #008000;">float</span><span style="color: black;">&#40;</span>temperature<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span> <span style="color: #66cc66;">&gt;</span> <span style="color: #dc143c;">random</span>.<span style="color: #dc143c;">random</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
                    current_sol = random_move
&nbsp;
        temperature = TEMP_ALPHA <span style="color: #66cc66;">*</span> temperature
        <span style="color: #ff7700;font-weight:bold;">if</span> current_value <span style="color: #66cc66;">&gt;</span>= best_value <span style="color: #ff7700;font-weight:bold;">or</span> temperature <span style="color: #66cc66;">&lt;</span>= <span style="color: #ff4500;">0</span>:
            <span style="color: #ff7700;font-weight:bold;">break</span></pre></div></div>

<p>And finally, that is the simulated annealing. You start from a temperature of 1.0 then you have a certain number of cooling steps, in every one of them you extract a random item from the neighbours and, if the item is better than the current best item then it becomes the new best item (and the new local solution). If the new item&#8217;s value is worst than the current best then update the current local solution with the probability expressed above. After the cooling steps the temperature is decreased with an <a href="http://en.wikipedia.org/wiki/Exponential_decay">exponential decay</a> (usually it is&nbsp;<img src='http://s.wordpress.com/latex.php?latex=t%20%3D%20%5Calpha%20t%2C%200.8%20%3C%20%5Calpha%20%3C%200.9&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t = \alpha t, 0.8 &lt; \alpha &lt; 0.9' title='t = \alpha t, 0.8 &lt; \alpha &lt; 0.9' class='latex' />).</p>
<p>In the example above you don&#8217;t wait for the temperature to be 0 but you leave the loop if after all the cooling steps there hasn&#8217;t been any improvement. How big must be the number of cooling steps and the <img src='http://s.wordpress.com/latex.php?latex=%5Calpha&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\alpha' title='\alpha' class='latex' /> value it&#8217;s a fine tuning problem. A different approach that allow to get rid of the cooling steps is to make the temperature get cold slowly (<img src='http://s.wordpress.com/latex.php?latex=t%20%3D%20%20t%20%2F%20%281%20%2B%20%28%5Cbeta%20t%29%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t =  t / (1 + (\beta t))' title='t =  t / (1 + (\beta t))' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=%5Cbeta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\beta' title='\beta' class='latex' /> is a very small value like 0.01). Besides, a common improvement is to cache the moves within the cooling steps unless you found a new best value or changed&nbsp;state.</p>
<p>You can see that the most critical points are the neighbour generation and the cost computation. While the neighbour generation could be cached like I said above, the cost computation could be replaced with a probability estimate in order to reduce the time per cooling&nbsp;step.</p>
<p>But how do you apply the algorithm? If an empty solution is acceptable then you can just start with that and let the neighbour&#8217;s generator to create a solution, but usually you start with a greedy solution (found through the hill climbing, for example) or from a random&nbsp;one.</p>
<p>Here follows the complete code. I used 1000 cooling steps and a <img src='http://s.wordpress.com/latex.php?latex=%5Calpha&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\alpha' title='\alpha' class='latex' /> value of 0.8. I start from a random solution whose behaviour is not that bad considering how much time it takes to compute the solution. Indeed often random algorithms perform really well on combinatorial algorithms, see <a href="http://www.cs.ubc.ca/labs/beta/Courses/CPSC532D-02/tutorial-slides.pdf">stochastic search</a> for some more&nbsp;informations.</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
</pre></td><td class="code"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">#!/usr/bin/env python</span>
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">math</span>
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">operator</span>
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">pprint</span>
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">random</span>
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">sys</span>
&nbsp;
COOLING_STEPS = <span style="color: #ff4500;">1000</span>
TEMP_ALPHA = <span style="color: #ff4500;">0.8</span>
&nbsp;
<span style="color: #dc143c;">random</span>.<span style="color: black;">seed</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">def</span> generate_items<span style="color: black;">&#40;</span>n_items=<span style="color: #ff4500;">100</span><span style="color: black;">&#41;</span>:
    <span style="color: #483d8b;">&quot;Generate a list of items that could be stealed&quot;</span>
    items = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
    <span style="color: #ff7700;font-weight:bold;">for</span> n <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span>, n_items<span style="color: black;">&#41;</span>:
        <span style="color: #808080; font-style: italic;"># use a uniform distribution both for values and for weights</span>
        cost = <span style="color: #dc143c;">random</span>.<span style="color: black;">randint</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">100</span><span style="color: black;">&#41;</span>
        weight = <span style="color: #dc143c;">random</span>.<span style="color: black;">uniform</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">20</span><span style="color: black;">&#41;</span>
        items.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: black;">&#40;</span>cost, weight<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">return</span> items
&nbsp;
<span style="color: #ff7700;font-weight:bold;">def</span> main<span style="color: black;">&#40;</span>args<span style="color: black;">&#41;</span>:
    items = generate_items<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    <span style="color: #dc143c;">pprint</span>.<span style="color: #dc143c;">pprint</span><span style="color: black;">&#40;</span>items<span style="color: black;">&#41;</span>
&nbsp;
    start_sol = generate_random_solution<span style="color: black;">&#40;</span>items, max_weight=<span style="color: #ff4500;">40</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;Random solution: %s&quot;</span> <span style="color: #66cc66;">%</span> start_sol
    <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;value: (cost: %d, weight: %f)&quot;</span> <span style="color: #66cc66;">%</span> compute_cost<span style="color: black;">&#40;</span>start_sol, items<span style="color: black;">&#41;</span>
&nbsp;
    solution = simulated_annealing<span style="color: black;">&#40;</span>start_sol, items, max_weight=<span style="color: #ff4500;">40</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;Final solution: %s&quot;</span> <span style="color: #66cc66;">%</span> solution
    <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;value: (cost: %d, weight: %f)&quot;</span> <span style="color: #66cc66;">%</span> compute_cost<span style="color: black;">&#40;</span>solution, items<span style="color: black;">&#41;</span>
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">False</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">def</span> generate_random_solution<span style="color: black;">&#40;</span>items, max_weight<span style="color: black;">&#41;</span>:
    <span style="color: #483d8b;">&quot;Generate a starting random solution&quot;</span>
&nbsp;
    <span style="color: #808080; font-style: italic;"># generate a random solution by adding a random item</span>
    <span style="color: #808080; font-style: italic;"># until we don't get over the weight</span>
    solution = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
    <span style="color: #ff7700;font-weight:bold;">while</span> compute_cost<span style="color: black;">&#40;</span>solution, items<span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span> <span style="color: #66cc66;">&lt;</span>= max_weight:
        idx = <span style="color: #dc143c;">random</span>.<span style="color: black;">randint</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span>, <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>items<span style="color: black;">&#41;</span> - <span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
        <span style="color: #808080; font-style: italic;"># skip duplicates</span>
        <span style="color: #ff7700;font-weight:bold;">if</span> idx <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #ff7700;font-weight:bold;">in</span> solution:
            solution.<span style="color: black;">append</span><span style="color: black;">&#40;</span>idx<span style="color: black;">&#41;</span>
    <span style="color: #808080; font-style: italic;"># last item makes us get over the weight so simply remove it</span>
    <span style="color: #808080; font-style: italic;"># we'll look for better results after</span>
    solution = solution<span style="color: black;">&#91;</span>:-<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>
    <span style="color: #ff7700;font-weight:bold;">return</span> solution
&nbsp;
<span style="color: #ff7700;font-weight:bold;">def</span> simulated_annealing<span style="color: black;">&#40;</span>solution, items, max_weight<span style="color: black;">&#41;</span>:
    <span style="color: #483d8b;">&quot;Apply the simulated annealing for solving the knapsack problem&quot;</span>
    best = solution
    best_value = compute_cost<span style="color: black;">&#40;</span>solution, items<span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
    current_sol = solution
    temperature = <span style="color: #ff4500;">1.0</span>
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">while</span> <span style="color: #008000;">True</span>:
        current_value = compute_cost<span style="color: black;">&#40;</span>best, items<span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span>, COOLING_STEPS<span style="color: black;">&#41;</span>:
            moves = generate_moves<span style="color: black;">&#40;</span>current_sol, items, max_weight<span style="color: black;">&#41;</span>
            idx = <span style="color: #dc143c;">random</span>.<span style="color: black;">randint</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span>, <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>moves<span style="color: black;">&#41;</span> - <span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
            random_move = moves<span style="color: black;">&#91;</span>idx<span style="color: black;">&#93;</span>
&nbsp;
            delta = compute_cost<span style="color: black;">&#40;</span>random_move, items<span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span> - \
                    compute_cost<span style="color: black;">&#40;</span>best, items<span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
&nbsp;
            <span style="color: #ff7700;font-weight:bold;">if</span> delta <span style="color: #66cc66;">&gt;</span> <span style="color: #ff4500;">0</span>:
                best = random_move
                best_value = compute_cost<span style="color: black;">&#40;</span>best, items<span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
                current_sol = random_move
            <span style="color: #ff7700;font-weight:bold;">else</span>:
                <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #dc143c;">math</span>.<span style="color: black;">exp</span><span style="color: black;">&#40;</span>delta / <span style="color: #008000;">float</span><span style="color: black;">&#40;</span>temperature<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span> <span style="color: #66cc66;">&gt;</span> <span style="color: #dc143c;">random</span>.<span style="color: #dc143c;">random</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
                    current_sol = random_move
&nbsp;
        temperature = TEMP_ALPHA <span style="color: #66cc66;">*</span> temperature
        <span style="color: #ff7700;font-weight:bold;">if</span> current_value <span style="color: #66cc66;">&gt;</span>= best_value <span style="color: #ff7700;font-weight:bold;">or</span> temperature <span style="color: #66cc66;">&lt;</span>= <span style="color: #ff4500;">0</span>:
            <span style="color: #ff7700;font-weight:bold;">break</span>
    <span style="color: #ff7700;font-weight:bold;">return</span> best
&nbsp;
<span style="color: #ff7700;font-weight:bold;">def</span> generate_moves<span style="color: black;">&#40;</span>solution, items, max_weight<span style="color: black;">&#41;</span>:
    <span style="color: #483d8b;">&quot;&quot;&quot;
    Generate all the ammissible moves starting from the input
    solution
    &quot;&quot;&quot;</span>
    moves = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
    <span style="color: #808080; font-style: italic;"># try to add another item and save as a possible move</span>
    <span style="color: #ff7700;font-weight:bold;">for</span> idx, item <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">enumerate</span><span style="color: black;">&#40;</span>items<span style="color: black;">&#41;</span>:
        <span style="color: #ff7700;font-weight:bold;">if</span> idx <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #ff7700;font-weight:bold;">in</span> solution:
            move = solution<span style="color: black;">&#91;</span>::<span style="color: black;">&#93;</span>
            move.<span style="color: black;">append</span><span style="color: black;">&#40;</span>idx<span style="color: black;">&#41;</span>
&nbsp;
            <span style="color: #ff7700;font-weight:bold;">if</span> compute_cost<span style="color: black;">&#40;</span>move, items<span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span> <span style="color: #66cc66;">&lt;</span>= max_weight:
                moves.<span style="color: black;">append</span><span style="color: black;">&#40;</span>move<span style="color: black;">&#41;</span>
&nbsp;
    <span style="color: #808080; font-style: italic;"># try to remove one item</span>
    <span style="color: #ff7700;font-weight:bold;">for</span> idx, item <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">enumerate</span><span style="color: black;">&#40;</span>solution<span style="color: black;">&#41;</span>:
        move = solution<span style="color: black;">&#91;</span>::<span style="color: black;">&#93;</span>
        <span style="color: #ff7700;font-weight:bold;">del</span> move<span style="color: black;">&#91;</span>idx<span style="color: black;">&#93;</span>
        <span style="color: #ff7700;font-weight:bold;">if</span> move <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #ff7700;font-weight:bold;">in</span> moves:
            moves.<span style="color: black;">append</span><span style="color: black;">&#40;</span>move<span style="color: black;">&#41;</span>
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">return</span> moves
&nbsp;
<span style="color: #ff7700;font-weight:bold;">def</span> compute_cost<span style="color: black;">&#40;</span>solution, items<span style="color: black;">&#41;</span>:
    <span style="color: #483d8b;">&quot;&quot;&quot;
    Return a tuple in the format (id_item1, id_item2, ...)
    for the input solution
    &quot;&quot;&quot;</span>
    cost, weight = <span style="color: #ff4500;">0</span>, <span style="color: #ff4500;">0</span>
    <span style="color: #ff7700;font-weight:bold;">for</span> item <span style="color: #ff7700;font-weight:bold;">in</span> solution:
        cost += items<span style="color: black;">&#91;</span>item<span style="color: black;">&#93;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
        weight += items<span style="color: black;">&#91;</span>item<span style="color: black;">&#93;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>
    <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: black;">&#40;</span>cost, weight<span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">if</span> __name__ == <span style="color: #483d8b;">'__main__'</span>:
    <span style="color: #dc143c;">sys</span>.<span style="color: black;">exit</span><span style="color: black;">&#40;</span>main<span style="color: black;">&#40;</span><span style="color: #dc143c;">sys</span>.<span style="color: black;">argv</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span></pre></td></tr></table></div>

<p>The results are suprising. For three different set of 100 items, each one with its own value and weight, here are the&nbsp;results:</p>
<pre>$ python sa.py
Random solution: [98, 71, 95]
value: (cost: 44, weight: 27.001685)
Final solution: [71, 95, 67, 9, 41, 33, 27]
value: (cost: 229, weight: 39.791386)

$ python sa.py
Random solution: [38, 16, 62, 31]
value: (cost: 124, weight: 36.863846)
Final solution: [38, 16, 62, 31, 5]
value: (cost: 194, weight: 38.970745)

Random solution: [61, 44, 48, 38]
value: (cost: 293, weight: 30.357135)
Final solution: [61, 44, 48, 38, 37, 5, 2]
value: (cost: 421, weight: 39.331549)
</pre>
<p>We usually don&#8217;t leave much free weight and the quality of the solutions found is quite good considering that the time required to compute that is&nbsp;~0.5s.</p>
]]></content:encoded>
			<wfw:commentRss>http://zeta-puppis.com/2010/02/22/simulated-annealing/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Google Wave impressions from a developer point of view</title>
		<link>http://zeta-puppis.com/2009/08/28/google-wave-impressions-from-a-developer-point-of-view/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=google-wave-impressions-from-a-developer-point-of-view</link>
		<comments>http://zeta-puppis.com/2009/08/28/google-wave-impressions-from-a-developer-point-of-view/#comments</comments>
		<pubDate>Thu, 27 Aug 2009 22:47:04 +0000</pubDate>
		<dc:creator>kratorius</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Geekness]]></category>
		<category><![CDATA[beta]]></category>
		<category><![CDATA[developer]]></category>
		<category><![CDATA[development]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[googlewave]]></category>
		<category><![CDATA[point of view]]></category>
		<category><![CDATA[review]]></category>
		<category><![CDATA[sandbox]]></category>
		<category><![CDATA[wave]]></category>

		<guid isPermaLink="false">http://zeta-puppis.com/?p=235</guid>
		<description><![CDATA[<a href="http://zeta-puppis.com/2009/08/28/google-wave-impressions-from-a-developer-point-of-view/" title="Google Wave impressions from a developer point of view"></a>A couple of days ago I finally had my Google Wave sandbox account. Given that I just finished developing my very first robot, I thought I&#8217;d share some impressions on the whole thing. From the user-side, things are far from &#8230;<p class="read-more"><a href="http://zeta-puppis.com/2009/08/28/google-wave-impressions-from-a-developer-point-of-view/">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<a href="http://zeta-puppis.com/2009/08/28/google-wave-impressions-from-a-developer-point-of-view/" title="Google Wave impressions from a developer point of view"></a><p>A couple of days ago I finally had my <a href="http://wave.google.com">Google Wave</a> sandbox account. Given that I just finished developing my very first robot, <strong>I thought I&#8217;d share some impressions</strong> on the whole thing. From the user-side, things are far from being ready. Some important features are still missing, just to name one you can&#8217;t remove user from a wave once they joined (or, alternatively, there&#8217;s no way to ignore a wave). Indeed, given that I joined several waves to try other people applications, I&#8217;m getting continuous notifications. But anyway, the whole thing is to me like a great development playground where I can make all the sorts of&nbsp;experiments.</p>
<p><span id="more-235"></span>They&#8217;ve been loyal: when you signed up the registration form, they asked you if you were comfortable with APIs changing or an instable system. That&#8217;s what you&#8217;ll find once you get your sandbox account. APIs are there but haven&#8217;t been fully documented yet and <strong>most of your work when developing some robot/gadget will be in exploring the API sources</strong> (they&#8217;re open source, yau!) or searching for some examples on the <a href="http://wave-samples-gallery.appspot.com/">samples gallery</a>, which is an invaluable resource by the&nbsp;way.</p>
<p><strong>Debugging is hard too</strong>, given that you can&#8217;t test what you done locally but you have to upload your code to <a href="http://appengine.google.com">AppEngine</a> to see if it works (actually AppEngine is the only platform they accept requests from, but they plan to allow every host that talks the <a href="http://www.waveprotocol.org">wave protocol</a> in the future). This means that if, for example, there&#8217;s some typo in the code (i.e.: <code>appendText()</code> rather than <code>AppendText()</code>), you&#8217;d know only by looking at the AppEngine&nbsp;logs.</p>
<p><strong>Be prepared to experience casual failures too</strong>. Sometimes your robot is working correctly and is receiving the whole wavelet (which is the whole conversation thread), but its response is ignored by the server for some unknown&nbsp;cause.</p>
<p>Anyway, even though there&#8217;s still some clear work in progress, I felt like <strong>the whole thing was quite exciting</strong> both from the user and as the developer point of view. The event model they thought for the external applications perfectly fits the nature of The Wave and gives room for some nice asynchronous applications. Hopefully, we&#8217;ll meet on Google Wave&nbsp;soon.</p>
]]></content:encoded>
			<wfw:commentRss>http://zeta-puppis.com/2009/08/28/google-wave-impressions-from-a-developer-point-of-view/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Dealing with algorithms and data structures</title>
		<link>http://zeta-puppis.com/2009/07/21/dealing-with-algorithms-and-data-structures/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=dealing-with-algorithms-and-data-structures</link>
		<comments>http://zeta-puppis.com/2009/07/21/dealing-with-algorithms-and-data-structures/#comments</comments>
		<pubDate>Tue, 21 Jul 2009 21:49:08 +0000</pubDate>
		<dc:creator>kratorius</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[algorithm]]></category>
		<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[cache]]></category>
		<category><![CDATA[compiler]]></category>
		<category><![CDATA[computational analysis]]></category>
		<category><![CDATA[computer science]]></category>
		<category><![CDATA[data structures]]></category>
		<category><![CDATA[locality]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[optimizer]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[processor]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://zeta-puppis.com/?p=223</guid>
		<description><![CDATA[<a href="http://zeta-puppis.com/2009/07/21/dealing-with-algorithms-and-data-structures/" title="Dealing with algorithms and data structures"></a>One of the reasons I haven&#8217;t been writing on this blog that much lately is that I&#8217;ve been terribly busy with university given that I just cleared out six exams in six months. That said, for one of my three &#8230;<p class="read-more"><a href="http://zeta-puppis.com/2009/07/21/dealing-with-algorithms-and-data-structures/">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<a href="http://zeta-puppis.com/2009/07/21/dealing-with-algorithms-and-data-structures/" title="Dealing with algorithms and data structures"></a><p>One of the reasons I haven&#8217;t been writing on this blog that much lately is that I&#8217;ve been terribly busy with university given that I just cleared out six exams in six months. That said, for one of my three exams that I still have left, I had to develop an <a href="http://en.wikipedia.org/wiki/Inference_engine">inference engine</a> written in C++. Since this was a fairly large project that had to deal with some sort of NP-complete problems (see also: <a href="http://en.wikipedia.org/wiki/Unification">unification</a>) and given that this was the first time I wrote something serious in C++ (i.e.: that would involve more than a class and that didn&#8217;t contain the &#8220;Hello world&#8221; string) I had the chance to learn quite a few new&nbsp;things.</p>
<p><span id="more-223"></span>First thing, no matter how good your data structure is or how well you implemented that, sooner or later you&#8217;ll meet a speed barrier that even the best data structure for the job can&#8217;t beat. That doesn&#8217;t mean you don&#8217;t have to think about what data structure to use, besides I&#8217;m really passionate about finding the right data structure for the job so if I have the chance to deal with this kind of problems seems like I won the lottery. <strong>Sometimes the algorithm you&#8217;re using simply has some limits that can&#8217;t be beaten</strong> unless you radically adjust your algorithm or you completely change it with some a better-algorithm-for-the-job. Say we have an implicit, non-weighted graph, but we know that at most we&#8217;d be expanding fifty nodes of about a hundred bytes each and the expansion of every node comes cheap as computational cost. Now you want to find a particular node within that graph and along with it you want to find the shortest path to the starting node. Given these considerations, what algorithm would you use in this case? I&#8217;d go for breadth-first search, since we don&#8217;t lose much time expanding the nodes (computational cost of each node expansion), we have relatively few nodes that, unless we&#8217;re running on some special hardware that is somewhat limited (even though today even those automatized toilet&#8217;s chips can hold 5Kb in memory), will take a very little amount of&nbsp;memory.</p>
<p>Now suppose we still have that implicit graph, with our fifty nodes of about a hundred bytes each. But now we know that generating every node is very expensive in terms of computational costs. The BFS above could still be applicable under certain circumstances but we&#8217;d better be looking for alternatives. You can hold the graph in whatever data structure you want but unless you decide to change the algorithm with something better there&#8217;s nothing you can do about. Of course you can gain some speed by improving your data structure but that&#8217;s not the point since <strong>what makes the shortest path search slow is the generation of a new node</strong>. So what you have to do? Try to generate as few nodes as possible. Indeed, say our nearest solution is ten edges far from the starting node, with BFS you&#8217;d expand first all the nodes which have a direct connection to the starting node (so one edge), than from each of these nodes you&#8217;d be expanding other nodes that now will have two edges separating them from the starting nodes, and so on until you expand all the <em>levels</em> and get to the solution which is separated from the starting node by ten edges. You had the solution eventually, but it came at an expensive price. What are the alternatives then? Say we know, for each node we expand, a numerical value which says how good that node is in terms of distance from what we&#8217;re looking for. In this way we can follow only the good leads and leaving the bad paths out of our research (that, though, is not entirely true, and I&#8217;ll say why in a minute) and still, we have limited the number of expanded nodes and as result we had a great speed up. There&#8217;s an entire category of algorithms that are based upon the principle that <em>you know something</em> of your problem that can help you out in some cases and these are called informed algorithms. One of those who comes to my mind is <a href="http://en.wikipedia.org/wiki/A*_search_algorithm">A*</a> which is quite simple to implement too. These algorithms are based upon the assumption that you can make some estimation of <em>how good</em> the current state is. Indeed, most of these algorithm&#8217;s accuracy comes from how good the function that gives you that estimation is. This function is called the <em>heuristic function</em>. But the heuristic function is an estimation and it&#8217;s likely to be wrong in some cases. So you still end up expanding more nodes than necessary and following some wrong leads, but now you save yourself from exploring all the wrong leads that the previous BFS would have forced you to&nbsp;do.</p>
<p>For the project I was talking above, at a certain point we hit the limit. We tried switching data structures initially and that saved some time but still we weren&#8217;t able to come up with any consistent time reduction. Until we switched algorithm, then we gained something like a 80% speed up and we even had the chance to use some simpler data structure which allowed us to exploit some stuff like caching which I suppose is a large part of that&nbsp;80%.</p>
<p>Saying that, here comes the second lesson: <strong>know your environment</strong>. You have to know how your processor works, what&#8217;s the difference between, say, a L2 and a L1 cache, why disk access is slow and how it works, how the paging works and so on. These things can make great room for optimization if you know how to use them. Caching, for example, is probably one a thing that may give a great help to you. If you know what the <a href="http://en.wikipedia.org/wiki/CPU_cache">cache line</a> is and how big is on your processor, you&#8217;ll know how and when you can exploit the data locality principle. Knowing that disk reading is not <em>that slow</em> as it&#8217;s said could really help you (I heard you crying: I did not mistyped, disk read ain&#8217;t slow, what&#8217;s makes this operation slow is the time the disk&#8217;s head takes to position itself. Once we got the head in position, reading a whole block is probably faster than you imagine. Of course, I&#8217;m talking about old-style mechanical disks, with these shiny new solid state disks is another story). In the same way it can help knowing that your program will not cause many page faults or won&#8217;t fragment the allocation memory because your data structure doesn&#8217;t fit in one page or your algorithm makes thousand of allocations/deallocations of different size. This is very important, even though I got to admit is one of the topics I&#8217;m lagging&nbsp;behind.</p>
<p>Then you chosen the best algorithm in the world, you resurrected the dead in order to fit your data structure in exactly one page and you used condoms while you were coding, but still your program is very slow and you still can&#8217;t find the reason. Well I tell you, <strong>probably you&#8217;re using the wrong data structure</strong> but at this point I guess you&#8217;d know. With the project I said above, we experienced exactly this. We did everything we can to speed up the algorithms, we optimized everything could have been optimized but yet our program was slow until we realized that something was a real bottleneck for our purpose. Big Oh analysis it&#8217;s really useful, but you got to take into account that that little &#8216;n&#8217; in that big &#8216;o&#8217; is meant to be big. Indeed, that gives you an idea of how the data structure works as n grows, so that two same algorithms or data structures which both have a O(nlogn) complexity can be very different. In our case, the bottleneck was the STL map implementation. It turns out that, under the hood, it is a <a href="http://en.wikipedia.org/wiki/Red-black_tree">red-black tree</a> that was everything but fast in our program. We were spending 30% of the time within the STL map looking for the key. The problem here really was that we hadn&#8217;t many values within the map, most I&#8217;ve seen were ten items but we had an average of three or four items. When we switched from the usual map to a very simple hash, we had a terrible speed&nbsp;improvement.</p>
<p>Next thing is optimizing little stuff. This won&#8217;t yield greet speed ups but yet, they&#8217;re useful. We gained about one second by only switching one function&#8217;s calling convention to <a href="http://en.wikipedia.org/wiki/X86_calling_conventions#fastcall">fastcall</a> and we had even greater benefits by forcing the inlining of some other&nbsp;functions.</p>
<p>Other than that, having to work on a large C++ code base has been challenging, most because of some C++ gotchas (i.e.: why something like <code>string fn() {}</code> returns NULL implicitly?) even though I come from a assembler/C background and I have a solid OO theory base behind. In the end, though, it has been probably the funniest problem I faced in the latest years and will be eventually be open sourced after the&nbsp;summer.</p>
]]></content:encoded>
			<wfw:commentRss>http://zeta-puppis.com/2009/07/21/dealing-with-algorithms-and-data-structures/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Optimize your programs</title>
		<link>http://zeta-puppis.com/2008/12/02/optimize-your-programs/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=optimize-your-programs</link>
		<comments>http://zeta-puppis.com/2008/12/02/optimize-your-programs/#comments</comments>
		<pubDate>Tue, 02 Dec 2008 20:17:39 +0000</pubDate>
		<dc:creator>kratorius</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[compression]]></category>
		<category><![CDATA[information retrieval]]></category>
		<category><![CDATA[optimizations]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[speed]]></category>
		<category><![CDATA[zlib]]></category>

		<guid isPermaLink="false">http://zeta-puppis.com/?p=185</guid>
		<description><![CDATA[<a href="http://zeta-puppis.com/2008/12/02/optimize-your-programs/" title="Optimize your programs"></a>The last time I blogged about a new course I&#8217;m following at my university. This course, held by Pasquale Lops and Giovanni Semeraro, is very interesting at the point that I&#8217;ll be developing a custom information retrieval engine as part &#8230;<p class="read-more"><a href="http://zeta-puppis.com/2008/12/02/optimize-your-programs/">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<a href="http://zeta-puppis.com/2008/12/02/optimize-your-programs/" title="Optimize your programs"></a><p><a href="http://zeta-puppis.com/2008/10/19/what-i-learned-by-information-retrieval-in-one-week/">The last time</a> I blogged about a new course I&#8217;m following at my university. This course, held by <a href="http://www.di.uniba.it/~lops/lops.html">Pasquale Lops</a> and <a href="http://lacam.di.uniba.it:8000/people/semeraro.htm">Giovanni Semeraro</a>, is very interesting at the point that I&#8217;ll be developing a <strong>custom information retrieval engine</strong> as part of my internship project. I can&#8217;t tell much more at this point since the internship haven&#8217;t started yet and I&#8217;m not sure I can release more details about this project (we&#8217;re still in the process of deciding if and how the whole thing will be released to the&nbsp;world).</p>
<p>In the meantime, I&#8217;ve been doing several experiments on this topic mostly about the memory usage and the performances of such system on limited hardware. This practically means implementing the algorithms you&#8217;ll be using and measuring the computational time they&nbsp;require.</p>
<p><span id="more-185"></span>One of the most common thing that our information retrieval engine have to do is to take a document and compress it, but considering&nbsp;that:</p>
<ul>
<li>this is a fundamental piece of this IR&nbsp;engine</li>
<li>it will be used very&nbsp;often</li>
<li>it&#8217;s not rare to process very large&nbsp;documents</li>
</ul>
<p>You&#8217;ll get that this operation should be as efficient as&nbsp;possible.</p>
<p>I chosen to go down with zlib as my compression library for mainly two&nbsp;reasons:</p>
<ul>
<li>it&#8217;s already included in Python (this is not really a strong point since better compression algorithms are included in Python&nbsp;too)</li>
<li>offers the best compromise in speed/compression&nbsp;ratio</li>
</ul>
<p>Given the above considerations, let start coding our compression&nbsp;system.</p>
<p>We will use as our document example the PDF specifications, available at the <a href="http://www.adobe.com/devnet/pdf/pdf_reference.html ">Adobe Development Center</a> (<a href="http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf">this is the file</a>) that are 8.6Mb&nbsp;large.</p>
<p>So let start doing the things the basic&nbsp;way:</p>
<pre><code>#!/usr/bin/env python
# compress1.py
import zlib

def compress(input_path, output_path, compression_level=6):
    input_fd = open(input_path, 'rb')
    output_fd = open(output_path, 'wb')

    cobj = zlib.compressobj(compression_level)
    out = ''
    for line in input_fd:
        out += cobj.compress(line)
    out += cobj.flush()

    output_fd.write(out)

    input_fd.close()
    output_fd.close()

def decompress(input_path, output_path):
    input_fd = open(input_path, 'rb')
    output_fd = open(output_path, 'wb')

    dobj = zlib.decompressobj()
    out = ''
    for line in input_fd:
        out += dobj.decompress(line)
    out += dobj.flush()

    output_fd.write(out)

    input_fd.close()
    output_fd.close()

if __name__ == '__main__':
    import sys
    args = sys.argv[1:]

    options = { 'compress': compress,
                'decompress': decompress,
    }

    input_path, output_path = args[1], args[2]

    try:
        options[args[0]](input_path, output_path)
    except (KeyError, IndexError):
        print("Invalid arguments")
</code></pre>
<p>By running this program and performing a very basic profiling we get some&nbsp;indications:</p>
<pre>
kratorius@becks:~/compress$ time ./compress1.py compress PDF32000_2008.pdf compr.zlib
real    0m2.517s
user    0m1.496s
sys     0m0.060s

kratorius@becks:~/compress$ time ./compress1.py decompress compr.zlib decompr.pdf
real    0m0.640s
user    0m0.537s
sys     0m0.085s
</pre>
<p>We need 2.5 secs in order to compress a file smaller than 10Mb. This is quite unacceptable, since it means that we&#8217;re processing about 3.5Mb per second; so we need to understand what we&#8217;re doing wrong. I can spot at least two big errors in this&nbsp;script:</p>
<ol>
<li>we&#8217;re reading the input file line by line that isn&#8217;t very efficient since in this way <strong>we&#8217;re accessing the disk multiple times</strong> (not counting that we are also processing the compression stuff line by line, that it&#8217;s not efficient and hasn&#8217;t so much sense in a binary file like our&nbsp;PDF)</li>
<li><strong>we keep our compressed object in memory</strong> until we finish the compression, and this means that if the script would run faster, we&#8217;d still have a very high memory usage that is not&nbsp;optimal</li>
</ol>
<p>So here it is the new version of our compression script that address the issues&nbsp;above:</p>
<pre><code>#!/usr/bin/env python
# compress2.py
import zlib

def compress(input_path, output_path, compression_level=6):
    input_fd = open(input_path, 'rb')
    output_fd = open(output_path, 'wb')

    out = zlib.compress(input_fd.read(), compression_level)
    output_fd.write(out)

    input_fd.close()
    output_fd.close()

def decompress(input_path, output_path):
    input_fd = open(input_path, 'rb')
    output_fd = open(output_path, 'wb')

    out = zlib.decompress(input_fd.read())
    output_fd.write(out)

    input_fd.close()
    output_fd.close()

if __name__ == '__main__':
    import sys
    args = sys.argv[1:]

    options = { 'compress': compress,
                'decompress': decompress,
    }

    input_path, output_path = args[1], args[2]

    try:
        options[args[0]](input_path, output_path)
    except (KeyError, IndexError):
        print("Invalid arguments")
</code></pre>
<p>Let perform our basic profiling&nbsp;again:</p>
<pre>kratorius@becks:~/compress$ time ./compress2.py compress PDF32000_2008.pdf compr.zlib
real    0m1.668s
user    0m1.337s
sys     0m0.079s

kratorius@becks:~/compress$ time ./compress2.py decompress compr.zlib decompr.pdf
real    0m0.561s
user    0m0.394s
sys     0m0.086s
</pre>
<p>We are now reading the whole input file in memory (minimizing the disk accesses), compressing everything in memory and writing the compressed file to the output in a single shot. We got a high speedup in this way but <strong>we have just increased our memory usage</strong> since now we&#8217;re keeping in memory both the input and the compressed file. This could be optimal if we&#8217;re processing small files, but since we need to have a generalized approach, this solution is not that&nbsp;good.</p>
<p>We can do better. And we&#8217;ll do better in the third&nbsp;try:</p>
<pre><code>#!/usr/bin/env python
# compress2.py
import zlib

READ_BYTES = 2097152 # 2Mb

def compress(input_path, output_path, compression_level=6):
    input_fd = open(input_path, 'rb')
    output_fd = open(output_path, 'wb')

    cobj = zlib.compressobj(compression_level)
    done = False
    while not done:
        rd = input_fd.read(READ_BYTES)
        done = rd == ''

        output_fd.write(cobj.compress(rd))

    output_fd.write(cobj.flush())

    input_fd.close()
    output_fd.close()

def decompress(input_path, output_path):
    input_fd = open(input_path, 'rb')
    output_fd = open(output_path, 'wb')

    dobj = zlib.decompressobj()
    done = False
    while not done:
        rd = input_fd.read(READ_BYTES)
        done = rd == ''

        output_fd.write(dobj.decompress(rd))

    output_fd.write(dobj.flush())

    input_fd.close()
    output_fd.close()

if __name__ == '__main__':
    import sys
    args = sys.argv[1:]

    options = { 'compress': compress,
                'decompress': decompress,
    }

    input_path, output_path = args[1], args[2]

    try:
        options[args[0]](input_path, output_path)
    except (KeyError, IndexError):
        print("Invalid arguments")
</code></pre>
<p>And we finally reached our&nbsp;goal:</p>
<pre>kratorius@becks:~/compress$ time ./compress3.py compress PDF32000_2008.pdf compr.zlib
real    0m1.325s
user    0m1.226s
sys     0m0.070s

kratorius@becks:~/compress$ time ./compress3.py decompress compr.zlib decompr.pdf
real    0m0.534s
user    0m0.404s
sys     0m0.119s
</pre>
<p>This last try works because <strong>we&#8217;re still minimizing the disk accesses</strong> for small files (we&#8217;re reading 2Mb chunks per time) and this time <strong>we&#8217;re reducing the memory usage</strong>&nbsp;since:</p>
<ul>
<li>we read a 2Mb block from our input&nbsp;file</li>
<li>we compress the read&nbsp;input</li>
<li>we write it directly to our output&nbsp;file</li>
</ul>
<p>I&#8217;m sure there&#8217;s still room for improvement but at this point we can be quite happy of our achievement. You can find the final script that performs error checking and file locking <a href="http://zeta-puppis.com/wp-content/uploads/2008/12/compress.py">here</a> (file locking works only on UNIX systems though, on Windows you should just comment the <code>fcntl</code> lines out). As always, suggestions are&nbsp;welcome.</p>
]]></content:encoded>
			<wfw:commentRss>http://zeta-puppis.com/2008/12/02/optimize-your-programs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What I learned by information retrieval in one week</title>
		<link>http://zeta-puppis.com/2008/10/19/what-i-learned-by-information-retrieval-in-one-week/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=what-i-learned-by-information-retrieval-in-one-week</link>
		<comments>http://zeta-puppis.com/2008/10/19/what-i-learned-by-information-retrieval-in-one-week/#comments</comments>
		<pubDate>Sun, 19 Oct 2008 16:38:24 +0000</pubDate>
		<dc:creator>kratorius</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Django]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[information retrieval]]></category>
		<category><![CDATA[IR]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[search engine]]></category>
		<category><![CDATA[text categorization]]></category>
		<category><![CDATA[tf-idf]]></category>

		<guid isPermaLink="false">http://zeta-puppis.com/?p=159</guid>
		<description><![CDATA[<a href="http://zeta-puppis.com/2008/10/19/what-i-learned-by-information-retrieval-in-one-week/" title="What I learned by information retrieval in one week"></a>It has been about a week since I began doing a deeper study of information retrieval. Actually, everything just began with a new course at my university about that and I just fallen in love almost immediately. The fact is &#8230;<p class="read-more"><a href="http://zeta-puppis.com/2008/10/19/what-i-learned-by-information-retrieval-in-one-week/">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<a href="http://zeta-puppis.com/2008/10/19/what-i-learned-by-information-retrieval-in-one-week/" title="What I learned by information retrieval in one week"></a><p>It has been about a week since I began doing a deeper study of information retrieval. Actually, everything just began with a new course at my university about that and I just fallen in love almost immediately. The fact is that this thing really got me interested, and I began doing some experiments (one involves django as well, keep reading to know&nbsp;more).</p>
<p>In this week I learned a lot of things about information retrieval, text categorization, natural language processing and machine learning. But the most relevant thing is: <strong>the principles are easy, their implementation is not</strong>. The fact is that most of the techniques are relatively simple but you usually have to deal with very large datasets and this could be challenging, since one of the main requirements about information retrieval is time. It&#8217;s really much more important that you give less results in one second rather than giving better results in one hour. No one will ever care to use your system if it takes an hour to get some result. And if you&#8217;re considering to store your data in a database forget about normalization, it wouldn&#8217;t really take you&nbsp;anywhere.</p>
<p><span id="more-159"></span>Talking about storing informations, you know that if you&#8217;re dealing with documents most of the words are the so called <em>stop words</em>. Those stop words are words that doesn&#8217;t really mean anything, but they help the readers to get a better text flux. Classic examples of stop words are articles like &#8220;the&#8221;, &#8220;a&#8221;, &#8220;an&#8221; or logic connectors like &#8220;or&#8221; and &#8220;and&#8221;. <strong>These words are so common that their presence is quite useless since they&#8217;re are&#8230; everywhere</strong>. If you&#8217;re going to study information retrieval than you&#8217;ll learn about a weighting technique called <a href="http://en.wikipedia.org/wiki/Tf-idf">tf-idf</a> that gives a weight near to 0 to these words, but since you&#8217;d probably use a reverse index for words (an index that given a word, tells you in which documents that word appears) you can understand that this would take a lot of space if you&#8217;re going to include stop&nbsp;words.</p>
<p>So one of the biggest issues until now is that you&#8217;re going to deal with extremely large datasets, so you have to strip as many things as possible. Now consider those words: &#8220;fishing&#8221;, &#8220;fishes&#8221;, &#8220;fish&#8221;. They all talk about &#8220;fish&#8221;, and an user that is searching for &#8220;fish&#8221; would probably be interested in &#8220;fishes&#8221; or &#8220;fishing&#8221; as well. Additionally, it&#8217;s useless to store three words that are almost identical. So here comes the <em>stemming</em> that, by quoting the related <a href="http://en.wikipedia.org/wiki/Stemming">wikipedia page</a>, is the <cite>process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form</cite>. Fortunately, if you&#8217;re dealing with english texts, there&#8217;s the <a href="http://tartarus.org/~martin/PorterStemmer/">Porter algorithm</a> that is the state-of-the-art algorithm for this sort of things. But that works only with english, so <strong>if your documents are written in another language or they are written in multiple languages, things are going to be&nbsp;complicated</strong>.</p>
<p>This leads to think about the problem of the language identification. How do you know if some text is written in a language or in another just by looking at it? Of course you can describe the document&#8217;s language with some kind of meta tagging, but not all the documents have this kind of description, just think about the web. There are some kind of statistical methods based upon the classification of <a href="http://en.wikipedia.org/wiki/N-gram">n-grams</a> but I haven&#8217;t deeply investigated about them yet, so I can&#8217;t really say&nbsp;anything.</p>
<p>Now you got your collection of documents that <em>match</em> a certain query. Now: how do you know what document is more relevant than another (in other words: how do you <em>rank</em> pages)? You got two alternatives (well, probably more, but I know just these at this moment): <strong>the tf-idf that we said above and the <a href="http://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a></strong>. The latter is an interesting one: consider the tf-idf vectors of the documents, then consider the query as a document too. Now plot those tf-idf vectors and measure their cosine of the angle between them. The more you&#8217;re near to 1, the more relevant is the&nbsp;document.</p>
<p>There are a lot of other important things that need to be said like the precision and recall concept, but that&#8217;s enough for now. I&#8217;ll talk about this another&nbsp;time.</p>
<p>Anyway I&#8217;m doing an experimental project named <a href="http://code.google.com/p/django-searchable/">django searchable</a>. It&#8217;s a pluggable app for django that implements an information retrieval engine based on tf-idf weighting. Play with it if you&#8217;re brave&nbsp;enough.</p>
]]></content:encoded>
			<wfw:commentRss>http://zeta-puppis.com/2008/10/19/what-i-learned-by-information-retrieval-in-one-week/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Announcing Pytagram</title>
		<link>http://zeta-puppis.com/2008/08/21/announcing-pytagram/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=announcing-pytagram</link>
		<comments>http://zeta-puppis.com/2008/08/21/announcing-pytagram/#comments</comments>
		<pubDate>Thu, 21 Aug 2008 14:42:11 +0000</pubDate>
		<dc:creator>kratorius</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Django]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Web]]></category>
		<category><![CDATA[pytagram]]></category>
		<category><![CDATA[svg]]></category>
		<category><![CDATA[toc]]></category>

		<guid isPermaLink="false">http://zeta-puppis.com/?p=138</guid>
		<description><![CDATA[<a href="http://zeta-puppis.com/2008/08/21/announcing-pytagram/" title="Announcing Pytagram"></a>Today I just ended one of my side projects: pytagram. Basically it generates an SVG file (that can successively be saved as eps/pdf/whatever and eventually manually manipulated) starting from a tree-like plain text file. This can be useful for generating &#8230;<p class="read-more"><a href="http://zeta-puppis.com/2008/08/21/announcing-pytagram/">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<a href="http://zeta-puppis.com/2008/08/21/announcing-pytagram/" title="Announcing Pytagram"></a><p>Today I just ended one of my side projects: pytagram. Basically it generates an SVG file (that can successively be saved as eps/pdf/whatever and eventually manually manipulated) starting from a tree-like plain text file. This can be useful for generating <strong>cheat sheets or quick references</strong> to classes or functions that belongs to some&nbsp;project.</p>
<p>I did this for generating a <a href="http://djangoproject.com">django</a> quick reference (<a href="http://zeta-puppis.com/wp-content/uploads/2008/08/django1.svg">here it is</a>) since it has a lot of functions and I know what&#8217;s their purpose, but I can never remember the names (and now two A4 papers are right in front of&nbsp;me).</p>
<p>If you&#8217;re interested in this, check out the <a href="http://code.google.com/p/pytagram/">google code project page</a> and grab your copy from the SVN&nbsp;repository.</p>
<p>There are <strong>tons of things that can be changed/optimized</strong> (i.e.: add some optional short explanation of the function, add more examples, easier way to change colors, &#8230;) but now the code is working quite well so that can be already useful to the people out&nbsp;there.</p>
]]></content:encoded>
			<wfw:commentRss>http://zeta-puppis.com/2008/08/21/announcing-pytagram/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Google, codejam and number conversions</title>
		<link>http://zeta-puppis.com/2008/06/26/google-codejam-and-number-conversions/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=google-codejam-and-number-conversions</link>
		<comments>http://zeta-puppis.com/2008/06/26/google-codejam-and-number-conversions/#comments</comments>
		<pubDate>Thu, 26 Jun 2008 11:23:19 +0000</pubDate>
		<dc:creator>kratorius</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[alien numbers]]></category>
		<category><![CDATA[base]]></category>
		<category><![CDATA[codejam]]></category>
		<category><![CDATA[conversion]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[number]]></category>

		<guid isPermaLink="false">http://zeta-puppis.com/?p=94</guid>
		<description><![CDATA[<a href="http://zeta-puppis.com/2008/06/26/google-codejam-and-number-conversions/" title="Google, codejam and number conversions"></a>The decimal numeral system is composed of ten digits, which we represent as &#8220;0123456789&#8221; (the digits in a system are written from lowest to highest). Imagine you have discovered an alien numeral system composed of some number of digits, which &#8230;<p class="read-more"><a href="http://zeta-puppis.com/2008/06/26/google-codejam-and-number-conversions/">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<a href="http://zeta-puppis.com/2008/06/26/google-codejam-and-number-conversions/" title="Google, codejam and number conversions"></a><p>The decimal numeral system is composed of ten digits, which we represent as &#8220;0123456789&#8221; (the digits in a system are written from lowest to highest). Imagine you have discovered an alien numeral system composed of some number of digits, which may or may not be the same as those used in decimal. For example, if the alien numeral system were represented as &#8220;oF8&#8221;, then the numbers one through ten would be (F, 8, Fo, FF, F8, 8o, 8F, 88, Foo, FoF). We would like to be able to work with numbers in arbitrary alien systems. More generally, we want to be able <strong>to convert an arbitrary number that&#8217;s written in one alien system into a second alien&nbsp;system</strong>.</p>
<p><span id="more-94"></span>The above was exactly one of the practice problems of the <a href="http://code.google.com/codejam">google codejam</a> (I still don&#8217;t know if I could join the event since I&#8217;ll probably be very busy with university exams in the day of the qualification round) and more generally the problem is the conversion of a number (that isn&#8217;t necessarily composed by the usual digits) from any base to any base. I just figured a solution out and passed both the the small input test and the large one. My solution is simple: <strong>convert the source base number in base 10 and then convert the produced base 10 number in another base</strong>. There are known algorithms for doing this (just think that 1986 for example is nothing but 1 * 10^3 + 9 * 10^2 + 8 * 10^1 + 6 * 10^0) and I finished implement my own solution in&nbsp;python.</p>
<p>The biggest issue here is that the source base <strong>can have symbols instead of digit</strong> and I solved this issue by mapping the symbols to an array and using the index value of the symbols as <em>digit value</em>. Here it is my&nbsp;solution:</p>
<pre><code>#!/usr/bin/env python
import sys, array

def main(argv=None):
    if not argv:
        argv = sys.argv

    try:
        f = open(argv[1])
    except IOError:
        print "File doesn't exist"
        return 0

    try:
        i = 0
        for line in f:
            if i == 0:
                # first line
                line_num = int(line)
            else:
                number, input_b, output_b = line.strip('\n').split(' ')
                print 'Case #%d: %s' % (i, convert(number, input_b, output_b))

            i += 1
    finally:
        f.close()

    return 1

def convert(number, input_b, output_b):
    """
    Convert a number from any base to any base
    """

    return convert_from_10(convert_to_10(number, input_b), output_b)

def convert_to_10(input, base):
    """
    Input can be a number in any base, even in an 'alien' base.
    For example: 'Foo' could be a number in a numerical system
    whose digits are 'oF8'.

    Base is exactly the digits representation.
    If you want to convert that 'Foo' to base 10 then you must
    call ``convert_to_10('Foo', 'oF8')``.

    Remember that the number in ``base`` must be written in an
    ordered form

    Returns a string of the number in base 10
    """

    current_base = len(base)

    map_to_base = array.array('c')
    map(map_to_base.append, base)

    i = len(input) - 1
    base_10 = 0
    for digit in input:
        base_10 += map_to_base.index(digit) * current_base**i
        i -= 1

    return str(base_10)

def convert_from_10(input, base):
    """
    ``input`` is a number in base 10, while ``base`` is the digit
    representation of the new base (for example, for base 16 this
    could be '0123456789ABCDEF' or for an alien base 3 could be
    'oF8').

    Returns the number converted from base 10 to the specified
    base
    """

    map_to_base = array.array('c')
    map(map_to_base.append, base)

    current = int(input)
    base_n = ''
    while current != 0:
        base_n = map_to_base[current % len(base)] + base_n
        current = current / len(base)

    return base_n

if __name__ == '__main__':
    sys.exit(main())</code></pre>
<p>And giving this input (the first line is number of the following&nbsp;lines):</p>
<pre><code>4
9 0123456789 oF8
Foo oF8 0123456789
13 0123456789abcdef 01
CODE O!CDE? A?JM!.</code></pre>
<p>I have the correct&nbsp;output:</p>
<pre><code>Case #1: Foo
Case #2: 9
Case #3: 10011
Case #4: JAM!</code></pre>
<p>Of course this is one of the practice problems and you should try to solve it by your own (otherwise it&#8217;s useless to try to join the&nbsp;event).</p>
]]></content:encoded>
			<wfw:commentRss>http://zeta-puppis.com/2008/06/26/google-codejam-and-number-conversions/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>And djangodash is ended&#8230;</title>
		<link>http://zeta-puppis.com/2008/06/11/and-djangodash-is-ended/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=and-djangodash-is-ended</link>
		<comments>http://zeta-puppis.com/2008/06/11/and-djangodash-is-ended/#comments</comments>
		<pubDate>Wed, 11 Jun 2008 14:43:23 +0000</pubDate>
		<dc:creator>kratorius</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Django]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Web]]></category>
		<category><![CDATA[dash]]></category>
		<category><![CDATA[djangodash]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://zeta-puppis.com/?p=93</guid>
		<description><![CDATA[<a href="http://zeta-puppis.com/2008/06/11/and-djangodash-is-ended/" title="And djangodash is ended..."></a>And I&#8217;ve been 6th. So I won a shared 2 hosting plan at webfaction and a 12 pack of G33K B33R caffeinated root beer (still trying to understand what this is exactly, anyway) from bawls. Anyway, here follows a short &#8230;<p class="read-more"><a href="http://zeta-puppis.com/2008/06/11/and-djangodash-is-ended/">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<a href="http://zeta-puppis.com/2008/06/11/and-djangodash-is-ended/" title="And djangodash is ended..."></a><p>And I&#8217;ve been 6th. So I won a shared 2 hosting plan at <a href="http://webfaction.com">webfaction</a> and a 12 pack of G33K B33R caffeinated root beer (still trying to understand what this is exactly, anyway) from <a href="http://www.bawlstyle.com">bawls</a>. Anyway, here follows <strong>a short resume of what happened</strong> from Saturday through Tuesday (if you&#8217;re asking yourself why it didn&#8217;t ended on Sunday, well, keep&nbsp;reading).</p>
<p>The competition began very well, I worked normally for the first part of the day but then I had to stop for a while. When I came back, <strong>svn and <a href="http://djangodash.com">djangodash</a> website was not working anymore</strong>. I initially thought that it was some connection issue but when I saw that other sites were working properly so they definitely had some&nbsp;problems.</p>
<p><span id="more-93"></span>I just waited, then gone sleeping. In the morning I received in my mailbox a message that informed me of a big power outage in <a href="http://www.theplanet.com">The Planet</a> datacenter where webfaction hosts a lot of their server (among the which there was the <a href="http://djangodash.com">djangodash</a> one) caused by power generator&#8217;s explosion. Then <strong>the competition has been delayed for other two days</strong>, so I decided to take a breath and wait &#8216;till the svn would came back. But that didn&#8217;t happen on Sunday, so after a while I chosen (as the mail suggested) to work locally without committing anything at least until the svn&nbsp;return.</p>
<p>Then Monday came and I had other things to do, so I had to postpone <a href="http://djangodash.com">djangodash</a> for the evening when I&#8217;d freed myself from other, most urgent things. On Monday <strong>I did a very little coding</strong>, as well on Tuesday. So at the end of competition I cannot complete my project, and not even reach the 50%&nbsp;milestone.</p>
<p>Today I discovered that <strong>I was one of the winners</strong> (ok not really, 6th place was not really a good place, but at least I tried) and I really have to thanks the organizers for this event and hope to join another <a href="http://djangodash.com">djangodash</a> next year. Maybe, as I said to one of them in an email thread, hosting the site/svn in two different datacenters, just to be insured against eventual thunderstorms, tornado, earthquakes and so on&#8230;). I have to say that I really enjoyed the whole thing, and hope to have more competitors next&nbsp;year!</p>
<p>If you want to get more news about final process of <a href="http://djangodash.com">djangodash</a> with some stats, <a href="http://www.toastdriven.com/fresh/django-dash-factoids/">read this article</a> on the <a href="http://www.toastdriven.com">Toast Driven</a> website (that&#8217;s the company that ran the&nbsp;dash).</p>
]]></content:encoded>
			<wfw:commentRss>http://zeta-puppis.com/2008/06/11/and-djangodash-is-ended/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Let meet at djangodash</title>
		<link>http://zeta-puppis.com/2008/05/04/let-meet-at-djangodash/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=let-meet-at-djangodash</link>
		<comments>http://zeta-puppis.com/2008/05/04/let-meet-at-djangodash/#comments</comments>
		<pubDate>Sun, 04 May 2008 10:20:59 +0000</pubDate>
		<dc:creator>kratorius</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Django]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Web]]></category>
		<category><![CDATA[competition]]></category>
		<category><![CDATA[dash]]></category>
		<category><![CDATA[djangodash]]></category>
		<category><![CDATA[prizes]]></category>

		<guid isPermaLink="false">http://zeta-puppis.com/?p=89</guid>
		<description><![CDATA[<a href="http://zeta-puppis.com/2008/05/04/let-meet-at-djangodash/" title="Let meet at djangodash"></a>As probably many of you already knows, on May 31 will begin the Django dash competition. Djangodash&#160;is: [&#8230;] is a chance for Django enthusiasts to flex their coding skills a little and put a fine point on “perfectionists with deadlines” &#8230;<p class="read-more"><a href="http://zeta-puppis.com/2008/05/04/let-meet-at-djangodash/">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<a href="http://zeta-puppis.com/2008/05/04/let-meet-at-djangodash/" title="Let meet at djangodash"></a><p>As probably many of you already knows, <strong>on May 31</strong> will begin the Django dash competition. <a href="http://djangodash.com">Djangodash</a>&nbsp;is:</p>
<blockquote><p>[&#8230;] is a chance for Django enthusiasts to flex their coding skills a little and put a fine point on “perfectionists with deadlines” by giving you a REAL deadline. 48 hours from start to stop to produce the best app you can and have a little fun in the&nbsp;process.</p></blockquote>
<p>I&#8217;ll be participating, so if you haven&#8217;t registered yet, <strong>do it now</strong>! And don&#8217;t forget to check out <a href="http://djangodash.com/sponsors/">the cool prizes</a>&nbsp;:)</p>
]]></content:encoded>
			<wfw:commentRss>http://zeta-puppis.com/2008/05/04/let-meet-at-djangodash/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

