One of the rea­sons I haven’t been writ­ing on this blog that much lately is that I’ve been ter­ri­bly busy with uni­ver­sity given that I just cleared out six exams in six months. That said, for one of my three exams that I still have left, I had to develop an infer­ence engine writ­ten in C++. Since this was a fairly large project that had to deal with some sort of NP-​complete prob­lems (see also: uni­fi­ca­tion) and given that this was the first time I wrote some­thing seri­ous in C++ (i.e.: that would involve more than a class and that didn’t con­tain the “Hello world” string) I had the chance to learn quite a few new things.

First thing, no matter how good your data struc­ture is or how well you imple­mented that, sooner or later you’ll meet a speed bar­rier that even the best data struc­ture for the job can’t beat. That doesn’t mean you don’t have to think about what data struc­ture to use, besides I’m really pas­sion­ate about find­ing the right data struc­ture for the job so if I have the chance to deal with this kind of prob­lems seems like I won the lot­tery. Some­times the algo­rithm you’re using simply has some limits that can’t be beaten unless you rad­i­cally adjust your algo­rithm or you com­pletely change it with some a better-algorithm-for-the-job. Say we have an implicit, non-​weighted graph, but we know that at most we’d be expand­ing fifty nodes of about a hun­dred bytes each and the expan­sion of every node comes cheap as com­pu­ta­tional cost. Now you want to find a par­tic­u­lar node within that graph and along with it you want to find the short­est path to the start­ing node. Given these con­sid­er­a­tions, what algo­rithm would you use in this case? I’d go for breadth-​first search, since we don’t lose much time expand­ing the nodes (com­pu­ta­tional cost of each node expan­sion), we have rel­a­tively few nodes that, unless we’re run­ning on some spe­cial hard­ware that is some­what lim­ited (even though today even those autom­a­tized toilet’s chips can hold 5Kb in memory), will take a very little amount of memory.

Now sup­pose we still have that implicit graph, with our fifty nodes of about a hun­dred bytes each. But now we know that gen­er­at­ing every node is very expen­sive in terms of com­pu­ta­tional costs. The BFS above could still be applic­a­ble under cer­tain cir­cum­stances but we’d better be look­ing for alter­na­tives. You can hold the graph in what­ever data struc­ture you want but unless you decide to change the algo­rithm with some­thing better there’s noth­ing you can do about. Of course you can gain some speed by improv­ing your data struc­ture but that’s not the point since what makes the short­est path search slow is the gen­er­a­tion of a new node. So what you have to do? Try to gen­er­ate as few nodes as pos­si­ble. Indeed, say our near­est solu­tion is ten edges far from the start­ing node, with BFS you’d expand first all the nodes which have a direct con­nec­tion to the start­ing node (so one edge), than from each of these nodes you’d be expand­ing other nodes that now will have two edges sep­a­rat­ing them from the start­ing nodes, and so on until you expand all the levels and get to the solu­tion which is sep­a­rated from the start­ing node by ten edges. You had the solu­tion even­tu­ally, but it came at an expen­sive price. What are the alter­na­tives then? Say we know, for each node we expand, a numer­i­cal value which says how good that node is in terms of dis­tance from what we’re look­ing for. In this way we can follow only the good leads and leav­ing the bad paths out of our research (that, though, is not entirely true, and I’ll say why in a minute) and still, we have lim­ited the number of expanded nodes and as result we had a great speed up. There’s an entire cat­e­gory of algo­rithms that are based upon the prin­ci­ple that you know some­thing of your prob­lem that can help you out in some cases and these are called informed algo­rithms. One of those who comes to my mind is A* which is quite simple to imple­ment too. These algo­rithms are based upon the assump­tion that you can make some esti­ma­tion of how good the cur­rent state is. Indeed, most of these algorithm’s accu­racy comes from how good the func­tion that gives you that esti­ma­tion is. This func­tion is called the heuris­tic func­tion. But the heuris­tic func­tion is an esti­ma­tion and it’s likely to be wrong in some cases. So you still end up expand­ing more nodes than nec­es­sary and fol­low­ing some wrong leads, but now you save your­self from explor­ing all the wrong leads that the pre­vi­ous BFS would have forced you to do.

For the project I was talk­ing above, at a cer­tain point we hit the limit. We tried switch­ing data struc­tures ini­tially and that saved some time but still we weren’t able to come up with any con­sis­tent time reduc­tion. Until we switched algo­rithm, then we gained some­thing like a 80% speed up and we even had the chance to use some sim­pler data struc­ture which allowed us to exploit some stuff like caching which I sup­pose is a large part of that 80%.

Saying that, here comes the second lesson: know your envi­ron­ment. You have to know how your proces­sor works, what’s the dif­fer­ence between, say, a L2 and a L1 cache, why disk access is slow and how it works, how the paging works and so on. These things can make great room for opti­miza­tion if you know how to use them. Caching, for exam­ple, is prob­a­bly one a thing that may give a great help to you. If you know what the cache line is and how big is on your proces­sor, you’ll know how and when you can exploit the data local­ity prin­ci­ple. Know­ing that disk read­ing is not that slow as it’s said could really help you (I heard you crying: I did not mistyped, disk read ain’t slow, what’s makes this oper­a­tion slow is the time the disk’s head takes to posi­tion itself. Once we got the head in posi­tion, read­ing a whole block is prob­a­bly faster than you imag­ine. Of course, I’m talk­ing about old-​style mechan­i­cal disks, with these shiny new solid state disks is another story). In the same way it can help know­ing that your pro­gram will not cause many page faults or won’t frag­ment the allo­ca­tion memory because your data struc­ture doesn’t fit in one page or your algo­rithm makes thou­sand of allocations/deallocations of dif­fer­ent size. This is very impor­tant, even though I got to admit is one of the topics I’m lag­ging behind.

Then you chosen the best algo­rithm in the world, you res­ur­rected the dead in order to fit your data struc­ture in exactly one page and you used con­doms while you were coding, but still your pro­gram is very slow and you still can’t find the reason. Well I tell you, prob­a­bly you’re using the wrong data struc­ture but at this point I guess you’d know. With the project I said above, we expe­ri­enced exactly this. We did every­thing we can to speed up the algo­rithms, we opti­mized every­thing could have been opti­mized but yet our pro­gram was slow until we real­ized that some­thing was a real bot­tle­neck for our pur­pose. Big Oh analy­sis it’s really useful, but you got to take into account that that little ‘n’ in that big ‘o’ is meant to be big. Indeed, that gives you an idea of how the data struc­ture works as n grows, so that two same algo­rithms or data struc­tures which both have a O(nlogn) com­plex­ity can be very dif­fer­ent. In our case, the bot­tle­neck was the STL map imple­men­ta­tion. It turns out that, under the hood, it is a red-​black tree that was every­thing but fast in our pro­gram. We were spend­ing 30% of the time within the STL map look­ing for the key. The prob­lem here really was that we hadn’t many values within the map, most I’ve seen were ten items but we had a medium of three or four items. When we switched from the usual map to a very simple hash, we had a ter­ri­ble speed improvement.

Next thing is opti­miz­ing little stuff. This won’t yield greet speed ups but yet, they’re useful. We gained about one second by only switch­ing one function’s call­ing con­ven­tion to fast­call and we had even greater ben­e­fits by forc­ing the inlin­ing of some other functions.

Other than that, having to work on a large C++ code base has been chal­leng­ing, most because of some C++ gotchas (i.e.: why some­thing like string fn() {} returns NULL implic­itly?) even though I come from a assembler/C back­ground and I have a solid OO theory base behind. In the end, though, it has been prob­a­bly the fun­ni­est prob­lem I faced in the latest years and will be even­tu­ally be open sourced after the summer.