Dealing with algorithms and data structures
July 21st, 2009
One of the reasons I haven’t been writing on this blog that much lately is that I’ve been terribly busy with university given that I just cleared out six exams in six months. That said, for one of my three exams that I still have left, I had to develop an inference engine written in C++. Since this was a fairly large project that had to deal with some sort of NP-complete problems (see also: unification) and given that this was the first time I wrote something serious in C++ (i.e.: that would involve more than a class and that didn’t contain the “Hello world” string) I had the chance to learn quite a few new things.
First thing, no matter how good your data structure is or how well you implemented that, sooner or later you’ll meet a speed barrier that even the best data structure for the job can’t beat. That doesn’t mean you don’t have to think about what data structure to use, besides I’m really passionate about finding the right data structure for the job so if I have the chance to deal with this kind of problems seems like I won the lottery. Sometimes the algorithm you’re using simply has some limits that can’t be beaten unless you radically adjust your algorithm or you completely change it with some a better-algorithm-for-the-job. Say we have an implicit, non-weighted graph, but we know that at most we’d be expanding fifty nodes of about a hundred bytes each and the expansion of every node comes cheap as computational cost. Now you want to find a particular node within that graph and along with it you want to find the shortest path to the starting node. Given these considerations, what algorithm would you use in this case? I’d go for breadth-first search, since we don’t lose much time expanding the nodes (computational cost of each node expansion), we have relatively few nodes that, unless we’re running on some special hardware that is somewhat limited (even though today even those automatized toilet’s chips can hold 5Kb in memory), will take a very little amount of memory.
Now suppose we still have that implicit graph, with our fifty nodes of about a hundred bytes each. But now we know that generating every node is very expensive in terms of computational costs. The BFS above could still be applicable under certain circumstances but we’d better be looking for alternatives. You can hold the graph in whatever data structure you want but unless you decide to change the algorithm with something better there’s nothing you can do about. Of course you can gain some speed by improving your data structure but that’s not the point since what makes the shortest path search slow is the generation of a new node. So what you have to do? Try to generate as few nodes as possible. Indeed, say our nearest solution is ten edges far from the starting node, with BFS you’d expand first all the nodes which have a direct connection to the starting node (so one edge), than from each of these nodes you’d be expanding other nodes that now will have two edges separating them from the starting nodes, and so on until you expand all the levels and get to the solution which is separated from the starting node by ten edges. You had the solution eventually, but it came at an expensive price. What are the alternatives then? Say we know, for each node we expand, a numerical value which says how good that node is in terms of distance from what we’re looking for. In this way we can follow only the good leads and leaving the bad paths out of our research (that, though, is not entirely true, and I’ll say why in a minute) and still, we have limited the number of expanded nodes and as result we had a great speed up. There’s an entire category of algorithms that are based upon the principle that you know something of your problem that can help you out in some cases and these are called informed algorithms. One of those who comes to my mind is A* which is quite simple to implement too. These algorithms are based upon the assumption that you can make some estimation of how good the current state is. Indeed, most of these algorithm’s accuracy comes from how good the function that gives you that estimation is. This function is called the heuristic function. But the heuristic function is an estimation and it’s likely to be wrong in some cases. So you still end up expanding more nodes than necessary and following some wrong leads, but now you save yourself from exploring all the wrong leads that the previous BFS would have forced you to do.
For the project I was talking above, at a certain point we hit the limit. We tried switching data structures initially and that saved some time but still we weren’t able to come up with any consistent time reduction. Until we switched algorithm, then we gained something like a 80% speed up and we even had the chance to use some simpler data structure which allowed us to exploit some stuff like caching which I suppose is a large part of that 80%.
Saying that, here comes the second lesson: know your environment. You have to know how your processor works, what’s the difference between, say, a L2 and a L1 cache, why disk access is slow and how it works, how the paging works and so on. These things can make great room for optimization if you know how to use them. Caching, for example, is probably one a thing that may give a great help to you. If you know what the cache line is and how big is on your processor, you’ll know how and when you can exploit the data locality principle. Knowing that disk reading is not that slow as it’s said could really help you (I heard you crying: I did not mistyped, disk read ain’t slow, what’s makes this operation slow is the time the disk’s head takes to position itself. Once we got the head in position, reading a whole block is probably faster than you imagine. Of course, I’m talking about old-style mechanical disks, with these shiny new solid state disks is another story). In the same way it can help knowing that your program will not cause many page faults or won’t fragment the allocation memory because your data structure doesn’t fit in one page or your algorithm makes thousand of allocations/deallocations of different size. This is very important, even though I got to admit is one of the topics I’m lagging behind.
Then you chosen the best algorithm in the world, you resurrected the dead in order to fit your data structure in exactly one page and you used condoms while you were coding, but still your program is very slow and you still can’t find the reason. Well I tell you, probably you’re using the wrong data structure but at this point I guess you’d know. With the project I said above, we experienced exactly this. We did everything we can to speed up the algorithms, we optimized everything could have been optimized but yet our program was slow until we realized that something was a real bottleneck for our purpose. Big Oh analysis it’s really useful, but you got to take into account that that little ‘n’ in that big ‘o’ is meant to be big. Indeed, that gives you an idea of how the data structure works as n grows, so that two same algorithms or data structures which both have a O(nlogn) complexity can be very different. In our case, the bottleneck was the STL map implementation. It turns out that, under the hood, it is a red-black tree that was everything but fast in our program. We were spending 30% of the time within the STL map looking for the key. The problem here really was that we hadn’t many values within the map, most I’ve seen were ten items but we had a medium of three or four items. When we switched from the usual map to a very simple hash, we had a terrible speed improvement.
Next thing is optimizing little stuff. This won’t yield greet speed ups but yet, they’re useful. We gained about one second by only switching one function’s calling convention to fastcall and we had even greater benefits by forcing the inlining of some other functions.
Other than that, having to work on a large C++ code base has been challenging, most because of some C++ gotchas (i.e.: why something like string fn() {} returns NULL implicitly?) even though I come from a assembler/C background and I have a solid OO theory base behind. In the end, though, it has been probably the funniest problem I faced in the latest years and will be eventually be open sourced after the summer.