Recently I needed to show a heat map of a quite a lot of coordinate points for a little project of mine that ended up in a data visualization contest (that unfortunately I didn’t win, even though I made to the finalists). The idea was to show the distribution of the georeferenced wikipedia pages through a heat map, so when I first heard about openheatmap.com I knew it was the tool to use. OpenHeatMap.com is an excellent project by Pete Warden that takes a dataset as a CSV, Excel or Google Spreadsheet file and convert it to a nice, browsable heat map presentation.
The first step was to obtain I dataset I could work on. I first tried to work directly onto the whole wikipedia database dump, extracting all the georeferenced pages in a smaller dataset. I actually succeeded but this work included only the english georeferenced pages. Also, extracting and converting coordinates to a common format would have been a real pain. So instead I decided to use the dump from the wikipedia-world project that already included data in a CSV file from all the downloadable wikipedia dumps which include other languages other than english. This dataset include roughly 1.300.000 points, so I had to narrow down some options to process it.
Once I had the dataset ready and knew how big it was I realized I had three options:
- the naive approach, just add every coordinate to the CSV file
- use a reverse geocoding service to get the country where the point belongs to
- cluster set of points together
It was clear that the first approach wouldn’t have worked for two reasons, the former being that there were just too many points for OHM (the rendering is done on client side and that would slow things a lot). Also, I would just draw points onto a map without effectively creating a “heat map” so I discarded that option soon.
Using a reverse geocoding service wouldn’t have worked too: I should have ran too many requests to a service like this and it would have taken ages. Also, I would have ended up with per-country rather than a per-city highlighting and that would have faked the final result. So it was clear that the only viable option was to cluster set of points together and then produce a CSV file that OHM would understand. Soon I realized I needed some sort of spatial indexing for a 2d space that turned out to be quad trees.
Before we dive deeper in how to cluster the points together we need to understand what’s a quad-tree. In the classic, recursive, definition of a tree, a quad-tree is a tree where each node, that represent a coordinate point, has up to four children. Each child represent a relative position to its father, being north west, north east, south west or south east.

One requirement for generating the heat map is knowing how many nodes we clustered together. Thus it’s easy to define a node item as an object storing the coordinates of the point and the number of nodes that have been aggregated on that point. We can thus define a class like this (all the code examples following will be in Python):
1 2 3 4 5 6 7 8 9 | class PointNode(object): NW, NE, SW, SE = 0, 1, 2, 3 def __init__(self, lat, lon): self.lat, self.lon = float(lat), float(lon) self.nodes = [None] * 4 self.aggregate_no = 1 def __str__(self): return "%s, %s" % (self.lat, self.lon) |
We’ll do two operations on the quad-tree: insert a new node and visit the whole tree. We can give a recursive definition of the insertion using the quad-tree as underlying data structure. Given a node:
- if the tree’s node is empty just insert the node there and set the number of points clustered together for that node to 1
- if the new node is near the tree’s node then compute a “middle node”, substitute it to the tree’s node and increment the number of points clustered together for that node
- otherwise find out where the new node belongs in the quad-tree (north west, north east, south west or south east) and insert it there
The insert operation on the quad-tree can be thus coded like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | def qtree_insert(root, node): """ Insert a point into the quad tree substituting a node with its midpoint if the nodes are near to each other (less than DISTANCE_LIMIT) """ if not root: return node # if we are under the distance limit, replace the root node with the # midpoint of the two nodes if point_distance(root, node) < DISTANCE_LIMIT: c = point_midpoint(root, node) c.nodes = root.nodes c.aggregate_no = root.aggregate_no + 1 root = c else: # otherwise just insert the node where it belongs # exploit PointNode child indexing (with NW being 0 we just need to add # the proper number to get what we need) pos = PointNode.NW if node.lat > root.lat: pos += 2 if node.lon > root.lon: pos += 1 root.nodes[pos] = qtree_insert(root.nodes[pos], node) return root |
The distance between two nodes can be computed using the pythagorean formula with parallel meridians. This formula returns the distance from two points in kilometers and it’s defined as: where
is the Earth’s radius and
,
are two points coordinates in radians (thus
and
are the differences between the two points longitudes and latitudes).
Note that we can return the distance in miles or whatever distance measure we want just by converting the Earth’s radius accordingly. We should keep in mind though that this formula it not very accurate so if we need better accuracy we need to find a better alternative.
Browsing the whole tree can be done using a classic tree visit and, considering that for my purposes there’s no need to visit the nodes in a special order I chosen DFS to save some memory. The final source code of the Python script can be found on my repository on github. The whole process takes slightly more than one minute on my desktop machine. This, instead, is the final heat map that joined the This Week In Relevance contest:
Great Post…thanks a lot. I had a lot of fun with your code. I actually decided to use great circle distance, and I kept a running tally of the lat/longs so I could get a weighted mid-point when replacing nodes. Thanks so much for helping me get going…