Clustering coordinate points together with quad-trees

Recently I needed to show a heat map of a quite a lot of coor­di­nate points for a little project of mine that ended up in a data visu­al­iza­tion con­test (that unfor­tu­nately I didn’t win, even though I made to the final­ists). The idea was to show the dis­tri­b­u­tion of the geo­ref­er­enced wikipedia pages through a heat map, so when I first heard about open​heatmap.com I knew it was the tool to use. Open​HeatMap.com is an excel­lent project by Pete Warden that takes a dataset as a CSV, Excel or Google Spread­sheet file and con­vert it to a nice, brows­able heat map pre­sen­ta­tion.

The first step was to obtain I dataset I could work on. I first tried to work directly onto the whole wikipedia data­base dump, extract­ing all the geo­ref­er­enced pages in a smaller dataset. I actu­ally suc­ceeded but this work included only the eng­lish geo­ref­er­enced pages. Also, extract­ing and con­vert­ing coor­di­nates to a common format would have been a real pain. So instead I decided to use the dump from the wikipedia-​world project that already included data in a CSV file from all the down­load­able wikipedia dumps which include other lan­guages other than eng­lish. This dataset include roughly 1.300.000 points, so I had to narrow down some options to process it.

Once I had the dataset ready and knew how big it was I real­ized I had three options:

  1. the naive approach, just add every coordinate to the CSV file
  2. use a reverse geocoding service to get the country where the point belongs to
  3. cluster set of points together

It was clear that the first approach wouldn’t have worked for two rea­sons, the former being that there were just too many points for OHM (the ren­der­ing is done on client side and that would slow things a lot). Also, I would just draw points onto a map with­out effec­tively cre­at­ing a “heat map” so I dis­carded that option soon.
Using a reverse geocod­ing ser­vice wouldn’t have worked too: I should have ran too many requests to a ser­vice like this and it would have taken ages. Also, I would have ended up with per-​country rather than a per-​city high­light­ing and that would have faked the final result. So it was clear that the only viable option was to clus­ter set of points together and then pro­duce a CSV file that OHM would under­stand. Soon I real­ized I needed some sort of spa­tial index­ing for a 2d space that turned out to be quad trees.

Before we dive deeper in how to clus­ter the points together we need to under­stand what’s a quad-​tree. In the clas­sic, recur­sive, def­i­n­i­tion of a tree, a quad-​tree is a tree where each node, that rep­re­sent a coor­di­nate point, has up to four chil­dren. Each child rep­re­sent a rel­a­tive posi­tion to its father, being north west, north east, south west or south east.

One require­ment for gen­er­at­ing the heat map is know­ing how many nodes we clus­tered together. Thus it’s easy to define a node item as an object stor­ing the coor­di­nates of the point and the number of nodes that have been aggre­gated on that point. We can thus define a class like this (all the code exam­ples fol­low­ing will be in Python):

1
2
3
4
5
6
7
8
9
class PointNode(object):
    NW, NE, SW, SE = 0, 1, 2, 3
    def __init__(self, lat, lon):
        self.lat, self.lon = float(lat), float(lon)
        self.nodes = [None] * 4
        self.aggregate_no = 1
 
    def __str__(self):
        return "%s, %s" % (self.lat, self.lon)

We’ll do two oper­a­tions on the quad-​tree: insert a new node and visit the whole tree. We can give a recur­sive def­i­n­i­tion of the inser­tion using the quad-​tree as under­ly­ing data struc­ture. Given a node:

  1. if the tree’s node is empty just insert the node there and set the number of points clustered together for that node to 1
  2. if the new node is near the tree’s node then compute a “middle node”, substitute it to the tree’s node and increment the number of points clustered together for that node
  3. otherwise find out where the new node belongs in the quad-tree (north west, north east, south west or south east) and insert it there

The insert oper­a­tion on the quad-​tree can be thus coded like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def qtree_insert(root, node):
    """
    Insert a point into the quad tree substituting a node with its
    midpoint if the nodes are near to each other (less than DISTANCE_LIMIT)
    """
    if not root: return node
 
    # if we are under the distance limit, replace the root node with the
    # midpoint of the two nodes
    if point_distance(root, node) < DISTANCE_LIMIT:
        c = point_midpoint(root, node)
        c.nodes = root.nodes
        c.aggregate_no = root.aggregate_no + 1
        root = c
    else:
        # otherwise just insert the node where it belongs
 
        # exploit PointNode child indexing (with NW being 0 we just need to add
        # the proper number to get what we need)
        pos = PointNode.NW
        if node.lat > root.lat:
            pos += 2
        if node.lon > root.lon:
            pos += 1
 
        root.nodes[pos] = qtree_insert(root.nodes[pos], node)
    return root

The dis­tance between two nodes can be com­puted using the pythagorean for­mula with par­al­lel merid­i­ans. This for­mula returns the dis­tance from two points in kilo­me­ters and it’s defined as: D=R\sqrt{(\Delta\phi)^2+(\Delta\lambda)^2}\! where R is the Earth’s radius and (\phi_0,\lambda_0)\,\!, (\phi_1,\lambda_1)\,\! are two points coor­di­nates in radi­ans (thus \Delta\phi and \Delta\lambda are the dif­fer­ences between the two points lon­gi­tudes and lat­i­tudes).
Note that we can return the dis­tance in miles or what­ever dis­tance mea­sure we want just by con­vert­ing the Earth’s radius accord­ingly. We should keep in mind though that this for­mula it not very accu­rate so if we need better accu­racy we need to find a better alternative.

Brows­ing the whole tree can be done using a clas­sic tree visit and, con­sid­er­ing that for my pur­poses there’s no need to visit the nodes in a spe­cial order I chosen DFS to save some memory. The final source code of the Python script can be found on my repos­i­tory on github. The whole process takes slightly more than one minute on my desk­top machine. This, instead, is the final heat map that joined the This Week In Rel­e­vance contest:

  1. Great Post…thanks a lot. I had a lot of fun with your code. I actu­ally decided to use great circle dis­tance, and I kept a run­ning tally of the lat/longs so I could get a weighted mid-​point when replac­ing nodes. Thanks so much for help­ing me get going…

Leave a Comment


NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">