Online Clustering in Python

A little while ago Stefano pointed me to Online Learning in Clojure, a really simple implementation of an the Pegasos online classification algorithm in the lisp-like (but compile to java bytecode) language Clojure. I implemented it myself in python, and it really is very simple. I’ll put it here some day when I have cleaned it up a bit.

By online in this context I mean an algorithm that does not keep all the examples in memory at the same time, but processes them one by one, keeping only aggregates or similar. Ideally it also does this in a single pass, i.e. you just look once at each example, the theory then goes that if you have enough examples they will still allow you to identify each class well. For linear binary classification problems this amazingly well, Pegasos for instance can train a binary-class SVM on 780,000 examples in less than 2 seconds. This is achieved by doing almost nothing for each example, the only operation is updating a weight-vector with the weighted example vector, where the weight depends on the class predicted by the weight vector at that point. The paper also includes pointers to Online Learning with Kernels, which describes how the same method can be adapted to work with kernel-methods that allow non-linear classification. I implemented this as well, and it makes the whole thing several times slower, but it’s still amazingly fast.

Now — this was during my holidays and I started wondering if you could also do clustering online. I’d had enough of reading papers at this point, so I just tried hacking something in python. It worked fairly well, but I had some problems when it came to merging clusters. Now a few weeks later I return to this, but this time I read the paper before. It turns out a very simple algorithm was published in 99: An on-Line Agglomerative Clustering Method for Non-Stationary Data, it was very similar to what I had pulled out of my made up, but where I had tried to find a complicated solution for deciding when to merge clusters they had a very simple solution. And it worked quite well, as long as the data was not too noisy. Luckily, a bit later, someone wrote a paper with the catchy title: Improving the Robustness of ‘Online Agglomerative Clustering Method’ Based on Kernel-Induce Distance Measures. They use kernel-functions for distance measures and a clever weighting to iron out the noise in the data. (Although the cluster-prototypes are still kept in normal space, so it’s not really a kernel-method?). NOW, (slowly getting to the point) I also implemented this in Python, and it rocks:

The algorithm has two parameters, the σ parameter for gaussian kernel used, and K, the number of cluster prototypes to maintain. The number of clusters is found automatically, although it cannot find more than K-1. The runtime is otherwise linear wrt. to the number of data-points, but scaled by K i.e. roughly:

O(KN)

This all got a bit detailed, sorry. I really just wanted to hit the youtube embed button and go. At least it’s a fun video.

Update:

Ah – in a bit of forget-the-attachment-in-a-”here-is-the-attachment”-email, I appear to have forgotten the python code :)

Grab it at: OnlineCluster.py, it is fairly straight-forward. It requires scipy, and will plot it’s results if you have pylab installed. (The video above was also generated using pylab). The only bit where readability suffered was using the heapq module for finding the clusters to merge. I’ve not used the module before, and it took a long goddamn while to figure out that since I wanted tuples in my queue and they compare funnily, I had to introduce an extra object with a __cmp__ method. The instances are represented as scipy arrays (i.e. vectors of floats), the clusterer will happily let you grow the dimsions as you go along.

Enjoy!

 

12 comments.

  1. [...] link is being shared on Twitter right now. @gromgull said Today's (and yesterday's) effort: Online [...]

  2. Hi,

    thanks for the nice post, and the code! I have a question wrt the lines 118–123:


    # make a new cluster for this point
    newc=Cluster(e)
    self.clusters.append(newc)
    self.updatedist(newc)

    self.n+=1

    since the new point is always added to an existing cluster at line 104:

    closest=self.clusters[min( c , key=operator.itemgetter(1))[0]]
    closest.add(e)

    why is it also assigned to a new cluster?

  3. Hi Manos,

    Good question — this also puzzled me initially. I guess the reasoning goes something like: When a new point arrives we do not know if this is an outlier or a genuine new cluster. Doing both assures we can both start building the new cluster when new points arrive in the same area later, OR we can later merge the newly created cluster into the existing one when it turns out that no more points arrive there.

    Makes sense?

    (Sorry about the delay, your comment was initially eaten by the spam-filter)

  4. Do you have anything that i could use for stationary data?

  5. Dagmare, there are lots and lots of libraries for normal clustering out there. For python I’ve used scikits.learn : http://scikit-learn.sourceforge.net/modules/clustering.html#clustering

  6. Hello, I’m trying to understand the algorithm and by looking at the video I am a bit confused. I notice that there are points that disappear in the video. For instance in frame 1 there is a point on the bottom right of the frame, but in frame 2 that point disappears and the centroid moves to the top right. This doesn’t make sense to me because when I read the algorithm I don’t see that they remove points, only clusters.

    Thanks

  7. Hi,

    thanks for the code! I had a question about the merge operation. It looks like the merge between two clusters only occurs if the number of clusters is above N. How come the number of clusters can be lower than N?

  8. mrm: The algorithm is incremental – at the start it has not yet seen N data-points and will only have created a cluster for each point seen. We only have to start merging once we have more than N clusters/datapoints.

  9. naydav: Very good question. I can only assume that I accidentally had some frame lying around from a previous run and forgot to clean up the working directory. I.e. it’s a bug :)

  10. Hi,

    Thanks for the answer. I understand that the number of clusters len(self.clusters) increases as new datapoints are introduced. However, self.N is a fixed parameter that is determined by the user at the beginning and does not change in the code. And the merge operation only happens if the number of clusters is larger or equal to N. So doesn’t that mean that the final number of clusters is fixed to N-1 and thus is determined by the user?

  11. Ah – I misunderstood your question. I must admit that it’s so long since I wrote this code I’ve forgotten, but it seems you are right. Merge is only ever called in this one place, so there will always be N-1 clusters.

    You could post-process at the “end” of course, and remove singleton clusters or similar.

  12. Brilliant. Any chance of seeing that pegasos SVM implementation too? :)

Post a comment.