Posts categorized “Machine Learning”.

Illustrating the kernel trick

For a one paragraph intro to SVMs and the kernel-trick I wanted a a graphic that I’ve seen in a book (although forgotten where, perhaps in Pattern Classification?):

Simple idea — show some 2D data points that are not linearly separable, then transform to 3D somehow, and show that they are. I found nothing on google (at least nothing that was high enough resolution to reuse, so I wrote some lines of python with pylab and matplotlib:

import math
import pylab
import scipy

def vlen(v):
return math.sqrt(scipy.vdot(v,v))


a=scipy.array([x for x in p if vlen(x)>1.3 and vlen(x)<2])
b=scipy.array([x for x in p if vlen(x)<0.8])

pylab.scatter(a[:,0], a[:,1], s=30, c="blue")
pylab.scatter(b[:,0], b[:,1], s=50, c="red", marker='s')


fig = pylab.figure()
from mpl_toolkits.mplot3d import Axes3D
ax = Axes3D(fig)

ax.scatter3D(map(vlen,a), a[:,0], a[:,1], s=30, c="blue")
ax.scatter3D(map(vlen,b), b[:,0], b[:,1], s=50, marker="s", c="red")


Take — adapt — use for anything you like, you can rotate the 3D plot in the window that is shown and you can save the figures as PDF etc. Unfortunately, the sizing of markers in the 3d plot is not yet implemented in the latest matplotlib (, so this only looks good with the latest SVN build.

The Machine Learning Algorithm with Capital A

A student came see me recently, wanting to do a Diplomarbeit (i.e. a MSc++) on a learning technique called Hierarchical Temporal Memory or HTMs. He had very specific ideas about what he wanted, and has already worked with the algorithm in his Projektarbeit (BSc++). I knew nothing about the approach, but remembered this reddit post, which was less than enthusiastic. I spoke to the student and read up on the thing a bit, and it seem interesting enough. It’s claimed to be close the way the brain learns/recognizes patterns and to be a general model of intelligence and  it will work for EVERYTHING. This reminded me of a few other things I’ve come across in the past years that claim to be the new Machine Learning algorithm with Capital A, i.e. the algorithm to end all other ML work, which will work on all problems, and so on. Here is a small collection of the three most interesting ones I remembered:

Hierarchical Temporal Models

HTMs are “invented” by Jeff Hawkin, whose track record includes making the Palm device and platform and later the Treo. Having your algorithm’s PR based on the celebrity status of the inventor is not really a good sign. The model is first presented in his book On Intelligence, which I’ve duly bought and am currently reading. The book is so far very interesting, although full of things like “[this is] how the brain actually works“, “Can we build intelligent machines? Yes. We can and we will.”, etc. As far as I understand, the model from the book was formally analysed and became the HTM algorithm in Dileep George‘s thesis: How the brain might work: A hierarchical and temporal model for learning and recognition. He applies to recognizing 32×32 pictures of various letters and objects.

The model is based on a hierarchy of sensing components, each dealing with a higher level of abstraction when processing input, the top of the hierarchy feeds into some traditional learning algorithm, such as a SVM for classification, or some clustering mechanism. In effect, the whole HTM is a form of feature pre-processing. The temporal aspect is introduces by the nodes observing their input (either the raw input, or their sub-nodes) over time, this (hopefully) gives rise to translation, rotation and  scale invariance, as the things you are watching move around.  I say watching here, because computer-vision seems to be the main application, although it’s of course applicable to EVERYTHING:

HTM technology has the potential to solve many difficult problems in machine learning, inference, and prediction. Some of the application areas we are exploring with our customers include recognizing objects in images, recognizing behaviors in videos, identifying the gender of a speaker, predicting traffic patterns, doing optical character recognition on messy text, evaluating medical images, and predicting click through patterns on the web.

The guys went on to make a company called Numenta for spreading this technique, they have a (not open-source, boo!) development framework you can play with.

Normalised Compression Distance

This beast goes under many names: compression based learning, compression-based dissimilarity measure, etc. The idea is in any case to reuse compression algorithms for learning, from good old DEFLATE algorithm from zip/gzip, to algorithms specific to some data-type, like DNA. The distance between things is then derived from how well they compress together with each other or with other data, and the distance metric can then be used for clustering, classification, anomaly detection, etc. The whole thing is supported by the theory of Kolmogorov Complexity and Minimum Description Length, i.e. it’s not just a hack.

I came across it back in 2003 in the Kuro5hin article Spam Filtering with gzip, back then I was very sceptical, thinking that any algorithm dedicated to doing classification MUST easily out-perform this. What I didn’t think about is that if you use the optimal compression for your data, then it finds all patterns in the data, and this is exactly what learning is about. Of course, gzip is pretty far from optimal, but it still works pretty well. I am not the only one who wasn’t convinced, this letter appeared in a physics journal in 2001, and led to some heated discussion: angry comment, angry reply, etc.

A bit later, I came across this again. Eamonn Keogh wrote Towards parameter-free data mining in 2004,  this paper makes a stronger case for this method being simple, easy and great and applicable to EVERYTHING:

[The Algorithm] can be implemented using any off-the-shelf compression algorithm with the addition of just a dozen or so lines of code. We will show that this approach is competitive or superior to the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/video datasets.

A bit later again I came across Rudi Cilibrasi (his page is broken atm.) thesis on Statistical Inference Through Data Compression. He has more examples, more theory and most importantly open-source software for everything: CompLearn (also down atm., but there are packages in debian). The method is very nice in that it makes no assumptions about the format of the input, i.e. no feature vectors or similar. Here is a clustering tree generated from a bunch of different files types:

Markov Logic Networks

I first came across Markov Logic Networks in the paper: Automatically Refining the Wikipedia Infobox Ontology. Here they have two intertwined problems they use machine learning to solve, firstly they want to map wikipedia category pages to WordNet Synsets, and secondly they want to arrange the wikipedia categories in a hierarhcy, i.e. by learning is-a relationships between categories. The solve the problem in two ways, the traditional way by using training a SVM to do the WordNet mappings, and using these mappings as an additional features for training a second SVM to do the is-a learning. This is all good, and works reasonably well, but by using Markov Logic Networks they can use joint inference to tackle both tasks at once. This is good since the two problems are clearly not independent, and now evidence that two categories are related can feed back and improve the probability that the map to WordNet synsets that are also related. Also, it allows different is-a relations to influence each other, i.e. if Gouda is-a Cheese is-a Food, then Gouda is probaby also a Food.

The software used in the paper is made by the people at the University of Washington, and is available as open-source: Alchemy – algorithms for statistical relational learning and probabilistic logic inference. The system combines logical and statistical AI, building on network structures much like Bayesian belief networks, in the end it’s a bit like Prolog programming, but with probabilities for all facts and relations, and these probabilities can be learned from data. For example, this cheerful example  about people dying of cancer, given this dependency network and some data about friends who influence each other to smoke and dying, you can estimate the probability that you smoke if your friend does and the probability that you will get cancer:

Since I am writing about it here, it is clear that this is applicable to EVERYTHING:

Markov logic serves as a general framework which can not only be used for the emerging field of statistical relational learning, but also can handle many classical machine learning tasks which many users are familiar with. […] Markov logic presents a language to handle machine learning problems intuitively and comprehensibly. […] We have applied it to link prediction, collective classification, entity resolution, social network analysis and other problems.

The others

Out of the three I think I find Markov Logic Network to be the most interesting, perhaps because it nicely bridges the symbolic and sub-symbolic AI worlds. This was my personal problem since I cannot readily dismiss symbolic AI as a Semantic Web person, but the more I read about about kernels, conditional random fields, online-learning using gradient descent etc. the more I realise that rule-learning and inductive logic programming probably isn’t going to catch up any time soon. NCD is a nice hack, but I tested it on clustering RDF resources, comparing the distance measure from my thesis with gzip’ping the RDF/XML or Turtle, and it did much worse. HTM still strikes me as a bit over-hyped, but I will of course be sorry when the bring the intelligent robots to market in 2011.

Some other learning frameworks that nearly made it into this post:

(wow – this got much longer than I intended)

Noun-phrase Chunking for the Awful German Language

(updated 05/09/2011)

In the Organik project we’ve been using the noun-phrase extraction modules of OpenNLP toolkit to extract key concepts from text for doing Taxonomy Learning. OpenNLP comes with trained model files for English sentence detection, POS-tagging and either noun-phrase chunking or full parsing, and this works great.

Of course in Organik we have some German partners who insist on using their awful german language [1] for everything – confusing us with their weird grammar. Finding a solution to this has been on my TODO list for about a year now. I had access to the Tiger Corpus of 50,000 German sentences marked up with POS-tags and syntactic structure. I have tried to use this for training a model for NP chunking, either using the OpenNLP MaxEnt model or with conditional random fields as implemented in FlexCRF. However, the models never performed better than around 60% precision and recall, and testing showed that this was really not enough. Returning to this problem now once again I have looked more closely at the input data, it turns out the syntactic structures used in the Tiger Corpus are quite detailed, containing far higher granularity of tag-types than what I need. For instance the structure for “Zwar hat auch der HDE seinen Mitgliedern Tips gegeben, wie mit vermeintlichen Langfingern umzugehen sei.” (click for readable picture):

Here the entire “Tips […] wie mit vermeintlichen Langfingern umzugen sei”, is a noun-phrase. This (might) be linguistically correct, but it’s not very useful to me when I essentially want to do keyword extraction. Much more useful is the terms marked NK (Noun-Kernels), i.e. here “vermeintlichen Langfingern”. Another problem is that the tree is not continuous with regard to the original sentence, i.e. the word gegeben fits into the middle of the NP, but is not a part of it.

SO – I have preprocessed the entire corpus again, flattening the tree, taking the lowermost NK chain, or NP chunk as example. This gives me much shorter NPs in general, for which it is easier to learn a model AND the result is more useful in Organik. Running FlexCRF again on this data, splitting off a part of the data for testing, gives me a model with 94.03% F1-measure on the test data. This is quite comparable to what was achieved for English with FlexCRF in CRFChunker, for the WSJ corpus they report a F-Measure of 95%.

I cannot redistribute the corpus or training data, but here is the model as trained by FlexCRF for chunking: GermanNPChunkModel.tar.gz (17.7mb)

and for the POSTagging: GermanPOSTagModel.tar.gz (9.5mb)

Both are trained on 44,170 sentences, with about 900,000 words. The POSTagger was trained for 50 iterations, the Chunker for 100, both with 10% of the data used for testing.

In addition, here is a model file trained with OpenNLPs MaxEnt: OpenNLP_GermanChunk.bin.gz (5.2mb)

This was trained with the POS tags as generated by the German POStagger that ships with OpenNLP, and can be used with the OpenNLP tools like this:

java -cp $CP \
models/german/sentdetect/sentenceModel.bin.gz |
java -cp $CP \
models/german/tokenizer/tokenModel.bin.gz |
java -cp $CP -Xmx100m \
models/german/postag/posModel.bin.gz |
java -cp $CP \

That’s it. Let me know if you use it and it works for you!

Update 5/9/2011:
I have re-run the OpenNLP chunker with the most recent 1.5.1 version. The chunking model file format had changed, the updated file is here: I did not redo the POS tagging, so if it changed from the old version the accuracy will suffer.

I’ve also put up the scripts I used to prepare this – starting with some corpus in the TIGER-XML treebank encoding format in tiger.xml and opennlp libraries in lib/:

# Convert XML to Flex
python tiger.xml > tiger.flex
# Reoplace POS tags with those OpenNLP produces
python tiger.flex > tiger.opennlp
# Learn chunking model
sh tiger.opennlp

It’s a while since I wrote this code, so I am not 100% sure how it works any more, but it seems fine :)

[1] Completely unrelated, but to exercise your German parsing skills, check out some old newspaper articles. Die Zeit has their online archive available back to 1946, where you find sentence-gems like this: Zunächst waren wir geneigt, das geschaute Bild irgendwie umzurechnen auf materielle Werte, wir versuchten Arbeitskräfte und Zeit zu überschlagen, die nötig waren, um diese Wüste, die uns umgab, wieder neu gestalten zu können, herauszuführen aus diesem unfaßlichen Zustand der Zerstörung, überzuführen in eine Welt, die wir verstanden, in eine Welt,’ die uns bis dahin umgeben hatte. (ONE sentece!, from

FastMap in Python

This shows some strings positioned on a graph according to their Levenshtein string distance. (And let me note how appropriate it is that a man named Levenshtein should make a string distance algorithm)

The mapping from the distance matrix given by the string distance is then mapped to two dimensions using the FastMap algorithm by Christos Faloutsos and King-Ip Lin. This mapping can be also done with Multidimensional Scaling, (and I did so in my PhD work, I even claimed it was novel, but actually I was 40 years too late, oh well), but that algorithm is nasty and iterative. FastMap, as the name implies, is much faster, it doesn’t actually even need a full distance matrix (I think). It doesn’t always find the best solution, and it has a slight random element to it, so the solution might also vary each time it’s run, but it’s good enough for almost all cases. Again, I’ve implemented it in python – grab it here:

To get a feel for how it works, download the code, remove George Leary and see how it reverts to only one dimension for a sensible mapping.

The algorithm is straight-forward enough, for each dimension you want, repeat:

  1. use a heuristic to find the two most distant points
  2. the line between these two points becomes the first dimension, project all points to this line
  3. recurse :)

The distance measure used keeps the projection “in mind”, so the second iteration will be different to the first.The whole thing is a bit like Principal Component Analysis, but without requiring an original matrix of feature vectors.

This is already quite old, from 1995, and I am sure something better exists now, but it’s a nice little thing to have in the toolbox. I wonder if it can be used to estimate numerical feature values for nominal attributes, in cases where all possible values are known?

Online Clustering in Python

A little while ago Stefano pointed me to Online Learning in Clojure, a really simple implementation of an the Pegasos online classification algorithm in the lisp-like (but compile to java bytecode) language Clojure. I implemented it myself in python, and it really is very simple. I’ll put it here some day when I have cleaned it up a bit.

By online in this context I mean an algorithm that does not keep all the examples in memory at the same time, but processes them one by one, keeping only aggregates or similar. Ideally it also does this in a single pass, i.e. you just look once at each example, the theory then goes that if you have enough examples they will still allow you to identify each class well. For linear binary classification problems this amazingly well, Pegasos for instance can train a binary-class SVM on 780,000 examples in less than 2 seconds. This is achieved by doing almost nothing for each example, the only operation is updating a weight-vector with the weighted example vector, where the weight depends on the class predicted by the weight vector at that point. The paper also includes pointers to Online Learning with Kernels, which describes how the same method can be adapted to work with kernel-methods that allow non-linear classification. I implemented this as well, and it makes the whole thing several times slower, but it’s still amazingly fast.

Now — this was during my holidays and I started wondering if you could also do clustering online. I’d had enough of reading papers at this point, so I just tried hacking something in python. It worked fairly well, but I had some problems when it came to merging clusters. Now a few weeks later I return to this, but this time I read the paper before. It turns out a very simple algorithm was published in 99: An on-Line Agglomerative Clustering Method for Non-Stationary Data, it was very similar to what I had pulled out of my made up, but where I had tried to find a complicated solution for deciding when to merge clusters they had a very simple solution. And it worked quite well, as long as the data was not too noisy. Luckily, a bit later, someone wrote a paper with the catchy title: Improving the Robustness of ‘Online Agglomerative Clustering Method’ Based on Kernel-Induce Distance Measures. They use kernel-functions for distance measures and a clever weighting to iron out the noise in the data. (Although the cluster-prototypes are still kept in normal space, so it’s not really a kernel-method?). NOW, (slowly getting to the point) I also implemented this in Python, and it rocks:

The algorithm has two parameters, the σ parameter for gaussian kernel used, and K, the number of cluster prototypes to maintain. The number of clusters is found automatically, although it cannot find more than K-1. The runtime is otherwise linear wrt. to the number of data-points, but scaled by K i.e. roughly:


This all got a bit detailed, sorry. I really just wanted to hit the youtube embed button and go. At least it’s a fun video.


Ah – in a bit of forget-the-attachment-in-a-“here-is-the-attachment”-email, I appear to have forgotten the python code :)

Grab it at:, it is fairly straight-forward. It requires scipy, and will plot it’s results if you have pylab installed. (The video above was also generated using pylab). The only bit where readability suffered was using the heapq module for finding the clusters to merge. I’ve not used the module before, and it took a long goddamn while to figure out that since I wanted tuples in my queue and they compare funnily, I had to introduce an extra object with a __cmp__ method. The instances are represented as scipy arrays (i.e. vectors of floats), the clusterer will happily let you grow the dimsions as you go along.


Applying market basket analysis to RDF

Since 3 is a magic number I'd really like to have 3 different learning algorithms used Smeagol. Currently I have ILP and HAC-clustering, both applied in several different ways. Sequence/Basket analysis seems like a good candidate for third algorithm, since it's the only area of ML not covered yet. (ILP covering classification and more…)
Sequence analysis would of course require a time dimension the the data, which i'd really rather not get into, AND it was probably covered pretty well by Heather Maclaren.
Basket analysis is left, and my first attempt was quickly hacked up using Orange. The things in my baskets are predicate-value pairs, and each person becomes a basket on their own. I tried this on several data-sets i had lying around, here are some quick and dirty results:

A small subset of my IMDB Data (3534 triples) gave me:

rdf#type IMDB#Movie -> IMDB#languages English


rdf#type IMDB#Movie -> IMDB#country_USA

My email from the last 5 years as crawler by aperture (127615 triples) gave me the fascinating rule:

aperture:mimeType message/rfc822 -> rdf#type imap/Message

A subset of some old FOAF crawl stolen from JibberJim years ago gave me:

jim#isKnownBy norman.walsh#norman-walsh -> rdf#type foaf/Person

yes – fascinating indeed. I also found the Norman Walsh rule using ILP years ago, at least running this one was pretty fast.

I'm not sure what to conclude from this – none of the rules are groundbreaking OR that interesting. Maybe I can tweak the way items are represented, using just values or just predicates for example. I'll see tomorrow.

I also had a brain-storming session with myself and some gin'n'tonic today, and if I don't finish this PhD it's because the table wasn't big enough: