Posts categorized “Python”.

I’ll trie in python

In Koble the auto-completion of thing-names used for wiki-editting, instant-search and adding relations  is getting slower and slower, mainly because I do:

result=[]
things=listAllThings()
for t in things:
   if t.startswith(key): res.append(t)
for t in things:
   if key in t: res.append(t)

Going through the list twice makes sure I get all things that match well first (i.e. the start with the string I complete for), and then things matching less well later (they only contain the string).

Of course the world has made up a far better data-structure for indexing prefix’es of string, namely the trie, or prefix tree. James Tauber had already implemented one in python, and kindly made it available. His version didn’t do everything I needed, so I added a few methods. Here is my updated version:

http://gromgull.net/2009/11/trie.py

Enjoy!

Heat-maps of Semantic Web Predicate usage

It’s all Cygri’s fault — he encouraged me to add schema namespaces to the general areas on the semantic web cluster-tree. Now, again I misjudged horribly how long this was going to take. I thought the general idea was simple enough, I already had the data. One hour should do it. And now one full day later I have:

FOAF Predicates on the Semantic Web

It’s the same map as last time, laid using graphviz’s neato as before. The heat-map of the properties was computed from the feature-vector of predicate counts, first I mapped all predicates to their “namespace”, by the slightly-dodgy-but-good-enough heuristic of taking the part of the URI before the last # or / character. Then I split the map into a grid of NxN points (I think I used N=30 in the end), and compute a new feature vector for each point. This vector is the sum of the mapped vector for each of the domains, divided by the distance. I.e. (if you prefer math) each point’s vector becomes:

\displaystyle V_{x,y}= \sum_d\frac{V_d}{\sqrt{D( (x,y),  pos_d)}}

Where D is the distance (here simple 2d euclidean), d is each domain, pos_d is that domains position in the figure and V_d is that domains feature vector. Normally it would be more natural to decrease the effect by the squared distance, but this gave less attractive results, and I ended up square-rooting it instead. The color is now simply on column of the resulting matrix normalised and mapped to a nice pylab colormap.

Now this was the fun and interesting part, and it took maybe 1 hour. As predicted. NOW, getting this plotted along with the nodes from the graph turned out to be a nightmare. Neato gave me the coordinates for the nodes, but would change them slightly when rendering to PNGs. Many hours of frustration later I ended up drawing all of it again with pylab, which worked really well. I would publish the code for this, but it’s so messy it makes grown men cry.

NOW I am off to analyse the result of the top-level domain interlinking on the billion triple data. The data-collection just finished running while I did this. … As he said.

Visualising predicate usage on the Semantic Web

So, not quite a billion triple challenge post, but the data is the same.  I had the idea that I compare the Pay-Level-Domains (PLD) of the context of the triples based on what predicates is used within each one. Then once I had the distance-metric, I could use FastMap to visualise it. It would be a quick hack, it would look smooth and great and be fun. In the end, many hours later, it wasn’t quick, the visual is not smooth (i.e. it doesn’t move) and I don’t know if it looks so great. It was fun though. Just go there and look at it:

PayLevelDomains cluster-tree

As you can see it’s a large PNG with the new-and-exciting ImageMap technology used to position the info-popup, or rather used to activate the JavaScript used for the popups. I tried at first with SVG, but I couldn’t get SVG and XHTML and Javascript to play along, I guess in Firefox 5 it will work. The graph is laid out and generated Graphviz’s neato, which also generated the imagemap.

So what do we actually see here? In short, a tree where domains that publish similar Semantic Web data are close to each other in the tree and have similar colours. In detail: I took the all PLDs that contained over 1,000 triples, this is around 7500, and counted the number of triples for each of the 500 most frequent predicates in the dataset. (These 500 predicates cover ≈94% of the data). This gave me a vector-space with 500 features for each of the PLDs, i.e. something like this:

geonames:nearbyFeature dbprop:redirect foaf:knows
dbpedia.org 0.01 0.8 0.1
livejournal.org 0 0 0.9
geonames.org 0.75 0 0.1

Each value is the percentage of triples from this PLD that used this predicate. In this vector space I used the cosine-similarity to compute a distance matrix for all PLDs. With this distance matrix I thought I could apply FastMap, but it worked really badly and looked like this:

Fastmapping the PLDs

So instead of FastMap I used maketree from the complearn tools, this generates trees from a distance matrix, it generates very good results, but it is an iterative optimisation and it takes forever for large instances. Around this time I realised I wasn’t going to be able to visualise all 7500 PLDs, and cut it down to the 2000, 1000, 500, 100, 50 largest PLDs. Now this worked fine, but the result looked like a bog-standard graphviz graph, and it wasn’t very exciting (i.e not at all like this colourful thing). Now I realised that since I actually had numeric feature vectors in the first place I wasn’t restrained to using FastMap to make up coordinates, and I used PCA to map the input vector-space to a 3-dimensional space, normalised the values to [0;255] and used these as RGB values for colour. Ah – lovely pastel.

I think I underestimated the time this would take by at least a factor of 20. Oh well. Time for lunch.

FastMap in Python

This shows some strings positioned on a graph according to their Levenshtein string distance. (And let me note how appropriate it is that a man named Levenshtein should make a string distance algorithm)

The mapping from the distance matrix given by the string distance is then mapped to two dimensions using the FastMap algorithm by Christos Faloutsos and King-Ip Lin. This mapping can be also done with Multidimensional Scaling, (and I did so in my PhD work, I even claimed it was novel, but actually I was 40 years too late, oh well), but that algorithm is nasty and iterative. FastMap, as the name implies, is much faster, it doesn’t actually even need a full distance matrix (I think). It doesn’t always find the best solution, and it has a slight random element to it, so the solution might also vary each time it’s run, but it’s good enough for almost all cases. Again, I’ve implemented it in python – grab it here: fastmap.py

To get a feel for how it works, download the code, remove George Leary and see how it reverts to only one dimension for a sensible mapping.

The algorithm is straight-forward enough, for each dimension you want, repeat:

  1. use a heuristic to find the two most distant points
  2. the line between these two points becomes the first dimension, project all points to this line
  3. recurse :)

The distance measure used keeps the projection “in mind”, so the second iteration will be different to the first.The whole thing is a bit like Principal Component Analysis, but without requiring an original matrix of feature vectors.

This is already quite old, from 1995, and I am sure something better exists now, but it’s a nice little thing to have in the toolbox. I wonder if it can be used to estimate numerical feature values for nominal attributes, in cases where all possible values are known?

Online Clustering in Python

A little while ago Stefano pointed me to Online Learning in Clojure, a really simple implementation of an the Pegasos online classification algorithm in the lisp-like (but compile to java bytecode) language Clojure. I implemented it myself in python, and it really is very simple. I’ll put it here some day when I have cleaned it up a bit.

By online in this context I mean an algorithm that does not keep all the examples in memory at the same time, but processes them one by one, keeping only aggregates or similar. Ideally it also does this in a single pass, i.e. you just look once at each example, the theory then goes that if you have enough examples they will still allow you to identify each class well. For linear binary classification problems this amazingly well, Pegasos for instance can train a binary-class SVM on 780,000 examples in less than 2 seconds. This is achieved by doing almost nothing for each example, the only operation is updating a weight-vector with the weighted example vector, where the weight depends on the class predicted by the weight vector at that point. The paper also includes pointers to Online Learning with Kernels, which describes how the same method can be adapted to work with kernel-methods that allow non-linear classification. I implemented this as well, and it makes the whole thing several times slower, but it’s still amazingly fast.

Now — this was during my holidays and I started wondering if you could also do clustering online. I’d had enough of reading papers at this point, so I just tried hacking something in python. It worked fairly well, but I had some problems when it came to merging clusters. Now a few weeks later I return to this, but this time I read the paper before. It turns out a very simple algorithm was published in 99: An on-Line Agglomerative Clustering Method for Non-Stationary Data, it was very similar to what I had pulled out of my made up, but where I had tried to find a complicated solution for deciding when to merge clusters they had a very simple solution. And it worked quite well, as long as the data was not too noisy. Luckily, a bit later, someone wrote a paper with the catchy title: Improving the Robustness of ‘Online Agglomerative Clustering Method’ Based on Kernel-Induce Distance Measures. They use kernel-functions for distance measures and a clever weighting to iron out the noise in the data. (Although the cluster-prototypes are still kept in normal space, so it’s not really a kernel-method?). NOW, (slowly getting to the point) I also implemented this in Python, and it rocks:

The algorithm has two parameters, the σ parameter for gaussian kernel used, and K, the number of cluster prototypes to maintain. The number of clusters is found automatically, although it cannot find more than K-1. The runtime is otherwise linear wrt. to the number of data-points, but scaled by K i.e. roughly:

O(KN)

This all got a bit detailed, sorry. I really just wanted to hit the youtube embed button and go. At least it’s a fun video.

Update:

Ah – in a bit of forget-the-attachment-in-a-”here-is-the-attachment”-email, I appear to have forgotten the python code :)

Grab it at: OnlineCluster.py, it is fairly straight-forward. It requires scipy, and will plot it’s results if you have pylab installed. (The video above was also generated using pylab). The only bit where readability suffered was using the heapq module for finding the clusters to merge. I’ve not used the module before, and it took a long goddamn while to figure out that since I wanted tuples in my queue and they compare funnily, I had to introduce an extra object with a __cmp__ method. The instances are represented as scipy arrays (i.e. vectors of floats), the clusterer will happily let you grow the dimsions as you go along.

Enjoy!

Integrating bibsonomy into Koble

A few years back I created koble – and used it on and off for my own notes ever since. (if you didn’t hear of it before, never mind – it’s essentially a semantic wiki)

So — Koble lets me create a page for a concept and then associate links with this page, in fact, it even has a bookmarklet for easily creating new things from pages or associating page to existing things. Sounds like anything you know? A bit like tagging with delicious or bibsonomy? Oh yes. So in the off periods of not using Koble I used bibsonomy for all my links, and I definitely use bibsonomy for keeping track of publications that I read skim the abstract and conclusion of. However, when it comes to writing notes about what I read I need the wiki. Sigh. So I bit the bullet, picked up the python showel and dug into the bibsonomy API, and a few hours later I have:

At first I worried about identity — what if the OnlineLearning of Koble is no the same as the OnlineLearning tag on bibsonomy? What if I have SemanticWeb vs. semantic-web? Maybe I should let users (read “myself”) selectively associated different things with different tags?

Then I learned to stop worrying and just get on with my life and now I show any links/publications where the tags matched if they are the same once all non-letters have been removed and the rest converted to lower-case.

Now I’ve fixed my tool — now nothing can stop me from actually doing some work.

Exhilarating Acceleration

Now that the Nepomuk review is finished I've had some time to play with the OpenMoko FreeRunner. The state of the standard phone-software is abysmal, but the hard-ware is cool AND it runs python :) Since Kaiserslautern was covered in snow this weekend and it was not so tempting to go out I spent a few hours playing with the accelerometers. The actual idea for what to do started with seeing this screenshot:

of the game Zing on OneMoreLevel.com. The game is nothing like I thought it was, I imagined the small things to be very abstract cars and not bugs trying to eat you. Oh well.

Anyway, within a few hours and some stealing Creative Commons content later I had a crappy car game:

controlled by the accelerometer of the FreeRunner, i.e. lie the FreeRunner flat, start the game, now when you tilt the FreeRunner in some direction the car drives that way! Trust me, the scrreenshot really does not do it justice, the feeling you get frome the immediate control and the amazing sound-effects is… well pretty much just like driving in real life!

Download tarball, untar on FreeRunner, make sure you have python-pygame and libpng3 installed and run python car.py. Press AUX to exit. The code is of course not pretty, but does at least show some basics of Accelerometer handling.

Web 0.2

Currently I am in Berlin at the Semantic Desktop Hands-On Workshop and I was working on a toy that I started on the train. No details now, it will appear here when finished (i.e. when 10% complete and I can get a screenshot).

For this toy I needed to get titles, publication dates and thumbnails of flickr photos. Flickr is amazingly open so this should be easy! Here we go:

  • Since this is a SEMANTIC workshop I will do this through RSS (which is kinda semantic…) and this makes it easy to work with any photostream from flickr (personal, contacts, tags, etc.). So I download the universal feed parser, it parses the feed, but loses the thumbnail and image elements. Bugger. Struggle for 20-30 min to make sure it's not my fault, but no, they are indeed gone.

  • Option 2: Find a flickr python API! Setup an API key, and try to make the FlickrAPI request authorisation, it needs a browser, leave it empty and it tries lynx. I only have links. I do this from the interactive python prompt and the flickrapi calls sys.exit when things go wrong. Great! Try links – it wont let me login, the meta-refresh always brings me back to the login page. What is the command to launch firefox on commandline in MacOSX? (prob. /Applications/Firefox.app/Contents/MacOS/firefox) – never mind. I give up – and revert to RSS.

  • Option 3: write my own rss "parser". "Parser" in the loosest sense here, no XML, just reg-exp the lines, relying on the order of RSS elements in the flickr rss feeds. This works nearly, till I get to parsing the publication dates. The date is given in some RFC format, i.e.: "Mon, 9 Apr 2007 04:07:19 -0800". Python has strftime and strptime to format and parse respectively. The format string for this is "%a, %d %b %Y %H:%M:%S %z" – however, %z (for the timezone offset) is not mentioned on the module documentation page, but it seems to work for strftime. However, parsing gives me:

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/_strptime.py", line 286, in strptime
format_regex = time_re.compile(format)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/_strptime.py", line 264, in compile
return re_compile(self.pattern(format), IGNORECASE)
File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/_strptime.py", line 256, in pattern
processed_format = "%s%s%s" % (processed_format,
KeyError: 'z'

Sigh…

Python saves the day from retarded iTunes application

Ever since I got my mac I've had endless problems with tagging (man, that word really is getting overloaded these days) my music collection. Not only did iTunes forget most of the tags I had lovely created in QuodLibet (a far superior tagger/music library/player), but also it did not support multiple values for a single tag (for instance a song by Miles Davis AND John Coltrane?).

Back then I was sort of saved by some applescript hacking, some patience, many hours of trying to face the fact that there were no good music players for MacOSX, and finally some effort to forget that I had previously lived in a perfect world… NOW, fast forward to earlier this week, I was reading about the British Recording Industry suing AllOfMP3, and a Norwegian newspaper writing that it was safe and sound for norwegians to buy music there still, (and maybe also the recent german crusade against illegal downloading…) So I decided to see what the fuzz was about and after I saw that they had several Ars Nova albums, which I have been unable to get hold of anywhere else in the world, I created an allofmp3 account and spent $10 to top up my balance… Shortly after I was the proud owner of "The Goddess of Darkness" and "The Book of the Dead", all for less than $3… Amazing. Unfortunately you have to download each song individually, which takes long if you get a bit shopping happy, downloading a zip of a whole album would make much more sense… I'm sure I could script the process though. (In doing this I found that each time I right-clicked and did save-as in firefox the whole application would freeze for 5s, googling a bit found that clearing the download list fixes this, because the list is kept in a file called "downloads.rdf", which is slow to update, *ahem*)

Now to the technical interesting bit: The MP3s I downloaded were 192kbps, with ID3 tag containing the Artist and song name, with filenames like this: 03-morgan_192_lame_cbr.mp3. Now the genre of all the songs I downloaded was set to "Blues", for no apparent reason, this held for Ars Nova, Bruce Springsteen and Røyksopp, but iTunes makes it easy to set the genre for a bunch of song. Worse, the ID3 tags did not contain track-numbers, and although this information is in the filename, iTunes has no easy way to extract this. QuodLibet/ExFalse of course does this very well indeed, but I've been unable to run them on MacOSX… BUT they are written in python, and the tag-handling is a separate library called Mutagen, and this runs nicely on MacOSX, so with 15 minutes I had my new MP3s tagged correctly! Hurrah.

The script is at http://www.dfki.uni-kl.de/~grimnes/2006 … knumber.py if anyone cares! If I feel inspired some day I might write a more general regexp based one, indeed I could probably steal most of it from QuodLibet…

(A side note, I need to stop reading Hunter S. Thompson to get rid of these looong rambling sentences with no point, OR maybe I have to read more… )

Dilbert OCR – Part I

Today's pointless hack was brought on by the fact that happened to be in possession of the complete dilbert comics up until 2005. Often I remember some particular strip, but have no way to sensible search thousands of GIF files. (I *could* join comics.com, but that would be less fun, and I already have nearly all the dilbert books, so this info is sort of 'mine' already, ahem). Also inspired by my recent re-reading of Hofstadter's Letter Spirit, I set to work. Note that there are good commercial solutions for this, and probably lots of well known algorithms etc., but I wanted to discover this myself. I did a quick google for tutorials on OCR, but nothing sensible came up.)
So, looking at some strips, the dilbert text seems to have many nice features for automatic extraction:

  • The font is always the same size/type
  • the letters are all capitals
  • the lines of text are always straights
  • there is always white-space behind the text
  • There is very little punctuation or digits.
  • with some exceptions to all of these, but these bits i can live without:

So with the gimp, python, ImageMagick, Numeric and PIL I set to work. First I cut out one of each letter, and auto-cropped in the gimp:


The first plan was to scan every pixel of the big image for a match with every letter, a match being defined as the sum of the errors for the overlapping pixels being below a certain threshold. Since most dilberts (apart from sunday strips) are black and white anyway, I did everything in gray-scale mode and the pixel values are simply byte values. The threshold for a "match" was set at 40 after some trial and error.
This process was slow as cancer, even when I changed from python lists to Numeric arrays. (later it turned, using a flat list and computing the offset as y*w+x is actually quicker than Numeric arrays, even without psyco, odd…) The first attempt, for a single letter only, looked like this:

Tweaking the numbers for the "match" threshold and removing testing for overlapping matches etc. speeded it up a bit, and I soon discovered another problem. Here illustrated by looking for matches for i's in the first frame:

The problem was that comparing the tightly cropped I image was matching lots of sub-sections of other letters. Since the thing was still horribly slow I decided to try a slightly different approach. I would detect the base-lines of each line of the text in picture, again, this should be pretty easy apart from a few comics where there is content next to the text, but I would worry about that later. First attempt at finding lines that were largely empty:

then group together blocks of lines and remove the ones to close to the top to fit a whole letter:

Now I retried the above matching of letters, trying each letter on every possible X position along each line, then order them by X value.
This produces the first real result, i.e. it spat out some text:


vititingacuitmer
visiting a customer

odurodftfitoas]
our office was

iodesitgneodwithltfhe
designed with the

citenceocgepfecghuitl
science of feng shui

Correct answers in italics. Interestingly this produces more letters than in the original :) So I try grouping the letters that are duplicates, i.e. both were detected in the same place:


v<it><it>ing a cu<i t>mer

<od>ur <od>f<tf><it>o as

<iod>es<it>gne<od> w<it>h lt<fh>e

c<it>ence <ocg><ep> fe<cg> hu<itl>

It's not quite useful yet… even if I generated all possible words for each ambiguous character I wouldn't get anything sensible.
Also, there is clearly a big problem in recognising S's and in telling Is and Ts apart. Maybe I shouldn't disallow overlapping boxes… this made it slightly slower again, but didn't improve results.

So after two evenings of hacking I now had some random text that was completely useless. My instinct told me I had probably taken the simple sum-of-error approach as far as it would go, and now – there is a fork in the path:

1. Keep considering the letters independently, but let the program learn, i.e. sit through a few sessions of : "i reckon this is an 'S', no that's a 'Z', try harder…. "

2. Make the matching pattern of each letter more aware of the special features of each letter, i.e. ignore the bits the I and T have in common, focus on the difference. Not sure how to do this when 3 (or more) letters match the same thing… I would probably just try a dodgy hack and see where I get to.

3. Neural networks does 2 much better than I can ever hope to.

I've done some more work on this now, and I will shortly follow it up with a chapter II! :)
I was a bit late in typing up this part I — and I needed a Saturday to find the time (as well as the need to prepare 150 slides on Semantic Web Services for Monday made it easy to want to do other things…)

PS: Leo was convinced I was wasting my time (BUT IT'S THE JOURNEY NOT THE DESTINATION!), and tried OmniPage Pro on some dilbert cartoons, and it sucked! Ha! This isn't completely pointless after all!