Posts categorized “Python”.

python, regex, unicode and brokenness

(This post included a complaint about handling of unicode codepoint >0xffff in python, including a literal such character, and it broke WordPress, which ate the remainder of the post after that character… and I am too lazy to retype it, so for now, no unicode)

I love python, I really do, but some things are … slightly irregular.

One of those things is the handling of unmatched regular expression groups when replacing. In python such a group returns None when matching, this is fine. But when replacing, this unmatched group will produce an error, rather than simple inserting the empty string. For example:

>>> re.sub('(ab)|(a)', r'\1\2', 'abc')
Traceback (most recent call last):
  File "", line 1, in 
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 275, in filter
    return sre_parse.expand_template(template, match)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_parse.py", line 787, in expand_template
    raise error, "unmatched group"
sre_constants.error: unmatched group

There are plenty of people with this problem on the interwebs, and even a python bug report – most “solutions” involves re-writing your expression to make the unmatched group match empty string. Unfortunately, my input expression comes from the sparql 11 compliance tests and as much as I’d like I’m not really free to change it. So, it gets ugly:

And it works, at least in my python 2.7.3 …

RDFLib & Linked Open Data on the Appengine

Recently I’ve had the chance to use RDFLib a fair bit at work, and I’ve fixed lots of bugs and also written a few new bits. The new bits generally started as write-once and forget things, which I then needed again and again and I kept making them more general. The end result (for now) is two scripts that let you go from this CSV file to this webapp (via this N3 file). Actually – it’ll let you go from any CSV file to a Linked Open Data webapp, the app does content-negotiation and SPARQL as well as the HTML you just saw when you clicked on the link.
In the court

The dataset in this case, is a small collection of King Crimson albums – I spent a long time looking for some CSV data in the wild that had the features I wanted to show off, but failed, and copy/pasted this together from the completely broken CSV dump of the Freebase page.

To convert the CSV file you need a config file giving URI prefixes and some details on how to handle the different columns. The config file for the King Crimson albums looks like:

[csv2rdf]

class=http://example.org/schema/Album
base=http://example.org/resource/
propbase=http://example.org/schema/
defineclass=True
ident=(0,)
label=(0,)

out=kingcrimson.n3

col1=date("%d %B %Y")
col2=split(";", uri("http://example.org/resource/label/", "http://example.org/schema/Genre"))
col3=split("/")
col4=split(";")
col5=int(replace("min",""))

prop3=http://myotherschema.org/label

With this config file and the current HEAD of rdfextras you can run:

python -m rdfextras.tools.csv2rdf -f kingcrimson.config kingcrimson.csv

and get your RDF.

This tool is of course not the first or only of it’s kind – but it’s mine! You may also want to try Google Refine, which has much more powerful (and interactive!) editing possibilities than my hack. With the RDF extension, you can even export RDF directly.
One benefit of this script is that it’s stream-based and could be used on very large CSV files. Although, I believe Google Refine can also export actions taken in some form of batch script, but I never tried it.

With lots of shiny new RDF in my hand I wanted to make it accessible to people who do not enjoy looking at N3 in a text-editor and built the LOD application.
It’s built on the excellent Flask micro-web-framework and it’s now also part of rdfextras . If you have the newest version you can run it locally in Flask’s debug server like this:

python -m rdfextras.web.lod kingcrimson.n3

This runs great locally – and I’ve also deployed it within Apache, but not everyone has a mod_python ready Apache at hand, so I thought it would be nice to run it inside the Google Appengine.

Running the Flask app inside of appengine turned out to be amazingly easy, thanks to Francisco Souza for the pointers:

# main.py
from google.appengine.ext.webapp.util import run_wsgi_app
from rdfextras.web import lod

import rdflib
g=rdflib.Graph()
g.load("kingcrimson.n3", format='n3')

run_wsgi_app(lod.get(g))

Write your app.yaml and make this your handler for /* and you’re nearly good to go. To deploy this app to the appengine you also need all required libraries (rdflib, flask, etc.) inside your app directory, a shell script for this is here: install-deps.sh

Now, I am not really clear on the details on how the appengine works. Is this code run for every request? Or is the wsgi app persistent? When I deployed the LOD app inside apache using mod_python, it seems the app is created once, and server many requests over it’s lifetime.
In any case, RDFLib has no appengine compatible persistent store (who wants to write an rdflib store on top of the appengine datastore?), so the graph is kept in memory, perhaps it is re-parsed once for each request, perhaps not – this limits the scalability of this approach in any case. I also do not know the memory limitations of the appengine – or how efficient the rdflib in-memory store really is – but I assume there is a fairly low limit on the number of triple you can server this way. Inside apache I’ve deployed it on some hundred thousand triples in a BerkleyDB store.

There are several things that could be improved everywhere here – the LOD app in particular has some rough edges and bugs, but it’s being used internally in our project, so we might fix some of them given time. The CSV converter really needs a way to merge two columns, not just split them.

All the files you need to run this example yourself are under: http://gromgull.net/2011/08/rdflibLOD/ – let me know if you try it and if it works or breaks!

A quick and dirty guide to YOUR first time with RDF

(This is written for the challenge from http://memespring.co.uk/2011/01/linked-data-rdfsparql-documentation-challenge/)

(To save you copy/pasting/typing you can download the examples from here: http://gromgull.net/2011/01/firstRDF/)

10 steps to make sense of RDF data:

  1. Install a debian or Ubuntu based system — I used Debian testing.
  2. Install rdflib and Berkely/Sleepycat DB by doing:
    sudo apt-get install python-rdflib python-bsddb3
    

    (I got rdflib version 2.4.2 – if you get version 3.X.X the code may look slightly different, let me know if you cannot work out the changes on your own)

  3. Find some data — I randomly picked the data behind the BIS Research Funding Explorer. You can find the raw RDF data on the source.data.gov.uk/data/ server. We will use the schema file from:
    http://source.data.gov.uk/data/research/bis-research-explorer/2010-03-04/research-schema.rdf

    and the education data from:

    http://source.data.gov.uk/data/education/bis-research-explorer/2010-03-04/education.data.gov.uk.nt

    We use the education data because it is smaller than the research data, only 500K vs 11M, and because there is a syntax error in the corresponding file for research :). In the same folders there are files called blahblah-void. These are statistics about the datasets, and we do not need them for this (see http://vocab.deri.ie/void/ for details).

  4. Load the data, type this into a python shell, or create a python file and run it:
    import rdflib
    
    g=rdflib.Graph('Sleepycat')
    g.open("db")
    
    g.load("http://source.data.gov.uk/data/education/bis-research-explorer/2010-03-04/education.data.gov.uk.nt", format='nt')
    g.load("http://source.data.gov.uk/data/research/bis-research-explorer/2010-03-04/research-schema.rdf")
    
    g.close()
    

    Note that the two files are in different RDF formats, both contain triples, but one is serialized as XML, the other in a ascii line-based format called N-Triples.You do not have to care about this, just tell rdflib to use the right parser with the format=X parameter, RDF/XML is the default.

  5. After the script has run there will be a new folder called db in the current directory, it contains the berkeley data-base files and indexes for the data. For the above example it’s about 1.5M
  6. Explore the data a bit, again type this into a python shell:
    • First open the DB again:
    • import rdflib
      g=rdflib.Graph('Sleepycat')
      g.open("db")
      len(g)
      
      -- Outputs: 3690 --
      

      The graph object is quite pythonic, and you can treat it like a collection of triples. Here len tells us we have loaded 3690 triples.

    • Find out what sorts of things this data describes. Things are typed by a triple with rdf:type as the predicate in RDF.
      for x in set(g.objects(None, rdflib.RDF.RDFNS["type"])): print x
      
      -- Outputs:
      
      http://www.w3.org/2002/07/owl#ObjectProperty
      
      
      http://www.w3.org/2002/07/owl#DatatypeProperty
      
      
      http://xmlns.com/foaf/0.1/Organization
      
      
      http://purl.org/vocab/aiiso/schema#Institution
      
      
      http://research.data.gov.uk/def/project/Location
      
      
      http://www.w3.org/1999/02/22-rdf-syntax-ns#Property
      
      
      http://www.w3.org/2000/01/rdf-schema#Class
      
      --
      

      rdflib gives you several handy functions that return python generators for doing simple triple based queries, here we used graph.objects, taking two parameters, the subjects and predicates to filter for, and returns a generator over all objects matching. rdflib also provides constants for the well-known RDF and RDFSchema vocabularies, we used this here to get the correct URI for the rdf:type predicate.

    • Now we know the data contains some Institutions, get a list using another rdflib triple-based query:
      for x in set(g.subjects(rdflib.RDF.RDFNS["type"], rdflib.URIRef('http://purl.org/vocab/aiiso/schema#Institution'))): print x
      
      -- Outputs:
      
      http://education.data.gov.uk/id/institution/UniversityOfWolverhampton
      
      
      http://education.data.gov.uk/id/institution/H-0081
      
      
      http://education.data.gov.uk/id/institution/H-0080
      
      ... (and many more) ...
      --
      

      This gives us a long list of all institutions. The set call here just iterates through the generator and removes duplicates.

    • Lets look at the triples about one in more detail:
      for t in g.triples((rdflib.URIRef('http://education.data.gov.uk/id/institution/UniversityColledgeOfLondon'), None, None)): print map(str,t)
      -- Outputs:
      ['http://education.data.gov.uk/id/institution/UniversityColledgeOfLondon', 'http://research.data.gov.uk/def/project/location', 'http://education.data.gov.uk/id/institution/UniversityColledgeOfLondon/WC1E6BT']
      ['http://education.data.gov.uk/id/institution/UniversityColledgeOfLondon', 'http://research.data.gov.uk/def/project/organisationName', 'University College London']
      ... (and many more) ...
      --
      

      This gives us a list of triples asserted about UCL, here we used the triples method of rdflib, it takes a single argument, a tuple representing the triple filters. The returned triples are also tuples, the map(str,t) just makes the output prettier.


  7. rdflib makes it very easy to work with triple based queries, but for more complex queries you quickly need SPARQL, this is also straight forward:
    PREFIX="""
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    PREFIX p: <http://research.data.gov.uk/def/project/>
    PREFIX aiiso: <http://purl.org/vocab/aiiso/schema#>
    PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    """
    
    list(g.query(PREFIX+"SELECT ?x ?label WHERE { ?x rdfs:label ?label ; a aiiso:Institution . } "))[:10]
    

    The prefixes defined here at the start lets us use short names instead of full URIs in the queries. The graph.query method returns a generator over tuples of variables bindings. This lists the first 10 – this is more or less the same as we did before, list all institutions, but this time also get the human readable label.

  8. Now a slightly more complicated example. Ask the knowledge base to find all institutions classified as public sector that took part in some project together:

     
    
    r=list(g.query(PREFIX+"""SELECT DISTINCT ?x ?xlabel ?y ?ylabel WHERE { 
       ?x rdfs:label ?xlabel ; 
          a aiiso:Institution ; 
          p:organisationSize 'Public Sector' ; 
          p:project ?p . 
    
       ?y rdfs:label ?ylabel ; 
          a aiiso:Institution ; 
          p:organisationSize 'Public Sector' ; 
          p:project ?p .
    
       FILTER (?x != ?y) } LIMIT 10 """))
    
    for x in r[:3]: print map(str,x)
    
    -- Outputs:
    ['http://education.data.gov.uk/id/institution/H-0155', 'Nottingham University', 'http://education.data.gov.uk/id/institution/H-0159', 'The University of Sheffield']
    ['http://education.data.gov.uk/id/institution/H-0155', 'University of Nottingham', 'http://education.data.gov.uk/id/institution/H-0159', 'University of Sheffield']
    ['http://education.data.gov.uk/id/institution/H-0159', 'Sheffield University', 'http://education.data.gov.uk/id/institution/H-0155', 'University of Nottingham']
    --
    

    All fairly straight forward, the FILTER is there to make sure the two institutions we find are not the same.
    (Disclaimer: there is a bug in rdflib (http://code.google.com/p/rdfextras/issues/detail?id=2) that makes this query take very long :( – it should be near instantaneous, but takes maybe 10 seconds for me. )

  9. The data we loaded so far do not have any details on the project that actually got funded, only the URI, for example: http://research.data.gov.uk/doc/project/tsb/100232. You can go there with your browser and find out that this is a project called “Nuclear transfer enhancement technology for bio processing and tissue engineering” – luckily so can rdflib, just call graph.load on the URI. Content-negotiation on the server will make sure that rdflib gets machine readable RDF when it asks. A for-loop over a rdflib triple query and loading all the project descriptions is left as an exercise to the reader :)
  10. That’s it! There are many places to go from here, just keep trying things out – if you get stuck try asking questions on http://www.semanticoverflow.com/ or in the IRC chatroom at irc://irc.freenode.net:6667/swig. Have fun!

Illustrating the kernel trick

For a one paragraph intro to SVMs and the kernel-trick I wanted a a graphic that I’ve seen in a book (although forgotten where, perhaps in Pattern Classification?):

Simple idea — show some 2D data points that are not linearly separable, then transform to 3D somehow, and show that they are. I found nothing on google (at least nothing that was high enough resolution to reuse, so I wrote some lines of python with pylab and matplotlib:

import math
import pylab
import scipy

def vlen(v):
return math.sqrt(scipy.vdot(v,v))

p=scipy.randn(100,2)

a=scipy.array([x for x in p if vlen(x)>1.3 and vlen(x)<2])
b=scipy.array([x for x in p if vlen(x)<0.8])

pylab.scatter(a[:,0], a[:,1], s=30, c="blue")
pylab.scatter(b[:,0], b[:,1], s=50, c="red", marker='s')

pylab.savefig("linear.png")

fig = pylab.figure()
from mpl_toolkits.mplot3d import Axes3D
ax = Axes3D(fig)
ax.view_init(30,-110)

ax.scatter3D(map(vlen,a), a[:,0], a[:,1], s=30, c="blue")
ax.scatter3D(map(vlen,b), b[:,0], b[:,1], s=50, marker="s", c="red")

pylab.savefig("tranformed.png")

pylab.show()

Take — adapt — use for anything you like, you can rotate the 3D plot in the window that is shown and you can save the figures as PDF etc. Unfortunately, the sizing of markers in the 3d plot is not yet implemented in the latest matplotlib (0.99.1.2-3), so this only looks good with the latest SVN build.

I’ll trie in python

In Koble the auto-completion of thing-names used for wiki-editting, instant-search and adding relations  is getting slower and slower, mainly because I do:

result=[]
things=listAllThings()
for t in things:
   if t.startswith(key): res.append(t)
for t in things:
   if key in t: res.append(t)

Going through the list twice makes sure I get all things that match well first (i.e. the start with the string I complete for), and then things matching less well later (they only contain the string).

Of course the world has made up a far better data-structure for indexing prefix’es of string, namely the trie, or prefix tree. James Tauber had already implemented one in python, and kindly made it available. His version didn’t do everything I needed, so I added a few methods. Here is my updated version:

http://gromgull.net/2009/11/trie.py

Enjoy!

Heat-maps of Semantic Web Predicate usage

It’s all Cygri‘s fault — he encouraged me to add schema namespaces to the general areas on the semantic web cluster-tree. Now, again I misjudged horribly how long this was going to take. I thought the general idea was simple enough, I already had the data. One hour should do it. And now one full day later I have:

FOAF Predicates on the Semantic Web

It’s the same map as last time, laid using graphviz’s neato as before. The heat-map of the properties was computed from the feature-vector of predicate counts, first I mapped all predicates to their “namespace”, by the slightly-dodgy-but-good-enough heuristic of taking the part of the URI before the last # or / character. Then I split the map into a grid of NxN points (I think I used N=30 in the end), and compute a new feature vector for each point. This vector is the sum of the mapped vector for each of the domains, divided by the distance. I.e. (if you prefer math) each point’s vector becomes:

\displaystyle V_{x,y}= \sum_d\frac{V_d}{\sqrt{D( (x,y),  pos_d)}}

Where D is the distance (here simple 2d euclidean), d is each domain, pos_d is that domains position in the figure and V_d is that domains feature vector. Normally it would be more natural to decrease the effect by the squared distance, but this gave less attractive results, and I ended up square-rooting it instead. The color is now simply on column of the resulting matrix normalised and mapped to a nice pylab colormap.

Now this was the fun and interesting part, and it took maybe 1 hour. As predicted. NOW, getting this plotted along with the nodes from the graph turned out to be a nightmare. Neato gave me the coordinates for the nodes, but would change them slightly when rendering to PNGs. Many hours of frustration later I ended up drawing all of it again with pylab, which worked really well. I would publish the code for this, but it’s so messy it makes grown men cry.

NOW I am off to analyse the result of the top-level domain interlinking on the billion triple data. The data-collection just finished running while I did this. … As he said.

Visualising predicate usage on the Semantic Web

So, not quite a billion triple challenge post, but the data is the same.  I had the idea that I compare the Pay-Level-Domains (PLD) of the context of the triples based on what predicates is used within each one. Then once I had the distance-metric, I could use FastMap to visualise it. It would be a quick hack, it would look smooth and great and be fun. In the end, many hours later, it wasn’t quick, the visual is not smooth (i.e. it doesn’t move) and I don’t know if it looks so great. It was fun though. Just go there and look at it:

PayLevelDomains cluster-tree

As you can see it’s a large PNG with the new-and-exciting ImageMap technology used to position the info-popup, or rather used to activate the JavaScript used for the popups. I tried at first with SVG, but I couldn’t get SVG and XHTML and Javascript to play along, I guess in Firefox 5 it will work. The graph is laid out and generated Graphviz‘s neato, which also generated the imagemap.

So what do we actually see here? In short, a tree where domains that publish similar Semantic Web data are close to each other in the tree and have similar colours. In detail: I took the all PLDs that contained over 1,000 triples, this is around 7500, and counted the number of triples for each of the 500 most frequent predicates in the dataset. (These 500 predicates cover ≈94% of the data). This gave me a vector-space with 500 features for each of the PLDs, i.e. something like this:

geonames:nearbyFeature dbprop:redirect foaf:knows
dbpedia.org 0.01 0.8 0.1
livejournal.org 0 0 0.9
geonames.org 0.75 0 0.1

Each value is the percentage of triples from this PLD that used this predicate. In this vector space I used the cosine-similarity to compute a distance matrix for all PLDs. With this distance matrix I thought I could apply FastMap, but it worked really badly and looked like this:

Fastmapping the PLDs

So instead of FastMap I used maketree from the complearn tools, this generates trees from a distance matrix, it generates very good results, but it is an iterative optimisation and it takes forever for large instances. Around this time I realised I wasn’t going to be able to visualise all 7500 PLDs, and cut it down to the 2000, 1000, 500, 100, 50 largest PLDs. Now this worked fine, but the result looked like a bog-standard graphviz graph, and it wasn’t very exciting (i.e not at all like this colourful thing). Now I realised that since I actually had numeric feature vectors in the first place I wasn’t restrained to using FastMap to make up coordinates, and I used PCA to map the input vector-space to a 3-dimensional space, normalised the values to [0;255] and used these as RGB values for colour. Ah – lovely pastel.

I think I underestimated the time this would take by at least a factor of 20. Oh well. Time for lunch.

FastMap in Python

This shows some strings positioned on a graph according to their Levenshtein string distance. (And let me note how appropriate it is that a man named Levenshtein should make a string distance algorithm)

The mapping from the distance matrix given by the string distance is then mapped to two dimensions using the FastMap algorithm by Christos Faloutsos and King-Ip Lin. This mapping can be also done with Multidimensional Scaling, (and I did so in my PhD work, I even claimed it was novel, but actually I was 40 years too late, oh well), but that algorithm is nasty and iterative. FastMap, as the name implies, is much faster, it doesn’t actually even need a full distance matrix (I think). It doesn’t always find the best solution, and it has a slight random element to it, so the solution might also vary each time it’s run, but it’s good enough for almost all cases. Again, I’ve implemented it in python – grab it here: fastmap.py

To get a feel for how it works, download the code, remove George Leary and see how it reverts to only one dimension for a sensible mapping.

The algorithm is straight-forward enough, for each dimension you want, repeat:

  1. use a heuristic to find the two most distant points
  2. the line between these two points becomes the first dimension, project all points to this line
  3. recurse :)

The distance measure used keeps the projection “in mind”, so the second iteration will be different to the first.The whole thing is a bit like Principal Component Analysis, but without requiring an original matrix of feature vectors.

This is already quite old, from 1995, and I am sure something better exists now, but it’s a nice little thing to have in the toolbox. I wonder if it can be used to estimate numerical feature values for nominal attributes, in cases where all possible values are known?

Online Clustering in Python

A little while ago Stefano pointed me to Online Learning in Clojure, a really simple implementation of an the Pegasos online classification algorithm in the lisp-like (but compile to java bytecode) language Clojure. I implemented it myself in python, and it really is very simple. I’ll put it here some day when I have cleaned it up a bit.

By online in this context I mean an algorithm that does not keep all the examples in memory at the same time, but processes them one by one, keeping only aggregates or similar. Ideally it also does this in a single pass, i.e. you just look once at each example, the theory then goes that if you have enough examples they will still allow you to identify each class well. For linear binary classification problems this amazingly well, Pegasos for instance can train a binary-class SVM on 780,000 examples in less than 2 seconds. This is achieved by doing almost nothing for each example, the only operation is updating a weight-vector with the weighted example vector, where the weight depends on the class predicted by the weight vector at that point. The paper also includes pointers to Online Learning with Kernels, which describes how the same method can be adapted to work with kernel-methods that allow non-linear classification. I implemented this as well, and it makes the whole thing several times slower, but it’s still amazingly fast.

Now — this was during my holidays and I started wondering if you could also do clustering online. I’d had enough of reading papers at this point, so I just tried hacking something in python. It worked fairly well, but I had some problems when it came to merging clusters. Now a few weeks later I return to this, but this time I read the paper before. It turns out a very simple algorithm was published in 99: An on-Line Agglomerative Clustering Method for Non-Stationary Data, it was very similar to what I had pulled out of my made up, but where I had tried to find a complicated solution for deciding when to merge clusters they had a very simple solution. And it worked quite well, as long as the data was not too noisy. Luckily, a bit later, someone wrote a paper with the catchy title: Improving the Robustness of ‘Online Agglomerative Clustering Method’ Based on Kernel-Induce Distance Measures. They use kernel-functions for distance measures and a clever weighting to iron out the noise in the data. (Although the cluster-prototypes are still kept in normal space, so it’s not really a kernel-method?). NOW, (slowly getting to the point) I also implemented this in Python, and it rocks:

The algorithm has two parameters, the σ parameter for gaussian kernel used, and K, the number of cluster prototypes to maintain. The number of clusters is found automatically, although it cannot find more than K-1. The runtime is otherwise linear wrt. to the number of data-points, but scaled by K i.e. roughly:

O(KN)

This all got a bit detailed, sorry. I really just wanted to hit the youtube embed button and go. At least it’s a fun video.

Update:

Ah – in a bit of forget-the-attachment-in-a-”here-is-the-attachment”-email, I appear to have forgotten the python code :)

Grab it at: OnlineCluster.py, it is fairly straight-forward. It requires scipy, and will plot it’s results if you have pylab installed. (The video above was also generated using pylab). The only bit where readability suffered was using the heapq module for finding the clusters to merge. I’ve not used the module before, and it took a long goddamn while to figure out that since I wanted tuples in my queue and they compare funnily, I had to introduce an extra object with a __cmp__ method. The instances are represented as scipy arrays (i.e. vectors of floats), the clusterer will happily let you grow the dimsions as you go along.

Enjoy!

Integrating bibsonomy into Koble

A few years back I created koble – and used it on and off for my own notes ever since. (if you didn’t hear of it before, never mind – it’s essentially a semantic wiki)

So — Koble lets me create a page for a concept and then associate links with this page, in fact, it even has a bookmarklet for easily creating new things from pages or associating page to existing things. Sounds like anything you know? A bit like tagging with delicious or bibsonomy? Oh yes. So in the off periods of not using Koble I used bibsonomy for all my links, and I definitely use bibsonomy for keeping track of publications that I read skim the abstract and conclusion of. However, when it comes to writing notes about what I read I need the wiki. Sigh. So I bit the bullet, picked up the python showel and dug into the bibsonomy API, and a few hours later I have:

At first I worried about identity — what if the OnlineLearning of Koble is no the same as the OnlineLearning tag on bibsonomy? What if I have SemanticWeb vs. semantic-web? Maybe I should let users (read “myself”) selectively associated different things with different tags?

Then I learned to stop worrying and just get on with my life and now I show any links/publications where the tags matched if they are the same once all non-letters have been removed and the rest converted to lower-case.

Now I’ve fixed my tool — now nothing can stop me from actually doing some work.