RDFLib & Linked Open Data on the Appengine

Recently I’ve had the chance to use RDFLib a fair bit at work, and I’ve fixed lots of bugs and also written a few new bits. The new bits generally started as write-once and forget things, which I then needed again and again and I kept making them more general. The end result (for now) is two scripts that let you go from this CSV file to this webapp (via this N3 file). Actually – it’ll let you go from any CSV file to a Linked Open Data webapp, the app does content-negotiation and SPARQL as well as the HTML you just saw when you clicked on the link.
In the court

The dataset in this case, is a small collection of King Crimson albums – I spent a long time looking for some CSV data in the wild that had the features I wanted to show off, but failed, and copy/pasted this together from the completely broken CSV dump of the Freebase page.

To convert the CSV file you need a config file giving URI prefixes and some details on how to handle the different columns. The config file for the King Crimson albums looks like:

[csv2rdf]

class=http://example.org/schema/Album
base=http://example.org/resource/
propbase=http://example.org/schema/
defineclass=True
ident=(0,)
label=(0,)

out=kingcrimson.n3

col1=date("%d %B %Y")
col2=split(";", uri("http://example.org/resource/label/", "http://example.org/schema/Genre"))
col3=split("/")
col4=split(";")
col5=int(replace("min",""))

prop3=http://myotherschema.org/label

With this config file and the current HEAD of rdfextras you can run:

python -m rdfextras.tools.csv2rdf -f kingcrimson.config kingcrimson.csv

and get your RDF.

This tool is of course not the first or only of it’s kind – but it’s mine! You may also want to try Google Refine, which has much more powerful (and interactive!) editing possibilities than my hack. With the RDF extension, you can even export RDF directly.
One benefit of this script is that it’s stream-based and could be used on very large CSV files. Although, I believe Google Refine can also export actions taken in some form of batch script, but I never tried it.

With lots of shiny new RDF in my hand I wanted to make it accessible to people who do not enjoy looking at N3 in a text-editor and built the LOD application.
It’s built on the excellent Flask micro-web-framework and it’s now also part of rdfextras . If you have the newest version you can run it locally in Flask’s debug server like this:

python -m rdfextras.web.lod kingcrimson.n3

This runs great locally – and I’ve also deployed it within Apache, but not everyone has a mod_python ready Apache at hand, so I thought it would be nice to run it inside the Google Appengine.

Running the Flask app inside of appengine turned out to be amazingly easy, thanks to Francisco Souza for the pointers:

# main.py
from google.appengine.ext.webapp.util import run_wsgi_app
from rdfextras.web import lod

import rdflib
g=rdflib.Graph()
g.load("kingcrimson.n3", format='n3')

run_wsgi_app(lod.get(g))

Write your app.yaml and make this your handler for /* and you’re nearly good to go. To deploy this app to the appengine you also need all required libraries (rdflib, flask, etc.) inside your app directory, a shell script for this is here: install-deps.sh

Now, I am not really clear on the details on how the appengine works. Is this code run for every request? Or is the wsgi app persistent? When I deployed the LOD app inside apache using mod_python, it seems the app is created once, and server many requests over it’s lifetime.
In any case, RDFLib has no appengine compatible persistent store (who wants to write an rdflib store on top of the appengine datastore?), so the graph is kept in memory, perhaps it is re-parsed once for each request, perhaps not – this limits the scalability of this approach in any case. I also do not know the memory limitations of the appengine – or how efficient the rdflib in-memory store really is – but I assume there is a fairly low limit on the number of triple you can server this way. Inside apache I’ve deployed it on some hundred thousand triples in a BerkleyDB store.

There are several things that could be improved everywhere here – the LOD app in particular has some rough edges and bugs, but it’s being used internally in our project, so we might fix some of them given time. The CSV converter really needs a way to merge two columns, not just split them.

All the files you need to run this example yourself are under: http://gromgull.net/2011/08/rdflibLOD/ – let me know if you try it and if it works or breaks!

 

Post a comment.