Posts by gromgull.

Multithreaded SciPy/NumPy with OpenBLAS on debian

Some months ago, just after I got an 8-core CPU, I wasted a weekend trying to get SciPy/NumPy to build against OpenBLAS. OpenBLAS is neat, as it has built-in and automatic support for multi-threading, for things like computing the dot-matrix of large matrices this can really be a time saver.

I was using roughly these instructions, but it was too complicated and I got nowhere and gave up.

Then I got a new MacBook, and set it up using homebrew rather than macports, and I noticed that NumPy was built against OpenBLAS by default. Now, it would be a shame to let debian be any worse off…

Luckily, it was much easier than I thought:

1. Install the openblas and lapack libraries and dev-headers

sudo apt-get install libopenblas-dev liblapack-dev

2. Setup a virtualenv

To make sure we do not mess up the whole system, setup a virtualenv (if you ever install more than 3 python packages, and do not yet know about virtualenv, you really should get to know it, it’s a little piece of magic!):

virtualenv env
source env/bin/activate

3. Install NumPy

In Debian, openblas/lapack fit into the alternatives system, and the implementation you chose gets symlinked to /usr/lib, however, this confuses numpy and you must point it to the right place, i.e. to /usr/lib/openblas-base
Download and unpack NumPy:

mkdir evn/download
pip install -d env/download numpy
mkdir env/build
cd env/build
tar xf ../download/numpy-1.7.1.tar.gz

Now create a site.cfg file with the following content:

[default]
library_dirs= /usr/lib/openblas-base

[atlas]
atlas_libs = openblas

Build/install NumPy:

python setup.py install

You can now check the file env/lib/python2.7/site-packages/numpy/__config__.py to make sure it found the right libs, mine looks like this:

lapack_info={'libraries': ['lapack'], 
    'library_dirs': ['/usr/lib'], 'language': 'f77'}
atlas_threads_info={'libraries': ['openblas'], 
    'library_dirs': ['/usr/local/lib'], 
    'language': 'c', 
    'define_macros': [('ATLAS_WITHOUT_LAPACK', None)], 
    'include_dirs': ['/usr/local/include']}
blas_opt_info={'libraries': ['openblas'], 
    'library_dirs': ['/usr/local/lib'], 
    'language': 'c', 
    'define_macros': [('ATLAS_INFO', '"\\"None\\""')], 
    'include_dirs': ['/usr/local/include']}

4. Install SciPy

If NumPy installs cleanly, SciPy can simple be installed with pip:

pip install scipy

5. Test!

Using these scripts you can test your NumPy and SciPy installation. Be activating/deactivating the virtualenv, you can test with/without OpenBLAS. If you have several CPU cores, you can see that with OpenBLAS up to 4 CPUs should also be used.

Without OpenBLAS I get:

NumPy 
dot: 0.901498508453 sec

SciPy
cholesky: 0.11981959343 sec
svd: 3.64697360992 sec

with OpenBLAS:

NumPy
dot: 0.0569217920303 sec

SciPy
cholesky: 0.0204758167267 sec
svd: 0.81153883934 sec

On finding duplicate images

I got a new shiny MacBook in my new job at Bakken & Baeck, and figured it was time for a new start, so I am de-commissioning my old MacBook and with it, the profile and files that are so old it used to be on a PowerBook. Most things were easy, until I got to the photos. Over the years I have imported photos to the laptop while travelling, but always tried to import them again to my real backed-up photo archive at home when I got there, unless my SD card was full while travelling, or I forgot, or something else went wrong. That means I am fairly sure MOST of the photos on the laptop are also in my archive, but also fairly sure some are not.
And of course, each photo, be it an out-of-focus, under-exposed test-shot, is a little piece of personal memory, a beautiful little diamond of DATA, and must at all cost NOT BE LOST.

The photos are mostly in iPhoto (but not all), in a mix of old-style mixed up in iPhotos own folder structures, and in folders I have named.

“Easy” I thought, I trust the computer, I normally use Picasa, it will detect duplicates when importing! Using


find . -iname \*.jpg -print0 | xargs -0 -I{} cp -v --backup=t {} /disks/1tb/tmp/photos/

I can copy all JPGs from the Pictures folder into one big folder without overwriting files with duplicate names (I ❤ coreutils?), then let Picasa sort it out for me.

Easily done, 7500 photos in one folder, Picasa thinks a bit and detects some duplicates, but not by far enough. Several photos I KNOW are archived are not flagged as dupes. I give up trusting Picasa. I know who I can trust:

md5sum!

(In retrospect, I should have trused Picasa a bit, and at least removed the ones it DID claim were duplicates)

So, next step, compute the md5sum of all 7500 new photos, and of all 45,000 already archived photos. Write the shell-script, go to work, return, write the python to find all duplicates, delete the ones from the laptop.

Success! 3700 duplicates gone! But wait! There are still many photos I know for a fact are duplicates, I pick one at random and inspect it. It IS the same photo, but one JPG is 3008×2008 and the other is 3040×2024, also the white-balance is very slightly different. Now I understand, back when I had more time, I shot exclusively in RAW, these are two JPG produced from the same RAW file, one by iPhoto, one by UFRaw, the iPhoto one is slightly smaller and has worse color. Bah.

Now it’s getting later, my Friday night is slipping away between my fingers, but I am damned if I give up now. Next step: EXIF data! Both files have EXIF intact, and both are (surprise!) taken at the same time. Now, I don’t want to go and look up the EXIF tag on all 45,000 archived photos just now, but I can filter by filename, if two files have the same basename (IMGP1234) AND are taken at the same time, I am willing to risk deleting one of them.

So with the help of the EXIF parsing library from https://github.com/ianare/exif-py and a bit of python:

it is done! Some ~3500 more duplicates removed!

I am left with 202 photos that may actually be SAVED FROM ETERNAL OBLIVION! (Looking more carefully, about half is actually out-of-focus or test shots, or nonsense I probably DID copy to the archive, but then deleted) It was certainly worth spending the entire evening 3 days in a row on this!

Now I can go back to hating hotmail for having deleted all the emails I received pre 2001…

Google Reader and the death of the open web

Well fuck. Google kills google reader.

Some random points:

This is upsetting because I use the damn thing a lot, as @bobdc says:

At some point RSS became synonymous with Google Reader, now Google tells me that although it to me seems like “everyone” uses Reader, this is because I have geek friends, and “normal” people have no idea Google Reader exists. It is safe to say, that these normal people have even less idea that RSS/Atom exists and that there are choices outside of Google Reader.  If not the last nail in the coffin for RSS, it certainly the BIG nail that made it impossible to pry open the coffin again. (It was already clear that RSS was geeky-poweruser only when Twitter killed the Atom feeds last year). Even if there now may be a market for NEW RSS Readers, and maybe even some innovation in that space, it wont matter any more, since people will just stop publishing RSS feeds.

RSS was probably the last standards driven eco-system where real integration of stuff happened in ways people probably didn’t foresee. I am sad to say so, but I cannot see FOAF pop up to kill the big social networks any time soon. OpenID became “sign-in with Google / Facebook / Yahoo” and the places where I can type in my OWN OpenID URL is again only hard-core tech places. OAuth2 is a mess where there is a Google version and a Facebook version, and interop is on paper only. We lost the open web and now we have APIs to a nice corporate sanitize social network.

As web-archaeologists well know, RSS was initially RDF based (RDF Site Syndicate, before it became Real-Simple). Now, look at your Facebook feed, there is clearly some sort of “ontology” here, there are basic updates, essentially just some text, but then there are updates with pictures, updates with youtube videos. There are updates where actions are possible: invites to parties, invites to terrible games, invites to share you birthday! apps, etc. Wouldn’t it have been cool if we kept an RDF extensible version of RSS, then I could have published my extensions to RSS items somewhere, if your client didn’t support it, it would fall back on default rendering, giving you just a URL, but if you had the right widget, you would get a richer representation of the thing (and these days maybe even I could publish it with some JavaScript snippet to render the widget, like Twitter does with the twitter widget I embedded above) … old semantic web dreams never die – they just get covered in a layer of cynicism!

32bit firefox/thunderbird on debian amd64

New computer at work last week, now with 16gb of RAM I totally do not need, but this it was clear that running a 32bit Linux was no longer an option.
So Debian amd64 was installed. Now, I’ve looked at the thunderbird/firefox icon for so long, I cannot live with iceweasel/icedove branding. Only Firefox nightly exists as 64bit build, where again the branding is different, so installing the 32bit build was necessary.

This is really just a problem that arises because of my own stubborn refusal to change my ways, if I ran the firefox nightly, lived with iceweasel or simply ran ubuntu it would all be fine.
I’ve pieced together this information twice now, so time to write it down. Also, maybe someone else has exactly the same weird “legacy” problems I do.

Run all commands below as root.

First, setup multiarch, beware, when I first did this about a year ago, it messed up conflict resolution in aptitude and I had to fall back on just apt-get again, I hear it might work fine now.

 
dpkg --add-architecture i386
apt-get update

Then install the basic libraries firefox/thunderbird needs (use ldd on the firefox binaries/libraries to find this list):

apt-get install libgtk2.0-0:i386 libatk1.0-0:i386 libgdk-pixbuf2.0-0:i386 \
     libldap-2.4-2:i386 libdbus-glib-1-2:i386 libpango1.0-0:i386 libglib2.0-0:i386

I use the greybird theme – to make thunderbird/firefox look the part, install the 32bit version of the gtk-engines used:

apt-get install gtk2-engines-murrine:i386 gtk2-engines-pixbuf:i386

Then install firefox/thunderbird normally – if you keep the folder owned/writable by your user they will auto-update fine.

python, regex, unicode and brokenness

(This post included a complaint about handling of unicode codepoint >0xffff in python, including a literal such character, and it broke WordPress, which ate the remainder of the post after that character… and I am too lazy to retype it, so for now, no unicode)

I love python, I really do, but some things are … slightly irregular.

One of those things is the handling of unmatched regular expression groups when replacing. In python such a group returns None when matching, this is fine. But when replacing, this unmatched group will produce an error, rather than simple inserting the empty string. For example:

>>> re.sub('(ab)|(a)', r'\1\2', 'abc')
Traceback (most recent call last):
  File "", line 1, in 
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 275, in filter
    return sre_parse.expand_template(template, match)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_parse.py", line 787, in expand_template
    raise error, "unmatched group"
sre_constants.error: unmatched group

There are plenty of people with this problem on the interwebs, and even a python bug report – most “solutions” involves re-writing your expression to make the unmatched group match empty string. Unfortunately, my input expression comes from the sparql 11 compliance tests and as much as I’d like I’m not really free to change it. So, it gets ugly:

And it works, at least in my python 2.7.3 …

Some basic BTC2012 Stats

(The Figure shows the biggest domains publishing data, and links between them – mouse-over the edges to highlight, chose linking predicate from the drop-down list)

So it’s that time of year again, and the Billion Triple Challenge Dataset for 2012 has been posted.
This coincided with our project demo being finished, so I had some time to spare. The previous years I’ve done this all using unix tools, sed/awk/grep and friends. This year I figured I’d do it all in python. To get reasonable performance two things were crucial:

  • the python gzip module has decompression implemented in python, using subprocess and reading from a pipe to gunzip is MUCH faster (thanks Jörn!)
  • I wrote a an N-Quads “parser” in cython, taking advantage of the very regular output of ld-spider

This meant that for simple operations, like adding up things in a hash-table in memory, I could stream-process about 500,000 triples per second. For things that did not fit in memory, I used LevelDB with a thin layer of most-frequently-used caching around it.

I’m happy to see that DbTropes is part of the data this year!
So – the basic stats:

  • 1.4B triples all in all
  • 1082 different namespaces are used
  • 9.2M unique contexts, from 831 top-level document PLDs (Pay-Level-Domain, essentially data.gov.uk, instead of gov.uk, but livejournal.com, instead of bob.livejournal.com)
  • 183M unique subjects are described
  • 57k unique predicates
  • 192M unique resources as objects
  • 156M unique literals
  • 152M triples are rdf:type statements, 296k types are used. Resource with multiple types are common, 45M resources have two types, 40M just one.

 

Top 10 Context PLDs

count context pld
751,352,061 data.gov.uk
198,090,262 dbpedia.org
101,241,556 freebase.com
101,082,592 livejournal.com
44,331,145 opera.com
41,544,819 dbtropes.org
39,200,538 legislation.gov.uk
36,969,163 identi.ca
29,447,217 ontologycentral.com
14,949,592 rdfize.com

 

Top 10 Namespaces

count namespace
336,911,630 http://www.w3.org/1999/02/22-rdf-syntax-ns#
191,669,089 http://www.w3.org/2000/01/rdf-schema#
143,650,096 http://xmlns.com/foaf/0.1/
133,845,241 http://reference.data.gov.uk/def/intervals/
115,692,342 http://www.w3.org/2006/time#
71,016,514 http://www.w3.org/2006/http#
69,715,106 http://rdf.freebase.com/ns/
66,058,545 http://www.w3.org/2004/02/skos/core#
53,246,991 http://purl.org/dc/terms/
50,444,755 http://dbpedia.org/property/

 

Top 10 Types

count type
39,345,307 intervals:Second
39,345,280 intervals:CalendarSecond
12,841,127 foaf:Person
7,623,831 foaf:Document
1,896,136 qb:Observation
1,851,173 fb:common.topic
1,712,877 intervals:Minute
1,712,875 intervals:CalendarMinute
1,328,921 owl:Thing
1,280,763 metalex:BibliographicExpression

As usual, although many namespaces/hosts/types are used, the distribution is skewed, the most common elements quickly accounts for most of the data. This graph shows the cumulative occurrences (i.e. % of total unique elements) of types/context-plds/namespaces occurring more than N times (the X axis is logarithmic):

So the steeper the curve, the longer the tail of infrequently occurring elements. For example, less than 5% of types occur more than 100 times, but very few context-pld’s occur less than 10 times. However, when you look at the actual density, the picture changes, here we plot the cumulative density, so although most types occur less than 100 times, the majority of the data uses only the most frequent types:

So the steeper the curve at the end, the more of the data is covered by the few most frequent element. For example, the top 5% most frequent namespaces and context-plds cover over 99% of the data, but the top 5% of types “only” 97%.

A different (maybe useless?) view of this, is this histogram with exponentially increasing bucket-sizes, again with a log-scale, so they look the same size:

Here we see … actually I’ll be damned if I know what we see here. Maybe I should have done more stats courses at uni instead of, say, Java Programming. Clearly the difference between the distribution of the three things is shown somehow. I’ve spent so long on this now though, there’s no way I wont put it here.

I don’t even want to talk about how long I spent making these graphs. I wanted to graph this since the first BTC dataset I looked at, but previously always fell back at “top n% of the elements cover n% of the data” tables.
They graphs are all done in pylab, exported as SVG (yay!). Playing with them was all done with the ipython notebook, which is really pleasant to work with.

Finally – the Chord-diagram on top shows links between context PLDs – mouse over each host to see outgoing links. This is only the top 19 PLD domains and the top 10 properties linking domains that themselves publish RDF data – this is important, as there are predicates used to link to non-semantic web resources that dominate otherwise. The graphic and interaction is all done with the excellent D3 Library.

I will try to come up with some more interesting visualisations based on links between instances of various types soon!

Three simple methods for graphing datastreams

This is not very cutting edge, nor all THAT exciting – but it seemed worth putting the things together into one post for someone else who encounters the same scenario.

The problem is this: You have a stream of data-points coming in, you do not know how many points and you do not know the range of the numbers. You would like to draw a pretty little graph of this data. Since there is an unknown number of points, but potentially massive, you cannot keep all numbers in memory, so look at each once and pass it on. It’s essentially an “online (learning) scenario. The goal here is to draw small graph to give an idea of the trend of the data, not careful analysis – so a fair bit of approximation is ok.

In our case the numbers are sensors readings from agricultural machines, coming in through ISOXML (you probably really don’t want to go there, but it’s all in ISO 11783) documents. The machine makes readings every second, which adds up when it drives for days. This isn’t quite BIG DATA, but there may be millions of points, and in our case we want to run on a fairly standard PC, so an online approach was called for.

For the first and simplest problem, you have data coming in at regular intervals, and you want a (much) shorter array of values to plot. This class keeps an array of some length N, at first we just fill it with the incoming numbers. If we see more than N numbers, we “zoom out” by a factor of two, rewrite the N numbers we’ve seen to N/2 numbers by averaging them, and from now on, every two incoming numbers are averaged into each cell in our array. Until we reach N*2 numbers, then we zoom out again, now averaging every 4 numbers, etc. Rewriting the array is a bit of work, but it only happens log(your total number of points) times. In the end you end up with somewhere between N/2 and N points.

/**
 * Create a summary time-series of some time-series of unspecified length. 
 * 
 * The final time-series will be somewhere between N/2 and N long
 * (Unless less numbers are given of course)
 * 
 * For example:
 * TimeLogSummariser t = new DDISummaryGraph.TimeLogSummariser(4);
 * for (int i=1; i<20; i++) t.add(i);
 * => [3.0, 7.25, 18.0]
 * 
 */
public class TimeLogSummariser { 
	
	double data[];
	int N;
	int n=1;
	int i=0;
	public TimeLogSummariser(int N) { 
		if (N%2 != 0 ) N++; 
		this.N=N; 
		data=new double[N];
	}
	public void zoomOut() { 
		for (int j=0;j<N/2;j++) data[j]=(data[j*2]+data[2*j+1])/2;
		for (int j=N/2;j<N;j++) data[j]=0;
		n*=2;
	}
	
	public void add(double d) { 
		int j=i%n;
		int idx=i/n;
		
		if (idx>=N) { 
			zoomOut(); 
			j=i%n;
			idx=i/n;
		}
		data[idx]=(data[idx]*j+d)/(j+1);
		i++;
	}
	
	public double[] getData() {
		return Arrays.copyOfRange(data, 0, i/n+1); 
	}
	
}

Now, after programming this I realised that most of my tractor-sensor data actually does not come in at regular intervals. You’ll get stuff every second for a while, then suddenly a 3 second gap, or a 10 minute smoke-break or a 40 minute lunch-break. Treating every point as equal does not give you the graph you want. This only makes things slightly more complicated though, instead always stepping one step in the array, the step depends on the difference in time between the two points. We assume (and hope) that data always arrives in sequential order. Also, since we may take many steps at once now, we may need to “zoom out” more than once to fit in the array. Otherwise the code is almost the same:

 
/**
 * Create a summary time-series of some time-series of unspecified length. 
 * 
 * The final time-series will be somewhere between N/2 and N long
 * (Unless less numbers are given of course)
 * 
 * This class will do correctly do averages over time. 
 * The constructor takes the stepsize as a number of milleseconds 
 * i.e. the number of milliseconds between each recording. 
 * Each value is then given with a date. 
 * 
 * For example:
 * DiffDDISummaryGraph t = new DiffDDISummaryGraph.TimeLogSummariser(4, 1000);
 * for (int i=1; i<20; i++) t.add(i, new Date(i*2000));
 * => [3.0, 7.25, 18.0]
 * 
 */
public class DiffTimeLogSummariser { 
	
	double data[];
	
	private int N;

	int n=1;
	double i=0;

	private long stepsize;

	private Date last;
	private Date start;
	
	public DiffTimeLogSummariser(int N, long stepsize) { 
		this.stepsize=stepsize;
		if (N%2 != 0 ) N++; 
		this.N=N;
		data=new double[N];
	}
	public void zoomOut() { 
		for (int j=0;j<N/2;j++) data[j]=(data[j*2]+data[2*j+1])/2;
		for (int j=N/2;j<N;j++) data[j]=0;
		n*=2;
	}
	
	public void add(double d, Date time) {
		
		long diff;
		if (last!=null) {
			diff=time.getTime()-last.getTime();
		} else { 
			start=time;
			diff=0;
		}
		if (diff<0) { 
			System.err.println("DiffTimeLogSummarizer got diff<0, ignoring.");
			return ; 
		}
		
		i+=diff/(double)stepsize;
		
		int j=(int) (Math.round(i)%n);
		int idx=(int) (Math.round(i)/n);
		
		while (idx>=N) { 
			zoomOut(); 
			j=(int) (Math.round(i)%n);
			idx=(int) (Math.round(i)/n);
		}
		data[idx]=(data[idx]*j+d)/(j+1);
		last=time;
	}
	
	public double[] getData() {
		return Arrays.copyOfRange(data, 0, (int) (i/n+1)); 
	}
	
	public Date getStart() {
		return start;
	}

	public long getStepsize() {
		return stepsize*n;
	}
}

This assumes that if there is a gap in the data, that means the value was 0 at these intervals. Whether this is true depends on your application, in some cases it would probably make more sense to assume no data means the value was unchanged. I will leave this as an exercise for the reader :)

Finally, sometimes you are more interested in the distribution of the values you get, rather than how they vary over time. Computing histograms on the fly is also possible, for uniform bin-sizes the algorithm is almost the same as above. The tricky bits here I’ve stolen from Per on stackoverflow:

/**
 * Create a histogram of N bins from some series of unknown length. 
 * 
 */
public class TimeLogHistogram { 

	int N;  // assume for simplicity that N is even
	int counts[];

	double lowerBound;
	double binSize=-1;

	public TimeLogHistogram(int N) { 
		this.N=N; 
		counts=new int[N];
	}

	public void add(double x) { 
		if (binSize==-1) {
			lowerBound=x;
			binSize=1;
		}
		int i=(int) (Math.floor(x-lowerBound)/binSize);

		if (i<0 || i>=N) {
			if (i>=N) 
				while (x-lowerBound >=N*binSize) zoomUp(); 
			else if (i<0) 
				while (lowerBound > x) zoomDown();
			i=(int) (Math.floor(x-lowerBound)/binSize);
		}
		counts[i]++;
	}

	private void zoomDown() {
		lowerBound-=N*binSize;
		binSize*=2;
		for (int j=N-1;j>N/2-1;j--) counts[j]=counts[2*j-N]+counts[2*j-N+1];
		for (int j=0;j<N/2;j++) counts[j]=0;
	}

	private void zoomUp() {
		binSize*=2;
		for (int j=0;j<N/2;j++) counts[j]=counts[2*j]+counts[2*j+1];
		for (int j=N/2;j<N;j++) counts[j]=0;
	}

	public double getLowerBound() {
		return lowerBound;
	}
	public double getBinSize() {
		return binSize;
	}
	public int[] getCounts() {
		return counts;
	}
	public int getN() { 
		return N;
	}
}

Histograms with uneven binsizes gets trickier, but is apparently still possible, see:

Approximation and streaming algorithms for histogram construction problems.
Sudipto Guha, Nick Koudas, and Kyuseok Shim ACM Trans. Database Syst. 31(1):396-438 (2006)

That’s it – I could have put in an actual graph figure here somewhere, but it just be some random numbers, just imagine one.

(Apologies for Java code examples today, your regular python programming will resume shortly)

RDFLib & Linked Open Data on the Appengine

Recently I’ve had the chance to use RDFLib a fair bit at work, and I’ve fixed lots of bugs and also written a few new bits. The new bits generally started as write-once and forget things, which I then needed again and again and I kept making them more general. The end result (for now) is two scripts that let you go from this CSV file to this webapp (via this N3 file). Actually – it’ll let you go from any CSV file to a Linked Open Data webapp, the app does content-negotiation and SPARQL as well as the HTML you just saw when you clicked on the link.
In the court

The dataset in this case, is a small collection of King Crimson albums – I spent a long time looking for some CSV data in the wild that had the features I wanted to show off, but failed, and copy/pasted this together from the completely broken CSV dump of the Freebase page.

To convert the CSV file you need a config file giving URI prefixes and some details on how to handle the different columns. The config file for the King Crimson albums looks like:

[csv2rdf]

class=http://example.org/schema/Album
base=http://example.org/resource/
propbase=http://example.org/schema/
defineclass=True
ident=(0,)
label=(0,)

out=kingcrimson.n3

col1=date("%d %B %Y")
col2=split(";", uri("http://example.org/resource/label/", "http://example.org/schema/Genre"))
col3=split("/")
col4=split(";")
col5=int(replace("min",""))

prop3=http://myotherschema.org/label

With this config file and the current HEAD of rdfextras you can run:

python -m rdfextras.tools.csv2rdf -f kingcrimson.config kingcrimson.csv

and get your RDF.

This tool is of course not the first or only of it’s kind – but it’s mine! You may also want to try Google Refine, which has much more powerful (and interactive!) editing possibilities than my hack. With the RDF extension, you can even export RDF directly.
One benefit of this script is that it’s stream-based and could be used on very large CSV files. Although, I believe Google Refine can also export actions taken in some form of batch script, but I never tried it.

With lots of shiny new RDF in my hand I wanted to make it accessible to people who do not enjoy looking at N3 in a text-editor and built the LOD application.
It’s built on the excellent Flask micro-web-framework and it’s now also part of rdfextras . If you have the newest version you can run it locally in Flask’s debug server like this:

python -m rdfextras.web.lod kingcrimson.n3

This runs great locally – and I’ve also deployed it within Apache, but not everyone has a mod_python ready Apache at hand, so I thought it would be nice to run it inside the Google Appengine.

Running the Flask app inside of appengine turned out to be amazingly easy, thanks to Francisco Souza for the pointers:

# main.py
from google.appengine.ext.webapp.util import run_wsgi_app
from rdfextras.web import lod

import rdflib
g=rdflib.Graph()
g.load("kingcrimson.n3", format='n3')

run_wsgi_app(lod.get(g))

Write your app.yaml and make this your handler for /* and you’re nearly good to go. To deploy this app to the appengine you also need all required libraries (rdflib, flask, etc.) inside your app directory, a shell script for this is here: install-deps.sh

Now, I am not really clear on the details on how the appengine works. Is this code run for every request? Or is the wsgi app persistent? When I deployed the LOD app inside apache using mod_python, it seems the app is created once, and server many requests over it’s lifetime.
In any case, RDFLib has no appengine compatible persistent store (who wants to write an rdflib store on top of the appengine datastore?), so the graph is kept in memory, perhaps it is re-parsed once for each request, perhaps not – this limits the scalability of this approach in any case. I also do not know the memory limitations of the appengine – or how efficient the rdflib in-memory store really is – but I assume there is a fairly low limit on the number of triple you can server this way. Inside apache I’ve deployed it on some hundred thousand triples in a BerkleyDB store.

There are several things that could be improved everywhere here – the LOD app in particular has some rough edges and bugs, but it’s being used internally in our project, so we might fix some of them given time. The CSV converter really needs a way to merge two columns, not just split them.

All the files you need to run this example yourself are under: http://gromgull.net/2011/08/rdflibLOD/ – let me know if you try it and if it works or breaks!

Voting in the Eurovision Song Contest

Yesterday I was made to watch the Eurovision song contest. I went to bed before the voting ended, so I woke up to find that Azerbaijan had won. Which was curious, since they were awful (or perhaps not, since this is Eurovision).

At the official webpage you can get the voting breakdown, where we can see that all the countries I have lived in (Norway, UK and Germany) gave Azerbaijan 0 points. Clearly the Eurovision has been ruined by all these new East-block counties, who in a giant conspiracy who only vote for each other, rendering us western countries with real musical talent without a chance. To confirm my suspicion I grabbed the result table, python, scipy and matplotlib. Compute the correlation matrix for the columns, run PCA on this and plot the first two components (if all that meant nothing to you, the result is that countries who tend to distribute their votes similarly are close to each other in the diagram):

Here the truths are clear as day – there is a Scandinavian conspiracy, Norway and Sweden are really the same country, Denmark is almost the same. Greece and Cyprus is one country (ahem – sorry Turkey, you are not far away). The Eastblock cabal is in the lower right. Malta is nothing like anyone else, it’s almost like it’s an island  ….

I think the only solution is to go back to the 1960 version of Eurovision, the first year that all countries that matter took part.

Seriously though – it’s fun to see how close this is to the actual geography of Europe, rotate the map a bit, and Scandinavia, UK, Italy are all in the right place.

PS: this is also pretty funny, but it seems someone takes this more seriously than perhaps they should: Eurovision Voting Fraud

Trope Bingo

This week we went to see Thor, and it seemed guaranteed to be a Trope-fest, and over lunch we came up with the idea of a “trope bingo”. 30 minutes with python, SPARQL and dbtropes.org in the last part of the afternoon and it was done.

In the end, the film was good, Thor was the Big Ham, utters the BIG NO and they have token Asian and Black Norse Gods. However, the cinema was too dark to actually play Bingo.

Today I got around to “porting” the script to PHP, so now you can play too! Click here:

Trope Bingo!

Not much to say about this one – the following query extracts all tropes from dbtropes which has more than 200 instances:


PREFIX rdfs:
PREFIX skip:
SELECT * WHERE {
?trope a skip:FeatureClass ; rdfs:label ?label ; rdfs:comment ?comment .
{ SELECT (count(*) AS ?count) ?trope WHERE { ?f a ?trope . } GROUP BY ?trope }
FILTER (?count>200)
}

The tropes are stored in a CSV file, we pick 25 randomly. See the source.