Posts categorized “Coding”.

Multithreaded SciPy/NumPy with OpenBLAS on debian

Some months ago, just after I got an 8-core CPU, I wasted a weekend trying to get SciPy/NumPy to build against OpenBLAS. OpenBLAS is neat, as it has built-in and automatic support for multi-threading, for things like computing the dot-matrix of large matrices this can really be a time saver.

I was using roughly these instructions, but it was too complicated and I got nowhere and gave up.

Then I got a new MacBook, and set it up using homebrew rather than macports, and I noticed that NumPy was built against OpenBLAS by default. Now, it would be a shame to let debian be any worse off…

Luckily, it was much easier than I thought:

1. Install the openblas and lapack libraries and dev-headers

sudo apt-get install libopenblas-dev liblapack-dev

2. Setup a virtualenv

To make sure we do not mess up the whole system, setup a virtualenv (if you ever install more than 3 python packages, and do not yet know about virtualenv, you really should get to know it, it’s a little piece of magic!):

virtualenv env
source env/bin/activate

3. Install NumPy

In Debian, openblas/lapack fit into the alternatives system, and the implementation you chose gets symlinked to /usr/lib, however, this confuses numpy and you must point it to the right place, i.e. to /usr/lib/openblas-base
Download and unpack NumPy:

mkdir evn/download
pip install -d env/download numpy
mkdir env/build
cd env/build
tar xf ../download/numpy-1.7.1.tar.gz

Now create a site.cfg file with the following content:

[default]
library_dirs= /usr/lib/openblas-base

[atlas]
atlas_libs = openblas

Build/install NumPy:

python setup.py install

You can now check the file env/lib/python2.7/site-packages/numpy/__config__.py to make sure it found the right libs, mine looks like this:

lapack_info={'libraries': ['lapack'], 
    'library_dirs': ['/usr/lib'], 'language': 'f77'}
atlas_threads_info={'libraries': ['openblas'], 
    'library_dirs': ['/usr/local/lib'], 
    'language': 'c', 
    'define_macros': [('ATLAS_WITHOUT_LAPACK', None)], 
    'include_dirs': ['/usr/local/include']}
blas_opt_info={'libraries': ['openblas'], 
    'library_dirs': ['/usr/local/lib'], 
    'language': 'c', 
    'define_macros': [('ATLAS_INFO', '"\\"None\\""')], 
    'include_dirs': ['/usr/local/include']}

4. Install SciPy

If NumPy installs cleanly, SciPy can simple be installed with pip:

pip install scipy

5. Test!

Using these scripts you can test your NumPy and SciPy installation. Be activating/deactivating the virtualenv, you can test with/without OpenBLAS. If you have several CPU cores, you can see that with OpenBLAS up to 4 CPUs should also be used.

Without OpenBLAS I get:

NumPy 
dot: 0.901498508453 sec

SciPy
cholesky: 0.11981959343 sec
svd: 3.64697360992 sec

with OpenBLAS:

NumPy
dot: 0.0569217920303 sec

SciPy
cholesky: 0.0204758167267 sec
svd: 0.81153883934 sec

On finding duplicate images

I got a new shiny MacBook in my new job at Bakken & Baeck, and figured it was time for a new start, so I am de-commissioning my old MacBook and with it, the profile and files that are so old it used to be on a PowerBook. Most things were easy, until I got to the photos. Over the years I have imported photos to the laptop while travelling, but always tried to import them again to my real backed-up photo archive at home when I got there, unless my SD card was full while travelling, or I forgot, or something else went wrong. That means I am fairly sure MOST of the photos on the laptop are also in my archive, but also fairly sure some are not.
And of course, each photo, be it an out-of-focus, under-exposed test-shot, is a little piece of personal memory, a beautiful little diamond of DATA, and must at all cost NOT BE LOST.

The photos are mostly in iPhoto (but not all), in a mix of old-style mixed up in iPhotos own folder structures, and in folders I have named.

“Easy” I thought, I trust the computer, I normally use Picasa, it will detect duplicates when importing! Using


find . -iname \*.jpg -print0 | xargs -0 -I{} cp -v --backup=t {} /disks/1tb/tmp/photos/

I can copy all JPGs from the Pictures folder into one big folder without overwriting files with duplicate names (I ❤ coreutils?), then let Picasa sort it out for me.

Easily done, 7500 photos in one folder, Picasa thinks a bit and detects some duplicates, but not by far enough. Several photos I KNOW are archived are not flagged as dupes. I give up trusting Picasa. I know who I can trust:

md5sum!

(In retrospect, I should have trused Picasa a bit, and at least removed the ones it DID claim were duplicates)

So, next step, compute the md5sum of all 7500 new photos, and of all 45,000 already archived photos. Write the shell-script, go to work, return, write the python to find all duplicates, delete the ones from the laptop.

Success! 3700 duplicates gone! But wait! There are still many photos I know for a fact are duplicates, I pick one at random and inspect it. It IS the same photo, but one JPG is 3008×2008 and the other is 3040×2024, also the white-balance is very slightly different. Now I understand, back when I had more time, I shot exclusively in RAW, these are two JPG produced from the same RAW file, one by iPhoto, one by UFRaw, the iPhoto one is slightly smaller and has worse color. Bah.

Now it’s getting later, my Friday night is slipping away between my fingers, but I am damned if I give up now. Next step: EXIF data! Both files have EXIF intact, and both are (surprise!) taken at the same time. Now, I don’t want to go and look up the EXIF tag on all 45,000 archived photos just now, but I can filter by filename, if two files have the same basename (IMGP1234) AND are taken at the same time, I am willing to risk deleting one of them.

So with the help of the EXIF parsing library from https://github.com/ianare/exif-py and a bit of python:

it is done! Some ~3500 more duplicates removed!

I am left with 202 photos that may actually be SAVED FROM ETERNAL OBLIVION! (Looking more carefully, about half is actually out-of-focus or test shots, or nonsense I probably DID copy to the archive, but then deleted) It was certainly worth spending the entire evening 3 days in a row on this!

Now I can go back to hating hotmail for having deleted all the emails I received pre 2001…

python, regex, unicode and brokenness

(This post included a complaint about handling of unicode codepoint >0xffff in python, including a literal such character, and it broke WordPress, which ate the remainder of the post after that character… and I am too lazy to retype it, so for now, no unicode)

I love python, I really do, but some things are … slightly irregular.

One of those things is the handling of unmatched regular expression groups when replacing. In python such a group returns None when matching, this is fine. But when replacing, this unmatched group will produce an error, rather than simple inserting the empty string. For example:

>>> re.sub('(ab)|(a)', r'\1\2', 'abc')
Traceback (most recent call last):
  File "", line 1, in 
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 275, in filter
    return sre_parse.expand_template(template, match)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_parse.py", line 787, in expand_template
    raise error, "unmatched group"
sre_constants.error: unmatched group

There are plenty of people with this problem on the interwebs, and even a python bug report – most “solutions” involves re-writing your expression to make the unmatched group match empty string. Unfortunately, my input expression comes from the sparql 11 compliance tests and as much as I’d like I’m not really free to change it. So, it gets ugly:

And it works, at least in my python 2.7.3 …

Three simple methods for graphing datastreams

This is not very cutting edge, nor all THAT exciting – but it seemed worth putting the things together into one post for someone else who encounters the same scenario.

The problem is this: You have a stream of data-points coming in, you do not know how many points and you do not know the range of the numbers. You would like to draw a pretty little graph of this data. Since there is an unknown number of points, but potentially massive, you cannot keep all numbers in memory, so look at each once and pass it on. It’s essentially an “online (learning) scenario. The goal here is to draw small graph to give an idea of the trend of the data, not careful analysis – so a fair bit of approximation is ok.

In our case the numbers are sensors readings from agricultural machines, coming in through ISOXML (you probably really don’t want to go there, but it’s all in ISO 11783) documents. The machine makes readings every second, which adds up when it drives for days. This isn’t quite BIG DATA, but there may be millions of points, and in our case we want to run on a fairly standard PC, so an online approach was called for.

For the first and simplest problem, you have data coming in at regular intervals, and you want a (much) shorter array of values to plot. This class keeps an array of some length N, at first we just fill it with the incoming numbers. If we see more than N numbers, we “zoom out” by a factor of two, rewrite the N numbers we’ve seen to N/2 numbers by averaging them, and from now on, every two incoming numbers are averaged into each cell in our array. Until we reach N*2 numbers, then we zoom out again, now averaging every 4 numbers, etc. Rewriting the array is a bit of work, but it only happens log(your total number of points) times. In the end you end up with somewhere between N/2 and N points.

/**
 * Create a summary time-series of some time-series of unspecified length. 
 * 
 * The final time-series will be somewhere between N/2 and N long
 * (Unless less numbers are given of course)
 * 
 * For example:
 * TimeLogSummariser t = new DDISummaryGraph.TimeLogSummariser(4);
 * for (int i=1; i<20; i++) t.add(i);
 * => [3.0, 7.25, 18.0]
 * 
 */
public class TimeLogSummariser { 
	
	double data[];
	int N;
	int n=1;
	int i=0;
	public TimeLogSummariser(int N) { 
		if (N%2 != 0 ) N++; 
		this.N=N; 
		data=new double[N];
	}
	public void zoomOut() { 
		for (int j=0;j<N/2;j++) data[j]=(data[j*2]+data[2*j+1])/2;
		for (int j=N/2;j<N;j++) data[j]=0;
		n*=2;
	}
	
	public void add(double d) { 
		int j=i%n;
		int idx=i/n;
		
		if (idx>=N) { 
			zoomOut(); 
			j=i%n;
			idx=i/n;
		}
		data[idx]=(data[idx]*j+d)/(j+1);
		i++;
	}
	
	public double[] getData() {
		return Arrays.copyOfRange(data, 0, i/n+1); 
	}
	
}

Now, after programming this I realised that most of my tractor-sensor data actually does not come in at regular intervals. You’ll get stuff every second for a while, then suddenly a 3 second gap, or a 10 minute smoke-break or a 40 minute lunch-break. Treating every point as equal does not give you the graph you want. This only makes things slightly more complicated though, instead always stepping one step in the array, the step depends on the difference in time between the two points. We assume (and hope) that data always arrives in sequential order. Also, since we may take many steps at once now, we may need to “zoom out” more than once to fit in the array. Otherwise the code is almost the same:

 
/**
 * Create a summary time-series of some time-series of unspecified length. 
 * 
 * The final time-series will be somewhere between N/2 and N long
 * (Unless less numbers are given of course)
 * 
 * This class will do correctly do averages over time. 
 * The constructor takes the stepsize as a number of milleseconds 
 * i.e. the number of milliseconds between each recording. 
 * Each value is then given with a date. 
 * 
 * For example:
 * DiffDDISummaryGraph t = new DiffDDISummaryGraph.TimeLogSummariser(4, 1000);
 * for (int i=1; i<20; i++) t.add(i, new Date(i*2000));
 * => [3.0, 7.25, 18.0]
 * 
 */
public class DiffTimeLogSummariser { 
	
	double data[];
	
	private int N;

	int n=1;
	double i=0;

	private long stepsize;

	private Date last;
	private Date start;
	
	public DiffTimeLogSummariser(int N, long stepsize) { 
		this.stepsize=stepsize;
		if (N%2 != 0 ) N++; 
		this.N=N;
		data=new double[N];
	}
	public void zoomOut() { 
		for (int j=0;j<N/2;j++) data[j]=(data[j*2]+data[2*j+1])/2;
		for (int j=N/2;j<N;j++) data[j]=0;
		n*=2;
	}
	
	public void add(double d, Date time) {
		
		long diff;
		if (last!=null) {
			diff=time.getTime()-last.getTime();
		} else { 
			start=time;
			diff=0;
		}
		if (diff<0) { 
			System.err.println("DiffTimeLogSummarizer got diff<0, ignoring.");
			return ; 
		}
		
		i+=diff/(double)stepsize;
		
		int j=(int) (Math.round(i)%n);
		int idx=(int) (Math.round(i)/n);
		
		while (idx>=N) { 
			zoomOut(); 
			j=(int) (Math.round(i)%n);
			idx=(int) (Math.round(i)/n);
		}
		data[idx]=(data[idx]*j+d)/(j+1);
		last=time;
	}
	
	public double[] getData() {
		return Arrays.copyOfRange(data, 0, (int) (i/n+1)); 
	}
	
	public Date getStart() {
		return start;
	}

	public long getStepsize() {
		return stepsize*n;
	}
}

This assumes that if there is a gap in the data, that means the value was 0 at these intervals. Whether this is true depends on your application, in some cases it would probably make more sense to assume no data means the value was unchanged. I will leave this as an exercise for the reader :)

Finally, sometimes you are more interested in the distribution of the values you get, rather than how they vary over time. Computing histograms on the fly is also possible, for uniform bin-sizes the algorithm is almost the same as above. The tricky bits here I’ve stolen from Per on stackoverflow:

/**
 * Create a histogram of N bins from some series of unknown length. 
 * 
 */
public class TimeLogHistogram { 

	int N;  // assume for simplicity that N is even
	int counts[];

	double lowerBound;
	double binSize=-1;

	public TimeLogHistogram(int N) { 
		this.N=N; 
		counts=new int[N];
	}

	public void add(double x) { 
		if (binSize==-1) {
			lowerBound=x;
			binSize=1;
		}
		int i=(int) (Math.floor(x-lowerBound)/binSize);

		if (i<0 || i>=N) {
			if (i>=N) 
				while (x-lowerBound >=N*binSize) zoomUp(); 
			else if (i<0) 
				while (lowerBound > x) zoomDown();
			i=(int) (Math.floor(x-lowerBound)/binSize);
		}
		counts[i]++;
	}

	private void zoomDown() {
		lowerBound-=N*binSize;
		binSize*=2;
		for (int j=N-1;j>N/2-1;j--) counts[j]=counts[2*j-N]+counts[2*j-N+1];
		for (int j=0;j<N/2;j++) counts[j]=0;
	}

	private void zoomUp() {
		binSize*=2;
		for (int j=0;j<N/2;j++) counts[j]=counts[2*j]+counts[2*j+1];
		for (int j=N/2;j<N;j++) counts[j]=0;
	}

	public double getLowerBound() {
		return lowerBound;
	}
	public double getBinSize() {
		return binSize;
	}
	public int[] getCounts() {
		return counts;
	}
	public int getN() { 
		return N;
	}
}

Histograms with uneven binsizes gets trickier, but is apparently still possible, see:

Approximation and streaming algorithms for histogram construction problems.
Sudipto Guha, Nick Koudas, and Kyuseok Shim ACM Trans. Database Syst. 31(1):396-438 (2006)

That’s it – I could have put in an actual graph figure here somewhere, but it just be some random numbers, just imagine one.

(Apologies for Java code examples today, your regular python programming will resume shortly)

RDFLib & Linked Open Data on the Appengine

Recently I’ve had the chance to use RDFLib a fair bit at work, and I’ve fixed lots of bugs and also written a few new bits. The new bits generally started as write-once and forget things, which I then needed again and again and I kept making them more general. The end result (for now) is two scripts that let you go from this CSV file to this webapp (via this N3 file). Actually – it’ll let you go from any CSV file to a Linked Open Data webapp, the app does content-negotiation and SPARQL as well as the HTML you just saw when you clicked on the link.
In the court

The dataset in this case, is a small collection of King Crimson albums – I spent a long time looking for some CSV data in the wild that had the features I wanted to show off, but failed, and copy/pasted this together from the completely broken CSV dump of the Freebase page.

To convert the CSV file you need a config file giving URI prefixes and some details on how to handle the different columns. The config file for the King Crimson albums looks like:

[csv2rdf]

class=http://example.org/schema/Album
base=http://example.org/resource/
propbase=http://example.org/schema/
defineclass=True
ident=(0,)
label=(0,)

out=kingcrimson.n3

col1=date("%d %B %Y")
col2=split(";", uri("http://example.org/resource/label/", "http://example.org/schema/Genre"))
col3=split("/")
col4=split(";")
col5=int(replace("min",""))

prop3=http://myotherschema.org/label

With this config file and the current HEAD of rdfextras you can run:

python -m rdfextras.tools.csv2rdf -f kingcrimson.config kingcrimson.csv

and get your RDF.

This tool is of course not the first or only of it’s kind – but it’s mine! You may also want to try Google Refine, which has much more powerful (and interactive!) editing possibilities than my hack. With the RDF extension, you can even export RDF directly.
One benefit of this script is that it’s stream-based and could be used on very large CSV files. Although, I believe Google Refine can also export actions taken in some form of batch script, but I never tried it.

With lots of shiny new RDF in my hand I wanted to make it accessible to people who do not enjoy looking at N3 in a text-editor and built the LOD application.
It’s built on the excellent Flask micro-web-framework and it’s now also part of rdfextras . If you have the newest version you can run it locally in Flask’s debug server like this:

python -m rdfextras.web.lod kingcrimson.n3

This runs great locally – and I’ve also deployed it within Apache, but not everyone has a mod_python ready Apache at hand, so I thought it would be nice to run it inside the Google Appengine.

Running the Flask app inside of appengine turned out to be amazingly easy, thanks to Francisco Souza for the pointers:

# main.py
from google.appengine.ext.webapp.util import run_wsgi_app
from rdfextras.web import lod

import rdflib
g=rdflib.Graph()
g.load("kingcrimson.n3", format='n3')

run_wsgi_app(lod.get(g))

Write your app.yaml and make this your handler for /* and you’re nearly good to go. To deploy this app to the appengine you also need all required libraries (rdflib, flask, etc.) inside your app directory, a shell script for this is here: install-deps.sh

Now, I am not really clear on the details on how the appengine works. Is this code run for every request? Or is the wsgi app persistent? When I deployed the LOD app inside apache using mod_python, it seems the app is created once, and server many requests over it’s lifetime.
In any case, RDFLib has no appengine compatible persistent store (who wants to write an rdflib store on top of the appengine datastore?), so the graph is kept in memory, perhaps it is re-parsed once for each request, perhaps not – this limits the scalability of this approach in any case. I also do not know the memory limitations of the appengine – or how efficient the rdflib in-memory store really is – but I assume there is a fairly low limit on the number of triple you can server this way. Inside apache I’ve deployed it on some hundred thousand triples in a BerkleyDB store.

There are several things that could be improved everywhere here – the LOD app in particular has some rough edges and bugs, but it’s being used internally in our project, so we might fix some of them given time. The CSV converter really needs a way to merge two columns, not just split them.

All the files you need to run this example yourself are under: http://gromgull.net/2011/08/rdflibLOD/ – let me know if you try it and if it works or breaks!

Creating animations with graphviz

Here is a really pointless hack – Joern Hees asked me if I knew of any tools to force layout and visualise RDF graphs. We wondered about graphviz, but he wanted an interactive tool. When he left I wondered if it wouldn’t be quite easy to at least make an animation with graphviz. Of course it took longer than the 10 minutes, I expected, but it sort of worked. Based on the graphviz siblings example graph:

# get example
wget http://www.graphviz.org/Gallery/directed/siblings.gv.txt

# create the initial random layout
neato -Gstart=rand -Gmaxiter=1 -o i.dot siblings.gv.txt

# create 200 pngs for each iteration 
for x in $(seq 200) ; do neato -Gmaxiter=$x -Tpng -o $(printf "%03d" $x).png i.dot ; done

# resize so they are all the same size - graphviz sizing (-Gsize=4,4) is specified in inches and does not always produce PNGs of the same size.
for f in *.png ; do convert $f -resize 500x500! out.png ; mv out.png $f ; done

# make movie
mencoder mf://*.png -mf w=450:h=500:fps=10:type=png -ovc lavc -lavcopts vcodec=mpeg4:mbd=2:trell -oac copy -o output.avi

# upload to youtube
# profit!

(Of course there are many other tools that are much better than this – this is really a “because I can” case.)

A quick and dirty guide to YOUR first time with RDF

(This is written for the challenge from http://memespring.co.uk/2011/01/linked-data-rdfsparql-documentation-challenge/)

(To save you copy/pasting/typing you can download the examples from here: http://gromgull.net/2011/01/firstRDF/)

10 steps to make sense of RDF data:

  1. Install a debian or Ubuntu based system — I used Debian testing.
  2. Install rdflib and Berkely/Sleepycat DB by doing:
    sudo apt-get install python-rdflib python-bsddb3
    

    (I got rdflib version 2.4.2 – if you get version 3.X.X the code may look slightly different, let me know if you cannot work out the changes on your own)

  3. Find some data — I randomly picked the data behind the BIS Research Funding Explorer. You can find the raw RDF data on the source.data.gov.uk/data/ server. We will use the schema file from:

    http://source.data.gov.uk/data/research/bis-research-explorer/2010-03-04/research-schema.rdf

    and the education data from:

    http://source.data.gov.uk/data/education/bis-research-explorer/2010-03-04/education.data.gov.uk.nt

    We use the education data because it is smaller than the research data, only 500K vs 11M, and because there is a syntax error in the corresponding file for research :). In the same folders there are files called blahblah-void. These are statistics about the datasets, and we do not need them for this (see http://vocab.deri.ie/void/ for details).

  4. Load the data, type this into a python shell, or create a python file and run it:
    import rdflib
    
    g=rdflib.Graph('Sleepycat')
    g.open("db")
    
    g.load("http://source.data.gov.uk/data/education/bis-research-explorer/2010-03-04/education.data.gov.uk.nt", format='nt')
    g.load("http://source.data.gov.uk/data/research/bis-research-explorer/2010-03-04/research-schema.rdf")
    
    g.close()
    

    Note that the two files are in different RDF formats, both contain triples, but one is serialized as XML, the other in a ascii line-based format called N-Triples.You do not have to care about this, just tell rdflib to use the right parser with the format=X parameter, RDF/XML is the default.

  5. After the script has run there will be a new folder called db in the current directory, it contains the berkeley data-base files and indexes for the data. For the above example it’s about 1.5M
  6. Explore the data a bit, again type this into a python shell:
    • First open the DB again:
    • import rdflib
      g=rdflib.Graph('Sleepycat')
      g.open("db")
      len(g)
      
      -- Outputs: 3690 --
      

      The graph object is quite pythonic, and you can treat it like a collection of triples. Here len tells us we have loaded 3690 triples.

    • Find out what sorts of things this data describes. Things are typed by a triple with rdf:type as the predicate in RDF.
      for x in set(g.objects(None, rdflib.RDF.RDFNS["type"])): print x
      
      -- Outputs:
      http://www.w3.org/2002/07/owl#ObjectProperty
      http://www.w3.org/2002/07/owl#DatatypeProperty
      http://xmlns.com/foaf/0.1/Organization
      http://purl.org/vocab/aiiso/schema#Institution
      http://research.data.gov.uk/def/project/Location
      http://www.w3.org/1999/02/22-rdf-syntax-ns#Property
      http://www.w3.org/2000/01/rdf-schema#Class
      --
      

      rdflib gives you several handy functions that return python generators for doing simple triple based queries, here we used graph.objects, taking two parameters, the subjects and predicates to filter for, and returns a generator over all objects matching. rdflib also provides constants for the well-known RDF and RDFSchema vocabularies, we used this here to get the correct URI for the rdf:type predicate.

    • Now we know the data contains some Institutions, get a list using another rdflib triple-based query:
      for x in set(g.subjects(rdflib.RDF.RDFNS["type"], rdflib.URIRef('http://purl.org/vocab/aiiso/schema#Institution'))): print x
      
      -- Outputs:
      http://education.data.gov.uk/id/institution/UniversityOfWolverhampton
      http://education.data.gov.uk/id/institution/H-0081
      http://education.data.gov.uk/id/institution/H-0080
      ... (and many more) ...
      --
      

      This gives us a long list of all institutions. The set call here just iterates through the generator and removes duplicates.

    • Lets look at the triples about one in more detail:
      for t in g.triples((rdflib.URIRef('http://education.data.gov.uk/id/institution/UniversityColledgeOfLondon'), None, None)): print map(str,t)
      -- Outputs:
      ['http://education.data.gov.uk/id/institution/UniversityColledgeOfLondon', 'http://research.data.gov.uk/def/project/location', 'http://education.data.gov.uk/id/institution/UniversityColledgeOfLondon/WC1E6BT']
      ['http://education.data.gov.uk/id/institution/UniversityColledgeOfLondon', 'http://research.data.gov.uk/def/project/organisationName', 'University College London']
      ... (and many more) ...
      --
      

      This gives us a list of triples asserted about UCL, here we used the triples method of rdflib, it takes a single argument, a tuple representing the triple filters. The returned triples are also tuples, the map(str,t) just makes the output prettier.

  7. rdflib makes it very easy to work with triple based queries, but for more complex queries you quickly need SPARQL, this is also straight forward:
    PREFIX="""
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    PREFIX p: <http://research.data.gov.uk/def/project/>
    PREFIX aiiso: <http://purl.org/vocab/aiiso/schema#>
    PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    """
    
    list(g.query(PREFIX+"SELECT ?x ?label WHERE { ?x rdfs:label ?label ; a aiiso:Institution . } "))[:10]
    

    The prefixes defined here at the start lets us use short names instead of full URIs in the queries. The graph.query method returns a generator over tuples of variables bindings. This lists the first 10 – this is more or less the same as we did before, list all institutions, but this time also get the human readable label.

  8. Now a slightly more complicated example. Ask the knowledge base to find all institutions classified as public sector that took part in some project together:

     
    
    r=list(g.query(PREFIX+"""SELECT DISTINCT ?x ?xlabel ?y ?ylabel WHERE { 
       ?x rdfs:label ?xlabel ; 
          a aiiso:Institution ; 
          p:organisationSize 'Public Sector' ; 
          p:project ?p . 
    
       ?y rdfs:label ?ylabel ; 
          a aiiso:Institution ; 
          p:organisationSize 'Public Sector' ; 
          p:project ?p .
    
       FILTER (?x != ?y) } LIMIT 10 """))
    
    for x in r[:3]: print map(str,x)
    
    -- Outputs:
    ['http://education.data.gov.uk/id/institution/H-0155', 'Nottingham University', 'http://education.data.gov.uk/id/institution/H-0159', 'The University of Sheffield']
    ['http://education.data.gov.uk/id/institution/H-0155', 'University of Nottingham', 'http://education.data.gov.uk/id/institution/H-0159', 'University of Sheffield']
    ['http://education.data.gov.uk/id/institution/H-0159', 'Sheffield University', 'http://education.data.gov.uk/id/institution/H-0155', 'University of Nottingham']
    --
    

    All fairly straight forward, the FILTER is there to make sure the two institutions we find are not the same.
    (Disclaimer: there is a bug in rdflib (http://code.google.com/p/rdfextras/issues/detail?id=2) that makes this query take very long :( – it should be near instantaneous, but takes maybe 10 seconds for me. )

  9. The data we loaded so far do not have any details on the project that actually got funded, only the URI, for example: http://research.data.gov.uk/doc/project/tsb/100232. You can go there with your browser and find out that this is a project called “Nuclear transfer enhancement technology for bio processing and tissue engineering” – luckily so can rdflib, just call graph.load on the URI. Content-negotiation on the server will make sure that rdflib gets machine readable RDF when it asks. A for-loop over a rdflib triple query and loading all the project descriptions is left as an exercise to the reader :)
  10. That’s it! There are many places to go from here, just keep trying things out – if you get stuck try asking questions on http://www.semanticoverflow.com/ or in the IRC chatroom at irc://irc.freenode.net:6667/swig. Have fun!

Illustrating the kernel trick

For a one paragraph intro to SVMs and the kernel-trick I wanted a a graphic that I’ve seen in a book (although forgotten where, perhaps in Pattern Classification?):

Simple idea — show some 2D data points that are not linearly separable, then transform to 3D somehow, and show that they are. I found nothing on google (at least nothing that was high enough resolution to reuse, so I wrote some lines of python with pylab and matplotlib:

import math
import pylab
import scipy

def vlen(v):
return math.sqrt(scipy.vdot(v,v))

p=scipy.randn(100,2)

a=scipy.array([x for x in p if vlen(x)>1.3 and vlen(x)<2])
b=scipy.array([x for x in p if vlen(x)<0.8])

pylab.scatter(a[:,0], a[:,1], s=30, c="blue")
pylab.scatter(b[:,0], b[:,1], s=50, c="red", marker='s')

pylab.savefig("linear.png")

fig = pylab.figure()
from mpl_toolkits.mplot3d import Axes3D
ax = Axes3D(fig)
ax.view_init(30,-110)

ax.scatter3D(map(vlen,a), a[:,0], a[:,1], s=30, c="blue")
ax.scatter3D(map(vlen,b), b[:,0], b[:,1], s=50, marker="s", c="red")

pylab.savefig("tranformed.png")

pylab.show()

Take — adapt — use for anything you like, you can rotate the 3D plot in the window that is shown and you can save the figures as PDF etc. Unfortunately, the sizing of markers in the 3d plot is not yet implemented in the latest matplotlib (0.99.1.2-3), so this only looks good with the latest SVN build.

HTTP File Uploads in PHP

And by this I mean uploading files from a PHP script to another HTTP URL, essentially submitting a web-form with a file-field from PHP. I needed this in Organik, it took me some hours to find out how. My hacky result is here for the world to reuse:

http://github.com/gromgull/randombits/blob/master/http_file_upload.php

Enjoy.

Semantic Web Clusterball

From the I-will-never-actually-finish-this department I bring you the Semantic Web Cluster-ball:

Semantic Web Clusterball

I started this is a part of the Billion Triple Challenge work, it shows the how different sites on Semantic Web are linked together. The whole thing is an interactive SVG, I could not get it to embed here, so click on that image and mouse over things and be amazed. Clicking on the different predicates in the SVG will toggle showing that predicate, mouse over any link will show how many links are currently being shown. (NOTE: Only really tested in Firefox 3.5.X, it looked roughly ok in Chrome though.)

The data is extracted from the BTC triples by computing the Pay-Level-Domain (PLD, essentially the top-level domain, but with special rules for .co.uk domains and similar) for the subjects and objects, and if they differ, count the predicates that link them. I.e. a triple:

dbpedia:Albert_Einstein rdf:type foaf:Person.

would count as a link between http://dbpedia.org and http://xmlns.com for the rdf:type predicate. Counting all links like this gives us the top cross-domain linking predicates:

predicate links
http://www.w3.org/1999/02/22-rdf-syntax-ns#type 60,813,659
http://www.w3.org/2000/01/rdf-schema#seeAlso 16,698,110
http://www.w3.org/2002/07/owl#sameAs 4,872,501
http://xmlns.com/foaf/0.1/weblog 4,627,271
http://www.aktors.org/ontology/portal#has-date 3,873,224
http://xmlns.com/foaf/0.1/page 3,273,613
http://dbpedia.org/property/hasPhotoCollection 2,556,532
http://xmlns.com/foaf/0.1/img 2,012,761
http://xmlns.com/foaf/0.1/depiction 1,556,066
http://www.geonames.org/ontology#wikipediaArticle 735,145

Most frequent is of course rdf:type, since most schemas are from different domains to the data, and most things have a type. The ball linked above is excluding type, since it’s not really a link. You can also see a version including rdf:type. The rest of the properties are more link-like, I am not sure what is going on with the akt:has-date though, anyone?

The visualisation idea is of course not mine, mainly I stole it from Chris Harrison: Wikipedia Clusterball. His is nicer since he has core nodes inside the ball. He points out that the “clustering” of nodes along the edge is important, as this brings out the structure of whatever is being mapped. My “clustering” method was very simple, I swap each node with the one giving me the largest decrease in edge distance, then repeat until the solution no longer improves. I couple this with a handful of random restarts and take the best solution. It’s essentially a greedy hill-climbing method, and I am sure it’s far from optimal, but it does at least something. For comparison, here is the ball on top without clustering applied.

The whole thing was of course hacked up in python, the javascript for the mouse-over etc. of the SVG uses prototype. I wanted to share the code, but it’s a horrible mess, and I’d rather not spend the time to clean it up. If you want it, drop me a line., see below. The data used to generate this is available either as a download: data.txt.gz (19Mb, 10,000 host-pairs and top 500 predicates), or a subset on Many Eyes (2,500 host-pairs and top 100 predicates, uploading 19Mb of data to Many Eyes crashed my Firefox :)

UPDATE: Richard Stirling asked for the code, so I spent 30 min cleaning it up a bit, grab it here: swball_code.tar.gz It includes the data+code needed to recreate the example above.