Posts categorized “Semantic Web”.

Google Reader and the death of the open web

Well fuck. Google kills google reader.

Some random points:

This is upsetting because I use the damn thing a lot, as @bobdc says:

At some point RSS became synonymous with Google Reader, now Google tells me that although it to me seems like “everyone” uses Reader, this is because I have geek friends, and “normal” people have no idea Google Reader exists. It is safe to say, that these normal people have even less idea that RSS/Atom exists and that there are choices outside of Google Reader.  If not the last nail in the coffin for RSS, it certainly the BIG nail that made it impossible to pry open the coffin again. (It was already clear that RSS was geeky-poweruser only when Twitter killed the Atom feeds last year). Even if there now may be a market for NEW RSS Readers, and maybe even some innovation in that space, it wont matter any more, since people will just stop publishing RSS feeds.

RSS was probably the last standards driven eco-system where real integration of stuff happened in ways people probably didn’t foresee. I am sad to say so, but I cannot see FOAF pop up to kill the big social networks any time soon. OpenID became “sign-in with Google / Facebook / Yahoo” and the places where I can type in my OWN OpenID URL is again only hard-core tech places. OAuth2 is a mess where there is a Google version and a Facebook version, and interop is on paper only. We lost the open web and now we have APIs to a nice corporate sanitize social network.

As web-archaeologists well know, RSS was initially RDF based (RDF Site Syndicate, before it became Real-Simple). Now, look at your Facebook feed, there is clearly some sort of “ontology” here, there are basic updates, essentially just some text, but then there are updates with pictures, updates with youtube videos. There are updates where actions are possible: invites to parties, invites to terrible games, invites to share you birthday! apps, etc. Wouldn’t it have been cool if we kept an RDF extensible version of RSS, then I could have published my extensions to RSS items somewhere, if your client didn’t support it, it would fall back on default rendering, giving you just a URL, but if you had the right widget, you would get a richer representation of the thing (and these days maybe even I could publish it with some JavaScript snippet to render the widget, like Twitter does with the twitter widget I embedded above) … old semantic web dreams never die – they just get covered in a layer of cynicism!

Some basic BTC2012 Stats

(The Figure shows the biggest domains publishing data, and links between them – mouse-over the edges to highlight, chose linking predicate from the drop-down list)

So it’s that time of year again, and the Billion Triple Challenge Dataset for 2012 has been posted.
This coincided with our project demo being finished, so I had some time to spare. The previous years I’ve done this all using unix tools, sed/awk/grep and friends. This year I figured I’d do it all in python. To get reasonable performance two things were crucial:

  • the python gzip module has decompression implemented in python, using subprocess and reading from a pipe to gunzip is MUCH faster (thanks Jörn!)
  • I wrote a an N-Quads “parser” in cython, taking advantage of the very regular output of ld-spider

This meant that for simple operations, like adding up things in a hash-table in memory, I could stream-process about 500,000 triples per second. For things that did not fit in memory, I used LevelDB with a thin layer of most-frequently-used caching around it.

I’m happy to see that DbTropes is part of the data this year!
So – the basic stats:

  • 1.4B triples all in all
  • 1082 different namespaces are used
  • 9.2M unique contexts, from 831 top-level document PLDs (Pay-Level-Domain, essentially, instead of, but, instead of
  • 183M unique subjects are described
  • 57k unique predicates
  • 192M unique resources as objects
  • 156M unique literals
  • 152M triples are rdf:type statements, 296k types are used. Resource with multiple types are common, 45M resources have two types, 40M just one.


Top 10 Context PLDs

count context pld


Top 10 Namespaces

count namespace


Top 10 Types

count type
39,345,307 intervals:Second
39,345,280 intervals:CalendarSecond
12,841,127 foaf:Person
7,623,831 foaf:Document
1,896,136 qb:Observation
1,851,173 fb:common.topic
1,712,877 intervals:Minute
1,712,875 intervals:CalendarMinute
1,328,921 owl:Thing
1,280,763 metalex:BibliographicExpression

As usual, although many namespaces/hosts/types are used, the distribution is skewed, the most common elements quickly accounts for most of the data. This graph shows the cumulative occurrences (i.e. % of total unique elements) of types/context-plds/namespaces occurring more than N times (the X axis is logarithmic):

So the steeper the curve, the longer the tail of infrequently occurring elements. For example, less than 5% of types occur more than 100 times, but very few context-pld’s occur less than 10 times. However, when you look at the actual density, the picture changes, here we plot the cumulative density, so although most types occur less than 100 times, the majority of the data uses only the most frequent types:

So the steeper the curve at the end, the more of the data is covered by the few most frequent element. For example, the top 5% most frequent namespaces and context-plds cover over 99% of the data, but the top 5% of types “only” 97%.

A different (maybe useless?) view of this, is this histogram with exponentially increasing bucket-sizes, again with a log-scale, so they look the same size:

Here we see … actually I’ll be damned if I know what we see here. Maybe I should have done more stats courses at uni instead of, say, Java Programming. Clearly the difference between the distribution of the three things is shown somehow. I’ve spent so long on this now though, there’s no way I wont put it here.

I don’t even want to talk about how long I spent making these graphs. I wanted to graph this since the first BTC dataset I looked at, but previously always fell back at “top n% of the elements cover n% of the data” tables.
They graphs are all done in pylab, exported as SVG (yay!). Playing with them was all done with the ipython notebook, which is really pleasant to work with.

Finally – the Chord-diagram on top shows links between context PLDs – mouse over each host to see outgoing links. This is only the top 19 PLD domains and the top 10 properties linking domains that themselves publish RDF data – this is important, as there are predicates used to link to non-semantic web resources that dominate otherwise. The graphic and interaction is all done with the excellent D3 Library.

I will try to come up with some more interesting visualisations based on links between instances of various types soon!

RDFLib & Linked Open Data on the Appengine

Recently I’ve had the chance to use RDFLib a fair bit at work, and I’ve fixed lots of bugs and also written a few new bits. The new bits generally started as write-once and forget things, which I then needed again and again and I kept making them more general. The end result (for now) is two scripts that let you go from this CSV file to this webapp (via this N3 file). Actually – it’ll let you go from any CSV file to a Linked Open Data webapp, the app does content-negotiation and SPARQL as well as the HTML you just saw when you clicked on the link.
In the court

The dataset in this case, is a small collection of King Crimson albums – I spent a long time looking for some CSV data in the wild that had the features I wanted to show off, but failed, and copy/pasted this together from the completely broken CSV dump of the Freebase page.

To convert the CSV file you need a config file giving URI prefixes and some details on how to handle the different columns. The config file for the King Crimson albums looks like:




col1=date("%d %B %Y")
col2=split(";", uri("", ""))


With this config file and the current HEAD of rdfextras you can run:

python -m -f kingcrimson.config kingcrimson.csv

and get your RDF.

This tool is of course not the first or only of it’s kind – but it’s mine! You may also want to try Google Refine, which has much more powerful (and interactive!) editing possibilities than my hack. With the RDF extension, you can even export RDF directly.
One benefit of this script is that it’s stream-based and could be used on very large CSV files. Although, I believe Google Refine can also export actions taken in some form of batch script, but I never tried it.

With lots of shiny new RDF in my hand I wanted to make it accessible to people who do not enjoy looking at N3 in a text-editor and built the LOD application.
It’s built on the excellent Flask micro-web-framework and it’s now also part of rdfextras . If you have the newest version you can run it locally in Flask’s debug server like this:

python -m rdfextras.web.lod kingcrimson.n3

This runs great locally – and I’ve also deployed it within Apache, but not everyone has a mod_python ready Apache at hand, so I thought it would be nice to run it inside the Google Appengine.

Running the Flask app inside of appengine turned out to be amazingly easy, thanks to Francisco Souza for the pointers:

from google.appengine.ext.webapp.util import run_wsgi_app
from rdfextras.web import lod

import rdflib
g.load("kingcrimson.n3", format='n3')


Write your app.yaml and make this your handler for /* and you’re nearly good to go. To deploy this app to the appengine you also need all required libraries (rdflib, flask, etc.) inside your app directory, a shell script for this is here:

Now, I am not really clear on the details on how the appengine works. Is this code run for every request? Or is the wsgi app persistent? When I deployed the LOD app inside apache using mod_python, it seems the app is created once, and server many requests over it’s lifetime.
In any case, RDFLib has no appengine compatible persistent store (who wants to write an rdflib store on top of the appengine datastore?), so the graph is kept in memory, perhaps it is re-parsed once for each request, perhaps not – this limits the scalability of this approach in any case. I also do not know the memory limitations of the appengine – or how efficient the rdflib in-memory store really is – but I assume there is a fairly low limit on the number of triple you can server this way. Inside apache I’ve deployed it on some hundred thousand triples in a BerkleyDB store.

There are several things that could be improved everywhere here – the LOD app in particular has some rough edges and bugs, but it’s being used internally in our project, so we might fix some of them given time. The CSV converter really needs a way to merge two columns, not just split them.

All the files you need to run this example yourself are under: – let me know if you try it and if it works or breaks!

Trope Bingo

This week we went to see Thor, and it seemed guaranteed to be a Trope-fest, and over lunch we came up with the idea of a “trope bingo”. 30 minutes with python, SPARQL and in the last part of the afternoon and it was done.

In the end, the film was good, Thor was the Big Ham, utters the BIG NO and they have token Asian and Black Norse Gods. However, the cinema was too dark to actually play Bingo.

Today I got around to “porting” the script to PHP, so now you can play too! Click here:

Trope Bingo!

Not much to say about this one – the following query extracts all tropes from dbtropes which has more than 200 instances:

PREFIX rdfs:
PREFIX skip:
?trope a skip:FeatureClass ; rdfs:label ?label ; rdfs:comment ?comment .
{ SELECT (count(*) AS ?count) ?trope WHERE { ?f a ?trope . } GROUP BY ?trope }
FILTER (?count>200)

The tropes are stored in a CSV file, we pick 25 randomly. See the source.

A quick and dirty guide to YOUR first time with RDF

(This is written for the challenge from

(To save you copy/pasting/typing you can download the examples from here:

10 steps to make sense of RDF data:

  1. Install a debian or Ubuntu based system — I used Debian testing.
  2. Install rdflib and Berkely/Sleepycat DB by doing:
    sudo apt-get install python-rdflib python-bsddb3

    (I got rdflib version 2.4.2 – if you get version 3.X.X the code may look slightly different, let me know if you cannot work out the changes on your own)

  3. Find some data — I randomly picked the data behind the BIS Research Funding Explorer. You can find the raw RDF data on the server. We will use the schema file from:

    and the education data from:

    We use the education data because it is smaller than the research data, only 500K vs 11M, and because there is a syntax error in the corresponding file for research :). In the same folders there are files called blahblah-void. These are statistics about the datasets, and we do not need them for this (see for details).

  4. Load the data, type this into a python shell, or create a python file and run it:
    import rdflib
    g.load("", format='nt')

    Note that the two files are in different RDF formats, both contain triples, but one is serialized as XML, the other in a ascii line-based format called N-Triples.You do not have to care about this, just tell rdflib to use the right parser with the format=X parameter, RDF/XML is the default.

  5. After the script has run there will be a new folder called db in the current directory, it contains the berkeley data-base files and indexes for the data. For the above example it’s about 1.5M
  6. Explore the data a bit, again type this into a python shell:
    • First open the DB again:
    • import rdflib
      -- Outputs: 3690 --

      The graph object is quite pythonic, and you can treat it like a collection of triples. Here len tells us we have loaded 3690 triples.

    • Find out what sorts of things this data describes. Things are typed by a triple with rdf:type as the predicate in RDF.
      for x in set(g.objects(None, rdflib.RDF.RDFNS["type"])): print x
      -- Outputs:

      rdflib gives you several handy functions that return python generators for doing simple triple based queries, here we used graph.objects, taking two parameters, the subjects and predicates to filter for, and returns a generator over all objects matching. rdflib also provides constants for the well-known RDF and RDFSchema vocabularies, we used this here to get the correct URI for the rdf:type predicate.

    • Now we know the data contains some Institutions, get a list using another rdflib triple-based query:
      for x in set(g.subjects(rdflib.RDF.RDFNS["type"], rdflib.URIRef(''))): print x
      -- Outputs:
      ... (and many more) ...

      This gives us a long list of all institutions. The set call here just iterates through the generator and removes duplicates.

    • Lets look at the triples about one in more detail:
      for t in g.triples((rdflib.URIRef(''), None, None)): print map(str,t)
      -- Outputs:
      ['', '', '']
      ['', '', 'University College London']
      ... (and many more) ...

      This gives us a list of triples asserted about UCL, here we used the triples method of rdflib, it takes a single argument, a tuple representing the triple filters. The returned triples are also tuples, the map(str,t) just makes the output prettier.

  7. rdflib makes it very easy to work with triple based queries, but for more complex queries you quickly need SPARQL, this is also straight forward:
    PREFIX owl: <>
    PREFIX foaf: <>
    PREFIX p: <>
    PREFIX aiiso: <>
    PREFIX geo: <>
    PREFIX skos: <>
    PREFIX rdf: <>
    PREFIX rdfs: <>
    list(g.query(PREFIX+"SELECT ?x ?label WHERE { ?x rdfs:label ?label ; a aiiso:Institution . } "))[:10]

    The prefixes defined here at the start lets us use short names instead of full URIs in the queries. The graph.query method returns a generator over tuples of variables bindings. This lists the first 10 – this is more or less the same as we did before, list all institutions, but this time also get the human readable label.

  8. Now a slightly more complicated example. Ask the knowledge base to find all institutions classified as public sector that took part in some project together:

    r=list(g.query(PREFIX+"""SELECT DISTINCT ?x ?xlabel ?y ?ylabel WHERE { 
       ?x rdfs:label ?xlabel ; 
          a aiiso:Institution ; 
          p:organisationSize 'Public Sector' ; 
          p:project ?p . 
       ?y rdfs:label ?ylabel ; 
          a aiiso:Institution ; 
          p:organisationSize 'Public Sector' ; 
          p:project ?p .
       FILTER (?x != ?y) } LIMIT 10 """))
    for x in r[:3]: print map(str,x)
    -- Outputs:
    ['', 'Nottingham University', '', 'The University of Sheffield']
    ['', 'University of Nottingham', '', 'University of Sheffield']
    ['', 'Sheffield University', '', 'University of Nottingham']

    All fairly straight forward, the FILTER is there to make sure the two institutions we find are not the same.
    (Disclaimer: there is a bug in rdflib ( that makes this query take very long :( – it should be near instantaneous, but takes maybe 10 seconds for me. )

  9. The data we loaded so far do not have any details on the project that actually got funded, only the URI, for example: You can go there with your browser and find out that this is a project called “Nuclear transfer enhancement technology for bio processing and tissue engineering” – luckily so can rdflib, just call graph.load on the URI. Content-negotiation on the server will make sure that rdflib gets machine readable RDF when it asks. A for-loop over a rdflib triple query and loading all the project descriptions is left as an exercise to the reader :)
  10. That’s it! There are many places to go from here, just keep trying things out – if you get stuck try asking questions on or in the IRC chatroom at irc:// Have fun!

Schema usage in the BTC2010 data

A little while back I spent about 1 CPU week computing which hosts use which namespaces in the BTC2010 data, i.e. I computed a matrix with hosts as rows, schemas as columns and each cell the number of triples using that namespace each host published. My plan was to use this to create a co-occurrence matrix for schemas, and then use this for computing similarities for hierarchical clustering. And I did. And it was not very amazing. Like Ed Summer’s neat LOD graph I wanted to use Protovis to make it pretty. Then, after making one version, uglier than the next I realised that just looking at the clustering tree as a javascript datastructure was just as useful, I gave up on the whole clustering thing.

Not wanting spent CPU hours go to waste, I instead coded up a direct view of the original matrix, getting a bit carried away I made a crappy non-animated, non-smooth version of Moritz Stefaner’s elastic lists using jquery-ui’s tablesorter plugin.

At you can see the result. Clicking one a namespace will show only hosts publishing triples using this schema, and only schemas that co-occur with the one you picked. Conversely, click on a host will show the namespaces published by that host, and only hosts that use the same schemas (this makes less intuitive sense for hosts than for namespaces). You even get a little protovis histogram of the distribution of hosts/namespaces!

The usually caveats for the BTC data applies, i.e. this is a random sampling of parts of the semantic web, it doesn’t really mean anything :)

Redundancy in the BTC2010 Data, it’s only 1.4B triples!

In a comment here, Andreas Harth mentions that publishes the same triples in many contexts, and that this may skew the statistics a bit. As it turns out, not only is guilty of this, by stripping the fourth quad component of the data and removing duplicate triples the original 3,171,793,030 quads turn into “only” 1,441,499,718 triples.

36,123,031 triples occurred more than once in the data, 42 of these even more than 100,000 times. The top redundant triples are:

#triples subj pred obj
470,903 prot:A rdf:type prot:Chains
470,778 prot:A prot:Chain “A”^^<>
470,748 prot:A prot:ChainName “Chain A”^^<>
413,647 rdf:type gr:BusinessEntity
366,073 foaf:Document rdfs:seeAlso
361,900 dcmitype:Text rdfs:seeAlso
254,567 swrc:InProceedings rdfs:seeAlso
184,530 foaf:Agent rdfs:seeAlso
159,627 rdfs:label “flickr(tm) wrappr”@en
150,417 rdf:type owl:ObjectProperty

This is all unfortunate, because I’ve been analysing the BTC data pretending that it’s a snapshot of the semantic web. Which perhaps it is? The data out there does of course look like this. Does the context of a triple change what it MEANS? If we had a trust/provenance stack in place I guess it would. Actually, I am not sure what this means for my statistics :)

At least I can now count the most common namespaces again, this time only from triples:

#triples namespace
275,920,526 foaf
181,683,388 rdf
106,130,939 rdfs
34,959,224 dc11
16,674,480 gr
12,733,566 rss
12,368,342 dcterm
8,334,653 swrc

Compare to the numbers for quads, data-gov had exactly the same number of triples (no redundancy!), whereas rdf dropped from 588M to 181M, rdfs from 860M to 106M and GoodRelations from 527M to 16M. Looking at all namespaces, GoodRelations wins the most redundant award from 16% of all quads, to only 1.1% of all triples. Comparing change since 2009 still puts GoodRelations up high though, so no need for them to worry:

% change namespace
3969.768833 gr

And if I understood Kingsley Idehen correctly, there is something fishy about the attribution namespace from openlink as well, but I’ve done enough boring digging now.

Now I’m done doing boring counting – next time I hope I can have more fun visualisation, like Ed!!

SKOS Concepts in the BTC2010 data

Again Dan Brickley is making me work :) This time looking at the “hidden” schema that is SKOS concepts, (hidden because it is not really apparent when just looking at normal rdf:types). Dan suggested looking at topics used with FOAF, i.e. objects of foaf:topic, foaf:primaryTopic and foaf:interest triples, and also things used with Dublin Core subject (I used both and

I found 1,136,475 unique FOAF topics in 8,119,528 triples, only 4,470 are bnodes, and only 265 (! i.e. only 0.002%) are literals. The top 10 topics are all of the type, with varying number of ?s, this is obviously what people entered into the interest field of livejournal. More interesting are perhaps the top hosts:

#triples host

So a lot of these topics are from FOAF exports of livejournal and friends. What I did not do, at least not yet, was to compare the list of FOAF topics with the things actually declared to be of type skos:Concept, this would be interesting.

Dublin Core looks quite different, it gives us 552,596 topics in 4,018,726 triples, but only 2,979 resources out of 921 are bnodes, the rest (i.e. 99.4%) are all literals.
The top 10 subjects according to DC are:

#triples subject
91,534 日記
38,566 写真
35,514 メル友募集
32,150 NAPLES
30,973 business
28,342 独り言
27,543 SoE Report
24,102 Congress
23,954 音楽

I do not even know what language most of these are (anyone?). Looking a bit further down the list, there are lots of government, education, crime, etc. Perhaps we can blame for this? I could have have kept track of the named-graphs these came from, but I didn’t. Maybe next time.

You can download the full raw counts for all subjects: FOAF topics (7.6mb), FOAF hosts and DC Topics (23mb).

BTC2009/2010 Raw Counts

Dan Brickley asked, so I put up the complete files with counts for predicates, namespaces, types, hosts, and pay-level domains here:

Uploading them to manyeyes or similar would perhaps be more modern, but it was too much work :)

Aggregates over BTC2010 namespaces

Yesterday I dumped the most basic BTC2010 stats. Today I have processed them a bit more – and it gets slightly less boring.

First predicates, yesterday I had the raw count per predicate. Much more interesting is the namespaces the predicates are defined in. These are the top 10:

#triples namespace
860,532,348 rdfs
588,063,466 rdf
527,347,381 gr
284,679,897 foaf
44,119,248 dc11
17,233,778 rss
13,009,685 owl

(prefix abbreviations are made from \u2013 I am too lazy to fix the missing ones)

Now it gets interesting – because I did exactly this last year as well, and now we can compare!


In 2009 there were 3,817 different namespaces, this year we have 3,911, but actually only 2,945 occur in both. The biggest dropouts, i.e. namespaces that occurred last year, but not at all this year are:

#triples namespace
5,443,549 nie
242,427 nco
203,283 osag
159,536 nexif

I am of course shocked and saddened to see that the Nepomuk Information Elements ontology has fallen out of fashion all together, although it was a bit of a freak occurrence last year. I am not sure how we lost 10M research ontology triples?


Looking the other way around, what namespaces are new and popular this year, we get:

#triples namespace
5,001,909 fec

Here the introduction of and were the big events last year.


For the namespaces that occurred both years we can find the biggest gainers. Here I calculated what ratio of the total triples each namespace constituted each year, and the increase in this ratio from 2009 to 2010. For example, GoodRelations, on top here, constituted nearly 16% of all triples in 2010, but only 2.91e-4% of all triples last year, for a cool increase of 570,000% :)

gain namespace
57058.38 gr


Similarly, we have the biggest losers, the ones who lost the most:

gain namespace
0.000191 sioct
0.000380 vcard
0.000418 affy
0.000719 urn://
0.000787 akts

If your namespace is a loser, do not worry, remember that BTC is a more or less arbitrary snapshot of SOME semantic web data, and you can always catch up next year! :)

With a bit of luck I will do this again for the Pay-Level-Domains for the context URLs tomorrow.


(a bit later)

You can get the full datasets for this from many eyes: