(The Figure shows the biggest domains publishing data, and links between them – mouse-over the edges to highlight, chose linking predicate from the drop-down list)
So it’s that time of year again, and the Billion Triple Challenge Dataset for 2012 has been posted.
This coincided with our project demo being finished, so I had some time to spare. The previous years I’ve done this all using unix tools, sed/awk/grep and friends. This year I figured I’d do it all in python. To get reasonable performance two things were crucial:
- the python gzip module has decompression implemented in python, using subprocess and reading from a pipe to gunzip is MUCH faster (thanks Jörn!)
- I wrote a an N-Quads “parser” in cython, taking advantage of the very regular output of ld-spider
This meant that for simple operations, like adding up things in a hash-table in memory, I could stream-process about 500,000 triples per second. For things that did not fit in memory, I used LevelDB with a thin layer of most-frequently-used caching around it.
I’m happy to see that DbTropes is part of the data this year!
So – the basic stats:
- 1.4B triples all in all
- 1082 different namespaces are used
- 9.2M unique contexts, from 831 top-level document PLDs (Pay-Level-Domain, essentially data.gov.uk, instead of gov.uk, but livejournal.com, instead of bob.livejournal.com)
- 183M unique subjects are described
- 57k unique predicates
- 192M unique resources as objects
- 156M unique literals
- 152M triples are rdf:type statements, 296k types are used. Resource with multiple types are common, 45M resources have two types, 40M just one.
Top 10 Context PLDs
Top 10 Namespaces
Top 10 Types
As usual, although many namespaces/hosts/types are used, the distribution is skewed, the most common elements quickly accounts for most of the data. This graph shows the cumulative occurrences (i.e. % of total unique elements) of types/context-plds/namespaces occurring more than N times (the X axis is logarithmic):
So the steeper the curve, the longer the tail of infrequently occurring elements. For example, less than 5% of types occur more than 100 times, but very few context-pld’s occur less than 10 times. However, when you look at the actual density, the picture changes, here we plot the cumulative density, so although most types occur less than 100 times, the majority of the data uses only the most frequent types:
So the steeper the curve at the end, the more of the data is covered by the few most frequent element. For example, the top 5% most frequent namespaces and context-plds cover over 99% of the data, but the top 5% of types “only” 97%.
A different (maybe useless?) view of this, is this histogram with exponentially increasing bucket-sizes, again with a log-scale, so they look the same size:
Here we see … actually I’ll be damned if I know what we see here. Maybe I should have done more stats courses at uni instead of, say, Java Programming. Clearly the difference between the distribution of the three things is shown somehow. I’ve spent so long on this now though, there’s no way I wont put it here.
I don’t even want to talk about how long I spent making these graphs. I wanted to graph this since the first BTC dataset I looked at, but previously always fell back at “top n% of the elements cover n% of the data” tables.
They graphs are all done in pylab, exported as SVG (yay!). Playing with them was all done with the ipython notebook, which is really pleasant to work with.
Finally – the Chord-diagram on top shows links between context PLDs – mouse over each host to see outgoing links. This is only the top 19 PLD domains and the top 10 properties linking domains that themselves publish RDF data – this is important, as there are predicates used to link to non-semantic web resources that dominate otherwise. The graphic and interaction is all done with the excellent D3 Library.
I will try to come up with some more interesting visualisations based on links between instances of various types soon!