Some basic BTC2012 Stats

(The Figure shows the biggest domains publishing data, and links between them – mouse-over the edges to highlight, chose linking predicate from the drop-down list)

So it’s that time of year again, and the Billion Triple Challenge Dataset for 2012 has been posted.
This coincided with our project demo being finished, so I had some time to spare. The previous years I’ve done this all using unix tools, sed/awk/grep and friends. This year I figured I’d do it all in python. To get reasonable performance two things were crucial:

  • the python gzip module has decompression implemented in python, using subprocess and reading from a pipe to gunzip is MUCH faster (thanks Jörn!)
  • I wrote a an N-Quads “parser” in cython, taking advantage of the very regular output of ld-spider

This meant that for simple operations, like adding up things in a hash-table in memory, I could stream-process about 500,000 triples per second. For things that did not fit in memory, I used LevelDB with a thin layer of most-frequently-used caching around it.

I’m happy to see that DbTropes is part of the data this year!
So – the basic stats:

  • 1.4B triples all in all
  • 1082 different namespaces are used
  • 9.2M unique contexts, from 831 top-level document PLDs (Pay-Level-Domain, essentially data.gov.uk, instead of gov.uk, but livejournal.com, instead of bob.livejournal.com)
  • 183M unique subjects are described
  • 57k unique predicates
  • 192M unique resources as objects
  • 156M unique literals
  • 152M triples are rdf:type statements, 296k types are used. Resource with multiple types are common, 45M resources have two types, 40M just one.

 

Top 10 Context PLDs

count context pld
751,352,061 data.gov.uk
198,090,262 dbpedia.org
101,241,556 freebase.com
101,082,592 livejournal.com
44,331,145 opera.com
41,544,819 dbtropes.org
39,200,538 legislation.gov.uk
36,969,163 identi.ca
29,447,217 ontologycentral.com
14,949,592 rdfize.com

 

Top 10 Namespaces

count namespace
336,911,630 http://www.w3.org/1999/02/22-rdf-syntax-ns#
191,669,089 http://www.w3.org/2000/01/rdf-schema#
143,650,096 http://xmlns.com/foaf/0.1/
133,845,241 http://reference.data.gov.uk/def/intervals/
115,692,342 http://www.w3.org/2006/time#
71,016,514 http://www.w3.org/2006/http#
69,715,106 http://rdf.freebase.com/ns/
66,058,545 http://www.w3.org/2004/02/skos/core#
53,246,991 http://purl.org/dc/terms/
50,444,755 http://dbpedia.org/property/

 

Top 10 Types

count type
39,345,307 intervals:Second
39,345,280 intervals:CalendarSecond
12,841,127 foaf:Person
7,623,831 foaf:Document
1,896,136 qb:Observation
1,851,173 fb:common.topic
1,712,877 intervals:Minute
1,712,875 intervals:CalendarMinute
1,328,921 owl:Thing
1,280,763 metalex:BibliographicExpression

As usual, although many namespaces/hosts/types are used, the distribution is skewed, the most common elements quickly accounts for most of the data. This graph shows the cumulative occurrences (i.e. % of total unique elements) of types/context-plds/namespaces occurring more than N times (the X axis is logarithmic):

So the steeper the curve, the longer the tail of infrequently occurring elements. For example, less than 5% of types occur more than 100 times, but very few context-pld’s occur less than 10 times. However, when you look at the actual density, the picture changes, here we plot the cumulative density, so although most types occur less than 100 times, the majority of the data uses only the most frequent types:

So the steeper the curve at the end, the more of the data is covered by the few most frequent element. For example, the top 5% most frequent namespaces and context-plds cover over 99% of the data, but the top 5% of types “only” 97%.

A different (maybe useless?) view of this, is this histogram with exponentially increasing bucket-sizes, again with a log-scale, so they look the same size:

Here we see … actually I’ll be damned if I know what we see here. Maybe I should have done more stats courses at uni instead of, say, Java Programming. Clearly the difference between the distribution of the three things is shown somehow. I’ve spent so long on this now though, there’s no way I wont put it here.

I don’t even want to talk about how long I spent making these graphs. I wanted to graph this since the first BTC dataset I looked at, but previously always fell back at “top n% of the elements cover n% of the data” tables.
They graphs are all done in pylab, exported as SVG (yay!). Playing with them was all done with the ipython notebook, which is really pleasant to work with.

Finally – the Chord-diagram on top shows links between context PLDs – mouse over each host to see outgoing links. This is only the top 19 PLD domains and the top 10 properties linking domains that themselves publish RDF data – this is important, as there are predicates used to link to non-semantic web resources that dominate otherwise. The graphic and interaction is all done with the excellent D3 Library.

I will try to come up with some more interesting visualisations based on links between instances of various types soon!

2 comments.

  1. I don’t understand how Freebase is listed as third in number of triples yet is completely invisible in the chord diagram, particularly when it’s basically fully linked with DBpedia.

    Is it because the triples are published at a subdomain (rdf.freebase.com) or is there something else going on?

  2. @Tom Morris: the sub-domain shouldn’t matter, since we compute the Pay-Level-Domain (i.e. rdf.freebase.com and freebase.com both get merged)

    In the data I analysed it seems FreeBase is not linked to anything, so although it contains many triples, it does not have many lihnks.

    Perhaps the dbpedia links somehow did not make it into the BTC data?

Post a comment.