(still) nothing clever

Some basic BTC2012 Stats

(The Figure shows the biggest domains publishing data, and links between them – mouse-over the edges to highlight, chose linking predicate from the drop-down list)

So it’s that time of year again, and the Billion Triple Challenge Dataset for 2012 has been posted.
This coincided with our project demo being finished, so I had some time to spare. The previous years I’ve done this all using unix tools, sed/awk/grep and friends. This year I figured I’d do it all in python. To get reasonable performance two things were crucial:

the python gzip module has decompression implemented in python, using subprocess and reading from a pipe to gunzip is MUCH faster (thanks Jörn!)
I wrote a an N-Quads “parser” in cython, taking advantage of the very regular output of ld-spider

This meant that for simple operations, like adding up things in a hash-table in memory, I could stream-process about 500,000 triples per second. For things that did not fit in memory, I used LevelDB with a thin layer of most-frequently-used caching around it.

I’m happy to see that DbTropes is part of the data this year!
So – the basic stats:

1.4B triples all in all
1082 different namespaces are used
9.2M unique contexts, from 831 top-level document PLDs (Pay-Level-Domain, essentially data.gov.uk, instead of gov.uk, but livejournal.com, instead of bob.livejournal.com)
183M unique subjects are described
57k unique predicates
192M unique resources as objects
156M unique literals
152M triples are rdf:type statements, 296k types are used. Resource with multiple types are common, 45M resources have two types, 40M just one.

Top 10 Context PLDs

count	context pld
751,352,061	data.gov.uk
198,090,262	dbpedia.org
101,241,556	freebase.com
101,082,592	livejournal.com
44,331,145	opera.com
41,544,819	dbtropes.org
39,200,538	legislation.gov.uk
36,969,163	identi.ca
29,447,217	ontologycentral.com
14,949,592	rdfize.com

Top 10 Namespaces

count	namespace
336,911,630	http://www.w3.org/1999/02/22-rdf-syntax-ns#
191,669,089	http://www.w3.org/2000/01/rdf-schema#
143,650,096	http://xmlns.com/foaf/0.1/
133,845,241	http://reference.data.gov.uk/def/intervals/
115,692,342	http://www.w3.org/2006/time#
71,016,514	http://www.w3.org/2006/http#
69,715,106	http://rdf.freebase.com/ns/
66,058,545	http://www.w3.org/2004/02/skos/core#
53,246,991	http://purl.org/dc/terms/
50,444,755	http://dbpedia.org/property/

Top 10 Types

count	type
39,345,307	intervals:Second
39,345,280	intervals:CalendarSecond
12,841,127	foaf:Person
7,623,831	foaf:Document
1,896,136	qb:Observation
1,851,173	fb:common.topic
1,712,877	intervals:Minute
1,712,875	intervals:CalendarMinute
1,328,921	owl:Thing
1,280,763	metalex:BibliographicExpression

As usual, although many namespaces/hosts/types are used, the distribution is skewed, the most common elements quickly accounts for most of the data. This graph shows the cumulative occurrences (i.e. % of total unique elements) of types/context-plds/namespaces occurring more than N times (the X axis is logarithmic):

So the steeper the curve, the longer the tail of infrequently occurring elements. For example, less than 5% of types occur more than 100 times, but very few context-pld’s occur less than 10 times. However, when you look at the actual density, the picture changes, here we plot the cumulative density, so although most types occur less than 100 times, the majority of the data uses only the most frequent types:

So the steeper the curve at the end, the more of the data is covered by the few most frequent element. For example, the top 5% most frequent namespaces and context-plds cover over 99% of the data, but the top 5% of types “only” 97%.

A different (maybe useless?) view of this, is this histogram with exponentially increasing bucket-sizes, again with a log-scale, so they look the same size:

Here we see … actually I’ll be damned if I know what we see here. Maybe I should have done more stats courses at uni instead of, say, Java Programming. Clearly the difference between the distribution of the three things is shown somehow. I’ve spent so long on this now though, there’s no way I wont put it here.

I don’t even want to talk about how long I spent making these graphs. I wanted to graph this since the first BTC dataset I looked at, but previously always fell back at “top n% of the elements cover n% of the data” tables.
They graphs are all done in pylab, exported as SVG (yay!). Playing with them was all done with the ipython notebook, which is really pleasant to work with.

Finally – the Chord-diagram on top shows links between context PLDs – mouse over each host to see outgoing links. This is only the top 19 PLD domains and the top 10 properties linking domains that themselves publish RDF data – this is important, as there are predicates used to link to non-semantic web resources that dominate otherwise. The graphic and interaction is all done with the excellent D3 Library.

I will try to come up with some more interesting visualisations based on links between instances of various types soon!

Posted by gromgull at 10:04 am on July 14th, 2012. 2 comments... »
Categories: Billion Triple Challenge, Statistics.

2 comments.

I don’t understand how Freebase is listed as third in number of triples yet is completely invisible in the chord diagram, particularly when it’s basically fully linked with DBpedia.

Is it because the triples are published at a subdomain (rdf.freebase.com) or is there something else going on?

Posted by Tom Morris on December 24th, 2012.
@Tom Morris: the sub-domain shouldn’t matter, since we compute the Pay-Level-Domain (i.e. rdf.freebase.com and freebase.com both get merged)

In the data I analysed it seems FreeBase is not linked to anything, so although it contains many triples, it does not have many lihnks.

Perhaps the dbpedia links somehow did not make it into the BTC data?

Posted by gromgull on January 9th, 2013.

Some basic BTC2012 Stats

Top 10 Context PLDs

Top 10 Namespaces

Top 10 Types

2 comments.

Post a comment.

Categories

Archives

Feeds