Posts categorized “R”.

BTC Statistics I

As I said, I wanted to try looking into the billion triple challenge using unix command-line tools. ISWC deadline set me back a bit, but now I’ve got it going.

First step was to get rid of those pesty literals as they contain all sort of crazy character that make my lazy parsing tricky. A bit of python later and I converted:

<> <> "Edd Dumbill" <> .
<> <> "edd" <> .
<> <> "Henry Story" <> .


<> <> "000_1" <> .
<> <> "000_2" <> .
<> <> "000_3" <> .

i.e. each literal was replaced with chunknumber_literalnumber, and the actual literals stored in another file. Now it was open for simply splitting the files by space and using cut, awk, sed, sort, uniq, etc. to do everything I wanted. (At least, that’s what I though, as it turned out the initial data contained URIs with spaces, and my “parsing” broke … then I fixed it by replacing > < with >\t<, and used tab as a field delimiter and I was laughing. The data has now been fixed, but I kept my original since I was too lazy to download 17GB again)

So, now I’ve computed a few random statistics, nothing amazingly interesting yet. I’ll put a bit her eat a time, today: THE PREDICATES!

The full data set contains 136,188 unique predicates of these:

  • 112,966 occur more than once
  • 62,937 more than 10 times
  • 24,125 more than 100
  • 8,178 more than 1000
  • 2,045 more than 10000

623 of them have URIs starting with <file://> – they will certainly be very useful for the semantic web.

Note that although 136k different predicates seems like a great deal, many of them are hardly used at all, in fact, if you only look at the top 10,000 most used predicates, you still cover 92% of the triples.

As also mentioned on the official BTC stats page, the most used predicates are:

triples predicate
143,293,758 rdf:type
53,869,968 rdfs:seeAlso
35,811,115 foaf:knows
32,895,374 foaf:nick
23,266,469 foaf:weblog
22,326,441 dc:title
19,565,730 akt:has-author
19,157,120 sioc:links_to
18,257,337 skos:subject

Note that these are computed from the whole corpus, not just a sample, and for instance for the top property there is a difference of a massive 13,139. That means the official stats are off by almost 0.01%! I don’t know how we can work under these conditions…

Moving on I assigned each predicate to a namespace, I did this by matching them with the list at, if the the URI didn’t start with any of those I made the namespace the URI up to the last # or /, whatever appeared later. The most used namespaces were:

triples namespace
244,854,345 foaf
224,325,132 dbpprop
167,911,029 rdf
807,21,580 rdfs
64,313,022 akt
63,850,346 geonames
58,675,733 dc
44,572,003 rss
31,502,395 sioc
21,156,972 skos
14,801,992 geo
9,812,367 content
8,623,124 owl
6,813,536 xhtml
5,443,549 nie

I included the top 19, since number is the NEPOMUK Information Element Ontology, and I found it funny that it was used so widely. Another thing that is funny is that RDFS is used more than 10x as much as OWL (even ignoring the RDF namespace, defining things like rdf:Property, also used by schemas). I tried to plot this data as well, since Knud pointed out that you need a nice long-tail graph these days. However, for both predicates and namespaces there are a (relatively) huge number of things that only occur once or twice, if you plot a histogram these dominate the whole graph, even with¬† logarithmic Y axis. In the end I’ve ended up plotting the run length encoding of the data, i.e. how many namespaces occur once, twice, three times, etc. :

Here the X axis shows how the number of occurrences and the Y axis shows how many things occur this often. I.e. the top left point is all the random noise that occurs once, such as file:/cygdrive/c/WINDOWS/Desktop/rdf.n3, file:/tmp/filem8INvE and other useful URLs. The bottom two right points are foaf and dbprop.

I don’t know about the graph – I have a feeling it lies somehow, in a way a histogram doesn’t. But I don’t know. Anyone?

Anyway – most things of the BTC I have plotted have a similarily shaped frequency distribution, i.e. the plain predicate frequencies, the subject/object frequencies are all the same. The literals are more interesting, if I have the time I’ll write them up tomorrow. Still it’s all pretty boring – I hope to detect duplicate triples from different sources once I’m done with this. I expect to find at least 10 copies of the FOAF schema.

Plotting Precision and Recall values

While helping Benjamin write our recent ISWC paper we were drawn between plotting recall, precision or the f1-measure (a non-linear combination of the two). In theory f1-measure is the best since it leaves you with less numbers to display, but we focussed especially on recall, so we wanted this. In the end we plotting each approach we compared against both recall and precision in a 2d plot. We briefly thought about showing f-measure in the same chart, but the deadline came and there was no time left.

No the deadline is gone, but I made it work in R:

The fun thing about this is that both me and Ben tried to “solve” the f-measure equation for either R or P, i.e from:


get to:


And it took me at least three attempts to get it right. I am glad I’ve not forgotten any of my school math… ahem.

Get the R code at: