BTC Statistics I

As I said, I wanted to try looking into the billion triple challenge using unix command-line tools. ISWC deadline set me back a bit, but now I’ve got it going.

First step was to get rid of those pesty literals as they contain all sort of crazy character that make my lazy parsing tricky. A bit of python later and I converted:


<http://www.w3.org/People/Berners-Lee/card#edd> <http://xmlns.com/foaf/0.1/name> "Edd Dumbill" <http://www.w3.org/People/Berners-Lee/card> .
<http://www.w3.org/People/Berners-Lee/card#edd> <http://xmlns.com/foaf/0.1/nick> "edd" <http://www.w3.org/People/Berners-Lee/card> .
<http://bblfish.net/people/henry/card#me> <http://xmlns.com/foaf/0.1/name> "Henry Story" <http://www.w3.org/People/Berners-Lee/card> .

to

<http://www.w3.org/People/Berners-Lee/card#edd> <http://xmlns.com/foaf/0.1/name> "000_1" <http://www.w3.org/People/Berners-Lee/card> .
<http://www.w3.org/People/Berners-Lee/card#edd> <http://xmlns.com/foaf/0.1/nick> "000_2" <http://www.w3.org/People/Berners-Lee/card> .
<http://bblfish.net/people/henry/card#me> <http://xmlns.com/foaf/0.1/name> "000_3" <http://www.w3.org/People/Berners-Lee/card> .

i.e. each literal was replaced with chunknumber_literalnumber, and the actual literals stored in another file. Now it was open for simply splitting the files by space and using cut, awk, sed, sort, uniq, etc. to do everything I wanted. (At least, that’s what I though, as it turned out the initial data contained URIs with spaces, and my “parsing” broke … then I fixed it by replacing > < with >\t<, and used tab as a field delimiter and I was laughing. The data has now been fixed, but I kept my original since I was too lazy to download 17GB again)

So, now I’ve computed a few random statistics, nothing amazingly interesting yet. I’ll put a bit her eat a time, today: THE PREDICATES!

The full data set contains 136,188 unique predicates of these:

  • 112,966 occur more than once
  • 62,937 more than 10 times
  • 24,125 more than 100
  • 8,178 more than 1000
  • 2,045 more than 10000

623 of them have URIs starting with <file://> – they will certainly be very useful for the semantic web.

Note that although 136k different predicates seems like a great deal, many of them are hardly used at all, in fact, if you only look at the top 10,000 most used predicates, you still cover 92% of the triples.

As also mentioned on the official BTC stats page, the most used predicates are:

triples predicate
156,448,093 http://dbpedia.org/property/wikilink
143,293,758 rdf:type
53,869,968 rdfs:seeAlso
35,811,115 foaf:knows
32,895,374 foaf:nick
23,266,469 foaf:weblog
22,326,441 dc:title
19,565,730 akt:has-author
19,157,120 sioc:links_to
18,257,337 skos:subject

Note that these are computed from the whole corpus, not just a sample, and for instance for the top property there is a difference of a massive 13,139. That means the official stats are off by almost 0.01%! I don’t know how we can work under these conditions…

Moving on I assigned each predicate to a namespace, I did this by matching them with the list at prefix.cc, if the the URI didn’t start with any of those I made the namespace the URI up to the last # or /, whatever appeared later. The most used namespaces were:

triples namespace
244,854,345 foaf
224,325,132 dbpprop
167,911,029 rdf
807,21,580 rdfs
64,313,022 akt
63,850,346 geonames
58,675,733 dc
44,572,003 rss
31,502,395 sioc
21,156,972 skos
14,801,992 geo
10,691,295 http://dbpedia.org/ontology
10,239,809 http://www.kisti.re.kr/isrl/ResearchRefOntology
9,812,367 content
9,661,682 http://www.rdfabout.com/rdf/schema/vote
8,623,124 owl
6,837,606 http://rdf.freebase.com/ns
6,813,536 xhtml
5,443,549 nie

I included the top 19, since number is the NEPOMUK Information Element Ontology, and I found it funny that it was used so widely. Another thing that is funny is that RDFS is used more than 10x as much as OWL (even ignoring the RDF namespace, defining things like rdf:Property, also used by schemas). I tried to plot this data as well, since Knud pointed out that you need a nice long-tail graph these days. However, for both predicates and namespaces there are a (relatively) huge number of things that only occur once or twice, if you plot a histogram these dominate the whole graph, even with  logarithmic Y axis. In the end I’ve ended up plotting the run length encoding of the data, i.e. how many namespaces occur once, twice, three times, etc. :

Here the X axis shows how the number of occurrences and the Y axis shows how many things occur this often. I.e. the top left point is all the random noise that occurs once, such as file:/cygdrive/c/WINDOWS/Desktop/rdf.n3, file:/tmp/filem8INvE and other useful URLs. The bottom two right points are foaf and dbprop.

I don’t know about the graph – I have a feeling it lies somehow, in a way a histogram doesn’t. But I don’t know. Anyone?

Anyway – most things of the BTC I have plotted have a similarily shaped frequency distribution, i.e. the plain predicate frequencies, the subject/object frequencies are all the same. The literals are more interesting, if I have the time I’ll write them up tomorrow. Still it’s all pretty boring – I hope to detect duplicate triples from different sources once I’m done with this. I expect to find at least 10 copies of the FOAF schema.

5 comments.

  1. That is a beautiful graph… did you use biggles?

  2. No, as I’ve not mentioned above, it was done using R :)

  3. Which data store did you use to load the BTC data ?
    I am using Virtuoso to load a sample BTC file that is named btc-2009-small.nq
    I am unable to load this data as I am having some problems with namespaces, format of the data set etc.

    Do you have any suggestions on how to load this data set to a triple store ?

    Thanks,
    Pramod

  4. Hi Pramod,

    Did you actually read the post? The first line says “using unix command-line tools.” and then I go on about fixing the format. :)

    Anyway – I did not use any triple store, I kept the files in n-triples format and I’ve been processing them line by line. A full scan through all triples for some operation takes about 8 hours on my machine. This means I cannot do normal queries, but I can do many other things.

    I dont know if any RDF store could load this data, at least not unless you have fantastic hardware (i.e. 64gb ram, etc.), but you may look into 4store, bigdata, and virtuoso that you already have.

  5. […] my Billion Triple Challenge data-set statistics posts, if you only just got here, catch up on part I, II or  […]

Post a comment.