As I said, I wanted to try looking into the billion triple challenge using unix command-line tools. ISWC deadline set me back a bit, but now I’ve got it going.
First step was to get rid of those pesty literals as they contain all sort of crazy character that make my lazy parsing tricky. A bit of python later and I converted:
<http://www.w3.org/People/Berners-Lee/card#edd> <http://xmlns.com/foaf/0.1/name> "Edd Dumbill" <http://www.w3.org/People/Berners-Lee/card> .
<http://www.w3.org/People/Berners-Lee/card#edd> <http://xmlns.com/foaf/0.1/nick> "edd" <http://www.w3.org/People/Berners-Lee/card> .
<http://bblfish.net/people/henry/card#me> <http://xmlns.com/foaf/0.1/name> "Henry Story" <http://www.w3.org/People/Berners-Lee/card> .
<http://www.w3.org/People/Berners-Lee/card#edd> <http://xmlns.com/foaf/0.1/name> "000_1" <http://www.w3.org/People/Berners-Lee/card> .
<http://www.w3.org/People/Berners-Lee/card#edd> <http://xmlns.com/foaf/0.1/nick> "000_2" <http://www.w3.org/People/Berners-Lee/card> .
<http://bblfish.net/people/henry/card#me> <http://xmlns.com/foaf/0.1/name> "000_3" <http://www.w3.org/People/Berners-Lee/card> .
i.e. each literal was replaced with chunknumber_literalnumber, and the actual literals stored in another file. Now it was open for simply splitting the files by space and using cut, awk, sed, sort, uniq, etc. to do everything I wanted. (At least, that’s what I though, as it turned out the initial data contained URIs with spaces, and my “parsing” broke … then I fixed it by replacing > < with >\t<, and used tab as a field delimiter and I was laughing. The data has now been fixed, but I kept my original since I was too lazy to download 17GB again)
So, now I’ve computed a few random statistics, nothing amazingly interesting yet. I’ll put a bit her eat a time, today: THE PREDICATES!
The full data set contains 136,188 unique predicates of these:
- 112,966 occur more than once
- 62,937 more than 10 times
- 24,125 more than 100
- 8,178 more than 1000
- 2,045 more than 10000
623 of them have URIs starting with <file://> – they will certainly be very useful for the semantic web.
Note that although 136k different predicates seems like a great deal, many of them are hardly used at all, in fact, if you only look at the top 10,000 most used predicates, you still cover 92% of the triples.
As also mentioned on the official BTC stats page, the most used predicates are:
Note that these are computed from the whole corpus, not just a sample, and for instance for the top property there is a difference of a massive 13,139. That means the official stats are off by almost 0.01%! I don’t know how we can work under these conditions…
Moving on I assigned each predicate to a namespace, I did this by matching them with the list at prefix.cc, if the the URI didn’t start with any of those I made the namespace the URI up to the last # or /, whatever appeared later. The most used namespaces were:
I included the top 19, since number is the NEPOMUK Information Element Ontology, and I found it funny that it was used so widely. Another thing that is funny is that RDFS is used more than 10x as much as OWL (even ignoring the RDF namespace, defining things like rdf:Property, also used by schemas). I tried to plot this data as well, since Knud pointed out that you need a nice long-tail graph these days. However, for both predicates and namespaces there are a (relatively) huge number of things that only occur once or twice, if you plot a histogram these dominate the whole graph, even with logarithmic Y axis. In the end I’ve ended up plotting the run length encoding of the data, i.e. how many namespaces occur once, twice, three times, etc. :
Here the X axis shows how the number of occurrences and the Y axis shows how many things occur this often. I.e. the top left point is all the random noise that occurs once, such as file:/cygdrive/c/WINDOWS/Desktop/rdf.n3, file:/tmp/filem8INvE and other useful URLs. The bottom two right points are foaf and dbprop.
I don’t know about the graph – I have a feeling it lies somehow, in a way a histogram doesn’t. But I don’t know. Anyone?
Anyway – most things of the BTC I have plotted have a similarily shaped frequency distribution, i.e. the plain predicate frequencies, the subject/object frequencies are all the same. The literals are more interesting, if I have the time I’ll write them up tomorrow. Still it’s all pretty boring – I hope to detect duplicate triples from different sources once I’m done with this. I expect to find at least 10 copies of the FOAF schema.