This is the next part of the BTC statistics, this time I look at the subjects of the triples. Oh my, isn’t it exciting. Actually, I’ve had all the numbers for this ready for a while, but holidays and real work has kept me from typing it up. So, BTC overall contains:
- 128,079,322 unique subjects
- 118,205,618 has more than a single triple
- 19,037,202 more than 10
- 1,302,353 more than 100
- 25,741 more than 1000
- 223 more than 10000
Out of these 128M subjects 59,423,933 are blank nodes. Only 17,089 of them are file:// URIs, I really expected many more to have snuck in. At first sight it may seem very odd that so many subjects have more than 1000 triples — what could those possibly be? However, when looking at the 10 subjects with the most triples it becomes clear:
138,618 |
swrc:InProceedings |
154,721 |
http://sw.opencyc.org/concept/Mx4rZOAVeiYGEdqAAAACs2IMmw |
172,599 |
http://sw.opencyc.org/2008/06/10/concept/ |
195,167 |
dctype:Text |
209,623 |
foaf:Document |
358,090 |
http://sw.opencyc.org/concept/Mx4rwLSVCpwpEbGdrcN5Y29ycA |
362,161 |
foaf:holdsAccount |
479,323 |
http://sw.opencyc.org/concept/ |
697,520 |
http://sw.cyc.com/CycAnnotations_v1#label |
930,025 |
http://sw.cyc.com/CycAnnotations_v1#externalID |
Most of these are parts of schemas, i.e. properties or classes (perhaps all? I don’t know enough about CYC use to say what http://sw.opencyc.org/2008/06/10/concept/ is). Looking at the data, out of the hundred-thousand of triples about foaf:holdsAccount for instance, 180,552 of the triples are:
foaf:holdsAccount rdf:type rdfs:Property .
And 180,390 are the triple:
foaf:holdsAccount rdf:type owl:InverseFunctionalProperty .
Of course each of these are in different context. At first I thought this meant that someone was keeping hundreds of thousand of the FOAF ontology around, but of course then all the other FOAF properties and classes would also be the subject of lots of triples. Looking at the contexts where these triples came from there are 180,574 contexts containing the first triple. 180,389 of them are from Kanzaki’s flickr2foaf script (the remaining are 150 variations on http://xmlns.com/foaf and 30 odd random contexts). However, the output from flickr2foaf does not include the schema information, it only uses use foaf:holdsAccount (and many foaf:OnlineAccount instances). My guess to what has happened is that someone has crawled this, each profile, such as mine will contain rdfs:seeAlso links to all my flickr contacts, and each of these pages will use foaf:holdsAccount. Then they applied some sort of inference that materialised the triples above, adding it once for each context it appeared in. This inference cannot be basic RDFS inference, since it also adds owl:InverseFunctionalProperty, and it has not been applied to all the BTC data, but only to some context. I wonder if there is a way to recover which contexts this has been applied to, and then perhaps finding out which triples are redundant, i.e. they could be re-inferred from the other triples?
Now, all these triples about foaf:holdsAccount and CYC concepts also tells us something else: this isn’t really the Billion Triple Challenge, since many of the triples are duplicate, it is the Billion Quad challenge, which I guess is not so catchy. A few more CPU cycles spent on piping things through sort, and uniq (my favourite activity!) I know that out of the original 1,151,383,508 quads, there are actually only 1,150,846,965 uniqe quads, i.e. about 500K duplicates, and more interestingly, there are only 906,166,056 unique triples, i.e. 245M duplicates. I guess it’s not the Billion triple challenge either :) — now with only 900M triples it should be easy!
(BTW: No graphs this time, sorry! Also — I know I said I would talk about the literal values this time, but I changed my mind, next time!)
UPDATE:
Gianluca Demartini asked an interesting question: Why is nearly half the subjects blank nodes? I don’t really know – but I can speculate. 46% of the subject IDs are blank-nodes, these account for ≈30% of the triples in the dataset. I was hoping these 30% would be badly distributed i.e. that there was some few blank nodes with lots and lots of triples, but alas, the blank-node/triple distribution breaks down like this :
- 57,457,905 – over 1
- 1,931,363 – over 10
- 189,487 – over 100
- 3,901 – over 1000
- 50 – over 10000
You need to include the 43,916,862 largest bnodes descriptions to cover 90% of these triples, i.e. we cannot quickly ignore the biggest ones and move on with our lives. I wont give you the top N bnodes since this is more or less random generated IDs, but looking at some of the “largest” bnodes they all look like sitemap files that have been converted to RDF — for example, the largest blank node is _:genid1http-3A-2F-2Fwww-2Eindexedvisuals-2Ecom-2Findexedvisuals-2Exml, this appers to be an RDF version of the sitemap for www.indexedvisuals.com
Now, this bnode alone is the subject of 32,984 triples, and all of these apart from one is a triples with property http://www.google.com/schemas/sitemap/0.84url and another bnode as an object. I guess this is the case for many of the largest bnodes, and probably many of those nodes in return. (Although a highly scientific grep for bnode IDs that contain “sitemap” returns only about 100K cases — a better count is underway.)
So in conclusion — bah! Who knows? Who needs bnodes anyway? :)
UPDATE II:
I did a proper count of how many of the blank-nodes are sitemap nodes like the indexedvisuals above, and it’s only 27! :) There goes that theory. These 27 do account for 71,985 triples with the 0.84url predicate, but this is still a tiny amount of the data. In the next post we will also see that a huge percentage of these bnodes have proper types, giving additional evidence that they are genuine real interesting parts of the data, not just some weird artifact.