Redundancy in the BTC2010 Data, it’s only 1.4B triples!

In a comment here, Andreas Harth mentions that publishes the same triples in many contexts, and that this may skew the statistics a bit. As it turns out, not only is guilty of this, by stripping the fourth quad component of the data and removing duplicate triples the original 3,171,793,030 quads turn into “only” 1,441,499,718 triples.

36,123,031 triples occurred more than once in the data, 42 of these even more than 100,000 times. The top redundant triples are:

#triples subj pred obj
470,903 prot:A rdf:type prot:Chains
470,778 prot:A prot:Chain “A”^^<>
470,748 prot:A prot:ChainName “Chain A”^^<>
413,647 rdf:type gr:BusinessEntity
366,073 foaf:Document rdfs:seeAlso
361,900 dcmitype:Text rdfs:seeAlso
254,567 swrc:InProceedings rdfs:seeAlso
184,530 foaf:Agent rdfs:seeAlso
159,627 rdfs:label “flickr(tm) wrappr”@en
150,417 rdf:type owl:ObjectProperty

This is all unfortunate, because I’ve been analysing the BTC data pretending that it’s a snapshot of the semantic web. Which perhaps it is? The data out there does of course look like this. Does the context of a triple change what it MEANS? If we had a trust/provenance stack in place I guess it would. Actually, I am not sure what this means for my statistics :)

At least I can now count the most common namespaces again, this time only from triples:

#triples namespace
275,920,526 foaf
181,683,388 rdf
106,130,939 rdfs
34,959,224 dc11
16,674,480 gr
12,733,566 rss
12,368,342 dcterm
8,334,653 swrc

Compare to the numbers for quads, data-gov had exactly the same number of triples (no redundancy!), whereas rdf dropped from 588M to 181M, rdfs from 860M to 106M and GoodRelations from 527M to 16M. Looking at all namespaces, GoodRelations wins the most redundant award from 16% of all quads, to only 1.1% of all triples. Comparing change since 2009 still puts GoodRelations up high though, so no need for them to worry:

% change namespace
3969.768833 gr

And if I understood Kingsley Idehen correctly, there is something fishy about the attribution namespace from openlink as well, but I’ve done enough boring digging now.

Now I’m done doing boring counting – next time I hope I can have more fun visualisation, like Ed!!


  1. Enjoying the stats that you are posting.

  2. Nice work on all the stats!

    For posterity, the reason for the is discussed here:

    …the issue has been (mostly) fixed since.

  3. Nice work! How did you remove all duplicate triples?

  4. Xin: As might be explained in an earlier (2009?) post – all the processing was done with unix commandline tools, sort, grep, awk, uniq etc. It’s not fast, but scalability is only limited by your diskspace [and patience :)! ]

    This was done by counting all the lines, then sorting all the chunks individually, then merge sorting all of them, piping it through uniq to remove duplicate lines and counting again.

Post a comment.