Redundancy in the BTC2010 Data, it’s only 1.4B triples!

In a comment here, Andreas Harth mentions that kaufkauf.net publishes the same triples in many contexts, and that this may skew the statistics a bit. As it turns out, not only kaufkauf.net is guilty of this, by stripping the fourth quad component of the data and removing duplicate triples the original 3,171,793,030 quads turn into “only” 1,441,499,718 triples.

36,123,031 triples occurred more than once in the data, 42 of these even more than 100,000 times. The top redundant triples are:

#triples subj pred obj
470,903 prot:A rdf:type prot:Chains
470,778 prot:A prot:Chain “A”^^<http://www.w3.org/2001/XMLSchema#string>
470,748 prot:A prot:ChainName “Chain A”^^<http://www.w3.org/2001/XMLSchema#string>
413,647 http://products.semweb.bestbuy.com/company.rdf#BusinessEntity_BestBuy rdf:type gr:BusinessEntity
366,073 foaf:Document rdfs:seeAlso http://dblp.l3s.de/d2r/sparql?query=DESCRIBE+%3Chttp://xmlns.com/foaf/0.1/Document%3E
361,900 dcmitype:Text rdfs:seeAlso http://dblp.l3s.de/d2r/sparql?query=DESCRIBE+%3Chttp://purl.org/dc/dcmitype/Text%3E
254,567 swrc:InProceedings rdfs:seeAlso http://dblp.l3s.de/d2r/sparql?query=DESCRIBE+%3Chttp://swrc.ontoware.org/ontology%23InProceedings%3E
184,530 foaf:Agent rdfs:seeAlso http://dblp.l3s.de/d2r/sparql?query=DESCRIBE+%3Chttp://xmlns.com/foaf/0.1/Agent%3E
159,627 http://www4.wiwiss.fu-berlin.de/flickrwrappr/ rdfs:label “flickr(tm) wrappr”@en
150,417 http://purl.org/obo/owl/OBO_REL#part_of rdf:type owl:ObjectProperty

This is all unfortunate, because I’ve been analysing the BTC data pretending that it’s a snapshot of the semantic web. Which perhaps it is? The data out there does of course look like this. Does the context of a triple change what it MEANS? If we had a trust/provenance stack in place I guess it would. Actually, I am not sure what this means for my statistics :)

At least I can now count the most common namespaces again, this time only from triples:

#triples namespace
651,432,324 http://data-gov.tw.rpi.edu/vocab/p/90
275,920,526 foaf
181,683,388 rdf
106,130,939 rdfs
34,959,224 dc11
33,289,653 http://purl.uniprot.org/core
16,674,480 gr
12,733,566 rss
12,368,342 dcterm
8,334,653 swrc

Compare to the numbers for quads, data-gov had exactly the same number of triples (no redundancy!), whereas rdf dropped from 588M to 181M, rdfs from 860M to 106M and GoodRelations from 527M to 16M. Looking at all namespaces, GoodRelations wins the most redundant award from 16% of all quads, to only 1.1% of all triples. Comparing change since 2009 still puts GoodRelations up high though, so no need for them to worry:

% change namespace
5579.997758 http://www.openlinksw.com/schema/attribution
4802.937827 http://www.openrdf.org/schema/serql
3969.768833 gr
2659.804256 urn:lsid:ubio.org:predicates:recordVersion
2655.011816 urn:lsid:ubio.org:predicates:lexicalStatus
2621.864105 urn:lsid:lsid.zoology.gla.ac.uk:predicates:mapping
2619.867255 urn:lsid:lsid.zoology.gla.ac.uk:predicates:rank
1539.092138 urn:lsid:ubio.org:predicates:hasCAVConcept
1063.282710 urn:lsid:lsid.zoology.gla.ac.uk:predicates:vernacularName
928.135900 http://spiele.j-crew.de/wiki/Spezial:URIResolver

And if I understood Kingsley Idehen correctly, there is something fishy about the attribution namespace from openlink as well, but I’ve done enough boring digging now.

Now I’m done doing boring counting – next time I hope I can have more fun visualisation, like Ed!!

4 comments.

  1. Enjoying the stats that you are posting.

  2. Nice work on all the stats!

    For posterity, the reason for the kaufkauf.net is discussed here:

    http://groups.google.com/group/pedantic-web/browse_thread/thread/ec03de1159eb5697?pli=1

    …the issue has been (mostly) fixed since.

  3. Nice work! How did you remove all duplicate triples?

  4. Xin: As might be explained in an earlier (2009?) post – all the processing was done with unix commandline tools, sort, grep, awk, uniq etc. It’s not fast, but scalability is only limited by your diskspace [and patience :)! ]

    This was done by counting all the lines, then sorting all the chunks individually, then merge sorting all of them, piping it through uniq to remove duplicate lines and counting again.

Post a comment.