(still) nothing clever

Redundancy in the BTC2010 Data, it’s only 1.4B triples!

In a comment here, Andreas Harth mentions that kaufkauf.net publishes the same triples in many contexts, and that this may skew the statistics a bit. As it turns out, not only kaufkauf.net is guilty of this, by stripping the fourth quad component of the data and removing duplicate triples the original 3,171,793,030 quads turn into “only” 1,441,499,718 triples.

36,123,031 triples occurred more than once in the data, 42 of these even more than 100,000 times. The top redundant triples are:

#triples	subj	pred	obj
470,903	prot:A	rdf:type	prot:Chains
470,778	prot:A	prot:Chain	“A”^^<http://www.w3.org/2001/XMLSchema#string>
470,748	prot:A	prot:ChainName	“Chain A”^^<http://www.w3.org/2001/XMLSchema#string>
413,647	http://products.semweb.bestbuy.com/company.rdf#BusinessEntity_BestBuy	rdf:type	gr:BusinessEntity
366,073	foaf:Document	rdfs:seeAlso	http://dblp.l3s.de/d2r/sparql?query=DESCRIBE+%3Chttp://xmlns.com/foaf/0.1/Document%3E
361,900	dcmitype:Text	rdfs:seeAlso	http://dblp.l3s.de/d2r/sparql?query=DESCRIBE+%3Chttp://purl.org/dc/dcmitype/Text%3E
254,567	swrc:InProceedings	rdfs:seeAlso	http://dblp.l3s.de/d2r/sparql?query=DESCRIBE+%3Chttp://swrc.ontoware.org/ontology%23InProceedings%3E
184,530	foaf:Agent	rdfs:seeAlso	http://dblp.l3s.de/d2r/sparql?query=DESCRIBE+%3Chttp://xmlns.com/foaf/0.1/Agent%3E
159,627	http://www4.wiwiss.fu-berlin.de/flickrwrappr/	rdfs:label	“flickr(tm) wrappr”@en
150,417	http://purl.org/obo/owl/OBO_REL#part_of	rdf:type	owl:ObjectProperty

This is all unfortunate, because I’ve been analysing the BTC data pretending that it’s a snapshot of the semantic web. Which perhaps it is? The data out there does of course look like this. Does the context of a triple change what it MEANS? If we had a trust/provenance stack in place I guess it would. Actually, I am not sure what this means for my statistics :)

At least I can now count the most common namespaces again, this time only from triples:

#triples	namespace
651,432,324	http://data-gov.tw.rpi.edu/vocab/p/90
275,920,526	foaf
181,683,388	rdf
106,130,939	rdfs
34,959,224	dc11
33,289,653	http://purl.uniprot.org/core
16,674,480	gr
12,733,566	rss
12,368,342	dcterm
8,334,653	swrc

Compare to the numbers for quads, data-gov had exactly the same number of triples (no redundancy!), whereas rdf dropped from 588M to 181M, rdfs from 860M to 106M and GoodRelations from 527M to 16M. Looking at all namespaces, GoodRelations wins the most redundant award from 16% of all quads, to only 1.1% of all triples. Comparing change since 2009 still puts GoodRelations up high though, so no need for them to worry:

% change	namespace
5579.997758	http://www.openlinksw.com/schema/attribution
4802.937827	http://www.openrdf.org/schema/serql
3969.768833	gr
2659.804256	urn:lsid:ubio.org:predicates:recordVersion
2655.011816	urn:lsid:ubio.org:predicates:lexicalStatus
2621.864105	urn:lsid:lsid.zoology.gla.ac.uk:predicates:mapping
2619.867255	urn:lsid:lsid.zoology.gla.ac.uk:predicates:rank
1539.092138	urn:lsid:ubio.org:predicates:hasCAVConcept
1063.282710	urn:lsid:lsid.zoology.gla.ac.uk:predicates:vernacularName
928.135900	http://spiele.j-crew.de/wiki/Spezial:URIResolver

And if I understood Kingsley Idehen correctly, there is something fishy about the attribution namespace from openlink as well, but I’ve done enough boring digging now.

Now I’m done doing boring counting – next time I hope I can have more fun visualisation, like Ed!!

Posted by gromgull at 10:17 am on September 15th, 2010. 4 comments... »
Categories: Billion Triple Challenge, Statistics.

4 comments.

Enjoying the stats that you are posting.

Posted by Jesse Weaver on September 18th, 2010.
Nice work on all the stats!

For posterity, the reason for the kaufkauf.net is discussed here:

http://groups.google.com/group/pedantic-web/browse_thread/thread/ec03de1159eb5697?pli=1

…the issue has been (mostly) fixed since.

Posted by Aidan Hogan on August 21st, 2011.
Nice work! How did you remove all duplicate triples?

Posted by Xin on November 30th, 2011.
Xin: As might be explained in an earlier (2009?) post – all the processing was done with unix commandline tools, sort, grep, awk, uniq etc. It’s not fast, but scalability is only limited by your diskspace [and patience :)! ]

This was done by counting all the lines, then sorting all the chunks individually, then merge sorting all of them, piping it through uniq to remove duplicate lines and counting again.

Posted by gromgull on November 30th, 2011.

Redundancy in the BTC2010 Data, it’s only 1.4B triples!

4 comments.

Post a comment.

Categories

Archives

Feeds