In a comment here, Andreas Harth mentions that kaufkauf.net publishes the same triples in many contexts, and that this may skew the statistics a bit. As it turns out, not only kaufkauf.net is guilty of this, by stripping the fourth quad component of the data and removing duplicate triples the original 3,171,793,030 quads turn into “only” 1,441,499,718 triples.
36,123,031 triples occurred more than once in the data, 42 of these even more than 100,000 times. The top redundant triples are:
#triples | subj | pred | obj |
---|---|---|---|
470,903 | prot:A | rdf:type | prot:Chains |
470,778 | prot:A | prot:Chain | “A”^^<http://www.w3.org/2001/XMLSchema#string> |
470,748 | prot:A | prot:ChainName | “Chain A”^^<http://www.w3.org/2001/XMLSchema#string> |
413,647 | http://products.semweb.bestbuy.com/company.rdf#BusinessEntity_BestBuy | rdf:type | gr:BusinessEntity |
366,073 | foaf:Document | rdfs:seeAlso | http://dblp.l3s.de/d2r/sparql?query=DESCRIBE+%3Chttp://xmlns.com/foaf/0.1/Document%3E |
361,900 | dcmitype:Text | rdfs:seeAlso | http://dblp.l3s.de/d2r/sparql?query=DESCRIBE+%3Chttp://purl.org/dc/dcmitype/Text%3E |
254,567 | swrc:InProceedings | rdfs:seeAlso | http://dblp.l3s.de/d2r/sparql?query=DESCRIBE+%3Chttp://swrc.ontoware.org/ontology%23InProceedings%3E |
184,530 | foaf:Agent | rdfs:seeAlso | http://dblp.l3s.de/d2r/sparql?query=DESCRIBE+%3Chttp://xmlns.com/foaf/0.1/Agent%3E |
159,627 | http://www4.wiwiss.fu-berlin.de/flickrwrappr/ | rdfs:label | “flickr(tm) wrappr”@en |
150,417 | http://purl.org/obo/owl/OBO_REL#part_of | rdf:type | owl:ObjectProperty |
This is all unfortunate, because I’ve been analysing the BTC data pretending that it’s a snapshot of the semantic web. Which perhaps it is? The data out there does of course look like this. Does the context of a triple change what it MEANS? If we had a trust/provenance stack in place I guess it would. Actually, I am not sure what this means for my statistics :)
At least I can now count the most common namespaces again, this time only from triples:
#triples | namespace |
---|---|
651,432,324 | http://data-gov.tw.rpi.edu/vocab/p/90 |
275,920,526 | foaf |
181,683,388 | rdf |
106,130,939 | rdfs |
34,959,224 | dc11 |
33,289,653 | http://purl.uniprot.org/core |
16,674,480 | gr |
12,733,566 | rss |
12,368,342 | dcterm |
8,334,653 | swrc |
Compare to the numbers for quads, data-gov had exactly the same number of triples (no redundancy!), whereas rdf dropped from 588M to 181M, rdfs from 860M to 106M and GoodRelations from 527M to 16M. Looking at all namespaces, GoodRelations wins the most redundant award from 16% of all quads, to only 1.1% of all triples. Comparing change since 2009 still puts GoodRelations up high though, so no need for them to worry:
% change | namespace |
---|---|
5579.997758 | http://www.openlinksw.com/schema/attribution |
4802.937827 | http://www.openrdf.org/schema/serql |
3969.768833 | gr |
2659.804256 | urn:lsid:ubio.org:predicates:recordVersion |
2655.011816 | urn:lsid:ubio.org:predicates:lexicalStatus |
2621.864105 | urn:lsid:lsid.zoology.gla.ac.uk:predicates:mapping |
2619.867255 | urn:lsid:lsid.zoology.gla.ac.uk:predicates:rank |
1539.092138 | urn:lsid:ubio.org:predicates:hasCAVConcept |
1063.282710 | urn:lsid:lsid.zoology.gla.ac.uk:predicates:vernacularName |
928.135900 | http://spiele.j-crew.de/wiki/Spezial:URIResolver |
And if I understood Kingsley Idehen correctly, there is something fishy about the attribution namespace from openlink as well, but I’ve done enough boring digging now.
Now I’m done doing boring counting – next time I hope I can have more fun visualisation, like Ed!!
Enjoying the stats that you are posting.
Posted by Jesse Weaver on September 18th, 2010.
Nice work on all the stats!
For posterity, the reason for the kaufkauf.net is discussed here:
http://groups.google.com/group/pedantic-web/browse_thread/thread/ec03de1159eb5697?pli=1
…the issue has been (mostly) fixed since.
Posted by Aidan Hogan on August 21st, 2011.
Nice work! How did you remove all duplicate triples?
Posted by Xin on November 30th, 2011.
Xin: As might be explained in an earlier (2009?) post – all the processing was done with unix commandline tools, sort, grep, awk, uniq etc. It’s not fast, but scalability is only limited by your diskspace [and patience :)! ]
This was done by counting all the lines, then sorting all the chunks individually, then merge sorting all of them, piping it through uniq to remove duplicate lines and counting again.
Posted by gromgull on November 30th, 2011.