Yesterday I dumped the most basic BTC2010 stats. Today I have processed them a bit more – and it gets slightly less boring.
First predicates, yesterday I had the raw count per predicate. Much more interesting is the namespaces the predicates are defined in. These are the top 10:
#triples | namespace |
---|---|
860,532,348 | rdfs |
651,432,324 | http://data-gov.tw.rpi.edu/vocab/p/90 |
588,063,466 | rdf |
527,347,381 | gr |
284,679,897 | foaf |
44,119,248 | dc11 |
41,961,046 | http://purl.uniprot.org/core |
17,233,778 | rss |
13,661,605 | http://www.proteinontology.info/po.owl |
13,009,685 | owl |
(prefix abbreviations are made from prefix.cc \u2013 I am too lazy to fix the missing ones)
Now it gets interesting – because I did exactly this last year as well, and now we can compare!
Dropouts
In 2009 there were 3,817 different namespaces, this year we have 3,911, but actually only 2,945 occur in both. The biggest dropouts, i.e. namespaces that occurred last year, but not at all this year are:
#triples | namespace |
---|---|
10,239,809 | http://www.kisti.re.kr/isrl/ResearchRefOntology |
5,443,549 | nie |
1,571,547 | http://ontologycentral.com/2009/01/eurostat/ns |
1,094,963 | http://sindice.com/exfn/0.1 |
320,155 | http://xmdr.org/ont/iso11179-3e3draft_r4.owl |
307,534 | http://cb.semsol.org/ns |
242,427 | nco |
203,283 | osag |
187,600 | http://auswiki.org/index.php/Special:URIResolver |
159,536 | nexif |
I am of course shocked and saddened to see that the Nepomuk Information Elements ontology has fallen out of fashion all together, although it was a bit of a freak occurrence last year. I am not sure how we lost 10M research ontology triples?
Newcomers
Looking the other way around, what namespaces are new and popular this year, we get:
#triples | namespace |
---|---|
651,432,324 | http://data-gov.tw.rpi.edu/vocab/p/90 |
5,001,909 | fec |
2,689,813 | http://transport.data.gov.uk/0/ontology/traffic |
543,835 | http://rdf.geospecies.org/ont/geospecies |
526,304 | http://data-gov.tw.rpi.edu/vocab/p/401 |
469,446 | http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf |
446,120 | http://education.data.gov.uk/def/school |
223,726 | http://www.w3.org/TR/rdf-schema |
190,890 | http://wecowi.de/wiki/Spezial:URIResolver |
166,511 | http://data-gov.tw.rpi.edu/vocab/p/10 |
Here the introduction of data.gov and data.gov.uk were the big events last year.
Winners
For the namespaces that occurred both years we can find the biggest gainers. Here I calculated what ratio of the total triples each namespace constituted each year, and the increase in this ratio from 2009 to 2010. For example, GoodRelations, on top here, constituted nearly 16% of all triples in 2010, but only 2.91e-4% of all triples last year, for a cool increase of 570,000% :)
gain | namespace |
---|---|
57058.38 | gr |
2636.34 | http://www.openlinksw.com/schema/attribution |
2182.81 | http://www.openrdf.org/schema/serql |
1944.68 | http://www.w3.org/2007/OWL/testOntology |
1235.02 | http://referata.com/wiki/Special:URIResolver |
1211.35 | urn:lsid:ubio.org:predicates:recordVersion |
1208.09 | urn:lsid:ubio.org:predicates:lexicalStatus |
1194.66 | urn:lsid:lsid.zoology.gla.ac.uk:predicates:mapping |
1191.39 | urn:lsid:lsid.zoology.gla.ac.uk:predicates:rank |
701.66 | urn:lsid:ubio.org:predicates:hasCAVConcept |
Losers
Similarly, we have the biggest losers, the ones who lost the most:
gain | namespace |
---|---|
0.000185 | http://purl.org/obo/metadata |
0.000191 | sioct |
0.000380 | vcard |
0.000418 | affy |
0.000438 | http://www.geneontology.org/go |
0.000677 | http://tap.stanford.edu/data |
0.000719 | urn://wymiwyg.org/knobot/default |
0.000787 | akts |
0.000876 | http://wymiwyg.org/ontologies/language-selection |
0.000904 | http://wymiwyg.org/ontologies/knobot |
If your namespace is a loser, do not worry, remember that BTC is a more or less arbitrary snapshot of SOME semantic web data, and you can always catch up next year! :)
With a bit of luck I will do this again for the Pay-Level-Domains for the context URLs tomorrow.
Update
(a bit later)
You can get the full datasets for this from many eyes:
[…] This post was mentioned on Twitter by Gunnar Grimnes, martin hepp. martin hepp said: Nice statistics: #goodrelations accounts for 16 % of all triples on the web & rose by 570.000 % compared to 2009: http://bit.ly/9zJDoa #lod […]
Posted by Tweets that mention (still) nothing clever — Aggregates over BTC2010 namespaces -- Topsy.com on September 2nd, 2010.
Hi Gunnar,
nice stats… as for the triple stats, did you actually use unique triples or quads? kaufkauf.net was publishing a the same small set of triples under many URIs, which could be the reason for the inflated GoodRelations numbers (http://groups.google.com/group/pedantic-web/browse_thread/thread/ec03de1159eb5697/2d59d0ac5f6b4220).
Best regards,
Andreas.
Posted by Andreas Harth on September 10th, 2010.
Andreas,
This is counting number of quads a predicate occurs in, so if kaufkauf.net publishes the same triple in two different contexts, it is counted twice. I guess removing the 4th context part, removing duplicates and doing the analysis again would be very interesting. I shall see if I find the time!
Cheers!
Posted by gromgull on September 10th, 2010.
[…] a comment here, Andreas Harth mentions that kaufkauf.net publishes the same triples in many contexts, and that […]
Posted by (still) nothing clever — Redundancy in the BTC2010 Data, it’s only 1.1B triples! on September 15th, 2010.
Andreas, I looked into this a bit more and put it here: http://gromgull.net/blog/2010/09/redundancy-in-the-btc2010-data-its-only-1-1b-triples/
Posted by gromgull on September 15th, 2010.