(still) nothing clever

Aggregates over BTC2010 namespaces

Yesterday I dumped the most basic BTC2010 stats. Today I have processed them a bit more – and it gets slightly less boring.

First predicates, yesterday I had the raw count per predicate. Much more interesting is the namespaces the predicates are defined in. These are the top 10:

#triples	namespace
860,532,348	rdfs
651,432,324	http://data-gov.tw.rpi.edu/vocab/p/90
588,063,466	rdf
527,347,381	gr
284,679,897	foaf
44,119,248	dc11
41,961,046	http://purl.uniprot.org/core
17,233,778	rss
13,661,605	http://www.proteinontology.info/po.owl
13,009,685	owl

(prefix abbreviations are made from prefix.cc \u2013 I am too lazy to fix the missing ones)

Now it gets interesting – because I did exactly this last year as well, and now we can compare!

Dropouts

In 2009 there were 3,817 different namespaces, this year we have 3,911, but actually only 2,945 occur in both. The biggest dropouts, i.e. namespaces that occurred last year, but not at all this year are:

#triples	namespace
10,239,809	http://www.kisti.re.kr/isrl/ResearchRefOntology
5,443,549	nie
1,571,547	http://ontologycentral.com/2009/01/eurostat/ns
1,094,963	http://sindice.com/exfn/0.1
320,155	http://xmdr.org/ont/iso11179-3e3draft_r4.owl
307,534	http://cb.semsol.org/ns
242,427	nco
203,283	osag
187,600	http://auswiki.org/index.php/Special:URIResolver
159,536	nexif

I am of course shocked and saddened to see that the Nepomuk Information Elements ontology has fallen out of fashion all together, although it was a bit of a freak occurrence last year. I am not sure how we lost 10M research ontology triples?

Newcomers

Looking the other way around, what namespaces are new and popular this year, we get:

#triples	namespace
651,432,324	http://data-gov.tw.rpi.edu/vocab/p/90
5,001,909	fec
2,689,813	http://transport.data.gov.uk/0/ontology/traffic
543,835	http://rdf.geospecies.org/ont/geospecies
526,304	http://data-gov.tw.rpi.edu/vocab/p/401
469,446	http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf
446,120	http://education.data.gov.uk/def/school
223,726	http://www.w3.org/TR/rdf-schema
190,890	http://wecowi.de/wiki/Spezial:URIResolver
166,511	http://data-gov.tw.rpi.edu/vocab/p/10

Here the introduction of data.gov and data.gov.uk were the big events last year.

Winners

For the namespaces that occurred both years we can find the biggest gainers. Here I calculated what ratio of the total triples each namespace constituted each year, and the increase in this ratio from 2009 to 2010. For example, GoodRelations, on top here, constituted nearly 16% of all triples in 2010, but only 2.91e-4% of all triples last year, for a cool increase of 570,000% :)

gain	namespace
57058.38	gr
2636.34	http://www.openlinksw.com/schema/attribution
2182.81	http://www.openrdf.org/schema/serql
1944.68	http://www.w3.org/2007/OWL/testOntology
1235.02	http://referata.com/wiki/Special:URIResolver
1211.35	urn:lsid:ubio.org:predicates:recordVersion
1208.09	urn:lsid:ubio.org:predicates:lexicalStatus
1194.66	urn:lsid:lsid.zoology.gla.ac.uk:predicates:mapping
1191.39	urn:lsid:lsid.zoology.gla.ac.uk:predicates:rank
701.66	urn:lsid:ubio.org:predicates:hasCAVConcept

Losers

Similarly, we have the biggest losers, the ones who lost the most:

gain	namespace
0.000185	http://purl.org/obo/metadata
0.000191	sioct
0.000380	vcard
0.000418	affy
0.000438	http://www.geneontology.org/go
0.000677	http://tap.stanford.edu/data
0.000719	urn://wymiwyg.org/knobot/default
0.000787	akts
0.000876	http://wymiwyg.org/ontologies/language-selection
0.000904	http://wymiwyg.org/ontologies/knobot

If your namespace is a loser, do not worry, remember that BTC is a more or less arbitrary snapshot of SOME semantic web data, and you can always catch up next year! :)

With a bit of luck I will do this again for the Pay-Level-Domains for the context URLs tomorrow.

Update

(a bit later)

You can get the full datasets for this from many eyes:

Posted by gromgull at 2:07 pm on September 2nd, 2010. 5 comments... »
Categories: Billion Triple Challenge, Statistics, Uncategorized.

5 comments.

[…] This post was mentioned on Twitter by Gunnar Grimnes, martin hepp. martin hepp said: Nice statistics: #goodrelations accounts for 16 % of all triples on the web & rose by 570.000 % compared to 2009: http://bit.ly/9zJDoa #lod […]

Posted by Tweets that mention (still) nothing clever — Aggregates over BTC2010 namespaces -- Topsy.com on September 2nd, 2010.
Hi Gunnar,

nice stats… as for the triple stats, did you actually use unique triples or quads? kaufkauf.net was publishing a the same small set of triples under many URIs, which could be the reason for the inflated GoodRelations numbers (http://groups.google.com/group/pedantic-web/browse_thread/thread/ec03de1159eb5697/2d59d0ac5f6b4220).

Best regards,
Andreas.

Posted by Andreas Harth on September 10th, 2010.
Andreas,

This is counting number of quads a predicate occurs in, so if kaufkauf.net publishes the same triple in two different contexts, it is counted twice. I guess removing the 4th context part, removing duplicates and doing the analysis again would be very interesting. I shall see if I find the time!

Cheers!

Posted by gromgull on September 10th, 2010.
[…] a comment here, Andreas Harth mentions that kaufkauf.net publishes the same triples in many contexts, and that […]

Posted by (still) nothing clever — Redundancy in the BTC2010 Data, it’s only 1.1B triples! on September 15th, 2010.
Andreas, I looked into this a bit more and put it here: http://gromgull.net/blog/2010/09/redundancy-in-the-btc2010-data-its-only-1-1b-triples/

Posted by gromgull on September 15th, 2010.

Aggregates over BTC2010 namespaces

Dropouts

Newcomers

Winners

Losers

Update

5 comments.

Post a comment.

Categories

Archives

Feeds