SKOS Concepts in the BTC2010 data

Again Dan Brickley is making me work :) This time looking at the “hidden” schema that is SKOS concepts, (hidden because it is not really apparent when just looking at normal rdf:types). Dan suggested looking at topics used with FOAF, i.e. objects of foaf:topic, foaf:primaryTopic and foaf:interest triples, and also things used with Dublin Core subject (I used both and

I found 1,136,475 unique FOAF topics in 8,119,528 triples, only 4,470 are bnodes, and only 265 (! i.e. only 0.002%) are literals. The top 10 topics are all of the type, with varying number of ?s, this is obviously what people entered into the interest field of livejournal. More interesting are perhaps the top hosts:

#triples host

So a lot of these topics are from FOAF exports of livejournal and friends. What I did not do, at least not yet, was to compare the list of FOAF topics with the things actually declared to be of type skos:Concept, this would be interesting.

Dublin Core looks quite different, it gives us 552,596 topics in 4,018,726 triples, but only 2,979 resources out of 921 are bnodes, the rest (i.e. 99.4%) are all literals.
The top 10 subjects according to DC are:

#triples subject
91,534 日記
38,566 写真
35,514 メル友募集
32,150 NAPLES
30,973 business
28,342 独り言
27,543 SoE Report
24,102 Congress
23,954 音楽

I do not even know what language most of these are (anyone?). Looking a bit further down the list, there are lots of government, education, crime, etc. Perhaps we can blame for this? I could have have kept track of the named-graphs these came from, but I didn’t. Maybe next time.

You can download the full raw counts for all subjects: FOAF topics (7.6mb), FOAF hosts and DC Topics (23mb).

