SKOS Concepts in the BTC2010 data

Again Dan Brickley is making me work :) This time looking at the “hidden” schema that is SKOS concepts, (hidden because it is not really apparent when just looking at normal rdf:types). Dan suggested looking at topics used with FOAF, i.e. objects of foaf:topic, foaf:primaryTopic and foaf:interest triples, and also things used with Dublin Core subject (I used both http://purl.org/dc/elements/1.1/subject and http://purl.org/dc/terms/subject.

I found 1,136,475 unique FOAF topics in 8,119,528 triples, only 4,470 are bnodes, and only 265 (! i.e. only 0.002%) are literals. The top 10 topics are all of the type http://www.livejournal.com/interests.bml?int=??????, with varying number of ?s, this is obviously what people entered into the interest field of livejournal. More interesting are perhaps the top hosts:

#triples host
5,191,771 www.livejournal.com
1,819,836 www.deadjournal.com
771,439 www.vox.com
78,290 klab.lv
75,285 lj.rossia.org
70,380 lod.geospecies.org
18,398 my.opera.com
16,251 dbpedia.org
11,481 www.wasab.dk
9,815 wiki.sembase.at

So a lot of these topics are from FOAF exports of livejournal and friends. What I did not do, at least not yet, was to compare the list of FOAF topics with the things actually declared to be of type skos:Concept, this would be interesting.

Dublin Core looks quite different, it gives us 552,596 topics in 4,018,726 triples, but only 2,979 resources out of 921 are bnodes, the rest (i.e. 99.4%) are all literals.
The top 10 subjects according to DC are:

#triples subject
91,534 日記
38,566 写真
35,514 メル友募集
32,150 NAPLES
30,973 business
28,342 独り言
27,543 SoE Report
24,102 Congress
23,954 音楽
20,097

I do not even know what language most of these are (anyone?). Looking a bit further down the list, there are lots of government, education, crime, etc. Perhaps we can blame data.gov for this? I could have have kept track of the named-graphs these came from, but I didn’t. Maybe next time.

You can download the full raw counts for all subjects: FOAF topics (7.6mb), FOAF hosts and DC Topics (23mb).

Post a comment.