An Objective look at the Billion Triple Data

For completeness, Besbes is telling me to write up the final stats from the BTC data, for the object-part of the triples. I am afraid this data is quite dull, and there are not many interesting things to say, so it’ll be mostly tables. Enjoy :)

The BTC data contains 279,710,101 unique objects in total. Out of these:

  • 90,007,431 appear more than once
  • 7,995,747 more than 10 times
  • 748,214 more than 100
  • 43,479 more than 1,000
  • 3,209 more than 10,000

The 280M objects are split into 162,764,271 resources and 116,945,830 literals. 13,538 of the resources are file:// URIs. The top 10 objects are:

#triples object
2,584,960 http://www.geonames.org/ontology#P
2,645,095 http://www.aktors.org/ontology/portal#Article-Reference
2,681,771 http://www.w3.org/2002/07/owl#Class
5,616,326 http://www.aktors.org/ontology/portal#Person
7,544,903 http://www.geonames.org/ontology#Feature
9,115,801 http://en.wikipedia.org/
12,124,378 http://xmlns.com/foaf/0.1/OnlineAccount
13,687,049 http://purl.org/rss/1.0/item
14,172,852 http://rdfs.org/sioc/types#WikiArticle
38,795,942 http://xmlns.com/foaf/0.1/Person

Apart from the wikipedia link, all are types. No literals appear in the top 10 table. For the 116M unique literals we have 12,845,021 literals with a language tag and 2,067,768 with a datatype tag. The top 10 literals are:

#triples literal
722,221 “0”^^xsd:integer
969,929 “1”
1,024,654 “Nay”
1,036,054 “Copyright © 2009 craigslist, inc.”
1,056,799 “text”
1,061,692 “text/html”
1,159,311 “0”
1,204,996 “en-us”
2,049,638 “Aye”
2,310,681 “application/rdf+xml”

I can’t be bothered to check it now, but I guess the  many Aye’s & Nay’s come from IRC chatlogs (#SWIG?).

Finally, I looked at the length of the literals used in the data, the longest literal is 65,244 unicode characters long (I wonder about this — this seems very close to 216 bytes, some unicode characters with more than one byte, could it be truncated?). The distribution of literals/lenghts looks like this:

The most literals are around 10 characters in length, there is a peak for 19, which I seem to remember was caused by the standard time format (i.e. 2005-10-30T10:45UTC) being exactly 19 characters.

That’s it! I believe I now have published all my numbers on BTC :)

Post a comment.