For completeness, Besbes is telling me to write up the final stats from the BTC data, for the object-part of the triples. I am afraid this data is quite dull, and there are not many interesting things to say, so it’ll be mostly tables. Enjoy :)
The BTC data contains 279,710,101 unique objects in total. Out of these:
- 90,007,431 appear more than once
- 7,995,747 more than 10 times
- 748,214 more than 100
- 43,479 more than 1,000
- 3,209 more than 10,000
The 280M objects are split into 162,764,271 resources and 116,945,830 literals. 13,538 of the resources are file:// URIs. The top 10 objects are:
#triples | object |
---|---|
2,584,960 | http://www.geonames.org/ontology#P |
2,645,095 | http://www.aktors.org/ontology/portal#Article-Reference |
2,681,771 | http://www.w3.org/2002/07/owl#Class |
5,616,326 | http://www.aktors.org/ontology/portal#Person |
7,544,903 | http://www.geonames.org/ontology#Feature |
9,115,801 | http://en.wikipedia.org/ |
12,124,378 | http://xmlns.com/foaf/0.1/OnlineAccount |
13,687,049 | http://purl.org/rss/1.0/item |
14,172,852 | http://rdfs.org/sioc/types#WikiArticle |
38,795,942 | http://xmlns.com/foaf/0.1/Person |
Apart from the wikipedia link, all are types. No literals appear in the top 10 table. For the 116M unique literals we have 12,845,021 literals with a language tag and 2,067,768 with a datatype tag. The top 10 literals are:
#triples | literal |
---|---|
722,221 | “0”^^xsd:integer |
969,929 | “1” |
1,024,654 | “Nay” |
1,036,054 | “Copyright © 2009 craigslist, inc.” |
1,056,799 | “text” |
1,061,692 | “text/html” |
1,159,311 | “0” |
1,204,996 | “en-us” |
2,049,638 | “Aye” |
2,310,681 | “application/rdf+xml” |
I can’t be bothered to check it now, but I guess the many Aye’s & Nay’s come from IRC chatlogs (#SWIG?).
Finally, I looked at the length of the literals used in the data, the longest literal is 65,244 unicode characters long (I wonder about this — this seems very close to 216 bytes, some unicode characters with more than one byte, could it be truncated?). The distribution of literals/lenghts looks like this:
The most literals are around 10 characters in length, there is a peak for 19, which I seem to remember was caused by the standard time format (i.e. 2005-10-30T10:45UTC) being exactly 19 characters.
That’s it! I believe I now have published all my numbers on BTC :)