For completeness, Besbes is telling me to write up the final stats from the BTC data, for the object-part of the triples. I am afraid this data is quite dull, and there are not many interesting things to say, so it’ll be mostly tables. Enjoy :)
The BTC data contains 279,710,101 unique objects in total. Out of these:
- 90,007,431 appear more than once
- 7,995,747 more than 10 times
- 748,214 more than 100
- 43,479 more than 1,000
- 3,209 more than 10,000
The 280M objects are split into 162,764,271 resources and 116,945,830 literals. 13,538 of the resources are file:// URIs. The top 10 objects are:
| #triples | object |
|---|---|
| 2,584,960 | http://www.geonames.org/ontology#P |
| 2,645,095 | http://www.aktors.org/ontology/portal#Article-Reference |
| 2,681,771 | http://www.w3.org/2002/07/owl#Class |
| 5,616,326 | http://www.aktors.org/ontology/portal#Person |
| 7,544,903 | http://www.geonames.org/ontology#Feature |
| 9,115,801 | http://en.wikipedia.org/ |
| 12,124,378 | http://xmlns.com/foaf/0.1/OnlineAccount |
| 13,687,049 | http://purl.org/rss/1.0/item |
| 14,172,852 | http://rdfs.org/sioc/types#WikiArticle |
| 38,795,942 | http://xmlns.com/foaf/0.1/Person |
Apart from the wikipedia link, all are types. No literals appear in the top 10 table. For the 116M unique literals we have 12,845,021 literals with a language tag and 2,067,768 with a datatype tag. The top 10 literals are:
| #triples | literal |
|---|---|
| 722,221 | “0″^^xsd:integer |
| 969,929 | “1″ |
| 1,024,654 | “Nay” |
| 1,036,054 | “Copyright © 2009 craigslist, inc.” |
| 1,056,799 | “text” |
| 1,061,692 | “text/html” |
| 1,159,311 | “0″ |
| 1,204,996 | “en-us” |
| 2,049,638 | “Aye” |
| 2,310,681 | “application/rdf+xml” |
I can’t be bothered to check it now, but I guess the many Aye’s & Nay’s come from IRC chatlogs (#SWIG?).
Finally, I looked at the length of the literals used in the data, the longest literal is 65,244 unicode characters long (I wonder about this — this seems very close to 216 bytes, some unicode characters with more than one byte, could it be truncated?). The distribution of literals/lenghts looks like this:
The most literals are around 10 characters in length, there is a peak for 19, which I seem to remember was caused by the standard time format (i.e. 2005-10-30T10:45UTC) being exactly 19 characters.
That’s it! I believe I now have published all my numbers on BTC :)