Billions and billions and billions (on a map)

Time for a few more BTC statistics, this time looking at the contexts. The BTC data comes from 50,207,171 different URLs, out of these:

  • 35,423,929 yielded more than a single triple
  • 10,278,663 yielded more than 10 triples, and covers 85% of the full data.
  • 1,574,458 more than 100  covers 63%
  • 133,369 more than 1000 covers 30%
  • 3,759 more than 10000 covers 7%

The biggest context were as follows:

triples context
7,186,445 http://sw.deri.org/svn/sw/2008/03/MetaX/vocab.rdf#aperture-1.2.0
410,659 http://lsdis.cs.uga.edu/projects/semdis/swetodblp/march2007/swetodblp_march_2007_part_22.rdf
273,644 http://www.ling.helsinki.fi/kit/2004k/ctl310semw/WordNet/wordnet_nouns-20010201.rdf
237,685 http://dbpedia.org/resource/リレット
196,239 http://dbpedia.org/resource/スケルツォ
194,730 http://dbpedia.org/resource/オークション
178,842 http://www.reveredata.com/reports/store/index-companies.rss
165,948 http://lsdis.cs.uga.edu/~satya/Satya/jan24.owl
165,506 http://www.cs.man.ac.uk/~dturi/ontologies/go-assocdb/go-termdb.owl
160,592 http://dbpedia.org/resource/をがわいちろを

It’s pretty cool that someone crawled 7 million triples with aperture and put it online :) – the link is 404 now though, so you can’t easily check what it was. Also, none of the huge dbpedia pages seem to give any info, I am not quite sure what is going on there. Perhaps some encoding trouble somewhere?

As the official BTC statistics page already shows, it is more interesting when you group the context by the ones with the same host, computing the same Pay-Level-Domains as they did I get the hosts contributing the most triples as:

triples context
278,566,771 dbpedia.org
133,266,773 livejournal.com
94,748,441 rkbexplorer.com
84,896,760 geonames.org
61,339,034 mybloglog.com
53,492,284 sioc-project.org
23,970,898 qdos.com
23,745,914 hi5.com
23,459,199 kanzaki.com
17,691,303 rdfabout.com
15,784,386 plode.us
15,208,914 dbtune.org
13,548,946 craigslist.org
10,155,861 l3s.de
10,028,115 opencyc.org

Again, this is computed from the whole dataset, not just a subset, but interestingly it differs quite a lot from the “official” statistics, in fact, I’ve “lost” over 100M triples from dbpedia. I am not sure why this happens, a handful of context URLs where so strange that python’s urlparse module did you produce a hostname, but they only account for about 100,000 triples. Summing for the hosts I did find I get the right number of triples (i.e. one billion :). So unless there is something fundamentally wrong about the way I find the PLD, I am almost forced to conclude that the official stats are WRONG!

UPDATE: The official numbers must be wrong, because if you sum them all you get 1,504,548,700 – i.e. over 1.5Billion triples for just the top 50 domains alone. This cannot be true, since actual number of triples is “just” 1,151,383,508.

More fun than the table above is using hostip.info to geocode the IPs of these servers and put them on map. Now the hostip database is not perfect, in fact, it’s pretty poor, some hosts with A LOT of triples are missing (such as livejournal.com). I could perhaps have used the country codes of the URLs as a fall-back solution, but I was too lazy.

Now for drawing the map I thought I could use Many Eyes, but it turned out not to be as easy as I imagined. After uploading the dataset I found that although Many Eyes has a map visualisation, it does not use lat/lon coordinates, but relies instead on country name. Here is what it would have looked like if done by lat/lon, you have to imagine the world map though:

Trying again, I used the hostip.info database again, and got the country of each host, and added up the numbers for each country (Many Eyes does not do any aggregation) and uploaded a triples by country dataset. This I could visualise on a map, shading each country according to the number of triples, but it’s kinda boring:

Giving up on Many Eyes I tried the Google Visualisation API instead. Surely they would have a smooth zoomable map visualisation? Not quite. They have a map, it’s flash based, only supports “zoom” into pre-defined regions and  does a complete reload when changing region. Also, it only supports 400 data points. All the data is embedded in the Javascript though. I couldn’t get it to embed here, so click:

mapscreenshot

Now I am sure I could hack something together than would use proper Google maps, and would actually let you zoom nicely, etc. BUT I think I’ve really spent enough time on this now.

Keep your eyes peeled for the next episode where we find out why the semantic web has more triples of length 19 than any other.

One comment.

  1. […] Billion Triple Challenge data-set statistics posts, if you only just got here, catch up on part I, II or  […]

Post a comment.