(still) nothing clever

Billions and billions and billions (on a map)

Time for a few more BTC statistics, this time looking at the contexts. The BTC data comes from 50,207,171 different URLs, out of these:

35,423,929 yielded more than a single triple
10,278,663 yielded more than 10 triples, and covers 85% of the full data.
1,574,458 more than 100 covers 63%
133,369 more than 1000 covers 30%
3,759 more than 10000 covers 7%

The biggest context were as follows:

triples	context
7,186,445	http://sw.deri.org/svn/sw/2008/03/MetaX/vocab.rdf#aperture-1.2.0
410,659	http://lsdis.cs.uga.edu/projects/semdis/swetodblp/march2007/swetodblp_march_2007_part_22.rdf
273,644	http://www.ling.helsinki.fi/kit/2004k/ctl310semw/WordNet/wordnet_nouns-20010201.rdf
237,685	http://dbpedia.org/resource/リレット
196,239	http://dbpedia.org/resource/スケルツォ
194,730	http://dbpedia.org/resource/オークション
178,842	http://www.reveredata.com/reports/store/index-companies.rss
165,948	http://lsdis.cs.uga.edu/~satya/Satya/jan24.owl
165,506	http://www.cs.man.ac.uk/~dturi/ontologies/go-assocdb/go-termdb.owl
160,592	http://dbpedia.org/resource/をがわいちろを

It’s pretty cool that someone crawled 7 million triples with aperture and put it online :) – the link is 404 now though, so you can’t easily check what it was. Also, none of the huge dbpedia pages seem to give any info, I am not quite sure what is going on there. Perhaps some encoding trouble somewhere?

As the official BTC statistics page already shows, it is more interesting when you group the context by the ones with the same host, computing the same Pay-Level-Domains as they did I get the hosts contributing the most triples as:

triples	context
278,566,771	dbpedia.org
133,266,773	livejournal.com
94,748,441	rkbexplorer.com
84,896,760	geonames.org
61,339,034	mybloglog.com
53,492,284	sioc-project.org
23,970,898	qdos.com
23,745,914	hi5.com
23,459,199	kanzaki.com
17,691,303	rdfabout.com
15,784,386	plode.us
15,208,914	dbtune.org
13,548,946	craigslist.org
10,155,861	l3s.de
10,028,115	opencyc.org

Again, this is computed from the whole dataset, not just a subset, but interestingly it differs quite a lot from the “official” statistics, in fact, I’ve “lost” over 100M triples from dbpedia. I am not sure why this happens, a handful of context URLs where so strange that python’s urlparse module did you produce a hostname, but they only account for about 100,000 triples. Summing for the hosts I did find I get the right number of triples (i.e. one billion :). So unless there is something fundamentally wrong about the way I find the PLD, I am almost forced to conclude that the official stats are WRONG!

UPDATE: The official numbers must be wrong, because if you sum them all you get 1,504,548,700 – i.e. over 1.5Billion triples for just the top 50 domains alone. This cannot be true, since actual number of triples is “just” 1,151,383,508.

More fun than the table above is using hostip.info to geocode the IPs of these servers and put them on map. Now the hostip database is not perfect, in fact, it’s pretty poor, some hosts with A LOT of triples are missing (such as livejournal.com). I could perhaps have used the country codes of the URLs as a fall-back solution, but I was too lazy.

Now for drawing the map I thought I could use Many Eyes, but it turned out not to be as easy as I imagined. After uploading the dataset I found that although Many Eyes has a map visualisation, it does not use lat/lon coordinates, but relies instead on country name. Here is what it would have looked like if done by lat/lon, you have to imagine the world map though:

Trying again, I used the hostip.info database again, and got the country of each host, and added up the numbers for each country (Many Eyes does not do any aggregation) and uploaded a triples by country dataset. This I could visualise on a map, shading each country according to the number of triples, but it’s kinda boring:

Giving up on Many Eyes I tried the Google Visualisation API instead. Surely they would have a smooth zoomable map visualisation? Not quite. They have a map, it’s flash based, only supports “zoom” into pre-defined regions and does a complete reload when changing region. Also, it only supports 400 data points. All the data is embedded in the Javascript though. I couldn’t get it to embed here, so click:

Now I am sure I could hack something together than would use proper Google maps, and would actually let you zoom nicely, etc. BUT I think I’ve really spent enough time on this now.

Keep your eyes peeled for the next episode where we find out why the semantic web has more triples of length 19 than any other.

Posted by gromgull at 12:50 pm on July 16th, 2009. One comment... »
Categories: Billion Triple Challenge, Semantic Web, Statistics, Visualisation.

One comment.

[…] Billion Triple Challenge data-set statistics posts, if you only just got here, catch up on part I, II or […]

Posted by (still) nothing clever — Typical Semantic Web Data on September 28th, 2009.

Billions and billions and billions (on a map)

One comment.

Post a comment.

Categories

Archives

Feeds