So, not quite a billion triple challenge post, but the data is the same. I had the idea that I compare the Pay-Level-Domains (PLD) of the context of the triples based on what predicates is used within each one. Then once I had the distance-metric, I could use FastMap to visualise it. It would be a quick hack, it would look smooth and great and be fun. In the end, many hours later, it wasn’t quick, the visual is not smooth (i.e. it doesn’t move) and I don’t know if it looks so great. It was fun though. Just go there and look at it:
As you can see it’s a large PNG with the new-and-exciting ImageMap technology used to position the info-popup, or rather used to activate the JavaScript used for the popups. I tried at first with SVG, but I couldn’t get SVG and XHTML and Javascript to play along, I guess in Firefox 5 it will work. The graph is laid out and generated Graphviz‘s neato, which also generated the imagemap.
So what do we actually see here? In short, a tree where domains that publish similar Semantic Web data are close to each other in the tree and have similar colours. In detail: I took the all PLDs that contained over 1,000 triples, this is around 7500, and counted the number of triples for each of the 500 most frequent predicates in the dataset. (These 500 predicates cover ≈94% of the data). This gave me a vector-space with 500 features for each of the PLDs, i.e. something like this:
geonames:nearbyFeature | dbprop:redirect | foaf:knows | … | |
dbpedia.org | 0.01 | 0.8 | 0.1 | |
livejournal.org | 0 | 0 | 0.9 | |
geonames.org | 0.75 | 0 | 0.1 | |
… |
Each value is the percentage of triples from this PLD that used this predicate. In this vector space I used the cosine-similarity to compute a distance matrix for all PLDs. With this distance matrix I thought I could apply FastMap, but it worked really badly and looked like this:
So instead of FastMap I used maketree from the complearn tools, this generates trees from a distance matrix, it generates very good results, but it is an iterative optimisation and it takes forever for large instances. Around this time I realised I wasn’t going to be able to visualise all 7500 PLDs, and cut it down to the 2000, 1000, 500, 100, 50 largest PLDs. Now this worked fine, but the result looked like a bog-standard graphviz graph, and it wasn’t very exciting (i.e not at all like this colourful thing). Now I realised that since I actually had numeric feature vectors in the first place I wasn’t restrained to using FastMap to make up coordinates, and I used PCA to map the input vector-space to a 3-dimensional space, normalised the values to [0;255] and used these as RGB values for colour. Ah – lovely pastel.
I think I underestimated the time this would take by at least a factor of 20. Oh well. Time for lunch.
[…] This post was mentioned on Twitter by Gunnar Grimnes. Gunnar Grimnes said: Semantic Web, now also in pastels: http://tr.im/yeVe (Warning not *just* pretty colours) […]
Posted by Tweets that mention (still) nothing clever — Visualising predicate usage on the Semantic Web -- Topsy.com on September 9th, 2009.
For help on SVG try http://tech.groups.yahoo.com/group/svg-developers
Posted by stelt on September 9th, 2009.
This looks really nice! Thanks for doing a writeup and for the links to the transformation tools. maketree is interesting– I’m interested in what you’d try next, supposing you did have to include all the nodes?
Posted by drewp on September 10th, 2009.