It’s all Cygri‘s fault — he encouraged me to add schema namespaces to the general areas on the semantic web cluster-tree. Now, again I misjudged horribly how long this was going to take. I thought the general idea was simple enough, I already had the data. One hour should do it. And now one full day later I have:
It’s the same map as last time, laid using graphviz’s neato as before. The heat-map of the properties was computed from the feature-vector of predicate counts, first I mapped all predicates to their “namespace”, by the slightly-dodgy-but-good-enough heuristic of taking the part of the URI before the last # or / character. Then I split the map into a grid of NxN points (I think I used N=30 in the end), and compute a new feature vector for each point. This vector is the sum of the mapped vector for each of the domains, divided by the distance. I.e. (if you prefer math) each point’s vector becomes:
Where is the distance (here simple 2d euclidean), is each domain, is that domains position in the figure and is that domains feature vector. Normally it would be more natural to decrease the effect by the squared distance, but this gave less attractive results, and I ended up square-rooting it instead. The color is now simply on column of the resulting matrix normalised and mapped to a nice pylab colormap.
Now this was the fun and interesting part, and it took maybe 1 hour. As predicted. NOW, getting this plotted along with the nodes from the graph turned out to be a nightmare. Neato gave me the coordinates for the nodes, but would change them slightly when rendering to PNGs. Many hours of frustration later I ended up drawing all of it again with pylab, which worked really well. I would publish the code for this, but it’s so messy it makes grown men cry.
NOW I am off to analyse the result of the top-level domain interlinking on the billion triple data. The data-collection just finished running while I did this. … As he said.