(still) nothing clever

Posts categorized “Semantic Web”.

BTC2010 Basic stats

Another year, another billion triple dataset. This time it was released the same time my daughter was born, so running the stats script was delayed for a bit.

This year we’ve got a few more triples, perhaps making up for the fact that it wasn’t actually one billion last year :) we’ve now got 3.1B triples (or 3,171,793,030 if you want to be exact).

I’ve not had a chance to do anything really fun with this, so I’ll just dump the stats:

Subjects

159,185,186 unique subjects
147,663,612 occur in more than a single triple
12,647,098 more than 10 times
5,394,733 more 100
313,493 more than 1,000
46,116 more than 10,000
and 53 more than 100,000 times

For an average of 19.9252 per unique triple. Like last year, I am not sure if having more than 100,000 triples with the same subject really is useful for anyone?

Looking only at bnodes used as subjects we get:

100,431,757 unique subjects
98,744,109 occur in more than a single triple
1,465,399 more than 10 times
266,759 more 100
4,956 more than 1,000
48 more than 10,000

So 100M out of 159M subjects are bnodes, but they are used less often than the named resources.

The top subjects are as follows:

#triples	subject
1,412,709	http://www.proteinontology.info/po.owl#A
895,776	http://openean.kaufkauf.net/id/
827,295	http://products.semweb.bestbuy.com/company.rdf#BusinessEntity_BestBuy
492,756	cycann:externalID
481,000	http://purl.uniprot.org/citations/15685292
445,430	foaf:Document
369,567	cycann:label
362,391	dcmitype:Text
357,309	http://sw.opencyc.org/concept/
349,988	http://purl.uniprot.org/citations/16973872

I do not know enough about the Proteine ontology to know why po:A is so popular. CYC we already had last year here, and I guess all products exposed by BestBuy have this URI as a subject.

Predicates

95,379 unique predicates
83,370 occur in more than one triples
46,710 more than 10
18,385 more than 100
5,395 more than 1,000
1,271 more than 10,000
548 more than 100,000

The average predicate occurred in 33254.6 triples.

#triples	predicate
557,268,190	rdf:type
384,891,996	rdfs:isDefinedBy
215,041,142	gr:hasGlobalLocationNumber
184,881,132	rdfs:label
175,141,343	rdfs:comment
168,719,459	gr:hasEAN_UCC-13
131,029,818	gr:hasManufacturer
112,635,203	rdfs:seeAlso
71,742,821	foaf:nick
71,036,882	foaf:knows

The usual suspects, rdf:type, comment, label, seeAlso and a bit of FOAF. New this year is lots of GoodRelations data!

Objects – Resources

Ignoring literals for the moment, looking only at resource-objects, we have:

192,855,067 unique resources
36,144,147 occur in more than a single triple
2,905,294 more than 10 times
197,052 more 100
20,011 more than 1,000
2,752 more than 10,000
and 370 more than 100,000 times

On average 7.72834 triples per object. This is both named objects and bnodes, looking at the bnodes only we get:

97,617,548 unique resources
616,825 occur in more than a single triple
8,632 more than 10 times
2,167 more 100
1 more than 1,000

Since BNode IDs are only valid within a certain file it is limited how often then can appear, but still almost half the overall objects are bnodes.

The top ten bnode IDs are pretty boring, but the top 10 named resources are:

#triples	resource-object
215,532,631	gr:BusinessEntity
215,153,113	ean:businessentities/
168,205,900	gr:ProductOrServiceModel
167,789,556	http://openean.kaufkauf.net/id/
71,051,459	foaf:Person
10,373,362	foaf:OnlineAccount
6,842,729	rss:item
6,025,094	rdf:Statement
4,647,293	foaf:Document
4,230,908	http://purl.uniprot.org/core/Resource

These are pretty much all types – compare to:

Types

A “type” being the object that occurs in a triple where rdf:type is the predicate gives us:

170,020 types
91,479 occur in more than a single triple
20,196 more than 10 times
4,325 more 100
1,113 more than 1,000
258 more than 10,000
and 89 more than 100,000 times

On average each type is used 3277.7 times, and the top 10 are:

#triples	type
215,536,042	gr:BusinessEntity
168,208,826	gr:ProductOrServiceModel
71,520,943	foaf:Person
10,447,941	foaf:OnlineAccount
6,886,401	rss:item
6,066,069	rdf:Statement
4,674,162	foaf:Document
4,260,056	http://purl.uniprot.org/core/Resource
4,001,282	http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry
3,405,101	owl:Class

Not identical to the top resources, but quite similar. Lots of FOAF and new this year, lots of GoodRelations.

Contexts

Something changed with regard to context handling for BTC2010, this year we only have 8M contexts, last year we had over 35M.
I wonder if perhaps all of dbpedia is in one context this year?

8,126,834 unique contexts
8,048,574 occur in more than a single triple
6,211,398 more than 10 times
1,493,520 more 100
321,466 more than 1,000
61,360 more than 10,000
and 4799 more than 100,000 times

For an average of 389.958 triples per context. The 10 biggest contexts are:

#triples	context
302,127	http://data-gov.tw.rpi.edu/raw/402/data-402.rdf
273,644	http://www.ling.helsinki.fi/kit/2004k/ctl310semw/WordNet/wordnet_nouns-20010201.rdf
259,824	http://static.cpantesters.org/author/M/MIYAGAWA.rss
207,513	http://data-gov.tw.rpi.edu/raw/401/data-401.rdf
193,944	http://static.cpantesters.org/author/D/DROLSKY.rss
189,528	http://static.cpantesters.org/author/S/SMUELLER.rss
170,899	http://data-gov.tw.rpi.edu/raw/59/data-59.rdf
166,454	http://zaltys.net/ontology/AKTiveSAOntology.owl
166,454	http://www.zaltys.net/ontology/AKTiveSAOntology.owl
165,948	http://lsdis.cs.uga.edu/~satya/Satya/jan24.owl

This concludes my boring stats dump for BTC2010 for now. Some information on literals and hopefully some graphs will come soon! I also plan to look into how these stats changed from last year – so far I see much more GoodRelations, but there must be other fun changes!

Posted by gromgull at 2:07 pm on September 1st, 2010. 3 comments... »
Categories: Billion Triple Challenge, Statistics, Uncategorized.

Semantic Web Clusterball

From the I-will-never-actually-finish-this department I bring you the Semantic Web Cluster-ball:

I started this is a part of the Billion Triple Challenge work, it shows the how different sites on Semantic Web are linked together. The whole thing is an interactive SVG, I could not get it to embed here, so click on that image and mouse over things and be amazed. Clicking on the different predicates in the SVG will toggle showing that predicate, mouse over any link will show how many links are currently being shown. (NOTE: Only really tested in Firefox 3.5.X, it looked roughly ok in Chrome though.)

The data is extracted from the BTC triples by computing the Pay-Level-Domain (PLD, essentially the top-level domain, but with special rules for .co.uk domains and similar) for the subjects and objects, and if they differ, count the predicates that link them. I.e. a triple:

dbpedia:Albert_Einstein rdf:type foaf:Person.

would count as a link between http://dbpedia.org and http://xmlns.com for the rdf:type predicate. Counting all links like this gives us the top cross-domain linking predicates:

predicate	links
http://www.w3.org/1999/02/22-rdf-syntax-ns#type	60,813,659
http://www.w3.org/2000/01/rdf-schema#seeAlso	16,698,110
http://www.w3.org/2002/07/owl#sameAs	4,872,501
http://xmlns.com/foaf/0.1/weblog	4,627,271
http://www.aktors.org/ontology/portal#has-date	3,873,224
http://xmlns.com/foaf/0.1/page	3,273,613
http://dbpedia.org/property/hasPhotoCollection	2,556,532
http://xmlns.com/foaf/0.1/img	2,012,761
http://xmlns.com/foaf/0.1/depiction	1,556,066
http://www.geonames.org/ontology#wikipediaArticle	735,145

Most frequent is of course rdf:type, since most schemas are from different domains to the data, and most things have a type. The ball linked above is excluding type, since it’s not really a link. You can also see a version including rdf:type. The rest of the properties are more link-like, I am not sure what is going on with the akt:has-date though, anyone?

The visualisation idea is of course not mine, mainly I stole it from Chris Harrison: Wikipedia Clusterball. His is nicer since he has core nodes inside the ball. He points out that the “clustering” of nodes along the edge is important, as this brings out the structure of whatever is being mapped. My “clustering” method was very simple, I swap each node with the one giving me the largest decrease in edge distance, then repeat until the solution no longer improves. I couple this with a handful of random restarts and take the best solution. It’s essentially a greedy hill-climbing method, and I am sure it’s far from optimal, but it does at least something. For comparison, here is the ball on top without clustering applied.

The whole thing was of course hacked up in python, the javascript for the mouse-over etc. of the SVG uses prototype. I wanted to share the code, but it’s a horrible mess, and I’d rather not spend the time to clean it up. If you want it, drop me a line., see below. The data used to generate this is available either as a download: data.txt.gz (19Mb, 10,000 host-pairs and top 500 predicates), or a subset on Many Eyes (2,500 host-pairs and top 100 predicates, uploading 19Mb of data to Many Eyes crashed my Firefox :)

UPDATE: Richard Stirling asked for the code, so I spent 30 min cleaning it up a bit, grab it here: swball_code.tar.gz It includes the data+code needed to recreate the example above.

Posted by gromgull at 12:25 pm on January 6th, 2010. 2 comments... »
Categories: Billion Triple Challenge, in progress, SVG, Visualisation.

An Objective look at the Billion Triple Data

For completeness, Besbes is telling me to write up the final stats from the BTC data, for the object-part of the triples. I am afraid this data is quite dull, and there are not many interesting things to say, so it’ll be mostly tables. Enjoy :)

The BTC data contains 279,710,101 unique objects in total. Out of these:

90,007,431 appear more than once
7,995,747 more than 10 times
748,214 more than 100
43,479 more than 1,000
3,209 more than 10,000

The 280M objects are split into 162,764,271 resources and 116,945,830 literals. 13,538 of the resources are file:// URIs. The top 10 objects are:

#triples	object
2,584,960	http://www.geonames.org/ontology#P
2,645,095	http://www.aktors.org/ontology/portal#Article-Reference
2,681,771	http://www.w3.org/2002/07/owl#Class
5,616,326	http://www.aktors.org/ontology/portal#Person
7,544,903	http://www.geonames.org/ontology#Feature
9,115,801	http://en.wikipedia.org/
12,124,378	http://xmlns.com/foaf/0.1/OnlineAccount
13,687,049	http://purl.org/rss/1.0/item
14,172,852	http://rdfs.org/sioc/types#WikiArticle
38,795,942	http://xmlns.com/foaf/0.1/Person

Apart from the wikipedia link, all are types. No literals appear in the top 10 table. For the 116M unique literals we have 12,845,021 literals with a language tag and 2,067,768 with a datatype tag. The top 10 literals are:

#triples	literal
722,221	“0”^^xsd:integer
969,929	“1”
1,024,654	“Nay”
1,036,054	“Copyright © 2009 craigslist, inc.”
1,056,799	“text”
1,061,692	“text/html”
1,159,311	“0”
1,204,996	“en-us”
2,049,638	“Aye”
2,310,681	“application/rdf+xml”

I can’t be bothered to check it now, but I guess the many Aye’s & Nay’s come from IRC chatlogs (#SWIG?).

Finally, I looked at the length of the literals used in the data, the longest literal is 65,244 unicode characters long (I wonder about this — this seems very close to 2¹⁶ bytes, some unicode characters with more than one byte, could it be truncated?). The distribution of literals/lenghts looks like this:

The most literals are around 10 characters in length, there is a peak for 19, which I seem to remember was caused by the standard time format (i.e. 2005-10-30T10:45UTC) being exactly 19 characters.

That’s it! I believe I now have published all my numbers on BTC :)

Posted by gromgull at 4:44 pm on December 11th, 2009. No comments... »
Categories: Billion Triple Challenge, Semantic Web, Statistics, Uncategorized.

DBTropes

Know TvTropes.org? As pointed out by XKCD, a great place to lose hours of time reading about SoBadIt’sHorrible, HighOctaneNightmareFuel and thousands of other tropes, all with examples from comics, films, tv-series etc.

DFKI colleague Malte Kiesel has done the right thing and just released his linked open data wrapper for tvtropes, natuerally names dbTropes.org. Now go read about DiabolusExMachina, it will of course do content-negotiation so try it with your favourite RDF browser.

I helped too — I made the stylesheet and the “logo” :)

Posted by gromgull at 2:31 pm on December 10th, 2009. No comments... »
Categories: Everything Else, Semantic Web.

Typical Semantic Web Data

This is the fourth of my Billion Triple Challenge data-set statistics posts, if you only just got here, catch up on part I, II or III.

I had these numbers ready for a long time, but never found the time to type it up as the it is not so exciting. However CaptSolo asked for it now to put in his very-soon-to-be-finished thesis, so I’ll hurry up. This is all about the classes used in the BTC data, i.e. the rdf:type triples.
Overall the data contains 143,293,758 type triples, assigning 283,815 different types to 104,562,695 different things. For the types themselves:

213,281 types are used more than once
94,455 used more than 10
14,862 more than 100
1,730 more than 1000
288 more than 10000

If we take only these 288 top ones we cover 92% of all types triples, we can cover 90% of the typed things with only 105 types and over 50% of the data with only foaf:Person, sioc:WikiArticle, rss:Item and foaf:OnlineAccount. Out of all the “types” used 12,319 were BNodes, which is odd, but I guess possible, and 204 are literals, which is even odder. The top 10 types are:

#triples	type URI
1,859,499	wordnet:Person
2,309,652	foaf:Document
2,645,091	akt:Article-Reference
2,680,081	owl:Class
5,616,163	akt:Person
7,544,797	geonames:Feature
12,123,375	foaf:OnlineAccount
13,686,988	rss:item
14,172,851	sioc:WikiArticle
38,790,680	foaf:Person

Now for the things the types are assigned to, out of the 104,562,965 things with types, 52,865,376 are BNodes. If you pay attention you will now have realised that many things have more than one type assigned (143M type triples⇒104M things). In fact:

7,026,972 things have more than one type triple.
612,467 has more than 10
35,201 more than 100
1,025 more than 1,000
40 more than 10,000

Note I am talking here of type triples, i.e. the top 40 things may well have the same type assigned 10,000 times. The things having over 10,000 types assigned is a product of the partially inclusion of inferred triples in the data. For instance, for every context where RDFS inference has been applied, all properties will have rdf:type rdf:Property inferred. Looking at the number of unique types per thing shows that:

2,979,968 things have more than one type
78,208 have more than 10
4 more than 100

The 10 things with most unique types are all pretty boring:

#types	URI
74	http://rdf.freebase.com/ns/guid.9202a8c04000641f8000000000959f60
75	http://dbpedia.org/resource/Arnold_Schwarzenegger
88	http://oiled.man.example.net/test#V822576
91	http://oiled.man.example.net/test#V21027
91	http://oiled.man.example.net/test#V21029
91	http://oiled.man.example.net/test#V21030
105	http://oiled.man.example.net/test#V16459
136	http://www.w3.org/2002/03owlt/description-logic/consistent501#T
136	http://www.w3.org/2002/03owlt/description-logic/inconsistent502#T
171	http://oiled.man.example.net/test#V21026

Likewise the 10 things with the most types assigned, all product of materialised inferred triples:

#triples	URI
57,533	http://sw.opencyc.org/2008/06/10/concept/
58,838	http://semantic-mediawiki.org/swivt/1.0#creationDate
58,838	http://semantic-mediawiki.org/swivt/1.0#page
58,838	http://semantic-mediawiki.org/swivt/1.0#Subject
89,521	http://sw.opencyc.org/concept/Mx4rwLSVCpwpEbGdrcN5Y29ycA
121,138	http://en.wikipedia.org/
159,773	http://sw.opencyc.org/concept/
232,505	http://sw.cyc.com/CycAnnotations_v1#label
361,113	http://xmlns.com/foaf/0.1/holdsAccount
465,010	http://sw.cyc.com/CycAnnotations_v1#externalID

That’s it — I hope it changed your life! :)

Posted by gromgull at 5:15 pm on September 28th, 2009. One comment... »
Categories: Billion Triple Challenge, Semantic Web, Statistics.

Heat-maps of Semantic Web Predicate usage

It’s all Cygri‘s fault — he encouraged me to add schema namespaces to the general areas on the semantic web cluster-tree. Now, again I misjudged horribly how long this was going to take. I thought the general idea was simple enough, I already had the data. One hour should do it. And now one full day later I have:

It’s the same map as last time, laid using graphviz’s neato as before. The heat-map of the properties was computed from the feature-vector of predicate counts, first I mapped all predicates to their “namespace”, by the slightly-dodgy-but-good-enough heuristic of taking the part of the URI before the last # or / character. Then I split the map into a grid of NxN points (I think I used N=30 in the end), and compute a new feature vector for each point. This vector is the sum of the mapped vector for each of the domains, divided by the distance. I.e. (if you prefer math) each point’s vector becomes:

$\displaystyle V_{x,y}= \sum_d\frac{V_d}{\sqrt{D( (x,y), pos_d)}}$

Where $D$ is the distance (here simple 2d euclidean), $d$ is each domain, $pos_d$ is that domains position in the figure and $V_d$ is that domains feature vector. Normally it would be more natural to decrease the effect by the squared distance, but this gave less attractive results, and I ended up square-rooting it instead. The color is now simply on column of the resulting matrix normalised and mapped to a nice pylab colormap.

Now this was the fun and interesting part, and it took maybe 1 hour. As predicted. NOW, getting this plotted along with the nodes from the graph turned out to be a nightmare. Neato gave me the coordinates for the nodes, but would change them slightly when rendering to PNGs. Many hours of frustration later I ended up drawing all of it again with pylab, which worked really well. I would publish the code for this, but it’s so messy it makes grown men cry.

NOW I am off to analyse the result of the top-level domain interlinking on the billion triple data. The data-collection just finished running while I did this. … As he said.

Posted by gromgull at 2:10 pm on September 11th, 2009. No comments... »
Categories: Billion Triple Challenge, Python, Semantic Web, Statistics, Visualisation.

Visualising predicate usage on the Semantic Web

So, not quite a billion triple challenge post, but the data is the same. I had the idea that I compare the Pay-Level-Domains (PLD) of the context of the triples based on what predicates is used within each one. Then once I had the distance-metric, I could use FastMap to visualise it. It would be a quick hack, it would look smooth and great and be fun. In the end, many hours later, it wasn’t quick, the visual is not smooth (i.e. it doesn’t move) and I don’t know if it looks so great. It was fun though. Just go there and look at it:

As you can see it’s a large PNG with the new-and-exciting ImageMap technology used to position the info-popup, or rather used to activate the JavaScript used for the popups. I tried at first with SVG, but I couldn’t get SVG and XHTML and Javascript to play along, I guess in Firefox 5 it will work. The graph is laid out and generated Graphviz‘s neato, which also generated the imagemap.

So what do we actually see here? In short, a tree where domains that publish similar Semantic Web data are close to each other in the tree and have similar colours. In detail: I took the all PLDs that contained over 1,000 triples, this is around 7500, and counted the number of triples for each of the 500 most frequent predicates in the dataset. (These 500 predicates cover ≈94% of the data). This gave me a vector-space with 500 features for each of the PLDs, i.e. something like this:

	geonames:nearbyFeature	dbprop:redirect	foaf:knows	…
dbpedia.org	0.01	0.8	0.1
livejournal.org	0	0	0.9
geonames.org	0.75	0	0.1
…

Each value is the percentage of triples from this PLD that used this predicate. In this vector space I used the cosine-similarity to compute a distance matrix for all PLDs. With this distance matrix I thought I could apply FastMap, but it worked really badly and looked like this:

So instead of FastMap I used maketree from the complearn tools, this generates trees from a distance matrix, it generates very good results, but it is an iterative optimisation and it takes forever for large instances. Around this time I realised I wasn’t going to be able to visualise all 7500 PLDs, and cut it down to the 2000, 1000, 500, 100, 50 largest PLDs. Now this worked fine, but the result looked like a bog-standard graphviz graph, and it wasn’t very exciting (i.e not at all like this colourful thing). Now I realised that since I actually had numeric feature vectors in the first place I wasn’t restrained to using FastMap to make up coordinates, and I used PCA to map the input vector-space to a 3-dimensional space, normalised the values to [0;255] and used these as RGB values for colour. Ah – lovely pastel.

I think I underestimated the time this would take by at least a factor of 20. Oh well. Time for lunch.

Posted by gromgull at 12:02 pm on September 9th, 2009. 3 comments... »
Categories: Billion Triple Challenge, Python, Semantic Web, Statistics, Visualisation.

The Subject Matter (or it’s a scam – there are only 900M!)

This is the next part of the BTC statistics, this time I look at the subjects of the triples. Oh my, isn’t it exciting. Actually, I’ve had all the numbers for this ready for a while, but holidays and real work has kept me from typing it up. So, BTC overall contains:

128,079,322 unique subjects
118,205,618 has more than a single triple
19,037,202 more than 10
1,302,353 more than 100
25,741 more than 1000
223 more than 10000

Out of these 128M subjects 59,423,933 are blank nodes. Only 17,089 of them are file:// URIs, I really expected many more to have snuck in. At first sight it may seem very odd that so many subjects have more than 1000 triples — what could those possibly be? However, when looking at the 10 subjects with the most triples it becomes clear:

138,618	swrc:InProceedings
154,721	http://sw.opencyc.org/concept/Mx4rZOAVeiYGEdqAAAACs2IMmw
172,599	http://sw.opencyc.org/2008/06/10/concept/
195,167	dctype:Text
209,623	foaf:Document
358,090	http://sw.opencyc.org/concept/Mx4rwLSVCpwpEbGdrcN5Y29ycA
362,161	foaf:holdsAccount
479,323	http://sw.opencyc.org/concept/
697,520	http://sw.cyc.com/CycAnnotations_v1#label
930,025	http://sw.cyc.com/CycAnnotations_v1#externalID

Most of these are parts of schemas, i.e. properties or classes (perhaps all? I don’t know enough about CYC use to say what http://sw.opencyc.org/2008/06/10/concept/ is). Looking at the data, out of the hundred-thousand of triples about foaf:holdsAccount for instance, 180,552 of the triples are:
foaf:holdsAccount rdf:type rdfs:Property .
And 180,390 are the triple:
foaf:holdsAccount rdf:type owl:InverseFunctionalProperty .
Of course each of these are in different context. At first I thought this meant that someone was keeping hundreds of thousand of the FOAF ontology around, but of course then all the other FOAF properties and classes would also be the subject of lots of triples. Looking at the contexts where these triples came from there are 180,574 contexts containing the first triple. 180,389 of them are from Kanzaki’s flickr2foaf script (the remaining are 150 variations on http://xmlns.com/foaf and 30 odd random contexts). However, the output from flickr2foaf does not include the schema information, it only uses use foaf:holdsAccount (and many foaf:OnlineAccount instances). My guess to what has happened is that someone has crawled this, each profile, such as mine will contain rdfs:seeAlso links to all my flickr contacts, and each of these pages will use foaf:holdsAccount. Then they applied some sort of inference that materialised the triples above, adding it once for each context it appeared in. This inference cannot be basic RDFS inference, since it also adds owl:InverseFunctionalProperty, and it has not been applied to all the BTC data, but only to some context. I wonder if there is a way to recover which contexts this has been applied to, and then perhaps finding out which triples are redundant, i.e. they could be re-inferred from the other triples?

Now, all these triples about foaf:holdsAccount and CYC concepts also tells us something else: this isn’t really the Billion Triple Challenge, since many of the triples are duplicate, it is the Billion Quad challenge, which I guess is not so catchy. A few more CPU cycles spent on piping things through sort, and uniq (my favourite activity!) I know that out of the original 1,151,383,508 quads, there are actually only 1,150,846,965 uniqe quads, i.e. about 500K duplicates, and more interestingly, there are only 906,166,056 unique triples, i.e. 245M duplicates. I guess it’s not the Billion triple challenge either :) — now with only 900M triples it should be easy!

(BTW: No graphs this time, sorry! Also — I know I said I would talk about the literal values this time, but I changed my mind, next time!)

UPDATE:

Gianluca Demartini asked an interesting question: Why is nearly half the subjects blank nodes? I don’t really know – but I can speculate. 46% of the subject IDs are blank-nodes, these account for ≈30% of the triples in the dataset. I was hoping these 30% would be badly distributed i.e. that there was some few blank nodes with lots and lots of triples, but alas, the blank-node/triple distribution breaks down like this :

57,457,905 – over 1
1,931,363 – over 10
189,487 – over 100
3,901 – over 1000
50 – over 10000

You need to include the 43,916,862 largest bnodes descriptions to cover 90% of these triples, i.e. we cannot quickly ignore the biggest ones and move on with our lives. I wont give you the top N bnodes since this is more or less random generated IDs, but looking at some of the “largest” bnodes they all look like sitemap files that have been converted to RDF — for example, the largest blank node is _:genid1http-3A-2F-2Fwww-2Eindexedvisuals-2Ecom-2Findexedvisuals-2Exml, this appers to be an RDF version of the sitemap for www.indexedvisuals.com

Now, this bnode alone is the subject of 32,984 triples, and all of these apart from one is a triples with property http://www.google.com/schemas/sitemap/0.84url and another bnode as an object. I guess this is the case for many of the largest bnodes, and probably many of those nodes in return. (Although a highly scientific grep for bnode IDs that contain “sitemap” returns only about 100K cases — a better count is underway.)

So in conclusion — bah! Who knows? Who needs bnodes anyway? :)

UPDATE II:

I did a proper count of how many of the blank-nodes are sitemap nodes like the indexedvisuals above, and it’s only 27! :) There goes that theory. These 27 do account for 71,985 triples with the 0.84url predicate, but this is still a tiny amount of the data. In the next post we will also see that a huge percentage of these bnodes have proper types, giving additional evidence that they are genuine real interesting parts of the data, not just some weird artifact.

Posted by gromgull at 3:01 pm on August 17th, 2009. 3 comments... »
Categories: Billion Triple Challenge, Semantic Web, Statistics.

Billions and billions and billions (on a map)

Time for a few more BTC statistics, this time looking at the contexts. The BTC data comes from 50,207,171 different URLs, out of these:

35,423,929 yielded more than a single triple
10,278,663 yielded more than 10 triples, and covers 85% of the full data.
1,574,458 more than 100 covers 63%
133,369 more than 1000 covers 30%
3,759 more than 10000 covers 7%

The biggest context were as follows:

triples	context
7,186,445	http://sw.deri.org/svn/sw/2008/03/MetaX/vocab.rdf#aperture-1.2.0
410,659	http://lsdis.cs.uga.edu/projects/semdis/swetodblp/march2007/swetodblp_march_2007_part_22.rdf
273,644	http://www.ling.helsinki.fi/kit/2004k/ctl310semw/WordNet/wordnet_nouns-20010201.rdf
237,685	http://dbpedia.org/resource/リレット
196,239	http://dbpedia.org/resource/スケルツォ
194,730	http://dbpedia.org/resource/オークション
178,842	http://www.reveredata.com/reports/store/index-companies.rss
165,948	http://lsdis.cs.uga.edu/~satya/Satya/jan24.owl
165,506	http://www.cs.man.ac.uk/~dturi/ontologies/go-assocdb/go-termdb.owl
160,592	http://dbpedia.org/resource/をがわいちろを

It’s pretty cool that someone crawled 7 million triples with aperture and put it online :) – the link is 404 now though, so you can’t easily check what it was. Also, none of the huge dbpedia pages seem to give any info, I am not quite sure what is going on there. Perhaps some encoding trouble somewhere?

As the official BTC statistics page already shows, it is more interesting when you group the context by the ones with the same host, computing the same Pay-Level-Domains as they did I get the hosts contributing the most triples as:

triples	context
278,566,771	dbpedia.org
133,266,773	livejournal.com
94,748,441	rkbexplorer.com
84,896,760	geonames.org
61,339,034	mybloglog.com
53,492,284	sioc-project.org
23,970,898	qdos.com
23,745,914	hi5.com
23,459,199	kanzaki.com
17,691,303	rdfabout.com
15,784,386	plode.us
15,208,914	dbtune.org
13,548,946	craigslist.org
10,155,861	l3s.de
10,028,115	opencyc.org

Again, this is computed from the whole dataset, not just a subset, but interestingly it differs quite a lot from the “official” statistics, in fact, I’ve “lost” over 100M triples from dbpedia. I am not sure why this happens, a handful of context URLs where so strange that python’s urlparse module did you produce a hostname, but they only account for about 100,000 triples. Summing for the hosts I did find I get the right number of triples (i.e. one billion :). So unless there is something fundamentally wrong about the way I find the PLD, I am almost forced to conclude that the official stats are WRONG!

UPDATE: The official numbers must be wrong, because if you sum them all you get 1,504,548,700 – i.e. over 1.5Billion triples for just the top 50 domains alone. This cannot be true, since actual number of triples is “just” 1,151,383,508.

More fun than the table above is using hostip.info to geocode the IPs of these servers and put them on map. Now the hostip database is not perfect, in fact, it’s pretty poor, some hosts with A LOT of triples are missing (such as livejournal.com). I could perhaps have used the country codes of the URLs as a fall-back solution, but I was too lazy.

Now for drawing the map I thought I could use Many Eyes, but it turned out not to be as easy as I imagined. After uploading the dataset I found that although Many Eyes has a map visualisation, it does not use lat/lon coordinates, but relies instead on country name. Here is what it would have looked like if done by lat/lon, you have to imagine the world map though:

Trying again, I used the hostip.info database again, and got the country of each host, and added up the numbers for each country (Many Eyes does not do any aggregation) and uploaded a triples by country dataset. This I could visualise on a map, shading each country according to the number of triples, but it’s kinda boring:

Giving up on Many Eyes I tried the Google Visualisation API instead. Surely they would have a smooth zoomable map visualisation? Not quite. They have a map, it’s flash based, only supports “zoom” into pre-defined regions and does a complete reload when changing region. Also, it only supports 400 data points. All the data is embedded in the Javascript though. I couldn’t get it to embed here, so click:

Now I am sure I could hack something together than would use proper Google maps, and would actually let you zoom nicely, etc. BUT I think I’ve really spent enough time on this now.

Keep your eyes peeled for the next episode where we find out why the semantic web has more triples of length 19 than any other.

Posted by gromgull at 12:50 pm on July 16th, 2009. One comment... »
Categories: Billion Triple Challenge, Semantic Web, Statistics, Visualisation.

BTC Statistics I

As I said, I wanted to try looking into the billion triple challenge using unix command-line tools. ISWC deadline set me back a bit, but now I’ve got it going.

First step was to get rid of those pesty literals as they contain all sort of crazy character that make my lazy parsing tricky. A bit of python later and I converted:

<http://www.w3.org/People/Berners-Lee/card#edd> <http://xmlns.com/foaf/0.1/name> "Edd Dumbill" <http://www.w3.org/People/Berners-Lee/card> . <http://www.w3.org/People/Berners-Lee/card#edd> <http://xmlns.com/foaf/0.1/nick> "edd" <http://www.w3.org/People/Berners-Lee/card> . <http://bblfish.net/people/henry/card#me> <http://xmlns.com/foaf/0.1/name> "Henry Story" <http://www.w3.org/People/Berners-Lee/card> .
to
<http://www.w3.org/People/Berners-Lee/card#edd> <http://xmlns.com/foaf/0.1/name> "000_1" <http://www.w3.org/People/Berners-Lee/card> . <http://www.w3.org/People/Berners-Lee/card#edd> <http://xmlns.com/foaf/0.1/nick> "000_2" <http://www.w3.org/People/Berners-Lee/card> . <http://bblfish.net/people/henry/card#me> <http://xmlns.com/foaf/0.1/name> "000_3" <http://www.w3.org/People/Berners-Lee/card> .
i.e. each literal was replaced with chunknumber_literalnumber, and the actual literals stored in another file. Now it was open for simply splitting the files by space and using cut, awk, sed, sort, uniq, etc. to do everything I wanted. (At least, that’s what I though, as it turned out the initial data contained URIs with spaces, and my “parsing” broke … then I fixed it by replacing > < with >\t<, and used tab as a field delimiter and I was laughing. The data has now been fixed, but I kept my original since I was too lazy to download 17GB again)

So, now I’ve computed a few random statistics, nothing amazingly interesting yet. I’ll put a bit her eat a time, today: THE PREDICATES!

The full data set contains 136,188 unique predicates of these:

112,966 occur more than once
62,937 more than 10 times
24,125 more than 100
8,178 more than 1000
2,045 more than 10000

623 of them have URIs starting with <file://> – they will certainly be very useful for the semantic web.

Note that although 136k different predicates seems like a great deal, many of them are hardly used at all, in fact, if you only look at the top 10,000 most used predicates, you still cover 92% of the triples.

As also mentioned on the official BTC stats page, the most used predicates are:

triples	predicate
156,448,093	http://dbpedia.org/property/wikilink
143,293,758	rdf:type
53,869,968	rdfs:seeAlso
35,811,115	foaf:knows
32,895,374	foaf:nick
23,266,469	foaf:weblog
22,326,441	dc:title
19,565,730	akt:has-author
19,157,120	sioc:links_to
18,257,337	skos:subject

Note that these are computed from the whole corpus, not just a sample, and for instance for the top property there is a difference of a massive 13,139. That means the official stats are off by almost 0.01%! I don’t know how we can work under these conditions…

Moving on I assigned each predicate to a namespace, I did this by matching them with the list at prefix.cc, if the the URI didn’t start with any of those I made the namespace the URI up to the last # or /, whatever appeared later. The most used namespaces were:

triples	namespace
244,854,345	foaf
224,325,132	dbpprop
167,911,029	rdf
807,21,580	rdfs
64,313,022	akt
63,850,346	geonames
58,675,733	dc
44,572,003	rss
31,502,395	sioc
21,156,972	skos
14,801,992	geo
10,691,295	http://dbpedia.org/ontology
10,239,809	http://www.kisti.re.kr/isrl/ResearchRefOntology
9,812,367	content
9,661,682	http://www.rdfabout.com/rdf/schema/vote
8,623,124	owl
6,837,606	http://rdf.freebase.com/ns
6,813,536	xhtml
5,443,549	nie

I included the top 19, since number is the NEPOMUK Information Element Ontology, and I found it funny that it was used so widely. Another thing that is funny is that RDFS is used more than 10x as much as OWL (even ignoring the RDF namespace, defining things like rdf:Property, also used by schemas). I tried to plot this data as well, since Knud pointed out that you need a nice long-tail graph these days. However, for both predicates and namespaces there are a (relatively) huge number of things that only occur once or twice, if you plot a histogram these dominate the whole graph, even with logarithmic Y axis. In the end I’ve ended up plotting the run length encoding of the data, i.e. how many namespaces occur once, twice, three times, etc. :

Here the X axis shows how the number of occurrences and the Y axis shows how many things occur this often. I.e. the top left point is all the random noise that occurs once, such as file:/cygdrive/c/WINDOWS/Desktop/rdf.n3, file:/tmp/filem8INvE and other useful URLs. The bottom two right points are foaf and dbprop.

I don’t know about the graph – I have a feeling it lies somehow, in a way a histogram doesn’t. But I don’t know. Anyone?

Anyway – most things of the BTC I have plotted have a similarily shaped frequency distribution, i.e. the plain predicate frequencies, the subject/object frequencies are all the same. The literals are more interesting, if I have the time I’ll write them up tomorrow. Still it’s all pretty boring – I hope to detect duplicate triples from different sources once I’m done with this. I expect to find at least 10 copies of the FOAF schema.

Posted by gromgull at 8:55 pm on June 25th, 2009. 5 comments... »
Categories: Billion Triple Challenge, R, Semantic Web, Statistics, Visualisation.