Posts categorized “Billion Triple Challenge”.

Some basic BTC2012 Stats

(The Figure shows the biggest domains publishing data, and links between them – mouse-over the edges to highlight, chose linking predicate from the drop-down list)

So it’s that time of year again, and the Billion Triple Challenge Dataset for 2012 has been posted.
This coincided with our project demo being finished, so I had some time to spare. The previous years I’ve done this all using unix tools, sed/awk/grep and friends. This year I figured I’d do it all in python. To get reasonable performance two things were crucial:

  • the python gzip module has decompression implemented in python, using subprocess and reading from a pipe to gunzip is MUCH faster (thanks Jörn!)
  • I wrote a an N-Quads “parser” in cython, taking advantage of the very regular output of ld-spider

This meant that for simple operations, like adding up things in a hash-table in memory, I could stream-process about 500,000 triples per second. For things that did not fit in memory, I used LevelDB with a thin layer of most-frequently-used caching around it.

I’m happy to see that DbTropes is part of the data this year!
So – the basic stats:

  • 1.4B triples all in all
  • 1082 different namespaces are used
  • 9.2M unique contexts, from 831 top-level document PLDs (Pay-Level-Domain, essentially data.gov.uk, instead of gov.uk, but livejournal.com, instead of bob.livejournal.com)
  • 183M unique subjects are described
  • 57k unique predicates
  • 192M unique resources as objects
  • 156M unique literals
  • 152M triples are rdf:type statements, 296k types are used. Resource with multiple types are common, 45M resources have two types, 40M just one.

 

Top 10 Context PLDs

count context pld
751,352,061 data.gov.uk
198,090,262 dbpedia.org
101,241,556 freebase.com
101,082,592 livejournal.com
44,331,145 opera.com
41,544,819 dbtropes.org
39,200,538 legislation.gov.uk
36,969,163 identi.ca
29,447,217 ontologycentral.com
14,949,592 rdfize.com

 

Top 10 Namespaces

count namespace
336,911,630 http://www.w3.org/1999/02/22-rdf-syntax-ns#
191,669,089 http://www.w3.org/2000/01/rdf-schema#
143,650,096 http://xmlns.com/foaf/0.1/
133,845,241 http://reference.data.gov.uk/def/intervals/
115,692,342 http://www.w3.org/2006/time#
71,016,514 http://www.w3.org/2006/http#
69,715,106 http://rdf.freebase.com/ns/
66,058,545 http://www.w3.org/2004/02/skos/core#
53,246,991 http://purl.org/dc/terms/
50,444,755 http://dbpedia.org/property/

 

Top 10 Types

count type
39,345,307 intervals:Second
39,345,280 intervals:CalendarSecond
12,841,127 foaf:Person
7,623,831 foaf:Document
1,896,136 qb:Observation
1,851,173 fb:common.topic
1,712,877 intervals:Minute
1,712,875 intervals:CalendarMinute
1,328,921 owl:Thing
1,280,763 metalex:BibliographicExpression

As usual, although many namespaces/hosts/types are used, the distribution is skewed, the most common elements quickly accounts for most of the data. This graph shows the cumulative occurrences (i.e. % of total unique elements) of types/context-plds/namespaces occurring more than N times (the X axis is logarithmic):

So the steeper the curve, the longer the tail of infrequently occurring elements. For example, less than 5% of types occur more than 100 times, but very few context-pld’s occur less than 10 times. However, when you look at the actual density, the picture changes, here we plot the cumulative density, so although most types occur less than 100 times, the majority of the data uses only the most frequent types:

So the steeper the curve at the end, the more of the data is covered by the few most frequent element. For example, the top 5% most frequent namespaces and context-plds cover over 99% of the data, but the top 5% of types “only” 97%.

A different (maybe useless?) view of this, is this histogram with exponentially increasing bucket-sizes, again with a log-scale, so they look the same size:

Here we see … actually I’ll be damned if I know what we see here. Maybe I should have done more stats courses at uni instead of, say, Java Programming. Clearly the difference between the distribution of the three things is shown somehow. I’ve spent so long on this now though, there’s no way I wont put it here.

I don’t even want to talk about how long I spent making these graphs. I wanted to graph this since the first BTC dataset I looked at, but previously always fell back at “top n% of the elements cover n% of the data” tables.
They graphs are all done in pylab, exported as SVG (yay!). Playing with them was all done with the ipython notebook, which is really pleasant to work with.

Finally – the Chord-diagram on top shows links between context PLDs – mouse over each host to see outgoing links. This is only the top 19 PLD domains and the top 10 properties linking domains that themselves publish RDF data – this is important, as there are predicates used to link to non-semantic web resources that dominate otherwise. The graphic and interaction is all done with the excellent D3 Library.

I will try to come up with some more interesting visualisations based on links between instances of various types soon!

Schema usage in the BTC2010 data

A little while back I spent about 1 CPU week computing which hosts use which namespaces in the BTC2010 data, i.e. I computed a matrix with hosts as rows, schemas as columns and each cell the number of triples using that namespace each host published. My plan was to use this to create a co-occurrence matrix for schemas, and then use this for computing similarities for hierarchical clustering. And I did. And it was not very amazing. Like Ed Summer’s neat LOD graph I wanted to use Protovis to make it pretty. Then, after making one version, uglier than the next I realised that just looking at the clustering tree as a javascript datastructure was just as useful, I gave up on the whole clustering thing.

Not wanting spent CPU hours go to waste, I instead coded up a direct view of the original matrix, getting a bit carried away I made a crappy non-animated, non-smooth version of Moritz Stefaner’s elastic lists using jquery-ui’s tablesorter plugin.

At http://gromgull.net/2010/10/btc/explore.html you can see the result. Clicking one a namespace will show only hosts publishing triples using this schema, and only schemas that co-occur with the one you picked. Conversely, click on a host will show the namespaces published by that host, and only hosts that use the same schemas (this makes less intuitive sense for hosts than for namespaces). You even get a little protovis histogram of the distribution of hosts/namespaces!

The usually caveats for the BTC data applies, i.e. this is a random sampling of parts of the semantic web, it doesn’t really mean anything :)

Redundancy in the BTC2010 Data, it’s only 1.4B triples!

In a comment here, Andreas Harth mentions that kaufkauf.net publishes the same triples in many contexts, and that this may skew the statistics a bit. As it turns out, not only kaufkauf.net is guilty of this, by stripping the fourth quad component of the data and removing duplicate triples the original 3,171,793,030 quads turn into “only” 1,441,499,718 triples.

36,123,031 triples occurred more than once in the data, 42 of these even more than 100,000 times. The top redundant triples are:

#triples subj pred obj
470,903 prot:A rdf:type prot:Chains
470,778 prot:A prot:Chain “A”^^<http://www.w3.org/2001/XMLSchema#string>
470,748 prot:A prot:ChainName “Chain A”^^<http://www.w3.org/2001/XMLSchema#string>
413,647 http://products.semweb.bestbuy.com/company.rdf#BusinessEntity_BestBuy rdf:type gr:BusinessEntity
366,073 foaf:Document rdfs:seeAlso http://dblp.l3s.de/d2r/sparql?query=DESCRIBE+%3Chttp://xmlns.com/foaf/0.1/Document%3E
361,900 dcmitype:Text rdfs:seeAlso http://dblp.l3s.de/d2r/sparql?query=DESCRIBE+%3Chttp://purl.org/dc/dcmitype/Text%3E
254,567 swrc:InProceedings rdfs:seeAlso http://dblp.l3s.de/d2r/sparql?query=DESCRIBE+%3Chttp://swrc.ontoware.org/ontology%23InProceedings%3E
184,530 foaf:Agent rdfs:seeAlso http://dblp.l3s.de/d2r/sparql?query=DESCRIBE+%3Chttp://xmlns.com/foaf/0.1/Agent%3E
159,627 http://www4.wiwiss.fu-berlin.de/flickrwrappr/ rdfs:label “flickr(tm) wrappr”@en
150,417 http://purl.org/obo/owl/OBO_REL#part_of rdf:type owl:ObjectProperty

This is all unfortunate, because I’ve been analysing the BTC data pretending that it’s a snapshot of the semantic web. Which perhaps it is? The data out there does of course look like this. Does the context of a triple change what it MEANS? If we had a trust/provenance stack in place I guess it would. Actually, I am not sure what this means for my statistics :)

At least I can now count the most common namespaces again, this time only from triples:

#triples namespace
651,432,324 http://data-gov.tw.rpi.edu/vocab/p/90
275,920,526 foaf
181,683,388 rdf
106,130,939 rdfs
34,959,224 dc11
33,289,653 http://purl.uniprot.org/core
16,674,480 gr
12,733,566 rss
12,368,342 dcterm
8,334,653 swrc

Compare to the numbers for quads, data-gov had exactly the same number of triples (no redundancy!), whereas rdf dropped from 588M to 181M, rdfs from 860M to 106M and GoodRelations from 527M to 16M. Looking at all namespaces, GoodRelations wins the most redundant award from 16% of all quads, to only 1.1% of all triples. Comparing change since 2009 still puts GoodRelations up high though, so no need for them to worry:

% change namespace
5579.997758 http://www.openlinksw.com/schema/attribution
4802.937827 http://www.openrdf.org/schema/serql
3969.768833 gr
2659.804256 urn:lsid:ubio.org:predicates:recordVersion
2655.011816 urn:lsid:ubio.org:predicates:lexicalStatus
2621.864105 urn:lsid:lsid.zoology.gla.ac.uk:predicates:mapping
2619.867255 urn:lsid:lsid.zoology.gla.ac.uk:predicates:rank
1539.092138 urn:lsid:ubio.org:predicates:hasCAVConcept
1063.282710 urn:lsid:lsid.zoology.gla.ac.uk:predicates:vernacularName
928.135900 http://spiele.j-crew.de/wiki/Spezial:URIResolver

And if I understood Kingsley Idehen correctly, there is something fishy about the attribution namespace from openlink as well, but I’ve done enough boring digging now.

Now I’m done doing boring counting – next time I hope I can have more fun visualisation, like Ed!!

SKOS Concepts in the BTC2010 data

Again Dan Brickley is making me work :) This time looking at the “hidden” schema that is SKOS concepts, (hidden because it is not really apparent when just looking at normal rdf:types). Dan suggested looking at topics used with FOAF, i.e. objects of foaf:topic, foaf:primaryTopic and foaf:interest triples, and also things used with Dublin Core subject (I used both http://purl.org/dc/elements/1.1/subject and http://purl.org/dc/terms/subject.

I found 1,136,475 unique FOAF topics in 8,119,528 triples, only 4,470 are bnodes, and only 265 (! i.e. only 0.002%) are literals. The top 10 topics are all of the type http://www.livejournal.com/interests.bml?int=??????, with varying number of ?s, this is obviously what people entered into the interest field of livejournal. More interesting are perhaps the top hosts:

#triples host
5,191,771 www.livejournal.com
1,819,836 www.deadjournal.com
771,439 www.vox.com
78,290 klab.lv
75,285 lj.rossia.org
70,380 lod.geospecies.org
18,398 my.opera.com
16,251 dbpedia.org
11,481 www.wasab.dk
9,815 wiki.sembase.at

So a lot of these topics are from FOAF exports of livejournal and friends. What I did not do, at least not yet, was to compare the list of FOAF topics with the things actually declared to be of type skos:Concept, this would be interesting.

Dublin Core looks quite different, it gives us 552,596 topics in 4,018,726 triples, but only 2,979 resources out of 921 are bnodes, the rest (i.e. 99.4%) are all literals.
The top 10 subjects according to DC are:

#triples subject
91,534 日記
38,566 写真
35,514 メル友募集
32,150 NAPLES
30,973 business
28,342 独り言
27,543 SoE Report
24,102 Congress
23,954 音楽
20,097

I do not even know what language most of these are (anyone?). Looking a bit further down the list, there are lots of government, education, crime, etc. Perhaps we can blame data.gov for this? I could have have kept track of the named-graphs these came from, but I didn’t. Maybe next time.

You can download the full raw counts for all subjects: FOAF topics (7.6mb), FOAF hosts and DC Topics (23mb).

BTC2009/2010 Raw Counts

Dan Brickley asked, so I put up the complete files with counts for predicates, namespaces, types, hosts, and pay-level domains here: http://gromgull.net/2010/09/btc2010data/.

Uploading them to manyeyes or similar would perhaps be more modern, but it was too much work :)

Aggregates over BTC2010 namespaces

Yesterday I dumped the most basic BTC2010 stats. Today I have processed them a bit more – and it gets slightly less boring.

First predicates, yesterday I had the raw count per predicate. Much more interesting is the namespaces the predicates are defined in. These are the top 10:

#triples namespace
860,532,348 rdfs
651,432,324 http://data-gov.tw.rpi.edu/vocab/p/90
588,063,466 rdf
527,347,381 gr
284,679,897 foaf
44,119,248 dc11
41,961,046 http://purl.uniprot.org/core
17,233,778 rss
13,661,605 http://www.proteinontology.info/po.owl
13,009,685 owl

(prefix abbreviations are made from prefix.cc \u2013 I am too lazy to fix the missing ones)

Now it gets interesting – because I did exactly this last year as well, and now we can compare!

Dropouts

In 2009 there were 3,817 different namespaces, this year we have 3,911, but actually only 2,945 occur in both. The biggest dropouts, i.e. namespaces that occurred last year, but not at all this year are:

#triples namespace
10,239,809 http://www.kisti.re.kr/isrl/ResearchRefOntology
5,443,549 nie
1,571,547 http://ontologycentral.com/2009/01/eurostat/ns
1,094,963 http://sindice.com/exfn/0.1
320,155 http://xmdr.org/ont/iso11179-3e3draft_r4.owl
307,534 http://cb.semsol.org/ns
242,427 nco
203,283 osag
187,600 http://auswiki.org/index.php/Special:URIResolver
159,536 nexif

I am of course shocked and saddened to see that the Nepomuk Information Elements ontology has fallen out of fashion all together, although it was a bit of a freak occurrence last year. I am not sure how we lost 10M research ontology triples?

Newcomers

Looking the other way around, what namespaces are new and popular this year, we get:

#triples namespace
651,432,324 http://data-gov.tw.rpi.edu/vocab/p/90
5,001,909 fec
2,689,813 http://transport.data.gov.uk/0/ontology/traffic
543,835 http://rdf.geospecies.org/ont/geospecies
526,304 http://data-gov.tw.rpi.edu/vocab/p/401
469,446 http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf
446,120 http://education.data.gov.uk/def/school
223,726 http://www.w3.org/TR/rdf-schema
190,890 http://wecowi.de/wiki/Spezial:URIResolver
166,511 http://data-gov.tw.rpi.edu/vocab/p/10

Here the introduction of data.gov and data.gov.uk were the big events last year.

Winners

For the namespaces that occurred both years we can find the biggest gainers. Here I calculated what ratio of the total triples each namespace constituted each year, and the increase in this ratio from 2009 to 2010. For example, GoodRelations, on top here, constituted nearly 16% of all triples in 2010, but only 2.91e-4% of all triples last year, for a cool increase of 570,000% :)

gain namespace
57058.38 gr
2636.34 http://www.openlinksw.com/schema/attribution
2182.81 http://www.openrdf.org/schema/serql
1944.68 http://www.w3.org/2007/OWL/testOntology
1235.02 http://referata.com/wiki/Special:URIResolver
1211.35 urn:lsid:ubio.org:predicates:recordVersion
1208.09 urn:lsid:ubio.org:predicates:lexicalStatus
1194.66 urn:lsid:lsid.zoology.gla.ac.uk:predicates:mapping
1191.39 urn:lsid:lsid.zoology.gla.ac.uk:predicates:rank
701.66 urn:lsid:ubio.org:predicates:hasCAVConcept

Losers

Similarly, we have the biggest losers, the ones who lost the most:

gain namespace
0.000185 http://purl.org/obo/metadata
0.000191 sioct
0.000380 vcard
0.000418 affy
0.000438 http://www.geneontology.org/go
0.000677 http://tap.stanford.edu/data
0.000719 urn://wymiwyg.org/knobot/default
0.000787 akts
0.000876 http://wymiwyg.org/ontologies/language-selection
0.000904 http://wymiwyg.org/ontologies/knobot

If your namespace is a loser, do not worry, remember that BTC is a more or less arbitrary snapshot of SOME semantic web data, and you can always catch up next year! :)

With a bit of luck I will do this again for the Pay-Level-Domains for the context URLs tomorrow.

Update

(a bit later)

You can get the full datasets for this from many eyes:

BTC2010 Basic stats

Another year, another billion triple dataset. This time it was released the same time my daughter was born, so running the stats script was delayed for a bit.

This year we’ve got a few more triples, perhaps making up for the fact that it wasn’t actually one billion last year :) we’ve now got 3.1B triples (or 3,171,793,030 if you want to be exact).

I’ve not had a chance to do anything really fun with this, so I’ll just dump the stats:

Subjects

  • 159,185,186 unique subjects
  • 147,663,612 occur in more than a single triple
  • 12,647,098 more than 10 times
  • 5,394,733 more 100
  • 313,493 more than 1,000
  • 46,116 more than 10,000
  • and 53 more than 100,000 times

For an average of 19.9252 per unique triple. Like last year, I am not sure if having more than 100,000 triples with the same subject really is useful for anyone?

Looking only at bnodes used as subjects we get:

  • 100,431,757 unique subjects
  • 98,744,109 occur in more than a single triple
  • 1,465,399 more than 10 times
  • 266,759 more 100
  • 4,956 more than 1,000
  • 48 more than 10,000

So 100M out of 159M subjects are bnodes, but they are used less often than the named resources.

The top subjects are as follows:

#triples subject
1,412,709 http://www.proteinontology.info/po.owl#A
895,776 http://openean.kaufkauf.net/id/
827,295 http://products.semweb.bestbuy.com/company.rdf#BusinessEntity_BestBuy
492,756 cycann:externalID
481,000 http://purl.uniprot.org/citations/15685292
445,430 foaf:Document
369,567 cycann:label
362,391 dcmitype:Text
357,309 http://sw.opencyc.org/concept/
349,988 http://purl.uniprot.org/citations/16973872

I do not know enough about the Proteine ontology to know why po:A is so popular. CYC we already had last year here, and I guess all products exposed by BestBuy have this URI as a subject.

Predicates

  • 95,379 unique predicates
  • 83,370 occur in more than one triples
  • 46,710 more than 10
  • 18,385 more than 100
  • 5,395 more than 1,000
  • 1,271 more than 10,000
  • 548 more than 100,000

The average predicate occurred in 33254.6 triples.

#triples predicate
557,268,190 rdf:type
384,891,996 rdfs:isDefinedBy
215,041,142 gr:hasGlobalLocationNumber
184,881,132 rdfs:label
175,141,343 rdfs:comment
168,719,459 gr:hasEAN_UCC-13
131,029,818 gr:hasManufacturer
112,635,203 rdfs:seeAlso
71,742,821 foaf:nick
71,036,882 foaf:knows

The usual suspects, rdf:type, comment, label, seeAlso and a bit of FOAF. New this year is lots of GoodRelations data!

Objects – Resources

Ignoring literals for the moment, looking only at resource-objects, we have:

  • 192,855,067 unique resources
  • 36,144,147 occur in more than a single triple
  • 2,905,294 more than 10 times
  • 197,052 more 100
  • 20,011 more than 1,000
  • 2,752 more than 10,000
  • and 370 more than 100,000 times

On average 7.72834 triples per object. This is both named objects and bnodes, looking at the bnodes only we get:

  • 97,617,548 unique resources
  • 616,825 occur in more than a single triple
  • 8,632 more than 10 times
  • 2,167 more 100
  • 1 more than 1,000

Since BNode IDs are only valid within a certain file it is limited how often then can appear, but still almost half the overall objects are bnodes.

The top ten bnode IDs are pretty boring, but the top 10 named resources are:

#triples resource-object
215,532,631 gr:BusinessEntity
215,153,113 ean:businessentities/
168,205,900 gr:ProductOrServiceModel
167,789,556 http://openean.kaufkauf.net/id/
71,051,459 foaf:Person
10,373,362 foaf:OnlineAccount
6,842,729 rss:item
6,025,094 rdf:Statement
4,647,293 foaf:Document
4,230,908 http://purl.uniprot.org/core/Resource

These are pretty much all types – compare to:

Types

A “type” being the object that occurs in a triple where rdf:type is the predicate gives us:

  • 170,020 types
  • 91,479 occur in more than a single triple
  • 20,196 more than 10 times
  • 4,325 more 100
  • 1,113 more than 1,000
  • 258 more than 10,000
  • and 89 more than 100,000 times

On average each type is used 3277.7 times, and the top 10 are:

#triples type
215,536,042 gr:BusinessEntity
168,208,826 gr:ProductOrServiceModel
71,520,943 foaf:Person
10,447,941 foaf:OnlineAccount
6,886,401 rss:item
6,066,069 rdf:Statement
4,674,162 foaf:Document
4,260,056 http://purl.uniprot.org/core/Resource
4,001,282 http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry
3,405,101 owl:Class

Not identical to the top resources, but quite similar. Lots of FOAF and new this year, lots of GoodRelations.

Contexts

Something changed with regard to context handling for BTC2010, this year we only have 8M contexts, last year we had over 35M.
I wonder if perhaps all of dbpedia is in one context this year?

  • 8,126,834 unique contexts
  • 8,048,574 occur in more than a single triple
  • 6,211,398 more than 10 times
  • 1,493,520 more 100
  • 321,466 more than 1,000
  • 61,360 more than 10,000
  • and 4799 more than 100,000 times

For an average of 389.958 triples per context. The 10 biggest contexts are:

#triples context
302,127 http://data-gov.tw.rpi.edu/raw/402/data-402.rdf
273,644 http://www.ling.helsinki.fi/kit/2004k/ctl310semw/WordNet/wordnet_nouns-20010201.rdf
259,824 http://static.cpantesters.org/author/M/MIYAGAWA.rss
207,513 http://data-gov.tw.rpi.edu/raw/401/data-401.rdf
193,944 http://static.cpantesters.org/author/D/DROLSKY.rss
189,528 http://static.cpantesters.org/author/S/SMUELLER.rss
170,899 http://data-gov.tw.rpi.edu/raw/59/data-59.rdf
166,454 http://zaltys.net/ontology/AKTiveSAOntology.owl
166,454 http://www.zaltys.net/ontology/AKTiveSAOntology.owl
165,948 http://lsdis.cs.uga.edu/~satya/Satya/jan24.owl

This concludes my boring stats dump for BTC2010 for now. Some information on literals and hopefully some graphs will come soon! I also plan to look into how these stats changed from last year – so far I see much more GoodRelations, but there must be other fun changes!

Semantic Web Clusterball

From the I-will-never-actually-finish-this department I bring you the Semantic Web Cluster-ball:

Semantic Web Clusterball

I started this is a part of the Billion Triple Challenge work, it shows the how different sites on Semantic Web are linked together. The whole thing is an interactive SVG, I could not get it to embed here, so click on that image and mouse over things and be amazed. Clicking on the different predicates in the SVG will toggle showing that predicate, mouse over any link will show how many links are currently being shown. (NOTE: Only really tested in Firefox 3.5.X, it looked roughly ok in Chrome though.)

The data is extracted from the BTC triples by computing the Pay-Level-Domain (PLD, essentially the top-level domain, but with special rules for .co.uk domains and similar) for the subjects and objects, and if they differ, count the predicates that link them. I.e. a triple:

dbpedia:Albert_Einstein rdf:type foaf:Person.

would count as a link between http://dbpedia.org and http://xmlns.com for the rdf:type predicate. Counting all links like this gives us the top cross-domain linking predicates:

predicate links
http://www.w3.org/1999/02/22-rdf-syntax-ns#type 60,813,659
http://www.w3.org/2000/01/rdf-schema#seeAlso 16,698,110
http://www.w3.org/2002/07/owl#sameAs 4,872,501
http://xmlns.com/foaf/0.1/weblog 4,627,271
http://www.aktors.org/ontology/portal#has-date 3,873,224
http://xmlns.com/foaf/0.1/page 3,273,613
http://dbpedia.org/property/hasPhotoCollection 2,556,532
http://xmlns.com/foaf/0.1/img 2,012,761
http://xmlns.com/foaf/0.1/depiction 1,556,066
http://www.geonames.org/ontology#wikipediaArticle 735,145

Most frequent is of course rdf:type, since most schemas are from different domains to the data, and most things have a type. The ball linked above is excluding type, since it’s not really a link. You can also see a version including rdf:type. The rest of the properties are more link-like, I am not sure what is going on with the akt:has-date though, anyone?

The visualisation idea is of course not mine, mainly I stole it from Chris Harrison: Wikipedia Clusterball. His is nicer since he has core nodes inside the ball. He points out that the “clustering” of nodes along the edge is important, as this brings out the structure of whatever is being mapped. My “clustering” method was very simple, I swap each node with the one giving me the largest decrease in edge distance, then repeat until the solution no longer improves. I couple this with a handful of random restarts and take the best solution. It’s essentially a greedy hill-climbing method, and I am sure it’s far from optimal, but it does at least something. For comparison, here is the ball on top without clustering applied.

The whole thing was of course hacked up in python, the javascript for the mouse-over etc. of the SVG uses prototype. I wanted to share the code, but it’s a horrible mess, and I’d rather not spend the time to clean it up. If you want it, drop me a line., see below. The data used to generate this is available either as a download: data.txt.gz (19Mb, 10,000 host-pairs and top 500 predicates), or a subset on Many Eyes (2,500 host-pairs and top 100 predicates, uploading 19Mb of data to Many Eyes crashed my Firefox :)

UPDATE: Richard Stirling asked for the code, so I spent 30 min cleaning it up a bit, grab it here: swball_code.tar.gz It includes the data+code needed to recreate the example above.

An Objective look at the Billion Triple Data

For completeness, Besbes is telling me to write up the final stats from the BTC data, for the object-part of the triples. I am afraid this data is quite dull, and there are not many interesting things to say, so it’ll be mostly tables. Enjoy :)

The BTC data contains 279,710,101 unique objects in total. Out of these:

  • 90,007,431 appear more than once
  • 7,995,747 more than 10 times
  • 748,214 more than 100
  • 43,479 more than 1,000
  • 3,209 more than 10,000

The 280M objects are split into 162,764,271 resources and 116,945,830 literals. 13,538 of the resources are file:// URIs. The top 10 objects are:

#triples object
2,584,960 http://www.geonames.org/ontology#P
2,645,095 http://www.aktors.org/ontology/portal#Article-Reference
2,681,771 http://www.w3.org/2002/07/owl#Class
5,616,326 http://www.aktors.org/ontology/portal#Person
7,544,903 http://www.geonames.org/ontology#Feature
9,115,801 http://en.wikipedia.org/
12,124,378 http://xmlns.com/foaf/0.1/OnlineAccount
13,687,049 http://purl.org/rss/1.0/item
14,172,852 http://rdfs.org/sioc/types#WikiArticle
38,795,942 http://xmlns.com/foaf/0.1/Person

Apart from the wikipedia link, all are types. No literals appear in the top 10 table. For the 116M unique literals we have 12,845,021 literals with a language tag and 2,067,768 with a datatype tag. The top 10 literals are:

#triples literal
722,221 “0″^^xsd:integer
969,929 “1″
1,024,654 “Nay”
1,036,054 “Copyright © 2009 craigslist, inc.”
1,056,799 “text”
1,061,692 “text/html”
1,159,311 “0″
1,204,996 “en-us”
2,049,638 “Aye”
2,310,681 “application/rdf+xml”

I can’t be bothered to check it now, but I guess the  many Aye’s & Nay’s come from IRC chatlogs (#SWIG?).

Finally, I looked at the length of the literals used in the data, the longest literal is 65,244 unicode characters long (I wonder about this — this seems very close to 216 bytes, some unicode characters with more than one byte, could it be truncated?). The distribution of literals/lenghts looks like this:

The most literals are around 10 characters in length, there is a peak for 19, which I seem to remember was caused by the standard time format (i.e. 2005-10-30T10:45UTC) being exactly 19 characters.

That’s it! I believe I now have published all my numbers on BTC :)

Typical Semantic Web Data

This is the fourth of my Billion Triple Challenge data-set statistics posts, if you only just got here, catch up on part I, II or  III.

I had these numbers ready for a long time, but never found the time to type it up as the it is not so exciting. However CaptSolo asked for it now to put in his very-soon-to-be-finished thesis, so I’ll hurry up. This is all about the classes used in the BTC data, i.e. the rdf:type triples.
Overall the data contains 143,293,758 type triples, assigning 283,815 different types to 104,562,695 different things.  For the types themselves:

  • 213,281 types are used more than once
  • 94,455 used more than 10
  • 14,862 more than 100
  • 1,730 more than 1000
  • 288 more than 10000

If we take only these 288 top ones we cover 92% of all types triples, we can cover 90% of the typed things with only 105 types and over 50% of the data with only foaf:Person, sioc:WikiArticle, rss:Item and foaf:OnlineAccount. Out of all the “types” used 12,319 were BNodes, which is odd, but I guess possible, and 204 are literals, which is even odder. The top 10 types are:

#triples type URI
1,859,499 wordnet:Person
2,309,652 foaf:Document
2,645,091 akt:Article-Reference
2,680,081 owl:Class
5,616,163 akt:Person
7,544,797 geonames:Feature
12,123,375 foaf:OnlineAccount
13,686,988 rss:item
14,172,851 sioc:WikiArticle
38,790,680 foaf:Person

Now for the things the types are assigned to, out of the 104,562,965 things with types, 52,865,376 are BNodes. If you pay attention you will now have realised that many things have more than one type assigned (143M type triples⇒104M things). In fact:

  • 7,026,972 things have more than one type triple.
  • 612,467 has more than 10
  • 35,201 more than 100
  • 1,025 more than 1,000
  • 40 more than 10,000

Note I am talking here of type triples, i.e. the top 40 things may well have the same type assigned 10,000 times. The things having over 10,000 types assigned is a product of the partially inclusion of inferred triples in the data. For instance, for every context where RDFS inference has been applied, all properties will have rdf:type rdf:Property inferred. Looking at the number of unique types per thing shows that:

  • 2,979,968 things have more than one type
  • 78,208 have more than 10
  • 4 more than 100

The 10 things with most unique types are all pretty boring:

#types URI
74 http://rdf.freebase.com/ns/guid.9202a8c04000641f8000000000959f60
75 http://dbpedia.org/resource/Arnold_Schwarzenegger
88 http://oiled.man.example.net/test#V822576
91 http://oiled.man.example.net/test#V21027
91 http://oiled.man.example.net/test#V21029
91 http://oiled.man.example.net/test#V21030
105 http://oiled.man.example.net/test#V16459
136 http://www.w3.org/2002/03owlt/description-logic/consistent501#T
136 http://www.w3.org/2002/03owlt/description-logic/inconsistent502#T
171 http://oiled.man.example.net/test#V21026

Likewise the 10 things with the most types assigned, all product of materialised inferred triples:

#triples URI
57,533 http://sw.opencyc.org/2008/06/10/concept/
58,838 http://semantic-mediawiki.org/swivt/1.0#creationDate
58,838 http://semantic-mediawiki.org/swivt/1.0#page
58,838 http://semantic-mediawiki.org/swivt/1.0#Subject
89,521 http://sw.opencyc.org/concept/Mx4rwLSVCpwpEbGdrcN5Y29ycA
121,138 http://en.wikipedia.org/
159,773 http://sw.opencyc.org/concept/
232,505 http://sw.cyc.com/CycAnnotations_v1#label
361,113 http://xmlns.com/foaf/0.1/holdsAccount
465,010 http://sw.cyc.com/CycAnnotations_v1#externalID

That’s it — I hope it changed your life! :)