<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>(still) nothing clever &#187; Semantic Web</title>
	<atom:link href="http://gromgull.net/blog/category/semantic-web/feed/" rel="self" type="application/rss+xml" />
	<link>http://gromgull.net/blog</link>
	<description></description>
	<lastBuildDate>Tue, 07 Sep 2010 09:25:18 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<atom:link rel="hub" href="http://pubsubhubbub.appspot.com"/><atom:link rel="hub" href="http://superfeedr.com/hubbub"/>		<item>
		<title>BTC2009/2010 Raw Counts</title>
		<link>http://gromgull.net/blog/2010/09/btc20092010-raw-counts/</link>
		<comments>http://gromgull.net/blog/2010/09/btc20092010-raw-counts/#comments</comments>
		<pubDate>Tue, 07 Sep 2010 09:25:18 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Billion Triple Challenge]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=494</guid>
		<description><![CDATA[Dan Brickley asked, so I put up the complete files with counts for predicates, namespaces, types, hosts, and pay-level domains here: http://gromgull.net/2010/09/btc2010data/.
Uploading them to manyeyes or similar would perhaps be more modern, but it was too much work :)
]]></description>
			<content:encoded><![CDATA[<p><a href="http://danbri.org/words/">Dan Brickley</a> asked, so I put up the complete files with counts for predicates, namespaces, types, hosts, and pay-level domains here: <a href="http://gromgull.net/2010/09/btc2010data/">http://gromgull.net/2010/09/btc2010data/</a>.</p>
<p>Uploading them to manyeyes or similar would perhaps be more modern, but it was too much work :)</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2010/09/btc20092010-raw-counts/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Aggregates over BTC2010 namespaces</title>
		<link>http://gromgull.net/blog/2010/09/aggregates-over-btc2010-namespaces/</link>
		<comments>http://gromgull.net/blog/2010/09/aggregates-over-btc2010-namespaces/#comments</comments>
		<pubDate>Thu, 02 Sep 2010 13:07:54 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Billion Triple Challenge]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=483</guid>
		<description><![CDATA[Yesterday I dumped the most basic BTC2010 stats. Today I have processed them a bit more &#8211; and it gets slightly less boring.
First predicates, yesterday I had the raw count per predicate. Much more interesting is the namespaces the predicates are defined in. These are the top 10:



#triples
namespace




860,532,348
rdfs


651,432,324
http://data-gov.tw.rpi.edu/vocab/p/90


588,063,466
rdf


527,347,381
gr


284,679,897
foaf


44,119,248
dc11


41,961,046
http://purl.uniprot.org/core


17,233,778
rss


13,661,605
http://www.proteinontology.info/po.owl


13,009,685
owl



(prefix abbreviations are made from prefix.cc â€“Â I am [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://gromgull.net/blog/2010/09/btc2010-basic-stats/">Yesterday</a> I dumped the most basic BTC2010 stats. Today I have processed them a bit more &#8211; and it gets slightly less boring.</p>
<p>First predicates, yesterday I had the raw count per predicate. Much more interesting is the namespaces the predicates are defined in. These are the top 10:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>namespace</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>860,532,348</td>
<td><a href="http://www.w3.org/2000/01/rdf-schema#">rdfs</a></td>
</tr>
<tr>
<td>651,432,324</td>
<td><a href="http://data-gov.tw.rpi.edu/vocab/p/90">http://data-gov.tw.rpi.edu/vocab/p/90</a></td>
</tr>
<tr>
<td>588,063,466</td>
<td><a href="http://www.w3.org/1999/02/22-rdf-syntax-ns#">rdf</a></td>
</tr>
<tr>
<td>527,347,381</td>
<td><a href="http://purl.org/goodrelations/v1#">gr</a></td>
</tr>
<tr>
<td>284,679,897</td>
<td><a href="http://xmlns.com/foaf/0.1/">foaf</a></td>
</tr>
<tr>
<td>44,119,248</td>
<td><a href="http://purl.org/dc/elements/1.1/">dc11</a></td>
</tr>
<tr>
<td>41,961,046</td>
<td><a href="http://purl.uniprot.org/core">http://purl.uniprot.org/core</a></td>
</tr>
<tr>
<td>17,233,778</td>
<td><a href="http://purl.org/rss/1.0/">rss</a></td>
</tr>
<tr>
<td>13,661,605</td>
<td><a href="http://www.proteinontology.info/po.owl">http://www.proteinontology.info/po.owl</a></td>
</tr>
<tr>
<td>13,009,685</td>
<td><a href="http://www.w3.org/2002/07/owl#">owl</a></td>
</tr>
</tbody>
</table>
<p>(prefix abbreviations are made from prefix.cc â€“Â I am too lazy to fix the missing ones)</p>
<p>Now it gets interesting &#8211; because I did exactly this last year as well, and now we can compare!</p>
<h2>Dropouts</h2>
<p>In 2009 there were 3,817 different namespaces, this year we have 3,911, but actually only 2,945 occur in both. The biggest <em>dropouts</em>, i.e. namespaces that occurred last year, but not at all this year are:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>namespace</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>10,239,809</td>
<td><a href="http://www.kisti.re.kr/isrl/ResearchRefOntology">http://www.kisti.re.kr/isrl/ResearchRefOntology</a></td>
</tr>
<tr>
<td>5,443,549</td>
<td><a href="http://www.semanticdesktop.org/ontologies/2007/01/19/nie#">nie</a></td>
</tr>
<tr>
<td>1,571,547</td>
<td><a href="http://ontologycentral.com/2009/01/eurostat/ns">http://ontologycentral.com/2009/01/eurostat/ns</a></td>
</tr>
<tr>
<td>1,094,963</td>
<td><a href="http://sindice.com/exfn/0.1">http://sindice.com/exfn/0.1</a></td>
</tr>
<tr>
<td>320,155</td>
<td><a href="http://xmdr.org/ont/iso11179-3e3draft_r4.owl">http://xmdr.org/ont/iso11179-3e3draft_r4.owl</a></td>
</tr>
<tr>
<td>307,534</td>
<td><a href="http://cb.semsol.org/ns">http://cb.semsol.org/ns</a></td>
</tr>
<tr>
<td>242,427</td>
<td><a href="http://www.semanticdesktop.org/ontologies/2007/03/22/nco#">nco</a></td>
</tr>
<tr>
<td>203,283</td>
<td><a href="http://www.ordnancesurvey.co.uk/ontology/AdministrativeGeography/v2.0/AdministrativeGeography.rdf#">osag</a></td>
</tr>
<tr>
<td>187,600</td>
<td><a href="http://auswiki.org/index.php/Special:URIResolver">http://auswiki.org/index.php/Special:URIResolver</a></td>
</tr>
<tr>
<td>159,536</td>
<td><a href="http://www.semanticdesktop.org/ontologies/2007/05/10/nexif#">nexif</a></td>
</tr>
</tbody>
</table>
<p>I am of course shocked and saddened to see that the Nepomuk Information Elements ontology has fallen out of fashion all together, although it was a bit of a freak occurrence last year. I am not sure how we lost 10M research ontology triples?</p>
<h2>Newcomers</h2>
<p>Looking the other way around, what namespaces are new and popular this year, we get:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>namespace</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>651,432,324</td>
<td><a href="http://data-gov.tw.rpi.edu/vocab/p/90">http://data-gov.tw.rpi.edu/vocab/p/90</a></td>
</tr>
<tr>
<td>5,001,909</td>
<td><a href="http://www.rdfabout.com/rdf/schema/usfec/">fec</a></td>
</tr>
<tr>
<td>2,689,813</td>
<td><a href="http://transport.data.gov.uk/0/ontology/traffic">http://transport.data.gov.uk/0/ontology/traffic</a></td>
</tr>
<tr>
<td>543,835</td>
<td><a href="http://rdf.geospecies.org/ont/geospecies">http://rdf.geospecies.org/ont/geospecies</a></td>
</tr>
<tr>
<td>526,304</td>
<td><a href="http://data-gov.tw.rpi.edu/vocab/p/401">http://data-gov.tw.rpi.edu/vocab/p/401</a></td>
</tr>
<tr>
<td>469,446</td>
<td><a href="http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf">http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf</a></td>
</tr>
<tr>
<td>446,120</td>
<td><a href="http://education.data.gov.uk/def/school">http://education.data.gov.uk/def/school</a></td>
</tr>
<tr>
<td>223,726</td>
<td><a href="http://www.w3.org/TR/rdf-schema">http://www.w3.org/TR/rdf-schema</a></td>
</tr>
<tr>
<td>190,890</td>
<td><a href="http://wecowi.de/wiki/Spezial:URIResolver">http://wecowi.de/wiki/Spezial:URIResolver</a></td>
</tr>
<tr>
<td>166,511</td>
<td><a href="http://data-gov.tw.rpi.edu/vocab/p/10">http://data-gov.tw.rpi.edu/vocab/p/10</a></td>
</tr>
</tbody>
</table>
<p>Here the introduction of <a href="http://data.gov">data.gov</a> and <a href="http://data.gov.uk">data.gov.uk</a> were the big events last year.</p>
<h2>Winners</h2>
<p>For the namespaces that occurred both years we can find the biggest gainers. Here I calculated what ratio of the total triples each namespace constituted each year, and the increase in this ratio from 2009 to 2010. For example, GoodRelations, on top here, constituted nearly 16% of all triples in 2010, but only 2.91e-4% of all triples last year, for a cool increase of 570,000% :)</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>gain</th>
<th><strong>namespace</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>57058.38</td>
<td><a href="http://purl.org/goodrelations/v1#">gr</a></td>
</tr>
<tr>
<td>2636.34</td>
<td><a href="http://www.openlinksw.com/schema/attribution">http://www.openlinksw.com/schema/attribution</a></td>
</tr>
<tr>
<td>2182.81</td>
<td><a href="http://www.openrdf.org/schema/serql">http://www.openrdf.org/schema/serql</a></td>
</tr>
<tr>
<td>1944.68</td>
<td><a href="http://www.w3.org/2007/OWL/testOntology">http://www.w3.org/2007/OWL/testOntology</a></td>
</tr>
<tr>
<td>1235.02</td>
<td><a href="http://referata.com/wiki/Special:URIResolver">http://referata.com/wiki/Special:URIResolver</a></td>
</tr>
<tr>
<td>1211.35</td>
<td><a href="urn:lsid:ubio.org:predicates:recordVersion">urn:lsid:ubio.org:predicates:recordVersion</a></td>
</tr>
<tr>
<td>1208.09</td>
<td><a href="urn:lsid:ubio.org:predicates:lexicalStatus">urn:lsid:ubio.org:predicates:lexicalStatus</a></td>
</tr>
<tr>
<td>1194.66</td>
<td><a href="urn:lsid:lsid.zoology.gla.ac.uk:predicates:mapping">urn:lsid:lsid.zoology.gla.ac.uk:predicates:mapping</a></td>
</tr>
<tr>
<td>1191.39</td>
<td><a href="urn:lsid:lsid.zoology.gla.ac.uk:predicates:rank">urn:lsid:lsid.zoology.gla.ac.uk:predicates:rank</a></td>
</tr>
<tr>
<td>701.66</td>
<td><a href="urn:lsid:ubio.org:predicates:hasCAVConcept">urn:lsid:ubio.org:predicates:hasCAVConcept</a></td>
</tr>
</tbody>
</table>
<h2>Losers</h2>
<p>Similarly, we have the biggest losers, the ones who lost the most:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>gain</th>
<th><strong>namespace</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>0.000185</td>
<td><a href="http://purl.org/obo/metadata">http://purl.org/obo/metadata</a></td>
</tr>
<tr>
<td>0.000191</td>
<td><a href="http://rdfs.org/sioc/types#">sioct</a></td>
</tr>
<tr>
<td>0.000380</td>
<td><a href="http://www.w3.org/2006/vcard/ns#">vcard</a></td>
</tr>
<tr>
<td>0.000418</td>
<td><a href="http://www.affymetrix.com/community/publications/affymetrix/tmsplice#">affy</a></td>
</tr>
<tr>
<td>0.000438</td>
<td><a href="http://www.geneontology.org/go">http://www.geneontology.org/go</a></td>
</tr>
<tr>
<td>0.000677</td>
<td><a href="http://tap.stanford.edu/data">http://tap.stanford.edu/data</a></td>
</tr>
<tr>
<td>0.000719</td>
<td><a href="urn://wymiwyg.org/knobot/default">urn://wymiwyg.org/knobot/default</a></td>
</tr>
<tr>
<td>0.000787</td>
<td><a href="http://www.aktors.org/ontology/support#">akts</a></td>
</tr>
<tr>
<td>0.000876</td>
<td><a href="http://wymiwyg.org/ontologies/language-selection">http://wymiwyg.org/ontologies/language-selection</a></td>
</tr>
<tr>
<td>0.000904</td>
<td><a href="http://wymiwyg.org/ontologies/knobot">http://wymiwyg.org/ontologies/knobot</a></td>
</tr>
</tbody>
</table>
<p>If your namespace is a loser, do not worry, remember that BTC is a more or less arbitrary snapshot of SOME semantic web data, and you can always catch up next year! :)</p>
<p>With a bit of luck I will do this again for the Pay-Level-Domains for the context URLs tomorrow.</p>
<h2>Update</h2>
<p>(a bit later)</p>
<p>You can get the full datasets for this from many eyes:</p>
<ul>
<li><a href="http://manyeyes.alphaworks.ibm.com/manyeyes/datasets/number-of-triples-per-predicate-na">Namespaces 2009</a></li>
<li><a href="http://manyeyes.alphaworks.ibm.com/manyeyes/datasets/number-of-triples-per-predicate-na-2">Namespaces 2010</a></li>
<li><a href="http://manyeyes.alphaworks.ibm.com/manyeyes/datasets/namespaces-that-dropped-out-betwee">Dropouts</a></li>
<li><a href="http://manyeyes.alphaworks.ibm.com/manyeyes/datasets/namespaces-that-appeared-between-b">Newcomers</a></li>
<li><a href="http://manyeyes.alphaworks.ibm.com/manyeyes/datasets/namespace-change-between-btc-2009-">Changes</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2010/09/aggregates-over-btc2010-namespaces/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>BTC2010 Basic stats</title>
		<link>http://gromgull.net/blog/2010/09/btc2010-basic-stats/</link>
		<comments>http://gromgull.net/blog/2010/09/btc2010-basic-stats/#comments</comments>
		<pubDate>Wed, 01 Sep 2010 13:07:18 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Billion Triple Challenge]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=474</guid>
		<description><![CDATA[Another year, another billion triple dataset. This time it was released the same time my daughter was born, so running the stats script was delayed for a bit.
This year we&#8217;ve got a few more triples, perhaps making up for the fact that it wasn&#8217;t actually one billion last year :) we&#8217;ve now got 3.1B triples [...]]]></description>
			<content:encoded><![CDATA[<p>Another year, <a href="http://km.aifb.kit.edu/projects/btc-2010/">another billion triple dataset</a>. This time it was released the same time my daughter was born, so running the stats script was delayed for a bit.</p>
<p>This year we&#8217;ve got a few more triples, perhaps making up for the fact that it wasn&#8217;t actually one billion last year :) we&#8217;ve now got 3.1B triples (or 3,171,793,030 if you want to be exact).</p>
<p>I&#8217;ve not had a chance to do anything really fun with this, so I&#8217;ll just dump the stats:</p>
<h2>Subjects</h2>
<ul>
<li>159,185,186	unique subjects</li>
<li>147,663,612	occur in more than a single triple</li>
<li>12,647,098 more than 10 times</li>
<li>5,394,733 more 100</li>
<li>313,493 more than 1,000</li>
<li>46,116 more than 10,000</li>
<li>and 53 more than 100,000 times</li>
</ul>
<p>For an average of 19.9252 per unique triple. Like last year, I am not sure if having more than 100,000 triples with the same subject really is useful for anyone?</p>
<p>Looking only at bnodes used as subjects we get:</p>
<ul>
<li>100,431,757	unique subjects</li>
<li>98,744,109	occur in more than a single triple</li>
<li>1,465,399 more than 10 times</li>
<li>266,759 more 100</li>
<li>4,956 more than 1,000</li>
<li>48 more than 10,000</li>
</ul>
<p>So 100M out of 159M subjects are bnodes, but they are used less often than the named resources. </p>
<p>The top subjects are as follows:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>subject</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>1,412,709</td>
<td><a href="http://www.proteinontology.info/po.owl#A">http://www.proteinontology.info/po.owl#A</a></td>
</tr>
<tr>
<td>895,776</td>
<td><a href="http://openean.kaufkauf.net/id/">http://openean.kaufkauf.net/id/</a></td>
</tr>
<tr>
<td>827,295</td>
<td><a href="http://products.semweb.bestbuy.com/company.rdf#BusinessEntity_BestBuy">http://products.semweb.bestbuy.com/company.rdf#BusinessEntity_BestBuy</a></td>
</tr>
<tr>
<td>492,756</td>
<td><a href="http://sw.cyc.com/CycAnnotations_v1#externalID">cycann:externalID</a></td>
</tr>
<tr>
<td>481,000</td>
<td><a href="http://purl.uniprot.org/citations/15685292">http://purl.uniprot.org/citations/15685292</a></td>
</tr>
<tr>
<td>445,430</td>
<td><a href="http://xmlns.com/foaf/0.1/Document">foaf:Document</a></td>
</tr>
<tr>
<td>369,567</td>
<td><a href="http://sw.cyc.com/CycAnnotations_v1#label">cycann:label</a></td>
</tr>
<tr>
<td>362,391</td>
<td><a href="http://purl.org/dc/dcmitype/Text">dcmitype:Text</a></td>
</tr>
<tr>
<td>357,309</td>
<td><a href="http://sw.opencyc.org/concept/">http://sw.opencyc.org/concept/</a></td>
</tr>
<tr>
<td>349,988</td>
<td><a href="http://purl.uniprot.org/citations/16973872">http://purl.uniprot.org/citations/16973872</a></td>
</tr>
</tbody>
</table>
<p>I do not know enough about the Proteine ontology to know why <em>po:A</em> is so popular. CYC we already had last year here, and I guess all products exposed by BestBuy have this URI as a subject.</p>
<h2>Predicates</h2>
<ul>
<li>95,379 unique predicates</li>
<li>83,370 occur in more than one triples</li>
<li>46,710 more than 10</li>
<li>18,385 more than 100</li>
<li>5,395 more than 1,000</li>
<li>1,271 more than 10,000</li>
<li>548 more than 100,000</li>
</ul>
<p>The average predicate occurred in 33254.6 triples.</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>predicate</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>557,268,190</td>
<td><a href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">rdf:type</a></td>
</tr>
<tr>
<td>384,891,996</td>
<td><a href="http://www.w3.org/2000/01/rdf-schema#isDefinedBy">rdfs:isDefinedBy</a></td>
</tr>
<tr>
<td>215,041,142</td>
<td><a href="http://purl.org/goodrelations/v1#hasGlobalLocationNumber">gr:hasGlobalLocationNumber</a></td>
</tr>
<tr>
<td>184,881,132</td>
<td><a href="http://www.w3.org/2000/01/rdf-schema#label">rdfs:label</a></td>
</tr>
<tr>
<td>175,141,343</td>
<td><a href="http://www.w3.org/2000/01/rdf-schema#comment">rdfs:comment</a></td>
</tr>
<tr>
<td>168,719,459</td>
<td><a href="http://purl.org/goodrelations/v1#hasEAN_UCC-13">gr:hasEAN_UCC-13</a></td>
</tr>
<tr>
<td>131,029,818</td>
<td><a href="http://purl.org/goodrelations/v1#hasManufacturer">gr:hasManufacturer</a></td>
</tr>
<tr>
<td>112,635,203</td>
<td><a href="http://www.w3.org/2000/01/rdf-schema#seeAlso">rdfs:seeAlso</a></td>
</tr>
<tr>
<td>71,742,821</td>
<td><a href="http://xmlns.com/foaf/0.1/nick">foaf:nick</a></td>
</tr>
<tr>
<td>71,036,882</td>
<td><a href="http://xmlns.com/foaf/0.1/knows">foaf:knows</a></td>
</tr>
</tbody>
</table>
<p>The usual suspects, rdf:type, comment, label, seeAlso and a bit of FOAF. New this year is lots of GoodRelations data!</p>
<h2>Objects &#8211; Resources</h2>
<p>Ignoring literals for the moment, looking only at resource-objects, we have: </p>
<ul>
<li>192,855,067      	unique resources</li>
<li> 36,144,147        occur in more than a single triple</li>
<li>2,905,294 	 more than 10 times</li>
<li>197,052   more 100</li>
<li>20,011  more than 1,000</li>
<li>2,752 more than 10,000</li>
<li>and 370 more than 100,000 times</li>
</ul>
<p>On average  7.72834 triples per object. This is both named objects and bnodes, looking at the bnodes only we get:     </p>
<ul>
<li>97,617,548      	unique resources</li>
<li> 616,825        occur in more than a single triple</li>
<li>8,632 	 more than 10 times</li>
<li>2,167   more 100</li>
<li>1  more than 1,000</li>
</ul>
<p>Since BNode IDs are only valid within a certain file it is limited how often then can appear, but still almost half the overall objects are bnodes. </p>
<p>The top ten bnode IDs are pretty boring, but the top 10 named resources are:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>resource-object</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>215,532,631</td>
<td><a href='http://purl.org/goodrelations/v1#BusinessEntity'>gr:BusinessEntity</a></td>
</tr>
<tr>
<td>215,153,113</td>
<td><a href='http://openean.kaufkauf.net/id/businessentities/'>ean:businessentities/</a></td>
</tr>
<tr>
<td>168,205,900</td>
<td><a href='http://purl.org/goodrelations/v1#ProductOrServiceModel'>gr:ProductOrServiceModel</a></td>
</tr>
<tr>
<td>167,789,556</td>
<td><a href='http://openean.kaufkauf.net/id/'>http://openean.kaufkauf.net/id/</a></td>
</tr>
<tr>
<td>71,051,459</td>
<td><a href='http://xmlns.com/foaf/0.1/Person'>foaf:Person</a></td>
</tr>
<tr>
<td>10,373,362</td>
<td><a href='http://xmlns.com/foaf/0.1/OnlineAccount'>foaf:OnlineAccount</a></td>
</tr>
<tr>
<td>6,842,729</td>
<td><a href='http://purl.org/rss/1.0/item'>rss:item</a></td>
</tr>
<tr>
<td>6,025,094</td>
<td><a href='http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement'>rdf:Statement</a></td>
</tr>
<tr>
<td>4,647,293</td>
<td><a href='http://xmlns.com/foaf/0.1/Document'>foaf:Document</a></td>
</tr>
<tr>
<td>4,230,908</td>
<td><a href='http://purl.uniprot.org/core/Resource'>http://purl.uniprot.org/core/Resource</a></td>
</tr>
</tbody>
</table>
<p>These are pretty much all types â€“Â compare to: </p>
<h2>Types</h2>
<p>A &#8220;type&#8221; being the object that occurs in a triple where <i>rdf:type</i> is the predicate gives us:</p>
<ul>
<li>170,020      	types</li>
<li> 91,479        occur in more than a single triple</li>
<li>20,196 	 more than 10 times</li>
<li>4,325   more 100</li>
<li>1,113  more than 1,000</li>
<li>258 more than 10,000</li>
<li>and 89 more than 100,000 times</li>
</ul>
<p>On average each type is used 3277.7 times, and the top 10 are:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>type</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>215,536,042</td>
<td><a href='http://purl.org/goodrelations/v1#BusinessEntity'>gr:BusinessEntity</a></td>
</tr>
<tr>
<td>168,208,826</td>
<td><a href='http://purl.org/goodrelations/v1#ProductOrServiceModel'>gr:ProductOrServiceModel</a></td>
</tr>
<tr>
<td>71,520,943</td>
<td><a href='http://xmlns.com/foaf/0.1/Person'>foaf:Person</a></td>
</tr>
<tr>
<td>10,447,941</td>
<td><a href='http://xmlns.com/foaf/0.1/OnlineAccount'>foaf:OnlineAccount</a></td>
</tr>
<tr>
<td>6,886,401</td>
<td><a href='http://purl.org/rss/1.0/item'>rss:item</a></td>
</tr>
<tr>
<td>6,066,069</td>
<td><a href='http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement'>rdf:Statement</a></td>
</tr>
<tr>
<td>4,674,162</td>
<td><a href='http://xmlns.com/foaf/0.1/Document'>foaf:Document</a></td>
</tr>
<tr>
<td>4,260,056</td>
<td><a href='http://purl.uniprot.org/core/Resource'>http://purl.uniprot.org/core/Resource</a></td>
</tr>
<tr>
<td>4,001,282</td>
<td><a href='http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry'>http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry</a></td>
</tr>
<tr>
<td>3,405,101</td>
<td><a href='http://www.w3.org/2002/07/owl#Class'>owl:Class</a></td>
</tr>
</tbody>
</table>
<p>Not identical to the top resources, but quite similar. Lots of FOAF and new this year, lots of GoodRelations.</p>
<h2>Contexts</h2>
<p>Something changed with regard to context handling for BTC2010, this year we only have 8M contexts, last year we had over 35M.<br />
I wonder if perhaps all of dbpedia is in one context this year?</p>
<ul>
<li>8,126,834  unique contexts</li>
<li>8,048,574        occur in more than a single triple</li>
<li>6,211,398 	 more than 10 times</li>
<li>1,493,520   more 100</li>
<li>321,466 more than 1,000</li>
<li>61,360 more than 10,000</li>
<li>and 4799 more than 100,000 times</li>
</ul>
<p>For an average of 389.958 triples per context. The 10 biggest contexts are: </p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>context</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>302,127</td>
<td><a href='http://data-gov.tw.rpi.edu/raw/402/data-402.rdf'>http://data-gov.tw.rpi.edu/raw/402/data-402.rdf</a></td>
</tr>
<tr>
<td>273,644</td>
<td><a href='http://www.ling.helsinki.fi/kit/2004k/ctl310semw/WordNet/wordnet_nouns-20010201.rdf'>http://www.ling.helsinki.fi/kit/2004k/ctl310semw/WordNet/wordnet_nouns-20010201.rdf</a></td>
</tr>
<tr>
<td>259,824</td>
<td><a href='http://static.cpantesters.org/author/M/MIYAGAWA.rss'>http://static.cpantesters.org/author/M/MIYAGAWA.rss</a></td>
</tr>
<tr>
<td>207,513</td>
<td><a href='http://data-gov.tw.rpi.edu/raw/401/data-401.rdf'>http://data-gov.tw.rpi.edu/raw/401/data-401.rdf</a></td>
</tr>
<tr>
<td>193,944</td>
<td><a href='http://static.cpantesters.org/author/D/DROLSKY.rss'>http://static.cpantesters.org/author/D/DROLSKY.rss</a></td>
</tr>
<tr>
<td>189,528</td>
<td><a href='http://static.cpantesters.org/author/S/SMUELLER.rss'>http://static.cpantesters.org/author/S/SMUELLER.rss</a></td>
</tr>
<tr>
<td>170,899</td>
<td><a href='http://data-gov.tw.rpi.edu/raw/59/data-59.rdf'>http://data-gov.tw.rpi.edu/raw/59/data-59.rdf</a></td>
</tr>
<tr>
<td>166,454</td>
<td><a href='http://zaltys.net/ontology/AKTiveSAOntology.owl'>http://zaltys.net/ontology/AKTiveSAOntology.owl</a></td>
</tr>
<tr>
<td>166,454</td>
<td><a href='http://www.zaltys.net/ontology/AKTiveSAOntology.owl'>http://www.zaltys.net/ontology/AKTiveSAOntology.owl</a></td>
</tr>
<tr>
<td>165,948</td>
<td><a href='http://lsdis.cs.uga.edu/~satya/Satya/jan24.owl'>http://lsdis.cs.uga.edu/~satya/Satya/jan24.owl</a></td>
</tr>
</tbody>
</table>
<p>This concludes my boring stats dump for BTC2010 for now. Some information on literals and hopefully some graphs will come soon! I also plan to look into how these stats changed from last year &#8211; so far I see much more GoodRelations, but there must be other fun changes!</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2010/09/btc2010-basic-stats/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Semantic Web Clusterball</title>
		<link>http://gromgull.net/blog/2010/01/semantic-web-clusterball/</link>
		<comments>http://gromgull.net/blog/2010/01/semantic-web-clusterball/#comments</comments>
		<pubDate>Wed, 06 Jan 2010 11:25:29 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Billion Triple Challenge]]></category>
		<category><![CDATA[SVG]]></category>
		<category><![CDATA[Visualisation]]></category>
		<category><![CDATA[in progress]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=402</guid>
		<description><![CDATA[From the I-will-never-actually-finish-this department I bring you the Semantic Web Cluster-ball:

I started this is a part of the Billion Triple Challenge work, it shows the how different sites on Semantic Web are linked together. The whole thing is an interactive SVG, I could not get it to embed here, so click on that image and [...]]]></description>
			<content:encoded><![CDATA[<p>From the I-will-never-actually-finish-this department I bring you the Semantic Web Cluster-ball:</p>
<p style="text-align: center;"><a href="http://gromgull.net/2010/01/swball/swball.svg"><img class="aligncenter" style="border: 0pt none;" title="Semantic Web Clusterball" src="http://farm5.static.flickr.com/4064/4250011607_245b975a26.jpg" alt="Semantic Web Clusterball" width="500" height="469" /></a></p>
<p>I started this is a part of the <a href="http://gromgull.net/blog/category/semantic-web/billion-triple-challenge/">Billion Triple Challenge work</a>, it shows the how different sites on Semantic Web are linked together. The whole thing is an interactive SVG, I could not get it to embed here, so click on that image and mouse over things and be amazed. Clicking on the different predicates in the SVG will toggle showing that predicate, mouse over any link will show how many links are currently being shown. (NOTE: Only really tested in Firefox 3.5.X, it looked roughly ok in Chrome though.)</p>
<p>The data is extracted from the BTC triples by computing the <em>Pay-Level-Domain</em> (PLD, essentially the top-level domain, but with special rules for .co.uk domains and similar) for the subjects and objects, and if they differ, count the predicates that link them. I.e. a triple:</p>
<p><code>dbpedia:Albert_Einstein rdf:type foaf:Person. </code></p>
<p>would count as a link between <em>http://dbpedia.org </em>and <em>http://xmlns.com</em> for the<em> rdf:type</em> predicate. Counting all links like this gives us the top cross-domain linking predicates:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>predicate</th>
<th>links</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">http://www.w3.org/1999/02/22-rdf-syntax-ns#type</a></td>
<td style="text-align: right;">60,813,659</td>
</tr>
<tr>
<td><a href="http://www.w3.org/2000/01/rdf-schema#seeAlso">http://www.w3.org/2000/01/rdf-schema#seeAlso</a></td>
<td style="text-align: right;">16,698,110</td>
</tr>
<tr>
<td><a href="http://www.w3.org/2002/07/owl#sameAs">http://www.w3.org/2002/07/owl#sameAs</a></td>
<td style="text-align: right;">4,872,501</td>
</tr>
<tr>
<td><a href="http://xmlns.com/foaf/0.1/weblog">http://xmlns.com/foaf/0.1/weblog</a></td>
<td style="text-align: right;">4,627,271</td>
</tr>
<tr>
<td><a href="http://www.aktors.org/ontology/portal#has-date">http://www.aktors.org/ontology/portal#has-date</a></td>
<td style="text-align: right;">3,873,224</td>
</tr>
<tr>
<td><a href="http://xmlns.com/foaf/0.1/page">http://xmlns.com/foaf/0.1/page</a></td>
<td style="text-align: right;">3,273,613</td>
</tr>
<tr>
<td><a href="http://dbpedia.org/property/hasPhotoCollection">http://dbpedia.org/property/hasPhotoCollection</a></td>
<td style="text-align: right;">2,556,532</td>
</tr>
<tr>
<td><a href="http://xmlns.com/foaf/0.1/img">http://xmlns.com/foaf/0.1/img</a></td>
<td style="text-align: right;">2,012,761</td>
</tr>
<tr>
<td><a href="http://xmlns.com/foaf/0.1/depiction">http://xmlns.com/foaf/0.1/depiction</a></td>
<td style="text-align: right;">1,556,066</td>
</tr>
<tr>
<td><a href="http://www.geonames.org/ontology#wikipediaArticle">http://www.geonames.org/ontology#wikipediaArticle</a></td>
<td style="text-align: right;">735,145</td>
</tr>
</tbody>
</table>
<p>Most frequent is of course <em>rdf:type</em>, since most schemas are from different domains to the data, and most things have a type. The ball linked above is excluding type, since it&#8217;s not really a <em>link</em>. You can also see <a href="http://gromgull.net/2010/01/swball/swball_type.svg">a version including <em>rdf:type</em>.</a> The rest of the properties are more <em>link-like</em>, I am not sure what is going on with the <em>akt:has-date </em>though, anyone?</p>
<p>The visualisation idea is of course not mine, mainly I stole it from Chris Harrison: <a href="http://www.chrisharrison.net/projects/clusterball/index.html">Wikipedia Clusterball</a>. His is nicer since he has core nodes <em>inside </em>the ball. He points out that the &#8220;clustering&#8221; of nodes along the edge is important, as this brings out the structure of whatever is being mapped. My &#8220;clustering&#8221; method was very simple, I swap each node with the one giving me the largest decrease in edge distance, then repeat until the solution no longer improves. I couple this with a handful of random restarts and take the best solution. It&#8217;s essentially a greedy hill-climbing method, and I am sure it&#8217;s far from optimal, but it does at least something. For comparison, <a href="http://gromgull.net/2010/01/swball/swball_nocluster.svg">here is the ball on top without clustering applied</a>.</p>
<p>The whole thing was of course hacked up in python, the javascript for the mouse-over etc. of the SVG uses <a href="http://www.prototypejs.org/">prototype</a>. I wanted to share the code, but it&#8217;s a horrible mess, and I&#8217;d rather not spend the time to clean it up. If you want it, <span style="text-decoration: line-through;">drop me a line.</span>, see below. The data used to generate this is available either as a download: <a href="http://gromgull.net/2010/01/swball/data.txt.gz">data.txt.gz</a> (19Mb, 10,000 host-pairs and top 500 predicates), or <a href="http://manyeyes.alphaworks.ibm.com/manyeyes/datasets/semantic-web-links/versions/1">a subset on Many Eyes</a> (2,500 host-pairs and top 100 predicates, uploading 19Mb of data to Many Eyes crashed my Firefox :)</p>
<p><strong>UPDATE</strong>: <a href="http://twitter.com/Rchards">Richard Stirling </a>asked for the code, so I spent 30 min cleaning it up a bit, grab it here: <a href="http://gromgull.net/2010/01/swball/swball_code.tar.gz">swball_code.tar.gz</a> It includes the data+code needed to recreate the example above.</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2010/01/semantic-web-clusterball/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>An Objective look at the Billion Triple Data</title>
		<link>http://gromgull.net/blog/2009/12/an-objective-look-at-the-billion-triple-data/</link>
		<comments>http://gromgull.net/blog/2009/12/an-objective-look-at-the-billion-triple-data/#comments</comments>
		<pubDate>Fri, 11 Dec 2009 15:44:18 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Billion Triple Challenge]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=394</guid>
		<description><![CDATA[For completeness, Besbes is telling me to write up the final stats from the BTC data, for the object-part of the triples. I am afraid this data is quite dull, and there are not many interesting things to say, so it&#8217;ll be mostly tables. Enjoy :)
The BTC data contains 279,710,101 unique objects in total. Out [...]]]></description>
			<content:encoded><![CDATA[<p>For completeness, <a href="http://www.cs.univie.ac.at/employee.php?tab=teaching&amp;eid=223">Besbes</a> is telling me to write up the final stats from the BTC data, for the <em>object-part</em> of the triples. I am afraid this data is quite dull, and there are not many interesting things to say, so it&#8217;ll be mostly tables. Enjoy :)</p>
<p>The BTC data contains 279,710,101 unique objects in total. Out of these:</p>
<ul>
<li>90,007,431 appear more than once</li>
<li>7,995,747 more than 10 times</li>
<li>748,214 more than 100</li>
<li>43,479 more than 1,000</li>
<li>3,209 more than 10,000</li>
</ul>
<p>The 280M objects are split into 162,764,271 resources and 116,945,830 literals. 13,538 of the resources are <em>file://</em> URIs. The top 10 objects are:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>object</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>2,584,960</td>
<td><a href="http://www.geonames.org/ontology#P">http://www.geonames.org/ontology#P</a></td>
</tr>
<tr>
<td>2,645,095</td>
<td><a href="http://www.aktors.org/ontology/portal#Article-Reference">http://www.aktors.org/ontology/portal#Article-Reference</a></td>
</tr>
<tr>
<td>2,681,771</td>
<td><a href="http://www.w3.org/2002/07/owl#Class">http://www.w3.org/2002/07/owl#Class</a></td>
</tr>
<tr>
<td>5,616,326</td>
<td><a href="http://www.aktors.org/ontology/portal#Person">http://www.aktors.org/ontology/portal#Person</a></td>
</tr>
<tr>
<td>7,544,903</td>
<td><a href="http://www.geonames.org/ontology#Feature">http://www.geonames.org/ontology#Feature</a></td>
</tr>
<tr>
<td>9,115,801</td>
<td><a href="http://en.wikipedia.org/">http://en.wikipedia.org/</a></td>
</tr>
<tr>
<td>12,124,378</td>
<td><a href="http://xmlns.com/foaf/0.1/OnlineAccount">http://xmlns.com/foaf/0.1/OnlineAccount</a></td>
</tr>
<tr>
<td>13,687,049</td>
<td><a href="http://purl.org/rss/1.0/item">http://purl.org/rss/1.0/item</a></td>
</tr>
<tr>
<td>14,172,852</td>
<td><a href="http://rdfs.org/sioc/types#WikiArticle">http://rdfs.org/sioc/types#WikiArticle</a></td>
</tr>
<tr>
<td>38,795,942</td>
<td><a href="http://xmlns.com/foaf/0.1/Person">http://xmlns.com/foaf/0.1/Person</a></td>
</tr>
</tbody>
</table>
<p>Apart from the wikipedia link, all are types. No literals appear in the top 10 table. For the 116M unique literals we have 12,845,021 literals with a language tag and 2,067,768 with a datatype tag. The top 10 literals are:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>literal</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>722,221</td>
<td>&#8220;0&#8243;^^xsd:integer</td>
</tr>
<tr>
<td>969,929</td>
<td>&#8220;1&#8243;</td>
</tr>
<tr>
<td>1,024,654</td>
<td>&#8220;Nay&#8221;</td>
</tr>
<tr>
<td>1,036,054</td>
<td>&#8220;Copyright Â© 2009 craigslist, inc.&#8221;</td>
</tr>
<tr>
<td>1,056,799</td>
<td>&#8220;text&#8221;</td>
</tr>
<tr>
<td>1,061,692</td>
<td>&#8220;text/html&#8221;</td>
</tr>
<tr>
<td>1,159,311</td>
<td>&#8220;0&#8243;</td>
</tr>
<tr>
<td>1,204,996</td>
<td>&#8220;en-us&#8221;</td>
</tr>
<tr>
<td>2,049,638</td>
<td>&#8220;Aye&#8221;</td>
</tr>
<tr>
<td>2,310,681</td>
<td>&#8220;application/rdf+xml&#8221;</td>
</tr>
</tbody>
</table>
<p>I can&#8217;t be bothered to check it now, but I guess theÂ  many Aye&#8217;s &amp; Nay&#8217;s come from IRC chatlogs (#SWIG?).</p>
<p>Finally, I looked at the length of the literals used in the data, the longest literal is 65,244 unicode characters long (I wonder about this â€” this seems very close to 2<sup>16</sup> bytes, some unicode characters with more than one byte, could it be truncated?). The distribution of literals/lenghts looks like this:</p>
<p style="text-align: left;"><a href="http://www.flickr.com/photos/gromgull/4176130623/sizes/o/"><img class="aligncenter" style="border: 0pt none;" title="Literal lengths" src="http://farm3.static.flickr.com/2746/4176130623_754a99c096.jpg" alt="" width="500" height="400" /></a>The most literals are around 10 characters in length, there is a peak for 19, which I seem to remember was caused by the standard time format (i.e. 2005-10-30T10:45UTC) being exactly 19 characters.</p>
<p style="text-align: left;">That&#8217;s it! I believe I now have published all my numbers on BTC :)</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2009/12/an-objective-look-at-the-billion-triple-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>DBTropes</title>
		<link>http://gromgull.net/blog/2009/12/dbtropes/</link>
		<comments>http://gromgull.net/blog/2009/12/dbtropes/#comments</comments>
		<pubDate>Thu, 10 Dec 2009 13:31:48 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Everything Else]]></category>
		<category><![CDATA[Semantic Web]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=387</guid>
		<description><![CDATA[
Know TvTropes.org? As pointed out by XKCD, a great place to lose hours of time reading about SoBadIt&#8217;sHorrible, HighOctaneNightmareFuel and thousands of other tropes, all with examples from comics, films, tv-series etc.
DFKI colleague Malte Kiesel has done the right thing and just released his linked open data wrapper for tvtropes, natuerally names dbTropes.org. Now go [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: center;"><a href="http://gromgull.net/blog/wp-content/uploads/2009/12/logo.png"><img class="aligncenter size-full wp-image-388" style="border: 0pt none;" title="logo" src="http://gromgull.net/blog/wp-content/uploads/2009/12/logo.png" alt="logo" width="200" height="50" /></a></p>
<p style="text-align: left;">Know <a href="http://tvtropes.org">TvTropes.org</a>? As <a href="http://xkcd.com/609/">pointed out by XKCD</a>, a great place to lose hours of time reading about <a href="http://tvtropes.org/pmwiki/pmwiki.php/DarthWiki/ptitlew9bltta3dv6n?from=Main.SoBadItsHorrible">SoBadIt&#8217;sHorrible</a>, <a href="http://tvtropes.org/pmwiki/pmwiki.php/Main/HighOctaneNightmareFuel">HighOctaneNightmareFuel</a> and thousands of other <em>tropes, </em>all with examples from comics, films, tv-series etc.</p>
<p>DFKI colleague <a href="http://www.dfki.uni-kl.de/~kiesel/">Malte Kiesel</a> has done the right thing and just released his linked open data wrapper for tvtropes, natuerally names <a href="http://dbtropes.org">dbTropes.org</a>. Now go read about <a href="http://dbtropes.org/resource/Main/DiabolusExMachina">DiabolusExMachina</a>, it will of course do content-negotiation so try it with your favourite RDF browser.</p>
<p>I helped too â€” I made the stylesheet and the &#8220;logo&#8221; :)</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2009/12/dbtropes/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Typical Semantic Web Data</title>
		<link>http://gromgull.net/blog/2009/09/typical-semantic-web-data/</link>
		<comments>http://gromgull.net/blog/2009/09/typical-semantic-web-data/#comments</comments>
		<pubDate>Mon, 28 Sep 2009 16:15:54 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Billion Triple Challenge]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=317</guid>
		<description><![CDATA[This is the fourth of my Billion Triple Challenge data-set statistics posts, if you only just got here, catch up on part I, II orÂ  III.
I had these numbers ready for a long time, but never found the time to type it up as the it is not so exciting. However CaptSolo asked for it [...]]]></description>
			<content:encoded><![CDATA[<p>This is the fourth of my <a href="http://vmlion25.deri.ie/index.html">Billion Triple Challenge data-set</a> statistics posts, if you only just got here, catch up on part <a href="http://gromgull.net/blog/2009/06/btc-statistics-i/">I</a>, <a href="http://gromgull.net/blog/2009/07/billions-and-billions-and-billions-on-a-map/">II</a> orÂ  <a href="http://gromgull.net/blog/2009/08/the-subject-matter-or-its-a-scam-there-are-only-900m/">III</a>.</p>
<p>I had these numbers ready for a long time, but never found the time to type it up as the it is not so exciting. However <a href="http://captsolo.net/">CaptSolo</a> asked for it now to put in his very-soon-to-be-finished thesis, so I&#8217;ll hurry up. This is all about the classes used in the BTC data, i.e. the <em>rdf:type</em> triples.<br />
Overall the data contains 143,293,758 type triples, assigning 283,815 different types to 104,562,695 different things.Â  For the types themselves:</p>
<ul>
<li>213,281 types are used more than once</li>
<li> 94,455 used more than 10</li>
<li>14,862 more than 100</li>
<li>1,730 more than 1000</li>
<li>288 more than 10000</li>
</ul>
<p>If we take only these 288 top ones we cover 92% of all types triples, we can cover 90% of the typed things with only 105 types and over 50% of the data with only <em>foaf:Person, sioc:WikiArticle</em>, <em>rss:Item </em>and <em>foaf:OnlineAccount</em>. Out of all the &#8220;types&#8221; used 12,319 were BNodes, which is odd, but I guess possible, and 204 are literals, which is even odder. The top 10 types are:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>type URI</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td class='firstcol'>1,859,499</td>
<td>wordnet:Person</td>
</tr>
<tr>
<td class='firstcol'>2,309,652</td>
<td>foaf:Document</td>
</tr>
<tr>
<td class='firstcol'>2,645,091</td>
<td>akt:Article-Reference</td>
</tr>
<tr>
<td class='firstcol'>2,680,081</td>
<td>owl:Class</td>
</tr>
<tr>
<td class='firstcol'>5,616,163</td>
<td>akt:Person</td>
</tr>
<tr>
<td class='firstcol'>7,544,797</td>
<td>geonames:Feature</td>
</tr>
<tr>
<td class='firstcol'>12,123,375</td>
<td>foaf:OnlineAccount</td>
</tr>
<tr>
<td class='firstcol'>13,686,988</td>
<td>rss:item</td>
</tr>
<tr>
<td class='firstcol'>14,172,851</td>
<td>sioc:WikiArticle</td>
</tr>
<tr>
<td class='firstcol'>38,790,680</td>
<td>foaf:Person</td>
</tr>
</tbody>
</table>
<p>Now for the things the types are assigned to, out of the 104,562,965 things with types, 52,865,376 are BNodes. If you pay attention you will now have realised that many things have more than one type assigned (143M type triplesâ‡’104M things). In fact:</p>
<ul>
<li>7,026,972 things have more than one type triple.</li>
<li>612,467 has more than 10</li>
<li>35,201 more than 100</li>
<li>1,025 more than 1,000</li>
<li>40 more than 10,000</li>
</ul>
<p>Note I am talking here of <em>type triples</em>, i.e. the top 40 things may well have the same type assigned 10,000 times. The things having over 10,000 types assigned is a product of the partially inclusion of inferred triples in the data. For instance, for every context where RDFS inference has been applied, all properties will have <em>rdf:type rdf:Property </em>inferred. Looking at the number of unique types per thing shows that:</p>
<ul>
<li>2,979,968 things have more than one type</li>
<li>78,208 have more than 10</li>
<li>4 more than 100</li>
</ul>
<p>The 10 things with most <em>unique </em>types are all pretty boring:</p>
<table class='stattable'>
<thead>
<tr>
<th style="font-weight: bold">#types</th>
<th><strong>URI</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td class="firstcol">74</td>
<td>http://rdf.freebase.com/ns/guid.9202a8c04000641f8000000000959f60</td>
</tr>
<tr>
<td class="firstcol">75</td>
<td>http://dbpedia.org/resource/Arnold_Schwarzenegger</td>
</tr>
<tr>
<td class="firstcol">88</td>
<td>http://oiled.man.example.net/test#V822576</td>
</tr>
<tr>
<td class="firstcol">91</td>
<td>http://oiled.man.example.net/test#V21027</td>
</tr>
<tr>
<td class="firstcol">91</td>
<td>http://oiled.man.example.net/test#V21029</td>
</tr>
<tr>
<td class="firstcol">91</td>
<td>http://oiled.man.example.net/test#V21030</td>
</tr>
<tr>
<td class="firstcol">105</td>
<td>http://oiled.man.example.net/test#V16459</td>
</tr>
<tr>
<td class="firstcol">136</td>
<td>http://www.w3.org/2002/03owlt/description-logic/consistent501#T</td>
</tr>
<tr>
<td class="firstcol">136</td>
<td>http://www.w3.org/2002/03owlt/description-logic/inconsistent502#T</td>
</tr>
<tr>
<td class="firstcol">171</td>
<td>http://oiled.man.example.net/test#V21026</td>
</tr>
</tbody>
</table>
<p>Likewise the 10 things with the most types assigned, all product of materialised inferred triples:</p>
<table class='stattable'>
<thead>
<tr style="font-weight: bold;">
<th>#triples</th>
<th>URI</th>
</tr>
</thead>
<tbody>
<tr>
<td class="firstcol">57,533</td>
<td>http://sw.opencyc.org/2008/06/10/concept/</td>
</tr>
<tr>
<td class="firstcol">58,838</td>
<td>http://semantic-mediawiki.org/swivt/1.0#creationDate</td>
</tr>
<tr>
<td class="firstcol">58,838</td>
<td>http://semantic-mediawiki.org/swivt/1.0#page</td>
</tr>
<tr>
<td class="firstcol">58,838</td>
<td>http://semantic-mediawiki.org/swivt/1.0#Subject</td>
</tr>
<tr>
<td class="firstcol">89,521</td>
<td>http://sw.opencyc.org/concept/Mx4rwLSVCpwpEbGdrcN5Y29ycA</td>
</tr>
<tr>
<td class="firstcol">121,138</td>
<td>http://en.wikipedia.org/</td>
</tr>
<tr>
<td class="firstcol">159,773</td>
<td>http://sw.opencyc.org/concept/</td>
</tr>
<tr>
<td class="firstcol">232,505</td>
<td>http://sw.cyc.com/CycAnnotations_v1#label</td>
</tr>
<tr>
<td class="firstcol">361,113</td>
<td>http://xmlns.com/foaf/0.1/holdsAccount</td>
</tr>
<tr>
<td class="firstcol">465,010</td>
<td>http://sw.cyc.com/CycAnnotations_v1#externalID</td>
</tr>
</tbody>
</table>
<p>That&#8217;s it â€” I hope it changed your life! :)</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2009/09/typical-semantic-web-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Heat-maps of Semantic Web Predicate usage</title>
		<link>http://gromgull.net/blog/2009/09/heat-maps-of-semantic-web-predicate-usage/</link>
		<comments>http://gromgull.net/blog/2009/09/heat-maps-of-semantic-web-predicate-usage/#comments</comments>
		<pubDate>Fri, 11 Sep 2009 13:10:01 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Billion Triple Challenge]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Visualisation]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=358</guid>
		<description><![CDATA[It&#8217;s all Cygri&#8217;s fault â€” he encouraged me to add schema namespaces to the general areas on the semantic web cluster-tree. Now, again I misjudged horribly how long this was going to take. I thought the general idea was simple enough, I already had the data. One hour should do it. And now one full [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s all <a href="http://richard.cyganiak.de/">Cygri</a>&#8217;s fault â€” he<a href="http://chatlogs.planetrdf.com/swig/2009-09-10.html#T11-27-04"> encouraged me</a> to add schema namespaces to the general areas on the<a href="http://gromgull.net/2009/09/btcclustertree/tree.html"> semantic web cluster-tree</a>. Now, again I misjudged horribly how long this was going to take. I thought the general idea was simple enough, I already had the data. One hour should do it. And now one full day later I have:</p>
<p style="text-align: left;"><a href="http://gromgull.net/2009/09/heat/heat.html"><img class="aligncenter size-medium wp-image-359" style="border: 0pt none;" title="FOAF Predicates on the Semantic Web" src="http://gromgull.net/blog/wp-content/uploads/2009/09/foaf-300x202.png" alt="FOAF Predicates on the Semantic Web" width="300" height="202" /></a></p>
<p style="text-align: left;">It&#8217;s the same map as last time, laid using graphviz&#8217;s neato as before. The heat-map of the properties was computed from the feature-vector of predicate counts, first I mapped all predicates to their &#8220;namespace&#8221;, by the slightly-dodgy-but-good-enough heuristic of taking the part of the URI before the last # or / character. Then I split the map into a grid of NxN points (I think I used N=30 in the end), and compute a new feature vector for each point. This vector is the sum of the mapped vector for each of the domains, divided by the distance. I.e. (if you prefer math) each point&#8217;s vector becomes:</p>
<p style="text-align: center;"><img src="http://quicklatex.com/cache/ql_eab1808c3a47a79ec4468286c166cd63.gif" alt="\displaystyle V_{x,y}= \sum_d\frac{V_d}{\sqrt{D( (x,y),Â  pos_d)}} " title="\displaystyle V_{x,y}= \sum_d\frac{V_d}{\sqrt{D( (x,y),Â  pos_d)}} " style="vertical-align: -22px; border: none;"/></p>
<p style="text-align: left;">Where <img src="http://quicklatex.com/cache/ql_f623e75af30e62bbd73d6df5b50bb7b5.gif" alt="D" title="D" style="vertical-align: 0px; border: none;"/> is the distance (here simple 2d euclidean), <img src="http://quicklatex.com/cache/ql_8277e0910d750195b448797616e091ad.gif" alt="d" title="d" style="vertical-align: 0px; border: none;"/> is each domain, <img src="http://quicklatex.com/cache/ql_35786283530fc15de9505a0cf7b2dfe7.gif" alt="pos_d" title="pos_d" style="vertical-align: -4px; border: none;"/> is that domains position in the figure and <img src="http://quicklatex.com/cache/ql_ac9efe23bd710abe094c643a0b6e9a39.gif" alt="V_d" title="V_d" style="vertical-align: -3px; border: none;"/> is that domains feature vector. <em>Normally</em> it would be more natural to decrease the effect by the squared distance, but this gave less attractive results, and I ended up square-rooting it instead. The color is now simply on column of the resulting matrix normalised and mapped to a <a href="http://www.scipy.org/Cookbook/Matplotlib/Show_colormaps">nice pylab colormap.</a></p>
<p style="text-align: left;">Now this was the fun and interesting part, and it took maybe 1 hour. As predicted. NOW, getting this plotted along with the nodes from the graph turned out to be a nightmare. Neato gave me the coordinates for the nodes, but would change them slightly when rendering to PNGs. Many hours of frustration later I ended up drawing all of it again with <a href="http://matplotlib.sourceforge.net/">pylab</a>, which worked really well. I would publish the code for this, but it&#8217;s so messy it makes grown men cry.</p>
<p style="text-align: left;">NOW I am off to analyse the result of the top-level domain interlinking on the billion triple data. The data-collection just finished running while I did this. &#8230; As <a href="http://chatlogs.planetrdf.com/swig/2009-09-10.html#T11-28-32">he said.</a></p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2009/09/heat-maps-of-semantic-web-predicate-usage/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Visualising predicate usage on the Semantic Web</title>
		<link>http://gromgull.net/blog/2009/09/visualising-predicate-usage-on-the-semantic-web/</link>
		<comments>http://gromgull.net/blog/2009/09/visualising-predicate-usage-on-the-semantic-web/#comments</comments>
		<pubDate>Wed, 09 Sep 2009 11:02:29 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Billion Triple Challenge]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Visualisation]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=337</guid>
		<description><![CDATA[So, not quite a billion triple challenge post, but the data is the same.Â  I had the idea that I compare the Pay-Level-Domains (PLD) of the context of the triples based on what predicates is used within each one. Then once I had the distance-metric, I could use FastMap to visualise it. It would be [...]]]></description>
			<content:encoded><![CDATA[<p>So, not quite a billion triple challenge post, but the data is the same.Â  I had the idea that I compare the Pay-Level-Domains (PLD) of the context of the triples based on what predicates is used within each one. Then once I had the distance-metric, I could use <a href="http://gromgull.net/blog/2009/08/fastmap-in-python/">FastMap</a> to visualise it. It would be a quick hack, it would look smooth and great and be fun. In the end, many hours later, it wasn&#8217;t quick, the visual is not smooth (i.e. it doesn&#8217;t move) and I don&#8217;t know if it looks so great. It was fun though. Just go there and look at it:</p>
<p style="text-align: center;"><a href="http://gromgull.net/2009/09/btcclustertree/tree.html"><img class="size-full wp-image-338 aligncenter" style="border: 0pt none;" title="PayLevelDomains cluster-tree" src="http://gromgull.net/blog/wp-content/uploads/2009/09/smalltree.png" alt="PayLevelDomains cluster-tree" width="400" height="263" /></a></p>
<p style="text-align: left;">As you can see it&#8217;s a large PNG with the new-and-exciting <a href="http://en.wikipedia.org/wiki/Image_map">ImageMap</a> technology used to position the info-popup, or rather used to activate the JavaScript used for the popups. I tried at first with SVG, but I couldn&#8217;t get SVG and XHTML and Javascript to play along, I guess in Firefox 5 it will work. The graph is laid out and generated <a href="http://www.graphviz.org/">Graphviz</a>&#8217;s <a href="http://www.graphviz.org/pdf/neatoguide.pdf">neato</a>, which also generated the imagemap.</p>
<p style="text-align: left;">So what do we actually see here? In short, a tree where domains that publish similar Semantic Web data are close to each other in the tree and have similar colours. In detail: I took the all PLDs that contained over 1,000 triples, this is around 7500, and counted the number of triples for each of the 500 most frequent predicates in the dataset. (These 500 predicates cover â‰ˆ94% of the data). This gave me a vector-space with 500 features for each of the PLDs, i.e. something like this:</p>
<table style="border-color: #aaa; border-collapse: collapse; text-align: center; font-family: mono;" border="1" cellspacing="5">
<colgroup>
<col width="77"></col>
<col width="77"></col>
<col width="77"></col>
<col width="77"></col>
<col width="77"></col>
</colgroup>
<tbody>
<tr>
<td width="77" height="16" align="LEFT"></td>
<td width="77" align="LEFT">geonames:nearbyFeature</td>
<td width="77" align="LEFT">dbprop:redirect</td>
<td width="77" align="LEFT">foaf:knows</td>
<td width="77" align="LEFT">&#8230;</td>
</tr>
<tr>
<td height="16" align="LEFT">dbpedia.org</td>
<td align="RIGHT">0.01</td>
<td align="RIGHT">0.8</td>
<td align="RIGHT">0.1</td>
<td align="LEFT"></td>
</tr>
<tr>
<td height="16" align="LEFT">livejournal.org</td>
<td align="RIGHT">0</td>
<td align="RIGHT">0</td>
<td align="RIGHT">0.9</td>
<td align="LEFT"></td>
</tr>
<tr>
<td height="16" align="LEFT">geonames.org</td>
<td align="RIGHT">0.75</td>
<td align="RIGHT">0</td>
<td align="RIGHT">0.1</td>
<td align="LEFT"></td>
</tr>
<tr>
<td height="16" align="LEFT">&#8230;</td>
<td align="LEFT"></td>
<td align="LEFT"></td>
<td align="LEFT"></td>
<td align="LEFT"></td>
</tr>
</tbody>
</table>
<p style="text-align: left;">Each value is the percentage of triples from this PLD that used this predicate. In this vector space I used the<a href="http://en.wikipedia.org/wiki/Cosine_similarity"> cosine-similarity</a> to compute a distance matrix for all PLDs. With this distance matrix I thought I could apply FastMap, but it worked really badly and looked like this:</p>
<p style="text-align: center;"><a href="http://gromgull.net/blog/wp-content/uploads/2009/09/fastmap.png"><img class="size-medium wp-image-340 aligncenter" style="border: 0pt none;" title="Fastmapping the PLDs " src="http://gromgull.net/blog/wp-content/uploads/2009/09/fastmap-300x226.png" alt="Fastmapping the PLDs " width="300" height="226" /></a></p>
<p style="text-align: left;">So instead of FastMap I used maketree from the <a href="http://complearn.org/">complearn tools</a>, this generates trees from a distance matrix, it generates very good results, but it is an iterative optimisation and it takes forever for large instances. Around this time I realised I wasn&#8217;t going to be able to visualise all 7500 PLDs, and cut it down to the <span style="text-decoration: line-through;">2000</span>, <span style="text-decoration: line-through;">1000</span>, <span style="text-decoration: line-through;">500</span>, <span style="text-decoration: line-through;">100</span>, 50 largest PLDs. Now this worked fine, but the result looked like a bog-standard graphviz graph, and it wasn&#8217;t very exciting (i.e not at all like <a href="http://www.pitchinteractive.com/colour_economy/">this colourful thing</a>). Now I realised that since I actually had numeric feature vectors in the first place I wasn&#8217;t restrained to using FastMap to make up coordinates, and I used <a href="http://en.wikipedia.org/wiki/Principal_component_analysis">PCA</a> to map the input vector-space to a 3-dimensional space, normalised the values to [0;255] and used these as RGB values for colour. Ah &#8211; lovely pastel.</p>
<p style="text-align: left;">I think I underestimated the time this would take by at least a factor of 20. Oh well. Time for lunch.</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2009/09/visualising-predicate-usage-on-the-semantic-web/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>The Subject Matter (or it&#8217;s a scam &#8211; there are only 900M!)</title>
		<link>http://gromgull.net/blog/2009/08/the-subject-matter-or-its-a-scam-there-are-only-900m/</link>
		<comments>http://gromgull.net/blog/2009/08/the-subject-matter-or-its-a-scam-there-are-only-900m/#comments</comments>
		<pubDate>Mon, 17 Aug 2009 14:01:31 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Billion Triple Challenge]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=304</guid>
		<description><![CDATA[This is the next part of the BTC statistics, this time I look at the subjects of the triples. Oh my, isn&#8217;t it exciting. Actually, I&#8217;ve had all the numbers for this ready for a while, but holidays and real work has kept me from typing it up. So, BTC overall contains:

128,079,322 unique subjects
118,205,618 has [...]]]></description>
			<content:encoded><![CDATA[<p>This is the next part of the <a href="http://vmlion25.deri.ie/index.html">BTC</a> statistics, this time I look at the <em>subjects </em>of the triples. Oh my, isn&#8217;t it exciting. Actually, I&#8217;ve had all the numbers for this ready for a while, but holidays and real work has kept me from typing it up. So, BTC overall contains:</p>
<ul>
<li>128,079,322 unique subjects</li>
<li>118,205,618 has more than a single triple</li>
<li>19,037,202 more than 10</li>
<li>1,302,353 more than 100</li>
<li>25,741 more than 1000</li>
<li>223 more than 10000</li>
</ul>
<p>Out of these 128M subjects 59,423,933 are blank nodes. Only 17,089 of them are<em> file://<span style="text-decoration: underline;"> </span></em>URIs, I really expected many more to have snuck in. At first sight it may seem very odd that so many subjects have more than 1000 triples â€” what could those possibly be? However, when looking at the 10 subjects with the most triples it becomes clear:</p>
<table style="font-family: mono" border="0" cellspacing="3">
<tbody>
<tr>
<td style="text-align: right">138,618</td>
<td><a href="http://swrc.ontoware.org/ontology#InProceedings">swrc:InProceedings</a></td>
</tr>
<tr>
<td style="text-align: right">154,721</td>
<td><a href="http://sw.opencyc.org/concept/Mx4rZOAVeiYGEdqAAAACs2IMmw">http://sw.opencyc.org/concept/Mx4rZOAVeiYGEdqAAAACs2IMmw</a></td>
</tr>
<tr>
<td style="text-align: right">172,599</td>
<td><a href="http://sw.opencyc.org/2008/06/10/concept/">http://sw.opencyc.org/2008/06/10/concept/</a></td>
</tr>
<tr>
<td style="text-align: right">195,167</td>
<td><a href="http://purl.org/dc/dcmitype/Text">dctype:Text</a></td>
</tr>
<tr>
<td style="text-align: right">209,623</td>
<td><a href="http://xmlns.com/foaf/0.1/Document">foaf:Document</a></td>
</tr>
<tr>
<td style="text-align: right">358,090</td>
<td><a href="http://sw.opencyc.org/concept/Mx4rwLSVCpwpEbGdrcN5Y29ycA">http://sw.opencyc.org/concept/Mx4rwLSVCpwpEbGdrcN5Y29ycA</a></td>
</tr>
<tr>
<td style="text-align: right">362,161</td>
<td><a href="http://xmlns.com/foaf/0.1/holdsAccount">foaf:holdsAccount</a></td>
</tr>
<tr>
<td style="text-align: right">479,323</td>
<td><a href="http://sw.opencyc.org/concept/">http://sw.opencyc.org/concept/</a></td>
</tr>
<tr>
<td style="text-align: right">697,520</td>
<td><a href="http://sw.cyc.com/CycAnnotations_v1#label">http://sw.cyc.com/CycAnnotations_v1#label</a></td>
</tr>
<tr>
<td style="text-align: right">930,025</td>
<td><a href="http://sw.cyc.com/CycAnnotations_v1#externalID">http://sw.cyc.com/CycAnnotations_v1#externalID</a></td>
</tr>
</tbody>
</table>
<p>Most of these are parts of schemas, i.e. properties or classes (perhaps all? I don&#8217;t know enough about <a href="http://opencyc.org/">CYC</a> use to say what <em>http://sw.opencyc.org/2008/06/10/concept/ </em>is). Looking at the data, out of the hundred-thousand of triples about <em>foaf:holdsAccount</em> for instance, 180,552 of the triples are:<br />
<code><br />
foaf:holdsAccount rdf:type rdfs:Property .<br />
</code><br />
And 180,390 are the triple:<br />
<code><br />
foaf:holdsAccount rdf:type owl:InverseFunctionalProperty .<br />
</code><br />
Of course each of these are in different context. At first I thought this meant that someone was keeping hundreds of thousand of the FOAF ontology around, but of course then all the other FOAF properties and classes would also be the subject of lots of triples. Looking at the contexts where these triples came from there are 180,574 contexts containing the first triple. 180,389 of them are from <a href="http://purl.org/net/kanzaki/flickr2foaf">Kanzaki&#8217;s flickr2foaf</a> script (the remaining are 150 variations on <em>http://xmlns.com/foaf</em> and 30 odd random contexts). However, the output from flickr2foaf does not include the schema information, it only uses use<em> foaf:holdsAccount </em>(and many <em>foaf:OnlineAccount </em>instances). My guess to what has happened is that someone has crawled this, each profile, such as <a href="http://www.kanzaki.com/works/2005/misc/flickr2foaf?u=gromgull">mine</a> will contain <em>rdfs:seeAlso </em>links to all my flickr contacts, and each of these pagesÂ  will use <em>foaf:holdsAccount</em>. Then they applied some sort of inference that materialised the triples above, adding it once for each context it appeared in. This inference cannot be basic RDFS inference, since it also adds <em>owl:InverseFunctionalProperty</em>, and it has not been applied to all the BTC data, but only to some context. I wonder if there is a way to recover which contexts this has been applied to, and then perhaps finding out which triples are <em>redundant</em>, i.e. they could be re-inferred from the other triples?</p>
<p>Now, all these triples about foaf:holdsAccount and CYC concepts also tells us something else: this isn&#8217;t really the Billion <strong>Triple</strong> Challenge, since many of the triples are duplicate, it is the Billion <strong>Quad</strong> challenge, which I guess is not so catchy. A few more CPU cycles spent on piping things through sort, and uniq (my favourite activity!) I know that out of the original 1,151,383,508 quads, there are actually only 1,150,846,965 uniqe quads, i.e. about 500K duplicates, and more interestingly, there are only 906,166,056 unique triples, i.e. 245M duplicates. I guess it&#8217;s not the <strong>Billion </strong>triple challenge either :) â€” now with only 900M triples it should be easy!</p>
<p>(BTW: No graphs this time, sorry! Also â€” I know I said I would talk about the literal values this time, but I changed my mind, next time!)</p>
<p><strong>UPDATE:</strong></p>
<p><a href="http://gianlucademartini.net/">Gianluca Demartini</a> asked an interesting question: Why is nearly half the subjects blank nodes? I don&#8217;t really know &#8211; but I can speculate. 46% of the subject IDs are blank-nodes, these account for â‰ˆ30% of the triples in the dataset. I was hoping these 30% would be badly distributed i.e. that there was some few blank nodes with lots and lots of triples, but alas, the blank-node/triple distribution breaks down like this :</p>
<ul>
<li> 57,457,905 &#8211; over 1</li>
<li> 1,931,363 &#8211; over 10</li>
<li> 189,487 &#8211; over 100</li>
<li> 3,901 &#8211; over 1000</li>
<li> 50 &#8211; over 10000</li>
</ul>
<p>You need to include the 43,916,862 largest bnodes descriptions to cover 90% of these triples, i.e. we cannot quickly ignore the biggest ones and move on with our lives. I wont give you the top N bnodes since this is more or less random generated IDs, but looking at some of the &#8220;largest&#8221; bnodes they all look like <a href="http://www.sitemaps.org/">sitemap</a> files that have been converted to RDF â€” for example, the largest blank node is<em> _:genid1http-3A-2F-2Fwww-2Eindexedvisuals-2Ecom-2Findexedvisuals-2Exml</em>, this appers to be an RDF version ofÂ <em> </em><a href="http://www.indexedvisuals.com/indexedvisuals.xml">the sitemap</a> for <a href="http://www.indexedvisuals.com/">www.indexedvisuals.com</a></p>
<p>Now, this bnode alone is the subject of 32,984 triples, and all of these apart from one is a triples with property <em>http://www.google.com/schemas/sitemap/0.84url </em>and another bnode as an object. I guess this is the case for many of the largest bnodes, and probably many of those nodes in return. (Although a highly scientific <em>grep</em> for bnode IDs that contain &#8220;sitemap&#8221; returns only about 100K cases â€” a better count is underway.)</p>
<p>So in conclusion â€” bah! Who knows? Who needs bnodes anyway? :)</p>
<p><strong>UPDATE II: </strong></p>
<p>I did a proper count of how many of the blank-nodes are sitemap nodes like the indexedvisuals above, and it&#8217;s only 27! :) There goes that theory. These 27 <em>do </em>account for 71,985 triples with the <em>0.84url </em>predicate, but this is still a tiny amount of the data. In the next post we will also see that a huge percentage of these bnodes have proper types, giving additional evidence that they are genuine real interesting parts of the data, not just some weird artifact.</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2009/08/the-subject-matter-or-its-a-scam-there-are-only-900m/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
