<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>(still) nothing clever</title>
	<atom:link href="http://gromgull.net/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://gromgull.net/blog</link>
	<description></description>
	<lastBuildDate>Thu, 02 Sep 2010 14:21:13 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<atom:link rel="hub" href="http://pubsubhubbub.appspot.com"/><atom:link rel="hub" href="http://superfeedr.com/hubbub"/>		<item>
		<title>Aggregates over BTC2010 namespaces</title>
		<link>http://gromgull.net/blog/2010/09/aggregates-over-btc2010-namespaces/</link>
		<comments>http://gromgull.net/blog/2010/09/aggregates-over-btc2010-namespaces/#comments</comments>
		<pubDate>Thu, 02 Sep 2010 13:07:54 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=483</guid>
		<description><![CDATA[Yesterday I dumped the most basic BTC2010 stats. Today I have processed them a bit more &#8211; and it gets slightly less boring.
First predicates, yesterday I had the raw count per predicate. Much more interesting is the namespaces the predicates are defined in. These are the top 10:



#triples
namespace




860,532,348
rdfs


651,432,324
http://data-gov.tw.rpi.edu/vocab/p/90


588,063,466
rdf


527,347,381
gr


284,679,897
foaf


44,119,248
dc11


41,961,046
http://purl.uniprot.org/core


17,233,778
rss


13,661,605
http://www.proteinontology.info/po.owl


13,009,685
owl



(prefix abbreviations are made from prefix.cc – I am [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://gromgull.net/blog/2010/09/btc2010-basic-stats/">Yesterday</a> I dumped the most basic BTC2010 stats. Today I have processed them a bit more &#8211; and it gets slightly less boring.</p>
<p>First predicates, yesterday I had the raw count per predicate. Much more interesting is the namespaces the predicates are defined in. These are the top 10:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>namespace</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>860,532,348</td>
<td><a href="http://www.w3.org/2000/01/rdf-schema#">rdfs</a></td>
</tr>
<tr>
<td>651,432,324</td>
<td><a href="http://data-gov.tw.rpi.edu/vocab/p/90">http://data-gov.tw.rpi.edu/vocab/p/90</a></td>
</tr>
<tr>
<td>588,063,466</td>
<td><a href="http://www.w3.org/1999/02/22-rdf-syntax-ns#">rdf</a></td>
</tr>
<tr>
<td>527,347,381</td>
<td><a href="http://purl.org/goodrelations/v1#">gr</a></td>
</tr>
<tr>
<td>284,679,897</td>
<td><a href="http://xmlns.com/foaf/0.1/">foaf</a></td>
</tr>
<tr>
<td>44,119,248</td>
<td><a href="http://purl.org/dc/elements/1.1/">dc11</a></td>
</tr>
<tr>
<td>41,961,046</td>
<td><a href="http://purl.uniprot.org/core">http://purl.uniprot.org/core</a></td>
</tr>
<tr>
<td>17,233,778</td>
<td><a href="http://purl.org/rss/1.0/">rss</a></td>
</tr>
<tr>
<td>13,661,605</td>
<td><a href="http://www.proteinontology.info/po.owl">http://www.proteinontology.info/po.owl</a></td>
</tr>
<tr>
<td>13,009,685</td>
<td><a href="http://www.w3.org/2002/07/owl#">owl</a></td>
</tr>
</tbody>
</table>
<p>(prefix abbreviations are made from prefix.cc – I am too lazy to fix the missing ones)</p>
<p>Now it gets interesting &#8211; because I did exactly this last year as well, and now we can compare!</p>
<h2>Dropouts</h2>
<p>In 2009 there were 3,817 different namespaces, this year we have 3,911, but actually only 2,945 occur in both. The biggest <em>dropouts</em>, i.e. namespaces that occurred last year, but not at all this year are:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>namespace</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>10,239,809</td>
<td><a href="http://www.kisti.re.kr/isrl/ResearchRefOntology">http://www.kisti.re.kr/isrl/ResearchRefOntology</a></td>
</tr>
<tr>
<td>5,443,549</td>
<td><a href="http://www.semanticdesktop.org/ontologies/2007/01/19/nie#">nie</a></td>
</tr>
<tr>
<td>1,571,547</td>
<td><a href="http://ontologycentral.com/2009/01/eurostat/ns">http://ontologycentral.com/2009/01/eurostat/ns</a></td>
</tr>
<tr>
<td>1,094,963</td>
<td><a href="http://sindice.com/exfn/0.1">http://sindice.com/exfn/0.1</a></td>
</tr>
<tr>
<td>320,155</td>
<td><a href="http://xmdr.org/ont/iso11179-3e3draft_r4.owl">http://xmdr.org/ont/iso11179-3e3draft_r4.owl</a></td>
</tr>
<tr>
<td>307,534</td>
<td><a href="http://cb.semsol.org/ns">http://cb.semsol.org/ns</a></td>
</tr>
<tr>
<td>242,427</td>
<td><a href="http://www.semanticdesktop.org/ontologies/2007/03/22/nco#">nco</a></td>
</tr>
<tr>
<td>203,283</td>
<td><a href="http://www.ordnancesurvey.co.uk/ontology/AdministrativeGeography/v2.0/AdministrativeGeography.rdf#">osag</a></td>
</tr>
<tr>
<td>187,600</td>
<td><a href="http://auswiki.org/index.php/Special:URIResolver">http://auswiki.org/index.php/Special:URIResolver</a></td>
</tr>
<tr>
<td>159,536</td>
<td><a href="http://www.semanticdesktop.org/ontologies/2007/05/10/nexif#">nexif</a></td>
</tr>
</tbody>
</table>
<p>I am of course shocked and saddened to see that the Nepomuk Information Elements ontology has fallen out of fashion all together, although it was a bit of a freak occurrence last year. I am not sure how we lost 10M research ontology triples?</p>
<h2>Newcomers</h2>
<p>Looking the other way around, what namespaces are new and popular this year, we get:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>namespace</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>651,432,324</td>
<td><a href="http://data-gov.tw.rpi.edu/vocab/p/90">http://data-gov.tw.rpi.edu/vocab/p/90</a></td>
</tr>
<tr>
<td>5,001,909</td>
<td><a href="http://www.rdfabout.com/rdf/schema/usfec/">fec</a></td>
</tr>
<tr>
<td>2,689,813</td>
<td><a href="http://transport.data.gov.uk/0/ontology/traffic">http://transport.data.gov.uk/0/ontology/traffic</a></td>
</tr>
<tr>
<td>543,835</td>
<td><a href="http://rdf.geospecies.org/ont/geospecies">http://rdf.geospecies.org/ont/geospecies</a></td>
</tr>
<tr>
<td>526,304</td>
<td><a href="http://data-gov.tw.rpi.edu/vocab/p/401">http://data-gov.tw.rpi.edu/vocab/p/401</a></td>
</tr>
<tr>
<td>469,446</td>
<td><a href="http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf">http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf</a></td>
</tr>
<tr>
<td>446,120</td>
<td><a href="http://education.data.gov.uk/def/school">http://education.data.gov.uk/def/school</a></td>
</tr>
<tr>
<td>223,726</td>
<td><a href="http://www.w3.org/TR/rdf-schema">http://www.w3.org/TR/rdf-schema</a></td>
</tr>
<tr>
<td>190,890</td>
<td><a href="http://wecowi.de/wiki/Spezial:URIResolver">http://wecowi.de/wiki/Spezial:URIResolver</a></td>
</tr>
<tr>
<td>166,511</td>
<td><a href="http://data-gov.tw.rpi.edu/vocab/p/10">http://data-gov.tw.rpi.edu/vocab/p/10</a></td>
</tr>
</tbody>
</table>
<p>Here the introduction of <a href="http://data.gov">data.gov</a> and <a href="http://data.gov.uk">data.gov.uk</a> were the big events last year.</p>
<h2>Winners</h2>
<p>For the namespaces that occurred both years we can find the biggest gainers. Here I calculated what ratio of the total triples each namespace constituted each year, and the increase in this ratio from 2009 to 2010. For example, GoodRelations, on top here, constituted nearly 16% of all triples in 2010, but only 2.91e-4% of all triples last year, for a cool increase of 570,000% :)</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>gain</th>
<th><strong>namespace</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>57058.38</td>
<td><a href="http://purl.org/goodrelations/v1#">gr</a></td>
</tr>
<tr>
<td>2636.34</td>
<td><a href="http://www.openlinksw.com/schema/attribution">http://www.openlinksw.com/schema/attribution</a></td>
</tr>
<tr>
<td>2182.81</td>
<td><a href="http://www.openrdf.org/schema/serql">http://www.openrdf.org/schema/serql</a></td>
</tr>
<tr>
<td>1944.68</td>
<td><a href="http://www.w3.org/2007/OWL/testOntology">http://www.w3.org/2007/OWL/testOntology</a></td>
</tr>
<tr>
<td>1235.02</td>
<td><a href="http://referata.com/wiki/Special:URIResolver">http://referata.com/wiki/Special:URIResolver</a></td>
</tr>
<tr>
<td>1211.35</td>
<td><a href="urn:lsid:ubio.org:predicates:recordVersion">urn:lsid:ubio.org:predicates:recordVersion</a></td>
</tr>
<tr>
<td>1208.09</td>
<td><a href="urn:lsid:ubio.org:predicates:lexicalStatus">urn:lsid:ubio.org:predicates:lexicalStatus</a></td>
</tr>
<tr>
<td>1194.66</td>
<td><a href="urn:lsid:lsid.zoology.gla.ac.uk:predicates:mapping">urn:lsid:lsid.zoology.gla.ac.uk:predicates:mapping</a></td>
</tr>
<tr>
<td>1191.39</td>
<td><a href="urn:lsid:lsid.zoology.gla.ac.uk:predicates:rank">urn:lsid:lsid.zoology.gla.ac.uk:predicates:rank</a></td>
</tr>
<tr>
<td>701.66</td>
<td><a href="urn:lsid:ubio.org:predicates:hasCAVConcept">urn:lsid:ubio.org:predicates:hasCAVConcept</a></td>
</tr>
</tbody>
</table>
<h2>Losers</h2>
<p>Similarly, we have the biggest losers, the ones who lost the most:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>gain</th>
<th><strong>namespace</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>0.000185</td>
<td><a href="http://purl.org/obo/metadata">http://purl.org/obo/metadata</a></td>
</tr>
<tr>
<td>0.000191</td>
<td><a href="http://rdfs.org/sioc/types#">sioct</a></td>
</tr>
<tr>
<td>0.000380</td>
<td><a href="http://www.w3.org/2006/vcard/ns#">vcard</a></td>
</tr>
<tr>
<td>0.000418</td>
<td><a href="http://www.affymetrix.com/community/publications/affymetrix/tmsplice#">affy</a></td>
</tr>
<tr>
<td>0.000438</td>
<td><a href="http://www.geneontology.org/go">http://www.geneontology.org/go</a></td>
</tr>
<tr>
<td>0.000677</td>
<td><a href="http://tap.stanford.edu/data">http://tap.stanford.edu/data</a></td>
</tr>
<tr>
<td>0.000719</td>
<td><a href="urn://wymiwyg.org/knobot/default">urn://wymiwyg.org/knobot/default</a></td>
</tr>
<tr>
<td>0.000787</td>
<td><a href="http://www.aktors.org/ontology/support#">akts</a></td>
</tr>
<tr>
<td>0.000876</td>
<td><a href="http://wymiwyg.org/ontologies/language-selection">http://wymiwyg.org/ontologies/language-selection</a></td>
</tr>
<tr>
<td>0.000904</td>
<td><a href="http://wymiwyg.org/ontologies/knobot">http://wymiwyg.org/ontologies/knobot</a></td>
</tr>
</tbody>
</table>
<p>If your namespace is a loser, do not worry, remember that BTC is a more or less arbitrary snapshot of SOME semantic web data, and you can always catch up next year! :)</p>
<p>With a bit of luck I will do this again for the Pay-Level-Domains for the context URLs tomorrow.</p>
<h2>Update</h2>
<p>(a bit later)</p>
<p>You can get the full datasets for this from many eyes:</p>
<ul>
<li><a href="http://manyeyes.alphaworks.ibm.com/manyeyes/datasets/number-of-triples-per-predicate-na">Namespaces 2009</a></li>
<li><a href="http://manyeyes.alphaworks.ibm.com/manyeyes/datasets/number-of-triples-per-predicate-na-2">Namespaces 2010</a></li>
<li><a href="http://manyeyes.alphaworks.ibm.com/manyeyes/datasets/namespaces-that-dropped-out-betwee">Dropouts</a></li>
<li><a href="http://manyeyes.alphaworks.ibm.com/manyeyes/datasets/namespaces-that-appeared-between-b">Newcomers</a></li>
<li><a href="http://manyeyes.alphaworks.ibm.com/manyeyes/datasets/namespace-change-between-btc-2009-">Changes</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2010/09/aggregates-over-btc2010-namespaces/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>BTC2010 Basic stats</title>
		<link>http://gromgull.net/blog/2010/09/btc2010-basic-stats/</link>
		<comments>http://gromgull.net/blog/2010/09/btc2010-basic-stats/#comments</comments>
		<pubDate>Wed, 01 Sep 2010 13:07:18 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Billion Triple Challenge]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=474</guid>
		<description><![CDATA[Another year, another billion triple dataset. This time it was released the same time my daughter was born, so running the stats script was delayed for a bit.
This year we&#8217;ve got a few more triples, perhaps making up for the fact that it wasn&#8217;t actually one billion last year :) we&#8217;ve now got 3.1B triples [...]]]></description>
			<content:encoded><![CDATA[<p>Another year, <a href="http://km.aifb.kit.edu/projects/btc-2010/">another billion triple dataset</a>. This time it was released the same time my daughter was born, so running the stats script was delayed for a bit.</p>
<p>This year we&#8217;ve got a few more triples, perhaps making up for the fact that it wasn&#8217;t actually one billion last year :) we&#8217;ve now got 3.1B triples (or 3,171,793,030 if you want to be exact).</p>
<p>I&#8217;ve not had a chance to do anything really fun with this, so I&#8217;ll just dump the stats:</p>
<h2>Subjects</h2>
<ul>
<li>159,185,186	unique subjects</li>
<li>147,663,612	occur in more than a single triple</li>
<li>12,647,098 more than 10 times</li>
<li>5,394,733 more 100</li>
<li>313,493 more than 1,000</li>
<li>46,116 more than 10,000</li>
<li>and 53 more than 100,000 times</li>
</ul>
<p>For an average of 19.9252 per unique triple. Like last year, I am not sure if having more than 100,000 triples with the same subject really is useful for anyone?</p>
<p>Looking only at bnodes used as subjects we get:</p>
<ul>
<li>100,431,757	unique subjects</li>
<li>98,744,109	occur in more than a single triple</li>
<li>1,465,399 more than 10 times</li>
<li>266,759 more 100</li>
<li>4,956 more than 1,000</li>
<li>48 more than 10,000</li>
</ul>
<p>So 100M out of 159M subjects are bnodes, but they are used less often than the named resources. </p>
<p>The top subjects are as follows:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>subject</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>1,412,709</td>
<td><a href="http://www.proteinontology.info/po.owl#A">http://www.proteinontology.info/po.owl#A</a></td>
</tr>
<tr>
<td>895,776</td>
<td><a href="http://openean.kaufkauf.net/id/">http://openean.kaufkauf.net/id/</a></td>
</tr>
<tr>
<td>827,295</td>
<td><a href="http://products.semweb.bestbuy.com/company.rdf#BusinessEntity_BestBuy">http://products.semweb.bestbuy.com/company.rdf#BusinessEntity_BestBuy</a></td>
</tr>
<tr>
<td>492,756</td>
<td><a href="http://sw.cyc.com/CycAnnotations_v1#externalID">cycann:externalID</a></td>
</tr>
<tr>
<td>481,000</td>
<td><a href="http://purl.uniprot.org/citations/15685292">http://purl.uniprot.org/citations/15685292</a></td>
</tr>
<tr>
<td>445,430</td>
<td><a href="http://xmlns.com/foaf/0.1/Document">foaf:Document</a></td>
</tr>
<tr>
<td>369,567</td>
<td><a href="http://sw.cyc.com/CycAnnotations_v1#label">cycann:label</a></td>
</tr>
<tr>
<td>362,391</td>
<td><a href="http://purl.org/dc/dcmitype/Text">dcmitype:Text</a></td>
</tr>
<tr>
<td>357,309</td>
<td><a href="http://sw.opencyc.org/concept/">http://sw.opencyc.org/concept/</a></td>
</tr>
<tr>
<td>349,988</td>
<td><a href="http://purl.uniprot.org/citations/16973872">http://purl.uniprot.org/citations/16973872</a></td>
</tr>
</tbody>
</table>
<p>I do not know enough about the Proteine ontology to know why <em>po:A</em> is so popular. CYC we already had last year here, and I guess all products exposed by BestBuy have this URI as a subject.</p>
<h2>Predicates</h2>
<ul>
<li>95,379 unique predicates</li>
<li>83,370 occur in more than one triples</li>
<li>46,710 more than 10</li>
<li>18,385 more than 100</li>
<li>5,395 more than 1,000</li>
<li>1,271 more than 10,000</li>
<li>548 more than 100,000</li>
</ul>
<p>The average predicate occurred in 33254.6 triples.</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>predicate</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>557,268,190</td>
<td><a href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">rdf:type</a></td>
</tr>
<tr>
<td>384,891,996</td>
<td><a href="http://www.w3.org/2000/01/rdf-schema#isDefinedBy">rdfs:isDefinedBy</a></td>
</tr>
<tr>
<td>215,041,142</td>
<td><a href="http://purl.org/goodrelations/v1#hasGlobalLocationNumber">gr:hasGlobalLocationNumber</a></td>
</tr>
<tr>
<td>184,881,132</td>
<td><a href="http://www.w3.org/2000/01/rdf-schema#label">rdfs:label</a></td>
</tr>
<tr>
<td>175,141,343</td>
<td><a href="http://www.w3.org/2000/01/rdf-schema#comment">rdfs:comment</a></td>
</tr>
<tr>
<td>168,719,459</td>
<td><a href="http://purl.org/goodrelations/v1#hasEAN_UCC-13">gr:hasEAN_UCC-13</a></td>
</tr>
<tr>
<td>131,029,818</td>
<td><a href="http://purl.org/goodrelations/v1#hasManufacturer">gr:hasManufacturer</a></td>
</tr>
<tr>
<td>112,635,203</td>
<td><a href="http://www.w3.org/2000/01/rdf-schema#seeAlso">rdfs:seeAlso</a></td>
</tr>
<tr>
<td>71,742,821</td>
<td><a href="http://xmlns.com/foaf/0.1/nick">foaf:nick</a></td>
</tr>
<tr>
<td>71,036,882</td>
<td><a href="http://xmlns.com/foaf/0.1/knows">foaf:knows</a></td>
</tr>
</tbody>
</table>
<p>The usual suspects, rdf:type, comment, label, seeAlso and a bit of FOAF. New this year is lots of GoodRelations data!</p>
<h2>Objects &#8211; Resources</h2>
<p>Ignoring literals for the moment, looking only at resource-objects, we have: </p>
<ul>
<li>192,855,067      	unique resources</li>
<li> 36,144,147        occur in more than a single triple</li>
<li>2,905,294 	 more than 10 times</li>
<li>197,052   more 100</li>
<li>20,011  more than 1,000</li>
<li>2,752 more than 10,000</li>
<li>and 370 more than 100,000 times</li>
</ul>
<p>On average  7.72834 triples per object. This is both named objects and bnodes, looking at the bnodes only we get:     </p>
<ul>
<li>97,617,548      	unique resources</li>
<li> 616,825        occur in more than a single triple</li>
<li>8,632 	 more than 10 times</li>
<li>2,167   more 100</li>
<li>1  more than 1,000</li>
</ul>
<p>Since BNode IDs are only valid within a certain file it is limited how often then can appear, but still almost half the overall objects are bnodes. </p>
<p>The top ten bnode IDs are pretty boring, but the top 10 named resources are:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>resource-object</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>215,532,631</td>
<td><a href='http://purl.org/goodrelations/v1#BusinessEntity'>gr:BusinessEntity</a></td>
</tr>
<tr>
<td>215,153,113</td>
<td><a href='http://openean.kaufkauf.net/id/businessentities/'>ean:businessentities/</a></td>
</tr>
<tr>
<td>168,205,900</td>
<td><a href='http://purl.org/goodrelations/v1#ProductOrServiceModel'>gr:ProductOrServiceModel</a></td>
</tr>
<tr>
<td>167,789,556</td>
<td><a href='http://openean.kaufkauf.net/id/'>http://openean.kaufkauf.net/id/</a></td>
</tr>
<tr>
<td>71,051,459</td>
<td><a href='http://xmlns.com/foaf/0.1/Person'>foaf:Person</a></td>
</tr>
<tr>
<td>10,373,362</td>
<td><a href='http://xmlns.com/foaf/0.1/OnlineAccount'>foaf:OnlineAccount</a></td>
</tr>
<tr>
<td>6,842,729</td>
<td><a href='http://purl.org/rss/1.0/item'>rss:item</a></td>
</tr>
<tr>
<td>6,025,094</td>
<td><a href='http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement'>rdf:Statement</a></td>
</tr>
<tr>
<td>4,647,293</td>
<td><a href='http://xmlns.com/foaf/0.1/Document'>foaf:Document</a></td>
</tr>
<tr>
<td>4,230,908</td>
<td><a href='http://purl.uniprot.org/core/Resource'>http://purl.uniprot.org/core/Resource</a></td>
</tr>
</tbody>
</table>
<p>These are pretty much all types – compare to: </p>
<h2>Types</h2>
<p>A &#8220;type&#8221; being the object that occurs in a triple where <i>rdf:type</i> is the predicate gives us:</p>
<ul>
<li>170,020      	types</li>
<li> 91,479        occur in more than a single triple</li>
<li>20,196 	 more than 10 times</li>
<li>4,325   more 100</li>
<li>1,113  more than 1,000</li>
<li>258 more than 10,000</li>
<li>and 89 more than 100,000 times</li>
</ul>
<p>On average each type is used 3277.7 times, and the top 10 are:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>type</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>215,536,042</td>
<td><a href='http://purl.org/goodrelations/v1#BusinessEntity'>gr:BusinessEntity</a></td>
</tr>
<tr>
<td>168,208,826</td>
<td><a href='http://purl.org/goodrelations/v1#ProductOrServiceModel'>gr:ProductOrServiceModel</a></td>
</tr>
<tr>
<td>71,520,943</td>
<td><a href='http://xmlns.com/foaf/0.1/Person'>foaf:Person</a></td>
</tr>
<tr>
<td>10,447,941</td>
<td><a href='http://xmlns.com/foaf/0.1/OnlineAccount'>foaf:OnlineAccount</a></td>
</tr>
<tr>
<td>6,886,401</td>
<td><a href='http://purl.org/rss/1.0/item'>rss:item</a></td>
</tr>
<tr>
<td>6,066,069</td>
<td><a href='http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement'>rdf:Statement</a></td>
</tr>
<tr>
<td>4,674,162</td>
<td><a href='http://xmlns.com/foaf/0.1/Document'>foaf:Document</a></td>
</tr>
<tr>
<td>4,260,056</td>
<td><a href='http://purl.uniprot.org/core/Resource'>http://purl.uniprot.org/core/Resource</a></td>
</tr>
<tr>
<td>4,001,282</td>
<td><a href='http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry'>http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry</a></td>
</tr>
<tr>
<td>3,405,101</td>
<td><a href='http://www.w3.org/2002/07/owl#Class'>owl:Class</a></td>
</tr>
</tbody>
</table>
<p>Not identical to the top resources, but quite similar. Lots of FOAF and new this year, lots of GoodRelations.</p>
<h2>Contexts</h2>
<p>Something changed with regard to context handling for BTC2010, this year we only have 8M contexts, last year we had over 35M.<br />
I wonder if perhaps all of dbpedia is in one context this year?</p>
<ul>
<li>8,126,834  unique contexts</li>
<li>8,048,574        occur in more than a single triple</li>
<li>6,211,398 	 more than 10 times</li>
<li>1,493,520   more 100</li>
<li>321,466 more than 1,000</li>
<li>61,360 more than 10,000</li>
<li>and 4799 more than 100,000 times</li>
</ul>
<p>For an average of 389.958 triples per context. The 10 biggest contexts are: </p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>context</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>302,127</td>
<td><a href='http://data-gov.tw.rpi.edu/raw/402/data-402.rdf'>http://data-gov.tw.rpi.edu/raw/402/data-402.rdf</a></td>
</tr>
<tr>
<td>273,644</td>
<td><a href='http://www.ling.helsinki.fi/kit/2004k/ctl310semw/WordNet/wordnet_nouns-20010201.rdf'>http://www.ling.helsinki.fi/kit/2004k/ctl310semw/WordNet/wordnet_nouns-20010201.rdf</a></td>
</tr>
<tr>
<td>259,824</td>
<td><a href='http://static.cpantesters.org/author/M/MIYAGAWA.rss'>http://static.cpantesters.org/author/M/MIYAGAWA.rss</a></td>
</tr>
<tr>
<td>207,513</td>
<td><a href='http://data-gov.tw.rpi.edu/raw/401/data-401.rdf'>http://data-gov.tw.rpi.edu/raw/401/data-401.rdf</a></td>
</tr>
<tr>
<td>193,944</td>
<td><a href='http://static.cpantesters.org/author/D/DROLSKY.rss'>http://static.cpantesters.org/author/D/DROLSKY.rss</a></td>
</tr>
<tr>
<td>189,528</td>
<td><a href='http://static.cpantesters.org/author/S/SMUELLER.rss'>http://static.cpantesters.org/author/S/SMUELLER.rss</a></td>
</tr>
<tr>
<td>170,899</td>
<td><a href='http://data-gov.tw.rpi.edu/raw/59/data-59.rdf'>http://data-gov.tw.rpi.edu/raw/59/data-59.rdf</a></td>
</tr>
<tr>
<td>166,454</td>
<td><a href='http://zaltys.net/ontology/AKTiveSAOntology.owl'>http://zaltys.net/ontology/AKTiveSAOntology.owl</a></td>
</tr>
<tr>
<td>166,454</td>
<td><a href='http://www.zaltys.net/ontology/AKTiveSAOntology.owl'>http://www.zaltys.net/ontology/AKTiveSAOntology.owl</a></td>
</tr>
<tr>
<td>165,948</td>
<td><a href='http://lsdis.cs.uga.edu/~satya/Satya/jan24.owl'>http://lsdis.cs.uga.edu/~satya/Satya/jan24.owl</a></td>
</tr>
</tbody>
</table>
<p>This concludes my boring stats dump for BTC2010 for now. Some information on literals and hopefully some graphs will come soon! I also plan to look into how these stats changed from last year &#8211; so far I see much more GoodRelations, but there must be other fun changes!</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2010/09/btc2010-basic-stats/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Illustrating the kernel trick</title>
		<link>http://gromgull.net/blog/2010/05/illustrating-the-kernel-trick/</link>
		<comments>http://gromgull.net/blog/2010/05/illustrating-the-kernel-trick/#comments</comments>
		<pubDate>Sun, 16 May 2010 10:00:05 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Visualisation]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=461</guid>
		<description><![CDATA[For a one paragraph intro to SVMs and the kernel-trick I wanted a a graphic that I&#8217;ve seen in a book (although forgotten where, perhaps in Pattern Classification?):

Simple idea — show some 2D data points that are not linearly separable, then transform to 3D somehow, and show that they are. I found nothing on google [...]]]></description>
			<content:encoded><![CDATA[<p>For a one paragraph intro to SVMs and the kernel-trick I wanted a a graphic that I&#8217;ve seen in a book (although forgotten where, perhaps in <a href="http://www.amazon.com/Pattern-Classification-2nd-Richard-Duda/dp/0471056693/ref=dp_cp_ob_b_title_2">Pattern Classification</a>?):</p>
<div style="text-align: center"><a href="http://farm2.static.flickr.com/1198/4608086713_a2799c4418_o.png"><img style="border: 0pt none;" title="2d data" src="http://farm2.static.flickr.com/1198/4608086713_9995534f6c_m.jpg" alt="" width="240" height="180" /></a><a href="http://farm5.static.flickr.com/4006/4608694916_1c6f17b2c6_o.png"><img style="border: 0pt none;" title="3d data" src="http://farm5.static.flickr.com/4006/4608694916_8f54e7035b_m.jpg" alt="" width="240" height="180" /></a></div>
<p>Simple idea — show some 2D data points that are not linearly separable, then transform to 3D somehow, and show that they are. I found nothing on google (at least nothing that was high enough resolution to reuse, so I wrote some lines of python with pylab and matplotlib:</p>
<pre class="brush:python">
import math
import pylab
import scipy

def vlen(v):
return math.sqrt(scipy.vdot(v,v))

p=scipy.randn(100,2)

a=scipy.array([x for x in p if vlen(x)&gt;1.3 and vlen(x)&lt;2])
b=scipy.array([x for x in p if vlen(x)&lt;0.8])

pylab.scatter(a[:,0], a[:,1], s=30, c="blue")
pylab.scatter(b[:,0], b[:,1], s=50, c="red", marker='s')

pylab.savefig("linear.png")

fig = pylab.figure()
from mpl_toolkits.mplot3d import Axes3D
ax = Axes3D(fig)
ax.view_init(30,-110)

ax.scatter3D(map(vlen,a), a[:,0], a[:,1], s=30, c="blue")
ax.scatter3D(map(vlen,b), b[:,0], b[:,1], s=50, marker="s", c="red")

pylab.savefig("tranformed.png")

pylab.show()
</pre>
<p>Take — adapt — use for anything you like, you can rotate the 3D plot in the window that is shown and you can save the figures as PDF etc. Unfortunately, the sizing of markers in the 3d plot is not yet implemented in the latest matplotlib (0.99.1.2-3), so this only looks good with the latest SVN build.</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2010/05/illustrating-the-kernel-trick/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The Machine Learning Algorithm with Capital A</title>
		<link>http://gromgull.net/blog/2010/03/the-machine-learning-algorithm-with-capital-a/</link>
		<comments>http://gromgull.net/blog/2010/03/the-machine-learning-algorithm-with-capital-a/#comments</comments>
		<pubDate>Wed, 10 Mar 2010 12:08:36 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=435</guid>
		<description><![CDATA[A student came see me recently, wanting to do a Diplomarbeit (i.e. a MSc++) on a learning technique called Hierarchical Temporal Memory or HTMs. He had very specific ideas about what he wanted, and has already worked with the algorithm in his Projektarbeit (BSc++). I knew nothing about the approach, but remembered this reddit post, [...]]]></description>
			<content:encoded><![CDATA[<p>A student came see me recently, wanting to do a Diplomarbeit (i.e. a MSc++) on a learning technique called <a href="http://en.wikipedia.org/wiki/Hierarchical_Temporal_Memory">Hierarchical Temporal Memory</a> or HTMs. He had very specific ideas about what he wanted, and has already worked with the algorithm in his Projektarbeit (BSc++). I knew nothing about the approach, but remembered <a href="http://www.reddit.com/r/MachineLearning/comments/a0mwd/dear_reddit_machine_learning_is_jeff_hawkings/">this reddit post</a>, which was less than enthusiastic. I spoke to the student and read up on the thing a bit, and it seem interesting enough. It&#8217;s claimed to be close the way the brain learns/recognizes patterns and to be a general model of intelligence and  it will work for EVERYTHING. This reminded me of a few other things I&#8217;ve come across in the past years that claim to be the new Machine Learning algorithm with Capital <strong>A</strong>, i.e. the algorithm to end all other ML work, which will work on all problems, and so on. Here is a small collection of the three most interesting ones I remembered:</p>
<h2 class="inpage">Hierarchical Temporal Models</h2>
<p>HTMs are &#8220;invented&#8221; by <a href="http://en.wikipedia.org/wiki/Jeff_Hawkins">Jeff Hawkin</a>, whose track record includes making the Palm device and platform and later the Treo. Having your algorithm&#8217;s PR based on the celebrity status of the inventor is not really a good sign. The model is first presented in his book <a href="http://en.wikipedia.org/wiki/On_Intelligence"><em>On Intelligence</em></a>, which I&#8217;ve duly bought and am currently reading. The book is so far very interesting, although full of things like <em>&#8220;[this is] how the brain actually works</em><em>&#8220;, &#8220;</em><em>Can we build intelligent machines? <em>&#8230; </em></em><em>Yes. We can and we will.&#8221;, </em>etc. As far as I understand, the model from the book was formally analysed and became the HTM algorithm in <a title="Dileep George (page does not exist)" href="http://en.wikipedia.org/w/index.php?title=Dileep_George&amp;action=edit&amp;redlink=1">Dileep George</a>&#8217;s thesis: <em>How the brain might work: A hierarchical and temporal model for learning  and recognition. </em>He applies to recognizing 32&#215;32 pictures of various letters and objects.</p>
<p style="text-align: center;"><a href="http://gromgull.net/blog/wp-content/uploads/2010/03/htm.png"><img class="aligncenter size-full wp-image-446" style="border: 0pt none;" title="Hierarchical Temporal Memory" src="http://gromgull.net/blog/wp-content/uploads/2010/03/htm.png" alt="" width="400" /></a></p>
<p>The model is based on a <em>hierarchy</em> of sensing components, each dealing with a higher level of abstraction when processing input, the top of the hierarchy feeds into some <em>traditional </em>learning algorithm, such as a SVM for classification, or some clustering mechanism. In effect, the whole HTM is a form of feature pre-processing. The <em>temporal </em>aspect is introduces by the nodes observing their input (either the raw input, or their sub-nodes) over time, this (hopefully) gives rise to translation, rotation and  scale invariance, as the things you are watching move around.  I say <em>watching </em>here, because computer-vision seems to be the main application, although it&#8217;s of course applicable to EVERYTHING:</p>
<blockquote><p>HTM technology has the potential to solve many difficult problems in  machine learning, inference, and prediction.  Some of the application  areas we are exploring with our customers include recognizing objects in  images, recognizing behaviors in videos, identifying the gender of a  speaker, predicting traffic patterns, doing optical character  recognition on messy text, evaluating medical images, and predicting  click through patterns on the web.</p></blockquote>
<p>The guys went on to make a company called <a href="http://www.numenta.com/">Numenta</a> for spreading this technique, they have a (not open-source, boo!) <a href="http://www.numenta.com/about-numenta/numenta-technology-2.php">development framework you can play with</a>.</p>
<h2 class="inpage">Normalised Compression Distance</h2>
<p>This beast goes under many names: <em>compression based learning, compression-based dissimilarity measure,</em> etc. The idea is in any case to reuse compression algorithms for learning, from good old <a href="http://en.wikipedia.org/wiki/DEFLATE">DEFLATE</a> algorithm from zip/gzip, to algorithms specific to some data-type, like <a href="http://www1.spms.ntu.edu.sg/~chenxin/paper/GIW99.pdf">DNA</a>. The <em>distance</em> between things is then derived from how well they compress together with each other or with other data, and the distance metric can then be used for clustering, classification, anomaly detection, etc. The whole thing is supported by the theory of Kolmogorov Complexity and Minimum Description Length, i.e. it&#8217;s not just a hack.</p>
<p>I came across it back in 2003 in the <a href="http://www.kuro5hin.org/">Kuro5hin</a> article<a href="http://www.kuro5hin.org/story/2003/1/25/224415/367"> Spam Filtering with gzip</a>, back then I was very sceptical, thinking that any algorithm dedicated to doing classification MUST easily out-perform this. What I didn&#8217;t think about is that if you use the <em>optimal </em>compression for your data, then it finds all patterns in the data, and this is exactly what learning is about. Of course, gzip is pretty far from optimal, but it still works pretty well. I am not the only one who wasn&#8217;t convinced, <a href="http://arxiv.org/abs/cond-mat/0108530">this letter</a> appeared in a physics journal in 2001, and led to some heated discussion: <a href="http://arxiv.org/abs/cond-mat/0202383">angry comment</a>,<a href="http://arxiv.org/abs/cond-mat/0203275"> angry reply</a>, etc.</p>
<p>A bit later, I came across this again. Eamonn Keogh wrote <a href="http://portal.acm.org/citation.cfm?id=1014052.1014077">Towards  parameter-free data mining</a> in 2004,  this paper makes a stronger case for this method being simple, easy and great and applicable to EVERYTHING:</p>
<blockquote><p>[The Algorithm] can be implemented using any off-the-shelf compression algorithm with  the addition of just a dozen or so lines of code. We will show that this  approach is competitive or superior to the state-of-the-art approaches  in anomaly/interestingness detection, classification, and clustering  with empirical tests on time series/DNA/text/video datasets.</p></blockquote>
<p>A bit later again I came across <a href="http://cilibrar.com/">Rudi Cilibrasi</a> (his page is broken atm.) thesis on<a href="http://www.illc.uva.nl/Publications/Dissertations/DS-2007-01.text.pdf"> Statistical  Inference Through Data Compression</a>. He has more examples, more theory and most importantly open-source software for everything: <a href="http://www.complearn.org/">CompLearn</a> (also down atm., but there are packages in debian). The method is very nice in that it makes no assumptions about the format of the input, i.e. no feature vectors or similar. Here is a clustering tree generated from a bunch of different files types:</p>
<p style="text-align: center;"><a href="http://gromgull.net/blog/wp-content/uploads/2010/03/ncd.png"><img class="aligncenter size-full wp-image-444" style="border: 0pt none;" title="Example NCD Clustering Tree" src="http://gromgull.net/blog/wp-content/uploads/2010/03/ncd.png" alt="" width="600" /></a></p>
<h2 class="inpage">Markov Logic Networks</h2>
<p>I first came across Markov Logic Networks in the paper: <a href="http://alchemy.cs.washington.edu/papers/wu08/">Automatically  Refining the Wikipedia Infobox Ontology</a>. Here they have two intertwined problems they use machine learning to solve, firstly they want to map wikipedia category pages to WordNet Synsets, and secondly they want to arrange the wikipedia categories in a hierarhcy, i.e. by learning <em>is-a </em>relationships between categories. The solve the problem in two ways, the <em>traditional </em>way by using training a SVM to do the WordNet mappings, and using these mappings as an additional features for training a second SVM to do the <em>is-a </em>learning. This is all good, and works reasonably well, but by using Markov Logic Networks they can use <em>joint inference </em>to tackle both tasks at once. This is good since the two problems are clearly not independent, and now evidence that two categories are related can feed back and improve the probability that the map to WordNet synsets that are also related. Also, it allows different <em>is-a</em> relations to influence each other, i.e. if <em>Gouda is-a Cheese is-a Food</em>, then <em>Gouda</em> is probaby also a <em>Food</em>.</p>
<p>The software used in the paper is made by the people at the <a href="http://www.cs.washington.edu/">University of Washington</a>, and is available as open-source: <a href="http://alchemy.cs.washington.edu/">Alchemy &#8211; algorithms for statistical relational learning and  probabilistic logic inference</a>. The system combines logical and statistical AI, building on network structures much like Bayesian belief networks, in the end it&#8217;s a bit like Prolog programming, but with probabilities for all facts and relations, and these probabilities can be learned from data. For example, this cheerful example  about people dying of cancer, given this dependency network and some data about friends who influence each other to smoke and dying, you can estimate the probability that you smoke if your friend does and the probability that you will get cancer:</p>
<p style="text-align: center;"><a href="http://gromgull.net/blog/wp-content/uploads/2010/03/mln.png"><img class="aligncenter size-full wp-image-450" style="border: 0pt none;" title="Smoking and Cancer Markov Network" src="http://gromgull.net/blog/wp-content/uploads/2010/03/mln.png" alt="" width="502" height="219" /></a></p>
<p>Since I am writing about it here, it is clear that this is applicable to EVERYTHING:</p>
<blockquote><p>Markov logic serves as a general framework which can not only be used  for the emerging field of statistical relational learning, but also can  handle many classical machine learning tasks which many users are familiar with.  [...] Markov logic presents a language to handle machine learning  problems intuitively and comprehensibly. [...] We have applied it to link prediction, collective classiﬁcation, entity resolution, social network analysis and other problems.</p></blockquote>
<h2 class="inpage">The others</h2>
<p>Out of the three I think I find Markov Logic Network to be the most interesting, perhaps because it nicely bridges the symbolic and sub-symbolic AI worlds. This was my personal problem since I cannot readily dismiss symbolic AI as a Semantic Web person, but the more I read about about kernels, conditional random fields, online-learning using gradient descent etc. the more I realise that rule-learning and inductive logic programming probably isn&#8217;t going to catch up any time soon. NCD is a nice hack, but I tested it on clustering RDF resources, comparing the distance measure from my thesis with gzip&#8217;ping the RDF/XML or Turtle, and it did much worse. HTM still strikes me as a bit over-hyped, but I will of course be sorry when the bring the intelligent robots to market in 2011.</p>
<p>Some other learning <em>frameworks</em> that nearly made it into this post:</p>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Recurrent_neural_networks">Recurrent Neural Networks</a> as championed by<a href="http://www.idsia.ch/~juergen/"> Jürgen Schmidhuber</a>. I.e. Neural networks with loops, they can do auto-associative recall, and of course EVERYTHING. But are a bastard to train.</li>
<li><a href="http://www.hutter1.net/ai/aixigentle.htm">AIXI</a>, the universal learning agent, which can (theoretically) learn EVERYTHING. And I mean really EVERYTHING, and really THEORETICALLY. As championed by <a href="http://www.hutter1.net/">Marcus Hutter</a>, and <a href="http://www.vetta.org/">Shane Legg</a> in his thesis: <a href="http://www.vetta.org/2008/07/machine-super-intelligence/">Machine Super Intelligence</a> which I&#8217;ve not read, but at least the title is funky.</li>
</ul>
<p>(wow &#8211; this got much longer than I intended)</p>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 1917px; width: 1px; height: 1px; overflow: hidden;">http://www.idsia.ch/~juergen/</div>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2010/03/the-machine-learning-algorithm-with-capital-a/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>HTTP File Uploads in PHP</title>
		<link>http://gromgull.net/blog/2010/02/http-file-uploads-in-php/</link>
		<comments>http://gromgull.net/blog/2010/02/http-file-uploads-in-php/#comments</comments>
		<pubDate>Tue, 02 Feb 2010 12:54:30 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[PHP]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=433</guid>
		<description><![CDATA[And by this I mean uploading files from a PHP script to another HTTP URL, essentially submitting a web-form with a file-field from PHP. I needed this in Organik, it took me some hours to find out how. My hacky result is here for the world to reuse:
http://github.com/gromgull/randombits/blob/master/http_file_upload.php
Enjoy.
]]></description>
			<content:encoded><![CDATA[<p>And by this I mean uploading files <em>from </em>a PHP script to another HTTP URL, essentially submitting a web-form with a file-field from PHP. I needed this in Organik, it took me some hours to find out how. My hacky result is here for the world to reuse:</p>
<p><a href="http://github.com/gromgull/randombits/blob/master/http_file_upload.php">http://github.com/gromgull/randombits/blob/master/http_file_upload.php</a></p>
<p>Enjoy.</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2010/02/http-file-uploads-in-php/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Noun-phrase Chunking for the Awful German Language</title>
		<link>http://gromgull.net/blog/2010/01/noun-phrase-chunking-for-the-awful-german-language/</link>
		<comments>http://gromgull.net/blog/2010/01/noun-phrase-chunking-for-the-awful-german-language/#comments</comments>
		<pubDate>Thu, 14 Jan 2010 11:47:15 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[NLP]]></category>
		<category><![CDATA[OrganikProject]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=417</guid>
		<description><![CDATA[In the Organik project we&#8217;ve been using the noun-phrase extraction modules of OpenNLP toolkit to extract key concepts from text for doing Taxonomy Learning. OpenNLP comes with trained model files for English sentence detection, POS-tagging and either noun-phrase chunking or full parsing, and this works great.
Of course in Organik we have some German partners who [...]]]></description>
			<content:encoded><![CDATA[<p>In the<a href="http://organik-project.eu/"> Organik project</a> we&#8217;ve been using the noun-phrase extraction modules of<a href="http://opennlp.sourceforge.net/"> OpenNLP toolkit </a>to extract key concepts from text for doing Taxonomy Learning. OpenNLP comes with trained model files for English sentence detection, POS-tagging and either noun-phrase chunking or full parsing, and this works great.</p>
<p>Of course in Organik we have some German partners who insist on using their <a href="http://www.crossmyt.com/hc/linghebr/awfgrmlg.html">awful german language</a> <a href="#fn1">[1]</a> for everything &#8211; confusing us with their weird grammar. Finding a solution to this has been on my TODO list for about a year now. I had access to the <a href="http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/">Tiger Corpus</a> of 50,000 German sentences marked up with POS-tags and syntactic structure. I have tried to use this for training a model for NP chunking, either using the OpenNLP MaxEnt model or with conditional random fields as implemented in <a href="http://flexcrfs.sourceforge.net/">FlexCRF</a>. However, the models never performed better than around 60% precision and recall, and testing showed that this was really not enough. Returning to this problem now once again I have looked more closely at the input data, it turns out the syntactic structures used in the Tiger Corpus are quite detailed, containing far higher granularity of tag-types than what I need. For instance the structure for <em>&#8220;Zwar hat auch der HDE seinen Mitgliedern Tips gegeben, wie mit vermeintlichen Langfingern umzugehen sei</em><em>.&#8221;</em> (click for readable picture):</p>
<p style="text-align: center;"><a href="http://farm3.static.flickr.com/2749/4270534809_cf3972c2cb_o.png"><img class="aligncenter" style="border: 0pt none;" src="http://farm3.static.flickr.com/2749/4270534809_32fee0840e.jpg" alt="" width="500" height="128" /></a></p>
<p>Here the entire <em>&#8220;Tips [...] wie mit vermeintlichen Langfingern umzugen sei&#8221;</em>, is a noun-phrase. This (might) be linguistically correct, but it&#8217;s not very useful to me when I essentially want to do keyword extraction. Much more useful is the terms marked NK (Noun-Kernels), i.e. here <em>&#8220;vermeintlichen Langfingern&#8221;</em>. Another problem is that the tree is not <em>continuous </em>with regard to the original sentence, i.e. the word <em>gegeben</em> fits into the middle of the NP, but is not a part of it.</p>
<p>SO &#8211; I have preprocessed the entire corpus again, flattening the tree, taking the lowermost NK chain, or NP chunk as example. This gives me much shorter NPs in general, for which it is easier to learn a model AND the result is more useful in Organik. Running FlexCRF again on this data, splitting off a part of the data for testing, gives me a model with 94.03% F1-measure on the test data. This is quite comparable to what was achieved for English with FlexCRF in <a href="http://crfchunker.sourceforge.net/">CRFChunker</a>, for the WSJ corpus they report a F-Measure of 95%.</p>
<p>I cannot redistribute the corpus or training data, but here is the model as trained by FlexCRF for chunking:<a href="http://gromgull.net/2010/01/npchunking/GermanNPChunkModel.tar.gz"> GermanNPChunkModel.tar.gz</a> (17.7mb)</p>
<p>and for the POSTagging: <a href="http://gromgull.net/2010/01/npchunking/GermanPOSTagModel.tar.gz">GermanPOSTagModel.tar.gz</a> (9.5mb)</p>
<p>Both are trained on 44,170 sentences, with about 900,000 words. The POSTagger was trained for 50 iterations, the Chunker for 100, both with 10% of the data used for testing.</p>
<p>In addition, here is a model file trained with OpenNLPs MaxEnt: <a href="http://gromgull.net/2010/01/npchunking/OpenNLP_GermanChunk.bin.gz">OpenNLP_GermanChunk.bin.gz</a> (5.2mb)</p>
<p>This was trained with the POS tags as generated by the <a href="http://opennlp.sourceforge.net/models/german/">German POStagger that ships with OpenNLP</a>, and can be used with the OpenNLP tools like this:</p>
<p><code><br />
java -cp $CP opennlp.tools.lang.german.SentenceDetector \<br />
models/german/sentdetect/sentenceModel.bin.gz  |<br />
java -cp $CP opennlp.tools.lang.german.Tokenizer \<br />
models/german/tokenizer/tokenModel.bin.gz |<br />
java -cp $CP -Xmx100m opennlp.tools.lang.german.PosTagger \<br />
models/german/postag/posModel.bin.gz |<br />
java -cp $CP opennlp.tools.lang.english.TreebankChunker \<br />
models/german/chunking/GermanChunk.bin.gz<br />
</code></p>
<p>That&#8217;s it. Let me know if you use it and it works for you!</p>
<hr /><a name="fn1">[1]</a> Completely unrelated, but to exercise your German parsing skills, check out some old newspaper articles. Die Zeit has their online archive available back to 1946, where you find sentence-gems like this: <em>Zunächst waren wir geneigt, das geschaute Bild irgendwie umzurechnen auf materielle Werte, wir versuchten Arbeitskräfte und Zeit zu überschlagen, die nötig waren, um diese Wüste, die uns umgab, wieder neu gestalten zu können, herauszuführen aus diesem unfaßlichen Zustand der Zerstörung, überzuführen in eine Welt, die wir verstanden, in eine Welt,&#8217; die uns bis dahin umgeben hatte. </em>(ONE sentece!, from <a href="http://www.zeit.de/1946/01/Rueckkehr-nach-Deutschland">http://www.zeit.de/1946/01/Rueckkehr-nach-Deutschland</a>)</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2010/01/noun-phrase-chunking-for-the-awful-german-language/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Semantic Web Clusterball</title>
		<link>http://gromgull.net/blog/2010/01/semantic-web-clusterball/</link>
		<comments>http://gromgull.net/blog/2010/01/semantic-web-clusterball/#comments</comments>
		<pubDate>Wed, 06 Jan 2010 11:25:29 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Billion Triple Challenge]]></category>
		<category><![CDATA[SVG]]></category>
		<category><![CDATA[Visualisation]]></category>
		<category><![CDATA[in progress]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=402</guid>
		<description><![CDATA[From the I-will-never-actually-finish-this department I bring you the Semantic Web Cluster-ball:

I started this is a part of the Billion Triple Challenge work, it shows the how different sites on Semantic Web are linked together. The whole thing is an interactive SVG, I could not get it to embed here, so click on that image and [...]]]></description>
			<content:encoded><![CDATA[<p>From the I-will-never-actually-finish-this department I bring you the Semantic Web Cluster-ball:</p>
<p style="text-align: center;"><a href="http://gromgull.net/2010/01/swball/swball.svg"><img class="aligncenter" style="border: 0pt none;" title="Semantic Web Clusterball" src="http://farm5.static.flickr.com/4064/4250011607_245b975a26.jpg" alt="Semantic Web Clusterball" width="500" height="469" /></a></p>
<p>I started this is a part of the <a href="http://gromgull.net/blog/category/semantic-web/billion-triple-challenge/">Billion Triple Challenge work</a>, it shows the how different sites on Semantic Web are linked together. The whole thing is an interactive SVG, I could not get it to embed here, so click on that image and mouse over things and be amazed. Clicking on the different predicates in the SVG will toggle showing that predicate, mouse over any link will show how many links are currently being shown. (NOTE: Only really tested in Firefox 3.5.X, it looked roughly ok in Chrome though.)</p>
<p>The data is extracted from the BTC triples by computing the <em>Pay-Level-Domain</em> (PLD, essentially the top-level domain, but with special rules for .co.uk domains and similar) for the subjects and objects, and if they differ, count the predicates that link them. I.e. a triple:</p>
<p><code>dbpedia:Albert_Einstein rdf:type foaf:Person. </code></p>
<p>would count as a link between <em>http://dbpedia.org </em>and <em>http://xmlns.com</em> for the<em> rdf:type</em> predicate. Counting all links like this gives us the top cross-domain linking predicates:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>predicate</th>
<th>links</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">http://www.w3.org/1999/02/22-rdf-syntax-ns#type</a></td>
<td style="text-align: right;">60,813,659</td>
</tr>
<tr>
<td><a href="http://www.w3.org/2000/01/rdf-schema#seeAlso">http://www.w3.org/2000/01/rdf-schema#seeAlso</a></td>
<td style="text-align: right;">16,698,110</td>
</tr>
<tr>
<td><a href="http://www.w3.org/2002/07/owl#sameAs">http://www.w3.org/2002/07/owl#sameAs</a></td>
<td style="text-align: right;">4,872,501</td>
</tr>
<tr>
<td><a href="http://xmlns.com/foaf/0.1/weblog">http://xmlns.com/foaf/0.1/weblog</a></td>
<td style="text-align: right;">4,627,271</td>
</tr>
<tr>
<td><a href="http://www.aktors.org/ontology/portal#has-date">http://www.aktors.org/ontology/portal#has-date</a></td>
<td style="text-align: right;">3,873,224</td>
</tr>
<tr>
<td><a href="http://xmlns.com/foaf/0.1/page">http://xmlns.com/foaf/0.1/page</a></td>
<td style="text-align: right;">3,273,613</td>
</tr>
<tr>
<td><a href="http://dbpedia.org/property/hasPhotoCollection">http://dbpedia.org/property/hasPhotoCollection</a></td>
<td style="text-align: right;">2,556,532</td>
</tr>
<tr>
<td><a href="http://xmlns.com/foaf/0.1/img">http://xmlns.com/foaf/0.1/img</a></td>
<td style="text-align: right;">2,012,761</td>
</tr>
<tr>
<td><a href="http://xmlns.com/foaf/0.1/depiction">http://xmlns.com/foaf/0.1/depiction</a></td>
<td style="text-align: right;">1,556,066</td>
</tr>
<tr>
<td><a href="http://www.geonames.org/ontology#wikipediaArticle">http://www.geonames.org/ontology#wikipediaArticle</a></td>
<td style="text-align: right;">735,145</td>
</tr>
</tbody>
</table>
<p>Most frequent is of course <em>rdf:type</em>, since most schemas are from different domains to the data, and most things have a type. The ball linked above is excluding type, since it&#8217;s not really a <em>link</em>. You can also see <a href="http://gromgull.net/2010/01/swball/swball_type.svg">a version including <em>rdf:type</em>.</a> The rest of the properties are more <em>link-like</em>, I am not sure what is going on with the <em>akt:has-date </em>though, anyone?</p>
<p>The visualisation idea is of course not mine, mainly I stole it from Chris Harrison: <a href="http://www.chrisharrison.net/projects/clusterball/index.html">Wikipedia Clusterball</a>. His is nicer since he has core nodes <em>inside </em>the ball. He points out that the &#8220;clustering&#8221; of nodes along the edge is important, as this brings out the structure of whatever is being mapped. My &#8220;clustering&#8221; method was very simple, I swap each node with the one giving me the largest decrease in edge distance, then repeat until the solution no longer improves. I couple this with a handful of random restarts and take the best solution. It&#8217;s essentially a greedy hill-climbing method, and I am sure it&#8217;s far from optimal, but it does at least something. For comparison, <a href="http://gromgull.net/2010/01/swball/swball_nocluster.svg">here is the ball on top without clustering applied</a>.</p>
<p>The whole thing was of course hacked up in python, the javascript for the mouse-over etc. of the SVG uses <a href="http://www.prototypejs.org/">prototype</a>. I wanted to share the code, but it&#8217;s a horrible mess, and I&#8217;d rather not spend the time to clean it up. If you want it, <span style="text-decoration: line-through;">drop me a line.</span>, see below. The data used to generate this is available either as a download: <a href="http://gromgull.net/2010/01/swball/data.txt.gz">data.txt.gz</a> (19Mb, 10,000 host-pairs and top 500 predicates), or <a href="http://manyeyes.alphaworks.ibm.com/manyeyes/datasets/semantic-web-links/versions/1">a subset on Many Eyes</a> (2,500 host-pairs and top 100 predicates, uploading 19Mb of data to Many Eyes crashed my Firefox :)</p>
<p><strong>UPDATE</strong>: <a href="http://twitter.com/Rchards">Richard Stirling </a>asked for the code, so I spent 30 min cleaning it up a bit, grab it here: <a href="http://gromgull.net/2010/01/swball/swball_code.tar.gz">swball_code.tar.gz</a> It includes the data+code needed to recreate the example above.</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2010/01/semantic-web-clusterball/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>An Objective look at the Billion Triple Data</title>
		<link>http://gromgull.net/blog/2009/12/an-objective-look-at-the-billion-triple-data/</link>
		<comments>http://gromgull.net/blog/2009/12/an-objective-look-at-the-billion-triple-data/#comments</comments>
		<pubDate>Fri, 11 Dec 2009 15:44:18 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Billion Triple Challenge]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=394</guid>
		<description><![CDATA[For completeness, Besbes is telling me to write up the final stats from the BTC data, for the object-part of the triples. I am afraid this data is quite dull, and there are not many interesting things to say, so it&#8217;ll be mostly tables. Enjoy :)
The BTC data contains 279,710,101 unique objects in total. Out [...]]]></description>
			<content:encoded><![CDATA[<p>For completeness, <a href="http://www.cs.univie.ac.at/employee.php?tab=teaching&amp;eid=223">Besbes</a> is telling me to write up the final stats from the BTC data, for the <em>object-part</em> of the triples. I am afraid this data is quite dull, and there are not many interesting things to say, so it&#8217;ll be mostly tables. Enjoy :)</p>
<p>The BTC data contains 279,710,101 unique objects in total. Out of these:</p>
<ul>
<li>90,007,431 appear more than once</li>
<li>7,995,747 more than 10 times</li>
<li>748,214 more than 100</li>
<li>43,479 more than 1,000</li>
<li>3,209 more than 10,000</li>
</ul>
<p>The 280M objects are split into 162,764,271 resources and 116,945,830 literals. 13,538 of the resources are <em>file://</em> URIs. The top 10 objects are:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>object</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>2,584,960</td>
<td><a href="http://www.geonames.org/ontology#P">http://www.geonames.org/ontology#P</a></td>
</tr>
<tr>
<td>2,645,095</td>
<td><a href="http://www.aktors.org/ontology/portal#Article-Reference">http://www.aktors.org/ontology/portal#Article-Reference</a></td>
</tr>
<tr>
<td>2,681,771</td>
<td><a href="http://www.w3.org/2002/07/owl#Class">http://www.w3.org/2002/07/owl#Class</a></td>
</tr>
<tr>
<td>5,616,326</td>
<td><a href="http://www.aktors.org/ontology/portal#Person">http://www.aktors.org/ontology/portal#Person</a></td>
</tr>
<tr>
<td>7,544,903</td>
<td><a href="http://www.geonames.org/ontology#Feature">http://www.geonames.org/ontology#Feature</a></td>
</tr>
<tr>
<td>9,115,801</td>
<td><a href="http://en.wikipedia.org/">http://en.wikipedia.org/</a></td>
</tr>
<tr>
<td>12,124,378</td>
<td><a href="http://xmlns.com/foaf/0.1/OnlineAccount">http://xmlns.com/foaf/0.1/OnlineAccount</a></td>
</tr>
<tr>
<td>13,687,049</td>
<td><a href="http://purl.org/rss/1.0/item">http://purl.org/rss/1.0/item</a></td>
</tr>
<tr>
<td>14,172,852</td>
<td><a href="http://rdfs.org/sioc/types#WikiArticle">http://rdfs.org/sioc/types#WikiArticle</a></td>
</tr>
<tr>
<td>38,795,942</td>
<td><a href="http://xmlns.com/foaf/0.1/Person">http://xmlns.com/foaf/0.1/Person</a></td>
</tr>
</tbody>
</table>
<p>Apart from the wikipedia link, all are types. No literals appear in the top 10 table. For the 116M unique literals we have 12,845,021 literals with a language tag and 2,067,768 with a datatype tag. The top 10 literals are:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>literal</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>722,221</td>
<td>&#8220;0&#8243;^^xsd:integer</td>
</tr>
<tr>
<td>969,929</td>
<td>&#8220;1&#8243;</td>
</tr>
<tr>
<td>1,024,654</td>
<td>&#8220;Nay&#8221;</td>
</tr>
<tr>
<td>1,036,054</td>
<td>&#8220;Copyright © 2009 craigslist, inc.&#8221;</td>
</tr>
<tr>
<td>1,056,799</td>
<td>&#8220;text&#8221;</td>
</tr>
<tr>
<td>1,061,692</td>
<td>&#8220;text/html&#8221;</td>
</tr>
<tr>
<td>1,159,311</td>
<td>&#8220;0&#8243;</td>
</tr>
<tr>
<td>1,204,996</td>
<td>&#8220;en-us&#8221;</td>
</tr>
<tr>
<td>2,049,638</td>
<td>&#8220;Aye&#8221;</td>
</tr>
<tr>
<td>2,310,681</td>
<td>&#8220;application/rdf+xml&#8221;</td>
</tr>
</tbody>
</table>
<p>I can&#8217;t be bothered to check it now, but I guess the  many Aye&#8217;s &amp; Nay&#8217;s come from IRC chatlogs (#SWIG?).</p>
<p>Finally, I looked at the length of the literals used in the data, the longest literal is 65,244 unicode characters long (I wonder about this — this seems very close to 2<sup>16</sup> bytes, some unicode characters with more than one byte, could it be truncated?). The distribution of literals/lenghts looks like this:</p>
<p style="text-align: left;"><a href="http://www.flickr.com/photos/gromgull/4176130623/sizes/o/"><img class="aligncenter" style="border: 0pt none;" title="Literal lengths" src="http://farm3.static.flickr.com/2746/4176130623_754a99c096.jpg" alt="" width="500" height="400" /></a>The most literals are around 10 characters in length, there is a peak for 19, which I seem to remember was caused by the standard time format (i.e. 2005-10-30T10:45UTC) being exactly 19 characters.</p>
<p style="text-align: left;">That&#8217;s it! I believe I now have published all my numbers on BTC :)</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2009/12/an-objective-look-at-the-billion-triple-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>DBTropes</title>
		<link>http://gromgull.net/blog/2009/12/dbtropes/</link>
		<comments>http://gromgull.net/blog/2009/12/dbtropes/#comments</comments>
		<pubDate>Thu, 10 Dec 2009 13:31:48 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Everything Else]]></category>
		<category><![CDATA[Semantic Web]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=387</guid>
		<description><![CDATA[
Know TvTropes.org? As pointed out by XKCD, a great place to lose hours of time reading about SoBadIt&#8217;sHorrible, HighOctaneNightmareFuel and thousands of other tropes, all with examples from comics, films, tv-series etc.
DFKI colleague Malte Kiesel has done the right thing and just released his linked open data wrapper for tvtropes, natuerally names dbTropes.org. Now go [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: center;"><a href="http://gromgull.net/blog/wp-content/uploads/2009/12/logo.png"><img class="aligncenter size-full wp-image-388" style="border: 0pt none;" title="logo" src="http://gromgull.net/blog/wp-content/uploads/2009/12/logo.png" alt="logo" width="200" height="50" /></a></p>
<p style="text-align: left;">Know <a href="http://tvtropes.org">TvTropes.org</a>? As <a href="http://xkcd.com/609/">pointed out by XKCD</a>, a great place to lose hours of time reading about <a href="http://tvtropes.org/pmwiki/pmwiki.php/DarthWiki/ptitlew9bltta3dv6n?from=Main.SoBadItsHorrible">SoBadIt&#8217;sHorrible</a>, <a href="http://tvtropes.org/pmwiki/pmwiki.php/Main/HighOctaneNightmareFuel">HighOctaneNightmareFuel</a> and thousands of other <em>tropes, </em>all with examples from comics, films, tv-series etc.</p>
<p>DFKI colleague <a href="http://www.dfki.uni-kl.de/~kiesel/">Malte Kiesel</a> has done the right thing and just released his linked open data wrapper for tvtropes, natuerally names <a href="http://dbtropes.org">dbTropes.org</a>. Now go read about <a href="http://dbtropes.org/resource/Main/DiabolusExMachina">DiabolusExMachina</a>, it will of course do content-negotiation so try it with your favourite RDF browser.</p>
<p>I helped too — I made the stylesheet and the &#8220;logo&#8221; :)</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2009/12/dbtropes/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>I&#8217;ll trie in python</title>
		<link>http://gromgull.net/blog/2009/11/ill-trie-in-python/</link>
		<comments>http://gromgull.net/blog/2009/11/ill-trie-in-python/#comments</comments>
		<pubDate>Wed, 11 Nov 2009 10:25:45 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Koble]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=385</guid>
		<description><![CDATA[In Koble the auto-completion of thing-names used for wiki-editting, instant-search and adding relations  is getting slower and slower,  mainly because I do:

result=[]
things=listAllThings()
for t in things:
   if t.startswith(key): res.append(t)
for t in things:
   if key in t: res.append(t)

Going through the list twice makes sure I get all things that match well first [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://koble.net/">Koble</a> the auto-completion of thing-names used for wiki-editting, instant-search and adding relations  is getting slower and slower,  mainly because I do:</p>
<pre class="brush:python">
result=[]
things=listAllThings()
for t in things:
   if t.startswith(key): res.append(t)
for t in things:
   if key in t: res.append(t)
</pre>
<p>Going through the list twice makes sure I get all things that match well first (i.e. the start with the string I complete for), and then things matching less well later (they only contain the string).</p>
<p>Of course the world has made up a far better data-structure for indexing prefix&#8217;es of string, namely the <a href="http://www.itl.nist.gov/div897/sqg/dads/HTML/trie.html">trie, or prefix tree</a>. <a href="http://jtauber.com/">James Tauber</a> had already implemented one in python, and <a href="http://jtauber.com/blog/2005/02/10/updated_python_trie_implementation/">kindly made it available.</a> His version didn&#8217;t do everything I needed, so I added a few methods. Here is my updated version:</p>
<p><a href="http://gromgull.net/2009/11/trie.py">http://gromgull.net/2009/11/trie.py</a></p>
<p>Enjoy!</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2009/11/ill-trie-in-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
