<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>(still) nothing clever</title>
	<atom:link href="http://gromgull.net/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://gromgull.net/blog</link>
	<description></description>
	<lastBuildDate>Tue, 02 Feb 2010 12:54:30 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<atom:link rel="hub" href="http://pubsubhubbub.appspot.com"/><atom:link rel="hub" href="http://superfeedr.com/hubbub"/>		<item>
		<title>HTTP File Uploads in PHP</title>
		<link>http://gromgull.net/blog/2010/02/http-file-uploads-in-php/</link>
		<comments>http://gromgull.net/blog/2010/02/http-file-uploads-in-php/#comments</comments>
		<pubDate>Tue, 02 Feb 2010 12:54:30 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[PHP]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=433</guid>
		<description><![CDATA[And by this I mean uploading files from a PHP script to another HTTP URL, essentially submitting a web-form with a file-field from PHP. I needed this in Organik, it took me some hours to find out how. My hacky result is here for the world to reuse:
http://github.com/gromgull/randombits/blob/master/http_file_upload.php
Enjoy.
]]></description>
			<content:encoded><![CDATA[<p>And by this I mean uploading files <em>from </em>a PHP script to another HTTP URL, essentially submitting a web-form with a file-field from PHP. I needed this in Organik, it took me some hours to find out how. My hacky result is here for the world to reuse:</p>
<p><a href="http://github.com/gromgull/randombits/blob/master/http_file_upload.php">http://github.com/gromgull/randombits/blob/master/http_file_upload.php</a></p>
<p>Enjoy.</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2010/02/http-file-uploads-in-php/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Noun-phrase Chunking for the Awful German Language</title>
		<link>http://gromgull.net/blog/2010/01/noun-phrase-chunking-for-the-awful-german-language/</link>
		<comments>http://gromgull.net/blog/2010/01/noun-phrase-chunking-for-the-awful-german-language/#comments</comments>
		<pubDate>Thu, 14 Jan 2010 11:47:15 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[NLP]]></category>
		<category><![CDATA[OrganikProject]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=417</guid>
		<description><![CDATA[In the Organik project we&#8217;ve been using the noun-phrase extraction modules of OpenNLP toolkit to extract key concepts from text for doing Taxonomy Learning. OpenNLP comes with trained model files for English sentence detection, POS-tagging and either noun-phrase chunking or full parsing, and this works great.
Of course in Organik we have some German partners who [...]]]></description>
			<content:encoded><![CDATA[<p>In the<a href="http://organik-project.eu/"> Organik project</a> we&#8217;ve been using the noun-phrase extraction modules of<a href="http://opennlp.sourceforge.net/"> OpenNLP toolkit </a>to extract key concepts from text for doing Taxonomy Learning. OpenNLP comes with trained model files for English sentence detection, POS-tagging and either noun-phrase chunking or full parsing, and this works great.</p>
<p>Of course in Organik we have some German partners who insist on using their <a href="http://www.crossmyt.com/hc/linghebr/awfgrmlg.html">awful german language</a> <a href="#fn1">[1]</a> for everything &#8211; confusing us with their weird grammar. Finding a solution to this has been on my TODO list for about a year now. I had access to the <a href="http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/">Tiger Corpus</a> of 50,000 German sentences marked up with POS-tags and syntactic structure. I have tried to use this for training a model for NP chunking, either using the OpenNLP MaxEnt model or with conditional random fields as implemented in <a href="http://flexcrfs.sourceforge.net/">FlexCRF</a>. However, the models never performed better than around 60% precision and recall, and testing showed that this was really not enough. Returning to this problem now once again I have looked more closely at the input data, it turns out the syntactic structures used in the Tiger Corpus are quite detailed, containing far higher granularity of tag-types than what I need. For instance the structure for <em>&#8220;Zwar hat auch der HDE seinen Mitgliedern Tips gegeben, wie mit vermeintlichen Langfingern umzugehen sei</em><em>.&#8221;</em> (click for readable picture):</p>
<p style="text-align: center;"><a href="http://farm3.static.flickr.com/2749/4270534809_cf3972c2cb_o.png"><img class="aligncenter" style="border: 0pt none;" src="http://farm3.static.flickr.com/2749/4270534809_32fee0840e.jpg" alt="" width="500" height="128" /></a></p>
<p>Here the entire <em>&#8220;Tips [...] wie mit vermeintlichen Langfingern umzugen sei&#8221;</em>, is a noun-phrase. This (might) be linguistically correct, but it&#8217;s not very useful to me when I essentially want to do keyword extraction. Much more useful is the terms marked NK (Noun-Kernels), i.e. here <em>&#8220;vermeintlichen Langfingern&#8221;</em>. Another problem is that the tree is not <em>continuous </em>with regard to the original sentence, i.e. the word <em>gegeben</em> fits into the middle of the NP, but is not a part of it.</p>
<p>SO &#8211; I have preprocessed the entire corpus again, flattening the tree, taking the lowermost NK chain, or NP chunk as example. This gives me much shorter NPs in general, for which it is easier to learn a model AND the result is more useful in Organik. Running FlexCRF again on this data, splitting off a part of the data for testing, gives me a model with 94.03% F1-measure on the test data. This is quite comparable to what was achieved for English with FlexCRF in <a href="http://crfchunker.sourceforge.net/">CRFChunker</a>, for the WSJ corpus they report a F-Measure of 95%.</p>
<p>I cannot redistribute the corpus or training data, but here is the model as trained by FlexCRF for chunking:<a href="http://gromgull.net/2010/01/npchunking/GermanNPChunkModel.tar.gz"> GermanNPChunkModel.tar.gz</a> (17.7mb)</p>
<p>and for the POSTagging: <a href="http://gromgull.net/2010/01/npchunking/GermanPOSTagModel.tar.gz">GermanPOSTagModel.tar.gz</a> (9.5mb)</p>
<p>Both are trained on 44,170 sentences, with about 900,000 words. The POSTagger was trained for 50 iterations, the Chunker for 100, both with 10% of the data used for testing.</p>
<p>In addition, here is a model file trained with OpenNLPs MaxEnt: <a href="http://gromgull.net/2010/01/npchunking/OpenNLP_GermanChunk.bin.gz">OpenNLP_GermanChunk.bin.gz</a> (5.2mb)</p>
<p>This was trained with the POS tags as generated by the <a href="http://opennlp.sourceforge.net/models/german/">German POStagger that ships with OpenNLP</a>, and can be used with the OpenNLP tools like this:</p>
<p><code><br />
java -cp $CP opennlp.tools.lang.german.SentenceDetector \<br />
models/german/sentdetect/sentenceModel.bin.gz  |<br />
java -cp $CP opennlp.tools.lang.german.Tokenizer \<br />
models/german/tokenizer/tokenModel.bin.gz |<br />
java -cp $CP -Xmx100m opennlp.tools.lang.german.PosTagger \<br />
models/german/postag/posModel.bin.gz |<br />
java -cp $CP opennlp.tools.lang.english.TreebankChunker \<br />
models/german/chunking/GermanChunk.bin.gz<br />
</code></p>
<p>That&#8217;s it. Let me know if you use it and it works for you!</p>
<hr /><a name="fn1">[1]</a> Completely unrelated, but to exercise your German parsing skills, check out some old newspaper articles. Die Zeit has their online archive available back to 1946, where you find sentence-gems like this: <em>ZunÃ¤chst waren wir geneigt, das geschaute Bild irgendwie umzurechnen auf materielle Werte, wir versuchten ArbeitskrÃ¤fte und Zeit zu Ã¼berschlagen, die nÃ¶tig waren, um diese WÃ¼ste, die uns umgab, wieder neu gestalten zu kÃ¶nnen, herauszufÃ¼hren aus diesem unfaÃŸlichen Zustand der ZerstÃ¶rung, Ã¼berzufÃ¼hren in eine Welt, die wir verstanden, in eine Welt,&#8217; die uns bis dahin umgeben hatte. </em>(ONE sentece!, from <a href="http://www.zeit.de/1946/01/Rueckkehr-nach-Deutschland">http://www.zeit.de/1946/01/Rueckkehr-nach-Deutschland</a>)</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2010/01/noun-phrase-chunking-for-the-awful-german-language/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Semantic Web Clusterball</title>
		<link>http://gromgull.net/blog/2010/01/semantic-web-clusterball/</link>
		<comments>http://gromgull.net/blog/2010/01/semantic-web-clusterball/#comments</comments>
		<pubDate>Wed, 06 Jan 2010 11:25:29 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Billion Triple Challenge]]></category>
		<category><![CDATA[SVG]]></category>
		<category><![CDATA[Visualisation]]></category>
		<category><![CDATA[in progress]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=402</guid>
		<description><![CDATA[From the I-will-never-actually-finish-this department I bring you the Semantic Web Cluster-ball:

I started this is a part of the Billion Triple Challenge work, it shows the how different sites on Semantic Web are linked together. The whole thing is an interactive SVG, I could not get it to embed here, so click on that image and [...]]]></description>
			<content:encoded><![CDATA[<p>From the I-will-never-actually-finish-this department I bring you the Semantic Web Cluster-ball:</p>
<p style="text-align: center;"><a href="http://gromgull.net/2010/01/swball/swball.svg"><img class="aligncenter" style="border: 0pt none;" title="Semantic Web Clusterball" src="http://farm5.static.flickr.com/4064/4250011607_245b975a26.jpg" alt="Semantic Web Clusterball" width="500" height="469" /></a></p>
<p>I started this is a part of the <a href="http://gromgull.net/blog/category/semantic-web/billion-triple-challenge/">Billion Triple Challenge work</a>, it shows the how different sites on Semantic Web are linked together. The whole thing is an interactive SVG, I could not get it to embed here, so click on that image and mouse over things and be amazed. Clicking on the different predicates in the SVG will toggle showing that predicate, mouse over any link will show how many links are currently being shown. (NOTE: Only really tested in Firefox 3.5.X, it looked roughly ok in Chrome though.)</p>
<p>The data is extracted from the BTC triples by computing the <em>Pay-Level-Domain</em> (PLD, essentially the top-level domain, but with special rules for .co.uk domains and similar) for the subjects and objects, and if they differ, count the predicates that link them. I.e. a triple:</p>
<p><code>dbpedia:Albert_Einstein rdf:type foaf:Person. </code></p>
<p>would count as a link between <em>http://dbpedia.org </em>and <em>http://xmlns.com</em> for the<em> rdf:type</em> predicate. Counting all links like this gives us the top cross-domain linking predicates:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>predicate</th>
<th>links</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">http://www.w3.org/1999/02/22-rdf-syntax-ns#type</a></td>
<td style="text-align: right;">60,813,659</td>
</tr>
<tr>
<td><a href="http://www.w3.org/2000/01/rdf-schema#seeAlso">http://www.w3.org/2000/01/rdf-schema#seeAlso</a></td>
<td style="text-align: right;">16,698,110</td>
</tr>
<tr>
<td><a href="http://www.w3.org/2002/07/owl#sameAs">http://www.w3.org/2002/07/owl#sameAs</a></td>
<td style="text-align: right;">4,872,501</td>
</tr>
<tr>
<td><a href="http://xmlns.com/foaf/0.1/weblog">http://xmlns.com/foaf/0.1/weblog</a></td>
<td style="text-align: right;">4,627,271</td>
</tr>
<tr>
<td><a href="http://www.aktors.org/ontology/portal#has-date">http://www.aktors.org/ontology/portal#has-date</a></td>
<td style="text-align: right;">3,873,224</td>
</tr>
<tr>
<td><a href="http://xmlns.com/foaf/0.1/page">http://xmlns.com/foaf/0.1/page</a></td>
<td style="text-align: right;">3,273,613</td>
</tr>
<tr>
<td><a href="http://dbpedia.org/property/hasPhotoCollection">http://dbpedia.org/property/hasPhotoCollection</a></td>
<td style="text-align: right;">2,556,532</td>
</tr>
<tr>
<td><a href="http://xmlns.com/foaf/0.1/img">http://xmlns.com/foaf/0.1/img</a></td>
<td style="text-align: right;">2,012,761</td>
</tr>
<tr>
<td><a href="http://xmlns.com/foaf/0.1/depiction">http://xmlns.com/foaf/0.1/depiction</a></td>
<td style="text-align: right;">1,556,066</td>
</tr>
<tr>
<td><a href="http://www.geonames.org/ontology#wikipediaArticle">http://www.geonames.org/ontology#wikipediaArticle</a></td>
<td style="text-align: right;">735,145</td>
</tr>
</tbody>
</table>
<p>Most frequent is of course <em>rdf:type</em>, since most schemas are from different domains to the data, and most things have a type. The ball linked above is excluding type, since it&#8217;s not really a <em>link</em>. You can also see <a href="http://gromgull.net/2010/01/swball/swball_type.svg">a version including <em>rdf:type</em>.</a> The rest of the properties are more <em>link-like</em>, I am not sure what is going on with the <em>akt:has-date </em>though, anyone?</p>
<p>The visualisation idea is of course not mine, mainly I stole it from Chris Harrison: <a href="http://www.chrisharrison.net/projects/clusterball/index.html">Wikipedia Clusterball</a>. His is nicer since he has core nodes <em>inside </em>the ball. He points out that the &#8220;clustering&#8221; of nodes along the edge is important, as this brings out the structure of whatever is being mapped. My &#8220;clustering&#8221; method was very simple, I swap each node with the one giving me the largest decrease in edge distance, then repeat until the solution no longer improves. I couple this with a handful of random restarts and take the best solution. It&#8217;s essentially a greedy hill-climbing method, and I am sure it&#8217;s far from optimal, but it does at least something. For comparison, <a href="http://gromgull.net/2010/01/swball/swball_nocluster.svg">here is the ball on top without clustering applied</a>.</p>
<p>The whole thing was of course hacked up in python, the javascript for the mouse-over etc. of the SVG uses <a href="http://www.prototypejs.org/">prototype</a>. I wanted to share the code, but it&#8217;s a horrible mess, and I&#8217;d rather not spend the time to clean it up. If you want it, <span style="text-decoration: line-through;">drop me a line.</span>, see below. The data used to generate this is available either as a download: <a href="http://gromgull.net/2010/01/swball/data.txt.gz">data.txt.gz</a> (19Mb, 10,000 host-pairs and top 500 predicates), or <a href="http://manyeyes.alphaworks.ibm.com/manyeyes/datasets/semantic-web-links/versions/1">a subset on Many Eyes</a> (2,500 host-pairs and top 100 predicates, uploading 19Mb of data to Many Eyes crashed my Firefox :)</p>
<p><strong>UPDATE</strong>: <a href="http://twitter.com/Rchards">Richard Stirling </a>asked for the code, so I spent 30 min cleaning it up a bit, grab it here: <a href="http://gromgull.net/2010/01/swball/swball_code.tar.gz">swball_code.tar.gz</a> It includes the data+code needed to recreate the example above.</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2010/01/semantic-web-clusterball/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>An Objective look at the Billion Triple Data</title>
		<link>http://gromgull.net/blog/2009/12/an-objective-look-at-the-billion-triple-data/</link>
		<comments>http://gromgull.net/blog/2009/12/an-objective-look-at-the-billion-triple-data/#comments</comments>
		<pubDate>Fri, 11 Dec 2009 15:44:18 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Billion Triple Challenge]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=394</guid>
		<description><![CDATA[For completeness, Besbes is telling me to write up the final stats from the BTC data, for the object-part of the triples. I am afraid this data is quite dull, and there are not many interesting things to say, so it&#8217;ll be mostly tables. Enjoy :)
The BTC data contains 279,710,101 unique objects in total. Out [...]]]></description>
			<content:encoded><![CDATA[<p>For completeness, <a href="http://www.cs.univie.ac.at/employee.php?tab=teaching&amp;eid=223">Besbes</a> is telling me to write up the final stats from the BTC data, for the <em>object-part</em> of the triples. I am afraid this data is quite dull, and there are not many interesting things to say, so it&#8217;ll be mostly tables. Enjoy :)</p>
<p>The BTC data contains 279,710,101 unique objects in total. Out of these:</p>
<ul>
<li>90,007,431 appear more than once</li>
<li>7,995,747 more than 10 times</li>
<li>748,214 more than 100</li>
<li>43,479 more than 1,000</li>
<li>3,209 more than 10,000</li>
</ul>
<p>The 280M objects are split into 162,764,271 resources and 116,945,830 literals. 13,538 of the resources are <em>file://</em> URIs. The top 10 objects are:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>object</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>2,584,960</td>
<td><a href="http://www.geonames.org/ontology#P">http://www.geonames.org/ontology#P</a></td>
</tr>
<tr>
<td>2,645,095</td>
<td><a href="http://www.aktors.org/ontology/portal#Article-Reference">http://www.aktors.org/ontology/portal#Article-Reference</a></td>
</tr>
<tr>
<td>2,681,771</td>
<td><a href="http://www.w3.org/2002/07/owl#Class">http://www.w3.org/2002/07/owl#Class</a></td>
</tr>
<tr>
<td>5,616,326</td>
<td><a href="http://www.aktors.org/ontology/portal#Person">http://www.aktors.org/ontology/portal#Person</a></td>
</tr>
<tr>
<td>7,544,903</td>
<td><a href="http://www.geonames.org/ontology#Feature">http://www.geonames.org/ontology#Feature</a></td>
</tr>
<tr>
<td>9,115,801</td>
<td><a href="http://en.wikipedia.org/">http://en.wikipedia.org/</a></td>
</tr>
<tr>
<td>12,124,378</td>
<td><a href="http://xmlns.com/foaf/0.1/OnlineAccount">http://xmlns.com/foaf/0.1/OnlineAccount</a></td>
</tr>
<tr>
<td>13,687,049</td>
<td><a href="http://purl.org/rss/1.0/item">http://purl.org/rss/1.0/item</a></td>
</tr>
<tr>
<td>14,172,852</td>
<td><a href="http://rdfs.org/sioc/types#WikiArticle">http://rdfs.org/sioc/types#WikiArticle</a></td>
</tr>
<tr>
<td>38,795,942</td>
<td><a href="http://xmlns.com/foaf/0.1/Person">http://xmlns.com/foaf/0.1/Person</a></td>
</tr>
</tbody>
</table>
<p>Apart from the wikipedia link, all are types. No literals appear in the top 10 table. For the 116M unique literals we have 12,845,021 literals with a language tag and 2,067,768 with a datatype tag. The top 10 literals are:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>literal</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>722,221</td>
<td>&#8220;0&#8243;^^xsd:integer</td>
</tr>
<tr>
<td>969,929</td>
<td>&#8220;1&#8243;</td>
</tr>
<tr>
<td>1,024,654</td>
<td>&#8220;Nay&#8221;</td>
</tr>
<tr>
<td>1,036,054</td>
<td>&#8220;Copyright Â© 2009 craigslist, inc.&#8221;</td>
</tr>
<tr>
<td>1,056,799</td>
<td>&#8220;text&#8221;</td>
</tr>
<tr>
<td>1,061,692</td>
<td>&#8220;text/html&#8221;</td>
</tr>
<tr>
<td>1,159,311</td>
<td>&#8220;0&#8243;</td>
</tr>
<tr>
<td>1,204,996</td>
<td>&#8220;en-us&#8221;</td>
</tr>
<tr>
<td>2,049,638</td>
<td>&#8220;Aye&#8221;</td>
</tr>
<tr>
<td>2,310,681</td>
<td>&#8220;application/rdf+xml&#8221;</td>
</tr>
</tbody>
</table>
<p>I can&#8217;t be bothered to check it now, but I guess theÂ  many Aye&#8217;s &amp; Nay&#8217;s come from IRC chatlogs (#SWIG?).</p>
<p>Finally, I looked at the length of the literals used in the data, the longest literal is 65,244 unicode characters long (I wonder about this â€” this seems very close to 2<sup>16</sup> bytes, some unicode characters with more than one byte, could it be truncated?). The distribution of literals/lenghts looks like this:</p>
<p style="text-align: left;"><a href="http://www.flickr.com/photos/gromgull/4176130623/sizes/o/"><img class="aligncenter" style="border: 0pt none;" title="Literal lengths" src="http://farm3.static.flickr.com/2746/4176130623_754a99c096.jpg" alt="" width="500" height="400" /></a>The most literals are around 10 characters in length, there is a peak for 19, which I seem to remember was caused by the standard time format (i.e. 2005-10-30T10:45UTC) being exactly 19 characters.</p>
<p style="text-align: left;">That&#8217;s it! I believe I now have published all my numbers on BTC :)</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2009/12/an-objective-look-at-the-billion-triple-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>DBTropes</title>
		<link>http://gromgull.net/blog/2009/12/dbtropes/</link>
		<comments>http://gromgull.net/blog/2009/12/dbtropes/#comments</comments>
		<pubDate>Thu, 10 Dec 2009 13:31:48 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Everything Else]]></category>
		<category><![CDATA[Semantic Web]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=387</guid>
		<description><![CDATA[
Know TvTropes.org? As pointed out by XKCD, a great place to lose hours of time reading about SoBadIt&#8217;sHorrible, HighOctaneNightmareFuel and thousands of other tropes, all with examples from comics, films, tv-series etc.
DFKI colleague Malte Kiesel has done the right thing and just released his linked open data wrapper for tvtropes, natuerally names dbTropes.org. Now go [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: center;"><a href="http://gromgull.net/blog/wp-content/uploads/2009/12/logo.png"><img class="aligncenter size-full wp-image-388" style="border: 0pt none;" title="logo" src="http://gromgull.net/blog/wp-content/uploads/2009/12/logo.png" alt="logo" width="200" height="50" /></a></p>
<p style="text-align: left;">Know <a href="http://tvtropes.org">TvTropes.org</a>? As <a href="http://xkcd.com/609/">pointed out by XKCD</a>, a great place to lose hours of time reading about <a href="http://tvtropes.org/pmwiki/pmwiki.php/DarthWiki/ptitlew9bltta3dv6n?from=Main.SoBadItsHorrible">SoBadIt&#8217;sHorrible</a>, <a href="http://tvtropes.org/pmwiki/pmwiki.php/Main/HighOctaneNightmareFuel">HighOctaneNightmareFuel</a> and thousands of other <em>tropes, </em>all with examples from comics, films, tv-series etc.</p>
<p>DFKI colleague <a href="http://www.dfki.uni-kl.de/~kiesel/">Malte Kiesel</a> has done the right thing and just released his linked open data wrapper for tvtropes, natuerally names <a href="http://dbtropes.org">dbTropes.org</a>. Now go read about <a href="http://dbtropes.org/resource/Main/DiabolusExMachina">DiabolusExMachina</a>, it will of course do content-negotiation so try it with your favourite RDF browser.</p>
<p>I helped too â€” I made the stylesheet and the &#8220;logo&#8221; :)</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2009/12/dbtropes/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>I&#8217;ll trie in python</title>
		<link>http://gromgull.net/blog/2009/11/ill-trie-in-python/</link>
		<comments>http://gromgull.net/blog/2009/11/ill-trie-in-python/#comments</comments>
		<pubDate>Wed, 11 Nov 2009 10:25:45 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Koble]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=385</guid>
		<description><![CDATA[In Koble the auto-completion of thing-names used for wiki-editting, instant-search and adding relationsÂ  is getting slower and slower,  mainly because I do:

result=[]
things=listAllThings()
for t in things:
   if t.startswith(key): res.append(t)
for t in things:
   if key in t: res.append(t)

Going through the list twice makes sure I get all things that match well first [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://koble.net/">Koble</a> the auto-completion of thing-names used for wiki-editting, instant-search and adding relationsÂ  is getting slower and slower,  mainly because I do:</p>
<pre>
result=[]
things=listAllThings()
for t in things:
   if t.startswith(key): res.append(t)
for t in things:
   if key in t: res.append(t)
</pre>
<p>Going through the list twice makes sure I get all things that match well first (i.e. the start with the string I complete for), and then things matching less well later (they only contain the string).</p>
<p>Of course the world has made up a far better data-structure for indexing prefix&#8217;es of string, namely the <a href="http://www.itl.nist.gov/div897/sqg/dads/HTML/trie.html">trie, or prefix tree</a>. <a href="http://jtauber.com/">James Tauber</a> had already implemented one in python, and <a href="http://jtauber.com/blog/2005/02/10/updated_python_trie_implementation/">kindly made it available.</a> His version didn&#8217;t do everything I needed, so I added a few methods. Here is my updated version:</p>
<p><a href="http://gromgull.net/2009/11/trie.py">http://gromgull.net/2009/11/trie.py</a></p>
<p>Enjoy!</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2009/11/ill-trie-in-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Typical Semantic Web Data</title>
		<link>http://gromgull.net/blog/2009/09/typical-semantic-web-data/</link>
		<comments>http://gromgull.net/blog/2009/09/typical-semantic-web-data/#comments</comments>
		<pubDate>Mon, 28 Sep 2009 16:15:54 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Billion Triple Challenge]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=317</guid>
		<description><![CDATA[This is the fourth of my Billion Triple Challenge data-set statistics posts, if you only just got here, catch up on part I, II orÂ  III.
I had these numbers ready for a long time, but never found the time to type it up as the it is not so exciting. However CaptSolo asked for it [...]]]></description>
			<content:encoded><![CDATA[<p>This is the fourth of my <a href="http://vmlion25.deri.ie/index.html">Billion Triple Challenge data-set</a> statistics posts, if you only just got here, catch up on part <a href="http://gromgull.net/blog/2009/06/btc-statistics-i/">I</a>, <a href="http://gromgull.net/blog/2009/07/billions-and-billions-and-billions-on-a-map/">II</a> orÂ  <a href="http://gromgull.net/blog/2009/08/the-subject-matter-or-its-a-scam-there-are-only-900m/">III</a>.</p>
<p>I had these numbers ready for a long time, but never found the time to type it up as the it is not so exciting. However <a href="http://captsolo.net/">CaptSolo</a> asked for it now to put in his very-soon-to-be-finished thesis, so I&#8217;ll hurry up. This is all about the classes used in the BTC data, i.e. the <em>rdf:type</em> triples.<br />
Overall the data contains 143,293,758 type triples, assigning 283,815 different types to 104,562,695 different things.Â  For the types themselves:</p>
<ul>
<li>213,281 types are used more than once</li>
<li> 94,455 used more than 10</li>
<li>14,862 more than 100</li>
<li>1,730 more than 1000</li>
<li>288 more than 10000</li>
</ul>
<p>If we take only these 288 top ones we cover 92% of all types triples, we can cover 90% of the typed things with only 105 types and over 50% of the data with only <em>foaf:Person, sioc:WikiArticle</em>, <em>rss:Item </em>and <em>foaf:OnlineAccount</em>. Out of all the &#8220;types&#8221; used 12,319 were BNodes, which is odd, but I guess possible, and 204 are literals, which is even odder. The top 10 types are:</p>
<table class="stattable" border="0">
<thead>
<tr>
<th>#triples</th>
<th><strong>type URI</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td class='firstcol'>1,859,499</td>
<td>wordnet:Person</td>
</tr>
<tr>
<td class='firstcol'>2,309,652</td>
<td>foaf:Document</td>
</tr>
<tr>
<td class='firstcol'>2,645,091</td>
<td>akt:Article-Reference</td>
</tr>
<tr>
<td class='firstcol'>2,680,081</td>
<td>owl:Class</td>
</tr>
<tr>
<td class='firstcol'>5,616,163</td>
<td>akt:Person</td>
</tr>
<tr>
<td class='firstcol'>7,544,797</td>
<td>geonames:Feature</td>
</tr>
<tr>
<td class='firstcol'>12,123,375</td>
<td>foaf:OnlineAccount</td>
</tr>
<tr>
<td class='firstcol'>13,686,988</td>
<td>rss:item</td>
</tr>
<tr>
<td class='firstcol'>14,172,851</td>
<td>sioc:WikiArticle</td>
</tr>
<tr>
<td class='firstcol'>38,790,680</td>
<td>foaf:Person</td>
</tr>
</tbody>
</table>
<p>Now for the things the types are assigned to, out of the 104,562,965 things with types, 52,865,376 are BNodes. If you pay attention you will now have realised that many things have more than one type assigned (143M type triplesâ‡’104M things). In fact:</p>
<ul>
<li>7,026,972 things have more than one type triple.</li>
<li>612,467 has more than 10</li>
<li>35,201 more than 100</li>
<li>1,025 more than 1,000</li>
<li>40 more than 10,000</li>
</ul>
<p>Note I am talking here of <em>type triples</em>, i.e. the top 40 things may well have the same type assigned 10,000 times. The things having over 10,000 types assigned is a product of the partially inclusion of inferred triples in the data. For instance, for every context where RDFS inference has been applied, all properties will have <em>rdf:type rdf:Property </em>inferred. Looking at the number of unique types per thing shows that:</p>
<ul>
<li>2,979,968 things have more than one type</li>
<li>78,208 have more than 10</li>
<li>4 more than 100</li>
</ul>
<p>The 10 things with most <em>unique </em>types are all pretty boring:</p>
<table class='stattable'>
<thead>
<tr>
<th style="font-weight: bold">#types</th>
<th><strong>URI</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td class="firstcol">74</td>
<td>http://rdf.freebase.com/ns/guid.9202a8c04000641f8000000000959f60</td>
</tr>
<tr>
<td class="firstcol">75</td>
<td>http://dbpedia.org/resource/Arnold_Schwarzenegger</td>
</tr>
<tr>
<td class="firstcol">88</td>
<td>http://oiled.man.example.net/test#V822576</td>
</tr>
<tr>
<td class="firstcol">91</td>
<td>http://oiled.man.example.net/test#V21027</td>
</tr>
<tr>
<td class="firstcol">91</td>
<td>http://oiled.man.example.net/test#V21029</td>
</tr>
<tr>
<td class="firstcol">91</td>
<td>http://oiled.man.example.net/test#V21030</td>
</tr>
<tr>
<td class="firstcol">105</td>
<td>http://oiled.man.example.net/test#V16459</td>
</tr>
<tr>
<td class="firstcol">136</td>
<td>http://www.w3.org/2002/03owlt/description-logic/consistent501#T</td>
</tr>
<tr>
<td class="firstcol">136</td>
<td>http://www.w3.org/2002/03owlt/description-logic/inconsistent502#T</td>
</tr>
<tr>
<td class="firstcol">171</td>
<td>http://oiled.man.example.net/test#V21026</td>
</tr>
</tbody>
</table>
<p>Likewise the 10 things with the most types assigned, all product of materialised inferred triples:</p>
<table class='stattable'>
<thead>
<tr style="font-weight: bold;">
<th>#triples</th>
<th>URI</th>
</tr>
</thead>
<tbody>
<tr>
<td class="firstcol">57,533</td>
<td>http://sw.opencyc.org/2008/06/10/concept/</td>
</tr>
<tr>
<td class="firstcol">58,838</td>
<td>http://semantic-mediawiki.org/swivt/1.0#creationDate</td>
</tr>
<tr>
<td class="firstcol">58,838</td>
<td>http://semantic-mediawiki.org/swivt/1.0#page</td>
</tr>
<tr>
<td class="firstcol">58,838</td>
<td>http://semantic-mediawiki.org/swivt/1.0#Subject</td>
</tr>
<tr>
<td class="firstcol">89,521</td>
<td>http://sw.opencyc.org/concept/Mx4rwLSVCpwpEbGdrcN5Y29ycA</td>
</tr>
<tr>
<td class="firstcol">121,138</td>
<td>http://en.wikipedia.org/</td>
</tr>
<tr>
<td class="firstcol">159,773</td>
<td>http://sw.opencyc.org/concept/</td>
</tr>
<tr>
<td class="firstcol">232,505</td>
<td>http://sw.cyc.com/CycAnnotations_v1#label</td>
</tr>
<tr>
<td class="firstcol">361,113</td>
<td>http://xmlns.com/foaf/0.1/holdsAccount</td>
</tr>
<tr>
<td class="firstcol">465,010</td>
<td>http://sw.cyc.com/CycAnnotations_v1#externalID</td>
</tr>
</tbody>
</table>
<p>That&#8217;s it â€” I hope it changed your life! :)</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2009/09/typical-semantic-web-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Heat-maps of Semantic Web Predicate usage</title>
		<link>http://gromgull.net/blog/2009/09/heat-maps-of-semantic-web-predicate-usage/</link>
		<comments>http://gromgull.net/blog/2009/09/heat-maps-of-semantic-web-predicate-usage/#comments</comments>
		<pubDate>Fri, 11 Sep 2009 13:10:01 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Billion Triple Challenge]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Visualisation]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=358</guid>
		<description><![CDATA[It&#8217;s all Cygri&#8217;s fault â€” he encouraged me to add schema namespaces to the general areas on the semantic web cluster-tree. Now, again I misjudged horribly how long this was going to take. I thought the general idea was simple enough, I already had the data. One hour should do it. And now one full [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s all <a href="http://richard.cyganiak.de/">Cygri</a>&#8217;s fault â€” he<a href="http://chatlogs.planetrdf.com/swig/2009-09-10.html#T11-27-04"> encouraged me</a> to add schema namespaces to the general areas on the<a href="http://gromgull.net/2009/09/btcclustertree/tree.html"> semantic web cluster-tree</a>. Now, again I misjudged horribly how long this was going to take. I thought the general idea was simple enough, I already had the data. One hour should do it. And now one full day later I have:</p>
<p style="text-align: left;"><a href="http://gromgull.net/2009/09/heat/heat.html"><img class="aligncenter size-medium wp-image-359" style="border: 0pt none;" title="FOAF Predicates on the Semantic Web" src="http://gromgull.net/blog/wp-content/uploads/2009/09/foaf-300x202.png" alt="FOAF Predicates on the Semantic Web" width="300" height="202" /></a></p>
<p style="text-align: left;">It&#8217;s the same map as last time, laid using graphviz&#8217;s neato as before. The heat-map of the properties was computed from the feature-vector of predicate counts, first I mapped all predicates to their &#8220;namespace&#8221;, by the slightly-dodgy-but-good-enough heuristic of taking the part of the URI before the last # or / character. Then I split the map into a grid of NxN points (I think I used N=30 in the end), and compute a new feature vector for each point. This vector is the sum of the mapped vector for each of the domains, divided by the distance. I.e. (if you prefer math) each point&#8217;s vector becomes:</p>
<p style="text-align: center;"><img src="http://quicklatex.com/cache/ql_eab1808c3a47a79ec4468286c166cd63.gif" alt="\displaystyle V_{x,y}= \sum_d\frac{V_d}{\sqrt{D( (x,y),Â  pos_d)}} " title="\displaystyle V_{x,y}= \sum_d\frac{V_d}{\sqrt{D( (x,y),Â  pos_d)}} " style="vertical-align: -22px; border: none;"/></p>
<p style="text-align: left;">Where <img src="http://quicklatex.com/cache/ql_f623e75af30e62bbd73d6df5b50bb7b5.gif" alt="D" title="D" style="vertical-align: 0px; border: none;"/> is the distance (here simple 2d euclidean), <img src="http://quicklatex.com/cache/ql_8277e0910d750195b448797616e091ad.gif" alt="d" title="d" style="vertical-align: 0px; border: none;"/> is each domain, <img src="http://quicklatex.com/cache/ql_35786283530fc15de9505a0cf7b2dfe7.gif" alt="pos_d" title="pos_d" style="vertical-align: -4px; border: none;"/> is that domains position in the figure and <img src="http://quicklatex.com/cache/ql_ac9efe23bd710abe094c643a0b6e9a39.gif" alt="V_d" title="V_d" style="vertical-align: -3px; border: none;"/> is that domains feature vector. <em>Normally</em> it would be more natural to decrease the effect by the squared distance, but this gave less attractive results, and I ended up square-rooting it instead. The color is now simply on column of the resulting matrix normalised and mapped to a <a href="http://www.scipy.org/Cookbook/Matplotlib/Show_colormaps">nice pylab colormap.</a></p>
<p style="text-align: left;">Now this was the fun and interesting part, and it took maybe 1 hour. As predicted. NOW, getting this plotted along with the nodes from the graph turned out to be a nightmare. Neato gave me the coordinates for the nodes, but would change them slightly when rendering to PNGs. Many hours of frustration later I ended up drawing all of it again with <a href="http://matplotlib.sourceforge.net/">pylab</a>, which worked really well. I would publish the code for this, but it&#8217;s so messy it makes grown men cry.</p>
<p style="text-align: left;">NOW I am off to analyse the result of the top-level domain interlinking on the billion triple data. The data-collection just finished running while I did this. &#8230; As <a href="http://chatlogs.planetrdf.com/swig/2009-09-10.html#T11-28-32">he said.</a></p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2009/09/heat-maps-of-semantic-web-predicate-usage/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Visualising predicate usage on the Semantic Web</title>
		<link>http://gromgull.net/blog/2009/09/visualising-predicate-usage-on-the-semantic-web/</link>
		<comments>http://gromgull.net/blog/2009/09/visualising-predicate-usage-on-the-semantic-web/#comments</comments>
		<pubDate>Wed, 09 Sep 2009 11:02:29 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Billion Triple Challenge]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Visualisation]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=337</guid>
		<description><![CDATA[So, not quite a billion triple challenge post, but the data is the same.Â  I had the idea that I compare the Pay-Level-Domains (PLD) of the context of the triples based on what predicates is used within each one. Then once I had the distance-metric, I could use FastMap to visualise it. It would be [...]]]></description>
			<content:encoded><![CDATA[<p>So, not quite a billion triple challenge post, but the data is the same.Â  I had the idea that I compare the Pay-Level-Domains (PLD) of the context of the triples based on what predicates is used within each one. Then once I had the distance-metric, I could use <a href="http://gromgull.net/blog/2009/08/fastmap-in-python/">FastMap</a> to visualise it. It would be a quick hack, it would look smooth and great and be fun. In the end, many hours later, it wasn&#8217;t quick, the visual is not smooth (i.e. it doesn&#8217;t move) and I don&#8217;t know if it looks so great. It was fun though. Just go there and look at it:</p>
<p style="text-align: center;"><a href="http://gromgull.net/2009/09/btcclustertree/tree.html"><img class="size-full wp-image-338 aligncenter" style="border: 0pt none;" title="PayLevelDomains cluster-tree" src="http://gromgull.net/blog/wp-content/uploads/2009/09/smalltree.png" alt="PayLevelDomains cluster-tree" width="400" height="263" /></a></p>
<p style="text-align: left;">As you can see it&#8217;s a large PNG with the new-and-exciting <a href="http://en.wikipedia.org/wiki/Image_map">ImageMap</a> technology used to position the info-popup, or rather used to activate the JavaScript used for the popups. I tried at first with SVG, but I couldn&#8217;t get SVG and XHTML and Javascript to play along, I guess in Firefox 5 it will work. The graph is laid out and generated <a href="http://www.graphviz.org/">Graphviz</a>&#8217;s <a href="http://www.graphviz.org/pdf/neatoguide.pdf">neato</a>, which also generated the imagemap.</p>
<p style="text-align: left;">So what do we actually see here? In short, a tree where domains that publish similar Semantic Web data are close to each other in the tree and have similar colours. In detail: I took the all PLDs that contained over 1,000 triples, this is around 7500, and counted the number of triples for each of the 500 most frequent predicates in the dataset. (These 500 predicates cover â‰ˆ94% of the data). This gave me a vector-space with 500 features for each of the PLDs, i.e. something like this:</p>
<table style="border-color: #aaa; border-collapse: collapse; text-align: center; font-family: mono;" border="1" cellspacing="5">
<colgroup>
<col width="77"></col>
<col width="77"></col>
<col width="77"></col>
<col width="77"></col>
<col width="77"></col>
</colgroup>
<tbody>
<tr>
<td width="77" height="16" align="LEFT"></td>
<td width="77" align="LEFT">geonames:nearbyFeature</td>
<td width="77" align="LEFT">dbprop:redirect</td>
<td width="77" align="LEFT">foaf:knows</td>
<td width="77" align="LEFT">&#8230;</td>
</tr>
<tr>
<td height="16" align="LEFT">dbpedia.org</td>
<td align="RIGHT">0.01</td>
<td align="RIGHT">0.8</td>
<td align="RIGHT">0.1</td>
<td align="LEFT"></td>
</tr>
<tr>
<td height="16" align="LEFT">livejournal.org</td>
<td align="RIGHT">0</td>
<td align="RIGHT">0</td>
<td align="RIGHT">0.9</td>
<td align="LEFT"></td>
</tr>
<tr>
<td height="16" align="LEFT">geonames.org</td>
<td align="RIGHT">0.75</td>
<td align="RIGHT">0</td>
<td align="RIGHT">0.1</td>
<td align="LEFT"></td>
</tr>
<tr>
<td height="16" align="LEFT">&#8230;</td>
<td align="LEFT"></td>
<td align="LEFT"></td>
<td align="LEFT"></td>
<td align="LEFT"></td>
</tr>
</tbody>
</table>
<p style="text-align: left;">Each value is the percentage of triples from this PLD that used this predicate. In this vector space I used the<a href="http://en.wikipedia.org/wiki/Cosine_similarity"> cosine-similarity</a> to compute a distance matrix for all PLDs. With this distance matrix I thought I could apply FastMap, but it worked really badly and looked like this:</p>
<p style="text-align: center;"><a href="http://gromgull.net/blog/wp-content/uploads/2009/09/fastmap.png"><img class="size-medium wp-image-340 aligncenter" style="border: 0pt none;" title="Fastmapping the PLDs " src="http://gromgull.net/blog/wp-content/uploads/2009/09/fastmap-300x226.png" alt="Fastmapping the PLDs " width="300" height="226" /></a></p>
<p style="text-align: left;">So instead of FastMap I used maketree from the <a href="http://complearn.org/">complearn tools</a>, this generates trees from a distance matrix, it generates very good results, but it is an iterative optimisation and it takes forever for large instances. Around this time I realised I wasn&#8217;t going to be able to visualise all 7500 PLDs, and cut it down to the <span style="text-decoration: line-through;">2000</span>, <span style="text-decoration: line-through;">1000</span>, <span style="text-decoration: line-through;">500</span>, <span style="text-decoration: line-through;">100</span>, 50 largest PLDs. Now this worked fine, but the result looked like a bog-standard graphviz graph, and it wasn&#8217;t very exciting (i.e not at all like <a href="http://www.pitchinteractive.com/colour_economy/">this colourful thing</a>). Now I realised that since I actually had numeric feature vectors in the first place I wasn&#8217;t restrained to using FastMap to make up coordinates, and I used <a href="http://en.wikipedia.org/wiki/Principal_component_analysis">PCA</a> to map the input vector-space to a 3-dimensional space, normalised the values to [0;255] and used these as RGB values for colour. Ah &#8211; lovely pastel.</p>
<p style="text-align: left;">I think I underestimated the time this would take by at least a factor of 20. Oh well. Time for lunch.</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2009/09/visualising-predicate-usage-on-the-semantic-web/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>FastMap in Python</title>
		<link>http://gromgull.net/blog/2009/08/fastmap-in-python/</link>
		<comments>http://gromgull.net/blog/2009/08/fastmap-in-python/#comments</comments>
		<pubDate>Mon, 31 Aug 2009 21:19:12 +0000</pubDate>
		<dc:creator>gromgull</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Visualisation]]></category>

		<guid isPermaLink="false">http://gromgull.net/blog/?p=333</guid>
		<description><![CDATA[This shows some strings positioned on a graph according to their Levenshtein string distance. (And let me note how appropriate it is that a man named Levenshtein should make a string distance algorithm)

The mapping from the distance matrix given by the string distance is then mapped to two dimensions using the FastMap algorithm by Christos [...]]]></description>
			<content:encoded><![CDATA[<p>This shows some strings positioned on a graph according to their <a href="http://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein string distance</a>. (And let me note how appropriate it is that a man named <em>Levenshtein</em> should make a string distance algorithm)</p>
<p style="text-align: center;"><a href="http://www.flickr.com/photos/gromgull/3875974572/"><img class="aligncenter" style="border: 0pt none;" src="http://farm3.static.flickr.com/2645/3875974572_33ff6b4795.jpg" alt="" width="500" height="377" /></a></p>
<p>The mapping from the distance matrix given by the string distance is then mapped to two dimensions using the<a href="http://portal.acm.org/citation.cfm?id=223812"> FastMap algorithm by Christos Faloutsos and King-Ip Lin</a>. This mapping can be also done with <a href="http://en.wikipedia.org/wiki/Multidimensional_scaling">Multidimensional Scaling</a>, (and I did so in my PhD work, I even claimed it was novel, but actually I was 40 years too late, oh well), but that algorithm is nasty and iterative. FastMap, as the name implies, is much faster, it doesn&#8217;t actually even need a full distance matrix (I think). It doesn&#8217;t always find the best solution, and it has a slight random element to it, so the solution might also vary each time it&#8217;s run, but it&#8217;s good enough for almost all cases. Again, I&#8217;ve implemented it in python &#8211; grab it here:<a href="http://gromgull.net/2009/08/fastmap.py"> fastmap.py</a></p>
<p>To get a feel for how it works, download the code, remove <em>George Leary</em> and see how it reverts to only one dimension for a sensible mapping.</p>
<p>The algorithm is straight-forward enough, for each dimension you want, repeat:</p>
<ol>
<li>use a heuristic to find the two most distant points</li>
<li>the line between these two points becomes the first dimension, project all points to this line</li>
<li>recurse :)</li>
</ol>
<p>The distance measure used keeps the projection &#8220;in mind&#8221;, so the second iteration will be different to the first.The whole thing is a bit like <a href="http://en.wikipedia.org/wiki/Principal_component_analysis">Principal Component Analysis</a>, but without requiring an original matrix of feature vectors.</p>
<p>This is already quite old, from 1995, and I am sure something better exists now, but it&#8217;s a nice little thing to have in the toolbox. I wonder if it can be used to estimate numerical feature values for nominal attributes, in cases where all possible values are known?</p>
]]></content:encoded>
			<wfw:commentRss>http://gromgull.net/blog/2009/08/fastmap-in-python/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
