Posts categorized “Uncategorized”.

32bit firefox/thunderbird on debian amd64

New computer at work last week, now with 16gb of RAM I totally do not need, but this it was clear that running a 32bit Linux was no longer an option.
So Debian amd64 was installed. Now, I’ve looked at the thunderbird/firefox icon for so long, I cannot live with iceweasel/icedove branding. Only Firefox nightly exists as 64bit build, where again the branding is different, so installing the 32bit build was necessary.

This is really just a problem that arises because of my own stubborn refusal to change my ways, if I ran the firefox nightly, lived with iceweasel or simply ran ubuntu it would all be fine.
I’ve pieced together this information twice now, so time to write it down. Also, maybe someone else has exactly the same weird “legacy” problems I do.

Run all commands below as root.

First, setup multiarch, beware, when I first did this about a year ago, it messed up conflict resolution in aptitude and I had to fall back on just apt-get again, I hear it might work fine now.

 
dpkg --add-architecture i386
apt-get update

Then install the basic libraries firefox/thunderbird needs (use ldd on the firefox binaries/libraries to find this list):

apt-get install libgtk2.0-0:i386 libatk1.0-0:i386 libgdk-pixbuf2.0-0:i386 \
     libldap-2.4-2:i386 libdbus-glib-1-2:i386 libpango1.0-0:i386 libglib2.0-0:i386

I use the greybird theme – to make thunderbird/firefox look the part, install the 32bit version of the gtk-engines used:

apt-get install gtk2-engines-murrine:i386 gtk2-engines-pixbuf:i386

Then install firefox/thunderbird normally – if you keep the folder owned/writable by your user they will auto-update fine.

Aggregates over BTC2010 namespaces

Yesterday I dumped the most basic BTC2010 stats. Today I have processed them a bit more – and it gets slightly less boring.

First predicates, yesterday I had the raw count per predicate. Much more interesting is the namespaces the predicates are defined in. These are the top 10:

#triples namespace
860,532,348 rdfs
651,432,324 http://data-gov.tw.rpi.edu/vocab/p/90
588,063,466 rdf
527,347,381 gr
284,679,897 foaf
44,119,248 dc11
41,961,046 http://purl.uniprot.org/core
17,233,778 rss
13,661,605 http://www.proteinontology.info/po.owl
13,009,685 owl

(prefix abbreviations are made from prefix.cc \u2013 I am too lazy to fix the missing ones)

Now it gets interesting – because I did exactly this last year as well, and now we can compare!

Dropouts

In 2009 there were 3,817 different namespaces, this year we have 3,911, but actually only 2,945 occur in both. The biggest dropouts, i.e. namespaces that occurred last year, but not at all this year are:

#triples namespace
10,239,809 http://www.kisti.re.kr/isrl/ResearchRefOntology
5,443,549 nie
1,571,547 http://ontologycentral.com/2009/01/eurostat/ns
1,094,963 http://sindice.com/exfn/0.1
320,155 http://xmdr.org/ont/iso11179-3e3draft_r4.owl
307,534 http://cb.semsol.org/ns
242,427 nco
203,283 osag
187,600 http://auswiki.org/index.php/Special:URIResolver
159,536 nexif

I am of course shocked and saddened to see that the Nepomuk Information Elements ontology has fallen out of fashion all together, although it was a bit of a freak occurrence last year. I am not sure how we lost 10M research ontology triples?

Newcomers

Looking the other way around, what namespaces are new and popular this year, we get:

#triples namespace
651,432,324 http://data-gov.tw.rpi.edu/vocab/p/90
5,001,909 fec
2,689,813 http://transport.data.gov.uk/0/ontology/traffic
543,835 http://rdf.geospecies.org/ont/geospecies
526,304 http://data-gov.tw.rpi.edu/vocab/p/401
469,446 http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf
446,120 http://education.data.gov.uk/def/school
223,726 http://www.w3.org/TR/rdf-schema
190,890 http://wecowi.de/wiki/Spezial:URIResolver
166,511 http://data-gov.tw.rpi.edu/vocab/p/10

Here the introduction of data.gov and data.gov.uk were the big events last year.

Winners

For the namespaces that occurred both years we can find the biggest gainers. Here I calculated what ratio of the total triples each namespace constituted each year, and the increase in this ratio from 2009 to 2010. For example, GoodRelations, on top here, constituted nearly 16% of all triples in 2010, but only 2.91e-4% of all triples last year, for a cool increase of 570,000% :)

gain namespace
57058.38 gr
2636.34 http://www.openlinksw.com/schema/attribution
2182.81 http://www.openrdf.org/schema/serql
1944.68 http://www.w3.org/2007/OWL/testOntology
1235.02 http://referata.com/wiki/Special:URIResolver
1211.35 urn:lsid:ubio.org:predicates:recordVersion
1208.09 urn:lsid:ubio.org:predicates:lexicalStatus
1194.66 urn:lsid:lsid.zoology.gla.ac.uk:predicates:mapping
1191.39 urn:lsid:lsid.zoology.gla.ac.uk:predicates:rank
701.66 urn:lsid:ubio.org:predicates:hasCAVConcept

Losers

Similarly, we have the biggest losers, the ones who lost the most:

gain namespace
0.000185 http://purl.org/obo/metadata
0.000191 sioct
0.000380 vcard
0.000418 affy
0.000438 http://www.geneontology.org/go
0.000677 http://tap.stanford.edu/data
0.000719 urn://wymiwyg.org/knobot/default
0.000787 akts
0.000876 http://wymiwyg.org/ontologies/language-selection
0.000904 http://wymiwyg.org/ontologies/knobot

If your namespace is a loser, do not worry, remember that BTC is a more or less arbitrary snapshot of SOME semantic web data, and you can always catch up next year! :)

With a bit of luck I will do this again for the Pay-Level-Domains for the context URLs tomorrow.

Update

(a bit later)

You can get the full datasets for this from many eyes:

BTC2010 Basic stats

Another year, another billion triple dataset. This time it was released the same time my daughter was born, so running the stats script was delayed for a bit.

This year we’ve got a few more triples, perhaps making up for the fact that it wasn’t actually one billion last year :) we’ve now got 3.1B triples (or 3,171,793,030 if you want to be exact).

I’ve not had a chance to do anything really fun with this, so I’ll just dump the stats:

Subjects

  • 159,185,186 unique subjects
  • 147,663,612 occur in more than a single triple
  • 12,647,098 more than 10 times
  • 5,394,733 more 100
  • 313,493 more than 1,000
  • 46,116 more than 10,000
  • and 53 more than 100,000 times

For an average of 19.9252 per unique triple. Like last year, I am not sure if having more than 100,000 triples with the same subject really is useful for anyone?

Looking only at bnodes used as subjects we get:

  • 100,431,757 unique subjects
  • 98,744,109 occur in more than a single triple
  • 1,465,399 more than 10 times
  • 266,759 more 100
  • 4,956 more than 1,000
  • 48 more than 10,000

So 100M out of 159M subjects are bnodes, but they are used less often than the named resources.

The top subjects are as follows:

#triples subject
1,412,709 http://www.proteinontology.info/po.owl#A
895,776 http://openean.kaufkauf.net/id/
827,295 http://products.semweb.bestbuy.com/company.rdf#BusinessEntity_BestBuy
492,756 cycann:externalID
481,000 http://purl.uniprot.org/citations/15685292
445,430 foaf:Document
369,567 cycann:label
362,391 dcmitype:Text
357,309 http://sw.opencyc.org/concept/
349,988 http://purl.uniprot.org/citations/16973872

I do not know enough about the Proteine ontology to know why po:A is so popular. CYC we already had last year here, and I guess all products exposed by BestBuy have this URI as a subject.

Predicates

  • 95,379 unique predicates
  • 83,370 occur in more than one triples
  • 46,710 more than 10
  • 18,385 more than 100
  • 5,395 more than 1,000
  • 1,271 more than 10,000
  • 548 more than 100,000

The average predicate occurred in 33254.6 triples.

#triples predicate
557,268,190 rdf:type
384,891,996 rdfs:isDefinedBy
215,041,142 gr:hasGlobalLocationNumber
184,881,132 rdfs:label
175,141,343 rdfs:comment
168,719,459 gr:hasEAN_UCC-13
131,029,818 gr:hasManufacturer
112,635,203 rdfs:seeAlso
71,742,821 foaf:nick
71,036,882 foaf:knows

The usual suspects, rdf:type, comment, label, seeAlso and a bit of FOAF. New this year is lots of GoodRelations data!

Objects – Resources

Ignoring literals for the moment, looking only at resource-objects, we have:

  • 192,855,067 unique resources
  • 36,144,147 occur in more than a single triple
  • 2,905,294 more than 10 times
  • 197,052 more 100
  • 20,011 more than 1,000
  • 2,752 more than 10,000
  • and 370 more than 100,000 times

On average 7.72834 triples per object. This is both named objects and bnodes, looking at the bnodes only we get:

  • 97,617,548 unique resources
  • 616,825 occur in more than a single triple
  • 8,632 more than 10 times
  • 2,167 more 100
  • 1 more than 1,000

Since BNode IDs are only valid within a certain file it is limited how often then can appear, but still almost half the overall objects are bnodes.

The top ten bnode IDs are pretty boring, but the top 10 named resources are:

#triples resource-object
215,532,631 gr:BusinessEntity
215,153,113 ean:businessentities/
168,205,900 gr:ProductOrServiceModel
167,789,556 http://openean.kaufkauf.net/id/
71,051,459 foaf:Person
10,373,362 foaf:OnlineAccount
6,842,729 rss:item
6,025,094 rdf:Statement
4,647,293 foaf:Document
4,230,908 http://purl.uniprot.org/core/Resource

These are pretty much all types – compare to:

Types

A “type” being the object that occurs in a triple where rdf:type is the predicate gives us:

  • 170,020 types
  • 91,479 occur in more than a single triple
  • 20,196 more than 10 times
  • 4,325 more 100
  • 1,113 more than 1,000
  • 258 more than 10,000
  • and 89 more than 100,000 times

On average each type is used 3277.7 times, and the top 10 are:

#triples type
215,536,042 gr:BusinessEntity
168,208,826 gr:ProductOrServiceModel
71,520,943 foaf:Person
10,447,941 foaf:OnlineAccount
6,886,401 rss:item
6,066,069 rdf:Statement
4,674,162 foaf:Document
4,260,056 http://purl.uniprot.org/core/Resource
4,001,282 http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry
3,405,101 owl:Class

Not identical to the top resources, but quite similar. Lots of FOAF and new this year, lots of GoodRelations.

Contexts

Something changed with regard to context handling for BTC2010, this year we only have 8M contexts, last year we had over 35M.
I wonder if perhaps all of dbpedia is in one context this year?

  • 8,126,834 unique contexts
  • 8,048,574 occur in more than a single triple
  • 6,211,398 more than 10 times
  • 1,493,520 more 100
  • 321,466 more than 1,000
  • 61,360 more than 10,000
  • and 4799 more than 100,000 times

For an average of 389.958 triples per context. The 10 biggest contexts are:

#triples context
302,127 http://data-gov.tw.rpi.edu/raw/402/data-402.rdf
273,644 http://www.ling.helsinki.fi/kit/2004k/ctl310semw/WordNet/wordnet_nouns-20010201.rdf
259,824 http://static.cpantesters.org/author/M/MIYAGAWA.rss
207,513 http://data-gov.tw.rpi.edu/raw/401/data-401.rdf
193,944 http://static.cpantesters.org/author/D/DROLSKY.rss
189,528 http://static.cpantesters.org/author/S/SMUELLER.rss
170,899 http://data-gov.tw.rpi.edu/raw/59/data-59.rdf
166,454 http://zaltys.net/ontology/AKTiveSAOntology.owl
166,454 http://www.zaltys.net/ontology/AKTiveSAOntology.owl
165,948 http://lsdis.cs.uga.edu/~satya/Satya/jan24.owl

This concludes my boring stats dump for BTC2010 for now. Some information on literals and hopefully some graphs will come soon! I also plan to look into how these stats changed from last year – so far I see much more GoodRelations, but there must be other fun changes!

The Machine Learning Algorithm with Capital A

A student came see me recently, wanting to do a Diplomarbeit (i.e. a MSc++) on a learning technique called Hierarchical Temporal Memory or HTMs. He had very specific ideas about what he wanted, and has already worked with the algorithm in his Projektarbeit (BSc++). I knew nothing about the approach, but remembered this reddit post, which was less than enthusiastic. I spoke to the student and read up on the thing a bit, and it seem interesting enough. It’s claimed to be close the way the brain learns/recognizes patterns and to be a general model of intelligence and  it will work for EVERYTHING. This reminded me of a few other things I’ve come across in the past years that claim to be the new Machine Learning algorithm with Capital A, i.e. the algorithm to end all other ML work, which will work on all problems, and so on. Here is a small collection of the three most interesting ones I remembered:

Hierarchical Temporal Models

HTMs are “invented” by Jeff Hawkin, whose track record includes making the Palm device and platform and later the Treo. Having your algorithm’s PR based on the celebrity status of the inventor is not really a good sign. The model is first presented in his book On Intelligence, which I’ve duly bought and am currently reading. The book is so far very interesting, although full of things like “[this is] how the brain actually works“, “Can we build intelligent machines? Yes. We can and we will.”, etc. As far as I understand, the model from the book was formally analysed and became the HTM algorithm in Dileep George‘s thesis: How the brain might work: A hierarchical and temporal model for learning and recognition. He applies to recognizing 32×32 pictures of various letters and objects.

The model is based on a hierarchy of sensing components, each dealing with a higher level of abstraction when processing input, the top of the hierarchy feeds into some traditional learning algorithm, such as a SVM for classification, or some clustering mechanism. In effect, the whole HTM is a form of feature pre-processing. The temporal aspect is introduces by the nodes observing their input (either the raw input, or their sub-nodes) over time, this (hopefully) gives rise to translation, rotation and  scale invariance, as the things you are watching move around.  I say watching here, because computer-vision seems to be the main application, although it’s of course applicable to EVERYTHING:

HTM technology has the potential to solve many difficult problems in machine learning, inference, and prediction. Some of the application areas we are exploring with our customers include recognizing objects in images, recognizing behaviors in videos, identifying the gender of a speaker, predicting traffic patterns, doing optical character recognition on messy text, evaluating medical images, and predicting click through patterns on the web.

The guys went on to make a company called Numenta for spreading this technique, they have a (not open-source, boo!) development framework you can play with.

Normalised Compression Distance

This beast goes under many names: compression based learning, compression-based dissimilarity measure, etc. The idea is in any case to reuse compression algorithms for learning, from good old DEFLATE algorithm from zip/gzip, to algorithms specific to some data-type, like DNA. The distance between things is then derived from how well they compress together with each other or with other data, and the distance metric can then be used for clustering, classification, anomaly detection, etc. The whole thing is supported by the theory of Kolmogorov Complexity and Minimum Description Length, i.e. it’s not just a hack.

I came across it back in 2003 in the Kuro5hin article Spam Filtering with gzip, back then I was very sceptical, thinking that any algorithm dedicated to doing classification MUST easily out-perform this. What I didn’t think about is that if you use the optimal compression for your data, then it finds all patterns in the data, and this is exactly what learning is about. Of course, gzip is pretty far from optimal, but it still works pretty well. I am not the only one who wasn’t convinced, this letter appeared in a physics journal in 2001, and led to some heated discussion: angry comment, angry reply, etc.

A bit later, I came across this again. Eamonn Keogh wrote Towards parameter-free data mining in 2004,  this paper makes a stronger case for this method being simple, easy and great and applicable to EVERYTHING:

[The Algorithm] can be implemented using any off-the-shelf compression algorithm with the addition of just a dozen or so lines of code. We will show that this approach is competitive or superior to the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/video datasets.

A bit later again I came across Rudi Cilibrasi (his page is broken atm.) thesis on Statistical Inference Through Data Compression. He has more examples, more theory and most importantly open-source software for everything: CompLearn (also down atm., but there are packages in debian). The method is very nice in that it makes no assumptions about the format of the input, i.e. no feature vectors or similar. Here is a clustering tree generated from a bunch of different files types:

Markov Logic Networks

I first came across Markov Logic Networks in the paper: Automatically Refining the Wikipedia Infobox Ontology. Here they have two intertwined problems they use machine learning to solve, firstly they want to map wikipedia category pages to WordNet Synsets, and secondly they want to arrange the wikipedia categories in a hierarhcy, i.e. by learning is-a relationships between categories. The solve the problem in two ways, the traditional way by using training a SVM to do the WordNet mappings, and using these mappings as an additional features for training a second SVM to do the is-a learning. This is all good, and works reasonably well, but by using Markov Logic Networks they can use joint inference to tackle both tasks at once. This is good since the two problems are clearly not independent, and now evidence that two categories are related can feed back and improve the probability that the map to WordNet synsets that are also related. Also, it allows different is-a relations to influence each other, i.e. if Gouda is-a Cheese is-a Food, then Gouda is probaby also a Food.

The software used in the paper is made by the people at the University of Washington, and is available as open-source: Alchemy – algorithms for statistical relational learning and probabilistic logic inference. The system combines logical and statistical AI, building on network structures much like Bayesian belief networks, in the end it’s a bit like Prolog programming, but with probabilities for all facts and relations, and these probabilities can be learned from data. For example, this cheerful example  about people dying of cancer, given this dependency network and some data about friends who influence each other to smoke and dying, you can estimate the probability that you smoke if your friend does and the probability that you will get cancer:

Since I am writing about it here, it is clear that this is applicable to EVERYTHING:

Markov logic serves as a general framework which can not only be used for the emerging field of statistical relational learning, but also can handle many classical machine learning tasks which many users are familiar with. […] Markov logic presents a language to handle machine learning problems intuitively and comprehensibly. […] We have applied it to link prediction, collective classification, entity resolution, social network analysis and other problems.

The others

Out of the three I think I find Markov Logic Network to be the most interesting, perhaps because it nicely bridges the symbolic and sub-symbolic AI worlds. This was my personal problem since I cannot readily dismiss symbolic AI as a Semantic Web person, but the more I read about about kernels, conditional random fields, online-learning using gradient descent etc. the more I realise that rule-learning and inductive logic programming probably isn’t going to catch up any time soon. NCD is a nice hack, but I tested it on clustering RDF resources, comparing the distance measure from my thesis with gzip’ping the RDF/XML or Turtle, and it did much worse. HTM still strikes me as a bit over-hyped, but I will of course be sorry when the bring the intelligent robots to market in 2011.

Some other learning frameworks that nearly made it into this post:

(wow – this got much longer than I intended)

http://www.idsia.ch/~juergen/

An Objective look at the Billion Triple Data

For completeness, Besbes is telling me to write up the final stats from the BTC data, for the object-part of the triples. I am afraid this data is quite dull, and there are not many interesting things to say, so it’ll be mostly tables. Enjoy :)

The BTC data contains 279,710,101 unique objects in total. Out of these:

  • 90,007,431 appear more than once
  • 7,995,747 more than 10 times
  • 748,214 more than 100
  • 43,479 more than 1,000
  • 3,209 more than 10,000

The 280M objects are split into 162,764,271 resources and 116,945,830 literals. 13,538 of the resources are file:// URIs. The top 10 objects are:

#triples object
2,584,960 http://www.geonames.org/ontology#P
2,645,095 http://www.aktors.org/ontology/portal#Article-Reference
2,681,771 http://www.w3.org/2002/07/owl#Class
5,616,326 http://www.aktors.org/ontology/portal#Person
7,544,903 http://www.geonames.org/ontology#Feature
9,115,801 http://en.wikipedia.org/
12,124,378 http://xmlns.com/foaf/0.1/OnlineAccount
13,687,049 http://purl.org/rss/1.0/item
14,172,852 http://rdfs.org/sioc/types#WikiArticle
38,795,942 http://xmlns.com/foaf/0.1/Person

Apart from the wikipedia link, all are types. No literals appear in the top 10 table. For the 116M unique literals we have 12,845,021 literals with a language tag and 2,067,768 with a datatype tag. The top 10 literals are:

#triples literal
722,221 “0”^^xsd:integer
969,929 “1”
1,024,654 “Nay”
1,036,054 “Copyright © 2009 craigslist, inc.”
1,056,799 “text”
1,061,692 “text/html”
1,159,311 “0”
1,204,996 “en-us”
2,049,638 “Aye”
2,310,681 “application/rdf+xml”

I can’t be bothered to check it now, but I guess the  many Aye’s & Nay’s come from IRC chatlogs (#SWIG?).

Finally, I looked at the length of the literals used in the data, the longest literal is 65,244 unicode characters long (I wonder about this — this seems very close to 216 bytes, some unicode characters with more than one byte, could it be truncated?). The distribution of literals/lenghts looks like this:

The most literals are around 10 characters in length, there is a peak for 19, which I seem to remember was caused by the standard time format (i.e. 2005-10-30T10:45UTC) being exactly 19 characters.

That’s it! I believe I now have published all my numbers on BTC :)

new site, new blog

I have been using http://gromgull.net as a hub for linking to my many online identities and also as an open-id delegate URL, but it was never a proper host, it just redirected to http://semikolon.co.uk/gromgull. This redirection was getting annoying, as different OpenID clients would save the openID url differently, some would use what I typed, i.e. http://gromgull.net, some would use what this redirected to, and some would use the myopenid ID I actually delegated to.  When the RSS feed of my (now old) phpsimpleblog also broke it was time to upgrade properly. So here we go with a new domain and a wordpress installation. Well worth it for the two times a year I blog. (All the old posts have been moved here, with lots of hassle, but old comments have not. )

In other news I was at ESWC09, organising SFSW09 as usual. All great fun, the Scripting challenge at SFSW had especially high quality entries this year, the winners are listed at the challenge page, and I would also recommend watching the screencast for Anca Luca’s Practical Semantic Works – a Bridge from the Users’ Web to the Semantic Web – although she did not win, she shows some amazing presentation skills! The whole experience is also documented on flickr.

Finally, ESWC brought the Billion Triple Challenge to my attention, and I wondered if I could possible do some data-mining of some sort on this data. Downloading it I quickly realised that it will not fit into any RDF database that I keep lying around, but since the data is in a nice 1-triple per line N-QUADS format, I can process it with commandline tools, like awk, sed, sort and friends. I promptly set to work, writing scripts for extracting literals (since they make the commandline processing trickier) and sorting and counting like mad. A week of CPU time later I realise that something is amiss, I have predicates that are simply “and”, and subject URI that are <file://Documents … bugger. As it turns out the data-set had some bugs features, like URLs with spaces in them and I’ve had to rewrite my script. Once it works the details will appear here. (Andreas Harth agrees that this is a feature btw, and a new version of the BTC dataset will appear later)

End-user Requirement Analysis

I often refer to this one Dilbert comic, but I can never find it. Today I spent the time googling and found it using google book search. Now archived here for future reference:

Why’ve you drugged their onions!?

It's been a long time since an update here now, been busy bowling and drinking beer mostly. The bowling team is doing well, think we have a real shot at making it in the tournament this year. Got my favourite bowling ball from the bowling ball shining place yesterday, so now I'm ready to rumble.

Haven't had much time to play Starcraft lately though, and I'm afraid that mnem has been playing online an practicing.

Até logo.

The window of shame

Time waster 1: Flash Elements TD (from Chris)

Time waster 2: Boomshine (you need the music to really appreciate it!)

(Title from here)

Gnowis + Python = sweet bliss

Today is finished implementing the xmlrpc interface to Gnowsis 0.9 that we are doing for the Semouse people.
As an added bonus I can get away from evil Java for a bit and interface with Gnowsis in python!

Gnowsis exports a range of XML-RPC methods, the ones I made today make up an interface to the different stores of gnowsis-server. Some javadoc will appear soon, but there are methods like addTriple, removeTriple, querySelect, queryConstruct.

Now with this in place I can write the python code:

#!/usr/bin/env python

from xmlrpclib import ServerProxy

server=ServerProxy("http://127.0.0.1:9993");

#print server.listMethods()
print getattr(server,"gnowsis-server_dataaccess.querySelect")("gunnar","fulltext")

And it works! As soon as I can convince Leo we will rewrite the whole gnowsis GUI in python! :)

Should anyone feel like trying this the code in svn…

With the latest wiki stuff working Gnowsis is really starting to free itself from the chains of ugly Swing GUI hell.