Posts categorized “PhD”.

quiet before the storm

Running like crazy for the final sprint of the PhD thesis at the moment, so no time to do anything fun to write about here, or to write here for that matter.

In a extremely motivated moment I googled "finishing a phd is impossible" and found this:

A gradiate school survival guide

It is useful and funny overall, and includes several good quotes, like this one:

The Feynman Problem Solving Algorithm:
1) Write down the problem.
2) Think very hard.
3) Write down the solution.

I wish I had thought of that.

Also included is the oh-so-very appropriately for my planning-to-learn thesis:

"Failing to plan is planning to fail."

Plotting graphs with R

This morning I spent 3 hours, from 8am to 11am, trying to force R into plotting beautiful graphs. To increase the challenge I do all my R work through rpy, since the R command-line drives me crazy. Problems, solutions and lessons learner were as follows (if you've never used R this will look like gibberish):

  • To get square line ends you have to the the lend parameter using par BEFORE you call plot.

  • Getting greek characters and other math notation into labels is rather complicated, and right out goddamn impossible through rpy. In the end I fell back to constructing long R commands as string, then using r(command) to execute them.

  • It is not possible to have diagonal axis labels.

  • At least in my version of R (=2.1.0, 2.4.0 crashes with rpy), the axis command ignores the xaxp set with par when at is not specified. Use axTicks specifying xaxp instead.

In the end I got what I wanted though:

The axis here were generated with:

axis(1,at=c(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0),
expression(paste("KNN ")),
expression(paste("KNN ", gamma, "1")),
expression(paste("KNN ", gamma, "2")),
expression(paste("SLIPPER ", gamma, "1")),
expression(paste("SLIPPER ", gamma, "2"))),las=3, srt=90)

Sub-scripting the number after gamma should also be possible, but I gave up.

Oh and before you say: "why didn't you just use excel/openoffice" – Because where would the wonderful vector-based PDF antialiases goodness be then?

N3 Language Definitions for source-highlight

I just found I had to include some RDF files in the appendix of my thesis, going with the "normal people shouldn't ever have to see RDF/XML" line I will include it as N3 (which also makes sense since that's how I wrote it).

source-highlight helped me typeset beautiful prolog before, but lacks a language definitions for N3. Luckily the definition langauge is simple, and by steal definitions left-right and center this only took 20 miniutes.

I give you: the n3.lang file, example output for html, and for latex here as PDF.

On writing

I was reading E.W. Djikstra's EWD1000 today, and came across this quote:

If there is one "scientific" discovery I am proud of, it is the discovery of the habit of writing without publication in mind. I experience it as a liberating habit: without it, doing the work becomes one thing and writing it down becomes another one, which is often viewed as an unpleasant burden. When working and writing have merged, that burden has been taken away.

and this one

[blah – about something he wrote] .. Had I only written with publication in mind, it would never have seen the light of day. […] The only way to discover that a neglected or ignored topic is worth writing about is to write about it.

related is probably that I've been spending a lot of time actually writing my thesis these days (I'm not at 39 pages, hurrah!), which is fun because putting thing son paper forces me to think about them and I realise how incomplete some of my initial "research" was…

While writing my thesis I also come across lots of questions that I did not previously consider:

  • Does one write a thesis in first person? Or is it still "we did blah", I feel odd writing "I decided X, I discovered Y";
  • Will I print my thesis in color? partially?
  • How do you chose what symbols to use for formulas/theorems/etc? I'm sure I read about this somewhere before, put the closest I can find now is this Guide to Writing Mathematics;
  • how much detail to you include on well known technologies? For instance, I write about multidimensional scaling, the method is from the 50s, surely anyone who cares knows by now? :), on the other hand some details might be useful for a discussion.

…and now it's probably time to get back to *real* work/writing.

AI without all the work

I finally got around to reading Brooks' intelligence without representation, given to me by Frank ages ago.

The paper discusses an interesting architecture of intelligent programs that does not rely on a representation of the external world (etc. read the paper). Which is interesting… BUT more interesting I found the discussion of how many AI researchers deceive themselves by using overly simple scenarios for their experience, either virtual words such as box-world, or even simplified versions of the real world, with matte walls, colour coded object etc. (There is a related, but kind of opposite argument made by Hofstadter, which I wont go into here)

Brooks argues that the only way to develop intelligent systems is

[…] to build completely autonomous mobile agents that co-exist in the world with humans, and re seen by those humans as intelligent beings in their own right.

These claims were made in 1987 and in the last 20 years the internet has brought us a completely new "real" world where many people spend hours every day. We now have a complex, and for any human use practically infinite world of things to interact with.

The internet removed one layer of the difficulties of perception: the need to interpret the not very well understood, noisy, high-bandwidth channels of sound and vision was removed. Instead an intelligent "creature" (to borrow Brooks' terminology) can work on textual documents, whereas still noisy, at least the understanding of natural language seems slightly easier that image understanding.

Perhaps with the advent of the Semantic Web the life of AI researchers has become much easier again. What previously was an unrealistic "abstraction" of the problem, i.e. ignoring the text-parsing and understanding problems, claiming that someone else would solve this and that our work takes the already extracted semantic content as input, has now become quite a reasonable argument.

I suppose what I'm saying at the end of the day is "Thank you Semantic Web people", you have made it possible for me to work on what I grandiosely call an "intelligent" system, without having to solve ALL the problems!

GOOOOOAAAAALLL directed learning…

This is the first real result from Smeagol, where it actually makes a plan to learn and does succeed!

(click for readable version)

I realise that this diagram is probably completely incomprehensible to anyone but me, so here is a quick explanation:

1. The "top-level goal" given in this case is to answer the query:

[] ql:select ( ?s ) ;
ql:where {
?s a bibtex:InProceedings ;
?s foaf:maker ?a ;
?a pers:expertIn wiki:Semantic_Web
} ;
ql:results ?r .

i.e. give all papers written by Semantic Web Experts.

2. Smeagol first makes the trivial plan to read some data (there are many issues about this, I skip them all), and perform the query.

3. The trivial plan fails (otherwise this would be boring), because there are no people who are Semantic Web experts who have written any papers (in my extremely artificial hand-made dataset). Papers written by other people DO exist…

4. Several plans are made that may introduce more triples matching any of the patterns evaluated in the query before it failed. In this particular case my heuristic reordered the query to be [(?s a Inproceedings),(?a expertIn SemanticWeb), (?s maker ?a)], i.e. find all the people and papers first and do the join later, and it was the very last pattern that failed to match, therefore further results for any of these patterns could be useful.

5. The easiest (the actions have weights) plan is chosen first, this is read the ontology for the bibtex classes, and attempt to find more "things" that are "InProceedings" based on RDFS inference, maybe they are a only explicitly declared to be a subclass of InProceedings for example. Unfortunately, this is not the case, and the this plan fails.

6. Smeagol returns to the second easiest plan, this plan involves actual learning: Find the set of things that are already of type InProceedings => attempt to use ILP to learn a description of this set => use this description to classify further instances as InProceedings. Alas, this also fails, in this case it failed because it failed to find any good negative examples, but I wont go into that here.

7. Returning to the third easiest plan, this is using the same pattern as the previous plan, but finds a set of people who are Semantic Web experts instead. In this case the description learning DOES work and we learn the rule:

{ ?A <> <> }
{ ?A rabbi:learnedCategory "blah" }.

8. This rule is then used to classify more instances as Semantic Web experts, and since I very carefully constructed this examples it finds one! :)

9. After the sub-plan has succeeded, Smeagol returns to the previous failed plan and re-tries the failed action, and now the query works. Hurrah!

All in all this took a ridiculous number of lines of code, many hours of debugging, and the result still isn't very impressive, but hopefully I have ironed out most planning bugs now and I can get to work on creating the final set of examples that will finish off this PhD! :)

Applying market basket analysis to RDF

Since 3 is a magic number I'd really like to have 3 different learning algorithms used Smeagol. Currently I have ILP and HAC-clustering, both applied in several different ways. Sequence/Basket analysis seems like a good candidate for third algorithm, since it's the only area of ML not covered yet. (ILP covering classification and more…)
Sequence analysis would of course require a time dimension the the data, which i'd really rather not get into, AND it was probably covered pretty well by Heather Maclaren.
Basket analysis is left, and my first attempt was quickly hacked up using Orange. The things in my baskets are predicate-value pairs, and each person becomes a basket on their own. I tried this on several data-sets i had lying around, here are some quick and dirty results:

A small subset of my IMDB Data (3534 triples) gave me:

rdf#type IMDB#Movie -> IMDB#languages English


rdf#type IMDB#Movie -> IMDB#country_USA

My email from the last 5 years as crawler by aperture (127615 triples) gave me the fascinating rule:

aperture:mimeType message/rfc822 -> rdf#type imap/Message

A subset of some old FOAF crawl stolen from JibberJim years ago gave me:

jim#isKnownBy norman.walsh#norman-walsh -> rdf#type foaf/Person

yes – fascinating indeed. I also found the Norman Walsh rule using ILP years ago, at least running this one was pretty fast.

I'm not sure what to conclude from this – none of the rules are groundbreaking OR that interesting. Maybe I can tweak the way items are represented, using just values or just predicates for example. I'll see tomorrow.

I also had a brain-storming session with myself and some gin'n'tonic today, and if I don't finish this PhD it's because the table wasn't big enough:

i love it when a plan comes together

Recently I decided that for the planning part of Smeagol it would make sense if all the avaiable actions were specified in RDF. This seemed sensible since I was re-factoring the action/planning part of Smeagol anyway – it was all pretty much coded in one coffee fuelled weekend just before some paper deadline and was now completely incomprehensible to anyone.

Moving to RDF also seemed right and elegant because I enjoy moving "things" into a more explicit and structured format; like my very first thoughts on the internal structuring of Smeagol moved from being hard-coded to being dynamically planned, or how someone (Malte?) pointed out that Semantic Wikis and the [ur=]Infoboxes[/url] of wikipedia are steps on a road where soon all content will be represented in structured form.

Anyway, I digress. An action in Smeagol now looks like this:
(I steal freely from N3 Rules and cwm in general).

@prefix : <>.
@prefix ac: <>.
@prefix math: <>.
@prefix rdf: <>.
@prefix rdfs: <>.

ac:read a :Action ;
rdfs:label "read" ;
:in ( ?u ) ;
:out () ;
:preconditions { ?u :source ?source . ?u :triples ?t1 } ;
:effects { ?u :triples ?t2 . ?t2 math:greaterThan ?t1} .

This represents an action "read something about this URI (?u)" – it needs a source of information about this uri to be known beforehand, and if executed it will result in Smeagol knowing more triples about ?u that we did before. Hardly rocket science, but I like the idea of using :triples as a magic property, which is evaluated rather than looked up in the graph. The idea is of course old, cause cwm and n3 rules/queries do exactly the same with builtin functions. Oh well – this time I got the idea from the handling of full-text lucene queries in sparql as done in Gnowsis.

So this evening I finished the planning module of Smeagol, and while I was developing it I used a list of only 2 actions for testing. And when it appeared to work I uncommented the other actions in the n3 file, and whoho! It just worked! And 4 different plans were generated! … and then I thought of the subject line and sniggered to myself.