Applying market basket analysis to RDF

Since 3 is a magic number I'd really like to have 3 different learning algorithms used Smeagol. Currently I have ILP and HAC-clustering, both applied in several different ways. Sequence/Basket analysis seems like a good candidate for third algorithm, since it's the only area of ML not covered yet. (ILP covering classification and more…)
Sequence analysis would of course require a time dimension the the data, which i'd really rather not get into, AND it was probably covered pretty well by Heather Maclaren.
Basket analysis is left, and my first attempt was quickly hacked up using Orange. The things in my baskets are predicate-value pairs, and each person becomes a basket on their own. I tried this on several data-sets i had lying around, here are some quick and dirty results:

A small subset of my IMDB Data (3534 triples) gave me:

rdf#type IMDB#Movie -> IMDB#languages English


rdf#type IMDB#Movie -> IMDB#country_USA

My email from the last 5 years as crawler by aperture (127615 triples) gave me the fascinating rule:

aperture:mimeType message/rfc822 -> rdf#type imap/Message

A subset of some old FOAF crawl stolen from JibberJim years ago gave me:

jim#isKnownBy norman.walsh#norman-walsh -> rdf#type foaf/Person

yes – fascinating indeed. I also found the Norman Walsh rule using ILP years ago, at least running this one was pretty fast.

I'm not sure what to conclude from this – none of the rules are groundbreaking OR that interesting. Maybe I can tweak the way items are represented, using just values or just predicates for example. I'll see tomorrow.

I also had a brain-storming session with myself and some gin'n'tonic today, and if I don't finish this PhD it's because the table wasn't big enough:

Post a comment.