## Posts categorized “Java”.

### Three simple methods for graphing datastreams

This is not very cutting edge, nor all THAT exciting – but it seemed worth putting the things together into one post for someone else who encounters the same scenario.

The problem is this: You have a stream of data-points coming in, you do not know how many points and you do not know the range of the numbers. You would like to draw a pretty little graph of this data. Since there is an unknown number of points, but potentially massive, you cannot keep all numbers in memory, so look at each once and pass it on. It’s essentially an “online (learning) scenario. The goal here is to draw small graph to give an idea of the trend of the data, not careful analysis – so a fair bit of approximation is ok.

In our case the numbers are sensors readings from agricultural machines, coming in through ISOXML (you probably really don’t want to go there, but it’s all in ISO 11783) documents. The machine makes readings every second, which adds up when it drives for days. This isn’t quite BIG DATA, but there may be millions of points, and in our case we want to run on a fairly standard PC, so an online approach was called for.

For the first and simplest problem, you have data coming in at regular intervals, and you want a (much) shorter array of values to plot. This class keeps an array of some length N, at first we just fill it with the incoming numbers. If we see more than N numbers, we “zoom out” by a factor of two, rewrite the N numbers we’ve seen to N/2 numbers by averaging them, and from now on, every two incoming numbers are averaged into each cell in our array. Until we reach N*2 numbers, then we zoom out again, now averaging every 4 numbers, etc. Rewriting the array is a bit of work, but it only happens log(your total number of points) times. In the end you end up with somewhere between N/2 and N points.

```/**
* Create a summary time-series of some time-series of unspecified length.
*
* The final time-series will be somewhere between N/2 and N long
* (Unless less numbers are given of course)
*
* For example:
* TimeLogSummariser t = new DDISummaryGraph.TimeLogSummariser(4);
* for (int i=1; i<20; i++) t.add(i);
* => [3.0, 7.25, 18.0]
*
*/
public class TimeLogSummariser {

double data[];
int N;
int n=1;
int i=0;
public TimeLogSummariser(int N) {
if (N%2 != 0 ) N++;
this.N=N;
data=new double[N];
}
public void zoomOut() {
for (int j=0;j<N/2;j++) data[j]=(data[j*2]+data[2*j+1])/2;
for (int j=N/2;j<N;j++) data[j]=0;
n*=2;
}

int j=i%n;
int idx=i/n;

if (idx>=N) {
zoomOut();
j=i%n;
idx=i/n;
}
data[idx]=(data[idx]*j+d)/(j+1);
i++;
}

public double[] getData() {
return Arrays.copyOfRange(data, 0, i/n+1);
}

}
```

Now, after programming this I realised that most of my tractor-sensor data actually does not come in at regular intervals. You’ll get stuff every second for a while, then suddenly a 3 second gap, or a 10 minute smoke-break or a 40 minute lunch-break. Treating every point as equal does not give you the graph you want. This only makes things slightly more complicated though, instead always stepping one step in the array, the step depends on the difference in time between the two points. We assume (and hope) that data always arrives in sequential order. Also, since we may take many steps at once now, we may need to “zoom out” more than once to fit in the array. Otherwise the code is almost the same:

```
/**
* Create a summary time-series of some time-series of unspecified length.
*
* The final time-series will be somewhere between N/2 and N long
* (Unless less numbers are given of course)
*
* This class will do correctly do averages over time.
* The constructor takes the stepsize as a number of milleseconds
* i.e. the number of milliseconds between each recording.
* Each value is then given with a date.
*
* For example:
* DiffDDISummaryGraph t = new DiffDDISummaryGraph.TimeLogSummariser(4, 1000);
* for (int i=1; i<20; i++) t.add(i, new Date(i*2000));
* => [3.0, 7.25, 18.0]
*
*/
public class DiffTimeLogSummariser {

double data[];

private int N;

int n=1;
double i=0;

private long stepsize;

private Date last;
private Date start;

public DiffTimeLogSummariser(int N, long stepsize) {
this.stepsize=stepsize;
if (N%2 != 0 ) N++;
this.N=N;
data=new double[N];
}
public void zoomOut() {
for (int j=0;j<N/2;j++) data[j]=(data[j*2]+data[2*j+1])/2;
for (int j=N/2;j<N;j++) data[j]=0;
n*=2;
}

public void add(double d, Date time) {

long diff;
if (last!=null) {
diff=time.getTime()-last.getTime();
} else {
start=time;
diff=0;
}
if (diff<0) {
System.err.println("DiffTimeLogSummarizer got diff<0, ignoring.");
return ;
}

i+=diff/(double)stepsize;

int j=(int) (Math.round(i)%n);
int idx=(int) (Math.round(i)/n);

while (idx>=N) {
zoomOut();
j=(int) (Math.round(i)%n);
idx=(int) (Math.round(i)/n);
}
data[idx]=(data[idx]*j+d)/(j+1);
last=time;
}

public double[] getData() {
return Arrays.copyOfRange(data, 0, (int) (i/n+1));
}

public Date getStart() {
return start;
}

public long getStepsize() {
return stepsize*n;
}
}
```

This assumes that if there is a gap in the data, that means the value was 0 at these intervals. Whether this is true depends on your application, in some cases it would probably make more sense to assume no data means the value was unchanged. I will leave this as an exercise for the reader :)

Finally, sometimes you are more interested in the distribution of the values you get, rather than how they vary over time. Computing histograms on the fly is also possible, for uniform bin-sizes the algorithm is almost the same as above. The tricky bits here I’ve stolen from Per on stackoverflow:

```/**
* Create a histogram of N bins from some series of unknown length.
*
*/
public class TimeLogHistogram {

int N;  // assume for simplicity that N is even
int counts[];

double lowerBound;
double binSize=-1;

public TimeLogHistogram(int N) {
this.N=N;
counts=new int[N];
}

if (binSize==-1) {
lowerBound=x;
binSize=1;
}
int i=(int) (Math.floor(x-lowerBound)/binSize);

if (i<0 || i>=N) {
if (i>=N)
while (x-lowerBound >=N*binSize) zoomUp();
else if (i<0)
while (lowerBound > x) zoomDown();
i=(int) (Math.floor(x-lowerBound)/binSize);
}
counts[i]++;
}

private void zoomDown() {
lowerBound-=N*binSize;
binSize*=2;
for (int j=N-1;j>N/2-1;j--) counts[j]=counts[2*j-N]+counts[2*j-N+1];
for (int j=0;j<N/2;j++) counts[j]=0;
}

private void zoomUp() {
binSize*=2;
for (int j=0;j<N/2;j++) counts[j]=counts[2*j]+counts[2*j+1];
for (int j=N/2;j<N;j++) counts[j]=0;
}

public double getLowerBound() {
return lowerBound;
}
public double getBinSize() {
return binSize;
}
public int[] getCounts() {
return counts;
}
public int getN() {
return N;
}
}
```

Histograms with uneven binsizes gets trickier, but is apparently still possible, see:

Approximation and streaming algorithms for histogram construction problems.
Sudipto Guha, Nick Koudas, and Kyuseok Shim ACM Trans. Database Syst. 31(1):396-438 (2006)

That’s it – I could have put in an actual graph figure here somewhere, but it just be some random numbers, just imagine one.

(Apologies for Java code examples today, your regular python programming will resume shortly)

### What’s in the Axis2_in_OSGI bundle? PAIN!

And more pain!

In the last two days I've been trying to deploy Axis2 as an OSGI bundle, and it has proven almost completely impossible.

The reason for this was the general idea that axis2 should be better than axis1 (see, it goes to 11, eh, to 2) and the developers are very excited about it, saying "Axis2 is very much straight forward and friendly to use than it's predecessor.". Now after two days and no hair left I wish I had googled "Axis2 sucks" earlier, cause it gave me this jewel of an article: Axis2 – why bother.

Read it all, but I have to pull out this quote:

[..blah, it sucks..] It really is a gift that keeps on giving. Deployment brings its own special joy sauce to burn your eyes out with and make your bottom cry rivers of brown sadness. Everything is hardcoded to a specific context path, and deploying the simplest hello world service is more likely than not to result in a jbossian stacktracefest.

PS: This post included no sensible or useful information, sorry.

### Launching files from Java

(I try again, with considerably less love than last time)

Opening files in their "natural" application was something I knew had to be done sooner or later for aperture/gnowsis/nepomuk, but I put it off for as long as possible because I had a horrible feeling that to get it to work on all platforms it would be long and messy and maybe involve nasty C code.

Imagine my surprise when in a few hours today I got it working for files, directories and web links, and this on both windows, macosx and linux (kde/gnome). All without leaving the world of Java. The code is so short I will paste it here (beautifully formatted by code2html!)

```private void windowsopen(URI uri) throws IOException {
Runtime.getRuntime().exec(
new String [] { "rundll32", "url.dll,FileProtocolHandler",uri.toString() });
}

private void linuxopen(URI uri) throws IOException {
// TODO: I don't know how reliable this is.
// It's set correctly for kde/gnome on my machine
if (System.getenv("DESKTOP_SESSION").toLowerCase().contains("kde")) {
//kde:
Runtime.getRuntime().exec(new String [] { "kfmclient","exec",uri.toString()} );
} else {
//Default to gnome as it complains less if it's not running.
Runtime.getRuntime().exec(new String [] { "gnome-open",uri.toString()} );
}
}

private void macopen(URI url) throws IOException {
try {
Class macopener = Class.forName("com.apple.eio.FileManager");
Method m = macopener.getMethod("openURL",new Class[] {String.class});
m.invoke(null,new Object[] {url.toString()});
} catch (Exception e) {
throw new IOException("Could not open URI: "+url+" - "+e);
}
}
```

This is for opening http links, but the file code is almost identical, KDE and Windows doesn't like file:// URIs so I convert them to filenames first, and launching things in Windows is done by executing "cmd /c blah" instead (Thanks Michael Sintek!).

Still damn easy. It's not very well tested yet – but it works on my three instances of each OS! If anyone feels like testing it the code is in the aperture cvs repos, FileOpener.java and HttpOpener.java (although it's only just committed so it'll take a while for the sourceforge cvs mirrors to catch up)

### Mork

Plan for Today: write an Aperture datasource for the thunderbird addressbook, this to pass time before lunch while I didn't have my mac, and should be quick and easy.

Fast forward till lunchtime, where I've found that Thunderbird uses the MORK format for it's addressbook. Mork is a "barely machine readable", "stupidest file format I've ever seen", "look-ma-I-wrote-a-DB" excuse for a DB format.

It was designed by some netscape engineer in the 90s who then disappeared and now no-one knows how it works. It looks a bit like this:

<(91=2)(81=Test)(82=Bob)(83=)(84=Test Bob)(87=bobnickname)(85
=bob@tester.com)(88=bobby@serial.co.uk)(80=0)(89=workphone)(8A
=homephone)(8B=faxphone)(8C=papegerphone)(8D=mobphone)(8E
=bobscreenname)(8F=441048b8)(86=1)(90=bill)>
{1:^80 {(k^BF:c)(s=9)}
[1:^82(^BE=2)]
[1(^83^81)(^84^82)(^85=)(^86=)(^87^84)(^88^87)(^89^85)(^8A^85)(^8B^88)
(^8C=)(^8D=)(^8E=0)(^8F=0)(^90^89)(^91^8A)(^92^8B)(^93^8C)(^94^8D)
(^95=)(^96=)(^97=)(^98=)(^99=)(^9A=)(^9B=)(^9C=)(^9D=)(^9E=)(^9F=)…

Luckily someone else has sort of reverse-engineered it, and there exist partial perl parser and one python version.
I've now added to the madness by converting the python one to Java. It sucked and took all day. (And someone pointed out I could have done it with jython, oh well). To make up for it I'm going to share it with the world.