This is not very cutting edge, nor all THAT exciting – but it seemed worth putting the things together into one post for someone else who encounters the same scenario.
The problem is this: You have a stream of data-points coming in, you do not know how many points and you do not know the range of the numbers. You would like to draw a pretty little graph of this data. Since there is an unknown number of points, but potentially massive, you cannot keep all numbers in memory, so look at each once and pass it on. It’s essentially an “online (learning) scenario. The goal here is to draw small graph to give an idea of the trend of the data, not careful analysis – so a fair bit of approximation is ok.
In our case the numbers are sensors readings from agricultural machines, coming in through ISOXML (you probably really don’t want to go there, but it’s all in ISO 11783) documents. The machine makes readings every second, which adds up when it drives for days. This isn’t quite BIG DATA, but there may be millions of points, and in our case we want to run on a fairly standard PC, so an online approach was called for.
For the first and simplest problem, you have data coming in at regular intervals, and you want a (much) shorter array of values to plot. This class keeps an array of some length N, at first we just fill it with the incoming numbers. If we see more than N numbers, we “zoom out” by a factor of two, rewrite the N numbers we’ve seen to N/2 numbers by averaging them, and from now on, every two incoming numbers are averaged into each cell in our array. Until we reach N*2 numbers, then we zoom out again, now averaging every 4 numbers, etc. Rewriting the array is a bit of work, but it only happens log(your total number of points) times. In the end you end up with somewhere between N/2 and N points.
/**
* Create a summary time-series of some time-series of unspecified length.
*
* The final time-series will be somewhere between N/2 and N long
* (Unless less numbers are given of course)
*
* For example:
* TimeLogSummariser t = new DDISummaryGraph.TimeLogSummariser(4);
* for (int i=1; i<20; i++) t.add(i);
* => [3.0, 7.25, 18.0]
*
*/
public class TimeLogSummariser {
double data[];
int N;
int n=1;
int i=0;
public TimeLogSummariser(int N) {
if (N%2 != 0 ) N++;
this.N=N;
data=new double[N];
}
public void zoomOut() {
for (int j=0;j<N/2;j++) data[j]=(data[j*2]+data[2*j+1])/2;
for (int j=N/2;j<N;j++) data[j]=0;
n*=2;
}
public void add(double d) {
int j=i%n;
int idx=i/n;
if (idx>=N) {
zoomOut();
j=i%n;
idx=i/n;
}
data[idx]=(data[idx]*j+d)/(j+1);
i++;
}
public double[] getData() {
return Arrays.copyOfRange(data, 0, i/n+1);
}
}
Now, after programming this I realised that most of my tractor-sensor data actually does not come in at regular intervals. You’ll get stuff every second for a while, then suddenly a 3 second gap, or a 10 minute smoke-break or a 40 minute lunch-break. Treating every point as equal does not give you the graph you want. This only makes things slightly more complicated though, instead always stepping one step in the array, the step depends on the difference in time between the two points. We assume (and hope) that data always arrives in sequential order. Also, since we may take many steps at once now, we may need to “zoom out” more than once to fit in the array. Otherwise the code is almost the same:
/**
* Create a summary time-series of some time-series of unspecified length.
*
* The final time-series will be somewhere between N/2 and N long
* (Unless less numbers are given of course)
*
* This class will do correctly do averages over time.
* The constructor takes the stepsize as a number of milleseconds
* i.e. the number of milliseconds between each recording.
* Each value is then given with a date.
*
* For example:
* DiffDDISummaryGraph t = new DiffDDISummaryGraph.TimeLogSummariser(4, 1000);
* for (int i=1; i<20; i++) t.add(i, new Date(i*2000));
* => [3.0, 7.25, 18.0]
*
*/
public class DiffTimeLogSummariser {
double data[];
private int N;
int n=1;
double i=0;
private long stepsize;
private Date last;
private Date start;
public DiffTimeLogSummariser(int N, long stepsize) {
this.stepsize=stepsize;
if (N%2 != 0 ) N++;
this.N=N;
data=new double[N];
}
public void zoomOut() {
for (int j=0;j<N/2;j++) data[j]=(data[j*2]+data[2*j+1])/2;
for (int j=N/2;j<N;j++) data[j]=0;
n*=2;
}
public void add(double d, Date time) {
long diff;
if (last!=null) {
diff=time.getTime()-last.getTime();
} else {
start=time;
diff=0;
}
if (diff<0) {
System.err.println("DiffTimeLogSummarizer got diff<0, ignoring.");
return ;
}
i+=diff/(double)stepsize;
int j=(int) (Math.round(i)%n);
int idx=(int) (Math.round(i)/n);
while (idx>=N) {
zoomOut();
j=(int) (Math.round(i)%n);
idx=(int) (Math.round(i)/n);
}
data[idx]=(data[idx]*j+d)/(j+1);
last=time;
}
public double[] getData() {
return Arrays.copyOfRange(data, 0, (int) (i/n+1));
}
public Date getStart() {
return start;
}
public long getStepsize() {
return stepsize*n;
}
}
This assumes that if there is a gap in the data, that means the value was 0 at these intervals. Whether this is true depends on your application, in some cases it would probably make more sense to assume no data means the value was unchanged. I will leave this as an exercise for the reader :)
Finally, sometimes you are more interested in the distribution of the values you get, rather than how they vary over time. Computing histograms on the fly is also possible, for uniform bin-sizes the algorithm is almost the same as above. The tricky bits here I’ve stolen from Per on stackoverflow:
/**
* Create a histogram of N bins from some series of unknown length.
*
*/
public class TimeLogHistogram {
int N; // assume for simplicity that N is even
int counts[];
double lowerBound;
double binSize=-1;
public TimeLogHistogram(int N) {
this.N=N;
counts=new int[N];
}
public void add(double x) {
if (binSize==-1) {
lowerBound=x;
binSize=1;
}
int i=(int) (Math.floor(x-lowerBound)/binSize);
if (i<0 || i>=N) {
if (i>=N)
while (x-lowerBound >=N*binSize) zoomUp();
else if (i<0)
while (lowerBound > x) zoomDown();
i=(int) (Math.floor(x-lowerBound)/binSize);
}
counts[i]++;
}
private void zoomDown() {
lowerBound-=N*binSize;
binSize*=2;
for (int j=N-1;j>N/2-1;j--) counts[j]=counts[2*j-N]+counts[2*j-N+1];
for (int j=0;j<N/2;j++) counts[j]=0;
}
private void zoomUp() {
binSize*=2;
for (int j=0;j<N/2;j++) counts[j]=counts[2*j]+counts[2*j+1];
for (int j=N/2;j<N;j++) counts[j]=0;
}
public double getLowerBound() {
return lowerBound;
}
public double getBinSize() {
return binSize;
}
public int[] getCounts() {
return counts;
}
public int getN() {
return N;
}
}
Histograms with uneven binsizes gets trickier, but is apparently still possible, see:
Approximation and streaming algorithms for histogram construction problems.
Sudipto Guha, Nick Koudas, and Kyuseok Shim ACM Trans. Database Syst. 31(1):396-438 (2006)
That’s it – I could have put in an actual graph figure here somewhere, but it just be some random numbers, just imagine one.
(Apologies for Java code examples today, your regular python programming will resume shortly)