Posts categorized “Statistics”.

Multithreaded SciPy/NumPy with OpenBLAS on debian

Some months ago, just after I got an 8-core CPU, I wasted a weekend trying to get SciPy/NumPy to build against OpenBLAS. OpenBLAS is neat, as it has built-in and automatic support for multi-threading, for things like computing the dot-matrix of large matrices this can really be a time saver.

I was using roughly these instructions, but it was too complicated and I got nowhere and gave up.

Then I got a new MacBook, and set it up using homebrew rather than macports, and I noticed that NumPy was built against OpenBLAS by default. Now, it would be a shame to let debian be any worse off…

Luckily, it was much easier than I thought:

1. Install the openblas and lapack libraries and dev-headers

sudo apt-get install libopenblas-dev liblapack-dev

2. Setup a virtualenv

To make sure we do not mess up the whole system, setup a virtualenv (if you ever install more than 3 python packages, and do not yet know about virtualenv, you really should get to know it, it’s a little piece of magic!):

virtualenv env
source env/bin/activate

3. Install NumPy

In Debian, openblas/lapack fit into the alternatives system, and the implementation you chose gets symlinked to /usr/lib, however, this confuses numpy and you must point it to the right place, i.e. to /usr/lib/openblas-base
Download and unpack NumPy:

mkdir evn/download
pip install -d env/download numpy
mkdir env/build
cd env/build
tar xf ../download/numpy-1.7.1.tar.gz

Now create a site.cfg file with the following content:

[default]
library_dirs= /usr/lib/openblas-base

[atlas]
atlas_libs = openblas

Build/install NumPy:

python setup.py install

You can now check the file env/lib/python2.7/site-packages/numpy/__config__.py to make sure it found the right libs, mine looks like this:

lapack_info={'libraries': ['lapack'], 
    'library_dirs': ['/usr/lib'], 'language': 'f77'}
atlas_threads_info={'libraries': ['openblas'], 
    'library_dirs': ['/usr/local/lib'], 
    'language': 'c', 
    'define_macros': [('ATLAS_WITHOUT_LAPACK', None)], 
    'include_dirs': ['/usr/local/include']}
blas_opt_info={'libraries': ['openblas'], 
    'library_dirs': ['/usr/local/lib'], 
    'language': 'c', 
    'define_macros': [('ATLAS_INFO', '"\\"None\\""')], 
    'include_dirs': ['/usr/local/include']}

4. Install SciPy

If NumPy installs cleanly, SciPy can simple be installed with pip:

pip install scipy

5. Test!

Using these scripts you can test your NumPy and SciPy installation. Be activating/deactivating the virtualenv, you can test with/without OpenBLAS. If you have several CPU cores, you can see that with OpenBLAS up to 4 CPUs should also be used.

Without OpenBLAS I get:

NumPy 
dot: 0.901498508453 sec

SciPy
cholesky: 0.11981959343 sec
svd: 3.64697360992 sec

with OpenBLAS:

NumPy
dot: 0.0569217920303 sec

SciPy
cholesky: 0.0204758167267 sec
svd: 0.81153883934 sec

Some basic BTC2012 Stats

(The Figure shows the biggest domains publishing data, and links between them – mouse-over the edges to highlight, chose linking predicate from the drop-down list)

So it’s that time of year again, and the Billion Triple Challenge Dataset for 2012 has been posted.
This coincided with our project demo being finished, so I had some time to spare. The previous years I’ve done this all using unix tools, sed/awk/grep and friends. This year I figured I’d do it all in python. To get reasonable performance two things were crucial:

  • the python gzip module has decompression implemented in python, using subprocess and reading from a pipe to gunzip is MUCH faster (thanks Jörn!)
  • I wrote a an N-Quads “parser” in cython, taking advantage of the very regular output of ld-spider

This meant that for simple operations, like adding up things in a hash-table in memory, I could stream-process about 500,000 triples per second. For things that did not fit in memory, I used LevelDB with a thin layer of most-frequently-used caching around it.

I’m happy to see that DbTropes is part of the data this year!
So – the basic stats:

  • 1.4B triples all in all
  • 1082 different namespaces are used
  • 9.2M unique contexts, from 831 top-level document PLDs (Pay-Level-Domain, essentially data.gov.uk, instead of gov.uk, but livejournal.com, instead of bob.livejournal.com)
  • 183M unique subjects are described
  • 57k unique predicates
  • 192M unique resources as objects
  • 156M unique literals
  • 152M triples are rdf:type statements, 296k types are used. Resource with multiple types are common, 45M resources have two types, 40M just one.

 

Top 10 Context PLDs

count context pld
751,352,061 data.gov.uk
198,090,262 dbpedia.org
101,241,556 freebase.com
101,082,592 livejournal.com
44,331,145 opera.com
41,544,819 dbtropes.org
39,200,538 legislation.gov.uk
36,969,163 identi.ca
29,447,217 ontologycentral.com
14,949,592 rdfize.com

 

Top 10 Namespaces

count namespace
336,911,630 http://www.w3.org/1999/02/22-rdf-syntax-ns#
191,669,089 http://www.w3.org/2000/01/rdf-schema#
143,650,096 http://xmlns.com/foaf/0.1/
133,845,241 http://reference.data.gov.uk/def/intervals/
115,692,342 http://www.w3.org/2006/time#
71,016,514 http://www.w3.org/2006/http#
69,715,106 http://rdf.freebase.com/ns/
66,058,545 http://www.w3.org/2004/02/skos/core#
53,246,991 http://purl.org/dc/terms/
50,444,755 http://dbpedia.org/property/

 

Top 10 Types

count type
39,345,307 intervals:Second
39,345,280 intervals:CalendarSecond
12,841,127 foaf:Person
7,623,831 foaf:Document
1,896,136 qb:Observation
1,851,173 fb:common.topic
1,712,877 intervals:Minute
1,712,875 intervals:CalendarMinute
1,328,921 owl:Thing
1,280,763 metalex:BibliographicExpression

As usual, although many namespaces/hosts/types are used, the distribution is skewed, the most common elements quickly accounts for most of the data. This graph shows the cumulative occurrences (i.e. % of total unique elements) of types/context-plds/namespaces occurring more than N times (the X axis is logarithmic):

So the steeper the curve, the longer the tail of infrequently occurring elements. For example, less than 5% of types occur more than 100 times, but very few context-pld’s occur less than 10 times. However, when you look at the actual density, the picture changes, here we plot the cumulative density, so although most types occur less than 100 times, the majority of the data uses only the most frequent types:

So the steeper the curve at the end, the more of the data is covered by the few most frequent element. For example, the top 5% most frequent namespaces and context-plds cover over 99% of the data, but the top 5% of types “only” 97%.

A different (maybe useless?) view of this, is this histogram with exponentially increasing bucket-sizes, again with a log-scale, so they look the same size:

Here we see … actually I’ll be damned if I know what we see here. Maybe I should have done more stats courses at uni instead of, say, Java Programming. Clearly the difference between the distribution of the three things is shown somehow. I’ve spent so long on this now though, there’s no way I wont put it here.

I don’t even want to talk about how long I spent making these graphs. I wanted to graph this since the first BTC dataset I looked at, but previously always fell back at “top n% of the elements cover n% of the data” tables.
They graphs are all done in pylab, exported as SVG (yay!). Playing with them was all done with the ipython notebook, which is really pleasant to work with.

Finally – the Chord-diagram on top shows links between context PLDs – mouse over each host to see outgoing links. This is only the top 19 PLD domains and the top 10 properties linking domains that themselves publish RDF data – this is important, as there are predicates used to link to non-semantic web resources that dominate otherwise. The graphic and interaction is all done with the excellent D3 Library.

I will try to come up with some more interesting visualisations based on links between instances of various types soon!

Three simple methods for graphing datastreams

This is not very cutting edge, nor all THAT exciting – but it seemed worth putting the things together into one post for someone else who encounters the same scenario.

The problem is this: You have a stream of data-points coming in, you do not know how many points and you do not know the range of the numbers. You would like to draw a pretty little graph of this data. Since there is an unknown number of points, but potentially massive, you cannot keep all numbers in memory, so look at each once and pass it on. It’s essentially an “online (learning) scenario. The goal here is to draw small graph to give an idea of the trend of the data, not careful analysis – so a fair bit of approximation is ok.

In our case the numbers are sensors readings from agricultural machines, coming in through ISOXML (you probably really don’t want to go there, but it’s all in ISO 11783) documents. The machine makes readings every second, which adds up when it drives for days. This isn’t quite BIG DATA, but there may be millions of points, and in our case we want to run on a fairly standard PC, so an online approach was called for.

For the first and simplest problem, you have data coming in at regular intervals, and you want a (much) shorter array of values to plot. This class keeps an array of some length N, at first we just fill it with the incoming numbers. If we see more than N numbers, we “zoom out” by a factor of two, rewrite the N numbers we’ve seen to N/2 numbers by averaging them, and from now on, every two incoming numbers are averaged into each cell in our array. Until we reach N*2 numbers, then we zoom out again, now averaging every 4 numbers, etc. Rewriting the array is a bit of work, but it only happens log(your total number of points) times. In the end you end up with somewhere between N/2 and N points.

/**
 * Create a summary time-series of some time-series of unspecified length. 
 * 
 * The final time-series will be somewhere between N/2 and N long
 * (Unless less numbers are given of course)
 * 
 * For example:
 * TimeLogSummariser t = new DDISummaryGraph.TimeLogSummariser(4);
 * for (int i=1; i<20; i++) t.add(i);
 * => [3.0, 7.25, 18.0]
 * 
 */
public class TimeLogSummariser { 
	
	double data[];
	int N;
	int n=1;
	int i=0;
	public TimeLogSummariser(int N) { 
		if (N%2 != 0 ) N++; 
		this.N=N; 
		data=new double[N];
	}
	public void zoomOut() { 
		for (int j=0;j<N/2;j++) data[j]=(data[j*2]+data[2*j+1])/2;
		for (int j=N/2;j<N;j++) data[j]=0;
		n*=2;
	}
	
	public void add(double d) { 
		int j=i%n;
		int idx=i/n;
		
		if (idx>=N) { 
			zoomOut(); 
			j=i%n;
			idx=i/n;
		}
		data[idx]=(data[idx]*j+d)/(j+1);
		i++;
	}
	
	public double[] getData() {
		return Arrays.copyOfRange(data, 0, i/n+1); 
	}
	
}

Now, after programming this I realised that most of my tractor-sensor data actually does not come in at regular intervals. You’ll get stuff every second for a while, then suddenly a 3 second gap, or a 10 minute smoke-break or a 40 minute lunch-break. Treating every point as equal does not give you the graph you want. This only makes things slightly more complicated though, instead always stepping one step in the array, the step depends on the difference in time between the two points. We assume (and hope) that data always arrives in sequential order. Also, since we may take many steps at once now, we may need to “zoom out” more than once to fit in the array. Otherwise the code is almost the same:

 
/**
 * Create a summary time-series of some time-series of unspecified length. 
 * 
 * The final time-series will be somewhere between N/2 and N long
 * (Unless less numbers are given of course)
 * 
 * This class will do correctly do averages over time. 
 * The constructor takes the stepsize as a number of milleseconds 
 * i.e. the number of milliseconds between each recording. 
 * Each value is then given with a date. 
 * 
 * For example:
 * DiffDDISummaryGraph t = new DiffDDISummaryGraph.TimeLogSummariser(4, 1000);
 * for (int i=1; i<20; i++) t.add(i, new Date(i*2000));
 * => [3.0, 7.25, 18.0]
 * 
 */
public class DiffTimeLogSummariser { 
	
	double data[];
	
	private int N;

	int n=1;
	double i=0;

	private long stepsize;

	private Date last;
	private Date start;
	
	public DiffTimeLogSummariser(int N, long stepsize) { 
		this.stepsize=stepsize;
		if (N%2 != 0 ) N++; 
		this.N=N;
		data=new double[N];
	}
	public void zoomOut() { 
		for (int j=0;j<N/2;j++) data[j]=(data[j*2]+data[2*j+1])/2;
		for (int j=N/2;j<N;j++) data[j]=0;
		n*=2;
	}
	
	public void add(double d, Date time) {
		
		long diff;
		if (last!=null) {
			diff=time.getTime()-last.getTime();
		} else { 
			start=time;
			diff=0;
		}
		if (diff<0) { 
			System.err.println("DiffTimeLogSummarizer got diff<0, ignoring.");
			return ; 
		}
		
		i+=diff/(double)stepsize;
		
		int j=(int) (Math.round(i)%n);
		int idx=(int) (Math.round(i)/n);
		
		while (idx>=N) { 
			zoomOut(); 
			j=(int) (Math.round(i)%n);
			idx=(int) (Math.round(i)/n);
		}
		data[idx]=(data[idx]*j+d)/(j+1);
		last=time;
	}
	
	public double[] getData() {
		return Arrays.copyOfRange(data, 0, (int) (i/n+1)); 
	}
	
	public Date getStart() {
		return start;
	}

	public long getStepsize() {
		return stepsize*n;
	}
}

This assumes that if there is a gap in the data, that means the value was 0 at these intervals. Whether this is true depends on your application, in some cases it would probably make more sense to assume no data means the value was unchanged. I will leave this as an exercise for the reader :)

Finally, sometimes you are more interested in the distribution of the values you get, rather than how they vary over time. Computing histograms on the fly is also possible, for uniform bin-sizes the algorithm is almost the same as above. The tricky bits here I’ve stolen from Per on stackoverflow:

/**
 * Create a histogram of N bins from some series of unknown length. 
 * 
 */
public class TimeLogHistogram { 

	int N;  // assume for simplicity that N is even
	int counts[];

	double lowerBound;
	double binSize=-1;

	public TimeLogHistogram(int N) { 
		this.N=N; 
		counts=new int[N];
	}

	public void add(double x) { 
		if (binSize==-1) {
			lowerBound=x;
			binSize=1;
		}
		int i=(int) (Math.floor(x-lowerBound)/binSize);

		if (i<0 || i>=N) {
			if (i>=N) 
				while (x-lowerBound >=N*binSize) zoomUp(); 
			else if (i<0) 
				while (lowerBound > x) zoomDown();
			i=(int) (Math.floor(x-lowerBound)/binSize);
		}
		counts[i]++;
	}

	private void zoomDown() {
		lowerBound-=N*binSize;
		binSize*=2;
		for (int j=N-1;j>N/2-1;j--) counts[j]=counts[2*j-N]+counts[2*j-N+1];
		for (int j=0;j<N/2;j++) counts[j]=0;
	}

	private void zoomUp() {
		binSize*=2;
		for (int j=0;j<N/2;j++) counts[j]=counts[2*j]+counts[2*j+1];
		for (int j=N/2;j<N;j++) counts[j]=0;
	}

	public double getLowerBound() {
		return lowerBound;
	}
	public double getBinSize() {
		return binSize;
	}
	public int[] getCounts() {
		return counts;
	}
	public int getN() { 
		return N;
	}
}

Histograms with uneven binsizes gets trickier, but is apparently still possible, see:

Approximation and streaming algorithms for histogram construction problems.
Sudipto Guha, Nick Koudas, and Kyuseok Shim ACM Trans. Database Syst. 31(1):396-438 (2006)

That’s it – I could have put in an actual graph figure here somewhere, but it just be some random numbers, just imagine one.

(Apologies for Java code examples today, your regular python programming will resume shortly)

Voting in the Eurovision Song Contest

Yesterday I was made to watch the Eurovision song contest. I went to bed before the voting ended, so I woke up to find that Azerbaijan had won. Which was curious, since they were awful (or perhaps not, since this is Eurovision).

At the official webpage you can get the voting breakdown, where we can see that all the countries I have lived in (Norway, UK and Germany) gave Azerbaijan 0 points. Clearly the Eurovision has been ruined by all these new East-block counties, who in a giant conspiracy who only vote for each other, rendering us western countries with real musical talent without a chance. To confirm my suspicion I grabbed the result table, python, scipy and matplotlib. Compute the correlation matrix for the columns, run PCA on this and plot the first two components (if all that meant nothing to you, the result is that countries who tend to distribute their votes similarly are close to each other in the diagram):

Here the truths are clear as day – there is a Scandinavian conspiracy, Norway and Sweden are really the same country, Denmark is almost the same. Greece and Cyprus is one country (ahem – sorry Turkey, you are not far away). The Eastblock cabal is in the lower right. Malta is nothing like anyone else, it’s almost like it’s an island  ….

I think the only solution is to go back to the 1960 version of Eurovision, the first year that all countries that matter took part.

Seriously though – it’s fun to see how close this is to the actual geography of Europe, rotate the map a bit, and Scandinavia, UK, Italy are all in the right place.

PS: this is also pretty funny, but it seems someone takes this more seriously than perhaps they should: Eurovision Voting Fraud

Schema usage in the BTC2010 data

A little while back I spent about 1 CPU week computing which hosts use which namespaces in the BTC2010 data, i.e. I computed a matrix with hosts as rows, schemas as columns and each cell the number of triples using that namespace each host published. My plan was to use this to create a co-occurrence matrix for schemas, and then use this for computing similarities for hierarchical clustering. And I did. And it was not very amazing. Like Ed Summer’s neat LOD graph I wanted to use Protovis to make it pretty. Then, after making one version, uglier than the next I realised that just looking at the clustering tree as a javascript datastructure was just as useful, I gave up on the whole clustering thing.

Not wanting spent CPU hours go to waste, I instead coded up a direct view of the original matrix, getting a bit carried away I made a crappy non-animated, non-smooth version of Moritz Stefaner’s elastic lists using jquery-ui’s tablesorter plugin.

At http://gromgull.net/2010/10/btc/explore.html you can see the result. Clicking one a namespace will show only hosts publishing triples using this schema, and only schemas that co-occur with the one you picked. Conversely, click on a host will show the namespaces published by that host, and only hosts that use the same schemas (this makes less intuitive sense for hosts than for namespaces). You even get a little protovis histogram of the distribution of hosts/namespaces!

The usually caveats for the BTC data applies, i.e. this is a random sampling of parts of the semantic web, it doesn’t really mean anything :)

Redundancy in the BTC2010 Data, it’s only 1.4B triples!

In a comment here, Andreas Harth mentions that kaufkauf.net publishes the same triples in many contexts, and that this may skew the statistics a bit. As it turns out, not only kaufkauf.net is guilty of this, by stripping the fourth quad component of the data and removing duplicate triples the original 3,171,793,030 quads turn into “only” 1,441,499,718 triples.

36,123,031 triples occurred more than once in the data, 42 of these even more than 100,000 times. The top redundant triples are:

#triples subj pred obj
470,903 prot:A rdf:type prot:Chains
470,778 prot:A prot:Chain “A”^^<http://www.w3.org/2001/XMLSchema#string>
470,748 prot:A prot:ChainName “Chain A”^^<http://www.w3.org/2001/XMLSchema#string>
413,647 http://products.semweb.bestbuy.com/company.rdf#BusinessEntity_BestBuy rdf:type gr:BusinessEntity
366,073 foaf:Document rdfs:seeAlso http://dblp.l3s.de/d2r/sparql?query=DESCRIBE+%3Chttp://xmlns.com/foaf/0.1/Document%3E
361,900 dcmitype:Text rdfs:seeAlso http://dblp.l3s.de/d2r/sparql?query=DESCRIBE+%3Chttp://purl.org/dc/dcmitype/Text%3E
254,567 swrc:InProceedings rdfs:seeAlso http://dblp.l3s.de/d2r/sparql?query=DESCRIBE+%3Chttp://swrc.ontoware.org/ontology%23InProceedings%3E
184,530 foaf:Agent rdfs:seeAlso http://dblp.l3s.de/d2r/sparql?query=DESCRIBE+%3Chttp://xmlns.com/foaf/0.1/Agent%3E
159,627 http://www4.wiwiss.fu-berlin.de/flickrwrappr/ rdfs:label “flickr(tm) wrappr”@en
150,417 http://purl.org/obo/owl/OBO_REL#part_of rdf:type owl:ObjectProperty

This is all unfortunate, because I’ve been analysing the BTC data pretending that it’s a snapshot of the semantic web. Which perhaps it is? The data out there does of course look like this. Does the context of a triple change what it MEANS? If we had a trust/provenance stack in place I guess it would. Actually, I am not sure what this means for my statistics :)

At least I can now count the most common namespaces again, this time only from triples:

#triples namespace
651,432,324 http://data-gov.tw.rpi.edu/vocab/p/90
275,920,526 foaf
181,683,388 rdf
106,130,939 rdfs
34,959,224 dc11
33,289,653 http://purl.uniprot.org/core
16,674,480 gr
12,733,566 rss
12,368,342 dcterm
8,334,653 swrc

Compare to the numbers for quads, data-gov had exactly the same number of triples (no redundancy!), whereas rdf dropped from 588M to 181M, rdfs from 860M to 106M and GoodRelations from 527M to 16M. Looking at all namespaces, GoodRelations wins the most redundant award from 16% of all quads, to only 1.1% of all triples. Comparing change since 2009 still puts GoodRelations up high though, so no need for them to worry:

% change namespace
5579.997758 http://www.openlinksw.com/schema/attribution
4802.937827 http://www.openrdf.org/schema/serql
3969.768833 gr
2659.804256 urn:lsid:ubio.org:predicates:recordVersion
2655.011816 urn:lsid:ubio.org:predicates:lexicalStatus
2621.864105 urn:lsid:lsid.zoology.gla.ac.uk:predicates:mapping
2619.867255 urn:lsid:lsid.zoology.gla.ac.uk:predicates:rank
1539.092138 urn:lsid:ubio.org:predicates:hasCAVConcept
1063.282710 urn:lsid:lsid.zoology.gla.ac.uk:predicates:vernacularName
928.135900 http://spiele.j-crew.de/wiki/Spezial:URIResolver

And if I understood Kingsley Idehen correctly, there is something fishy about the attribution namespace from openlink as well, but I’ve done enough boring digging now.

Now I’m done doing boring counting – next time I hope I can have more fun visualisation, like Ed!!

SKOS Concepts in the BTC2010 data

Again Dan Brickley is making me work :) This time looking at the “hidden” schema that is SKOS concepts, (hidden because it is not really apparent when just looking at normal rdf:types). Dan suggested looking at topics used with FOAF, i.e. objects of foaf:topic, foaf:primaryTopic and foaf:interest triples, and also things used with Dublin Core subject (I used both http://purl.org/dc/elements/1.1/subject and http://purl.org/dc/terms/subject.

I found 1,136,475 unique FOAF topics in 8,119,528 triples, only 4,470 are bnodes, and only 265 (! i.e. only 0.002%) are literals. The top 10 topics are all of the type http://www.livejournal.com/interests.bml?int=??????, with varying number of ?s, this is obviously what people entered into the interest field of livejournal. More interesting are perhaps the top hosts:

#triples host
5,191,771 www.livejournal.com
1,819,836 www.deadjournal.com
771,439 www.vox.com
78,290 klab.lv
75,285 lj.rossia.org
70,380 lod.geospecies.org
18,398 my.opera.com
16,251 dbpedia.org
11,481 www.wasab.dk
9,815 wiki.sembase.at

So a lot of these topics are from FOAF exports of livejournal and friends. What I did not do, at least not yet, was to compare the list of FOAF topics with the things actually declared to be of type skos:Concept, this would be interesting.

Dublin Core looks quite different, it gives us 552,596 topics in 4,018,726 triples, but only 2,979 resources out of 921 are bnodes, the rest (i.e. 99.4%) are all literals.
The top 10 subjects according to DC are:

#triples subject
91,534 日記
38,566 写真
35,514 メル友募集
32,150 NAPLES
30,973 business
28,342 独り言
27,543 SoE Report
24,102 Congress
23,954 音楽
20,097

I do not even know what language most of these are (anyone?). Looking a bit further down the list, there are lots of government, education, crime, etc. Perhaps we can blame data.gov for this? I could have have kept track of the named-graphs these came from, but I didn’t. Maybe next time.

You can download the full raw counts for all subjects: FOAF topics (7.6mb), FOAF hosts and DC Topics (23mb).

BTC2009/2010 Raw Counts

Dan Brickley asked, so I put up the complete files with counts for predicates, namespaces, types, hosts, and pay-level domains here: http://gromgull.net/2010/09/btc2010data/.

Uploading them to manyeyes or similar would perhaps be more modern, but it was too much work :)

Aggregates over BTC2010 namespaces

Yesterday I dumped the most basic BTC2010 stats. Today I have processed them a bit more – and it gets slightly less boring.

First predicates, yesterday I had the raw count per predicate. Much more interesting is the namespaces the predicates are defined in. These are the top 10:

#triples namespace
860,532,348 rdfs
651,432,324 http://data-gov.tw.rpi.edu/vocab/p/90
588,063,466 rdf
527,347,381 gr
284,679,897 foaf
44,119,248 dc11
41,961,046 http://purl.uniprot.org/core
17,233,778 rss
13,661,605 http://www.proteinontology.info/po.owl
13,009,685 owl

(prefix abbreviations are made from prefix.cc \u2013 I am too lazy to fix the missing ones)

Now it gets interesting – because I did exactly this last year as well, and now we can compare!

Dropouts

In 2009 there were 3,817 different namespaces, this year we have 3,911, but actually only 2,945 occur in both. The biggest dropouts, i.e. namespaces that occurred last year, but not at all this year are:

#triples namespace
10,239,809 http://www.kisti.re.kr/isrl/ResearchRefOntology
5,443,549 nie
1,571,547 http://ontologycentral.com/2009/01/eurostat/ns
1,094,963 http://sindice.com/exfn/0.1
320,155 http://xmdr.org/ont/iso11179-3e3draft_r4.owl
307,534 http://cb.semsol.org/ns
242,427 nco
203,283 osag
187,600 http://auswiki.org/index.php/Special:URIResolver
159,536 nexif

I am of course shocked and saddened to see that the Nepomuk Information Elements ontology has fallen out of fashion all together, although it was a bit of a freak occurrence last year. I am not sure how we lost 10M research ontology triples?

Newcomers

Looking the other way around, what namespaces are new and popular this year, we get:

#triples namespace
651,432,324 http://data-gov.tw.rpi.edu/vocab/p/90
5,001,909 fec
2,689,813 http://transport.data.gov.uk/0/ontology/traffic
543,835 http://rdf.geospecies.org/ont/geospecies
526,304 http://data-gov.tw.rpi.edu/vocab/p/401
469,446 http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf
446,120 http://education.data.gov.uk/def/school
223,726 http://www.w3.org/TR/rdf-schema
190,890 http://wecowi.de/wiki/Spezial:URIResolver
166,511 http://data-gov.tw.rpi.edu/vocab/p/10

Here the introduction of data.gov and data.gov.uk were the big events last year.

Winners

For the namespaces that occurred both years we can find the biggest gainers. Here I calculated what ratio of the total triples each namespace constituted each year, and the increase in this ratio from 2009 to 2010. For example, GoodRelations, on top here, constituted nearly 16% of all triples in 2010, but only 2.91e-4% of all triples last year, for a cool increase of 570,000% :)

gain namespace
57058.38 gr
2636.34 http://www.openlinksw.com/schema/attribution
2182.81 http://www.openrdf.org/schema/serql
1944.68 http://www.w3.org/2007/OWL/testOntology
1235.02 http://referata.com/wiki/Special:URIResolver
1211.35 urn:lsid:ubio.org:predicates:recordVersion
1208.09 urn:lsid:ubio.org:predicates:lexicalStatus
1194.66 urn:lsid:lsid.zoology.gla.ac.uk:predicates:mapping
1191.39 urn:lsid:lsid.zoology.gla.ac.uk:predicates:rank
701.66 urn:lsid:ubio.org:predicates:hasCAVConcept

Losers

Similarly, we have the biggest losers, the ones who lost the most:

gain namespace
0.000185 http://purl.org/obo/metadata
0.000191 sioct
0.000380 vcard
0.000418 affy
0.000438 http://www.geneontology.org/go
0.000677 http://tap.stanford.edu/data
0.000719 urn://wymiwyg.org/knobot/default
0.000787 akts
0.000876 http://wymiwyg.org/ontologies/language-selection
0.000904 http://wymiwyg.org/ontologies/knobot

If your namespace is a loser, do not worry, remember that BTC is a more or less arbitrary snapshot of SOME semantic web data, and you can always catch up next year! :)

With a bit of luck I will do this again for the Pay-Level-Domains for the context URLs tomorrow.

Update

(a bit later)

You can get the full datasets for this from many eyes:

BTC2010 Basic stats

Another year, another billion triple dataset. This time it was released the same time my daughter was born, so running the stats script was delayed for a bit.

This year we’ve got a few more triples, perhaps making up for the fact that it wasn’t actually one billion last year :) we’ve now got 3.1B triples (or 3,171,793,030 if you want to be exact).

I’ve not had a chance to do anything really fun with this, so I’ll just dump the stats:

Subjects

  • 159,185,186 unique subjects
  • 147,663,612 occur in more than a single triple
  • 12,647,098 more than 10 times
  • 5,394,733 more 100
  • 313,493 more than 1,000
  • 46,116 more than 10,000
  • and 53 more than 100,000 times

For an average of 19.9252 per unique triple. Like last year, I am not sure if having more than 100,000 triples with the same subject really is useful for anyone?

Looking only at bnodes used as subjects we get:

  • 100,431,757 unique subjects
  • 98,744,109 occur in more than a single triple
  • 1,465,399 more than 10 times
  • 266,759 more 100
  • 4,956 more than 1,000
  • 48 more than 10,000

So 100M out of 159M subjects are bnodes, but they are used less often than the named resources.

The top subjects are as follows:

#triples subject
1,412,709 http://www.proteinontology.info/po.owl#A
895,776 http://openean.kaufkauf.net/id/
827,295 http://products.semweb.bestbuy.com/company.rdf#BusinessEntity_BestBuy
492,756 cycann:externalID
481,000 http://purl.uniprot.org/citations/15685292
445,430 foaf:Document
369,567 cycann:label
362,391 dcmitype:Text
357,309 http://sw.opencyc.org/concept/
349,988 http://purl.uniprot.org/citations/16973872

I do not know enough about the Proteine ontology to know why po:A is so popular. CYC we already had last year here, and I guess all products exposed by BestBuy have this URI as a subject.

Predicates

  • 95,379 unique predicates
  • 83,370 occur in more than one triples
  • 46,710 more than 10
  • 18,385 more than 100
  • 5,395 more than 1,000
  • 1,271 more than 10,000
  • 548 more than 100,000

The average predicate occurred in 33254.6 triples.

#triples predicate
557,268,190 rdf:type
384,891,996 rdfs:isDefinedBy
215,041,142 gr:hasGlobalLocationNumber
184,881,132 rdfs:label
175,141,343 rdfs:comment
168,719,459 gr:hasEAN_UCC-13
131,029,818 gr:hasManufacturer
112,635,203 rdfs:seeAlso
71,742,821 foaf:nick
71,036,882 foaf:knows

The usual suspects, rdf:type, comment, label, seeAlso and a bit of FOAF. New this year is lots of GoodRelations data!

Objects – Resources

Ignoring literals for the moment, looking only at resource-objects, we have:

  • 192,855,067 unique resources
  • 36,144,147 occur in more than a single triple
  • 2,905,294 more than 10 times
  • 197,052 more 100
  • 20,011 more than 1,000
  • 2,752 more than 10,000
  • and 370 more than 100,000 times

On average 7.72834 triples per object. This is both named objects and bnodes, looking at the bnodes only we get:

  • 97,617,548 unique resources
  • 616,825 occur in more than a single triple
  • 8,632 more than 10 times
  • 2,167 more 100
  • 1 more than 1,000

Since BNode IDs are only valid within a certain file it is limited how often then can appear, but still almost half the overall objects are bnodes.

The top ten bnode IDs are pretty boring, but the top 10 named resources are:

#triples resource-object
215,532,631 gr:BusinessEntity
215,153,113 ean:businessentities/
168,205,900 gr:ProductOrServiceModel
167,789,556 http://openean.kaufkauf.net/id/
71,051,459 foaf:Person
10,373,362 foaf:OnlineAccount
6,842,729 rss:item
6,025,094 rdf:Statement
4,647,293 foaf:Document
4,230,908 http://purl.uniprot.org/core/Resource

These are pretty much all types – compare to:

Types

A “type” being the object that occurs in a triple where rdf:type is the predicate gives us:

  • 170,020 types
  • 91,479 occur in more than a single triple
  • 20,196 more than 10 times
  • 4,325 more 100
  • 1,113 more than 1,000
  • 258 more than 10,000
  • and 89 more than 100,000 times

On average each type is used 3277.7 times, and the top 10 are:

#triples type
215,536,042 gr:BusinessEntity
168,208,826 gr:ProductOrServiceModel
71,520,943 foaf:Person
10,447,941 foaf:OnlineAccount
6,886,401 rss:item
6,066,069 rdf:Statement
4,674,162 foaf:Document
4,260,056 http://purl.uniprot.org/core/Resource
4,001,282 http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry
3,405,101 owl:Class

Not identical to the top resources, but quite similar. Lots of FOAF and new this year, lots of GoodRelations.

Contexts

Something changed with regard to context handling for BTC2010, this year we only have 8M contexts, last year we had over 35M.
I wonder if perhaps all of dbpedia is in one context this year?

  • 8,126,834 unique contexts
  • 8,048,574 occur in more than a single triple
  • 6,211,398 more than 10 times
  • 1,493,520 more 100
  • 321,466 more than 1,000
  • 61,360 more than 10,000
  • and 4799 more than 100,000 times

For an average of 389.958 triples per context. The 10 biggest contexts are:

#triples context
302,127 http://data-gov.tw.rpi.edu/raw/402/data-402.rdf
273,644 http://www.ling.helsinki.fi/kit/2004k/ctl310semw/WordNet/wordnet_nouns-20010201.rdf
259,824 http://static.cpantesters.org/author/M/MIYAGAWA.rss
207,513 http://data-gov.tw.rpi.edu/raw/401/data-401.rdf
193,944 http://static.cpantesters.org/author/D/DROLSKY.rss
189,528 http://static.cpantesters.org/author/S/SMUELLER.rss
170,899 http://data-gov.tw.rpi.edu/raw/59/data-59.rdf
166,454 http://zaltys.net/ontology/AKTiveSAOntology.owl
166,454 http://www.zaltys.net/ontology/AKTiveSAOntology.owl
165,948 http://lsdis.cs.uga.edu/~satya/Satya/jan24.owl

This concludes my boring stats dump for BTC2010 for now. Some information on literals and hopefully some graphs will come soon! I also plan to look into how these stats changed from last year – so far I see much more GoodRelations, but there must be other fun changes!