For completeness, Besbes is telling me to write up the final stats from the BTC data, for the object-part of the triples. I am afraid this data is quite dull, and there are not many interesting things to say, so it’ll be mostly tables. Enjoy :)
The BTC data contains 279,710,101 unique objects in total. Out of these:
- 90,007,431 appear more than once
- 7,995,747 more than 10 times
- 748,214 more than 100
- 43,479 more than 1,000
- 3,209 more than 10,000
The 280M objects are split into 162,764,271 resources and 116,945,830 literals. 13,538 of the resources are file:// URIs. The top 10 objects are:
| #triples |
object |
| 2,584,960 |
http://www.geonames.org/ontology#P |
| 2,645,095 |
http://www.aktors.org/ontology/portal#Article-Reference |
| 2,681,771 |
http://www.w3.org/2002/07/owl#Class |
| 5,616,326 |
http://www.aktors.org/ontology/portal#Person |
| 7,544,903 |
http://www.geonames.org/ontology#Feature |
| 9,115,801 |
http://en.wikipedia.org/ |
| 12,124,378 |
http://xmlns.com/foaf/0.1/OnlineAccount |
| 13,687,049 |
http://purl.org/rss/1.0/item |
| 14,172,852 |
http://rdfs.org/sioc/types#WikiArticle |
| 38,795,942 |
http://xmlns.com/foaf/0.1/Person |
Apart from the wikipedia link, all are types. No literals appear in the top 10 table. For the 116M unique literals we have 12,845,021 literals with a language tag and 2,067,768 with a datatype tag. The top 10 literals are:
| #triples |
literal |
| 722,221 |
“0″^^xsd:integer |
| 969,929 |
“1″ |
| 1,024,654 |
“Nay” |
| 1,036,054 |
“Copyright © 2009 craigslist, inc.” |
| 1,056,799 |
“text” |
| 1,061,692 |
“text/html” |
| 1,159,311 |
“0″ |
| 1,204,996 |
“en-us” |
| 2,049,638 |
“Aye” |
| 2,310,681 |
“application/rdf+xml” |
I can’t be bothered to check it now, but I guess the many Aye’s & Nay’s come from IRC chatlogs (#SWIG?).
Finally, I looked at the length of the literals used in the data, the longest literal is 65,244 unicode characters long (I wonder about this — this seems very close to 216 bytes, some unicode characters with more than one byte, could it be truncated?). The distribution of literals/lenghts looks like this:
The most literals are around 10 characters in length, there is a peak for 19, which I seem to remember was caused by the standard time format (i.e. 2005-10-30T10:45UTC) being exactly 19 characters.
That’s it! I believe I now have published all my numbers on BTC :)
Posted by gromgull at 4:44 pm on December 11th, 2009.
Categories: Billion Triple Challenge, Semantic Web, Statistics, Uncategorized.
I have been using http://gromgull.net as a hub for linking to my many online identities and also as an open-id delegate URL, but it was never a proper host, it just redirected to http://semikolon.co.uk/gromgull. This redirection was getting annoying, as different OpenID clients would save the openID url differently, some would use what I typed, i.e. http://gromgull.net, some would use what this redirected to, and some would use the myopenid ID I actually delegated to. When the RSS feed of my (now old) phpsimpleblog also broke it was time to upgrade properly. So here we go with a new domain and a wordpress installation. Well worth it for the two times a year I blog. (All the old posts have been moved here, with lots of hassle, but old comments have not. )
In other news I was at ESWC09, organising SFSW09 as usual. All great fun, the Scripting challenge at SFSW had especially high quality entries this year, the winners are listed at the challenge page, and I would also recommend watching the screencast for Anca Luca’s Practical Semantic Works – a Bridge from the Users’ Web to the Semantic Web – although she did not win, she shows some amazing presentation skills! The whole experience is also documented on flickr.
Finally, ESWC brought the Billion Triple Challenge to my attention, and I wondered if I could possible do some data-mining of some sort on this data. Downloading it I quickly realised that it will not fit into any RDF database that I keep lying around, but since the data is in a nice 1-triple per line N-QUADS format, I can process it with commandline tools, like awk, sed, sort and friends. I promptly set to work, writing scripts for extracting literals (since they make the commandline processing trickier) and sorting and counting like mad. A week of CPU time later I realise that something is amiss, I have predicates that are simply “and”, and subject URI that are <file://Documents … bugger. As it turns out the data-set had some bugs features, like URLs with spaces in them and I’ve had to rewrite my script. Once it works the details will appear here. (Andreas Harth agrees that this is a feature btw, and a new version of the BTC dataset will appear later)
Posted by gromgull at 3:37 pm on June 16th, 2009.
Categories: Uncategorized.
I often refer to this one Dilbert comic, but I can never find it. Today I spent the time googling and found it using google book search. Now archived here for future reference:

Posted by gromgull at 4:01 pm on October 14th, 2008.
Categories: Uncategorized.
It's been a long time since an update here now, been busy bowling and drinking beer mostly. The bowling team is doing well, think we have a real shot at making it in the tournament this year. Got my favourite bowling ball from the bowling ball shining place yesterday, so now I'm ready to rumble.
Haven't had much time to play Starcraft lately though, and I'm afraid that mnem has been playing online an practicing.
Até logo.
Posted by gromgull at 10:34 pm on April 9th, 2008.
Categories: Uncategorized.
Time waster 1: Flash Elements TD (from Chris)
Time waster 2: Boomshine (you need the music to really appreciate it!)
(Title from here)
Posted by gromgull at 3:01 pm on April 26th, 2007.
Categories: Uncategorized.
Today is finished implementing the xmlrpc interface to Gnowsis 0.9 that we are doing for the Semouse people.
As an added bonus I can get away from evil Java for a bit and interface with Gnowsis in python!
Gnowsis exports a range of XML-RPC methods, the ones I made today make up an interface to the different stores of gnowsis-server. Some javadoc will appear soon, but there are methods like addTriple, removeTriple, querySelect, queryConstruct.
Now with this in place I can write the python code:
#!/usr/bin/env python
from xmlrpclib import ServerProxy
server=ServerProxy("http://127.0.0.1:9993");
#print server.listMethods()
print getattr(server,"gnowsis-server_dataaccess.querySelect")("gunnar","fulltext")
And it works! As soon as I can convince Leo we will rewrite the whole gnowsis GUI in python! :)
Should anyone feel like trying this the code in svn…
With the latest wiki stuff working Gnowsis is really starting to free itself from the chains of ugly Swing GUI hell.
Posted by gromgull at 11:06 pm on March 8th, 2006.
Categories: Uncategorized.