A quick and dirty guide to YOUR first time with RDF

(This is written for the challenge from http://memespring.co.uk/2011/01/linked-data-rdfsparql-documentation-challenge/)

(To save you copy/pasting/typing you can download the examples from here: http://gromgull.net/2011/01/firstRDF/)

10 steps to make sense of RDF data:

  1. Install a debian or Ubuntu based system — I used Debian testing.
  2. Install rdflib and Berkely/Sleepycat DB by doing:
    sudo apt-get install python-rdflib python-bsddb3
    

    (I got rdflib version 2.4.2 – if you get version 3.X.X the code may look slightly different, let me know if you cannot work out the changes on your own)

  3. Find some data — I randomly picked the data behind the BIS Research Funding Explorer. You can find the raw RDF data on the source.data.gov.uk/data/ server. We will use the schema file from:
    http://source.data.gov.uk/data/research/bis-research-explorer/2010-03-04/research-schema.rdf

    and the education data from:

    http://source.data.gov.uk/data/education/bis-research-explorer/2010-03-04/education.data.gov.uk.nt

    We use the education data because it is smaller than the research data, only 500K vs 11M, and because there is a syntax error in the corresponding file for research :). In the same folders there are files called blahblah-void. These are statistics about the datasets, and we do not need them for this (see http://vocab.deri.ie/void/ for details).

  4. Load the data, type this into a python shell, or create a python file and run it:
    import rdflib
    
    g=rdflib.Graph('Sleepycat')
    g.open("db")
    
    g.load("http://source.data.gov.uk/data/education/bis-research-explorer/2010-03-04/education.data.gov.uk.nt", format='nt')
    g.load("http://source.data.gov.uk/data/research/bis-research-explorer/2010-03-04/research-schema.rdf")
    
    g.close()
    

    Note that the two files are in different RDF formats, both contain triples, but one is serialized as XML, the other in a ascii line-based format called N-Triples.You do not have to care about this, just tell rdflib to use the right parser with the format=X parameter, RDF/XML is the default.

  5. After the script has run there will be a new folder called db in the current directory, it contains the berkeley data-base files and indexes for the data. For the above example it’s about 1.5M
  6. Explore the data a bit, again type this into a python shell:
    • First open the DB again:
    • import rdflib
      g=rdflib.Graph('Sleepycat')
      g.open("db")
      len(g)
      
      -- Outputs: 3690 --
      

      The graph object is quite pythonic, and you can treat it like a collection of triples. Here len tells us we have loaded 3690 triples.

    • Find out what sorts of things this data describes. Things are typed by a triple with rdf:type as the predicate in RDF.
      for x in set(g.objects(None, rdflib.RDF.RDFNS["type"])): print x
      
      -- Outputs:
      http://www.w3.org/2002/07/owl#ObjectProperty
      http://www.w3.org/2002/07/owl#DatatypeProperty
      http://xmlns.com/foaf/0.1/Organization
      http://purl.org/vocab/aiiso/schema#Institution
      http://research.data.gov.uk/def/project/Location
      http://www.w3.org/1999/02/22-rdf-syntax-ns#Property
      http://www.w3.org/2000/01/rdf-schema#Class
      --
      

      rdflib gives you several handy functions that return python generators for doing simple triple based queries, here we used graph.objects, taking two parameters, the subjects and predicates to filter for, and returns a generator over all objects matching. rdflib also provides constants for the well-known RDF and RDFSchema vocabularies, we used this here to get the correct URI for the rdf:type predicate.

    • Now we know the data contains some Institutions, get a list using another rdflib triple-based query:
      for x in set(g.subjects(rdflib.RDF.RDFNS["type"], rdflib.URIRef('http://purl.org/vocab/aiiso/schema#Institution'))): print x
      
      -- Outputs:
      http://education.data.gov.uk/id/institution/UniversityOfWolverhampton
      http://education.data.gov.uk/id/institution/H-0081
      http://education.data.gov.uk/id/institution/H-0080
      ... (and many more) ...
      --
      

      This gives us a long list of all institutions. The set call here just iterates through the generator and removes duplicates.

    • Lets look at the triples about one in more detail:
      for t in g.triples((rdflib.URIRef('http://education.data.gov.uk/id/institution/UniversityColledgeOfLondon'), None, None)): print map(str,t)
      -- Outputs:
      ['http://education.data.gov.uk/id/institution/UniversityColledgeOfLondon', 'http://research.data.gov.uk/def/project/location', 'http://education.data.gov.uk/id/institution/UniversityColledgeOfLondon/WC1E6BT']
      ['http://education.data.gov.uk/id/institution/UniversityColledgeOfLondon', 'http://research.data.gov.uk/def/project/organisationName', 'University College London']
      ... (and many more) ...
      --
      

      This gives us a list of triples asserted about UCL, here we used the triples method of rdflib, it takes a single argument, a tuple representing the triple filters. The returned triples are also tuples, the map(str,t) just makes the output prettier.

  7. rdflib makes it very easy to work with triple based queries, but for more complex queries you quickly need SPARQL, this is also straight forward:
    PREFIX="""
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    PREFIX p: <http://research.data.gov.uk/def/project/>
    PREFIX aiiso: <http://purl.org/vocab/aiiso/schema#>
    PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    """
    
    list(g.query(PREFIX+"SELECT ?x ?label WHERE { ?x rdfs:label ?label ; a aiiso:Institution . } "))[:10]
    

    The prefixes defined here at the start lets us use short names instead of full URIs in the queries. The graph.query method returns a generator over tuples of variables bindings. This lists the first 10 – this is more or less the same as we did before, list all institutions, but this time also get the human readable label.

  8. Now a slightly more complicated example. Ask the knowledge base to find all institutions classified as public sector that took part in some project together:

     
    
    r=list(g.query(PREFIX+"""SELECT DISTINCT ?x ?xlabel ?y ?ylabel WHERE { 
       ?x rdfs:label ?xlabel ; 
          a aiiso:Institution ; 
          p:organisationSize 'Public Sector' ; 
          p:project ?p . 
    
       ?y rdfs:label ?ylabel ; 
          a aiiso:Institution ; 
          p:organisationSize 'Public Sector' ; 
          p:project ?p .
    
       FILTER (?x != ?y) } LIMIT 10 """))
    
    for x in r[:3]: print map(str,x)
    
    -- Outputs:
    ['http://education.data.gov.uk/id/institution/H-0155', 'Nottingham University', 'http://education.data.gov.uk/id/institution/H-0159', 'The University of Sheffield']
    ['http://education.data.gov.uk/id/institution/H-0155', 'University of Nottingham', 'http://education.data.gov.uk/id/institution/H-0159', 'University of Sheffield']
    ['http://education.data.gov.uk/id/institution/H-0159', 'Sheffield University', 'http://education.data.gov.uk/id/institution/H-0155', 'University of Nottingham']
    --
    

    All fairly straight forward, the FILTER is there to make sure the two institutions we find are not the same.
    (Disclaimer: there is a bug in rdflib (http://code.google.com/p/rdfextras/issues/detail?id=2) that makes this query take very long :( – it should be near instantaneous, but takes maybe 10 seconds for me. )

  9. The data we loaded so far do not have any details on the project that actually got funded, only the URI, for example: http://research.data.gov.uk/doc/project/tsb/100232. You can go there with your browser and find out that this is a project called “Nuclear transfer enhancement technology for bio processing and tissue engineering” – luckily so can rdflib, just call graph.load on the URI. Content-negotiation on the server will make sure that rdflib gets machine readable RDF when it asks. A for-loop over a rdflib triple query and loading all the project descriptions is left as an exercise to the reader :)
  10. That’s it! There are many places to go from here, just keep trying things out – if you get stuck try asking questions on http://www.semanticoverflow.com/ or in the IRC chatroom at irc://irc.freenode.net:6667/swig. Have fun!

6 comments.

  1. Nice tutorial, but it maybe a warning helps people not to fall into the same pitfall I did:
    If you ever find yourself iterating over a lot of IRIs the way described in 9 you should try to find the responsible SPARQL endpoint
    – that would be faster
    – cause less traffic
    – and is probably more complete as dereferencing the IRI may return an incomplete set (as mentioned by TimBL in http://www.w3.org/DesignIssues/LinkedData.html).

    Bigger warning for DBpedia: they simply cut after the first 2001 triples:
    g = rdflib.Graph()
    g.load(“http://dbpedia.org/resource/United_States”)
    len(g)
    # 2001
    print [x for x in g.objects(None, rdflib.RDFS[“label”])]
    # []
    Not not even a label in the result set :(

  2. @Jörn: good point – this is a bug in dbpedia as far as I am concerned, but I do not think they agree. I wonder if some HTTP hack could be used to “fix” this – i.e. a special return code “20X OK – but that’s not all!” (perhaps 206 Partial Content?) and then allow an offset to get the rest?

  3. […] application. You can find contributions by Bill Roberts, Christopher Gutteridge, Pezholio, Gunnar Aastrand Grimnes, Tom Morris, Jeni Tennison (and here), Niklas Lindström, Felix Ostrowski, and John Goodwin. […]

  4. Good day! I have a question. How can I put variable instead of value ‘Institution’?
    “””SELECT DISTINCT ?x ?xlabel ?y ?ylabel WHERE {
    02 ?x rdfs:label ?xlabel ;
    03 a aiiso:Institution ;
    04 p:organisationSize ‘Public Sector’ ;
    05 p:project ?p .

    For example:
    val = ‘Institution’
    “””SELECT DISTINCT ?x ?xlabel ?y ?ylabel WHERE {
    02 ?x rdfs:label ?xlabel ;
    03 a aiiso:val ;
    04 p:organisationSize ‘Public Sector’ ;
    05 p:project ?p .
    ????

  5. Only 5 months later :(

    For the record, you can put variables in any place in the query:
    SELECT DISTINCT * WHERE {
    ?x rdfs:label ?xlabel ;
    a ?type ;
    p:organisationSize ‘Public Sector’ ;
    p:project ?p .
    }

    Now you get the type of the resource back as a variable.
    You can also FILTER on this variable if you want, checking that it contains a certain string for example.

    If you compose your query programmatically – you can either create the correct query from python strings, or better, have the query as above, and pass in initBindings={“type”: MY_TYPE} to graph.query.

  6. […] A quick and dirty guide to YOUR first time with RDF: another example of querying Uk government data found on data.gov.uk using RdfLib and Berkely/Sleepycat DB. […]

Post a comment.