The GTR technical approach

Paul Chitson, BBSRC Head of Information Services –

The GtR technical team have started the process of pulling together data from various Research Council funded projects and building a technology platform through which to make it available to a wide variety of audiences.

The project is being managed and delivered within an Agile framework, perhaps the first cross council project in the Research Councils to adopt this approach.  This means we will not only be trying to hit key milestones but also aiming to release as much information and as many early previews as we can.  This blog post is the first from the technical team and hopefully sets a precedent for the way we wish to continue.

The process began with the gathering of user stories from potential audiences which have since enabled us to understand the content that is of interest, the ways in which people may want to interface with the data and, of equal importance, what people do not want.

The requirements gathering phase has informed our initial thoughts on the solution architecture and how the data will be stored.

The data will be extracted from various sources, transformed and then stored in two forms. The first will be in a relational database using the CERIF data model.  Initially, the intention was to provide an interface that would output data in a CERIF format but following a workshop with Keith Jeffery (President of euroCRIS) we felt that we could use it as the internal storage model.  However, we also have a commitment to provide data in a linked data format (RDF) via a SPARQL interface, so have chosen to store the data in two forms (rather than a RDF wrapper) and benefit from a SPARQL interface out of the box.  We will, wherever possible use existing open source technologies and integrate rather than build from scratch. Any code or customisations we make will be made freely available.

GtR technology at a glance:

  • Data storage model will be CERIF. This ensures compliance with many universities, research organisations and consumers of research data;
  • A user portal with full text and facetted search (SOLR);
  • REST based interface providing rich information in both human and machine readable forms;
  • SPARQL interface for querying a triple store (initially using the JENA Software stack);
  • Web services providing data in OAI-PMH and CERIF formats.

 

We are at the start of an exciting journey and invite you to add your input along the way.

13 Responses to The GTR technical approach

  1. Simon Kerridge August 2, 2012 at 16:49 #

    Looks like a great start to GtR.
    You will know that there are a number of JISC funded projects that are currently using CERIF and in particular the IRIOS-2 Project that is converting RCUK Grants data into CERIF. You may well find it useful to speak the the project team about the various possible representations of the RCUK data in CERIF as they may already have solved some of the problems that you may encounter. Perhaps you might wish to speak to Kevin Ginty the project manager at Sunderland.

    • Darren Hunter August 3, 2012 at 15:40 #

      Thanks Simon – will do.

  2. Christopher Gutteridge August 29, 2012 at 08:04 #

    Hi, my name is Christopher Gutteridge, I run http://data.southampton.ac.uk/ and I’m currently working with a project to get universities to start sharing lists of (some of) their facilites and equipment, using linked open data. At first glance this could be seen as competing but that is certainly not our intent.

    Part of the indirect benefit will be to get universities to start thinking about URIs for things.

    Our data model is (deliberately) less detailed than CERIF, as it does not and the temporal reification. ie. we say “University X has Facility Y” whereas, if I’ve understood CERIF, it would say “University X has and Facility Y had a relationship between /daterange/”. This is a level of complexity too far for what we’re doing, but certainly not incompatible.

    Where possible we’d prefer to use CERIF terms, but I’m not sure what the preferred namespace is. There’s a few projects to map CERIF to RDF but these don’t appear official.

    It seems quite likely that some of the data we’re shaking lose will be meat for your grinder.

    On behalf of the data..ac.uk community, we’ve registered data.ac.uk with the intent of using subdomains for aggregator services and to provide ‘cool uris’ for elements in uk academia, so they don’t get stuck with shonky URIs based around the domain of the university or project which initially defines them. I can see initiatives like GTR being able to make good use of this.

    Lastly, we’ve a free 2 day seminar about sharing data about university equipment, coming up in the middle of September: http://www.uniquip.ecs.soton.ac.uk/

    • Christopher Gutteridge August 29, 2012 at 08:05 #

      ps.
      Our data can be browsed here: http://data.southampton.ac.uk/facilities.html (a lightweight portal on top of the RDF)
      raw data here:
      http://data.southampton.ac.uk/dataset/facilities.html

    • Paul Chitson September 3, 2012 at 12:26 #

      Christopher,
      Thanks for the post and useful information. We have been wrestling with the CERIF/RDF conundrum as well and would be interested in seeing how your work develops and potentially linking it in with the information available on GTR. Your work on facilities sharing fits well with one of the user stories we have which is “how do I find facilities of a certain type in my area”. Perhaps they may also want a Geographic search facility but can see a lot of value of linking in this data to our GtR pages (so we can show what facilities an organisation has, or link to more information on a specific facility used by a project). Will you maintain URI’s for facilities that are no longer current, as GtR displays data on both current and past projects and therefore will reference facilities that have probably ceased operation.
      The seminar sounds interesting although I’m not sure I can make both days.

      • Christopher Gutteridge September 3, 2012 at 12:39 #

        My approach will be to ask to get the data under an open license so that it can very easily be reused by other projects.

        URIs: The plan for this is to use the organisation’s own URI for things, and mint “placeholder” ones if they are not yet doing that themselves. It’s clear that there’s more to facilities and equipment than what is seen from the “top down” view. One of our early decisions is that when we say “Facility” we mean something fairly loose, not the strict definition of an “RCUK costed facility”.

        We believe that a practical solution is to ask organistions to publish RDF or simple CSV/Excel files on their website. If they publish CSV we’ll ask them to publish a boilerplate short RDF file listing the location(s) of the CSV files and the license etc. This allows a verifiable chain of trust and licenses while providing a single central dataset for people to access.

        Historical data. For this purpose we’re absolutely not interested in things which are not current. This reduces the requirements on data providers to something manageable, but this could easily be munged out of CERIF RDF.

        Generally I’m trying to bootstrap something pragmatic and not adding anything non-essential.

        • Christopher Gutteridge September 3, 2012 at 12:40 #

          ps. You’d be very welcome for a single day. There’s a meal on the first evening.

          • Paul Chitson September 27, 2012 at 15:41 #

            It was good to see you at UNIQUIP. It was refreshing to be in a group of people who had a desire to get something working and start simple. Felt that the proposed model of sites putting some limited information on facilities out quickly and pulling that into a searchable repository on data.ac.uk was something that I would recommend within BBSRC. The short discussion about licensing was also helpful as GtR has been considering what license to publish under initially and trying to use the same terms and conditions across open data sites would be helpful, although publishing under CC-0 obviates the need to understand any license I’m not sure we are ready for that yet.

            Believe that if you get the facilities up on data.ac.uk we would like to harvest/present the information through GTR if possible as some SME’s have stated that they would like to find institutions that have certain capabilities to work with, and that is more about current facilities than historic. The historic information is to facilitate a readers understanding of how a research project was conducted.

            I was also listening in also during your discussion with Brigitte around CERIF model and potential use for facilities and the difficulties that might pose. My colleagues have been working closely with Brigitte (who has been extremely helpful and spent much time supporting the mapping process) and found the process both beneficial and a little more time consuming than using your own bespoke data model. Your suggestion that they publish cookbook type examples (akin to those on goodrrelations) demonstrating how the model might be applied to certain problems could help in its adoption.

        • Brigitte Joerg December 5, 2012 at 20:59 #

          yes, in the CERIF XML (which is intended for exchange) the dates are optional. it is, that in the relational CERIF world the dates are part of the primary key, i.e. the identifier of the link entity. Different worlds different rules.

          So, in RDF dates are not common in links and CERIF XML probably resembles these requirements to the most extent.

          “Historical data. For this purpose we’re absolutely not interested in things which are not current. This reduces the requirements on data providers to something manageable, but this could easily be munged out of CERIF RDF.”

  3. Brigitte Joerg December 5, 2012 at 21:06 #

    yes, we are in contact with Chris to enable the Soton Facilities RDF requirements (currently) from a CERIF XML Facility description. Having requirements is like a cookbook.

  4. Jenny grostock December 12, 2012 at 01:01 #

    Are there any GtR release dates coming up, or mailing lists to join?

    • Paul Chitson December 12, 2012 at 14:11 #

      GTR Beta was launched officially at 12:00 on 12/12/2012 is now publically available on http://gtr.rcuk.ac.uk/. The beta has a static data set which will not be updated (except to fix issues) until early 2013. At some point in the first quarter of 2013 we hope to have a more regular pull of data and further API’s available.

      Each of the detailled screens (non-search) also have an XML version which can be accessed by setting the response type header or adding “.xml” to the end of the ID. Their is also a rudimentary CERIF interface which we will publish details of over the coming days/weeks.

Leave a Reply