GTR – Dealing with the Data Challenges

At the outset of the GTR project, one of the major challenges we identified was establishing the scope of the data and the semantics of terms associated with it.

A little background to the challenge

In 2011, the seven Research Councils finished consolidating 5 grants transaction processing systems into a single shared solution. This has helped to reduce some of the differences between the format of council data, but not always the meaning of values held. Whilst many of the semantic differences are purely a labelling issue, some are more deeply entrenched in the ways councils work with their respective research communities. To this end, we are taking time to unpick and agree a common set of terms under which we can publish data.

Alongside these semantic equivalence challenges come issues over whether we are permitted to publish some of the data we hold. The GtR portal will be publishing data from various Research Council systems and these systems are often a result of previous projects to migrate legacy data. The terms under which the information was originally gathered have variation on whether the data can be made public, although they generally permit the Research Councils to publish information for the furtherance of their missions. We are working to ensure that the information gathered for one purpose is suitable and fit for release and does not break any expectations of privacy or confidentiality.

The form and content of the data gathered for one purpose (e.g. to provide the basis of evidence of the impact of research) may not be suitable or indeed easily consumable for another purpose, or may be missing some linkages that would have been made if the aim was publication – e.g. we may not be able to link a publication to more than one person on a research project as we gather this as evidence of outcomes against a project. Over time, as the GtR develops, we would expect it to inform the way we gather data so that we can ensure it delivers value as both evidence of what we have funded and as a resource for those wanting to exploit it in either a public or commercial interest.

Why did we choose CERIF

CERIF was chosen as a storage mechanism to store GtR data for 2 main reasons:

  • The common challenges of storing research and related data have already been discussed and documented by a wide group of skilled and knowledgeable people. Developing our own bespoke model to deal with the same challenges seemed wasted effort
  • CERIF is highly regarded within the research community and we felt it was important to store the data in a consistent way to facilitate exploitation and information exchange with many universities, research organisations and consumers of research data

Following our decision to adopt CERIF we invested time in understanding the model and also seeking the advice of Brigitte Joerg. Brigitte has been instrumental in helping us understand the CERIF model and how to populate it with data from our staging area.

The Cost of Adoption

Adopting CERIF brings a steep learning curve and also adds time. Most of this additional time has been associated with documenting and defining the semantics of every attribute and how it relates to others. This in-depth understanding, notwithstanding the challenge of getting 7 bodies to agree, is critical to identifying where it belongs in the CERIF model. If your data is already well defined then it will put you in as strong position. If like most systems, it is not then you will need to agree a set of terms and definitions with the business for every attribute that you want to map to CERIF before proceeding.

This issue most clearly manifests itself in populating the CERIF semantic layer with your vocabularies. Only when the vocabulary data has been added can you begin loading your data sets. For example:

Project A has a current status of Authorised. To be able to represent that information against a project in CERIF you need to populate the CERIF semantic layer with 2 key pieces of data. A) What is the definition for project status B) What does the term ‘Authorised’ mean. With this semantic reference data in place you can relate project with the “Authorised” status. This has a lot of similarities with the open data RDF world, but CERIF contains its semantic information both in the table structure (the attributes of the tables) and in the contents of the semantic layer.

Another key point is understanding what to do with free text and date fields that the CERIF model hasn’t catered for. There are a couple of options how to store them, with the simplest being to extend the existing tables by adding attributes but this is definitely not supported by the CERIF task group. The more appropriate route would be to add additional tables containing the attributes. Of course if you want the additional fields to have language variations, then you face a decision of whether to adopt the same model used in CERIF and maintain some logical consistency across the model and the extension or adopt your own model. However, you choose to resolve the issue of unsupported fields, you should try and maintain clean semantics, as it is tempting to put something in that solves a problem now and live with consequences of this decision going forward.

We have made good progress on the subset of data that we hope to launch in November and have managed to gain agreement for the vocabulary (although some definitions are still being hammered out). The use of CERIF has definitely been a journey of discovery so far and we hope to keep you updated with progress and more detail on how we have mapped the data into CERIF.

Note: If an attribute is added and it could be of general interest to the community. We will notify the CERIF task group and they will consider it as an addition to the model.

JISC and Research Councils UK work to reduce reporting burden on universities

Matt Jukes, MRC Digital Communication Manager –

JISC have just published a blogpost outlining joint activity between them and us here at the Research Councils that is of particular interest to anyone following the Gateway to Research project. Rather than reproduce it here I recommend you check it out over on the JISC blog. There is a great of interesting work being undertaken in this area.

The GTR technical approach

Paul Chitson, BBSRC Head of Information Services –

The GtR technical team have started the process of pulling together data from various Research Council funded projects and building a technology platform through which to make it available to a wide variety of audiences.

The project is being managed and delivered within an Agile framework, perhaps the first cross council project in the Research Councils to adopt this approach.  This means we will not only be trying to hit key milestones but also aiming to release as much information and as many early previews as we can.  This blog post is the first from the technical team and hopefully sets a precedent for the way we wish to continue.

The process began with the gathering of user stories from potential audiences which have since enabled us to understand the content that is of interest, the ways in which people may want to interface with the data and, of equal importance, what people do not want.

The requirements gathering phase has informed our initial thoughts on the solution architecture and how the data will be stored.

The data will be extracted from various sources, transformed and then stored in two forms. The first will be in a relational database using the CERIF data model.  Initially, the intention was to provide an interface that would output data in a CERIF format but following a workshop with Keith Jeffery (President of euroCRIS) we felt that we could use it as the internal storage model.  However, we also have a commitment to provide data in a linked data format (RDF) via a SPARQL interface, so have chosen to store the data in two forms (rather than a RDF wrapper) and benefit from a SPARQL interface out of the box.  We will, wherever possible use existing open source technologies and integrate rather than build from scratch. Any code or customisations we make will be made freely available.

GtR technology at a glance:

  • Data storage model will be CERIF. This ensures compliance with many universities, research organisations and consumers of research data;
  • A user portal with full text and facetted search (SOLR);
  • REST based interface providing rich information in both human and machine readable forms;
  • SPARQL interface for querying a triple store (initially using the JENA Software stack);
  • Web services providing data in OAI-PMH and CERIF formats.

 

We are at the start of an exciting journey and invite you to add your input along the way.

Introducing the Gateway to Research

Catherine Coates, Director of Business Innovation at EPSRC and the SRO for the Gateway to Research project, introduces the project and explains why the Gateway is important –

The UK’s Research Councils host a significant quantity of data which provides information on the research and training that they support, as well as the outcomes of that research. This is of huge potential interest and value to business and many other organisations, particularly universities that already make similar data publically available. The Research Councils together are determined to play their part in making the data we hold freely and easily available for others to use as they see fit, including seeding collaborations and helping interested parties to find out who, what and where knowledge sits to enable them to make contact with people who can help them.

With the Gateway to Research project, we envisage an integrated Research Council data set that enables data sharing across the government, private and university sectors.

For example, although currently our data is in the public domain, accessible through our websites, it isn’t easy to navigate what seven Councils hold when using seven different websites!

So we want to create a smart way to make that easy, with common data standards and interoperability, so anyone can access it and use it as they see fit. This is the Gateway to Research concept. Not a controlling gateway but an open door!

We aim to produce an integrated data depositing and harvesting experience for universities and other stakeholders.

We need your help in making this happen. We will use this blog to engage with interested parties regarding the platforms, technologies and data formats that we will be delivering. Help us to deliver the functionality and user experience that will enable you to use our data.

This blog will be updated as often as is practical when there is new information to share. Realistically this will be once a week at most. We will, however, endeavour to engage with questions on a more frequent basis.