GTR – Dealing with the Data Challenges

At the outset of the GTR project, one of the major challenges we identified was establishing the scope of the data and the semantics of terms associated with it.

A little background to the challenge

In 2011, the seven Research Councils finished consolidating 5 grants transaction processing systems into a single shared solution. This has helped to reduce some of the differences between the format of council data, but not always the meaning of values held. Whilst many of the semantic differences are purely a labelling issue, some are more deeply entrenched in the ways councils work with their respective research communities. To this end, we are taking time to unpick and agree a common set of terms under which we can publish data.

Alongside these semantic equivalence challenges come issues over whether we are permitted to publish some of the data we hold. The GtR portal will be publishing data from various Research Council systems and these systems are often a result of previous projects to migrate legacy data. The terms under which the information was originally gathered have variation on whether the data can be made public, although they generally permit the Research Councils to publish information for the furtherance of their missions. We are working to ensure that the information gathered for one purpose is suitable and fit for release and does not break any expectations of privacy or confidentiality.

The form and content of the data gathered for one purpose (e.g. to provide the basis of evidence of the impact of research) may not be suitable or indeed easily consumable for another purpose, or may be missing some linkages that would have been made if the aim was publication – e.g. we may not be able to link a publication to more than one person on a research project as we gather this as evidence of outcomes against a project. Over time, as the GtR develops, we would expect it to inform the way we gather data so that we can ensure it delivers value as both evidence of what we have funded and as a resource for those wanting to exploit it in either a public or commercial interest.

Why did we choose CERIF

CERIF was chosen as a storage mechanism to store GtR data for 2 main reasons:

  • The common challenges of storing research and related data have already been discussed and documented by a wide group of skilled and knowledgeable people. Developing our own bespoke model to deal with the same challenges seemed wasted effort
  • CERIF is highly regarded within the research community and we felt it was important to store the data in a consistent way to facilitate exploitation and information exchange with many universities, research organisations and consumers of research data

Following our decision to adopt CERIF we invested time in understanding the model and also seeking the advice of Brigitte Joerg. Brigitte has been instrumental in helping us understand the CERIF model and how to populate it with data from our staging area.

The Cost of Adoption

Adopting CERIF brings a steep learning curve and also adds time. Most of this additional time has been associated with documenting and defining the semantics of every attribute and how it relates to others. This in-depth understanding, notwithstanding the challenge of getting 7 bodies to agree, is critical to identifying where it belongs in the CERIF model. If your data is already well defined then it will put you in as strong position. If like most systems, it is not then you will need to agree a set of terms and definitions with the business for every attribute that you want to map to CERIF before proceeding.

This issue most clearly manifests itself in populating the CERIF semantic layer with your vocabularies. Only when the vocabulary data has been added can you begin loading your data sets. For example:

Project A has a current status of Authorised. To be able to represent that information against a project in CERIF you need to populate the CERIF semantic layer with 2 key pieces of data. A) What is the definition for project status B) What does the term ‘Authorised’ mean. With this semantic reference data in place you can relate project with the “Authorised” status. This has a lot of similarities with the open data RDF world, but CERIF contains its semantic information both in the table structure (the attributes of the tables) and in the contents of the semantic layer.

Another key point is understanding what to do with free text and date fields that the CERIF model hasn’t catered for. There are a couple of options how to store them, with the simplest being to extend the existing tables by adding attributes but this is definitely not supported by the CERIF task group. The more appropriate route would be to add additional tables containing the attributes. Of course if you want the additional fields to have language variations, then you face a decision of whether to adopt the same model used in CERIF and maintain some logical consistency across the model and the extension or adopt your own model. However, you choose to resolve the issue of unsupported fields, you should try and maintain clean semantics, as it is tempting to put something in that solves a problem now and live with consequences of this decision going forward.

We have made good progress on the subset of data that we hope to launch in November and have managed to gain agreement for the vocabulary (although some definitions are still being hammered out). The use of CERIF has definitely been a journey of discovery so far and we hope to keep you updated with progress and more detail on how we have mapped the data into CERIF.

Note: If an attribute is added and it could be of general interest to the community. We will notify the CERIF task group and they will consider it as an addition to the model.