Building a Semantic Capability Stack to Support FAIR Knowledge Graphs at Scale

This is an abbreviated version of a presentation from Ontotext’s Knowledge Graph Forum 2022 titled “Building a Semantic Capability Stack to Support FAIR Knowledge Graphs at Scale” by Martin Romacker, Product Manager Roche Data Marketplace, Roche

February 7, 2024 8 mins. read Martin Romacker

When we talk about expenses for generating new data assets, we typically focus only on the planned and immediately visible costs. For example, when we run an assay, we need materials such as compounds or plates, we need infrastructure like offices or lab devices to run the assay, and we need FTEs for the planning, execution, and readout.  

However, what we usually don’t talk about when generating an asset, are the huge invisible or unplanned costs occurring at a later stage when the data needs to be made available for analysis or secondary usage. These include ETL processes, searching, accessing, data cleansing, data creation, semantic data integration, and the IT infrastructure to support it. As a result, a big portion of the IT capacity in Pharma is bound by data integration. 

But there is a way out of this hamster wheel and it is what we call information procurement. It requires a good information architecture – an information-centric data organization that is semantic and meaningful. It also requires an efficient process of creating, acquiring, and integrating standardized information types into all information-driven R&D activities. 

Biomedical Ontologies and Terminologies

There are a lot of standards we have to deal with in Pharma and their number keeps growing. There are also different schemas, which hold, more or less, the same data but look completely different. So, if we multiply the schemas with the universe of standards out there, we are moving into intractable space. And, although we have services like the EMBL-EBI Ontology Xref service providing a schema of all the various mappings between resources, these still need to be maintained and kept up-to-date. All this requires a lot of resources.

The current situation also leads to a loss of information because it is not always easy to find out if two concepts are semantically equivalent or not. It also impedes interoperability – even though there are standards for sharing ontology mappings, it still isn’t optimal.

The Linked Data Illusion

The current Linked Open Data Cloud brings the assumption that if we talk about the same thing, our data is linked. But that’s not true. Our data is linked only when it has a referential identity, which comes from interoperability. It doesn’t come from creating classes and properties again and again, and then trying to map them. 

The same happens in organizations dealing with data coming from disparate sources. At the end of the day, they create a lot of inevitable additional costs and project delays. They have to misallocate resources because 80% of the time the data scientists are busy doing data finding, accessing, cleansing, etc. This also results in the information loss I’ve already mentioned and severely impacts our insight creation and monetizing the data.

FAIR Data Architecture

Instead, we should aim to build an open public-private semantic infrastructure of fully standardized FAIR (findable, accessible, interoperable, and reusable) applications, services, and data. 

The first element we need for such an architecture is to set our terminology management. In other words, all the different concepts we work with should have a Uniform Resource Identifier (URI). The terminology sets formed by these concepts should also have identifiers. 

The next element is the variables. We have a schema definition and every variable should have an identifier. Ideally, we should be able to link these identifiers to other standards like OMOP, ODC, SPHN, etc. But we can also treat the whole schema as an object with a URI. So, on one hand, we can see our schema as a vocabulary that has different names and different standards, and, on the other, we can see it as a metadata object. In this way, the elements of the schema form a data dictionary or metadata dictionary, which we can then put in a metadata registry.

The last element is about knowledge graphs and there are two different approaches. 

One approach is to translate a table into a graph with a simple graph generation. Here, we select one of the columns as a subject, the variables become predicates and the values in the cells become objects. The problem here is that we lose a lot of semantics in the process because this method requires semantic simplification.

Another approach is to build a semantic graph with a more complex representation. For example, we can have a study subject that has an “age” and the “age” has “age unit” and “age value”. We can also have a “treatment” for this subject, which can be a very complex object. For example, it can have a “route of administration” and “frequency of application”. It can have “an active dose”, which has a “dosage form”, which has an “active substance”, which has a “strength”, which has a “unit” and so on. 

The beauty of semantic technology is that it’s compositional. It can provide a common model that everybody in the industry can agree upon as a standard and then use the same URIs, the same variables, and the same code list. 

Why FAIR and Knowledge Graphs Might Fail

But, so far, all this is contrary to what we’ve been doing. So, where do we go wrong?

  • We fail to make it clear what FAIR is – the implications of FAIRification on how we work in knowledge management and IT projects are not entirely understood. For many people in our organizations, it is not clear that breaking up silos doesn’t mean building an integration layer on top of a silo but that the silos should go away. 
  • There is a misunderstanding of the scope of data FAIRification and data quality – FAIR data and high-quality data are not the same. Having a tool that automatically checks the FAIRness of the data, doesn’t mean that the data is correct or standard compliant. What is measured is related to global identifiers, standardized schemata, ontologies, etc.
  • We cannot measure the FAIR maturity as this model is almost incomprehensible – this heavily impacts the correct adoption of the FAIR data principles and needs to be translated into something actionable. 
  • The biomedical community doesn’t converge on standards – we rather make more chaos instead of harmonizing, both in terms of semantic resources and projects, all of which results in intractable knowledge space. 
  • The FAIR community stays in a bubble – insiders connect with insiders while the outreach and the integration with the business are poor. Besides a vague understanding, there is little support from management for a real breakthrough.
  • Scaling up knowledge management based on FAIR resources and standards requires an operational backbone of FAIR data and services, but there is no community strategy about the basic resources and their maintenance
  • FAIR does not speak to key communities such as IT Architects, Master Data Management, Data Managers, etc. Implementing a landscape of FAIR data, services and applications requires high engagement with such communities but until now this hasn’t happened. 
  • Using RDF and OWL to build ontologies as well as creating knowledge graphs does not prevent us from establishing new data silos or producing unFAIR data

Standardization and Capability Stack

In Roche, we have an in-house development tool that covers the semantic capability stack – from terminology management to metadata to building ontologies. We now use Ontotext’s GraphDB as a high-performance graph database. We have integrated more than 100 productive applications on the terminology part and we have features providing terminologies in context.  

This customization is the main factor for our success. When people want to use Roche terminology system and, for example, need indications, we don’t dump 6,000 indications on them. Instead, they get exactly what they want, which is maybe 100-200. They can also change the preferred label or set up their hierarchies, etc. All of this is intrinsically held together by the URIs and makes it possible to have a complete landscaping of the use of specific concepts. 

We have now extended to Schemata, which provides a data dictionary for defining the various variables we have, and each variable has the same URI regardless of the context. So Schemata are interoperable by design.

We also have a tool for semantic modeling. It provides reference terminologies we need to apply and whenever we want to use a specific concept like, for example, pathway, genetic variant, drug, etc., it will have the same URI in all the models we build.

To Sum It Up

We need high-quality standardized and linked data, which is the foundation for digitization and insight generation. We’ve seen that FAIR principles intrinsically tie data management to semantic technologies. So, we need to work with RDF, RDFS OWL and have FAIR data that is linked data by design. We’ve also seen that information procurement based on FAIR knowledge graphs supports transformation-less data integration. 

It’s also important to note that FAIR data is about the HOW and not only about the THAT. It’s about finding something with URIs, not with names. Right now, what we all try to do as a community is to find new architectural approaches around data and information. And the interoperability of the terminologies, the metadata, the schemata, the dataset models, and the ontology is key. 

Last but not least, we need to converge. A single company can’t solve the issues we have to face. So, there is an urgency to act as a community and to build these semantic capabilities together. 

Do you want to learn more about FAIR knowledge graphs?

New call-to-action

Article's content

Product Manager Roche Data Marketplace at Roche

Martin Romacker is working on terminologies, data curation, data harmonization, data standards, daa governance, information architecture and FAIR data integration at scale. He's very enthusiastic about any strategic, tactical or operational work executed to turn data into real assets. Martin believes that the "Science of Data '' is equally important as Data Science as data is ubiquitous in the Pharma value chain and the foundation of every value-generating activity.