Read about FAIR data principles - a relatively new concept for data discoverability and management that has quickly gained traction among the scientific data community…
When we talk about expenses for generating new data assets, we typically focus only on the planned and immediately visible costs. For example, when we run an assay, we need materials such as compounds or plates, we need infrastructure like offices or lab devices to run the assay, and we need FTEs for the planning, execution, and readout.
However, what we usually don’t talk about when generating an asset, are the huge invisible or unplanned costs occurring at a later stage when the data needs to be made available for analysis or secondary usage. These include ETL processes, searching, accessing, data cleansing, data creation, semantic data integration, and the IT infrastructure to support it. As a result, a big portion of the IT capacity in Pharma is bound by data integration.
But there is a way out of this hamster wheel and it is what we call information procurement. It requires a good information architecture – an information-centric data organization that is semantic and meaningful. It also requires an efficient process of creating, acquiring, and integrating standardized information types into all information-driven R&D activities.
There are a lot of standards we have to deal with in Pharma and their number keeps growing. There are also different schemas, which hold, more or less, the same data but look completely different. So, if we multiply the schemas with the universe of standards out there, we are moving into intractable space. And, although we have services like the EMBL-EBI Ontology Xref service providing a schema of all the various mappings between resources, these still need to be maintained and kept up-to-date. All this requires a lot of resources.
The current situation also leads to a loss of information because it is not always easy to find out if two concepts are semantically equivalent or not. It also impedes interoperability – even though there are standards for sharing ontology mappings, it still isn’t optimal.
The current Linked Open Data Cloud brings the assumption that if we talk about the same thing, our data is linked. But that’s not true. Our data is linked only when it has a referential identity, which comes from interoperability. It doesn’t come from creating classes and properties again and again, and then trying to map them.
The same happens in organizations dealing with data coming from disparate sources. At the end of the day, they create a lot of inevitable additional costs and project delays. They have to misallocate resources because 80% of the time the data scientists are busy doing data finding, accessing, cleansing, etc. This also results in the information loss I’ve already mentioned and severely impacts our insight creation and monetizing the data.
Instead, we should aim to build an open public-private semantic infrastructure of fully standardized FAIR (findable, accessible, interoperable, and reusable) applications, services, and data.
The first element we need for such an architecture is to set our terminology management. In other words, all the different concepts we work with should have a Uniform Resource Identifier (URI). The terminology sets formed by these concepts should also have identifiers.
The next element is the variables. We have a schema definition and every variable should have an identifier. Ideally, we should be able to link these identifiers to other standards like OMOP, ODC, SPHN, etc. But we can also treat the whole schema as an object with a URI. So, on one hand, we can see our schema as a vocabulary that has different names and different standards, and, on the other, we can see it as a metadata object. In this way, the elements of the schema form a data dictionary or metadata dictionary, which we can then put in a metadata registry.
The last element is about knowledge graphs and there are two different approaches.
One approach is to translate a table into a graph with a simple graph generation. Here, we select one of the columns as a subject, the variables become predicates and the values in the cells become objects. The problem here is that we lose a lot of semantics in the process because this method requires semantic simplification.
Another approach is to build a semantic graph with a more complex representation. For example, we can have a study subject that has an “age” and the “age” has “age unit” and “age value”. We can also have a “treatment” for this subject, which can be a very complex object. For example, it can have a “route of administration” and “frequency of application”. It can have “an active dose”, which has a “dosage form”, which has an “active substance”, which has a “strength”, which has a “unit” and so on.
The beauty of semantic technology is that it’s compositional. It can provide a common model that everybody in the industry can agree upon as a standard and then use the same URIs, the same variables, and the same code list.
But, so far, all this is contrary to what we’ve been doing. So, where do we go wrong?
In Roche, we have an in-house development tool that covers the semantic capability stack – from terminology management to metadata to building ontologies. We now use Ontotext’s GraphDB as a high-performance graph database. We have integrated more than 100 productive applications on the terminology part and we have features providing terminologies in context.
This customization is the main factor for our success. When people want to use Roche terminology system and, for example, need indications, we don’t dump 6,000 indications on them. Instead, they get exactly what they want, which is maybe 100-200. They can also change the preferred label or set up their hierarchies, etc. All of this is intrinsically held together by the URIs and makes it possible to have a complete landscaping of the use of specific concepts.
We have now extended to Schemata, which provides a data dictionary for defining the various variables we have, and each variable has the same URI regardless of the context. So Schemata are interoperable by design.
We also have a tool for semantic modeling. It provides reference terminologies we need to apply and whenever we want to use a specific concept like, for example, pathway, genetic variant, drug, etc., it will have the same URI in all the models we build.
We need high-quality standardized and linked data, which is the foundation for digitization and insight generation. We’ve seen that FAIR principles intrinsically tie data management to semantic technologies. So, we need to work with RDF, RDFS OWL and have FAIR data that is linked data by design. We’ve also seen that information procurement based on FAIR knowledge graphs supports transformation-less data integration.
It’s also important to note that FAIR data is about the HOW and not only about the THAT. It’s about finding something with URIs, not with names. Right now, what we all try to do as a community is to find new architectural approaches around data and information. And the interoperability of the terminologies, the metadata, the schemata, the dataset models, and the ontology is key.
Last but not least, we need to converge. A single company can’t solve the issues we have to face. So, there is an urgency to act as a community and to build these semantic capabilities together.