Graphs on the Ground Part II: Knowledge Graphs in the Life Sciences

The dependence of life science companies on large public data sources as well as their use of both structured and unstructured data make knowledge graphs a natural fit for the industry. Today, they support key functions from drug development to literature review and regulatory reporting.

December 17, 2021 7 mins. read Joe Hilleary


A friend of mine went whale watching recently. He spent a small fortune on tickets for his family and then proceeded to sit on a boat for four hours scouring the ocean for any sign of life. He came back a little seasick and without having seen a thing. Why do I bring this up? Because without the right tools, searching for data in the life sciences is a lot like whale watching. You know what you’re looking for is out there, but the sea of information is vast, and, despite their expense, time consuming manual searches often miss critical insights.

Companies in the life sciences face data challenges on two fronts:

Volume. Organizations in this sector often deal with multiple repositories of millions of documents on top of proprietary data from labs and other internal sources.

Variety. This data comes from both private and public sources and in structured and unstructured formats, making it difficult to create a unified, queryable view of existing knowledge.

In this article, the second entry in our Graphs on the Ground series (see our first article on financial services), we will explore how knowledge graphs combat these challenges across four critical activities within the world of life sciences and pharmaceuticals.

Drug development

Let’s start with the core function of every pharmaceutical company—drug discovery. Developing a new treatment or therapy takes an extraordinary amount of scientific data. In addition to the data they generate, organizations rely on public and other external resources for research data, gene information, and other knowledge shared across the discipline. Only by making connections between all of these sources and their own proprietary data can enterprises identify potential new treatments.

Unfortunately, the process of making those connections remains highly manual. Because data is siloed within each resource, scientists cannot easily explore the information available to them about a particular chemical compound or medical condition. A knowledge graph-based approach solves this problem by unifying the information within an intuitively linked structure.

Domain experts can document their knowledge of terms and concepts within the graph and then connect the nodes to data stored in any number of external sources. (See figure 1.)

Figure 1. Simplified Graph of Drug Features

Instead of searching for different aspects in different sources, scientists can start anywhere—with a particular chemical, a certain study, or a condition of interest—and immediately begin examining links from that node to data from any number of other sources. Organizations can even use the knowledge graph to store information on how to retrieve data from each source, automating the task of regularly extracting updates. The knowledge graph-based approach is especially effective for uncovering new uses for existing drugs. Linking data about existing chemical compounds makes it much easier to find previously overlooked connections. As a result, companies are better able to generate value from past research.

A critical component of knowledge graphs’ effectiveness in this field is their ability to introduce structure to unstructured data. Many rich sources of information in the medical world are written documents with poor quality metadata. Coupling natural language processing (NLP) with a knowledge graph, as Ontotext has done for many of its clients, creates a positive feedback loop that improves document discoverability. NLP models trained on collections of terms and concepts, like a knowledge graph, are better able to sort through documents to extract information. That information enhances the graph, which improves the NLP model. Ultimately, this allows for rich search that supports complex queries across both structured and unstructured sources.

Clinical study

Clinical studies are another key activity in the world of life sciences. In order to publish a study, scientists must create a draft report that consolidates all of their data. As with drug discovery, this data is typically a mixture of structured and unstructured sources. Often searches for particular documents return thousands of results most of which don’t actually apply directly to the research. In one case, an Ontotext customer had more than 500,000 documents contained in data silos related to three clinical studies. When dealing with that volume of data, manual processes run the risk of someone overlooking a critical piece of information and creating a liability for the company.

A knowledge graph paired with a graph-trained NLP model once again provides a solution. The model can readily identify the structure of clinical documents, extract the important relationships, and add them to the graph. Researchers can link concepts such as inclusion and exclusion criteria to their formal definitions in external databases and connect adverse events to particular combinations of conditions. Ultimately, this approach not only saves time but also reduces the risk of human error when dealing with data at a massive scale.

Literature review

Life science companies don’t just publish studies; they also ingest them. Once again, knowledge graphs lend a hand. Monitoring scientific literature requires scouring public sources with millions of articles for information about adverse events and competitors’ research. In many respects this task is the opposite of the last example. Researchers must break down articles into their key data to extract insights. Using NLP organizations can process these documents automatically, feeding the takeaways into the knowledge graph where they become readily accessible. Figure 2 shows a simplified graph reflecting information about an adverse patient reaction from a case report.

Figure 2. Simplified Graph of an Adverse Reaction

The knowledge graph essentially becomes a unified interface for exploring millions of articles from multiple repositories. With this approach, researchers can easily track competitors progress through clinical trials or set up notifications and alerts about topics of interest. Relevant information becomes far more retrievable because it is liberated from the silo of the external source and enriched with metadata from the knowledge graph.

Regulatory reporting

Like the financial services vertical we examined in the first installment of this series, companies in the life sciences must routinely report information to a variety of regulatory agencies. As with the other use cases, the ability of a knowledge graph to consolidate information from both proprietary and public sources is invaluable. Rather than manually tracking down all of the sources needed to generate the appropriate reports, organizations can map them within the graph.

A knowledge graph for regulatory reporting stores information about the agencies as well as the research. It contains definitions specific to certain authorities and relates those to terms used by others. It can also store information about previously asked questions. As a result, graph algorithms applied to this type of graph can produce similarity scores for questions asked by different regulators. These scores can point report authors in the right direction and save them time in answering closely related questions.

Making it a Reality

So how does one begin building a knowledge graph for the life sciences?

  1. Start with a specific function. One of the examples above might be a fit for your company, but let business needs drive the development process. Many knowledge graph initiatives fail because the initial vision is too expansive. Choosing one use case and demonstrating value within a set time frame will make it easier to justify expanding the graph to other applications down the road. It’s better to have a small, functional graph that meets a need than to spend years constructing a comprehensive knowledge graph that never gets used.
  • Take advantage of existing resources. Knowledge graphs aren’t new to the life sciences. Realizing the unique utility of knowledge graphs within the industry, early adopters have already done a lot of the heavy lifting needed to build domain-specific knowledge graphs. As a service to its clients, Ontotext provides updated access to over 200 of the existing public ontologies, datasets, and graph compatible sources for the life sciences. Building off of these and other extant resources gives newcomers a jumpstart on knowledge graph development.

Although my friend’s whale watching excursion was a bust, plenty of research vessels have no problem finding whales in the same area to tag and track them. Why? Because they use the right tools. No technology solves every data problem, but choosing the right one can make certain tasks much easier. And for many functions in the life sciences, knowledge graphs are a natural fit.

Do you want to learn how Ontotext’s knowledge graph-based technology can help in your particular use case?


New call-to-action

Article's content

Data Scientist

Joe Hilleary is a writer and a data enthusiast. He believes that we are living through a pivotal moment in the evolution of data technology and is dedicated to helping organizations find the best ways to leverage their information. He holds an B.A. from Bowdoin College and, when not researching the latest developments in the world of data, can be found exploring the woods and rocky coasts of Maine.