George Anadiotis discusses the RDF graph data mode with Ontotext CTO Vassil Momtchev in this engaging ZDNet interview.
Eric Kavanagh: You are doing some really interesting things at Ontotext, I’ve been studying you and your focus on the Life Sciences and Healthcare space but maybe to start you can give us an overview of what Ontotext is and what you are trying to accomplish.
Vassil Momtchev: Okay. We’re really a quite interesting company and I’ll start from the very beginning. The company was started by a few developers as a spin-off from one of the Bulgarian software solution companies.
Ironically, the company was started in 2001, just after the dot-com bubble burst. What was even funnier was that the company’s vision was to develop and deliver intelligent software applications, branded by the word semantics, right after the AI winter. Looking back, I think it was probably a really crazy idea and it was a long, hard road in the beginning.
During this period, we invested enormous time in the technology development to solve really complex knowledge management problems and also text analysis problems. One of the biggest challenges was that there was not much software available and most of it operated through black boxes. So, text analytics and knowledge management had quite unpredictable performance. During the early years, our mission was to develop and popularize skillful engineering semantic tools and products.
Nowadays, Ontotext still keeps its original company spirit; we’re still quite crazy guys. This is the same DNA for the company. Obviously, the market has matured and has allowed us to scale up our business, so we are now up to 70 people and we have offices not only in Bulgaria but also in the U.S. and UK as well as contractors in New Zealand and various places in Europe.
Another thing that has changed is that we have been able to attract many high profile customers from multiple industries. Healthcare and Life Sciences are obviously industries that have big applications of Semantic Technology, but we also have very high-profile customers from the Media, Publishing and other similar sectors like cultural heritage and so on. We are actually active in many different horizontals but, as you said, the most active verticals are Healthcare and Pharma, and Publishing and Media.
Eric Kavanagh: Let’s dig into Semantic Technology and how it works. A lot of people understand semantics in language – you look for different meanings of similar words or different words that have similar meanings, basically, and you try to use semantics to align certain dimensions. So, maybe in the Pharmaceutical industry, it could be along the lines of certain products that are used, certain pharmaceuticals and what they do. In the Media, you’re looking to align different kinds of media that have similar threads or similar topics, is that right?
Vassil Momtchev: Yes, this is absolutely right and I’m going to give you a few examples. Our solutions are based on three fundamental technologies, which we have mastered over the years. We have quite a powerful platform, which can be applied to different use cases.
The very basic first technology is the knowledge base design or the ontology modeling. Ontologies are the conceptualization of knowledge, which explains how different concepts are linked and represented. Recently, many big companies, like Google for example, started initiatives to popularize the knowledge graph and the graph of information. This is very similar to the way human beings understand information: you know some concepts and all the meaning is based on the relationships between those concepts.This is one of the foundational pillars of Semantic Technology: knowledge base design and ontology modeling.
One of the more obvious applications is ontology-based text mining. This is necessary because a lot of the available information in the world is unstructured. Using knowledge bases, which are actually optimized for doing text analysis, helps to identify concepts in text and to disambiguate meaning. Also this is another way to populate and learn new facts from unstructured text. For instance, you see that one company was acquired by another company and you can populate knowledge bases by using text analysis.
The third and perhaps the most obvious application is how you can interact with this information and knowledge. We support various types of usage scenarios. The most obvious one is semantic search or asking analytical questions, for instance, “Which are the clinical studies that show asthma as an adverse event?” “For these types of clinical studies, where they were mentioned, what was the company?” Then by using the knowledge base, you can identify all the different product names. In the pharma domain, if you look at drugs, which seems like a quite simple domain, you know that there are so many complexities. “What is the brand name?” “What is the generic name?” “What are the active ingredients?” “Who is the manufacturer?” “Who is the packager?” And so on. By using these types of knowledge bases, you can really do much more powerful analytical queries and searches.
Eric Kavanagh: I remember 15 years or more ago, the whole concept of knowledge management became very popular and it really did not deliver on the initial promise. I think it’s for the reasons you mentioned earlier. I like the way you described that when you said back in the day, the technologies were all black boxes. So they were unpredictable in terms of their usability, they were difficult to put into practice, and I think that for a variety of reasons, the industry just wasn’t quite ready yet for the kind of things that you’re doing now.
But now we have these graph technologies that we can use and we also have these knowledge bases that, if I understand it correctly, are really providing a sort of a foundation for companies to construct a view of their business or their products or their services or their content or whatever the case may be. So with this knowledge base, you’re really providing a seed or a foundation for companies to build a much more clear view of what’s happening in their world, right?
Vassil Momtchev: Yes. You are absolutely right. With knowledge bases, there was a big limitation around how many problems they could solve. There was so much hype and over-expectations that this technology could solve nearly every computer science and business problem by using intelligence. But in reality, smart technology seems to be a slightly simplified version of Artificial Intelligence that is applied to more practical problems. As I said, a knowledge base is really used to represent very complex information describing complex business analytical cases. Or, it can be used for very powerful text analysis.
A key distinction of this technology is that if you have a complex problem, you don’t know what the right questions to ask are. So you try to model your data in the most accurate way possible. After you design all the knowledge and the information and links between the different entities, you start to analyze it from different perspectives.
This is what makes the semantic approach and design a little bit different. You don’t know what your questions are in the beginning. You know what your software is and you know how to represent it and then you ask all these complex analytical questions afterward.
Eric Kavanagh: That’s a really good point about the comparison of what we might call traditional business intelligence to what Ontotext is doing. The fact is the world doesn’t fit into a nice neat box. I think that’s one of the constraints of traditional business intelligence, which all tends to be very row and column based. It’s all very two-dimensional and sometimes three-dimensional, but the world is multi-dimensional. The world has so many different aspects to it and so many facets and so many components, and there are dependencies between these things and relationships between these things. With your knowledge base and your technology, it seems to me you are creating a more accurate representation of more complex constructs like we talked about in media or life sciences or so forth. Is that a fair assessment?
Vassil Momtchev: Yes, this is an absolutely fair assessment. I will use a few very simple examples. For instance, in the clinical domain, physicians all write down “malignant tumor” in brief. This is a quite popular construct, known as a diagnosis.
Semantic Technology allows you, if you use a knowledge base, in the case of this malignant tumor, to have two types of entities. One entity is a morphological abnormality and the other is the anatomical location. By combining these two things in the knowledge base you know that this is breast cancer. Obviously, every physician knows that these two entities are combined. But you will never find it explicitly in the text. So you need some kind of background knowledge to better understand the text.
This is also applicable if you have to analyze very heterogeneous types of data sources. So semantic knowledge is really applicable to highly heterogeneous and complex domains, where you have many different data sources. The methodology of representing all the information as a graph and linkage allows you to model a variety of data much easier.
In other use cases with a very simple type of a system, such as with bank transactions, you don’t need a semantic system. You need some sort of operational and transactional system. But if you need a complex analytical query, which comes from various sources, and you need to integrate knowledge extracted from text, Semantic Technology is definitely the design you should consider.
Eric Kavanagh: That’s a really good point. And I think you’re giving a really good explanation for when, where and how to use some of this Semantic Technology. Certainly in the Life Sciences, when you consider the number of different facets or components, even the number of chemical compounds that come into play with various pharmaceuticals and various conditions in the human body – well, there’s no end to those really.
What you’re doing, you’re creating the framework that allows subject matter experts to build upon past knowledge and slowly but surely move their way through a process of understanding the relationships between, let’s just use the example of pharmaceuticals and the human body and the outcomes. And by doing so, you are opening up a pathway for illumination, right?
Vassil Momtchev: Absolutely, yes. I can share more details and insights for the typical types of applications. Ontotext is a company that delivers value on two levels. The first level is really designing information architecture in the enterprise. Because the complexity of the data and the landscape of the different systems is quite big. Most of the data is really locked into some sort of data silos. The very first step for Ontotext is always to create a nice, structured information architecture, which allows us to liberate the data and allows the linkage of the data to be as easy as possible.
The second step is to deliver complete applications, which solve specific business problems. What makes us really unique is that we capture the full variety of all these complexities. From information architecture, to importing the data, to making the knowledge model, to extracting the text and then designing a user-friendly interface, we provide the platform and also the complete application development.
I can give you more insights into the Semantic Technology and what are the typical applications.
Eric Kavanagh: Sure.
Vassil Momtchev: There are cases where pharmaceutical companies use a lot of PDF and Word documents. Their authoring structure is really to follow the processes required by their regulators. But there is a huge amount of knowledge in these documents. And the question really is: how to extract this knowledge. If you have a simple search system like a classic Boolean keyword search and you type the word “diabetes,” you’ll find thousands of documents. And diabetes is so unspecified.
Semantic Technology allows you to put everything into the right context. It is applying Natural Language Processing technology, which understands the structure of the PDF and Word documents, to extract the meaning of every section. And then to find the word “diabetes” and what it means. Because if the word diabetes was mentioned in the section under “serious adverse events” that means it was a serious adverse event in this particular clinical trial; if it’s in the indication, then it is probably the reason for treatment. Without understanding the context, which is typical for most of the classical search engines systems, you won’t be able to know if diabetes is the reason for or the consequence of the treatment.
I also want to emphasize the importance of text analysis. We have really mastered many different algorithms for linguistic analysis, and now our algorithms can also detect various linguistic constructs like negation. If you find the word but there is negation – that there was no diabetes or it was not a brain tumor – it has a completely opposite meaning. This is very frustrating for physicians when they use traditional technology. The technology doesn’t understand them and it only returns the keywords using simple keyword matching.
Eric Kavanagh: I’m guessing that you are able to assign certain properties? When you are going through the process of semantics search, for example, and you find a particular term, maybe it’s a medical term, I’m guessing in your tool you have some capability of manually inserting some definitions or flushing some of these concepts out manually, such that you can re-assess the data. Is that how it works?
Vassil Momtchev: Yes, this is how it works. But for simplicity, for most of our customers we provide ready-built terminologies and ontologies. So we support all the popular ontologies and terminologies that are required by any domain such as MedDRA, which is used to report on various medical events. What makes us special is that we can give you all the standard ontologies and they are ready to use. Obviously, in some specific scenarios you may want to modify them. But the idea is that you want to start from somewhere and get everything ready.
This is how this process works: you can imagine a graph of knowledge, and some of the information is already there because it is considered popular common knowledge. Once you start to use the system and import your documents, you can create new connections in this graph, which are derived from the documents. Obviously, the system and our graph database support the ability to keep track of which document originated this particular fact. For every fact, you can go back to the original document and see why it was extracted and why it was there. Was it the common knowledge imported from in the system or was it something, which was learned by importing this type of document?
And this is the full process. You have one big graph of data, which is initially populated with some common knowledge, and you put in some additional data, you further enrich it. This is the full process of using the graph to enrich new data and also to introduce the meanings between the entities.
Eric Kavanagh: That’s great. Please describe how it works for those who don’t fully understand. You have this graph technology and you talk about an RDF graph, which stands for Resource Description Framework. Can you explain why an RDF graph underneath this technology is so important?
Vassil Momtchev: Yes, the easiest way to explain this is by using a copy of the systems computer algorithms. They make the difference between a string and a thing. In the RDF database, everything is described as a subject, predicate and object, and every object may have two different meanings. One is some sort of a label or really just an identifier, which is different from the label.
And by using these, you can solve complex analytical problems. If you have a homonym or words that have the same sound but have different meanings, they will have a different identifier. For instance, if you have the word “cold” in the clinical domain it might have two very different meanings. Cold can be a Chronic Obstructive Lung Disease which is an acronym (COLD) and the other cold is really just the common cold, which you know is a virus.
By using the RDF data model, you can make a distinction between these two concepts and they can have two different identifiers. And when you mention this in the text, you can identify which of the two entities is mentioned. Obviously, in order to do this, you have to know what these models are and you have to train them to disambiguate the meanings of the words in the right way.
Eric Kavanagh: That’s a really good point. And as you suggest, you have to train the model. Who are the people in organizations who wind up using the technologies? I understand it would be different for different industries; a Media company would be different from a Life Sciences company. I’m sure that researchers love this type of technology because it gives them such a good mechanism for furthering their knowledge and finding associations. But can you talk about a couple of the different kinds of professionals?
Vassil Momtchev: Typically, we have two types of processes we use to train the algorithms and make them intelligent. So, the first process is if you have a lot of domain knowledge, we’ll manually combine all of the information. In this case, we can develop a statistical model to understand human behavior and based on the context decide which of the two words is applicable. This is more like a mathematical model. You’ll find some related entities from the context and based on the frequencies in the context you decide this is a common cold, it’s not related to obstructed respiratory disease.
The other approach we practice, and it is actually our area of expertise to know which approach we should apply, is to use a domain expert. These domain experts know all these terminologies and can define rules as to how they should be disambiguated.
This is part of the biggest complexity in this process. In order to deliver working solutions that satisfy the business needs, it is important to know which approach to apply and how to integrate everything from the data import to text analysis, to the semantics search and to allow the end users to understand all these complexities and to guide them to the right information.
Eric Kavanagh: Right. We’re in an increasingly complex world and it seems to me if you start looking at the Internet of Things and just the amount of information that’s available through the Internet, you can understand why a technology like this can be so useful. Again, for distilling very complex information environments down to these knowledge bases that you describe, which you can then use for the foundation of really moving forward with your organization.
I think you are very well positioned to help companies deal with these complex information environments.
Just one last question. What are you goals for the next two to three to five years in terms of improving Healthcare for example?
Vassil Momtchev: I can answer this question in two ways. The first is that we are a company that is currently in a growth phase. Obviously, what we want to do is to have more successful projects and to make our business more scalable. We had a really good start with high profile customers but it’s always good to reach more segments of the market. So one of the challenges is to replicate that success. This is what we are working hard to do in order to reach more customers.
The bigger end for me and a more strategic goal, it’s like a personal challenge, is obviously to generate a critical mass of new knowledge. Actually, it’s not to create new knowledge but to enrich the existing knowledge and to make it possible to generate new types of relationships and hypotheses between the information points. Once all of this critical mass of analytical information is generated, I’m sure that somebody will benefit.
It would make me happiest to see our technology solve really big problems in the Healthcare field and to help humans to live more positive, healthy lives.
Eric Kavanagh: You said an interesting thing a moment ago and really I think it speaks to the value of this technology. A lot of times it’s not discovering new knowledge, it’s revealing the knowledge that already exists. We have this expression in English – don’t reinvent the wheel. The point is that the wheel has already been invented. You don’t have to spend too much time reinventing the wheel. It seems to me that’s a big value you provide, helping organizations use large sets of complex data to realize and uncover what has already been learned and already been discovered. Right?
Vassil Momtchev: Absolutely, yes. This is one of the key aspects. There is a lot of knowledge that has already been generated but what makes it more difficult is to create the right links and to find the right correlation to the data to pull in the real hypotheses.
There is so much information that is going to be generated, what is really difficult for every business is to digest it and to make it more complete.
Eric Kavanagh: That’s wonderful stuff. Vassil, thank you so much for your time today.
Vassil Momtchev: Thank you for the wonderful talk.
For more information on how Semantic Technology can uncover actionable insights from Life Sciences and Healthcare data visit Discovery in Drug Clinical Trials and Semantic Data Normalization of Patient Records or just contact us for a consultation.