Learn how connecting text mining to a graph database like GraphDB can help you improve your decision making.
In the not too distant past, enterprises were pursuing a “360-degree view” of their data. At the time, the phrase mainly referred to their structured data, which was neatly housed in Relational Database Management Systems (RDBMS). But with today’s onslaught of unstructured and semi-structured data (such as documents, blogs, news articles, research reports, etc.), we think we can indulge ourselves and extend it to mean more: the full semantic circle.
This post summarizes how the integration between Text Mining and RDF Triplestores (also known as semantic graph databases) provides closed loop semantics. As you read, you may find that you have some questions. Don’t hesitate to ask them in the comments section below.
When you start researching RDF and RDF triplestores, maybe the first thing you learn is that data is not born in RDF format. To do that, you will need a data transformation tool (like GraphDB‘s OntoRefine). It converts data into subject–predicate–object triples (hence triplestore) also known as RDF statements and loads them in the RDF database.
Traditionally, data comes structured in tables (relational databases), which means that you need to define how your data is organized and how the relationships between the data pieces are associated before you add any new information. It also means that the resulting output is pretty basic and lacks richness.
In contrast, the RDF graph structure is more robust (it can handle massive amounts of data of all kinds and from various sources) and more flexible (it does not need its schema re-defined every time you add new data).
For example, standard RDF covers some basic classifications of terms and relationships (just like relational databases): Tom is a
Persons work for
Organizations exist in specific
Locations. But RDF can go much further:
Person is a resident of
Location which is the capital of
Another benefit of RDF is that each concept in the RDF structure is specifically referred to by a Unique Resource Identifier (URI) and this allows you to create a rich context around everything.
For example, the name “Paris” can refer to both the city of Paris (France) as well as the city of Paris (Texas). Clearly, both concepts have one and the same label. However, what makes them different is their unique URI and the context built around them by all the other concepts to which they are connected.
Therefore, our “Paris” in the first instance will have a unique ID and is more likely to be connected to the concepts “France”, “Eiffel Tower”, “River Seine”, “The Louvre”, etc. Whereas “Paris” in the second instance will have a different ID and is more likely to be connected to the concept “US”, “Texas”, “Samuel Bell Maxey House”, etc.The RDF graph structures effectively represent complex knowledge by modeling relationships in a semantic network. Click To Tweet
RDF Knowledge Graphs are available both as free structure graphs and as already modeled domain-specific thesauri, taxonomies and ontologies in either proprietary or open source form. Having your data in RDF significantly increases your data interoperability. If also helps you reach a bigger audience by benefiting from the common understanding of standard Open Data repositories across the globe.
Text Mining, on the other hand, enables you to enrich your data by incorporating packaged Text Mining pipelines for specific domains such as Pharma, Science Publishing, Cultural Heritage, Finance Publishing, News, etc. These pipelines create RDF and automatically insert the statements in your RDF triplestore.
Augmenting the transformation of unstructured text into structured information with a domain-specific Knowledge Graph approach to data representation produces a very rich set of new knowledge. With this rich set, search and discovery applications take on a new potential.
Unfortunately, RDF triplestores and Text Mining are usually not tightly coupled. Organizations either simply use RDF to build large knowledge models or they only process text to extract some information.
While most graph databases simply provide a repository for this information, Ontotext’s GraphDB™ is tightly integrated with Text Mining pipelines through its Concept Extraction Service (CES) API. This powerful coupling means that as new information is extracted, it refers to existing knowledge in GraphDB™ and can be easily inserted into the database.
In general, the Concept Extraction Service uses an open-source framework for Text Mining and retrieves enriched data in RDF format. Organizations typically customize these pipelines, which consist of any set of Text Mining algorithms for scoring, machine learning, disambiguation or any other Text Mining techniques.
It is important to note that these Text Mining pipelines create RDF in a linear fashion and feed GraphDB™. Once the RDF is enriched and stored in the database, the created tags can be modified, edited or removed. This is particularly useful when integrated with Linked Open Data sources.
When the source information changes, updates to the database are populated automatically. For example, let’s say your Text Mining pipeline is referencing Freebase as its Linked Open Data source for names of organizations. If an organization changes its name or a new subsidiary is announced in Freebase, this information will be updated as reference-able metadata in GraphDB™.
As we already said, in an RDF Triplestores, the relationships are represented as new and dynamic properties (predicates). This is why GraphDB™ can take a statement and apply its inferencing capabilities to materialize all the possible inferred relationships to that statement. The result is additional intelligence and faster queries.
Let’s see how it works. Let’s start with a known fact:
Barak Obama was elected as president of the United States.
A Text Mining pipeline can easily make the relationship between Barack Obama and the position President of the United States. Still, this is only temporal information. In 4 years, it could be different.
In RDF, however, you will model the fact as follows:
<Barak Obama (person)> <is_president> <USA (country)> <Document ID> <Document ID> dc:date <document date>
Once this provenance is preserved, you can ask the RDF triplestore (via SPARQL) who the president of United States was back in 2002. This is also known as “multiple versions of the truth”.
Let’s look at another example.
In June of 2014, Semprana was rejected for treatment of migraines by the FDA.
A Text Mining pipeline would determine Semprana as a prescription drug and insert definitions or knowledge from other sources. It would do the same for the other concepts in the sentences such as migraine and FDA and would identify the date as June 2014.
And this is where the power of inference kicks in. GraphDB™ can take this statement and produce a report for all migraine drugs rejected by the FDA.
It’s worth mentioning (although detail is beyond the scope of this post) that one of the unique attributes of GraphDB™ is its ability to update the RDF repository together with all the inferred relationships without a substantial performance hit when an inferred statement is retracted.
When developing Text Mining pipelines, each solution may utilize a different set of tools depending on your particular use case. Disambiguation, for example, can take place solely in a Text Mining pipeline through machine learning and “training” a pipeline on a specific domain. For example, Orange in a health and wellness context would most likely refer to the fruit while in a geographical context – to the southern county of California.
The tightly-coupled integration of Text Mining and RDF triplestores makes the end-to-end process of structuring unstructured data, enriching domain-specific content and feeding a dynamic repository of facts much easier to operationalize. In this dance, Text Mining takes care of correctly tagging each concept while the RDF database ensures that only documents about this concept are served in the search results.
This blend of technologies happens to be unique in the market. The result is a full semantic circle where dynamic curation, authoring and reporting can be executed on an enterprise scale. Ontotext offers all of this technology in the Ontotext Platform.
Want to learn more about RDF triplestores like Ontotext’s GraphDB, which powers the Ontotext Platform?