• Blog
  • Informational

Text Mining to Triplestores – The Full Semantic Circle

February 10, 2015 7 mins. read Milena Yankova

Text Mining to Triplestores – The Full Semantic Circle

In the not too distant past, enterprises were pursuing a “360-degree view” of their data. At the time, the phrase mainly referred to their structured data, which was neatly housed in Relational Database Management Systems (RDBMS). But with today’s onslaught of unstructured and semi-structured data (such as documents, blogs, news articles, research reports, etc.), we think we can indulge ourselves and extend it to mean more: the full semantic circle.

This post summarizes how the integration between Text Mining and RDF Triplestores (also known as semantic graph databases) provides closed loop semantics. As you read, you may find that you have some questions. Don’t hesitate to ask them in the comments section below.

Read our White Paper: The Truth about Triplestores

The Scoop About Triplestores

When you start researching RDF and RDF triplestores, maybe the first thing you learn is that data is not born in RDF format. To do that, you will need a data transformation tool (like GraphDB‘s OntoRefine). It converts data into subject–predicate–object triples (hence triplestore) also known as RDF statements and loads them in the RDF database.

Traditionally, data comes structured in tables (relational databases), which means that you need to define how your data is organized and how the relationships between the data pieces are associated before you add any new information. It also means that the resulting output is pretty basic and lacks richness.

In contrast, the RDF graph structure is more robust (it can handle massive amounts of data of all kinds and from various sources) and more flexible (it does not need its schema re-defined every time you add new data).

For example, standard RDF covers some basic classifications of terms and relationships (just like relational databases): Tom is a Person. Persons work for Organizations. Organizations exist in specific Locations. But RDF can go much further: Person is a resident of Location which is the capital of Location.

Another benefit of RDF is that each concept in the RDF structure is specifically referred to by a Unique Resource Identifier (URI) and this allows you to create a rich context around everything.

For example, the name “Paris” can refer to both the city of Paris (France) as well as the city of Paris (Texas). Clearly, both concepts have one and the same label. However, what makes them different is their unique URI and the context built around them by all the other concepts to which they are connected.

Therefore, our “Paris” in the first instance will have a unique ID and is more likely to be connected to the concepts “France”, “Eiffel Tower”, “River Seine”, “The Louvre”, etc. Whereas “Paris” in the second instance will have a different ID and is more likely to be connected to the concept “US”, “Texas”, “Samuel Bell Maxey House”, etc.

The RDF graph structures effectively represent complex knowledge by modeling relationships in a semantic network. Click To Tweet

RDF Knowledge Graphs are available both as free structure graphs and as already modeled domain-specific thesauri, taxonomies and ontologies in either proprietary or open source form. Having your data in RDF significantly increases your data interoperability. If also helps you reach a bigger audience by benefiting from the common understanding of standard Open Data repositories across the globe.

RDF Graph

When Text Mining Joins the Dance

Text Mining, on the other hand, enables you to enrich your data by incorporating packaged Text Mining pipelines for specific domains such as Pharma, Science Publishing, Cultural Heritage, Finance Publishing, News, etc. These pipelines create RDF and automatically insert the statements in your RDF triplestore.

Augmenting the transformation of unstructured text into structured information with a domain-specific Knowledge Graph approach to data representation produces a very rich set of new knowledge. With this rich set, search and discovery applications take on a new potential.

Unfortunately, RDF triplestores and Text Mining are usually not tightly coupled. Organizations either simply use RDF to build large knowledge models or they only process text to extract some information.

While most graph databases simply provide a repository for this information, Ontotext’s GraphDB™ is tightly integrated with Text Mining pipelines through its Concept Extraction Service (CES) API. This powerful coupling means that as new information is extracted, it refers to existing knowledge in GraphDB™ and can be easily inserted into the database.

Best Practices for Large Scale Text Mining Process

How Does This Actually Work?

In general, the Concept Extraction Service uses an open-source framework for Text Mining and retrieves enriched data in RDF format. Organizations typically customize these pipelines, which consist of any set of Text Mining algorithms for scoring, machine learning, disambiguation or any other Text Mining techniques.

It is important to note that these Text Mining pipelines create RDF in a linear fashion and feed GraphDB™.  Once the RDF is enriched and stored in the database, the created tags can be modified, edited or removed. This is particularly useful when integrated with Linked Open Data sources.

When the source information changes, updates to the database are populated automatically. For example, let’s say your Text Mining pipeline is referencing Freebase as its Linked Open Data source for names of organizations. If an organization changes its name or a new subsidiary is announced in Freebase, this information will be updated as reference-able metadata in GraphDB™.

Examples Speak Louder Than Explanations

As we already said, in an RDF Triplestores, the relationships are represented as new and dynamic properties (predicates). This is why GraphDB™ can take a statement and apply its inferencing capabilities to materialize all the possible inferred relationships to that statement. The result is additional intelligence and faster queries.

Let’s see how it works. Let’s start with a known fact:

Barak Obama was elected as president of the United States.

A Text Mining pipeline can easily make the relationship between Barack Obama and the position President of the United States. Still, this is only temporal information. In 4 years, it could be different.

In RDF, however, you will model the fact as follows:

<Barak Obama (person)> <is_president> <USA (country)> <Document ID> <Document ID> dc:date <document date>

Once this provenance is preserved, you can ask the RDF triplestore (via SPARQL) who the president of United States was back in 2002. This is also known as “multiple versions of the truth”.

Let’s look at another example.

In June of 2014, Semprana was rejected for treatment of migraines by the FDA.

A Text Mining pipeline would determine Semprana as a prescription drug and insert definitions or knowledge from other sources. It would do the same for the other concepts in the sentences such as migraine and FDA and would identify the date as June 2014.

And this is where the power of inference kicks in. GraphDB™ can take this statement and produce a report for all migraine drugs rejected by the FDA.

It’s worth mentioning (although detail is beyond the scope of this post) that one of the unique attributes of GraphDB™ is its ability to update the RDF repository together with all the inferred relationships without a substantial performance hit when an inferred statement is retracted.

Important Takeaways

Ontotext Platform

When developing Text Mining pipelines, each solution may utilize a different set of tools depending on your particular use case. Disambiguation, for example, can take place solely in a Text Mining pipeline through machine learning and “training” a pipeline on a specific domain. For example, Orange in a health and wellness context would most likely refer to the fruit while in a geographical context – to the southern county of California.

The tightly-coupled integration of Text Mining and RDF triplestores makes the end-to-end process of structuring unstructured data, enriching domain-specific content and feeding a dynamic repository of facts much easier to operationalize. In this dance, Text Mining takes care of correctly tagging each concept while the RDF database ensures that only documents about this concept are served in the search results.

This blend of technologies happens to be unique in the market. The result is a full semantic circle where dynamic curation, authoring and reporting can be executed on an enterprise scale. Ontotext offers all of this technology in the Ontotext Platform.

Want to learn more about RDF triplestores like Ontotext’s GraphDB, which powers the Ontotext Platform?

Read our White Paper: The Truth about Triplestores

Article's content

A bright lady with a PhD in Computer Science, Milena's path started in the role of a developer, passed through project and quickly led her to product management. For her a constant source of miracles is how technology supports and alters our behaviour, engagement and social connections.

Linked Data Solutions for Empowering Analytics in Fintech

Read about how FinTech can use the power of Linked Data to put data into context and expose various links between these concepts.

Semantic Technology: Creating Smarter Content for Publishers

Learn how semantic technology helps publishers create better content publishing workflows and improved content consumption for readers.

The 5 Key Drivers Of Why Graph Databases Are Gaining Popularity

Read about the 5 key characteristics of graph databases – speed, meaning, answers, relationships, and transformation.

GraphDB Migration Service: The 10-Step Pathway from Data to Insights

Welcome to our GraphDB Migration Service that helps you prepare for migrating your data to GraphDB, walks you through the setup and monitors performance.

Fighting Fake News: Ontotext’s Role in EU-Funded Pheme Project

Read about the EU-funded project PHEME aiming to create a computational framework for automatic discovery and verification of information at scale and fast.

Semantic Technology: The Future of Independent Investment Research

Learn how independent research firms use cutting-edge technologies to add value to research pieces and monetize the content they offer.

Top 5 Semantic Technology Trends to Look for in 2017

Read about the top 5 trends in which Semantic Technology enables enterprises to make sense of their data and fine-tune their offerings to customers.

Ontotext’s 2016: Our Top 7 Webinars Of The Year

Data shows that in 2016 we had a total of 22 webinars that attracted over 7 000 people – here are the 7 best webinars!

Ontotext’s 2016: What Did You Liked The Most On The Blog

Nearly 10 000 people read our blog in 2016 and the following 5 posts gathered most interest.

Linked Data in Regtech: Boosting Compliance and Performance

Learn how regulatory technology, coupled with semantic technology, can help enterprises and financial institutions reduce exposure to risk.

How Data Integration Joined the Music Hit Charts

Learn how today it is the Internet, data integration, and tailored recommendations that stage the music scene for the new Bob Dylans.

Open Data Innovation? Open Your Data And See It Happen

Learn how open data trend-setting governments and local authorities are opening up data sets and actively encouraging innovation.

Linked Data Innovation – A Key To Foster Business Growth

Learn how freely available and machine-readable Linked Open Data enriches organizations’ data and helps them discover new links and insights.

Linked Data Approach to Smart Insurance Analytics

Read about how Linked Data and semantic technology can enrich data and pave the way to advanced analytics.

Linked Data Paths To A Smart Tourism Journey

Read about how the tourism industry can benefit from Linked Data and big data analytics for wiser investments and higher profits.

Linked Data Pathways To Wisdom

Learn about the linked data pathways to wisdom through ‘who’, ‘what’, ‘when’, ‘where’, ‘why’, ‘how to’ and, finally, ‘what is best’.

Taking Semantic Web to its Next Level with Cognitive Computing

Learn about the new age of cognitive computing and integrating its concepts into two decades of semantic web growth.

Open Data Play in Sports Journalism And EURO 2016

Read about how open data gives those modern-day Sherlocks the bases of their stories.

Open Data Sources for Empowering Smart Analytics

Learn how Open Data and how more businesses use data analytics to gain insights, predict trends and make data-driven decisions.

Journalism in the Age of Open Data

Learn how governments and authorities can start relying more on journalism to promote the use of open data and its social and economic value.

Building Linked Data Bridges To Fish In Data Lakes

Learn how enterprises can build bridges to extracting more powerful and more relevant insights from their Big Data analytics.

Open Data Use Cases In Five Cities

Learn how London, Chicago, New York, Amsterdam and Sofia deal with open data and extract social and business value from databases.

ODI Summit Take Out: Open Data To Be Considered Infrastructure

Learn about The ODI’s second Summit with prominent speakers such as Sir Tim Berners-Lee, Martha Lane Fox and Sir Nigel Shadbolt.

Highlights from the “Mining Electronic Health Records for Insights” Webinar

Read some of the Q&As from our webinar “Mining Electronic Health Records for Insights”.

Highlights from ISWC 2015 – Day Three

The 14th International SemanticWeb Conference started three days ago and Ontotext has been its most prominent sponsor for 13 years in a row.

Highlights from ISWC 2015 – Day Two

The 14th International SemanticWeb Conference started three days ago and Ontotext has been its most prominent sponsor for 13 years in a row.

Overcoming the Next Hurdle in the Digital Healthcare Revolution: EHR Semantic Interoperability

Learn how NLP techniques can process large volumes of clinical text while automatically encoding clinical information in a structured form.

Highlights from ISWC 2015 – Day One

The 14th International SemanticWeb Conference started three days ago and Ontotext has been its most prominent sponsor for 13 years in a row.

Text Mining to Triplestores – The Full Semantic Circle

Read about the unique blend of technology offered by Ontotext – coupling text mining and RDF triplestores.

Text Mining & Graph Databases – Two Technologies that Work Well Together

Learn how connecting text mining to a graph database like GraphDB can help you improve your decision making.

Semantic Publishing – Relevant Recommendations Create a Unique User Experience

Learn how semantic publishing can personalize user experience by delivering contextual content based on NLP, search history, user profiles and semantically enriched data.

Why are graph databases hot? Because they tell a story…

Learn how graph databases like GraphDB allow you to connect the dots and to tell a story.