• Blog
  • Informational

Text Mining & Graph Databases – Two Technologies that Work Well Together

July 5, 2014 6 mins. read Milena Yankova

Weaving Data into Text

Graph databases and Text Mining work well together because you can apply Natural Language Processing to extract meaning from free flowing text and then store the results in a graph database to be further used for knowledge discovery and analysis.

Why Graph Databases?

Graph databases, and especially semantic graph databases (also called RDF triplestores), can smoothly integrate heterogeneous data from multiple sources and make it all interlinked. They can store hundreds of billions of facts (in the form of RDF triples) about any concept imaginable.

Read our White Paper: The Truth about Triplestores

Today, there are many freely available interlinked facts from sources such as DBpedia, GeoNames, Wikidata and so on, and their number continues to grow as we speak. Some estimate their total between 150 and 200 billion right now. This freely available data is called Linked Open Data (LOD) and it can be a good source of information to power your graph database.

However, the real power of LOD comes when you transform your own data into RDF triples and then connect your proprietary knowledge to open world knowledge.

Another equally important functionality of graph databases is inference where new knowledge can be created from already existing facts. Here is a simple example of inference using two pre-existing facts: Fido is a dog and a dog is a mammal. From these facts, we can infer that Fido is also a mammal. When such new facts are materialized and stored in a graph database, your search results become much more relevant, opening new avenues for actionable insights.

But if you want to add even more power to your data, you can use Text Mining techniques to extract the salient facts from free-flowing texts and then add them to the facts in your graph database. The resulting dataset is richer and much more useful to analyze, visualize, aggregate and report on.

Unfortunately, the market for this type of technology is still fragmented. Some vendors only sell graph databases and leave it up to you to determine how to do the Text Mining part. Others, offer just Text Mining and let you figure out where to store the results.

We at Ontotext, however, believe that graph databases and Text Mining should go hand in hand. Click To Tweet Both of these technologies power our Ontotext Platform and are at the core of our Technology Solutions.

More About Text Mining

But what does a typical Text Mining process look like?

Here’s a simplified explanation in 5 steps:

Step 1 – Extracting Concepts (Facts) from Free Flowing Text

In this step, the text from your unstructured data (such as documents, blogs, news articles, research reports, etc.) is processed by a Text Mining pipeline.

There are, of course, different stages that take place during this step such as sentence-splitting, tokenization, Named Entities Recognition (NER), etc. But, basically, it’s about subjecting the text of the document to a series of actions, which break it down into smaller data pieces in order to add a layer of meaning to the raw content.

Information Extraction

Step 2 – Enriching Concepts by Using Linked Open Data

As we have already mentioned, there are many freely available datasets that describe places, people, events, music, etc. You can use these sources of Linked Open Data to identify the concepts you have extracted from the text and get additional information about each of them.

For example, if your text mentions Bruce Springsteen, you can link the extracted concept to the MusicBrainz Linked Open Dataset. As a result, you will derive a wealth of additional information about the singer including songs he has written, albums, concerts, biographical information and much more.

In this way, enriching your original data with additional data opens even greater possibilities for Text Mining.

Step 3 – Identifying Relationships Between Concepts

The next step is to identify the relationships between the different concepts in your text. Here are some examples:

Sally worked at Banking Corp.
Gary lives in Tampa.
Tamps is a city in Florida.
Gary worked with Sally.
Sally plays golf.

Some of these relationships are explicitly described in the text but some have been added through the process of LOD enrichment.

Because graph databases represent relationships in a Knowledge Graph structure, they can easily express not only the relationship between two facts in a statement but also multiple relationships across the whole dataset. Therefore, graph databases are ideal for storing facts extracted from text. It also makes them a powerful tool for relationship-centered analytics and knowledge discovery.

Step 4 – Disambiguating One Concept from Another

In this step, the concepts detected in the text are disambiguated and linked to their specific instances in the graph database. In other words, here you have to determine whether mentions in the text that appear to be similar refer to the same specific instance or not.

For example, figuring out whether “Paris” in a text refers to the capital of France, or to Paris, Texas, or to Paris Hilton, or maybe even to the movie “Paris, Texas” is crucial for the correct understanding of this text. And without a good understanding, no reliable analysis can happen.

As ambiguity is a common problem in free-flowing text, the ability to instruct a computer how to learn which is the correct instance is very powerful.

Disambiguation

Step 5 – Semantically Indexing Everything

In this final step, all facts and their original reference to the documents are indexed and stored in your graph database (together with the ontology – the model used for classifying the concepts and the relationships between them).

This provides robust capabilities for search and knowledge discovery. For example, after indexing your concepts and the quotation relationships in a dataset of news, you can ask what Trump has said about Putin.

Important Takeaways

When you think of Text Mining in these simple terms, it’s not too difficult to understand how the free-flowing text in your documents can be transformed into meaningful insights.

The ability to process unstructured data, transforming it into structured intelligence, and storing the results in a graph database along with a classification system, will give your business a huge competitive advantage. Click To Tweet

Today, as organizations make the critical decision to discover the hidden meaning behind the massive amounts of unstructured data that lies in their different legacy systems, Text Mining is becoming increasingly important. And when they couple this process with the power of a graph database like Ontotext’s GraphDB™ to query, aggregate, report and visualize this data, this fuels faster knowledge discovery and smarter decision-making.

Want to learn more about graph databases like Ontotext’s GraphDB?
Read our White Paper: The Truth about Triplestores

Article's content

A bright lady with a PhD in Computer Science, Milena's path started in the role of a developer, passed through project and quickly led her to product management. For her a constant source of miracles is how technology supports and alters our behaviour, engagement and social connections.

Linked Data Solutions for Empowering Analytics in Fintech

Read about how FinTech can use the power of Linked Data to put data into context and expose various links between these concepts.

Semantic Technology: Creating Smarter Content for Publishers

Learn how semantic technology helps publishers create better content publishing workflows and improved content consumption for readers.

The 5 Key Drivers Of Why Graph Databases Are Gaining Popularity

Read about the 5 key characteristics of graph databases – speed, meaning, answers, relationships, and transformation.

GraphDB Migration Service: The 10-Step Pathway from Data to Insights

Welcome to our GraphDB Migration Service that helps you prepare for migrating your data to GraphDB, walks you through the setup and monitors performance.

Fighting Fake News: Ontotext’s Role in EU-Funded Pheme Project

Read about the EU-funded project PHEME aiming to create a computational framework for automatic discovery and verification of information at scale and fast.

Semantic Technology: The Future of Independent Investment Research

Learn how independent research firms use cutting-edge technologies to add value to research pieces and monetize the content they offer.

Top 5 Semantic Technology Trends to Look for in 2017

Read about the top 5 trends in which Semantic Technology enables enterprises to make sense of their data and fine-tune their offerings to customers.

Ontotext’s 2016: Our Top 7 Webinars Of The Year

Data shows that in 2016 we had a total of 22 webinars that attracted over 7 000 people – here are the 7 best webinars!

Ontotext’s 2016: What Did You Liked The Most On The Blog

Nearly 10 000 people read our blog in 2016 and the following 5 posts gathered most interest.

Linked Data in Regtech: Boosting Compliance and Performance

Learn how regulatory technology, coupled with semantic technology, can help enterprises and financial institutions reduce exposure to risk.

How Data Integration Joined the Music Hit Charts

Learn how today it is the Internet, data integration, and tailored recommendations that stage the music scene for the new Bob Dylans.

Open Data Innovation? Open Your Data And See It Happen

Learn how open data trend-setting governments and local authorities are opening up data sets and actively encouraging innovation.

Linked Data Innovation – A Key To Foster Business Growth

Learn how freely available and machine-readable Linked Open Data enriches organizations’ data and helps them discover new links and insights.

Linked Data Approach to Smart Insurance Analytics

Read about how Linked Data and semantic technology can enrich data and pave the way to advanced analytics.

Linked Data Paths To A Smart Tourism Journey

Read about how the tourism industry can benefit from Linked Data and big data analytics for wiser investments and higher profits.

Linked Data Pathways To Wisdom

Learn about the linked data pathways to wisdom through ‘who’, ‘what’, ‘when’, ‘where’, ‘why’, ‘how to’ and, finally, ‘what is best’.

Taking Semantic Web to its Next Level with Cognitive Computing

Learn about the new age of cognitive computing and integrating its concepts into two decades of semantic web growth.

Open Data Play in Sports Journalism And EURO 2016

Read about how open data gives those modern-day Sherlocks the bases of their stories.

Open Data Sources for Empowering Smart Analytics

Learn how Open Data and how more businesses use data analytics to gain insights, predict trends and make data-driven decisions.

Journalism in the Age of Open Data

Learn how governments and authorities can start relying more on journalism to promote the use of open data and its social and economic value.

Building Linked Data Bridges To Fish In Data Lakes

Learn how enterprises can build bridges to extracting more powerful and more relevant insights from their Big Data analytics.

Open Data Use Cases In Five Cities

Learn how London, Chicago, New York, Amsterdam and Sofia deal with open data and extract social and business value from databases.

ODI Summit Take Out: Open Data To Be Considered Infrastructure

Learn about The ODI’s second Summit with prominent speakers such as Sir Tim Berners-Lee, Martha Lane Fox and Sir Nigel Shadbolt.

Highlights from the “Mining Electronic Health Records for Insights” Webinar

Read some of the Q&As from our webinar “Mining Electronic Health Records for Insights”.

Highlights from ISWC 2015 – Day Three

The 14th International SemanticWeb Conference started three days ago and Ontotext has been its most prominent sponsor for 13 years in a row.

Highlights from ISWC 2015 – Day Two

The 14th International SemanticWeb Conference started three days ago and Ontotext has been its most prominent sponsor for 13 years in a row.

Overcoming the Next Hurdle in the Digital Healthcare Revolution: EHR Semantic Interoperability

Learn how NLP techniques can process large volumes of clinical text while automatically encoding clinical information in a structured form.

Highlights from ISWC 2015 – Day One

The 14th International SemanticWeb Conference started three days ago and Ontotext has been its most prominent sponsor for 13 years in a row.

Text Mining to Triplestores – The Full Semantic Circle

Read about the unique blend of technology offered by Ontotext – coupling text mining and RDF triplestores.

Text Mining & Graph Databases – Two Technologies that Work Well Together

Learn how connecting text mining to a graph database like GraphDB can help you improve your decision making.

Semantic Publishing – Relevant Recommendations Create a Unique User Experience

Learn how semantic publishing can personalize user experience by delivering contextual content based on NLP, search history, user profiles and semantically enriched data.

Why are graph databases hot? Because they tell a story…

Learn how graph databases like GraphDB allow you to connect the dots and to tell a story.