Connecting the Dots to Turn Data Into Knowledge: Entity Linking

The advantages and disadvantages of different ways to do entity linking based on reconciliation, inference, SPARQL and Kafka

August 26, 2022 10 mins. read Radostin Nanov

Knowledge isn’t simply derived from raw data. It is more concerned with the linking of datapoints. When data is properly linked, you can easily find other, related data. A datapoint would be “Paris is the capital of France ”. The problem with this basic datapoint is that we don’t know which Paris we are referring to – is it Paris, Texas, Paris Hilton, or Paris in France?

Furthermore, what does “France” stand for? Is it the country or the Canadian river? And what, exactly, is a “capital”? Knowledge derived from that would be to know that the city of Paris is the capital city of the country of France. We can keep iterating over this, deriving more and more links from the basic building blocks of our data – of course, provided that it is linked.

Here’s an evolution of that same idea. Once we have a data graph, we can link it with other knowledge models. As illustrated in the diagram above, the fundamental difference between an arbitrary graph data structure and a knowledge graph is that the latter includes a knowledge model which then affords automated reasoning capabilities one may decide to take advantage of.

Turning data into knowledge requires an explicit knowledge model, an ontology, that can combine a conventional data schema with other types of topical or terminological knowledge: taxonomies, controlled vocabularies, domain models and business rules. In the rest of this post, we assume the presence of such a knowledge model and focus on linking data elements to such models and to each other.

Without data linking, you can have different data lakes that do not collaborate to form a common graph. Many enterprises struggle at exactly this critical step – how to derive useful knowledge given their basic data.

And the first step to untangling the web of data is to identify the strings and turn them into proper identifiers. In mainstream enterprise data management, such identifiers belong to the so-called Reference data (“data used to classify or categorize other data”) and Master data  (“data about the business entities that provide context for business transactions”). The distinction between these two flavors is irrelevant to some projects, thus we call it “Reference Entity Data”.

Test case

It is much easier to illustrate our point if we have a common frame of reference. So, let’s make up some data for our use case.

Source Data

Turtle data. @prefix bvz: . a bvz:DataCollection ; bvz:owner "William Hank" ; bvz:ownerID "U100333" ; bvz:creator "Sandra Hellen" ; bvz:creatorID "U100334" ; bvz:reference "D001" .

Reference Entity Data

We want to know that the entity in the Source Data, the “data collection” is related to the entities in the Reference Entity Data. In this example, this corresponds to knowing that the data collection D001 has the owner William Hank (identified by either name or ID) and the creator Sandra Hellen.

OOTB linking entities solution

The out-of-the-box solution for linking entities in GraphDB is the reconciliation capability of Ontotext Refine. It is robust, configurable and text-based. Ontotext has developed a Reconciliation Service that allows you to leverage the GraphDB Elasticsearch (ES) connector and reconcile against your own data. With it, you can create ES indexes from your RDF repository, feed it some data and provide an ES query template with mustache. The add-on also creates an automated default query if you don’t want to customize.

However, this service is intended for reconciliation against textual references. You can easily reconcile, for example, Paris as described in the introductory paragraph. As such, it may not be ideally suited for reconciling identifiers. Furthermore, it would require some additional logic.

The process here would look like this:

  1. Extract the ownerID from the Source data.
  2. Reconcile the ownerID against the Reference entity data, trying to match the comment field. It is a 1:1 match, so it should be straightforward. The result should be <https://data.com/person/U100333>.
  3. Do the same for creatorUD.
  4. Add data that <https://data.com/bvz/collection/D001> has owner <https://data.com/person/U100333>  and creator <https://data.com/person/U100334>.

The same process could be applied to reconciling based on names. The Elasticsearch analyzer is interesting here – specifying a custom analyzer, tailor-made for names, could be useful when checking that “Hank, William” and “William Hank” are one and the same.

Entity linking via inference

GraphDB engine uses a forward-chaining reasoner that materializes triples on ingestions. Using this methodology, the incoming statements would create inferred statements within the reference system. The process is relatively straightforward and efficient, it would require little to no custom code. Furthermore, there would be other benefits such as automatic retraction of assertions and being able to enable or disable the link with a query modifier. There are two major caveats:

  1. This is useful only if the data is kept within the same repository.
  2. This is useful only for statement patterns. GraphDB works with the entity pool directly, assigning each IRI a unique value and then comparing those values, thus obtaining very fast inference. However, this means we cannot do value inference, where we perform a mathematical operation or regex check on the data.

A custom rule that implements this would be:

Custom inference rule

 
Id: collection_ownership 

collection  <https://data.com/bvz/DataCollection>
collection <https://data.com/bvz/ownerID> ownerID
person  <https://data.com/Person>
person <https://schema.org/identifier> ownerID

------------------------------------

collection <https://data.com/bvz/has_owner>  person

You wouldn’t need custom rules for most use cases, though. GraphDB comes with multiple default rulesets and you would only need to provide the proper ontology. In this case, you can achieve this with a property chain axiom and an inverseOf property:

Toy RDF ontology, turtle-formatted. @prefix bvz: <https://data.com/bvz/> . @prefix schema: <https://schema.org/> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . bvz:has_owner owl:propertyChainAxiom bvz:ownershipChain . bvz:ownershipChain rdf:first bvz:ownerID ; rdf:rest bvz:inverseID . bvz:inverseID owl:inverseOf schema:identifier .

To explain what is happening: we declare that “has_owner” can be obtained by following a chain of properties. The first property of that chain, which we call the “ownershipChain”, would be the “ownerID”. The rest of that chain consists of only one property. It is the inverse of “identifier”. This corresponds to the following chain of links:

collection -> ownerID -> (inverse) identifier -> person

Then the combination of ownerID and inverse identifier is compressed via owl:propertyChainAxiom into “has_owner”. Then this set of ontological triples can be used with the built-in OWL2-RL ruleset to generate the inferred knowledge.

SPARQL entity linking

When both systems are SPARQL-enabled, it is possible to link entities via a query. This doesn’t necessarily mean that both the source and reference data have to be ingested within GraphDB or another RDF database. Data could be virtualized or RDF-ized only temporarily.

The benefits of SPARQL entity linking is that the data doesn’t need to be ingested. Furthermore, SPARQL filters and clauses are much more detailed than inference rules or reconciliation queries. If data is ingested, even temporarily, you can also benefit from the power of Lucene indices, Elasticsearch, the Graph Path and Similarity plugins, and many others.

A custom SPARQL linker can be implemented as either a CONSTRUCT, INSERT or SELECT query – either format works:

  • SELECT – if the link you want to form is temporary, i.e., “is this person from IAM also in the application database”?
  • CONSTRUCT – if the link is temporary, but you also want to process it with RDF tools, i.e., if you are sending it from a RDF library.
  • INSERT – if the link you want to form is permanent, i.e., storing “owl:sameAs” or another relationship, like the “has_owner” example above.

Here is an example of SPARQL entity linking on “Name”:

SPARQL name linking

Truth maintenance and Kafka producer entity linking

GraphDB also features a Kafka producer and consumer. Given that the reference entity data and/or source data is stored in GraphDB, the Kafka connector may produce messages on insertion and deletion. Here’s an example for creating the user William Hank and deleting the user Sandra Hellen.

Data deleted, data inserted

{
  "https://data.com/person/U100333": {
    "familyName": "Hank",
    "givenName": "William"
  },
  "https://data.com/person/U100334": null
}

This is given a Kafka connector that listens on the type data:Person and indexes the fields “familyName” and “givenName”. Full entity filters are supported, as they are for all other connectors. This tooling can be used together with a Kafka consumer, like a Python script, which then fires a SPARQL query, just like the one described earlier, or any other way to obtain the missing data.

GraphDB 9.10 also introduced smart updates, which allows the customer to fire a SPARQL INSERT or DELETE based on data consumed from Kafka.

Smart INSERT

This is combined with a Kafka message structured like so:

Kafka message

{
  "https://data.com/bvz/collection/D001": {
    "bvz:has_owner": "https://data.com/person/U100333"
  }
}

The disadvantage here is that we will be forced to do the mapping from producer to consumer by hand, with some custom logic. However, we can make this easier. GraphDB also introduced advanced SPARQL templates. Assume, once again, that we are inserting the data specified in the beginning of this section. This can immediately kick off a further SPARQL insert:

When the Kafka message comes in, it will substitute the ?givenName and ?familyName variables with the values from the message and the ?id variable with the title of the message. As a result, only William Hank will be identified as the owner of collection D001. Sandra Hellen would be rejected by the SPARQL filter.

One of the strong suits of this is the possibility to use it for truth maintenance. Since there is a message containing NULL for a value, you can use this to trigger a SPARQL delete that, for example, cleans the data from the has_owner_rule graph.

Summary and recommendations

In the end, all ways to do entity linking have advantages and disadvantages that depend on your use case.

  • Reconciliation-based linking is great for resolving text references to entities, but requires a little code to automate. The more complicated the source data you start with, the more code will be required to prepare it for automation. Ontotext Refine can help you with data cleaning.
  • Inference-based entity linking performs really well and comes with automatic truth maintenance, keeping the solution as native to GraphDB as possible. However, it depends on you only using graph patterns for liking and on ingesting all data within the same repository.
  • SPARQL-based liking is versatile and relatively performant, but requires that both source and reference data are accessible via SPARQL. It also isn’t automatic, unlike inference.
  • A Kafka-based solution can be made more versatile than a pure SPARQL solution. However, it does require some custom logic and sacrifices some performance due to requiring an intermediate step.

Ultimately, what the best case is depends on your particular needs. At Ontotext we have a lot of experience with all of those approaches and can help you thrive in this field.

Want to connect the dots to turn your data into knowledge?

 

GraphDB Free Download
 Give GraphDB a try today!

Download Now

 

Article's content

Solution/System Architect at Ontotext

Radostin Nanov has a MEng in Computer Systems and Software Engineering from the University of York. He joined Ontotext in 2017 and progressed through many of the company's teams as a software engineer working on the Ontotext Cognitive Cloud, GraphDB and finally Ontotext Platform before settling into his current role as a Solution Architect in the Knowledge Graph Solutions team.

SHACL-ing the Data Quality Dragon III: A Good Artisan Knows Their Tools

Read our blog post about the internals of a SHACL engine and how Ontotext GraphDB validates your data

SHACL-ing the Data Quality Dragon II: Application, Application, Application!

Read our blog post to learn how to apply SHACL to your data and how to handle the output

SHACL-ing the Data Quality Dragon I: the Problem and the Tools

Read our blog post to learn about the dragon of invalid data and the wide array of SHACL constraints you can apply to combat it

Power to the Filters! GraphDB Introduces Improvements to the Connectors in its 10th Edition.

Read about the improvements of GraphDB 10 Connectors, which offer more more flexibility and further filtering capabilities when synchronizing RDF data to non-RDF stores

Connecting the Dots to Turn Data Into Knowledge: Entity Linking

Read about the advantages and disadvantages of different ways to do entity linking based on reconciliation, inference, SPARQL and Kafka

Loading Data in GraphDB: Best Practices and Tools

Read about our guided tour through data transformation, ingestion, updates and virtualization with GraphDB

At Center Stage V: Embedding Graphs in Enterprise Architectures via GraphQL, Federation and Kafka

Read about the mechanisms for building a big enterprise software architectures by embedding graphs via GraphQL, Federation and Kafka

From Disparate Data to Visualized Knowledge Part III: The Outsider Perspective

Read our final post from this series focusing on how GraphDB and Ontotext Platform provide an architecture that can work on any infrastructure resulting in a well-deployed and well-visualized knowledge graph.

From Disparate Data to Visualized Knowledge Part II: Scaling on Both Ends

Read our second post of this series focusing on what happens when you have more and faster data sources as well as when you want more processing power and more resilient and available data.

From Disparate Data to Visualized Knowledge Part I: Moving from Spreadsheets to an RDF Database

Read our first post from this series about how to turn your disparate data into visualized knowledge, starting with a step-by-step guide for data ingestion and inference validation with GraphDB