Read our first post from this series about how to turn your disparate data into visualized knowledge, starting with a step-by-step guide for data ingestion…
Knowledge isn’t simply derived from raw data. It is more concerned with the linking of datapoints. When data is properly linked, you can easily find other, related data. A datapoint would be “Paris is the capital of France ”. The problem with this basic datapoint is that we don’t know which Paris we are referring to – is it Paris, Texas, Paris Hilton, or Paris in France?
Furthermore, what does “France” stand for? Is it the country or the Canadian river? And what, exactly, is a “capital”? Knowledge derived from that would be to know that the city of Paris is the capital city of the country of France. We can keep iterating over this, deriving more and more links from the basic building blocks of our data – of course, provided that it is linked.
Here’s an evolution of that same idea. Once we have a data graph, we can link it with other knowledge models. As illustrated in the diagram above, the fundamental difference between an arbitrary graph data structure and a knowledge graph is that the latter includes a knowledge model which then affords automated reasoning capabilities one may decide to take advantage of.
Turning data into knowledge requires an explicit knowledge model, an ontology, that can combine a conventional data schema with other types of topical or terminological knowledge: taxonomies, controlled vocabularies, domain models and business rules. In the rest of this post, we assume the presence of such a knowledge model and focus on linking data elements to such models and to each other.
Without data linking, you can have different data lakes that do not collaborate to form a common graph. Many enterprises struggle at exactly this critical step – how to derive useful knowledge given their basic data.
And the first step to untangling the web of data is to identify the strings and turn them into proper identifiers. In mainstream enterprise data management, such identifiers belong to the so-called Reference data (“data used to classify or categorize other data”) and Master data (“data about the business entities that provide context for business transactions”). The distinction between these two flavors is irrelevant to some projects, thus we call it “Reference Entity Data”.
It is much easier to illustrate our point if we have a common frame of reference. So, let’s make up some data for our use case.
Source Data
Reference Entity Data
We want to know that the entity in the Source Data, the “data collection” is related to the entities in the Reference Entity Data. In this example, this corresponds to knowing that the data collection D001 has the owner William Hank (identified by either name or ID) and the creator Sandra Hellen.
The out-of-the-box solution for linking entities in GraphDB is the reconciliation capability of Ontotext Refine. It is robust, configurable and text-based. Ontotext has developed a Reconciliation Service that allows you to leverage the GraphDB Elasticsearch (ES) connector and reconcile against your own data. With it, you can create ES indexes from your RDF repository, feed it some data and provide an ES query template with mustache. The add-on also creates an automated default query if you don’t want to customize.
However, this service is intended for reconciliation against textual references. You can easily reconcile, for example, Paris as described in the introductory paragraph. As such, it may not be ideally suited for reconciling identifiers. Furthermore, it would require some additional logic.
The process here would look like this:
<https://data.com/person/U100333>
.<https://data.com/bvz/collection/D001>
has owner <https://data.com/person/U100333>
and creator <https://data.com/person/U100334>
.The same process could be applied to reconciling based on names. The Elasticsearch analyzer is interesting here – specifying a custom analyzer, tailor-made for names, could be useful when checking that “Hank, William” and “William Hank” are one and the same.
GraphDB engine uses a forward-chaining reasoner that materializes triples on ingestions. Using this methodology, the incoming statements would create inferred statements within the reference system. The process is relatively straightforward and efficient, it would require little to no custom code. Furthermore, there would be other benefits such as automatic retraction of assertions and being able to enable or disable the link with a query modifier. There are two major caveats:
A custom rule that implements this would be:
Custom inference rule
Id: collection_ownership collection <https://data.com/bvz/DataCollection> collection <https://data.com/bvz/ownerID> ownerID person <https://data.com/Person> person <https://schema.org/identifier> ownerID ------------------------------------ collection <https://data.com/bvz/has_owner> person
You wouldn’t need custom rules for most use cases, though. GraphDB comes with multiple default rulesets and you would only need to provide the proper ontology. In this case, you can achieve this with a property chain axiom and an inverseOf property:
To explain what is happening: we declare that “has_owner” can be obtained by following a chain of properties. The first property of that chain, which we call the “ownershipChain”, would be the “ownerID”. The rest of that chain consists of only one property. It is the inverse of “identifier”. This corresponds to the following chain of links:
collection -> ownerID -> (inverse) identifier -> person
Then the combination of ownerID and inverse identifier is compressed via owl:propertyChainAxiom into “has_owner”. Then this set of ontological triples can be used with the built-in OWL2-RL ruleset to generate the inferred knowledge.
When both systems are SPARQL-enabled, it is possible to link entities via a query. This doesn’t necessarily mean that both the source and reference data have to be ingested within GraphDB or another RDF database. Data could be virtualized or RDF-ized only temporarily.
The benefits of SPARQL entity linking is that the data doesn’t need to be ingested. Furthermore, SPARQL filters and clauses are much more detailed than inference rules or reconciliation queries. If data is ingested, even temporarily, you can also benefit from the power of Lucene indices, Elasticsearch, the Graph Path and Similarity plugins, and many others.
A custom SPARQL linker can be implemented as either a CONSTRUCT, INSERT or SELECT query – either format works:
Here is an example of SPARQL entity linking on “Name”:
SPARQL name linking
GraphDB also features a Kafka producer and consumer. Given that the reference entity data and/or source data is stored in GraphDB, the Kafka connector may produce messages on insertion and deletion. Here’s an example for creating the user William Hank and deleting the user Sandra Hellen.
Data deleted, data inserted
{ "https://data.com/person/U100333": { "familyName": "Hank", "givenName": "William" }, "https://data.com/person/U100334": null }
This is given a Kafka connector that listens on the type data:Person and indexes the fields “familyName” and “givenName”. Full entity filters are supported, as they are for all other connectors. This tooling can be used together with a Kafka consumer, like a Python script, which then fires a SPARQL query, just like the one described earlier, or any other way to obtain the missing data.
GraphDB 9.10 also introduced smart updates, which allows the customer to fire a SPARQL INSERT or DELETE based on data consumed from Kafka.
Smart INSERT
This is combined with a Kafka message structured like so:
Kafka message
{ "https://data.com/bvz/collection/D001": { "bvz:has_owner": "https://data.com/person/U100333" } }
The disadvantage here is that we will be forced to do the mapping from producer to consumer by hand, with some custom logic. However, we can make this easier. GraphDB also introduced advanced SPARQL templates. Assume, once again, that we are inserting the data specified in the beginning of this section. This can immediately kick off a further SPARQL insert:
When the Kafka message comes in, it will substitute the ?givenName and ?familyName variables with the values from the message and the ?id variable with the title of the message. As a result, only William Hank will be identified as the owner of collection D001. Sandra Hellen would be rejected by the SPARQL filter.
One of the strong suits of this is the possibility to use it for truth maintenance. Since there is a message containing NULL for a value, you can use this to trigger a SPARQL delete that, for example, cleans the data from the has_owner_rule graph.
In the end, all ways to do entity linking have advantages and disadvantages that depend on your use case.
Ultimately, what the best case is depends on your particular needs. At Ontotext we have a lot of experience with all of those approaches and can help you thrive in this field.
Give GraphDB a try today! |