Power to the Filters! GraphDB Introduces Improvements to the Connectors in its 10th Edition.

The GraphDB connectors in a new age - new capabilities, best practices and refactoring.

December 2, 2022 11 mins. read Radostin Nanov

A feature, even if well developed, supported and used, eventually gets deprecated. It is the circle of software life. Fortunately, the best-loved features get redesigned. This has been the fate of GraphDB‘s much lauded connectors.

Let’s take a look at the connectors of GraphDB and see where they stand now and how much ground they covered in their small leap forward.

The Status Quo

The connectors have long been one of the flagship features of GraphDB. They offer a powerful and performant way of synchronizing RDF data into non-RDF stores. The original trinity of connectors were Lucene, Solr and Elasticsearch. They are suited for full-text search, faceted search and aggregations. More recently, there was a newcomer to the group – the Kafka connector, which serializes incoming triples as JSON messages.

All the connectors are based around the ingestion mechanism – when a document gets changed, this is immediately reflected in the connected secondary stores. A “document” in this context is a logical collection of triples, and the change can be as small as a single triple. The incremental design means that there is a low latency for update synchronization and minimal need for third party synchronization solutions. Connectors are specialized plugins that offer an update and query mechanism. They index RDF data in secondary stores. There are many other GraphDB plugins, including the MongoDB Connector. However, MongoDB in particular only offers a query mechanism, which sets it apart from our other connectors.

It is obvious that not all RDF data should be stored in secondary indexes. They are best for their specialized use cases. So, ever since their inception, the connectors of GraphDB have offered a way to filter only the relevant data from all the triples in the database. This is called “filtering”.

In GraphDB 9, there used to be a single entity filter, with capabilities that largely spanned the particular specific value of a given field. It had capabilities such as:

  • Comparisons
  • Boolean logic
  • Set membership
  • Regular expressions

Beyond simple filtering, you could:

  • Filter by the previous element in the chain. To give an example, if you have a “child” field that contains data about a child, you can go up to the “parent” and check if that parent has a specific value – and not index the “child” if the filter fails.
  • Accessing additional elements that are not indexed. To give an example, you can have a “child” field and a non-indexed field, called `example:height`. Then, using the construct `?child -> example:height < 100` you could filter by height. This is limited to one predicate and no property chains.
  • Filtering by graph.
  • Filtering by language tag – for example, filter only the values in English and Bulgarian.

Power! Less Limited Power!

There are a number of improvements within GraphDB 10. This gives more flexibility and further filtering capabilities. This does come with a small downside – the filters are slightly more involved to configure. You also have to migrate any connectors built in GraphDB 9. The tradeoffs are more than worth it.

Splitting the Entity Filter

The reason for needing to migrate the connectors in GraphDB 10 is the large change to the entity filter. It has been split into four parts. Part of the reason for this split is the fact that the singular entity filter in GraphDB 9 was sometimes unclear in its functionality. Sometimes a filter would remove the whole document. Other times, it would affect only a specific field.There are, essentially, two types of filters, applied at two levels.

  • The value filters – filtering a specific value. This can be done at a specific field level, or at the top level. If this happens at the top level, the entire document is removed. If the value filter is applied at the field level, only this specific field value will be removed.
    Notably, the value filter applied at the top level is applied before any fields are generated. Therefore, all that you can filter against is the root object, denoted by $this. This filter allows you to fail fast – if an object is obviously not interesting to us, we can remove it immediately, before we have spent any computational resources on processing it.
  • Document filters – filtering the whole document. This can be done at the top level, rejecting the whole document, or per nested document, rejecting nested documents. Those filters are applied last, after all fields have been computed and can access data from all of the field values.
    Within the nested document, all fields are considered within the context of the nested document. I.e., the field “parent.child.name” is only addressed as “name” from the context of the nested “child” document.

Two-Variable Filtering

Very often, you would run into situations where you want to compare two fields. Assume a financial compliance scenario. For tax audit reasons, maybe, you want to filter people whose ?netIncome < ?netExpenditure. This is applied to the top-level document filter. Only people with dubious financial balance would be further evaluated for tax evasion.

Previously, in GraphDB 9, this was technically possible by treating the second variable you want to index as an additional element beyond the chain of the root field. However, this capability was beholden to the simple path restriction – you can’t follow multiple steps, and apply alternate paths.

For example, parent(?netIncome) -> urn:netExpenditure > $this would work. However, parent(?netIncome) -> (urn:expediture | urn:net) > $this would not! In GraphDB 10, this is as easy as declaring $this < ?netExpenditure at the level of the ?netIncome field.

Dependencies are resolved smartly by reordering the fields. To continue our financial example, consider a value filter. For example, in the expensivePurchases field, we only want to index purchases greater than monthlyIncome. This means that all filters on monthlyIncome will be applied first.

Circular dependencies are accounted for – if you define a circular dependency, an error will be thrown informing you that you have defined an invalid filter. Of course, this applies to multi-step dependencies!

New Filter Capabilities

The new version isn’t only about making filters more flexible and predictable. There are other changes as well.

First, there is now a direct function isExplicit(?field). This is a shorthand for the previous approach to doing this, graph(?field) not in (<http://www.ontotext.com/implicit>). The new construct is shorter and easier to understand.

The ALL() quantifier – a document passes a filter if one of its values matches the declared condition. For example, we may have ?nationality = . Then, a document may pass if the nationality is both German and British. Using ALL(?nationality) = would make the filter stricter. Previously, this couldn’t really be achieved.

Putting it All Together

Those are all new capabilities of the connectors in GraphDB 10. But what more can we do compared to GraphDB 9? To illustrate, we can create a small example.

Fiscal Compliance

Suppose we have a simple RDF database of people and their purchases. We want to present a view of their activities, including their purchases’ time and location. This would require faceted searches. Our experts are skilled in working with Kibana, and it offers ready tooling for time series analysis and mapping. Sounds like a good fit for the Elasticsearch connector.

To start with, we have two basic customers in our database, Dudley and Snidely. They are both instances of the foaf:Person class. This calls for a type configuration in our new connector.

Dudley has already been audited. This has been reflected in the database and, therefore, he has nothing to worry about. Snidely, however, is yet to be checked. Dudley has done right and can be filtered out at step one by our top level value filter.

Note that you can usually do this with != rather than not in. However, as we simply do not have isCompliant bound for Snidely, we would need to check for non-equality or unbound. Not in serves as a convenient shorthand in this case.

Now, we are interested in some basic information for all of the (not yet compliant!) people in our database such as their name and their monthly and yearly income. We would store all those values, without filtering.

Notice how we use the full IRI for the Property chain attribute. The connectors are not namespace aware, so you would have to use the full IRI. We do the same for income and monthly income.

Now, for the key part of the exercise, the listing of all suspicious spendings. Each person has numerous purchases and other expenses. They are all RDF objects associated with a specific person. Each of these objects contains a price. Proper purchases also contain a location and time. The structure only has two layers, but that’s still quite enough to complicate things. However, nested objects can’t be declared via the UI. So, we would switch to using JSON embedded in a SPARQL connector creation request.

To begin with, nested expense records can be obtained as compliance:expense This field needs to have the native:nested datatype (which corresponds to Elastisearch’s nested field type).

We would also apply a top-level value filter. We don’t want to index instances of compliance:GovernmentTax. This value filter applies to the nested purchase documents. We can also handle this check as a document filter. However, the value filter gets evaluated first and rejects the document before computing the fields. Putting this check in such a filter would result in better performance.

When declaring the nested fields, note that the location field is a geo point, since Elasticsearch supports this datatype out of the box.

Finally, once we have prepared the whole nested object, we would combine a few of our new features. We want to apply a document filter to the nested purchase object and filter out small purchases and old purchases. If the purchase doesn’t take more than a monthly salary, filter it out. This would involve the root-level field monthlyIncome, using the two-variable filtering capability. Also, filter out purchases which were made in 2021 or earlier. Do not perform the date check if no date is given.

 

Note that we use the $outer keyword to access the scope of the root level. You can chain $outer keywords if you have deeply nested objects.

In the end, we have a clean index containing all of Snidely’s suspicious purchases and, where applicable, their times and locations.

Reference Data

One major feature of experiments is repeatability. If you want to follow along with our example, you can use our sample data and connector creation commands.

If we were to ask Elasticsearch about the contents of the compliance index, we would get the following JSON data:

{
    "income": "12000",
    "name": "Snidely",
    "suspiciousExpense": [
        {
            "date": "2022-02-22",
            "amount": "5000",
            "location": "Point (52.9259503034234 -82.42871206672606)",
            "id": "http://example.org/compliance/purchase1"
        },
        {
            "amount": "1200",
            "id": "http://example.org/compliance/expense1"
        }
    ],
    "expense": [
        {
            "date": "2022-02-22",
            "amount": "5000",
            "location": "Point (52.9259503034234 -82.42871206672606)",
            "id": "http://example.org/compliance/purchase1"
        },
        {
            "date": "2021-02-22",
            "amount": "4500",
            "location": "Point (52.9259503034234 -82.42871206672606)",
            "id": "http://example.org/compliance/purchase2"
        },
        {
            "date": "2022-03-22",
            "amount": "20",
            "location": "Point (52.9259503034234 -82.42871206672606)",
            "id": "http://example.org/compliance/purchase3"
        },
        {
            "amount": "1200",
            "id": "http://example.org/compliance/expense1"
        },
        {
            "amount": "7600",
            "id": "http://example.org/compliance/expense2"
        }
    ],
    "monthlyIncome": "1000"
}

And the sample connector creation request:

PREFIX :<http://www.ontotext.com/connectors/elasticsearch#>
PREFIX inst:<http://www.ontotext.com/connectors/elasticsearch/instance#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
INSERT DATA {
    inst:compliance :createConnector '''
{
  "fields": [
    {
      "fieldName": "name",
      "propertyChain": [
        "http://xmlns.com/foaf/0.1/name"
      ],
      "multivalued": false,
    },
    {
      "fieldName": "income",
      "propertyChain": [
        "http://example.org/compliance/income"
      ],
      "multivalued": false,
    },
    {
      "fieldName": "monthlyIncome",
      "propertyChain": [
        "http://example.org/compliance/monthlyIncome"
      ],
      "multivalued": false,
    },
    {
      "fieldName": "suspiciousExpense",
      "propertyChain": [
        "http://example.org/compliance/expense"
      ],
      "datatype": "native:nested",
      "objectFields": [
        {
          "fieldName": "id",
          "propertyChain": [
            "$self"
          ]
        },
        {
          "fieldName": "amount",
            "propertyChain": [
              "http://example.org/compliance/amount"
            ]
        },
        {
          "fieldName": "date",
          "propertyChain": [
            "http://example.org/compliance/date"
          ]
        },
        {
          "fieldName": "location",
          "propertyChain": [
            "http://example.org/compliance/location"
          ],
         "datatype": "native:geo_point"
        }
      ],
      "documentFilter": "?amount > $outer.monthlyIncome and (?date >= \\\"2022-01-01\\\"^^xsd:date || !bound(?date))",
      "valueFilter": "$this -> type != <http://example.org/compliance/GovernmentTax>"
    },
    {
      "fieldName": "expense",
      "propertyChain": [
        "http://example.org/compliance/expense"
      ],
      "datatype": "native:nested",
      "objectFields": [
        {
          "fieldName": "id",
          "propertyChain": [
            "$self"
          ]
        },
        {
          "fieldName": "amount",
          "propertyChain": [
            "http://example.org/compliance/amount"
          ]
        },
        {
          "fieldName": "date",
          "propertyChain": [
            "http://example.org/compliance/date"
          ]
        },
        {
          "fieldName": "location",
          "propertyChain": [
            "http://example.org/compliance/location"
          ],
          "datatype": "native:geo_point"
        }
      ],
    }
  ],
  "languages": [],
  "types": [
    "http://xmlns.com/foaf/0.1/Person"
  ],
  "valueFilter": "$this -> <http://example.org/compliance/isCompliant> not in (\\\"true\\\"^^xsd:boolean)",
  "readonly": false,
  "detectFields": false,
  "importGraph": false,
  "skipInitialIndexing": false,
  "elasticsearchNode": "http://localhost:9200",
  "elasticsearchClusterSniff": true,
  "manageIndex": true,
  "manageMapping": true,
  "bulkUpdateBatchSize": 5000,
  "bulkUpdateRequestSize": 5242880
}
''' .
}

What’s Next?

Now that we have some data in Elasticsearch, how to visualize and integrate it with our other systems? That’s beyond the scope of this blog post, unfortunately. However, we already have blog posts on creating knowledge graphs, including visualizations. Or, perhaps, you would like to view our Elasticsearch-based demonstrator, the Transparency Energy Knowledge Graph? If you are more keen on learning by doing, you can start out with Lucene, which is packaged with each edition of GraphDB, including Free and Standard. If Elasticsearch, Solr or Kafka are your target secondary indexes, get in touch – trial and educational licenses are also available.

GraphDB Free Download
 Give GraphDB a try today!

Download Now

 

 

Article's content

Solution/System Architect at Ontotext

Radostin Nanov has a MEng in Computer Systems and Software Engineering from the University of York. He joined Ontotext in 2017 and progressed through many of the company's teams as a software engineer working on the Ontotext Cognitive Cloud, GraphDB and finally Ontotext Platform before settling into his current role as a Solution Architect in the Knowledge Graph Solutions team.

SHACL-ing the Data Quality Dragon III: A Good Artisan Knows Their Tools

Read our blog post about the internals of a SHACL engine and how Ontotext GraphDB validates your data

SHACL-ing the Data Quality Dragon II: Application, Application, Application!

Read our blog post to learn how to apply SHACL to your data and how to handle the output

SHACL-ing the Data Quality Dragon I: the Problem and the Tools

Read our blog post to learn about the dragon of invalid data and the wide array of SHACL constraints you can apply to combat it

Power to the Filters! GraphDB Introduces Improvements to the Connectors in its 10th Edition.

Read about the improvements of GraphDB 10 Connectors, which offer more more flexibility and further filtering capabilities when synchronizing RDF data to non-RDF stores

Connecting the Dots to Turn Data Into Knowledge: Entity Linking

Read about the advantages and disadvantages of different ways to do entity linking based on reconciliation, inference, SPARQL and Kafka

Loading Data in GraphDB: Best Practices and Tools

Read about our guided tour through data transformation, ingestion, updates and virtualization with GraphDB

At Center Stage V: Embedding Graphs in Enterprise Architectures via GraphQL, Federation and Kafka

Read about the mechanisms for building a big enterprise software architectures by embedding graphs via GraphQL, Federation and Kafka

From Disparate Data to Visualized Knowledge Part III: The Outsider Perspective

Read our final post from this series focusing on how GraphDB and Ontotext Platform provide an architecture that can work on any infrastructure resulting in a well-deployed and well-visualized knowledge graph.

From Disparate Data to Visualized Knowledge Part II: Scaling on Both Ends

Read our second post of this series focusing on what happens when you have more and faster data sources as well as when you want more processing power and more resilient and available data.

From Disparate Data to Visualized Knowledge Part I: Moving from Spreadsheets to an RDF Database

Read our first post from this series about how to turn your disparate data into visualized knowledge, starting with a step-by-step guide for data ingestion and inference validation with GraphDB