Data quality is one of the core problems with any database. SHACL, a W3C standard, is the current industry leader for RDF data validation. It allows the users to write some relatively simple RDF statements to define their constraints. Unfortunately, the recommendation is not too specific when it comes to implementations, so all databases on the market differ a bit.
The GraphDB SHACL engine works on materialized data. When triples are inserted, they follow a simple two-step process: first, they are used for inference. Then they, together with all statements inferred from them, undergo SHACL validation. And, yes, inserting a SHACL schema is treated the same as inserting triples, so it would trigger a validation.
The SHACL validation step works by comparing the data to the SHACL model. The contents of the update work as the initial set to be tested. However, this isn’t always enough. For example, consider the following constraint:
sh:property [ sh:path ex:building ; sh:node ex:BuildingShape ; # Check that the range of ex:building conforms to some SHACL shape. ] .
This needs to check that every object that is the range of the `ex:building` predicate must be a valid Building. But what if the data is only inserting a single person who lives somewhere?
That isn’t enough to validate our triple. To this end, SHACL pulls the relevant validation data – and only it – into the “validation context” and performs all necessary checks.
This means that the SHACL engine is incremental. You never run a validation on the whole dataset. By default, all updates are validated. If you want to validate the whole dataset, or to temporarily disable validations, you can use our API to do it.
As performance is the main concern with SHACL, the engine uses a highly customized syntax tree. This means that you cannot have arbitrary SPARQL rules. However, we strongly believe that most SHACL constraints that are relevant for a production-grade environment can be expressed with the default constraints. And for the few that are not possible, we offer the RDF4J SHACL Extensions and the DASH Data Shapes.
Because of all of these factors you won’t be able to validate any arbitrary file without ingesting it into your database. As GraphDB SHACL is tied to the data update and depends on a customized syntax tree, you need to insert your data. Performance is also the reason why we’ve not yet implemented “warning”-level SHACL violations. We don’t want to ingest invalid data as this may lead to hidden problems with your future inserts.
Keep in mind that the SHACL implementation is a work in progress that has taken a tremendous leap since first introduced into GraphDB. Furthermore, the RSX and DASH extensions are both moving forward – so, coming back in a few years or even months, you may see a vastly improved SHACL implementation.
Did this help you solve your issue? Your opinion is important not only to us but also to your peers.