The Shapes Constraint Language (SHACL) is a widely-supported W3C standard that lets us describe conditions that a dataset must meet. As with RDFS schemas and OWL ontologies, SHACL contains metadata about datasets, but this metadata serves a different purpose: to help validate the data instead of enabling inferencing.
Because RDFS schemas and OWL ontologies describe a dataset’s structure by listing classes, properties, and their relationships, many people have thought that these were describing a schema the same way the schema of a relational database or an object-oriented system does — by describing constraints that data must conform to if it will be used in that dataset.
But, they do not. RDFS schemas and OWL ontologies describe these structures to enable inferencing. For example, if an RDFS schema says that the
familyName property has a domain of
Person, then we can infer that any resource with a
familyName property is an instance of the class
Person. Neither RDFS nor OWL give you a way to say that
familyName is a required property of
Person, so that if an instance of
Person lacks a
familyName value then it is an invalid instance.
This is where SHACL comes in. If you want to say that
familyName is a required property for the
Person class, or that a
rating value must be an integer value between 1 and 5, then SHACL lets you do this. This makes SHACL invaluable for ensuring data quality when you manage your RDF knowledge graph using a tool that supports SHACL such as GraphDB.
As with OWL ontologies and RDFS schemas, we describe SHACL constraints using the RDF statements known as triples. Just as we use triples to describe the details of instances of an
Employee class or a
Product class, with SHACL we describe shapes that each class’s instances must conform to. We’re going to look at some simple shapes and how they can be used to identify invalid data, and then we’ll learn more about the possibilities of what SHACL shapes can do.
We’ll start with some prefix declarations and a simple RDFS data model for some restaurant reviews. These reviews can have descriptions and ratings. We’d like the ratings to be required values stored as integers between 1 and 5, but there is no straightforward way to specify these conditions in RDFS or OWL.
@prefix ex: <http://example.com/ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix sh: <http://www.w3.org/ns/shacl#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . ex:Review a rdfs:Class . ex:description rdf:type rdf:Property ; rdfs:domain ex:Review . ex:rating rdf:type rdf:Property ; rdfs:domain ex:Review .
SHACL shapes make it easy to express these constraints. The first shape below is a node shape for our
ex:Review class. It includes a property named
sh:property whose value points to a property shape called
ex:ratingShape. That property shape definition has a very simple
sh:path value to show that it relates to the
ex:rating property defined in the model above. (SHACL also lets you use multi-step paths, alternative paths, and boolean operators to specify much more sophisticated relationships between shapes and the values that they reference.) A shape can have multiple
ex:ratingShape property shape has some constraints: the
ex:rating property’s value must be an integer, and that integer must be at least 1 but no more than 5. The shape’s final two constraints specify that the number of
sh:rating values must be at least 1 (which is how you specify that a property is required) but no more than 1.
ex:ReviewShape a sh:NodeShape ; sh:targetClass ex:Review ; sh:property ex:ratingShape . ex:ratingShape a sh:PropertyShape ; sh:path ex:rating ; sh:datatype xsd:integer ; sh:minInclusive 1 ; sh:maxInclusive 5 ; sh:minCount 1 ; sh:maxCount 1 .
As we’ll see, you can do much, much more with node shapes, property shapes, and other features of SHACL, but the example above shows that with very little SHACL you can implement useful data quality checks.
The following sample data has four reviews: one with a rating of 5, which doesn’t violate any constraints, but then three reviews that each violate a constraint. One of those has a decimal number as its rating value, one has a value that is greater than five, and the last
ex:Review doesn’t even have an
@prefix ex: <http://example.com/ns#> . ex:r1 a ex:Review; ex:description "This restaurant was great." ; ex:rating 5 . ex:r2 a ex:Review; ex:description "This place was terrible." ; ex:rating 2.71828 . ex:r3 a ex:Review ; ex:description "Best restaurant ever." ; ex:rating 6 . ex:r4 a ex:Review ; ex:description "Wish I could give it zero stars." .
Any tool that supports SHACL could read the schema, constraints, and data above and then alert you to the three invalid or missing
ex:rating values with messages about what constraint each violated. (You could run this validation yourself with the example above in the GraphDB Workbench by following the steps on the SHACL Validation documentation page.)
Because the constraint violation alerts are themselves RDF triples, it’s easier to include this validation step as part of an RDF pipeline during application development. For example, you could store these alerts in an RDF repository. Then, as an analysis step toward improving data quality, you could run SPARQL queries on this data to learn which constraints were violated how often and with what invalid values.
There are other ways to associate shapes with targets to validate:
sh:targetSubjectsOf, you can target all instances that have a value for a particular property – for example, if you wanted to define a constraint for all instances that have a
hireDatevalue but not the ones that don’t.
sh:targetObjectsOfpoints a shape at all instances that are used as the value for a particular property. For example, if a company has 20 different departments but only 16 of them are used as instance department values for the
ex:employedBypredicate, you could use this to target those 16.
To specify more complex criteria for which instances a shape should point to, a SHACL-SPARQL target lets you write a SPARQL query that defines this set of instances.
To help with the modeling of how a target gets applied to which parts of a data model, SHACL can work with inference and class hierarchies so that constraints applied to a particular class will also apply to its subclasses. For example, if
Employee is a subclass of
Person, a constraint that an instance of the
Person class must have a
familyName value would also apply to instances of the
Our restaurant review example above demonstrated a type constraint to check that the
rating value was an integer and a range constraint to check that this value was between 1 and 5. Type constraints are also great for checking that date and boolean values are properly represented, which can prevent some classic data processing errors further down the pipeline. Other SHACL constraint types let you check string values for specific lengths or language encodings. You can also check that a string conforms to a particular pattern that you specify with a regular expression (for example, that a book’s ISBN number has just the right sequence of hyphens and numeric digits).
Property pair constraints let you check the relationship between two different properties for a given subject. You can check that two values are different (for example, that the
givenName values for an
Employee instance are different), or are the same. You can check that one value is less than another (for example, that a
stockItem instance has a
wholesaleCost property value less than its
retailCost value) or less than or equal to the other value.
Logical constraints let you use boolean operators to combine other constraints into more complex combinations. For example, you could specify that a given value is valid if it passes either tests A and B or tests C and D.
Another useful property to assign to shapes is
sh:deactivated, which essentially turns a shape off. If you are validating some newly imported data and getting alerts about hundreds of invalid values, temporarily deactivating some of your shapes makes it easier to focus on certain types of errors before you do the final cleanup of the dataset.
One more nice property for managing the relationship between shapes is
sh:severity, which you can set to
sh:Violation. The default value is
sh:Violation, which you will see in your SHACL error report data even if you don’t set a
sh:severity value for any shape. Setting it to one of the other two values will lead to your data being verified as conformant and add these values to the output report data about violations of those shapes, where it can be helpful in SPARQL-driven summary reports that you create from your SHACL validation output data.
SHACL gives you a large choice of building blocks to define, implement, and manage data quality standards for your knowledge graphs, and we haven’t even covered all that it can do here. This is all possible without writing any custom code, and what you define is portable because SHACL is a W3C standard. You’re going to find it useful for maintaining your existing knowledge graphs, for scaling those graphs up in a reliable way, and for integrating new data from other public and private sources.