Ontotext talks to Gene Loh, Director Software Development at Synaptica, and Vassil Momtchev, Ontotext CTO, about the RDF-star extension to the RDF graph data model,…
There are no easy answers in life or in Information Architecture. Design decisions come with tradeoffs. Relational databases (RDBS) have been the workhorse of ICT for decades. Being able to sit down and define a complete schema, a blueprint of the database, gave everyone assurity and consistency. Sure, you have to ignore the edge cases and hope that they stay edge cases. And yeah, the real-world relationships among the entities represented in the data had to be fudged a bit to fit in the counterintuitive model of tabular data, but, in trade, you get reliability and speed. Surely, business requirements don’t change over time, right?
The simple, transactional data that relational databases do well increasingly does not reflect the hyper-connected dynamic needs of today’s business environment. Ironically, relational databases only imply relationships between data points by whatever row or column they exist in. With graph databases the representation of relationships as data make it possible to better represent data in real time, addressing newly discovered types of data and relationships. Relational databases benefit from decades of tweaks and optimizations to deliver performance. However, when it comes to queries that involve large and highly interconnected master data, the performance is solidly in favour of graph databases like GraphDB. This is why data-driven companies like the FAANGs, global pharma brands and the financial industry have long ago switched to graph databases.
For instance, the analysis of M&A transactions in order to derive investment insights requires the raw transaction data, in addition to the information on relationships of the companies involved in these transactions, e.g. subsidiaries, joint ventures, investors or competitors. This is a graph of millions of edges and vertices – in enterprise data management terms it is a giant piece of master/reference data. Now consider that transaction data is dynamic (thousands of equity transactions take place daily) and, to further complicate the scenario, the reference data is dynamic as transactions often imply new relationships in the company graph. To handle such scenarios you need a transalytical graph database – a database engine that can deal with both frequent updates (OLTP workload) as well as with graph analytics (OLAP).
In order to have a competitive advantage in dynamic environments, enterprises need to enhance their proprietary information using global knowledge as context for interpretation and source for enrichment. They should be able to continuously integrate data across multiple internal systems and link it to data from external sources. To be able to automate these operations and maintain sufficient data quality, enterprises have started implementing the so-called data fabrics, that employ diverse metadata sourced from different systems.
“The ability of the data fabric to continuously find, integrate, catalog, and share all forms of metadata: It should be able to do this across all environments, including hybrid and multicloud platforms, and at the edge. This metadata should then be represented, along with its intricate relationships, in a connected knowledge graph model that can be understood by the business teams”
Further, “ML-Augmented data integration is making active metadata analysis and semantic knowledge graphs pivotal parts of the data fabric””
Gartner, ‘Data Fabrics Add Augmented Intelligence to Modernize Your Data Integration’, Ehtisham Zaidi, Eric Thoo, Guido De Simoni, Mark Beyer, December 17, 2019,.
The vital added-value of KGs is the paradigm for using ontologies – explicit formal conceptual models – to provide consistent unified access to data scattered across different systems. The key characteristic is that ontologies capture, integrate and operationalize knowledge across several disciplines and type of systems:
If you want to solve interesting problems beyond basic data analytics, you are going to need formal semantics and that means schemas. Schemas are powerful. They create reliable, consistent and communicable models for representing data. It provides meaning. In the world of relational databases where meaning is tacit and reliant on a costly database architect to define the entire model a priori and hope they don’t make a mistake or the model needs adaptation at production.
The advantage of knowledge graphs over a relational database is that schema is data too. It can be queried. It can also be modified as the business needs require. Knowledge graphs use ontologies as semantic schemas in order to accommodate all the above types of knowledge in a way that allows both human experts and computers to understand and interpret them in an unambiguous manner.
The addition of formal semantics to the data model has a number of advantages:
The stack of Semantic Web standards (RDF, RDFS, SPARQL, OWL, SHACL) is developed through the W3C community process, to make sure that the requirements of different actors are satisfied – all the way from logicians to enterprise data management professionals and system operations teams. It incorporates several interoperable schema languages to make sure different applications and types of data can be represented properly (e.g. open-world vs. closed-world assumptions).
On the other hand the property graph model, implicitly defined by the Apache ThinkerPop framework (referred to as Property Graph), is designed for efficient paths traversal and similar tasks. They aren’t concerned with publishing or integrating data. Enterprise data management and governance require standards for query schema and query languages, identification and serialization formats, federation and management protocols none of which is present in the Property Graph stack. Lacking any form of formal semantics, they certainly aren’t a good choice for automated reasoning over data to provide data insights. This is why data architects and organisations interested in the sustainability of their data prefer RDF for implementation of their knowledge graphs.
At the low-level of representation of data in a graph, there is often a need to attach metadata to relationships, which are most naturally represented as edges in the graph. Such examples are provenance (e.g. where a relationship is sourced from or who edited it last), access control and representation of context (e.g. time span).
The ability to attach properties to relationships (i.e. the edges) as simply as to entities (i.e. nodes) was an advantage to Property Graph representations. It’s not impossible to do in RDF, but the workaround all came with costs. In the diagram below you can see four different ways of doing it without extension of the RDF specification: reification, singleton properties or named graphs, and N-ary relationships. Each approach’s advantages and disadvantages are detailed in What is RDF-Star?
Natural and intuitive modelling is key for numerous reasons! All these approaches share several important problems: they are counter-intuitive, representation is “hairy” and requires extra effort to comprehend. When people are faced with the challenging task of integrating data across multiple sources, many of them quite complex on their own, it is too burdensome to ask them to further complicate the representation. It changes the already difficult task into an impractical one. This unnecessary modelling complexity can impede the adoption of KG in large enterprises who could benefit most from their semantics.
Removing ungainly representation was an early lesson learned by the pioneers of the Semantic Web when Description Logics (DL) were a prevalent approach in the field. DL reasoning’s power came at the cost of performance, but that isn’t what made the semantic world move away from DLs. The true problem was that the DL semantics were too complex to comprehend at scale. Tracing inference chains on anything non-trivial became onerous. If it’s not intuitive to a KR-specialist with a pocketful of PhDs, what chance would a commercial developer have? RDFS with its simple entailment rules solved that issue.
Recently an extension has been proposed, RDF-Star (sometimes referred to as RDF*), that addresses those issues by reducing the document size and increasing efficiency. Most importantly, it provides a representation that is human-friendly and intuitive.
For example, in your data you have People entities and Job Titles and the relationship between them called positionHeld. If you wanted to label positionHeld with a length of time, it would require the addition of additional statements to ‘reify’ with additional statements such as Id1 which in turn would have relationship with other properties such as StartDate and EndDate. Id1 isn’t intuitive or useful. RDF-Star enables you to do directly what you want and with fewer statements. Namely, label the relationship just as easily as you label entities.
RDF-Star comes with the corresponding extension of the query language SPARQL-Star – examples of modelling properties on edges in RDF-Star and querying them from SPARQL-Star are provided here.
RDF-Star provides a representation that is closer to the clean simplicity of property graphs but without having to sacrifice semantics in the bargain. RDF-Star goes beyond the expressivity of Property Graphs, where you can only attach key-value pairs to relationships – in RDF-Star you can make a statement about an edge in the graph that refers to another RDF resource, e.g. description of context, shared with other edges.
Knowledge graphs have been shown to be essential for dynamic, flexible and automated data management that is required today with its reusable data services, machine-readable semantic metadata and APIs that ensure the integration and orchestration of data across the organization and with 3rd parties.
RDF-Star is furthering the open data philosophy of RDF to ensure interoperability and avoid the inevitable headaches of proprietary languages that dominate the world of Property Graphs. More than RDBMS or Property Graphs, knowledge graphs deliver unified data access, automation of data management tasks, and meaningful data in context.