Triplestores are Proven as Operational Graph Databases

September 30, 2015 8 mins. read Atanas Kiryakov

GraphDB Enterprise - High availabilityClaiming that RDF triplestores are typically used for offline analytics suggests unfamiliarity with their most popular use cases. Triplestores are often applied in very dynamic operational database setups such as metadata-based content management at the world’s largest media and publishers such as BBC, FT, Wiley, Elsevier, Oxford University Press and DK.

As new approaches to data management are gaining popularity, we start seeing more texts that compare the different NoSQL and particularly graph database engines. A recent example is “Graph Databases for Beginners: Other Graph Data Technologies”.

While such comparisons do a great job helping developers understand “how stuff works”, sometimes they tend to be imprecise when authors comment engines beyond their core area of expertise.

The above-mentioned post makes statements about triplestores (also called semantic graph databases) like these:

However, triple stores are not “native” graph databases because they don’t support index-free adjacency, nor are their storage engines optimized for storing property graphs.

and

… the most common use case for triplestores is offline analytics rather than for online transactions.

And so my 20+ years of piled up expertise urge me to comment.

To provide a bit of background, let’s start with:

    • Triplestores are graph database engines that, unlike engines based on property graphs, implement a set of comprehensive, vendor-independent standards: RDF (the data model), RDFS and OWL (schema languages), and SPARQL (query language).
    • Triplestores work with globally unique identifiers – together with few other features, which makes them very suitable for integration of data – be it the thousands Linked Open Data (LOD) datasets or proprietary data.
Download Ontotext' GraphDB!

 

Index-free Adjacency in Triplestores

Now we’ll dive deeper into the theory of data representation and indexing, so if you want to understand how these are actually implemented, don’t skip this section.

Indeed, traversal from one node of the graph to another is not the most typical operation for triplestores, so, many triplestores do not provide efficient support for it out of the box. Still, the leading triplestores can be configured so that such operations are efficiently supported.

The leading triplestores can be configured to efficiently support graph-traversal. Click To Tweet

To understand how triplestores work, I will provide a quick intro to the most typical designs.

Most triplestores have some sort of dictionaries, which assign each node in the graph an integer number as an internal identifier. Technically, they map the actual entity identifiers (URIs such as “http://company.com/data/person.101”) and literals (such as “Frank Lampard” and “2015-02-28T23:39:07Z”^^xsd:dateTime) to an integer number unique for the database instance.

The most popular such index is PSO (Predicate Subject Object), where triples are ordered first by their predicate (the type of the relationship), then by subject (the end node) and, finally, by the object (the start node). Each of the elements of the triple is represented in such an index by its internal integer IDs for efficiency purposes.

pso

The PSO index handles efficiently queries where the predicate and the subject are known, e.g.:

SELECT ?team WHERE { :Frank_Lampard :plays_for ?team}

and even when only the predicate is known:

SELECT ?who ?team WHERE { ?who ::plays_for ?team}

Usually, triplestores maintain several such indices, to be able to efficiently deal with different triple patterns. The concrete indices to be used are easy to configure in accordance with the typical loads and the performance requirements for the database instance.

Note that triplestores do not “store” triples for the sake of storing them. Indexing triples in PSO and other similar indices is also the way to store them. Each triple is stored in each of the indices, which is not a problem because its internal representation by integer IDs is sufficiently compact.

To support efficiently graph traversal from one node into another, a triplestore needs to be configured to enable its subject-object-predicate (SOP) index. With this index switched on, a triplestore becomes a de-facto “index-free adjacency” engine. One can consider the SOP index being “the storage” of the graph database.

There are multiple deployments of triplestores that are tuned this way and do support efficient graph traversal.

sop

Triplestores as Dynamic Operational Databases

Triplestores are often used for dynamic operational databases. Plenty of such applications can be found in publishing and media, where triplestores are used for dynamic management of content, based on rich metadata descriptions.

This usage pattern is known as dynamic semantic publishing and it is embodied in the LDBC Semantic Publishing Benchmark (SPB). In SPB news, images and other “creative works” are described with metadata: namely, Dublin Core-like attributes and links to entities and concepts that are most relevant to them. Entities and concepts are described as reference data: a huge Knowledge Graph derived from Linked Open Data datasets such as DBpedia, GeoNames and others.

Both the metadata and the reference data are stored in a triplestore that is accessed by two types of clients (agents):

    • Aggregation agents retrieve information on specific subjects. For instance, at the Sports section of the website of BBC, each topic web page (e.g., the one for Chelsea) is dynamically generated by several SPARQL queries to the underlying triplestore;
    • Editorial agents are constantly making changes to the database, either inserting metadata for the newly coming content or updating the reference data (e.g., the number of goals scored by Frank Lampard this season).

chelsea

In dynamic semantic publishing scenarios, triplestores typically handle hundreds of read queries per second, while in parallel processing tens of update transactions per second for Knowledge Graphs that contain hundreds of millions of… Click To Tweet

These are real statistics from mission-critical deployments like the triplestore behind BBC SPORT – a very dynamic operational database backing the website 24×7 since the year 2012.

How To Evaluate A Graph Database

The Linked Data Benchmark Council (LDBC) is an industry consortium governing and developing TPC-like benchmarks for graph databases and triplestores. Its members include leading vendors in both fields, e.g., IBM, ORACLE, Neo Technologies, OpenLink Software, Ontotext and others. At the LDBC’s website, one can find benchmarks and benchmark results alongside blog posts on related subjects and event announcements.

The Dynamic Semantic Publishing (DSP) application pattern was invented and first implemented by BBC’s team for their website for the FIFA World Cup 2010. A great blog post describing this project was published by Jem Rayfield. One can read about the dynamic semantic publishing use case and the Semantic Publishing Benchmark in the blog post that I wrote earlier this year.

On the technology side in DSP, the graph database engine needs to interplay closely with text-mining technology used for automated metadata generation. Particularly when Linked Open Data is used for text analytics and tagging purposes. This is something that triplestores have proven to do very well – not a surprise given that LOD comes in RDF.

In April this year, Philip Howard from Bloor Research completed a report on the graph database market. One can download the full version of the “Graph and RDF databases 2015” report. Philip also refers to Property Graphs as “operational graph databases” and considers triplestores incapable of “index-free adjacency”, reflecting historical trends and attitudes.

The summary about RDF databases there is as follows:

Often semantically focused… for use in operational environments but have inferencing capabilities. Require indexes even in transactional environments. Often ACID compliance.

Back in January, Robin Bloor, published “The Graph Database and the RDF Database” that provides a number of good insights about the differences and commonalities between RDF databases and Property Graph databases:

Where the RDF databases really score is when you want to do set processing (a la SQL) at the same time that you want to do graph processing. Consider a query such as “Who are the biggest influencers on Twitter over the past six months?

Both the RDF and property graph databases would handle such a query and return the same results quickly. But if you ask the very different question: “Which influencers have had the same pattern of influence on Twitter over the last six months?”, you are asking for both graph processing and set processing at the same time to get to the answer, and the RDF databases do both well. Not only that, but this is an area of analytics, which was virtually untapped until recently because there was no software that could easily do it.

Final Words

The popularity of graph databases is growing based on good track record of projects where these engines delivered to the expectations.

That’s true for all types of graph database technology: property graphs, RDF, and other graph analytics. People start paying more attention to the differences between the different graph database standards in order to choose the one most appropriate for their application.

In this post, I provided some insights on how triplestores work and how they can support graph-traversal efficiently, although “index-free adjacency” is not central to their design. I also presented the “dynamic semantic publishing” pattern – a typical use case where triplestores are used as a dynamic operational database.

I also provided references to recent posts and reports that touch the RDF vs. Property Graphs subject. The later excel in graph analytics, although triplestores can do this, too.

I will summarize the advantages of RDF-based graph engines as follows:

Want to learn more about graph databases like Ontotext’s GraphDB?

GraphDB Free Download
Ontotext’s GraphDB
Give it a try today!

Download Now

Article's content

CEO at Ontotext

Atanas is a leading expert in semantic databases, author of multiple signature industry publications, including chapters from the widely acclaimed Handbook of Semantic Web Technologies.

Benchmark Results Position GraphDB As the Most Versatile Graph Database Engine

GraphDB is the first engine to pass both LDBC Social Network and Semantic Publishing benchmarks, proving its unique capability to handle graph analytics and metadata management workloads simultaneously.

Ontotext Expands To Help More Enterprises Turn Their Data into Competitive Advantage

Join us for a review of our accomplishments and plans for the next few years. Have a cup of tea or a glass of wine and enjoy the story!

Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes

Read about how to use reasoning to enrich big knowledge graphs with new facts and relationships, avoiding the typical pitfalls and reaping all the benefits

At Center Stage IV: Ontotext Webinars About How GraphDB Levels the Field Between RDF and Property Graphs

Read about how GraphDB eliminates the main limitations of RDF vs LPG by enabling edge properties with RDF-star and key graph analytics within SPARQL queries with the Graph Path Search plug-in.

The Semantic Web: 20 Years And a Handful of Enterprise Knowledge Graphs Later

Read about how the Semantic Web vision reincarnated in thousands of Linked Open Data datasets and millions of Schema.org tagged webpages. And how it enables knowledge graphs to smarten up enterprises data.

Ontotext Comes of Age: Increased Efficiency, New Technology, Big Partners and Big AI Plans

Read about the important and exciting developments in Ontotext as we are closing up 2018.

Linked Leaks: A Smart Dive into Analyzing the Panama Papers

Learn about how, to help data enthusiasts and investigative journalists effectively search and explore the Panama Papers data, Ontotext created Linked Leaks.

Practical Big Data Analytics For Financials

Learn more about the benefits of big data – from keeping up with compliance standards & increasing customer satisfaction to revenue increase.

Triplestores are Proven as Operational Graph Databases

Dive into the theory of how RDF triplestores work and how they can support graph-traversal efficiently.

Industry Relevance of the Semantic Publishing Benchmark

Learn how the Semantic Publishing model for using Semantic Technology in media and how the Semantic Publishing Benchmark is utilized by organizations to tag information.