From Disparate Data to Visualized Knowledge Part II: Scaling on Both Ends

This series of blog posts constitutes a step-by-step guide for data ingestion, inference validation and visualization with GraphDB followed by GraphQL interface setup, search and federation with Ontotext Platform.

November 26, 2021 8 mins. read Radostin Nanov

In our previous blog post of the series, we covered how to ingest data from different sources into GraphDB, validate it and infer new knowledge from the extant facts. Today we’ll deal with the big issue of scaling, tackling it on two sides: what happens when you have more and faster sources of data? And what happens when you want more processing power and more resilient and available data?

Adding the robots – ETL of disparate data

Moore’s law says that the number of transistors in a circuit doubles about every two years. However, the same does not apply to human processing powers. With the great data integration powers of GraphDB, LAZY has achieved a lot.

But now the bottleneck moves from data processing towards data gathering. Inspectors operating locally are slow. Following the mantra that everything is better with robots, inspection authorities have started looking into automation. This will be achieved with Autonomous Inspection Drones, or AID.

The trouble is that there are many drone manufacturers on the building inspection market and all have different formats and APIs. Now LAZY has to get all that disparate data and store it into GraphDB. The good news is that we are no longer using spreadsheets. The bad news is that we now need to process different formats. There are two ways to handle this.

  • A bespoke program. This can be anything as simple as a dozen lines of code or a very complex processing solution. It would take more time to develop, but offer greater flexibility. A bespoke program can also execute arbitrary queries against the database and create aggregations over the incoming data.

Part of a Python program for data ingestion

  • Alternatively, if the incoming data is mostly static, you can use a mapping language such as RML. This is a more straightforward task, but unless paired with custom programming, it would preclude you from doing data manipulation based on what is already in the database.

Once data is in RDF format, GraphDB has a robust API, which allows its users many different ways to interact with it. The engineers at LAZY would have to write a server that picks up incoming data and translates it into RDF. Then this data would get ingested into GraphDB, and can also be further modified by SPARQL queries. We’ll cover the four most popular ways to do that:

  1. The RDF4J Java API. GraphDB is a Java application based on the powerful RDF4J open-source framework. As such, using the API is the native way to interact programmatically with GraphDB, offering you the greatest flexibility. Ontotext also offers some improvements over the API with our client libraries. There is a runtime for all editions – the free runtime is available at Maven central. Other runtime libraries are kept at Ontotext’s Maven repository. With this, you can handle end-to-end processing of your data.
  2. The Javascript GraphDB driver. Javascript is very popular so GraphDB has a driver for it as well. It covers all the basic GraphDB functionality by making HTTP API calls. It’s open source as well.
  3. Python RDFLib. RDFLib allows its users to connect to any SPARQL endpoint, GraphDB included. As a highly popular language for scripting and data science, Python is often favored for development. The example provided previously uses RDFLib.
  4. Data virtualization with Ontop. This GraphDB-native solution allows you to map relational data to RDF. If you are already processing data to create a RDF model, you probably won’t need this. However, perhaps you want to integrate relational data. Or perhaps you have a legacy program that stores drone reports into a relational database. Then Ontop is the best way to make it work with RDF data.

Ultimately, as a data-science driven enterprise, LAZY would perhaps prefer the Python solution. However, it is possible to go for any of the options to bring data automation into your use case.

Victims of our own success – how to scale out

There comes a moment in every successful enterprise’s life that it must scale out. LAZY is no different. Imagine a scenario where you have a successful solution integrating data from many different drone models and offering analytics based on it. Surely, many clients would want that. And, inevitably, someone would want to deploy it at scale. Up till now, we have  only covered naively the functional part of our questions. It’s time to ask the hard non-functional questions.

Data resilience

For a small solution, AID may carry out inspections at a given time. The operator deploys the AID, runs the GraphDB ingestion service, then collects the AID and shuts it down. But at a large site – or if there are multiple sites, especially across the globe – the need arises for a resilient solution that is always ready to ingest data.

Fortunately, it’s all a question of deploying the correct GraphDB version. While up till now, LAZY could have used GraphDB Free, now it’s definitely the moment to move to the Enterprise edition. GraphDB Enterprise comes with a battle-tested cluster mode. GraphDB’s cluster is based on the data replication principle. If one worker fails, the other workers should have an up-to-date version of the data. And if one worker is processing a large update, the other workers may provide stable and efficient query endpoints. This is powered by a transaction log that is replicated across multiple masters, which ensures all updates are atomic and happen in order.

What’s great about this is that it’s all completely invisible to the end user. The deployment topology of GraphDB doesn’t matter. All features of the standalone version are available in a cluster, and the SPARQL and upload endpoint remain the same.

ETL resilience

However, not even the best graph database can store data if none is available. To this end, LAZY must take care of its Python solution and make sure it’s always on. There are many ways to do this. However, since LAZY already operates with GraphDB, it’s probably a good idea to look into the tools that GraphDB already uses.

GraphDB is often deployed with Kubernetes. Kubernetes is a way to orchestrate Docker deployments, with much needed tools at the users’ disposal. Kubernetes features liveness and readiness checks, an initialization pipeline, persistent storage, rolling upgrades, blue-green deployments and much, much more. Ontotext offers GraphDB as an official Docker image, so there’s no need for any preparation before diving into Kubernetes.

However, Kubernetes can be opaque. It’s sometimes hard to repeat deployments. Very often you would have a deployment to a different infrastructure where a couple of things need to be changed. But those things may be in three different files. And perhaps the namespace needs to be changed, necessitating the replacement of a couple of dozen lines of configuration.

There’s a better way, and GraphDB has already embraced it. It’s called Helm and it is a Kubernetes templating engine. It allows a repeatable and easily configurable deployment.

Helm deployment for GraphDB

Even with those great features, though, the ETL process may still fail or temporarily get outpaced by the volume of incoming reports. This can be mitigated by adding a Kafka queue in front of it. In such a deployment, there will be a Kafka queue that picks up the incoming AID reports. Each AID can be a Kafka producer or there can be an ingestion server, which is the entrypoint for the LAZY system. Then the Kafka queue is consumed by a cluster of ETL services, which then feed the data into the GraphDB cluster.

This enables LAZY to buffer data until it is consumed, taking care of usage spikes. And the great thing about this is that many places offer Kafka as a service, so if a customer has doubts whether the on premise deployment of LAZY is good enough, it can be easily outsourced.

LAZY now has a working, scalable system that is resistant to failure and can potentially process thousands of requests per second. And that’s quite neat.

Conclusion

The updated LAZY System architecture, now more resilient and available

In this blog post from our series, we’ve covered how to adapt and scale the basic solution outlined in From Disparate Data to Visualized Knowledge Part I: Moving from Spreadsheets to an RDF Database to a fully-fledged, scalable system. Ontotext provides the basis and, with just a small bit of configuration, your ETL ecosystem can thrive on it.

Come back next Friday for our next blog post where we’ll look into how to interact with others and what tooling do we need for it.

Do you want to solve similar problems specific to your enterprise use case?

 

GraphDB Free Download
Ontotext’s GraphDB
Give it a try today!

Download Now

Article's content

Solution/System Architect at Ontotext

Radostin Nanov has a MEng in Computer Systems and Software Engineering from the University of York. He joined Ontotext in 2017 and progressed through many of the company's teams as a software engineer working on the Ontotext Cognitive Cloud, GraphDB and finally Ontotext Platform before settling into his current role as a Solution Architect in the Knowledge Graph Solutions team.

SHACL-ing the Data Quality Dragon III: A Good Artisan Knows Their Tools

Read our blog post about the internals of a SHACL engine and how Ontotext GraphDB validates your data

SHACL-ing the Data Quality Dragon II: Application, Application, Application!

Read our blog post to learn how to apply SHACL to your data and how to handle the output

SHACL-ing the Data Quality Dragon I: the Problem and the Tools

Read our blog post to learn about the dragon of invalid data and the wide array of SHACL constraints you can apply to combat it

Power to the Filters! GraphDB Introduces Improvements to the Connectors in its 10th Edition.

Read about the improvements of GraphDB 10 Connectors, which offer more more flexibility and further filtering capabilities when synchronizing RDF data to non-RDF stores

Connecting the Dots to Turn Data Into Knowledge: Entity Linking

Read about the advantages and disadvantages of different ways to do entity linking based on reconciliation, inference, SPARQL and Kafka

Loading Data in GraphDB: Best Practices and Tools

Read about our guided tour through data transformation, ingestion, updates and virtualization with GraphDB

At Center Stage V: Embedding Graphs in Enterprise Architectures via GraphQL, Federation and Kafka

Read about the mechanisms for building a big enterprise software architectures by embedding graphs via GraphQL, Federation and Kafka

From Disparate Data to Visualized Knowledge Part III: The Outsider Perspective

Read our final post from this series focusing on how GraphDB and Ontotext Platform provide an architecture that can work on any infrastructure resulting in a well-deployed and well-visualized knowledge graph.

From Disparate Data to Visualized Knowledge Part II: Scaling on Both Ends

Read our second post of this series focusing on what happens when you have more and faster data sources as well as when you want more processing power and more resilient and available data.

From Disparate Data to Visualized Knowledge Part I: Moving from Spreadsheets to an RDF Database

Read our first post from this series about how to turn your disparate data into visualized knowledge, starting with a step-by-step guide for data ingestion and inference validation with GraphDB