Read our first post from this series about how to turn your disparate data into visualized knowledge, starting with a step-by-step guide for data ingestion…
In our previous blog post of the series, we covered how to ingest data from different sources into GraphDB, validate it and infer new knowledge from the extant facts. Today we’ll deal with the big issue of scaling, tackling it on two sides: what happens when you have more and faster sources of data? And what happens when you want more processing power and more resilient and available data?
Moore’s law says that the number of transistors in a circuit doubles about every two years. However, the same does not apply to human processing powers. With the great data integration powers of GraphDB, LAZY has achieved a lot.
But now the bottleneck moves from data processing towards data gathering. Inspectors operating locally are slow. Following the mantra that everything is better with robots, inspection authorities have started looking into automation. This will be achieved with Autonomous Inspection Drones, or AID.
The trouble is that there are many drone manufacturers on the building inspection market and all have different formats and APIs. Now LAZY has to get all that disparate data and store it into GraphDB. The good news is that we are no longer using spreadsheets. The bad news is that we now need to process different formats. There are two ways to handle this.
Part of a Python program for data ingestion
Once data is in RDF format, GraphDB has a robust API, which allows its users many different ways to interact with it. The engineers at LAZY would have to write a server that picks up incoming data and translates it into RDF. Then this data would get ingested into GraphDB, and can also be further modified by SPARQL queries. We’ll cover the four most popular ways to do that:
Ultimately, as a data-science driven enterprise, LAZY would perhaps prefer the Python solution. However, it is possible to go for any of the options to bring data automation into your use case.
There comes a moment in every successful enterprise’s life that it must scale out. LAZY is no different. Imagine a scenario where you have a successful solution integrating data from many different drone models and offering analytics based on it. Surely, many clients would want that. And, inevitably, someone would want to deploy it at scale. Up till now, we have only covered naively the functional part of our questions. It’s time to ask the hard non-functional questions.
For a small solution, AID may carry out inspections at a given time. The operator deploys the AID, runs the GraphDB ingestion service, then collects the AID and shuts it down. But at a large site – or if there are multiple sites, especially across the globe – the need arises for a resilient solution that is always ready to ingest data.
Fortunately, it’s all a question of deploying the correct GraphDB version. While up till now, LAZY could have used GraphDB Free, now it’s definitely the moment to move to the Enterprise edition. GraphDB Enterprise comes with a battle-tested cluster mode. GraphDB’s cluster is based on the data replication principle. If one worker fails, the other workers should have an up-to-date version of the data. And if one worker is processing a large update, the other workers may provide stable and efficient query endpoints. This is powered by a transaction log that is replicated across multiple masters, which ensures all updates are atomic and happen in order.
What’s great about this is that it’s all completely invisible to the end user. The deployment topology of GraphDB doesn’t matter. All features of the standalone version are available in a cluster, and the SPARQL and upload endpoint remain the same.
However, not even the best graph database can store data if none is available. To this end, LAZY must take care of its Python solution and make sure it’s always on. There are many ways to do this. However, since LAZY already operates with GraphDB, it’s probably a good idea to look into the tools that GraphDB already uses.
GraphDB is often deployed with Kubernetes. Kubernetes is a way to orchestrate Docker deployments, with much needed tools at the users’ disposal. Kubernetes features liveness and readiness checks, an initialization pipeline, persistent storage, rolling upgrades, blue-green deployments and much, much more. Ontotext offers GraphDB as an official Docker image, so there’s no need for any preparation before diving into Kubernetes.
However, Kubernetes can be opaque. It’s sometimes hard to repeat deployments. Very often you would have a deployment to a different infrastructure where a couple of things need to be changed. But those things may be in three different files. And perhaps the namespace needs to be changed, necessitating the replacement of a couple of dozen lines of configuration.
There’s a better way, and GraphDB has already embraced it. It’s called Helm and it is a Kubernetes templating engine. It allows a repeatable and easily configurable deployment.
Helm deployment for GraphDB
Even with those great features, though, the ETL process may still fail or temporarily get outpaced by the volume of incoming reports. This can be mitigated by adding a Kafka queue in front of it. In such a deployment, there will be a Kafka queue that picks up the incoming AID reports. Each AID can be a Kafka producer or there can be an ingestion server, which is the entrypoint for the LAZY system. Then the Kafka queue is consumed by a cluster of ETL services, which then feed the data into the GraphDB cluster.
This enables LAZY to buffer data until it is consumed, taking care of usage spikes. And the great thing about this is that many places offer Kafka as a service, so if a customer has doubts whether the on premise deployment of LAZY is good enough, it can be easily outsourced.
LAZY now has a working, scalable system that is resistant to failure and can potentially process thousands of requests per second. And that’s quite neat.
In this blog post from our series, we’ve covered how to adapt and scale the basic solution outlined in From Disparate Data to Visualized Knowledge Part I: Moving from Spreadsheets to an RDF Database to a fully-fledged, scalable system. Ontotext provides the basis and, with just a small bit of configuration, your ETL ecosystem can thrive on it.
Come back next Friday for our next blog post where we’ll look into how to interact with others and what tooling do we need for it.