Loading Data in GraphDB: Best Practices and Tools

A guided tour through the alternative approaches, tools and facilities for data transformation, ingestion, updates and virtualization that can be used to turn data into a graph and manage it with GraphDB

April 1, 2022 7 mins. read Radostin Nanov

Data is just the first step on the path towards knowledge. And not to sound like an empty proverb, we can put some more context to this claim. Imagine having the bytes that correspond to a photograph stored on your system. Viewed on their own, those bytes mean nothing. Only by putting them in the right context – an image viewing program – do we obtain the knowledge that picture conveys. And if we want to interpret those bytes differently, we need yet another program.

This is good when we are working with data that can be read by multiple tools. You may think that images are great for this – after all, there are only half a dozen major image formats. However, some tools work only with vector graphics, while others refuse to visualize them. Others yet use their own proprietary formats. If that’s the case with images, imagine how bad it must be for structured data stored in databases. In fact, you probably don’t need to imagine, if you are reading this.

Data formats are hundreds, if not thousands, and getting data in just the right format that helps you uncover its hidden knowledge is a challenge. RDF is one of the best knowledge-building formats, but it is not in the mainstream yet and most data is not RDF-native. So, how to get that data in the right format?

Enter Converters

The traditional solution to the problem of converting data is to use a third-party converter. This has a few key benefits: it is platform-independent, reduces vendor lock-in and often follows an official standard.

The most popular conversion language for RDF is R2RML. R2RML covers all the bases:

  • It follows an official standard – it doesn’t get much better than W3C.
  • It is platform-independent – two of the major R2RML tools are Java and Python based, with tooling for many other languages as well. Ontotext has also built on the open source tool Ontop, to bring this functionality to GraphDB.
  • It reduces vendor lock-in – you take a relational database or a file exported from it and produce an RDF file.

TARQL is another option when you are working directly with tabular files, e.g., CSV. With it, you don’t need to write a mapping, you just prepare a SPARQL update and execute it directly on the table. The inputs are a table and a SPARQL, the outputs are triples. It can’t get much easier than that.

If direct mappers aren’t good enough because you are using some custom logic, you can always implement your own converter. If you are writing code specifically with GraphDB in mind, you can use the optimized Java API or the JS client. Many data scientists are fans of Python. We ourselves often use Pandas and Rdflib. This can be bundled with a Kafka pipeline for making sure that the messages being relayed to GraphDB are repeatable and trackable. Finally, if you are using C#, we have tested integrations with dotNetRdf.

Mapping in Style with OntoRefine

TARQL and R2RML are great tools, but using them, you have to write a mapping mostly by hand, modifying a text file, and that can be laborious. We have a solution that helps you with that hurdle. Ontotext Refine is based on the popular OpenRefine tool initiated by Google, now governed by the Apache Foundation. This means that vendor lock-in is avoided as transformation logic can be scripted in GREL (Google Refine Expression Language). It allows quick ingestion of structured data in multiple formats (TSV, CSV, *SV, XLS, XLSX, JSON, XML, RDF as XML and Google sheet) into GraphDB.

Ontotext Refine is now a standalone tool which can be deployed and scaled separately from GraphDB. You can even tie it to a different GraphDB instance at runtime, allowing you to run only one refine tool for all your needs, instead of spinning up a different one for each cluster.

The main strength of Refine is the powerful UI and flexible transformation capabilities. It starts with data preview, similar to the import facility of Microsoft Excel. You can manipulate a sample file, modify the input and create a mapping. The mapping can be translated to JSON or to SPARQL. All operations can be extracted in a JSON format and applied to any other input. If you reuse the same patterns in other files and map them to the same RDF ontology, you can use the OntoRefine client or API to carry out the same transformations.

Refine outperforms TARQL and R2RML and can ingest data directly into GraphDB. You don’t need an intermediate step. This adds a layer of flexibility. You can use the data from Refine for all sorts of SPARQL queries and easily combine it with data already persisted in the database or in another Refine project.

Streaming Data with Kafka

A common question is how to keep data fresh. The converters described above work very well for bulk ingestions, but they operate in memory. They are not great for discerning what’s already in the database. Even Ontotext Refine has to do that with a SPARQL FILTER clause, which isn’t efficient.

However, with update data coming piecemeal, we can employ GraphDB’s smart updates functionality. You can create a SPARQL template, which gets executed against a specific Kafka message, conveyed by our Kafka sink addon. Data listed within the incoming Kafka message would be stored as RDF.

In many cases, though, that won’t be enough as the Kafka message has to be RDF-formatted. Fortunately, starting in GraphDB 9.11, we offer something more – arbitrary SPARQL execution against JSON messages.

The SPARQL

And the JSON

With this approach, provided that you don’t need to call third party services and you have JSON data coming in, you can greatly simplify your whole ingestion pipeline, skipping the need for a custom ETL step of any sort.

Keeping the Data Where It Is with Virtualization

The most radical option is to just keep the data where it is. Maybe you don’t need all the benefits of RDF, like visualizations, inference and secondary indexing. Perhaps all you need to do is the occasional query. GraphDB has the right tool ready for this case: Ontop.

Ontop is, in essence, a mapping interface. You provide it with a mapping file and some configurations related to the security and locations of an SQL database, and it translates SPARQL to SQL queries. The data is kept in the remote SQL database. The full expressivity of SPARQL queries is supported. The data is not cached, except for any caching that happens on the SQL layer. This means that SQL updates can be served seamlessly, provided that the mapping is kept up-to-date. You can use R2RML or Ontop’s bespoke mapping language, OBDA.

If you change your mind and decide you do want to store the data, Ontop can be used as a tool to ingest data from SQL into GraphDB. This would greatly improve the performance of non-trivial queries with this data. It is also useful when you want to utilize some of GraphDB’s advanced capabilities such as the semantic similarity index, autocomplete, RDF rank, secondary indices via FTS connectors, etc.

Conclusion

The world doesn’t run on RDF – yet. But maybe you want a part of it to do so. With the tools described here – R2RML, TARQL, OntoRefine, Smart updates and Ontop, you have gotten one big step closer to that goal.

You don’t need to search for the golden needle in the haystack, nor spend time developing an ETL process from scratch. We have already done the research and are more than ready to give you a hand on your RDF journey.

Do you want to have a smooth RDF journey?

 

GraphDB Free Download
Ontotext’s GraphDB
Give it a try today!

Download Now

Article's content

Solution/System Architect at Ontotext

Radostin Nanov has a MEng in Computer Systems and Software Engineering from the University of York. He joined Ontotext in 2017 and progressed through many of the company's teams as a software engineer working on the Ontotext Cognitive Cloud, GraphDB and finally Ontotext Platform before settling into his current role as a Solution Architect in the Knowledge Graph Solutions team.

SHACL-ing the Data Quality Dragon III: A Good Artisan Knows Their Tools

Read our blog post about the internals of a SHACL engine and how Ontotext GraphDB validates your data

SHACL-ing the Data Quality Dragon II: Application, Application, Application!

Read our blog post to learn how to apply SHACL to your data and how to handle the output

SHACL-ing the Data Quality Dragon I: the Problem and the Tools

Read our blog post to learn about the dragon of invalid data and the wide array of SHACL constraints you can apply to combat it

Power to the Filters! GraphDB Introduces Improvements to the Connectors in its 10th Edition.

Read about the improvements of GraphDB 10 Connectors, which offer more more flexibility and further filtering capabilities when synchronizing RDF data to non-RDF stores

Connecting the Dots to Turn Data Into Knowledge: Entity Linking

Read about the advantages and disadvantages of different ways to do entity linking based on reconciliation, inference, SPARQL and Kafka

Loading Data in GraphDB: Best Practices and Tools

Read about our guided tour through data transformation, ingestion, updates and virtualization with GraphDB

At Center Stage V: Embedding Graphs in Enterprise Architectures via GraphQL, Federation and Kafka

Read about the mechanisms for building a big enterprise software architectures by embedding graphs via GraphQL, Federation and Kafka

From Disparate Data to Visualized Knowledge Part III: The Outsider Perspective

Read our final post from this series focusing on how GraphDB and Ontotext Platform provide an architecture that can work on any infrastructure resulting in a well-deployed and well-visualized knowledge graph.

From Disparate Data to Visualized Knowledge Part II: Scaling on Both Ends

Read our second post of this series focusing on what happens when you have more and faster data sources as well as when you want more processing power and more resilient and available data.

From Disparate Data to Visualized Knowledge Part I: Moving from Spreadsheets to an RDF Database

Read our first post from this series about how to turn your disparate data into visualized knowledge, starting with a step-by-step guide for data ingestion and inference validation with GraphDB