Keeping a historical record of the DB state is a frequently faced challenge. To start off, you need to keep in mind that any approach towards this task will have obvious advantages and disadvantages:
With that in mind, GraphDB offers at least three solutions to suit your needs. If none quite work for your use case, we also offer a flexible plugin API that can allow you to write your own custom logic.
This is the most flexible approach and is highly recommended when you have full control over your ingestion. Under such a scenario, you can keep your data in timestamped named graphs. For example, data ingested on the 1st of May 2022, can be kept in the named graph <http://example.org/2022/05/01>. If you want finer granularity, you can use something more specific, like a precise timestamp: <http://example.org/1651408271>.
Now, this might be insufficient – you may want to use named graphs for another purpose or you don’t want each instance to be dropped in a named graph. In such a case, you can keep triples like <http://example.org/createdAt/> “2022-05-01T12:21:03Z”^^xsd:dateTime and <http://example.org/lastUpdated/> “2022-05-01T12:25:03Z”^^xsd:dateTime.
The approach above works for the instance as a whole. If you want to use it on individual triples, you would have to combine this with RDF-star and nested triples.
<<http://example.org/subject> <http://example.org/predicate> <http://example.org/object>> <http://example.org/lastUpdated/> “2022-05-01T12:25:03Z”^^xsd:dateTime.
If you want to achieve your goals by modeling, you need to control the whole ingestion process.
The change tracking plugin allows you to effectively give a timestamp to every triple that is being ingested. When you enable the plugin for a transaction, you give a specific in-memory named graph. All triples persisted in this transaction would be added to this graph, but also to the graph that they were intended for. This goes around the issue of using named graphs for multiple purposes.
Note the usage of the semicolon to separate the two SPARQL requests in the same transaction. You can access the two “added” and “removed” special named graphs to check what happened in this transaction.
The advantage of such an approach is that it’s easier to intercept all requests to the database and plug in a pre-commit transaction, which enables the track-changes plugin. The disadvantage is that those graphs are in-memory. When a shutdown is due, you need to persist them.
In cases where you don’t have any control over the ingestion, or where you don’t want to deal with modeling issues, you can use the history plugin instead. The history plugin allows you to keep track of certain triples and instance types. It is global – it keeps track of all transactions done by all users. Since it’s fully automated, you won’t have to touch anything after configuring the plugin. The downside of such an approach is that it is the least flexible.
The history plugin creates a new index. By default, GraphDB stores data inside a PSO (predicate-subject-object) and a POS index. With the history plugin, a DSPOCI index would also be kept.
Naturally, such an index would increase your database size. Assuming you keep historical data for all your triples and only the default indexes, your database size would nearly double. In order to manage this, you have the capability to trim and compact the history index.
You can configure the history plugin to filter some specific data. It’s unlikely that you want history for each triple – just a few key objects should be enough. This can be done for each position, Subject, Predicate, Object or Context. Each filter can be either:
Once you have ingested data, you can query it like this:
You can combine the three options as you like. Additionally, you can use either option for history tracking with the audit log to track down the user who made the change.