The New Cache on the Block: A Caching Strategy in GraphDB To Better Utilize Memory

April 30, 2017 7 mins. read Nikola Petrov

New Global Page Cache

The ability to seamlessly integrate datasets and the speed at which this can be done are mission critical when it comes to working with big data. Cache, the component that stores temporary statements in the memory to eliminate disk operations and sped up the database, is crucial for the performance of your data project.

Aware of the importance of the caching system in GraphDB, we have been working on a faster, smarter and more adaptive caching system to solve the issues of the old caching strategy in GraphDB.

In this blog post you will find what these issues were and how we have worked them out to create the new cache and enable better GraphDB performance for our users.

Download Ontotext' GraphDB!


The Old Caching Strategy and Its Problems

To understand the new page cache design and the opportunities it opens, let’s take a look at the old caching strategy and its problems. In a nutshell, some of the major challenges when using the old caching strategy were:

  • The user had to manually estimate the memory allocation for each repository collection.
  • As result, the memory utilization was far from optimal.
  • Dealing with multiple repositories made the configuration challenge even bigger.

The old approach is illustrated in configuring a repository in the old releases, where you can see a diagram explaining how to size the repository caches to be able to utilize the memory properly.

Old Page Cache Strategy

Memory Utilization Issues

Although it is not obvious from the diagram, when you set the tuple-index-memory parameter, it was evenly split between collections as a result of an internal design decision. This meant that if you allocated 2GB of memory to the database and you enabled the context indices, it would still get 1/4 of that memory (512MB for each POS, PSO, PCSO, PCOS), regardless of the fact that your queries were hitting the POS collection more.

Another design decision was to split that amount for the read and write cache so if you were ingesting a big file, it wouldn’t use the whole memory but only 1/2 for that collection. This would result in 256MB used for the write cache for each collection and 256 for the read cache. As you can imagine, it was not ideal memory utilization as if all your queries were hitting only the POS index, you would get 256MB of memory for caching information from the disc, when in fact you had allocated 2GB.

Unnecessary Complexity

Another historical fact was that the predLists had a separate parameter just because most queries didn’t hit it. As a result, it made things even more complicated for the user as he had to know his usecase and queries very well to be able to tune the database memory.

Hurdles with Multiple Repositories

Something else that we had a lot of complains about was that when the user had multiple repositories, it became impossible to size the memory. If you had distributed your memory and then decided to add another repository, you had to resize the rest.

Smarter Database Caching To Solve Them All

In the release notes for GraphDB 7.2 (edition 2016), we included the following sentence as one of the most important changes in the database:

Smarter database caching: Now all server repositories share a common cache pool that adapts to the various patterns of data access. This speeds up substantially the overall read and write performance by reducing the number of I/O operations.

As a result, we saw a considerable speedup in both loading and query time on the SPB benchmark.

To quote Phil Karlton:

There are only two hard things in Computer Science: cache invalidation and naming things.

So you might imagine that we were developing the cache for quite a while to craft it bug free and to make sure that cached copies in the system reflected the updated data while the resource utilization was as low as possible.

The newly developed caching is stress tested and we believe it to be rock solid. Click To Tweet It is enabled by default and there is no manual configuration transition required.

Better and Faster Performance with the New 7.2 Global Page Cache

The new global page cache addressed all the problems from version GraphDB 7.2 onward with the following design:

New Page Cache Strategy

As you can see, all collections in all repositories are now using a single central chunk of memory. The cache object is implemented with the help of the remarkable caching Java library Caffeine. It uses the W-TinyLfu eviction algorithm, which is a form of LRU that we have found to suite us well.

By default, the new cache takes 50% of the JVM heap size. So, if you start your database with -Xmx2GB (the maximum amount of memory for the JVM parameter), it will take 1GB for page caching. If you know that there won’t be that many group by queries to eat the other memory, you can easily change the parameter for the whole GraphDB instance with<amount-of-memory-for-caching> .

How the New Global Page Cache Made John’s Life Easier

New Global Page Cache

Meet John Doe.

John wants to have two separate repositories, one for integrating its data from different sources and one that can be queried for data that’s production ready.

He has bought a server that has 20GB of memory, so he starts GraphDB with the -Xmx10GB and creates two separate repositories. Note that John doesn’t have to specify different memory parameters for his two repositories.

Throughout the day, John is running a bunch of queries on repository1 that are simple enough and are hitting the POS and PSO indices:

SELECT ?s where {
    # This is hitting the POS index, because we know the predicate and the object
    ?s rdf:type :CreativeWork

SELECT ?o where {
    # This is hitting the PSO index, because we know the predicate and the object
    <urn:subject> rdf:type ?o

If a bunch of variations of these queries are run, the cache will be full of read pages from the POS and POS collections. Everything is good, the memory keeps the most used pages hot in the memory and we may not need to touch the disk at all.

If in the middle of the day, John runs some other queries:

SELECT ?s ?o {
    # This will hit the PCOS index, because we know the predicate and the context
    GRAPH <urn:context> {
        ?s rdf:type ?o

the cache will start to fill up with read pages from the PCOS index. They will force the cache to remove the least recently used pages from POS or PSO. Again, the system will respond to the user and will keep the most recently used pages hot in the cache.

Now at night, John runs a bunch of processes that are starting to import new versions of his datasets in the other repository. GraphDB will quickly throw out the read pages from repository1 and will start filling the cache with dirty pages, which will result in less flushes to the disk and much better resource utilization.

If for some reason, the read queries come in the middle of the night, the latest edition of GraphDB will quickly flush dirty pages to the disc to free up memory space.

Needless to say, John’s life just got easier.

The new caching strategy in GraphDB now utilizes memory much better and will lead to greater performance.

Something more, GraphDB users don’t need to migrate their data but just upgrade to a recent GraphDB version and enjoy the benefits of a better, faster and smarter caching system.

Wan to see for yourself?

GraphDB Free Download
Ontotext’s GraphDB
Give it a try today!

Download Now

Article's content

Nikola is a software architect who is open to new technologies and work methodologies. His interests are in distributed systems, exploiting new technologies to gain business value, different programming languages, semantic web, deployment and configuration management.