OWLIM References

This page provides references to OWLIM from independent evaluations and benchmarking.

Press Association Project for Image Retrieval

In the context of a Press Association project for commercial image retrieval and browsing, a team benchmarked several semantic repository products. It evaluated quality and completeness of the results as well as speed and scalability. The system included in the evaluation were: AllegroGraph, BigOWLIM, ORACLE, Sesame, Jena TDB and Virtuoso. An excerpt from the conclusion is provided below.

"In our tests, BigOWLIM provides the best average query response time and answers the maximum number of queries for both the datasets ... it is clear to see that execution speed-wise BigOWLIM outperforms AllegroGraph and Sesame for almost all of the dataset queries."

Source: A Pragmatic Approach to Semantic Repositories Benchmarking. Thakker , D., Osman, T., Gohil, S., Lakin, P. In Proc. of the 7th Extended Semantic Web Conference, ESWC 2010. Presentation

Berlin SPARQL Benchmark

The third version of the BSBM benchmark aims to allow for a more comprehensive framework for measuring the all-round performance of semantic repositories. Among other improvements, the benchmark is extended with two new scenarios (query mixes): Explore and Update and Business Intelligence. Those are based on the new features of the SPARQL 1.1 specification, allowing for update queries (e.g. insert) and aggregate expressions (e.g. sum and average).

The most recent BSBM results report on the performance of the leading semantic repositories (BigOWLIM, Virtuoso, 4store, BigData and Jena TDB) on 100M and 200M datasets. BigOWLIM was the top performer on:

  • loading time - BigOWLIM loads the 100M and the 200M datasets almost twice as fast as the next best product;
  • query performance among those repositories that can handle update and multi-client query tasks.

Because BSBM, and it particular its third version, is a comprehensive benchmark, that tests various aspects of the performance of the semantic repositories, the raw results require analysis and interpretation. It is also important to understand the overall organization of the comparison.

The evaluation was organized and run by FU Berlin, the benchmark authors, in a developer-friendly manner. The repository developers were notified about the new version of the benchmark in November 2010 and were asked to prepare for evaluation in January. In the course of the evaluation runs, the evaluation team took care to communicate extensively with the repository developers in order to make sure that the best configuration of the repositories is used and the "minor technical problems" are solved. The developers were allowed to provide fixed and updated versions of the repositories if and when they wish to do so in order to address a problem or to provide an extra optimization. As a result, the fact that it was impossible to derive useful results for a specific repository on a specific task provides indication that there is a real problem and application developers are very likely to fail to make use of such functionality also.

In this context, it is unfortunate that all the repositories had problems with the Business Intelligence use case, thus the organizers of the evaluation did not published results for it. Speaking of all-round performance, some of the engines show good results in various categories, but cannot handle essential requirements such as updates and multi-client loads. One possible reason is that the engines were configured so that the results of specific tasks are maximized, but such configurations cause problems with other tasks. Whatever the reasons such partial top results are as representative for real-world performance as the races of the Late Model series of NASCAR. Potential commercial users are likely to be influenced more by a database's ability to meet requirements in all the important usage scenarios rather than top performance in any one category.

Understanding the overall performance requires consolidation of the results from the different tasks, based on weighting factors reflecting their importance across the most popular usage scenarios. For instance, while load performance is not expected to be crucial in the usage scenarios addressed by BSBM, multi-client loads and handling continuous updates are quite important for enterprise deployments; on the other hand, single-user performance has lower practical importance. One such schema for interpretation of the overall results is presented here.

SourceBSBM V3 Results (February 2011). Bizer, Ch., Schultz, A.

Reasoner Completeness Evaluation by the Oxford University Computing Lab

An evaluation was conducted in order to determine the completeness of inference supported by several reasoners:

  • The systems included in the evaluation are HAWK, SwiftOWLIM, Sesame and Minerva
  • A tool called SyGENIA was used to demonstrate that the LUBM benchmark setup and its queries do not provide any guarantee about the completeness of the inference capabilities of the engines
  • The results of the experiments show that SwiftOWLIM is the system that provides the most complete inference out of all the systems included in the evaluation

Source: How Incomplete is your Semantic Web Reasoner? Stoilos G., Grau B. C., Horrocks I. In Proc. of the 20th Nat. Conference on Artificial Intelligence (AAAI 10), 2010.