Ontotext

OWLIM Benchmarking: Linked Open Data

RDF is an excellent format for data integration, particularly when the data originates from a wide variety of sources. The Linked Open Data initiative is a truly global example of data integration, combining diverse and vast data sets, such as BBC Music, BBC Programs, DBpedia, FOAF profiles, Freebase, Geonames, MusicBrainz, PubMed, US Census Data, and many others.

BigOWLIM makes an excellent choice for storing these huge data-sets and with advanced features, such as RDF rank, RDF priming and Full Text Search, it makes an unbeatable data mining platform.

One example of such a platform is Fact Forge (previously known as the Linked Data Semantic Repository - LDSR). It combines a number of popular (and very large) data sets that can be explored using SPARQL or a combination of powerful full-text search with ranking. All of which is powered by BigOWLIM.

For further information about the range of datasets that are combined, their sizes and the numbers of entities they include, see the FactForge statistics page.

The datasets are loaded in the order: Schemata and ontologies, DBpedia (categories), DBpedia (sameAs), UMBEL, Lingvoj, CIA Factbook, WordNet, Geonames, DBpedia core, Freebase, MusicBrainz with a total loading time of 93 hours - Freebase and MusicBrainz take 36 hours each. Our experience with the FactForge datasets indicates that:

  • reasoning with linked data (and in particular with these datasets) is much more complex, compared to synthetic tests like LUBM;
  • forward-chaining and materialization are absolutely feasible for such data.

Even though it is not really comparable to any other system that uses some of these Linked Open Data datasets, we nevertheless provide some information about query execution performance. The table below presents results from query performance benchmarking, organized as follows: the BSBM framework was used as a basis; the repository had to evaluate exploration queries, identical to those required to explore URIs at FactForge; the task included 5000 query evaluations, using about 3000 URIs, randomly selected out of the set of all URIs at FactForge).

Number of warmup runs 1
Number of clients 8
Number of query mix runs (without warmups) 20000 times
min/max Querymix runtime 0.0000s / 9.7567s
Total runtime (sum) 15407.688 seconds
Total actual runtime 1932.337 seconds
CQET 0.77038 seconds average runtime of query mix
CQET (geom.) 0.00000 seconds geometric mean runtime of query mix
AQET 0.770384 seconds (arithmetic mean)
AQET(geom.) 0.000000 seconds (geometric mean)
Average result (Bytes) 396083.32
min/max result (Bytes) 0 / 1429497
QPS 10.35 Queries per second
minQET/maxQET 0.00000000s / 9.75666709s