Performing inference against multiple datasets from LOD is not a trivial task. This page provides information about the specific reasoning approach undertaken in FactForge. This includes the selected entailment semantics and optimizations of the inference processes.
The reasonability criteria for FactForge are defined with respect to OWL 2 RL. FactForge allows forward-chaining, e.g. entailment and consistency checking within O(n.log(n)) space and time. The integrated dataset of FactForge is consistent with respect to OWL 2 RL. Most of the results of the inference comply with common sense without specific assumptions about the context of interpretation.
The entailment semantics uses forward-chaining to materialize the statements which could be entailed from the explicit data in FactForge based on the ontologies used by the data publishers. Reasoning is performed in BigOWLIM with respect to ruleset owl-max of OWLIM, which delivers a combination of RDFS with incomplete OWL Lite. The approach of Herman ter Horst to support the semantics of the OWL primitives in a tractable logical fragment is used in the form of Datalog-like Horn clauses. The various dialects of OWL are described here.
The standard reasoning behavior of OWLIM is to update the deductive closure upon committing of a transaction to the repository. When new statements are introduced, the new explicit statements are added to the repository in addition to the existing explicit statements that have come from previous transactions and their closure. Forward-chaining is performed with respect to the rules from the selected rule-set. It infers and adds to the repository all statements that are inferable from the repository in its current state. This allows for efficient incremental updates of the deductive closure. Consistency checking is performed, applying the checking rules after adding all new statements and updating the deductive closure. When statements are deleted, the deductive closure is updated in order to withdraw statements that cannot be inferred from the new state of the repository.
Bellow we present several optimizations which speed up the loading, inference and query evaluation in FactForge.
owl:sameAs is a predicate which is used to encode that two different URIs denote one and the same resource. Most often, it is used to align the different identifiers of one and the same real-world entity across different datasets and data-sources. owl:sameAs is heavily used for linking the different datasets in Linking Open Data initiative, LOD, and can be considered as the most important OWL predicate when it comes to merging data from different data sources. Here are its effects.
The URI of Vienna in DBpedia is http://dbpedia.org/page/Vienna, while in Geonames its URI is http://sws.geonames.org/2761369/. In DBpedia, there is a statement
(S1) dbpedia:Vienna owl:sameAs geonames:2761369
which declares that the two URIs are equivalent.
According to the formal definition of OWL 2 RL, whenever two URIs are declared to be equivalent, all statements which involve one of them, should be "replicated" with the other URI as well. The inferencing process goes as follows.
The city of Vienna with URI http://sws.geonames.org/2761369/ in Geonames is defined as part of the first-order administrative division in Austria with the same name and with URI http://www.geonames.org/2761367/. It on its turn is part of the country Austria with URI http://www.geonames.org/2782113. This makes for the following RDF statements:
(S2) geonames:2761369 gno:parentFeature geonames:2761367
(S3) geonames:2761367 gno:parentFeature geonames:2782113
As gno:parentFeature is a transitive relationship, in the course of the initial inference, OWLIM will derive that the city of Vienna is also part of Austria, e.g.:
(S4) geonames:2761369 gno:parentFeature geonames:2782113
Due to the semantics of owl:sameAs, OWLIM will infer in the subsequent inference from (S1) that statements (S2) and (S4) also hold for Vienna, when it is referred to with its DBpedia URI, e.g.:
(S5) dbpedia:Vienna gno:parentFeature geonames:2761367
(S6) dbpedia:Vienna gno:parentFeature geonames:2782113
Considering that Austria also has equivalent URI in DBpedia, e.g.
(S7) geonames:2782113 owl:sameAs dbpedia:Austria
OWLIM will also infer that:
(S8) dbpedia:Vienna gno:parentFeature dbpedia:Austria
(S9) dbpedia:Austria gno:parentFeature dbpedia:Austria
and that
(S10) geonames:2761369 gno:parentFeature dbpedia:Austria
(S11) geonames:2761367 gno:parentFeature dbpedia:Austria
All these facts are true and ensure obtaining the same results when querying RDF Data regardless which of the equivalent URIs was used in the explicit statement.
It is thus clear that the equivalence operator owl:sameAs generates plenty of new statements even for equivalence declared between just two URIs in two distinct datasets. There are 7 new statements generated just by the single declarations of equivalence between Vienna in DBpedia and Vienna in Geonames, and between Austria in DBpedia and Austria in Geonames which makes for 175% increase of the dataset. The number of explicit and implicit statements will vastly increase as equivalences are declared between URIs in additional datasets. As an equivalence operator, owl:sameAs is transitive, reflexive, and symmetric, thus, a set of N equivalent URIs will generate N2 owl:sameAs statement between each pair of those. For instance, Vienna has an URI also in UMBEL which is also declared equivalent to the URI in DBpedia. This will make for another 4 additional implicit statements.
Although owl:sameAs is useful for interlinking RDF datasets, its semantics causes considerable inflation of the number of implicit facts that should be considered during inference and query evaluation. This has performance implications and requires optimization.
The loading of FactForge takes considerable benefits from a specific feature of the BigTRREE engine, which allows it to handle owl:sameAs statements efficiently. In its indices, each set of equivalent URIs (equivalence class with respect to owl:sameAs) is presented by a single super-node. So, BigTRREE can still enumerate all statements that should be inferred through the equivalence, but it does not have to inflate its indices. This approach can be considered as a sort of partial materialization. However, BigOWLIM takes special care to make sure that this trick does not hinder the ability to distinguish explicit from implicit statements.
This optimization allows OWLIM to efficiently handle large datasets where owl:sameAs is extensively used. In the case of FactForge, this technique allows OWLIM to deal with more than 7 billion statements at the computational costs required for 860 million statements.
Several features of the standard semantics of RDFS and OWL inflate the number of statements that have to be considered during materialization and/or query evaluation. These are:
X,rdf:type,rdf:Resource> and <P,rdf:type,rdf:Property> should be inferred for all URIs which appear as subjects in triples and for all predicates of statements;X,rdf:type,owl:Thing> and <X,owl:sameAs,X> should be inferred for each URI in a subject position;owl:Thing and owl:Nothing have to be asserted to be super- and sub-classes of all classesThese features make the semantics of these languages self-contained and to some extent better grounded, allowing some wonderful meta-reasoning capabilities. However, they "produce" statements, which are out of interest for most of the applications and usage scenarios. Effectively, they add an extra 3 statements for each URI in an RDF graph.
Given that in FactForge there are on average 3.5 explicit statements per URI, these 3 extra statements per URI appear as an unjustified overhead, especially with their limited utility. Thus, FactForge is loaded with the "partialRDFS" parameter of OWLIM which suppresses the inference of these extra statements coming from the features of the semantics of RDFS and OWL.
We have also "switched off" the RDFS rules which derive types of resources based on domains and ranges of properties, because plenty of properties are used without regard (or knowledge) about their formal definitions. For instance, foaf:img has domain foaf:Person, but it is often used denote images of all types of resources. The effect of this change in the result set was that OWLIM inferred 446 million statements less.