FactForge – Fast Track to The Center of the Data Web

FactForge

FactForge (formerly LDSR) represents a reason-able view to the web of data. It aims to allow users to find resources and facts based on the semantics of the data, like web search engines index WWW pages and facilitate their usage.

FactForge enables users to easily identify resources in the Linking Open Data (LOD) cloud. It provides efficient mechanism to query data from multiple datasets and sources, considering their semantics. FactForge is designed also as a use-case for large-scale reasoning and data integration.

FactForge includes several of the most central datasets of LOD. OWLIM semantic repository is used to load the data and “materialize” the facts that could be inferred from it. It is probably the largest and most heterogeneous body of general factual knowledge that was ever used for logical inference. Read more: presentation, brochure, presentation.

The Data

FactForge has the following characteristics:

  • Datasets included: DBpedia, New York Times, MusicBrainz, Lingvoj, Lexvo, CIA World Factbook, WordNet,
    Geonames, Freebase.
  • Ontologies: several schemata used in the datasets are also loaded into FactForge: Dublin Core, SKOS, RSS, FOAF
    Reference Layer – PROTON
  • Size: 1.8B explicit plus 1.3B inferred statements are indexed; there are 15B different retrievable statements.
  • Inference: materialization is performed with respect to the semantics of OWL-Horst optimized.

Access: Public Service at http://factforge.net

The data is accessible through a web use interface at http://factforge.net, which allows:

  • RDF Search – retrieve ranked list of URIs related to literals, which contain specific keywords
  • Exploration – traversing the data, one resource at a time
    For instance, one can “browse” Madrid with its DBpedia URI, <http://dbpedia.org/resource/Madrid> or dbpedia:Madrid
  • Evaluation of queries in SPARQL and other languages.
    For instance, to obtain a list of politicians born in Germany one can use the following SPARQL query:
(...add prefixes here...)SELECT * WHERE { 
?Person dbp-ont:birthPlace [geo-ont:parentFeature dbpedia:Germany] ; 
 rdf:type dbp-ont:Politicians ; 
 om:hasRDFRank ?RR . 
} ORDER BY DESC(?RR)

This is an example of a structured query, the evaluation of which involves data from 4 datasets and interpretation of the semantics of several schemata (i.e. reasoning). Within few seconds it returns results ranked by PageRank (in the RDF graph).

  • Reference layer – using PROTON to access FactForge datasets
The same query can be formulated using the PROTON reference layer predicates only:
 (...add prefixes here...)SELECT * WHERE {
 ?Person pext:birthPlace [ptop:subRegionOf dbpedia:Germany] ;
 pext:hasProfession pext:Politician ;
 om:hasRDFRank ?RR .
 ?BirthPlace
 } ORDER BY DESC(?RR)

Note that the conceptualization of Politician in this model is a profession, whereas in the first query, Politician is
defined as a person. Additionally, executing the second query retrieves 35% more results over the entire FactForge
dataset.

A public SPARQL end-point is available at http://factforge.net/sparql, allowing FactForge to be used as a query evaluation web service.

Credits and References

“Linked data” represents a set of principles for publishing of structured data they can be explored and navigated in a manner analogous to the HTML WWW. The linked data concept is an enabling factor for the realization of the Semantic Web as a global web of structured data around the Linking Open Data initiative.

FactForge has been initially developed as an evaluation case in the European research project LarKC. It has been extended, improved and build into the data layer infrastructure of RENDER FP7 European research project. The development of OWLIM, as well as other relevant technology and know-how, has been supported by several projects within programs FP5, FP6, and FP7 of the European Commission: RASCALLI, TAO, TripCom, SEKT, On-To-Knowledge.

The Linked Life Data service is similar to FactForge. It represents a reason-able view towards the life science part of LOD, including UniProt, GeneOntology, and more than 20 other datasets. FactForge and LLD are based on the same technology: Forest semantic web fronts-ends and OWLIM semantic repository. With its 5 billion explicit statements, LLD is probably the largest body of non-synthetic knowledge that was used for inference.

Notes and Disclaimers

FactForge is an experimental project from Ontotext. The access to this demonstration service is free of charge. Ontotext does not provide any guarantees for quality, availability, or fitness for particular purpose. FactForge is far from perfect.


Reason-able Views

Reason-able views (RAV) represent a practical approach for reasoning with the web of linked data. It is an assembly of independent datasets, which can be used as a single body of knowledge – an integrated dataset – with respect to reasoning and query evaluation. The integrated dataset is designed to meet some criteria for “reasonability”, e.g. it has specific qualities with respect to a specific reasoning task and language. For example, “consistent with OWL Lite” or “allows RDFS entailment within O(n) time and space”.

Linked data reason-able view can be considered a special case where:

  • All the datasets in the view represent linked data
  • Single reasonability criteria is imposed on all datasets
  • Each dataset is connected to at least one of the others

Considering the size of the LOD datasets, in order to make query evaluation and reasoning practically feasible, the integrated dataset of a linked RAV should be loaded in a single repository (even if it employs some sort of distribution internally). Such linked RAV can be considered as index, which caches parts of the LOD cloud and provides access to the datasets included in it in a manner similar to the one in which web search engines index WWW pages and facilitate their usage.

As a final practical consideration, to allow for caching and indexing, linked RAVs should include only datasets that are more or less static; this excludes various types of wrappers or virtual datasets, where RDF is generated in answer to retrieval requests (one can make an analogy with the dynamic part of the WWW).

Standard Methods of Inference

Practically inapplicable to a web of linked data are the standard methods of sound and complete inference with respect to relatively rich flavor of the First Order Predicate Calculus (FOPC). Some of the major obstacles are:

  • Counting on “closed-world” assumption models developed under centralized control by the most popular FOPC fragments, such as the Description Logics (DL). This is irrelevant in web context. Performing sound and complete inference with respect to LOD-type data is heavily prone to inconsistency. This renders the results of such inference useless.
  • Mechanisms with prohibitively high computational complexity of the semantics of languages like DL. They require “satisfiability” checks. As a result the most scalable published experiments with DL reasoning remain below 10 million statements of sound and complete reasoning. This is not enough.
  • Unsuitability for reasoning of some of the datasets of LOD (or some parts of them). Some data publishers seem to use the OWL and RDFS vocabulary without account for their formal semantics. The result of inference for some datasets is of questionable utility. For instance, a dataset contains a subject hierarchy, encoded via the relation rdfs:subClassOf with cycles of length tens of concepts. Any reasoner, following the standard semantics of rdfs:subClassOf, will infer that all the concepts in the loop are equivalent. This does not seem to be the intention of the publishers.
  • Reasoning with data distributed across different web servers is possible but much slower than reasoning with local data. The fundamental reason is related to the so called “remote join” problem known from the distributed database management systems (DBMS).

Linking the Linked Data

Reasoning has the potential to enhance the interlinking between linked data datasets, as long as it it ensures enforcement of the semantics of the links. For instance, the link between the identifiers for Vienna in DBpedia (dbpedia:Vienna) and in Geonames (geonames:2761369), and the statements linkingVienna to the corresponding high-level administrative region in Austria (geonames:2761367):

dbpedia:Vienna owl:sameAs geonames:2761369
geonames:2761369 gno:parentFeature geonames:2761367
derive by simple inference the statement:
dbpedia:Vienna gno:parentFeature geonames:2761367

This would allow this connection between the DBpedia entry of Vienna and the Geonames description of Austria to appear when exploring dbpedia:Vienna or to be considered during query evaluation.


 Web of Linked Data

Here follows a quick introduction to the notions of Semantic Web, Linked Data and the Linking Open Data initiative.

Semantic Web

The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. This fosters the opportunity of creating a next generation world-wide web of structured data which are not only understandable to humans (like the typical HTML page), but also understandable by computers.

The data on the Semantic Web have explicitly defined structure (like in the databases) and semantics (like in the ontologies). This allows the computers to perform structured queries (like those in SQL) and infer new facts.

In short, the Semantic Web is an extension of the current WWW, in which information is given well-defined meaning, better enabling computers and people to work in cooperation.

Linked Data

The web of linked data represents a vision of Tim Berners-Lee of a collection of open and linked structured data on the Web. It calls upon two initiatives: (a) the Linked Data initiative and (b) the Linking Open Data initiative.

Linked Data is a set of principles for publishing of structured data in the form of RDF graphs, so that they can be explored and navigated in a manner analogous to the HTML WWW and be understandable by computers. These principles are:

1. Using URIs as names of things
2. Using HTTP URIs, so that people can look up those names
3. Providing useful information when someone looks up a URI
4. Including links to other URI, so people can discover more things
The linked data concept is an enabling factor for the realization of the Semantic Web as a global web of structured data.

Linking Open Data (LOD)

Linking Open Data (LOD) is a W3C SWEO community project. It aims to facilitate the emergence of a web of linked data, by means of publishing and interlinking open data on the Web in RDF.

The central dataset of LOD is DBpedia – an RDF extract of the Wikipedia. DBpedia is a sort of a hub in the LOD graph, which guarantees certain level of connectivity. It also provides easy entry points to find resources of interest using their Wikipedia names in DBpedia and through it in the LOD network.


 FactForge Inference

Performing inference against multiple datasets from LOD is not a trivial task. This page provides information about the specific reasoning approach undertaken in FactForge. This includes the selected entailment semantics and optimizations of the inference processes.

The reasonability criteria for FactForge are defined with respect to OWL 2 RL. FactForge allows forward-chaining, e.g. entailment and consistency checking within O(n.log(n)) space and time. The integrated dataset of FactForge is consistent with respect to OWL 2 RL. Most of the results of the inference comply with common sense without specific assumptions about the context of interpretation.

Entailment Semantics

The entailment semantics uses forward-chaining to materialize the statements which could be entailed from the explicit data in FactForge based on the ontologies used by the data publishers. Reasoning is performed in BigOWLIM with respect to ruleset owl-max of OWLIM, which delivers a combination of RDFS with incomplete OWL Lite. The approach of Herman ter Horst to support the semantics of the OWL primitives in a tractable logical fragment is used in the form of Datalog-like Horn clauses. The various dialects of OWL are described here.

The standard reasoning behavior of OWLIM is to update the deductive closure upon committing of a transaction to the repository. When new statements are introduced, the new explicit statements are added to the repository in addition to the existing explicit statements that have come from previous transactions and their closure. Forward-chaining is performed with respect to the rules from the selected rule-set. It infers and adds to the repository all statements that are inferable from the repository in its current state. This allows for efficient incremental updates of the deductive closure. Consistency checking is performed, applying the checking rules after adding all new statements and updating the deductive closure. When statements are deleted, the deductive closure is updated in order to withdraw statements that cannot be inferred from the new state of the repository.

Reasoning Optimizations

Bellow we present several optimizations which speed up the loading, inference and query evaluation in FactForge.

owl:sameAs optimization

owl:sameAs is a predicate which is used to encode that two different URIs denote one and the same resource. Most often, it is used to align the different identifiers of one and the same real-world entity across different datasets and data-sources. owl:sameAs is heavily used for linking the different datasets in Linking Open Data initiative, LOD, and can be considered as the most important OWL predicate when it comes to merging data from different data sources. Here are its effects.

The URI of Vienna in DBpedia is http://dbpedia.org/page/Vienna, while in Geonames its URI is http://sws.geonames.org/2761369/. In DBpedia, there is a statement

(S1) dbpedia:Vienna owl:sameAs geonames:2761369

which declares that the two URIs are equivalent.

According to the formal definition of OWL 2 RL, whenever two URIs are declared to be equivalent, all statements which involve one of them, should be “replicated” with the other URI as well. The inferencing process goes as follows.

The city of Vienna with URI http://sws.geonames.org/2761369/ in Geonames is defined as part of the first-order administrative division in Austria with the same name and with URI http://www.geonames.org/2761367/. It on its turn is part of the country Austria with URI http://www.geonames.org/2782113. This makes for the following RDF statements:

(S2) geonames:2761369 gno:parentFeature geonames:2761367
(S3) geonames:2761367 gno:parentFeature geonames:2782113

As gno:parentFeature is a transitive relationship, in the course of the initial inference, OWLIM will derive that the city of Vienna is also part of Austria, e.g.:

(S4) geonames:2761369 gno:parentFeature geonames:2782113

Due to the semantics of owl:sameAs, OWLIM will infer in the subsequent inference from (S1) that statements (S2) and (S4) also hold for Vienna, when it is referred to with its DBpedia URI, e.g.:

(S5) dbpedia:Vienna gno:parentFeature geonames:2761367
(S6) dbpedia:Vienna gno:parentFeature geonames:2782113

Considering that Austria also has equivalent URI in DBpedia, e.g.

(S7) geonames:2782113 owl:sameAs dbpedia:Austria

OWLIM will also infer that:

(S8) dbpedia:Vienna gno:parentFeature dbpedia:Austria
(S9) dbpedia:Austria gno:parentFeature dbpedia:Austria

and that

(S10) geonames:2761369 gno:parentFeature dbpedia:Austria 
(S11) geonames:2761367 gno:parentFeature dbpedia:Austria

All these facts are true and ensure obtaining the same results when querying RDF Data regardless which of the equivalent URIs was used in the explicit statement.

It is thus clear that the equivalence operator owl:sameAs generates plenty of new statements even for equivalence declared between just two URIs in two distinct datasets. There are 7 new statements generated just by the single declarations of equivalence between Vienna in DBpedia and Vienna in Geonames, and between Austria in DBpedia and Austria in Geonames which makes for 175% increase of the dataset. The number of explicit and implicit statements will vastly increase as equivalences are declared between URIs in additional datasets. As an equivalence operator, owl:sameAs is transitive, reflexive, and symmetric, thus, a set of N equivalent URIs will generate N2 owl:sameAs statement between each pair of those. For instance, Vienna has an URI also in UMBEL which is also declared equivalent to the URI in DBpedia. This will make for another 4 additional implicit statements.

Although owl:sameAs is useful for interlinking RDF datasets, its semantics causes considerable inflation of the number of implicit facts that should be considered during inference and query evaluation. This has performance implications and requires optimization.

The loading of FactForge takes considerable benefits from a specific feature of the BigTRREE engine, which allows it to handle owl:sameAs statements efficiently. In its indices, each set of equivalent URIs (equivalence class with respect to owl:sameAs) is presented by a single super-node. So, BigTRREE can still enumerate all statements that should be inferred through the equivalence, but it does not have to inflate its indices. This approach can be considered as a sort of partial materialization. However, BigOWLIM takes special care to make sure that this trick does not hinder the ability to distinguish explicit from implicit statements.

This optimization allows OWLIM to efficiently handle large datasets where owl:sameAs is extensively used. In the case of FactForge, this technique allows OWLIM to deal with more than 7 billion statements at the computational costs required for 860 million statements.

RDFS and OWL “performance optimization”

Several features of the standard semantics of RDFS and OWL inflate the number of statements that have to be considered during materialization and/or query evaluation. These are:

<X,rdf:type,rdf:Resource> and <P,rdf:type,rdf:Property> should be inferred for all URIs which appear as subjects in triples and for all predicates of statements;

<X,rdf:type,owl:Thing> and <X,owl:sameAs,X> should be inferred for each URI in a subject position;

owl:Thing and owl:Nothing have to be asserted to be super- and sub-classes of all classes

These features make the semantics of these languages self-contained and to some extent better grounded, allowing some wonderful meta-reasoning capabilities. However, they “produce” statements, which are out of interest for most of the applications and usage scenarios. Effectively, they add an extra 3 statements for each URI in an RDF graph.

Given that in FactForge there are on average 3.5 explicit statements per URI, these 3 extra statements per URI appear as an unjustified overhead, especially with their limited utility. Thus, FactForge is loaded with the “partialRDFS” parameter of OWLIM which suppresses the inference of these extra statements coming from the features of the semantics of RDFS and OWL.

We have also “switched off” the RDFS rules which derive types of resources based on domains and ranges of properties, because plenty of properties are used without regard (or knowledge) about their formal definitions. For instance, foaf:img has domain foaf:Person, but it is often used denote images of all types of resources. The effect of this change in the result set was that OWLIM inferred 446 million statements less.

Additional Resources

The Latest White Paper from Ontotext: "The Truth About Triplestores"

Download Whitepaper

GraphDB: At Last, the Meaningful Database

 

Download Report

OpenPolicy: Semantic Technology Accelerates Document Search

Download White Paper