Ontotext

Dataset Statistics

This page provides statistics about the loading and inference of the datasets of the FactForge.

Loading of multiple datasets into a single repository and materialization of all facts which could be logically inferred from them is the core of FactForge. The table below shows the statistics from loading of the datasets and materialization of the implicit facts in FactForge. The first column lists the datasets in the order in which they are loaded into the repository.

Dataset Named Graph Indexed Explicit
Triples
 ('000)
Indexed Inferred
Triples
 ('000)
All Indexed
Triples
('000)
Entities
('000 graph nodes)
Implicit/ explicit
ratio
Schemata and ontologies   11 7 18 6 0.6
DBpedia (SKOS categories) http://dbpedia.org/ 2,877 42,587 45,464 1,144 14.8
DBpedia (owl:sameAs) http://dbpedia.org 5,544 566 6,110 8,464 0.1
UMBEL http://umbel.org/umbel# 5,162 42,212 47,374 500 8.2
lingvoj http://lingvoj.org 20 863 883 18 43.8
CIA Factbook http://www4.wiwiss.fu-berlin.de/factbook 76 4 80 25 0.1
WordNet http://wordnet.princeton.edu/ 2,281 9,296 11,577 830 4.1
Geonames http://www.geonames.org/ 91,908 125,025 216,933 33,382 1.4
DBpedia core http://dbpedia.org/ 560,096 198,043 758,139 127,931 0.4
Freebase http://freebase.com 463,689 40,840 504,529 94,810 0.1
MusicBrainz http://musicbrainz.org 45,536 421,093 466,630 15,595 9.2


The statistics of the indices of FactForge after loading and materialization looks as follows: 
 

Total number after loading Value (millions)
Indexed explicit statements 1,177
Indexed inferred statements 881
Indexed statements (explicit + inferred) 2,058
Entities (nodes in the RDF graph) 283

Although BigOWLIM performs complete forward-chaining, not all inferable triples are stored in its indices for the sake of better performance and space economy. Such example is the sameAs-optimization which allows BigOWLIM not to derive multiple "replicas" of one statement when one or more of its elements has owl:sameAs equivalents. Further, there are additional statements which can be retrieved from the repository; those are result of postprocessing and serve for the sake of better presentation of the data in FactForge. Here follows a summary:

 

Number of statements after post-processing  Value
Added after post-processing (preferred labels and ranks of the nodes) 179,812,809
Indexed 2,237,550,383
"Compressed" through sameAs-optimization 7,760,929,834
Different retrievable statements (by pattern <?s, ?p, ?o, ?g>) 9,818,667,408

The total number of the entities in the RDF graph, after post processing is 404,796,665.