Read about how GrapDB's Semantic Similarity plugin enables you to perform statistical inference and get more results.
One of the latest new features of GraphDB is the MongoDB document store integration. This is a really exciting feature as it increases the write scalability of RDF solutions when dealing with document-centric data. It allows us to store some of our data outside of GraphDB, but still query it from GraphDB’s SPARQL editor.
Often, the data stored in an enterprise knowledge graph has two shape types. The first type is a true graph data with a complex shape describing relations between entities in random directions and with multiple predicates. Such data is very efficiently represented in GraphDB because the users will most likely join it in all possible directions. The second type is an important part of the enterprise knowledge graph but has a hierarchical structure.
For example, it’s data generated as a result of text mining algorithms that includes document metadata attributes and annotations with links to the first type of information. This type of data is queried in a much more predictable way either starting by the document identifier or the annotation links so we don’t need to heavily index such data and use MongoDB instead.
MongoDB is a NoSQL JSON (and also JSON-LD) document store that doesn’t natively support joins, SPARQL or RDF-enabled linked data. Its indices are designed for fast storage and retrieval of documents and objects. Consequently, it is the document store with the biggest developer community.
The integration between GraphDB and MongoDB is done by a plugin. It sends requests to MongoDB queries and then transforms the extracted data to an RDF model. We can use the results in the same way we would use other RDF data stored in GraphDB and we can query it with SPARQL.
For the following demonstration, we will use a subset of the LDBC Semantic Publishing Benchmark. The data includes articles and metadata represented as creative works and annotations between the creative works and DBPedia entities with a complex graph structure also known as reference data. In this experiment, we will show high-scalability when the articles/creative works are stored in MongoDB and all reference/true graph data in GraphDB.
Before we can start integrating data from both GraphDB and MongoDB in our SPARQL queries, first we need a running MongoDB instance. Then, we need to have our data imported in MongoDB as JSON LD data.
One important thing to keep in mind is that the MongoDB plugin is read-only. It provides the ability to query our data in MongoDB through SPARQL, but we need to manage MongoDB data outside of GraphDB.
Let’s have a quick look at how it works.
First, we need to import the provided cwork1000.json file with 1000 of creativeWork documents in MongoDB database “ldbc” and “creativeWorks” collection.
The structure of the document is hierarchical. It starts with the key “@graph”: and contains a list of entities and their hierarchical structures. For example, the “@id”:http://www.bbc.co.uk/things/3#id” has a predicate “bbc:primaryContentOf” with an object “@id”: http://www.bbc.co.uk/things/2#id”. We want to convert this JSON LD document to RDF.
We create the MongoDB connector with an INSERT query.
Here inst:spb1000 is the name of the connection and we have to define where the MongoDB we want to work with is located, which database we need to connect to and which is the MongoDB collection.
Now that everything is set, let’s first see how we can look for some documents that interest us.
In this MongoDB connection, we are searching for the key “graph.id”, which has the value “@id”:http://www.bbc.co.uk/things/1#id”. The first part of the query specifies that we are searching for a document with this ID in MongoDB. The second part of the query tells us what to fetch from MongoDB. In our case, the plugin will fetch everything from the document with its ID, convert it to RDF and list it.
Now, let’s try to do something a little more interesting and look for a specific thing in the hierarchy.
Here, we are searching for documents that have a specified audience id: NationalAudience. So, this query will fetch data for all documents in MongoDB that have audience id: NationalAudience. Also, we are not interested in all triples, but just in the data with a predicate dateModified.
Now let’s see how the integration between the RDF in GraphDB and the data in MongoDB can really happen. As already mentioned, we keep some of our data in GraphDB and some of it in MongoDB. We want to fetch some of the data from MongoDB and integrate it with data in GraphDB.
For the query below to return results, import the ttl content from:
@prefix xsd: http://www.w3.org/2001/XMLSchema# . http://sws.geonames.org/2862704/; http://www.ldbcouncil.org/spb#prefLabel "Niederwürschnitz" .
What this query does is to fetch the mentions with the cwork:mentions predicate for a predefined label (Niederwurschnitz“) from MongoDB and then find the mentions that have this particular label in GraphDB. This entity comes from MongoDB, but we filter it with data in GraphDB.
So far so good. But in this query, we fetched all the data from MongoDB about these mentions, which is very bad in terms of performance.
Let’s optimize this query, so that we still fetch particular mentions from GraphDB, but specify only the data that we want to work with.
This query fetches mentions with the exact label “Niederwurschnitz” from GraphDB. Then (because MongoDB works with keys), we have to replace the prefix “http://sws.geonames.org” with the key “geonames:”. In other words, we fetch this mention from GraphDB, replace the prefix and then send the query to MongoDB only for this key. As a result, the query is much faster.
Let’s see what else we can do with this connector.
Another nice feature is that we can execute hierarchical queries, i.e. go through a chain of predicates.
One more thing that we can do is to execute MongoDB aggregates.
Here, we can count the documents in MongoDB. Or we can do anything else we may want to do with the normal MongoDB aggregate queries (you can read more about MongoDB queries and aggregates in the MongoDB documentation).
So, that’s how the GraphDB integration with MongoDB works. It allows us to store some of our data outside of GraphDB, convert it to RDF, fetch it with SPARQL and integrate it with our other data.
We have also performed some benchmark tests such as transforming some CreativeWorks to hierarchical JSON documents ingested in MongoDB. The writes have proven to be much faster. As for the reads, it really depends on the data and how we write our queries.
Want to boost the scalability of your RDF solution for document-centric data?