A bird’s eye view on where to start in building a knowledge graph solution to help your business excel in a data-driven market
A graph is like a map that represents real-life objects and the relationships between them. While many of us use Google, Twitter, Alexa and Siri, likely most don’t know (or think about) that they are powered by knowledge graph technology. In these social network graphs, the objects are people and organizations, and the relationships are ‘follows’ or ‘friends’.
The objects in knowledge graphs are called “entities” and can represent real things in the world, events, situations or even ideas. The descriptions of these entities have a specific structure and meaning (semantics). This allows both humans and machines to process them efficiently and unambiguously. These descriptions also reference other entities and their descriptions and in this way create a vast network of knowledge.
A knowledge graph is a versatile way of organizing and using data. It can act as a database, a network and a knowledge base depending on how it’s designed and used.
Like a database, knowledge graphs have schemas and users can apply complex structured queries to extract specific data needed. However, unlike relational databases, schema in a graph is flexible and it doesn’t need to be pre-defined.
The data in a knowledge graph can be represented as a collection of nodes and edges and can be analyzed like a network structure. This enables users to perform different graph algorithms, optimizations and traversal operations and transformations.
Because of the formal semantics attached to the data, knowledge graphs can act as a knowledge base. This enables humans and machines to easily interpret this data and derive new information.
Formal semantics (usually defined by an ontology) establishes an agreement between the developers of a knowledge graph and its users with the context of the domain and the meaning of the data. Semantics utilizes a number of representation and modeling instruments to express and interpret the data of a knowledge graph.
A description of an entity usually includes its classification with respect to a class hierarchy. The idea is that each entity belongs to exactly one class (but can also be a superclass representing a higher-level concept or a subclass with a granular concept). For example, in domains like general news the most common classes are Person, Organization and Location. To continue along the hierarchy, both Person and Organization can have a superclass Agent, whereas Location usually has sub-classes like Country, City, etc.
The relationship between entities, on the other hand, are usually expressed by relation types. These indicate the nature of the relationship such as friend, relative, competitor, etc. Relation types can also have formal definitions. For instance, parent-of can be defined as the inverse relation of child-of and both can be considered specific cases of the symmetric relation relative-of.
Entities can also be associated with categories that describe specific aspects of their semantics. For example, a book can simultaneously belong to “Books about Africa”, “Bestseller”, “Books by Italian authors”, “Books for kids”, etc.
It’s also possible to include “human-friendly” free text descriptions in a knowledge graph. This helps further clarify the design intentions for an entity and offers additional context and details for enhanced search capabilities.
One of the common graph data models is the Resource Description Framework (RDF). Developed and standardized by the World Wide Web Consortium (W3C), it provides a powerful and expressive framework for representing data and metadata.
RDF is made of three-part structures called triples. An RDF triple consists of Subject, Predicate and Object. Each triple has a unique identifier known as the Uniform Resource Identifier (URI), which looks like a web page address.
Let’s consider the following example triples:
In the first triple, “Wilma hasSpouse Fred”, Wilma is the subject, hasSpouse is the predicate and Fred is the object. In the second triple, “Wilma hasAge 24”, Wilma is the subject, hasAge is the predicate and 24 is the object.
By connecting multiple triples together, we create an RDF graph. The following diagram illustrates the characters and relationships found in the Flintstones TV cartoon series. We see triples such as “PebbleFlintstone livesIn Bedrock” or “BamBamRubble livesIn Bedrock”. This tells us that the Flintstones and the Rubbles live in Bedrock and that Bedrock is part of Cobblestone County in Prehistoric America.
The other triples in the graph describe the relationships between the different characters (hasSpouse or hasChild) as well as their work association (worksFor). For example, we can see that Fred and Wilma are married, that they have a child Pebbles and that Fred works for the Rock Quarry company.
Labeled Property Graphs (LPGs) are another graph data model that offers light-weight management of graph data. Its primary motivation is not centered around semantics, data exchange or publication, but is focused on efficient storage that enables quick querying and traversal of interconnected data.
LPG technology doesn’t have standardized schema or modeling languages and query languages, nor does it provide formal semantics and interoperability specifications. This means that there are no established serialization formats for representing LPGs. Because of this, there are no federation protocols for integrating data from multiple sources or other mechanisms to ensure seamless interaction and compatibility between different LPG implementations.
So this model is most useful when data needs to be collected on-the-fly and analytics is done within the scope of a single project.
While RDF allows statements to be made only about nodes in the graph, LPGs can attach descriptions or properties to both nodes and edges. This is a major difference between the two models.
The introduction of the RDF-star extension resolves this gap, which now allows RDF to make statements about other statements. Now it’s possible to attach metadata to describe graph edges such as scores, weights, temporal aspects and provenance.
Overall, knowledge graphs represented in RDF allow data to be easily integrated, interconnected, identified, disambiguated and reused. This is possible because of a combination of factors discussed below.
The expressivity of Semantic Web standards enables the fluent representation of diverse data and content types. This includes data schema, taxonomies, vocabularies, metadata of various kinds as well as reference and master data.
Another important aspect of RDF knowledge graphs is their formal semantics. Thanks to the precisely defined meanings, both humans and machines can interpret the model and data unambiguously.
Performance is also a critical aspect of semantic knowledge graphs. As all RDF specifications have been exhaustively designed and proven in practical scenarios, users can efficiently manage knowledge graphs containing billions of facts and properties.
In addition, there are various specifications available in the RDF ecosystem to facilitate the interoperability of data across different systems and applications. They cover different aspects of data serialization, access, management and federation.
Finally, standardization plays an essential role in everything discussed so far. Through the W3C community process, all of these have been standardized, ensuring the fulfillment of all the requirements of various stakeholders.
So far, we’ve focused on the nature and characteristics of knowledge graphs. Now let’s talk about what is not a knowledge graph.
A graph-based representation of data is valuable, but there are many use cases when we don’t need to capture the semantic knowledge in the data.
For example, when statistical data like GDP for different countries is represented in RDF, this is not a knowledge graph. Here, we don’t need to define the meaning of what countries are or what the ‘Gross Domestic Product’ of a country is. It’s enough just to have the string ‘China’ associated with the string ‘GDP’ and the number ‘18.1 trillion’.
So, the essence of a knowledge graph lies in its connections and the underlying graph structure rather than the specific language used for representing the data.
Knowledge bases that lack formal structure and semantics also don’t qualify as knowledge graphs. One such example is a Q&A knowledge base about a product. Another is an expert system with data organized in a non-graph format that uses automated inference (e.g., a set of ‘if-then’ rules) to facilitate analysis.
An essential characteristic of a knowledge graph is that entity descriptions should be interlinked with one another. Each entity definition should include references to other entities, forming the basis of the graph structure.
Knowledge graphs are powerful frameworks for organizing data and metadata, designed to meet specific criteria and purposes. They are not software. Instead, the data and associated metadata of one knowledge graph can be used and reused by various independent systems to enable diverse functionalities.
A variety of software applications leverage knowledge graphs. This includes databases for structured queries, automated reasoners for inferring new relationships, full-text search engines for efficient content searches, editing and curation tools for modifying and extending knowledge graphs and countless more.
Knowledge graphs have the potential to redefine the way we organize, interpret and use our data. Their ability to interconnect diverse data sources and capture complex relationships enables them to gain deeper insights and make more informed decisions. It also unlocks new possibilities for collaboration and innovation. As advancements in this technology continue to evolve, we can expect them to become even more sophisticated and powerful, bringing greater value to enterprise data.