Natural Language Querying of GraphDB in LangChain

In a galaxy ravaged by civil wars, a brave alliance of underground freedom fighters has challenged the tyranny and oppression of the Galactic Empire. Following the thread of mysterious natural language queries, the Jedi Master Neli Hateva flies to the swamp-covered planet of Dagobah to locate a powerful weapon that can bring new hope to the future

February 9, 2024 15 mins. read Neli Hateva

This is part of Ontotext’s AI-in-Action initiative aimed at enabling data scientists and engineers to benefit from the AI capabilities of our products.

Natural Language Querying (NLQ) refers to the process of querying a database or an information system using natural language, such as English, instead of formal query languages such as Structured Query Language (SQL). NLQ allows users to interact with databases or systems in a more intuitive and user-friendly way. They can simply type or speak their queries, resembling how they would ask another person a question.

By enabling users to interact with data and systems using natural language, NLQ makes it easier for non-technical users to access and analyze data, leading to faster insights and better decision-making. Such type of querying has applications across various domains, including business intelligence, data analytics, customer support, and information retrieval.

Many NLQ systems nowadays leverage the potential of Large Language Models (LLMs) to translate a natural query into a query language. People use LLMs to translate text to SQL, SPARQL, Cypher, Elasticsearch DSL or GraphQL.

This blog post will present Ontotext GraphDB QA Chain – a NLQ integration for Ontotext GraphDB in LangChain: a framework designed to simplify the creation of applications using LLMs. LangChain provides flexible abstractions and an extensive toolkit that enables developers to build context-aware, reasoning LLM applications and since its initial release in 2022, it has been growing in popularity.

Prequel

The NLQ integration for GraphDB in LangChain enables developers to quickly create Python applications, which will allow the end user to ask questions against the data in GraphDB without the need to write SPARQL queries, and receive a response in natural language. It works by integrating with an LLM to understand the user question and translate it into a SPARQL query automatically, based on an ontology (data model) provided to the LLM as grounding context. The diagram above offers a high-level overview of how NLQ for GraphDB works in LangChain.

First, the user asks a question in natural language. The question and the ontology schema are passed to the LLM, which is prompted to generate a SPARQL query. If the generated query is a valid SPARQL query, then it is executed against GraphDB. The query results, together with the question, are passed to the LLM, which is prompted to generate an answer to the question, given the results. The LLM is instructed to only use the information from the returned results.

Sometimes the LLM may generate a SPARQL query with missing prefixes, syntactic or other errors. In such cases, we try to amend this by prompting the LLM to correct the query a certain number of times. In addition to the invalid query, we also provide the LLM with the error message and the ontology schema.

The ontology schema can be configured either with a SPARQL query or with a file. If you store the ontology schema in a dedicated named graph, you can use a SPARQL CONSTRUCT query to collect the ontology schema statements, for example:

CONSTRUCT {?s ?p ?o} FROM <http://example.com/ontology/> WHERE {?s ?p ?o}

If you don’t store the ontology schema statements in a dedicated named graph, you can use the RDF file option. The supported formats are Turtle, RDF/XML, JSON-LD, N-Triples, Notation-3, Trig, Trix, N-Quad.

Either way, the ontology schema is fed to the LLM in Turtle format since Turtle with appropriate prefixes is the most compact and easiest for the LLM to remember. The ontology schema should include enough information about classes, properties, property attachment to classes (using rdfs:domain, schema:domainIncludes or OWL restrictions), and taxonomies (important individuals). The ontology schema should not include overly verbose and irrelevant definitions and examples that do not help with the SPARQL query construction (for example, https://schema.org definitions suffer from this problem).

Examples from the Star Wars Universe

To illustrate all that, let’s review some examples. We’ll use the Star Wars API (SWAPI) ontology and dataset that you can get from here.

Set up a chain we should

We have to set up the chain first:

import os

from langchain_community.graphs import OntotextGraphDBGraph
from langchain_openai import ChatOpenAI

from langchain.chains import OntotextGraphDBQAChain

os.environ["OPENAI_API_KEY"] = "sk-***"

graph = OntotextGraphDBGraph(
    query_endpoint="http://localhost:7200/repositories/langchain",
    query_ontology="CONSTRUCT {?s ?p ?o}"
    "FROM <https://swapi.co/ontology/> WHERE {?s ?p ?o}",
)

chain = OntotextGraphDBQAChain.from_llm(
    ChatOpenAI(temperature=0, model_name="gpt-4-1106-preview"),
    graph=graph,
    verbose=True,
)

We need to provide the GraphDB repository URL, set-up the ontology schema and choose a LLM (any available LLM integrated in LangChain can be used). In the above example, we set-up the ontology schema with a SPARQL CONSTRUCT query and choose the Open AI ChatGPT ‘gpt-4-1106-preview’ model, because of the bigger context window (128,000 tokens).

The light side examples

We can now ask a simple question:

chain.invoke({chain.input_key: "What is the climate on Tatooine?
"})[chain.output_key]
> Entering new OntotextGraphDBQAChain chain...
Generated SPARQL:
PREFIX : <https://swapi.co/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?climate
WHERE {
    ?planet rdfs:label "Tatooine" .
    ?planet :climate ?climate .
}
> Finished chain.
The climate on Tatooine is arid.

The chain execution log shows the generated SPARQL query and the natural language response to the question. The generated query is correct and we have found the right answer.

For simplicity we’ll omit the code part invoking the chain and give only the questions. We’ll also omit parts of the execution log and give only the generated SPARQL query and the answer.

Let’s ask another question:

"What is Luke Skywalker's home planet?"
PREFIX : <https://swapi.co/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?homePlanet
WHERE {
    ?person rdfs:label "Luke Skywalker" .
    ?person :homeworld ?planet .
    ?planet rdfs:label ?homePlanet .
}

Luke Skywalker's home planet is Tatooine.

The query and the answer are both correct again. Great!

Now, let’s try a more complicated question:

"What is the climate on Luke Skywalker's home planet?"
PREFIX : <https://swapi.co/vocabulary/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?climate
WHERE {
  ?character rdfs:label "Luke Skywalker" .
  ?character :homeworld ?planet .
  ?planet :climate ?climate .
}

The climate on Luke Skywalker's home planet is arid.

Awesome! The SPARQL query is correct again and we’ve got the right answer.

Let’s check if the LLM can handle queries with aggregations:

"How many Star Wars movies are there?"
PREFIX : <https://swapi.co/vocabulary/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT (COUNT(?film) AS ?numberOfFilms)
WHERE {
  ?film rdf:type :Film .
}

Based on the information provided, there are 7 Star Wars movies.

And again the generated query and the results are correct.

Let’s try another one:

"What is the average box office revenue for all the Star Wars movies?"
Invalid SPARQL query: 
SELECT (AVG(?boxOffice) AS ?averageBoxOfficeRevenue)
WHERE {
  ?film a :Film .
  ?film :boxOffice ?boxOffice .
}
SPARQL Query Parse Error: 
Unknown namespace prefix : None

Generated SPARQL:
SELECT (AVG(?boxOffice) AS ?averageBoxOfficeRevenue)
WHERE {
  ?film a <https://swapi.co/vocabulary/Film> .
  ?film <https://swapi.co/vocabulary/boxOffice> ?boxOffice .
}

The average box office revenue for all the Star Wars movies is approximately 754.15 million dollars.

Here we see that, at first, the generated query was not valid due to a missing prefix. We can see the wrong query and the error message. However, since we prompt the LLM to correct the query, the second time, the generated query is correct and we have our answer.

Now, let’s ask a question that would require the LLM to produce an answer from multiple results:

"Which are the pilots of the Millennium Falcon?"
PREFIX : <https://swapi.co/vocabulary/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?pilotName
WHERE {
  ?starship rdfs:label "Millennium Falcon" .
  ?starship :pilot ?pilot .
  ?pilot rdfs:label ?pilotName .
}

The pilots of the Millennium Falcon include Chewbacca, Han Solo, and Lando Calrissian.

Another correct query and answer.

Let’s check if the LLM can generate a SPARQL query with multiple bindings in the results:

"Which is the shortest starship and which is its manufacturer?"
Invalid SPARQL query: 
SELECT ?starship ?manufacturer ?length
WHERE {
  ?starship a :Starship ;
            :manufacturer ?manufacturer ;
            :length ?length .
}
ORDER BY ?length
LIMIT 1
SPARQL Query Parse Error: 
Unknown namespace prefix : None

Generated SPARQL:
SELECT ?starship ?manufacturer ?length
WHERE {
  ?starship a <https://swapi.co/vocabulary/Starship> ;
            <https://swapi.co/vocabulary/manufacturer> ?manufacturer ;
            <https://swapi.co/vocabulary/length> ?length .
}
ORDER BY ?length
LIMIT 1

The shortest starship is manufactured by Kuat Systems Engineering and has a length of 5.47 units.

Although at first the query was not correct, at the end, we still got the right answer. We didn’t ask for the exact length of the starship, but the query has selected it and it’s included in the answer.

The dark side examples

Now, let’s review a few examples that result in wrongly generated SPARQL queries:

"Which is taller - the tallest human or the shortest ewok?"
PREFIX : <https://swapi.co/vocabulary/>

SELECT ?tallestHumanHeight ?shortestEwokHeight
WHERE {
  {
    SELECT (MAX(?humanHeight) AS ?tallestHumanHeight)
    WHERE {
      ?human a :Human .
      ?human :averageHeight ?humanHeight .
    }
  }
  {
    SELECT (MIN(?ewokHeight) AS ?shortestEwokHeight)
    WHERE {
      ?ewok a :Ewok .
      ?ewok :averageHeight ?ewokHeight .
    }
  }
}

Based on the information provided, I do not have the data to determine the height of the tallest human or the shortest Ewok. Therefore, I cannot answer which is taller without additional information.

Here, in the generated SPARQL query, the LLM wrongly used the class property :averageHeight instead of the instance property :height. Both predicates are included in the ontology schema, so this confuses the LLM.

Let’s check this question:

"Are there more men than women characters?"
PREFIX : <https://swapi.co/vocabulary/>

SELECT (COUNT(?male) AS ?maleCount) (COUNT(?female) AS ?femaleCount)
WHERE {
  ?male a :Character;
        :gender "male".
  ?female a :Character;
          :gender "female".
}

Based on the information provided from the SPARQL query results, it appears that the number of male characters is equal to the number of female characters, with both counts being 1178. Therefore, there are not more male characters than female characters; they are equal in number.

Although we received an answer, it looks suspicious that the number of women is exactly the same as the number of men. The generated SPARQL query is wrong in this case, because the WHERE clause results in the Cartesian product between the male and the female characters.

The LLM might fail to generate correct SPARQL query even for a simple questions like this one:

"Who is the oldest character?"
PREFIX : <https://swapi.co/vocabulary/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?character ?name ?birthYear
WHERE {
  ?character a :Character ;
             rdfs:label ?name ;
             :birthYear ?birthYear .
  FILTER(xsd:integer(?birthYear))
}
ORDER BY xsd:integer(?birthYear)
LIMIT 1

Based on the information provided, I am unable to determine who the oldest character is. There is no data available from the SPARQL query results to answer your question. If you can provide more details or context, I may be able to assist you further.

Actually, data is available in the ontology, but this is a bit tricky.  There is a predicate :birthYear in the ontology, but the values are from the timeline of galactic history according to the Star Wars Universe. For example, Luke Skywalker’s birth year is “19BBY”. The generated query uses the SPARQL function xsd:integer in the FILTER clause, which is wrong, as this function casts the value to xsd:integer, while the FILTER clause expects a boolean condition. Note that the LLM generates correct SPARQL queries for the questions “Who is the highest character?” and “Who is the heaviest character?”.

Another example:

"Which movie has received an Academy Award nomination?"
PREFIX : <https://swapi.co/vocabulary/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT DISTINCT ?movie
WHERE {
  ?award a :Award .
  ?award :film ?movie .
}

Based on the information provided, I do not have any details about movies that have received an Academy Award nomination. If you have specific details or a list of movies from a SPARQL query, I could certainly help you identify which ones have been nominated. Without such information, I'm unable to provide an answer to your question.

The LLM wrongly uses the class :Award instead of :AwardRecognition.

And another one with negation this time:

"Which movie didn't received an Academy Award?"
PREFIX : <https://swapi.co/vocabulary/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT DISTINCT ?film
WHERE {
  ?film a :Film .
  MINUS { ?film :award ?award .
          ?award a :Award .
        }
}

Based on the information provided from the SPARQL query results, I do not have the details regarding Academy Award receptions for the movies. The query results only list the URIs for seven films from the "Star Wars" franchise, but do not include any data about their award history. Therefore, I cannot determine which movie did not receive an Academy Award.

The generated query is wrong. The correct query in this case would be:

PREFIX : <https://swapi.co/vocabulary/>

SELECT DISTINCT ?film
WHERE {
    ?film a :Film.
    MINUS {
        ?awardRecognition a :AwardRecognition.
        ?awardRecognition :film ?film.
        ?awardRecognition :awardStatus "awarded".
    }
}

This time not only the LLM confused the two classes :Award and :AwardRecognition, but also failed to filter the awards by status.

The NLQ integration for GraphDB in LangChain is flexible and allows LLM prompt refinement for further improvement of your QA chain and the overall user experience of your app.

This concludes our demonstration. If you want to further play with the Star Wars ontology and NLQ, you can follow the instructions from the LangChain documentation on how to run the demo.

Change the prompts we might

Before we conclude this blog post, let’s review the default prompt templates we use:

GRAPHDB_SPARQL_GENERATION_TEMPLATE = """
Write a SPARQL SELECT query for querying a graph database.
The ontology schema delimited by triple backticks in Turtle format is:
```
{schema}
```
Use only the classes and properties provided in the schema to construct the SPARQL query.
Do not use any classes or properties that are not explicitly provided in the SPARQL query.
Include all necessary prefixes.
Do not include any explanations or apologies in your responses.
Do not wrap the query in backticks.
Do not include any text except the SPARQL query generated.
The question delimited by triple backticks is:
```
{prompt}
```
"""

GRAPHDB_SPARQL_FIX_TEMPLATE = """
This following SPARQL query delimited by triple backticks
```
{generated_sparql}
```
is not valid.
The error delimited by triple backticks is
```
{error_message}
```
Give me a correct version of the SPARQL query.
Do not change the logic of the query.
Do not include any explanations or apologies in your responses.
Do not wrap the query in backticks.
Do not include any text except the SPARQL query generated.
The ontology schema delimited by triple backticks in Turtle format is:
```
{schema}
```
"""

GRAPHDB_QA_TEMPLATE = """Task: Generate a natural language response from the results of a SPARQL query.
You are an assistant that creates well-written and human understandable answers.
The information part contains the information provided, which you can use to construct an answer.
The information provided is authoritative, you must never doubt it or try to use your internal knowledge to correct it.
Make your response sound like the information is coming from an AI assistant, but don't add any information.
Don't use internal knowledge to answer the question, just say you don't know if no information is available.
Information:
{context}

Question: {prompt}
Helpful Answer:"""

As we can see from the SPARQL generation prompt, we don’t provide any examples and an arbitrary ontology schema can be used.

Epilogue

The integration of NLQ and knowledge graphs marks an important step towards simplifying data access, fostering deeper insights, and enhancing decision-making processes across diverse users. GraphDB already provides various functionalities such as Talk to your Graph, ChatGPT Retrieval Connector and GPT Querying.

In this blog post, we’ve reviewed another approach – the NLQ integration for GraphDB in LangChain. It’s an additional way to benefit from the power of LLMs and knowledge graphs. We can help you bootstrap your needs and discuss how this can add value to your business!

The End

Want to leverage the capabilities of LLMs for querying your knowledge graph without writing SPARQL queries?

Get GraphDB Free, load the sample data, and give it a try!

Article's content

Software Engineer at Ontotext

Neli joined Ontotext in 2016 as a padawan and is now a Jedi Master, experienced in ML, NLP, Java, Python and other Force powers.