Scaling Understanding with the Help of Feedback Loops, Knowledge Graphs and NLP

This article was originally published in Data Science Central.

April 19, 2024 5 mins. read Alan Morrison

Image by Gerd Altmann from Pixabay

Assisted annotation and other automation methods that can be used for knowledge graph creation and natural language understanding should be more flexible and iterative than they often are. Much depends on the architecture and tooling choices your organization makes.

If you make sound choices, you avail yourself of a range of capabilities. Choosing wisely could be the difference between success and failure. Consider these capabilities, for example:

  • You can compare and contrast the effectiveness and efficiency of different approaches with benchmarking techniques that are an integrated part of your process.
  • You can switch out or add to the assisted tagging capabilities you can work with based on the benchmarking, so you’re able to optimize the result.
  • You can harness the power of graph linked, more curated sources that can make a substantial difference in the accuracy, utility, currency and reuse possibilities of the outputs you’re seeking.
  • You can make it possible to empower subject matter experts who aren’t programmers in new ways, allowing better scaling and boosting the impact of the data you’re contextualizing and making reusable.

Recently I had the opportunity to talk with Ivaylo (Ivo for short) Kabakov, Head of Solutions at Ontotext. We mainly talked about the company’s Metadata Studio and the types of features it has that give users the options I’ve listed above. Ivo’s been at Ontotext for over 14 years.

Ontotext is best known for GraphDB, the company’s semantic graph database management system. But Ontotext has been investing for over 20 years in the area of natural language processing (NLP) as well.

This depth of understanding of both statistical modeling and description and relationship logic of ontological modeling makes the company well suited to navigate the thicket of hybrid or neurosymbolic AI, in which statistical methods such as NLP and LLMs complement the symbolic logic in well-constructed knowledge graphs.

The knowledge graph as a reusable, growing resource

Of course the vast majority of enterprise data (Ontotext puts the estimate at 80 percent) is unstructured. If that data stays unstructured, it will lack reusability and stay siloed because it’s not explicitly contextualized and therefore not discoverable. In essence, the enterprise has buried it.

Of course, unstructured enterprise data consists primarily of content in document form, and if it’s not discoverable or reusable, it’s useless. Ikea’s content strategy principal Timi Stoop-Acala puts it bluntly: “Unconnected content is a liability.”

A well-designed knowledge graph describes the contexts or domains of the data it contains. It’s that context, Ivo says, that not only leads to discoverability, but also to understanding, whether human or machine understanding. Contextualization leads to sensemaking because the human or the machine finds these representations of people, places, things and ideas related and interacting together in logical, systematized ways. It’s the relations and how the interactions take place that make a given subgraph under analysis meaningful.

The importance of well-described relationships to either human or machine understanding is most evident when it comes to human connection. Take the example of author Truman Capote. Try to imagine understanding Capote’s last 30 years of life without the back story of the trauma he went through while researching his wildly popular and important book, a true crime thriller called In Cold Blood.

Or without his connection since childhood to author Carson McCullers, who helped him do the research for In Cold Blood. Or his closeness to socialite Babe Paley or the other “swans” depicted in the current FX series Feud: Capote vs. the Swans and how they ended up being the subjects of his final book Answered Prayers.

If you lacked an awareness of these connections, you’d miss the key causal factors behind Capote’s agonizing decline from 1965 until his death in 1984.

And yet, the typical enterprise never captures or preserves so many causal relationships that may be critical to continued business success in the data it stores, because documents, databases and other files are all stored without contextualizing the data they contain.

NLP and the role of metadata-oriented feedback loops

Moreover, text analysis or mining can add to that meaning and contribute even more to the knowledge in the graph. The graph becomes a vehicle of feedback loop symbiosis with the help of a strategically designed NLP process, including intelligent extraction and tagging that disambiguates and in the process improves search and recommendations.

“While reinforced learning with human feedback (RLHF) is valid for chatbots and LLMs, it’s just as valid for traditional machine learning and has been for decades,” Ivo says. In order to classify and accurately extract entities and relationships from text, “you need to provide a lot of human feedback.”

A gold standard content corpus for benchmarking

Once you have an enriched, annotated corpus, “that’s what you train your model against,” Ivo points out. Evaluating the model gives you the means to iteratively refine the model.

“We developed Metadata Studio in a way that it’s agnostic to the technology that’s used, as long as it’s exposed as a web service,” according to Ivo. Users can take advantage of Ontotext’s own Common English Entity Linking (CEEL) and other services alongside NLP tools from Google, IBM, Amazon, selecting and orchestrating a best-of-breed toolchain.

Because Ontotext has focused on semantic standards-based knowledge management together with NLP for over 20 years, the company has a rare perspective on what’s most important about the current phase of AI and how to address the problems associated with LLMs in an enterprise context.

 

Article's content

Freelance writer

Alan Morrison is an independent consultant and freelance writer on data tech and enterprise transformation. He is a contributor to Data Science Central with over 35 years of experience as an analyst, researcher, writer, editor and technology trends forecaster, including 20 years in emerging tech R&D at PwC.