• Blog
  • Guest Posts

    Technology

Breakthrough Moments in Enterprise Taxonomy Management

This post is republished by permission of Synaptica and originally appeared at their website. Synaptica is part of our partner ecosystem that powers the next generation of content and data management applications for many of the world’s most knowledge-intensive enterprises – all the way from Life Sciences and Financial Services to Publishing, Government and Industry.

February 2, 2024 13 mins. read David Clarke

 

Earlier this month at KMWorld in Washington DC I got to meet many of Synaptica’s clients, technology partners, as well as some new faces from the community. It’s always a pleasure to attend KMWorld because of the opportunity to connect in person, and this year was no exception.

I think it’s fair to say that LLMs dominated the discussions. We as a taxonomy community are facing rapid technological change and grappling with where our discipline fits into a world captivated by ChatGPT and other generative models. In this moment of technological change, I think it’s helpful to reflect on the challenges our professional community has already faced and solved together. I was fortunate to present a keynote talk on this topic, and I will summarise my key themes from that presentation: Breakthrough Moments in Enterprise Taxonomy Management

In my presentation, and in this blog, I outlined the challenges our community has successfully addressed, and continues to address, and highlighted a few solutions the Synaptica team has been developing. Specifically, I explore the challenges of complexityscaleexplainability, and trust.

The challenge of complexity

Knowledge models for taxonomy have responded to an increasing need for sophistication and expressiveness. The breakthroughs outlined in this section are the effort of the entire community, especially those individuals and companies who have contributed to industry standards and specifications like ANSI-NISO Z39.19, ISO 25964, and W3C RDF, SKOS, and OWL.

Presentation Slide | Complexity Breakthroughs

Terms to concepts

The journey starts with the migration from term-based taxonomies to concept-based taxonomies enabled by SKOS.

This simple step-change separated the identifier for each concept from its lexical labels. With RDF, these identifiers are represented with HTTP-URIs.

The community gained many advantages from this breakthrough. For example, the method for managing multilingual taxonomies became massively simpler (once concept could now hold multiple language labels), resulting in huge cost savings compared with the previous data model.

Taxonomy to ontology

Next came the migration from taxonomy to ontology.

The widespread adoption of RDF as a semantic data model transformed simple taxonomies into extensible and expressive ontologies. Once the semantics of the knowledge organization system was no longer confined to hierarchical and associative relationships, people could define any named relationship between things, and define classes of things with distinct sets of properties. Not only are these ontologies capable of representing any domain of knowledge, but the structure of the data model is also  transparent and intelligible to both humans and machines.

Ontologies in RDF also support machine inferencing, allowing new knowledge to be derived from existing data.

Labels as things

Another step-change came when labels became things that can have their own properties. The data model evolved to support SKOS-XL.

A concrete example helps to explain the breakthrough. The person with the birth name ‘Stephen King’ wrote a novel called The Shining. The same person also wrote a novel called The Running Man, but he did so under the penname ‘Richard Bachman’. With SKOS-XL, Stephen King and Richard Bachman are bound together as one concept, but each label also has a unique URI making it a thing rather than a string, which enables each name to have independent properties, such as ‘Stephen King’ authorOf The Shining, and ‘Richard Bachman’ authorOf The Running Man.

Relationships as things

It’s perhaps not surprising what came next. With RDF-Star, relationships became things. Using RDF-Star, the relationship joining two concepts (in graph speak, the edge between two nodes) can carry its own set of properties.

Again, a concrete example will help explain why this was a powerful breakthrough. Standard RDF lets us express the following statement ‘Stephen King’ authorOf The Shining. With RDF-Star we can add a date to the relationship between the person and the book, and state that ‘Stephen King’ published The Shining in 1977.

Knowledge graphs

The final step in model complexity covered in my talk is the advent of enterprise knowledge graphs, and, more specifically, what Synaptica calls content-aware knowledge graphs.

What do we mean by content aware knowledge graphs? When our auto-categorization engine, Graphite Knowledge Studio, tags content with metadata derived from taxonomies, we also capture and store this content metadata inside the knowledge graph, thereby linking concepts to the content they describe. As more content gets tagged, the knowledge graph expands, providing a powerful resource for business insights and analytics as well as powerful business functions like similarity indexing and recommendations.

With great solutions, come new challenges . . .

As the ontological schema have successfully evolved to manage increasingly complex knowledge models, the user experience challenge increased. How can tooling and interfaces make what is inherently complex appear simple and comprehensible?

Synaptica’s Graphite application provides a collaboration space for knowledge engineers and content managers to create and manage enterprise taxonomies and ontologies. Graphite’s intuitive drag-and-drop user interface provides a simplification layer on top of complex semantic data models, enabling non-experts to rapidly design and build standards-compliant knowledge organization systems.

As I will explore below, we have brought this same investment in usability to our newest product, Graphite Knowledge Studio, to embrace the same diverse user community we have long served through Graphite.

Presentation Slide | Complexity Breakthrough: UI to make the Complex Comprehensible

The challenge of scale

Some organizations will also face the challenge of scale.

Extreme scale

Many, if not most, enterprise taxonomies have under 10,000 concepts. This is depicted by the barely visible red bar on the left below. Synaptica has a few clients with large taxonomies, on the order of 100,000 concepts. This is represented by the small bar in the middle below. This year Synaptica on-boarded clients with extreme scale taxonomies, in the order of 10 million concepts. This is represented by the tall bar on the right below.

Responding to this challenge, Synaptica’s engineers and design team set about a massive re-engineering project. It involved refactoring queries to deliver performant search and browse response, but it also required us to rethink the user experience and workflow. What happens if a user is clicking through a hierarchy and lands on a concept with a million related concepts? What happens when a concept can participate in hundreds of different relationship types? Our engineers developed innovative solutions to each of these challenges.

Adaptive navigation

Graphite now features Adaptive Navigation when exploring both hierarchical and associative links. Click-by-click, Adaptive Navigation is dynamically assessing the scale of data beneath each link.

If it detects scale that cannot be meaningfully rendered as a browse experience, then it adapts by switching the user to a search-inside control. Search-inside searches inside the set of concepts that are beneath a concept in a hierarchy or via an associative relationship.

Presentation Slide | Scale Breakthrough: Adaptive Navigation

In the example below a concept is related to 1,058,231 other concepts. Entering a keyword into the search-inside control filters the results down to a manageable 95 related concepts.

Presentation Slide | Scale Breakthrough: Adaptive Navigation

The user interface also adapts to the scale of relational predicates. SKOS has 12 relationship types, and though most enterprises have some additional relationships, the scale of these is still generally manageable for the UI to display all types and provide drag-and-drop functionality. One of the extreme scale taxonomies used by a Graphite client has 662 relational predicates. Instead of attempting to display all when only a few are needed, the UI adapts to display only populated predicates while allow the user to quickly add additional ones.

Presentation Slide | Scale Breakthrough: Adaptive Navigation

Search performance

The last breakthrough for the challenge of scale is search performance. Even though extreme scale taxonomies are thousands of times larger than normal enterprise taxonomies, taxonomists still require performant response times on these taxonomies for the tooling to be usable.

Synaptica engineers responded to this need with optimized search indexes and queries. In benchmark tests against taxonomies with over 7 million concepts, sub-second response times are delivered where the number of concepts returned is under 100, and results sets containing thousands of concepts are returned in under 4 seconds.

The Challenge of explainability

With increased reliance on machine learning algorithms, the community faces the newest challenge: that of explainability.

As we extend our products to harness the power of machine learning scale, Synaptica teams have considered carefully how to best marry the powers of information science with the advantages of data science without falling prey to the “black box.” This thinking is evident in our newest product, Graphite Knowledge Studio – an extension to Graphite, which enables users to tag content with enterprise taxonomies at scale.

Synaptica’s design principle for successful autocategorization systems is based on three pillars:

      1. explainable results
      2. transparent rules and
      3. rapid iteration

Explainable results

Within Graphite Knowledge Studio, we have achieved explainable results by giving the taxonomist the ability to inspect inline annotations, relate these inline annotations to inferred document-level classifications (displayed with confidence scores), and relate all tagging results to specific taxonomy concepts and metadata.

Presentation slide | Explainability Breakthrough Explainable Results

This transparency and explainability is key to extending the user community beyond engineers, data scientists, or others with coding and scripting skills. This expansion allows enterprises to better leverage the deep content knowledge of non-technical subject matter specialists while freeing enterprise technical resources (engineers, data scientists) for other projects.

Transparent rules

The configuration of enterprise taxonomies to power these explainable algorithms relies on SKOS and its extensibility through custom properties.

SKOS provides a widely adopted ontology schema for creating taxonomies. Some of its properties and relationships can be used to facilitate autocategorization. For example, SKOS prefLabels and altLabels can be used to find concept matches within document texts. The SKOS hierarchical structure can be used to infer generalized aboutness classifications (e.g., a document that specifically mentions ‘apples’, ‘pears’, and ‘oranges’, could be inferred to be about the more general broader concept of ‘fruit’, even if ‘fruit’ is not specifically found in the document).

But the SKOS specification was written for human readability and falls short when it comes to some of the machine-readable rules needed by auto-categorizers. For example, SKOS doesn’t support positive or negative contexts (essential for disambiguation), or textual patterns for novel entity extraction, or proximity rules and relevance ranking.

Synaptica’s Graphite Knowledge Studio responds to this challenge by extending the SKOS ontology with additional predicates to support the needs of autocategorizers.

The rules are simple and transparent, which empowers the taxonomist with direct control over how the tagger works.

Presentation Slide | Explainability Breakthrough: Transparent Rules

Rapid iteration

The third pillar for successful autocategorization is the ability to support rapid iteration. With explainable results and transparent rules, the taxonomist can quickly jump back from reviewing inline annotations to modifying the tagging rules in the taxonomy management system. As soon as the taxonomy is modified the changes are immediately available to the autocategorization tagger, enabling the taxonomist to see the results and make further refinements if required.

Presentation Slide | Explainability Breakthrough: Rapid Iteration

You can read more about the experience of a Taxonomist managing this iterative process in a recent blog post by Sarah Downs, our Director of Client Solutions: Graphite Knowledge Studio: Putting Taxonomists in the Driver’s Seat

The challenge of trust

Our last challenge is the challenge of trust, which has come to the forefront as people adopt Large Language Models (LLMs). Generative AI in general and LLMs specifically pose a challenge to data privacy, the opaqueness of data sources, and the veracity of results.

Opaqueness of data sources

Synaptica’s technology partner Squirro tackled opaqueness of sources head on when it introduced its LLM implementation called SquirroGPT. Synaptica have embraced this technology and are currently developing a chat-based interface to Synaptica’s user guides, tech docs, and public facing blogs and website articles. In the example below, we asked SynapticaGPT to explain the difference between when to use SKOS and when to use OWL. The short summary it provides is pretty accurate, but what we’ve highlighted here is how the GPT cites the sources it used to generate the answer. Linking answers to sources goes a long way toward establishing trust with Generative AI (GenAI) technology.

Presentation Slide | Trust Breakthrough: Cited Sources ref: Squirro

Veracity of Results

The veracity of GenAI output exposes another trust challenge to the adoption of tools like ChatGPT.

In the example below I asked the Microsoft Bing Image Creator to create images of “Dean Allemang explaining LLMs” (Dean delivered a brilliant keynote on ontologies and LLMs on day one of Taxonomy Bootcamp).

Presentation Slide | Generative AI: Veracity - Images of Unusual Suspects

The image generator did not have an image of Dean Allemang so it created images it thought were appropriate (i.e., it made them up). This phenomenon has been widely discussed this year. As millions of people started using ChatGPT, we quickly discovered that while ChatGPT’s general knowledge is incredible, it also makes up answers that are not evidenced-based and delivers the answers with convincing confidence creating distrust as to whether any particular answer is true or made up.

One way to improve the veracity and relevance of LLMs is to combine them with taxonomies and ontologies. This was one of the key themes in Dean’s keynote (he has also published a research paper with benchmarks which is publicly available at https://arxiv.org/abs/2311.07509).

Three memorable takeaways from Dean’s Taxonomy Bootcamp keynote (my paraphrase):

Who understands why we need taxonomy? ChatGPT does

Combining LLMs with an ontology can massively boost accuracy (37%+ in a specific test case)

You won’t lose your job to an AI; you’ll lose it to a person who knows how to use AI . . . a call to action for all taxonomists.

Balancing trust with the power of LLMs

Synaptica’s Graphite Knowledge Studio enables enterprises to both perform taxonomy-based categorization and access LLMs to support novel entity extraction by framing discovery questions that the LLM will use to identify entities not yet know to the taxonomy.

Presentation Slide | Trust Breakthrough: Combining LLMs and Taxonomy

Enabling breakthroughs with tech stack diversity

As an important (and adorable) aside: I opened my keynote with a photo of my two beautiful dogs (Maximus and Huxley), together with their ‘Heinz 57’ genetic make-up.

Presentation Slide | Diversity DNA of dogs, Maximus and Huxley

I assure you this is an actual photograph. I received many questions at KMWorld whether this was AI-generated art.

More importantly, this image serves as a metaphor for tech stack diversity. An important aspect of Synaptica’s technology strategy is to provide a platform of diverse tooling that is flexible and open to newcomers. NLP and AI have been around for decades, but the pace of development has accelerated to warp-speed. Our Graphite taxonomy management system is designed to connect to an expanding eco system of NLP and LLM tools such as SpaCy and OpenAI’s ChatGPT.

Presentation Slide | Tech Stack Divesity

This tech stack diversity and the flexibility it enables have been key to our ability to respond to the challenges and breakthrough moments for enterprise taxonomies:

The challenge of complexity

The challenge of scale

The challenge of explainability

The challenge of trust

Synaptica has been collaborating with businesses and organizations around the world for over twenty-five years to solve the evolving challenges of enterprise taxonomy management and semantic categorization.

It is an ongoing journey of innovation. If your organization is embarking on or managing taxonomy and categorization projects, then please reach out to the Synaptica team. We will be happy to share our experience and would like to show you the solutions we have developed.

Article's content

Co-founder and CEO at Synaptica

David served on the authoring committee of the 2005 version of the US national standard for controlled vocabularies, ANSI/NISO Z39.19. He leads research and development at Synaptica, including software solutions for taxonomy and ontology management, text analytics and auto-categorization, image annotation and indexing, and Linked Data management.