Jem Rayfield, Chief Solution Architect at Ontotext, provides technical insights into the Ontotext Platform and its design choices.
The quality of semantic annotations has a direct impact on the applications that use them. This blog post focuses on the platform’s Curation Tool and its role in improving annotation quality.
As discussed in my previous post, the Ontotext Platform is often required to process and reprocess millions of unstructured content items using the platform’s text analytics [TA] components.
An unstructured content archive may need to be processed or re-processed to discover and add additional knowledge or train a machine learning model. Ontotext’s text analytics components in these scenarios may well create 10’s of billions of annotations that need to be processed, re-processed and stored quickly with little or indeed no impact to a live running knowledge graph.
Text analytics annotation metadata adds information to unstructured content at some level: a word or phrase, paragraph or section, an entire document, a polygon within a Scalable Vector Graphic or perhaps a time-code within a video. Annotations connect unstructured content to concepts and background knowledge stored within a GraphDB’s knowledge graph, an ontology or a gazetteer (text analytics dictionary).
The TA services represent machine suggestions using an extended version of the W3C Web Annotation Model [WA] JSON-LD.
Annotations capture the semantic fingerprint of unstructured content; the structured knowledge contained within fragments of unstructured content using URI references to a GraphDB knowledge graph. Annotations include quantitative attribution such as confidence or relevance to support comparison.
TA annotation suggestions are published as events to an event queue to allow processing to be performed asynchronously. Suggestion events are consumed and in some cases moderated by a team of curators. Curators moderate suggestions using the Ontotext Platform Curation Tool.
Curation moderation aims to improve annotation quality, vocabulary depth and breadth and ultimately text analytics precision and recall.
The Curation process has a small set of general principles:
Annotation concepts may represent single entities such as a Person, Location or Organization or a more complex relationship such as a “Persons Role within an Organization”, “A Company’s Merger with another Company”, etc.
Annotations may require curation when they are ambiguous or “fuzzy” or indeed when confidence score thresholds are too low (configurable). There may also be word sense disagreement between isolated curators.
It is important that ambiguous edge cases are fixed and rationalized to keep annotation precision as high as possible.
When dealing with sentiment annotations the following text could be annotated with “sad” or perhaps “repulse” sentiments. It is also possible that annotation guidelines deem “repulse” as an inappropriate sentiment when associated to text covering sensitive subjects. It would, therefore, be likely that “repulse” needs to be moderated and removed. The Curation Tool would allow a curator to modify “fuzzy” text analytics suggestions to ensure quality is kept high.
"An Indian woman was allegedly set on fire and killed by her husband’s family because she was too dark-skinned and they wanted him to remarry a fairer bride."
Text fragments can also belong to multiple concepts or categorizations simultaneously. Overlapping annotations increase flexibility and allow annotations to capture the most knowledge. Curators must be able to visualize the different overlaps in order that each annotation can be moderated and reviewed carefully.
For example, the following text includes the Federal Court of Australia (the Organization) and Australia (the Location).
..... ....."Access to this website has been disabled by an order of the Federal Court of Australia because it infringes or facilitates the infringement of copyright," Telstra's landing message reads...... .....
The Curation Tool supports the moderation of these overlapping suggested annotations. It’s possible to moderate the annotation referencing the Organization Federal Court of Australia
and the Location referencing Australia.
The system supports a team of curators who moderate a stream of annotated content to eliminate annotation errors. Curators are usually SMEs working to an agreed annotation guideline to support data quality service level agreements.
The following screenshot depicts a stream of unstructured content items ready to be pulled from the curation queue. These content items have been annotated by the “machine” and are selected and ready for the human curation process.
The Curation Tool allows curators to accept and or reject novel concepts but also allow curators to add missing annotations or remove incorrect annotations.
The following Curation Tool screenshot depicts how a particular piece of text – “Amazon” – includes multiple suggested overlapping candidate annotations. In this particular case, the candidate annotation is set as the Organization Amazon.com, which references a GraphDB knowledge graph entity by its URI. The Curation Tool suggests alternative annotations (ordered by relevance/confidence); in this case, four “Amazon” Locations, six “Amazon” Organizations and two “Amazon” People annotations.
Curators are able to moderate a candidate annotation by replacing it. In this example, the Organization Amazon.com annotation is replaced by one of the other suggested annotations such as the Person Amanda Knox. Curators are also able to select a different candidate from the supplied list (classified by type) or by searching for a missing candidate across the entire knowledge graph:
When independent curators reach a level of agreement (inter-annotator agreement), annotations are automatically accepted or rejected. If curators disagree (detected by configurable consensus rules), an administrator/supervisor can override the team’s decisions and ensure that annotations are confirmed as accepted or rejected. Disagreements can occur due to word sense ambiguity or perhaps the characteristics of the curator. Curators may have different levels of familiarity with the material, amount of training, motivation, interest or fatigue. An administrator/supervisor must be an SME with a full understanding of the annotation guidelines and domain in order that administrator/supervisor overrides are likely to be of the highest quality.
The following Curation Tool screenshot depicts how an administrator can override conflicting curator annotations. In this particular example, removing the Person Amanda Knox, which conflicts with the Organization Amazon.com annotation.
Annotation conflict resolution is normally only required for a small number of edge cases but is an important catch for exceptions.
Automatically accepted, refined and moderated annotations are fed back into the “machine” to improve knowledge graph and text analytics vocabularies, corpora and statistical models. The following diagram describes this cyclic continuous improvement flow:
The cycle continuously adapts machine learned statistical models to facilitate improved F1 scores.
Ontotext’s text analytics platform not only discovers known entities within a knowledge graph but it is also able to detect and classify novel unknown entities and relationships. Novel entities are published and fed back into the GraphDB knowledge graph. When discovered by the platform’s text analytics components, they can subsequently go through the moderation process. The Curation Tool detects that the entities are not present within the knowledge graph and allows curators to accept and augment the knowledge graph vocabulary. Ontotext’s text analytics components are continuously updated with new entities. The text analytics architecture includes a configurable Dynamic Gazetteer that allows domain vocabularies (including URIs) to be synchronized in near real-time from the knowledge graph. Synchronization ensures that new entities that are included within the knowledge graph are also present and detected within unstructured content processed by the text analytics components.
As I stated at beginning of this post, the quality of Semantic Annotations has a direct impact on the applications that use them. For example, machine learning algorithms will learn how to make mistakes if the models are trained on a poor quality golden corpus that includes ambiguity and errors. The linguistic analysis will be misled if annotations are incorrect and the text analytics results will be poor. Search and discovery applications that query annotations directly will return false positives if the quality is not maintained via Curation moderation.
The Ontotext Platform Curation Tool supports the analysis of several metrics:
Curation removes text analytics wheat from the chaff, preserving quality information, removing noise to produce quality annotations that accurately capture the knowledge locked within unstructured content.
Ontotext’s Curation Tool increases text analytics transparency, accountability and provides key management metrics. One is able to gain insights into curator teams usage patterns. It is also possible to analyze how Ontotext text analytics platform performs, how the text analytics algorithms are performing and what steps are required to improve precision and recall.
Ontotext’s Curation Tool is integrated with Ontotext’s text analytics and knowledge graph instance management tools in such a way that the platform can automatically learn from user feedback. It also provides mechanisms to automate the addition of new concepts or edit existing concepts so that they’re suggested automatically.
The ability to take data, to be able to understand it, process it, moderate it, extract value from it, visualize and clean it ensures that the Ontotext Platform captures your business knowledge and value accurately.