Semantic Biomedical Tagger is an information extraction system, designed to create semantic annotations in biomedical texts using more than 100 different semantic types.
Annotation, or tagging, is about attaching names, attributes, comments, descriptions, etc., to a whole document or document snippets. It provides additional information (meta-data) about an existing piece of data. Compared to tagging, which adds relevance and precision to the retrieved information, semantic annotation goes one level deeper:
Semantic annotation helps to bridge the ambiguity of the natural language when expressing notions and their computational representation in a formal language. By telling a computer how data items are related and how these relations can be evaluated automatically, it becomes possible to process complex filter and search operations.
Figure 1 shows the PubMed article, after it was annotated by Semantic Biomedical Tagger and presented in Linked Life Data (LLD) . Structured data such as the author of the article and the journal that has published it are taken from the PubMed source. In text, some words are automatically recognized by Semantic Biomedical Tagger as mentions of entities from the LLD data. For instance, COPD and umls:C000496 are part of the LLD ontology and data. Such metadata for a particular article allows complex processing, search and filtering operations in LLD.
Semantic Biomedical Tagger has a built-in capability to recognize 135 biomedical entity types and semantically link them to the knowledge base systems, in this case LLD. Semantic Biomedical Tagger can load entity names from the LLD service or any other RDF database with a SPARQL endpoint.
The LLD service supports content negotiation, a mechanism defined in the HTTP specification that makes it possible to serve different versions of a document (or more generally, a resource) with the same URI, so that user agents can specify which version fits their capabilities best.
Semantic Biomedical Tagger creates semantic annotations that have names (Annotation type) and features: class (URI), instance (URI), and string (instance label). Both URIs can be further explored in the LLD service.
LifeNERC uses two strategies to achieve high precision and recall of semantic annotations:
1. By using the pre-existing knowledge, retrieved from the LLD service, the LD-Gazetteer analyzes texts and recognizes mentions of entities. This semantic annotation process comprises:
Due to the specific nature of biomedical texts, the LD-Gazetteer is divided into two distinct processing resources – standard-gaze and abbr-gaze.
2. Bio-Taggers are machine learning components that identify entity mentions in text, which have not yet been recognized by the LD-Gazetteer dictionary. Each tagger produces temporary annotation types, which are later post-processed to the final stage annotation types. Integrating Bio-Taggers in Semantic Biomedical Tagger significantly increases the entity recognition precision and recall, which is one of the major issues in the dictionary-based approaches.
The following annotation types are supported by Bio-Taggers component:
Since, the Bio-Taggers approach relies on “looser” rules compared to the LD-Gazetteer strict matching it is hard to link the entities to a valid instance URI. Instead, entity instances get an auto-generated value that is a concatenation of the namespace and the local name.
Although Bio-Taggers in Semantic Biomedical Tagger significantly improve the precision and recall, they also pose a problem with the disambiguation of the semantic annotations caused by their offset overlap. There is no single solution for such a problem, as two short offsets can describe two completely different entities, while taken as one, long offset, they can complement each other and form another, also meaningful, entity.
The transform priority component offers a possible compromise between the number of semantic annotations and their completeness and disambiguates LD-Gazetteer and Bio-Taggers annotation types.
Semantic Biomedical Tagger can be integrated as a separate module in the different applications:
LifeNERC extracts biomedical entities by means of symbolic methods (JAPE rules) and probabilistic models (machine learning). The main system components are: