Menu

Ontotext

Semantic Biomedical Tagger

Semantic Biomedical Tagger is an information extraction system, designed to create semantic annotations in biomedical texts using more than 100 different semantic types.

Annotation, or tagging, is about attaching names, attributes, comments, descriptions, etc., to a whole document or document snippets. It provides additional information (meta-data) about an existing piece of data. Compared to tagging, which adds relevance and precision to the retrieved information, semantic annotation goes one level deeper:

  • It enriches the unstructured or semi-structured data with a context that is further linked to the domain structured knowledge.
  • It allows results that are not explicitly related to the original search.

Semantic annotation helps to bridge the ambiguity of the natural language when expressing notions and their computational representation in a formal language. By telling a computer how data items are related and how these relations can be evaluated automatically, it becomes possible to process complex filter and search operations.

Semantic Annotations

Figure 1 shows the PubMed article, after it was annotated by Semantic Biomedical Tagger and presented in Linked Life Data (LLD) . Structured data such as the author of the article and the journal that has published it are taken from the PubMed source. In text, some words are automatically recognized by Semantic Biomedical Tagger as mentions of entities from the LLD data. For instance, COPD and umls:C000496 are part of the LLD ontology and data. Such metadata for a particular article allows complex processing, search and filtering operations in LLD.

Semantic Biomedical Tagger has a built-in capability to recognize 135 biomedical entity types and semantically link them to the knowledge base systems, in this case LLD. Semantic Biomedical Tagger can load entity names from the LLD service or any other RDF database with a SPARQL endpoint.

  • by default LifeNERC 2.0 is initialized and tested to operate with Linked Life Data Platform
  • all URIs used by default by Semantic Biomedical Tagger are resolvable and can be opened by a web browser or a machine accessible API.

COPD

The LLD service supports content negotiation, a mechanism defined in the HTTP specification that makes it possible to serve different versions of a document (or more generally, a resource) with the same URI, so that user agents can specify which version fits their capabilities best.

Annotations

Semantic Biomedical Tagger creates semantic annotations that have names (Annotation type) and features: class (URI), instance (URI), and string (instance label). Both URIs can be further explored in the LLD service.

Components

LifeNERC uses two strategies to achieve high precision and recall of semantic annotations:

1. By using the pre-existing knowledge, retrieved from the LLD service, the LD-Gazetteer analyzes texts and recognizes mentions of entities. This semantic annotation process comprises:

  • matching the entity mention in the text to the respective entry in the dictionary
  • assigning a semantic annotation to the entity mention according to the predefined annotation types
  • linking the entity mention (string) to its LLD instance URI
  • linking the annotation type to its LLD class URI

Due to the specific nature of biomedical texts, the LD-Gazetteer is divided into two distinct processing resources – standard-gaze and abbr-gaze.

2. Bio-Taggers are machine learning components that identify entity mentions in text, which have not yet been recognized by the LD-Gazetteer dictionary. Each tagger produces temporary annotation types, which are later post-processed to the final stage annotation types. Integrating Bio-Taggers in Semantic Biomedical Tagger significantly increases the entity recognition precision and recall, which is one of the major issues in the dictionary-based approaches.

The following annotation types are supported by Bio-Taggers component:

  • Genes
  • Malignancies/Neoplasms
  • Sequence variants
  • Cell types
  • Cell lines
  • DNA sequences
  • RNA sequences

Since, the Bio-Taggers approach relies on “looser” rules compared to the LD-Gazetteer strict matching it is hard to link the entities to a valid instance URI. Instead, entity instances get an auto-generated value that is a concatenation of the namespace and the local name.

Although Bio-Taggers in Semantic Biomedical Tagger significantly improve the precision and recall, they also pose a problem with the disambiguation of the semantic annotations caused by their offset overlap. There is no single solution for such a problem, as two short offsets can describe two completely different entities, while taken as one, long offset, they can complement each other and form another, also meaningful, entity.

The transform priority component offers a possible compromise between the number of semantic annotations and their completeness and disambiguates LD-Gazetteer and Bio-Taggers annotation types.

Integration

Semantic Biomedical Tagger can be integrated as a separate module in the different applications:

  • GATE – as a stand-alone application
  • KIM – as an information extraction component
  • Teamware – as an annotation service
  • As a distributed service and REST architecture for large-scale machine information extraction

LifeNERC extracts biomedical entities by means of symbolic methods (JAPE rules) and probabilistic models (machine learning). The main system components are:

  • Large-scale linked data gazetteer (LD-Gazetteer) – applies the traditional dictionary-based approach for biomedical entity recognition
  • Bio-medical taggers (Bio-Taggers) – use probabilistic models and grammars to discover new biomedical entities that do not exist in the LD-Gazetteer list
  • Various pre-processing and post-processing rules