What is Semantic Annotation?

Semantic annotation is the process of tagging documents with relevant concepts. The documents are enriched with metadata: references that link the content to concepts, described in a knowledge graph. This makes unstructured content easier to find, interpret and reuse.

Semantic annotation or tagging is the process of attaching to a text document or other unstructured content, metadata about concepts (e.g., people, places, organizations, products or topics) relevant to it. Unlike classic text annotations, which are for the reader’s reference, semantic annotations can also be used by machines. Semantically tagged documents are easier to find, interpret, combine and reuse.

The result of the semantic annotation process is metadata that describes the document via references to concepts and entities mentioned in the text or relevant to it. These references link the content to the formal descriptions of these concepts in a knowledge graph. Typically, such metadata is represented as a set of tags or annotations that enrich the document, or specific fragments of it, with identifiers of concepts.

Semantic metadata can be stored in a knowledge graph, rather than embedded in a document. One modelling approach, which enables a broad range of analytics, is to store the annotations as individual objects, which refer to the document, which is also a node in the graph. This way documents and annotations become first class citizens of the knowledge graph and can be indexed and queried alongside the other type of data there: ontologies, schemas, reference and master data. This approach is implemented in Ontotext Platform.

Do you want to learn how Ontotext uses semantic annotation to enrich your content?

New call-to-action

Create Smart Content with Machine-Processable Marginalia

Think of semantic annotations as a sort of highly structured digital marginalia (notes made in the margins of a book or other document), usually invisible in the human-readable part of the content. Written in the machine-interpretable formal language of data, these notes serve computers to perform operations such as classifying, linking, inferencing, searching, filtering, etc.

For instance, to semantically annotate chosen concepts in the sentence “Aristotle, the author of Politics, established the Lyceum” means to identify Aristotle as a person and Politics as a written work of political philosophy, and to further index, classify and interlink the identified concepts in a semantic graph database, also known as a, triplestore. In this case, Aristotle can be linked to his date of birth, his teachers, his works, etc. Politics can be linked to its subject, its date of creation, etc. Given the semantic metadata about the above sentence and its links to other (external or internal) formal knowledge, algorithms will be able to automatically:

  • find out who tutored Alexander the Great;
  • answer which of Plato’s pupils established the Lyceum;
  • retrieve a list of political thinkers who lived between 380 BC and 310 BC;
  • render a list of Greek philosophers, which includes Aristotle.

How Does Semantic Annotation Work?

Semantic annotation enriches content with machine-processable information by linking background information to extracted concepts. These concepts, found in a document or another piece of content, are unambiguously defined and related to each other within and outside the content. It turns the content into a better manageable data source.

A typical process of semantic enrichment includes:

 Text Identification

Step 1: We remove the boilerplate from the unstructured textual content.

Step 1 Semantic Identification

Text is extracted from non-textual sources such as PDF files, videos, documents, voice recordings, etc.

Text Analysis

Step 2: We perform a set of standard Natural Language Processing operations over content – such as Sentence Splitting, Part-of-Speech Tagging and Named Entity Recognition.

 

Step 2 Text Analysis

Algorithms split sentences and identify concepts such as people, things, places, events, numbers, etc.

Concept Extraction

Step 3: We classify and disambiguate the identified entities

Step 3 Concept Extraction

All recognized concepts are classified, which means that they are defined as people, organizations, numbers, etc. Next, they are disambiguated, that is, they are unambiguously identified according to a domain-specific knowledge base. For example, Rome is classified as a city and further disambiguated as Rome, Italy, and not Rome, Iowa. The process of entity recognition and disambiguation is known as entity linking.

This is the most important stage of semantic annotation. It recognizes text chunks and turns them into machine-processable and understandable data pieces by linking them to the broader context of already existing data.

Relationship Extraction

Step 4: We also identify the relationships between known and newly recognized entities.

Step 4 Relationship Extraction

The relationships between the extracted concepts are identified and further interlinked with related external or internal domain knowledge.

Indexing and storing in a semantic graph database

Step 5: Finally, the extracted knowledge, represented as a graph, is stored in our semantic database GraphDB, which can also create full-text indices in a search engine like Elasticsearch.

Step 5 Indexing and storing in a semantic graph database

All mentions of people, things, etc. and the relationships between them that have been recognized and enriched with machine-readable data are then indexed and stored in a semantic graph database for further reference and use.

Where is Semantic Annotation Used?

How GraphDB works

What semantic annotation brings to the table are smart data pieces containing highly-structured and informative notes for machines to refer to. Solutions that include semantic annotation are widely used for risk analysis, content recommendation, content discovery, detecting regulatory compliance and much more.

Semantically Annotated Content Opens Up Cost-Effective Opportunities:

Semantic Annotation Makes it Easy to:

  • Find relevant information in heaps of documents with the help of machines doing the legwork;
  • Extract knowledge from disparate sources;
  • Provide personalized content, based on machine-understandable context;
  • Automatically interconnect content.
Want to learn more about Semantic Annotation and its applications in Enterprise Content Management?

 

White Paper: Text Analysis for Content Management
5 Steps To Make Your Content Serve Your Business Better

New call-to-action

[schemaapprating]

Ontotext Newsletter