Semantic annotation or tagging is the process of attaching to a text document or other unstructured content, metadata about concepts (e.g., people, places, organizations, products or topics) relevant to it. Unlike classic text annotations, which are for the reader’s reference, semantic annotations can also be used by machines. Semantically tagged documents are easier to find, interpret, combine and reuse.
The result of the semantic annotation process is metadata that describes the document via references to concepts and entities mentioned in the text or relevant to it. These references link the content to the formal descriptions of these concepts in a knowledge graph. Typically, such metadata is represented as a set of tags or annotations that enrich the document, or specific fragments of it, with identifiers of concepts.
Semantic metadata can be stored in a knowledge graph, rather than embedded in a document. One modelling approach, which enables a broad range of analytics, is to store the annotations as individual objects, which refer to the document, which is also a node in the graph. This way documents and annotations become first class citizens of the knowledge graph and can be indexed and queried alongside the other type of data there: ontologies, schemas, reference and master data. This approach is implemented in Ontotext Platform.
Think of semantic annotations as a sort of highly structured digital marginalia (notes made in the margins of a book or other document), usually invisible in the human-readable part of the content. Written in the machine-interpretable formal language of data, these notes serve computers to perform operations such as classifying, linking, inferencing, searching, filtering, etc.
For instance, to semantically annotate chosen concepts in the sentence “Aristotle, the author of Politics, established the Lyceum” means to identify Aristotle as a person and Politics as a written work of political philosophy, and to further index, classify and interlink the identified concepts in a semantic graph database, also known as a, triplestore. In this case, Aristotle can be linked to his date of birth, his teachers, his works, etc. Politics can be linked to its subject, its date of creation, etc. Given the semantic metadata about the above sentence and its links to other (external or internal) formal knowledge, algorithms will be able to automatically:
Semantic annotation enriches content with machine-processable information by linking background information to extracted concepts. These concepts, found in a document or another piece of content, are unambiguously defined and related to each other within and outside the content. It turns the content into a better manageable data source.
A typical process of semantic enrichment includes:
Text is extracted from non-textual sources such as PDF files, videos, documents, voice recordings, etc.
Algorithms split sentences and identify concepts such as people, things, places, events, numbers, etc.
All recognized concepts are classified, which means that they are defined as people, organizations, numbers, etc. Next, they are disambiguated, that is, they are unambiguously identified according to a domain-specific knowledge base. For example, Rome is classified as a city and further disambiguated as Rome, Italy, and not Rome, Iowa.
This is the most important stage of semantic annotation. It recognizes text chunks and turns them into machine-processable and understandable data pieces by linking them to the broader context of already existing data.
The relationships between the extracted concepts are identified and further interlinked with related external or internal domain knowledge.
All mentions of people, things, etc. and the relationships between them that have been recognized and enriched with machine-readable data are then indexed and stored in a semantic graph database for further reference and use.
What semantic annotation brings to the table are smart data pieces containing highly-structured and informative notes for machines to refer to. Solutions that include semantic annotation are widely used for risk analysis, content recommendation, content discovery, detecting regulatory compliance and much more.
White Paper: Text Analysis for Content Management
|