The KIM Platform: Semantic Annotation

Here is what we consider semantic annotation of text:

Semantic annotation is information about what entities (or, more generally, semantic features) appear in a text and where they do. Formally, semantic annotations represented a specific sort of metadata, which provides references to entities in the form of URIs or other types of unique identifiers.

As discussed below, a text can be semantically annotated in many different ways. The above approach presents the basic annotation schema used by the KIM platform. When combined with a semantic repository, which contains descriptions of these entities, semantic annotations allow for various improvements in the management of the documents: semantic indexing and retrieval, hyperlinking, advanced visualisation, and navigation.

Semantic annotations are not only relevant to text documents. For instance, SAWSDL is a specification for annotation of web services (namely, WSDL descriptions). Such annotations can be created using WSMO Studio.

Annotation

'Annotation', in contemporary English, according to WordNet, has two meanings:

In linguistics (and particularly in computational linguistics) an annotation is considered a formal note added to a specific part of the text. There are number of alternative approaches regarding the organization, structuring, and preservation of annotations. For instance, all the markup languages (HTML, SGML, XML, etc.) can be considered schemata for embedded or in-line annotation. On the contrary, open hypermedia systems use stand-off annotation models where annotations are kept detached, i.e. non-embedded in the content.

As presented on the figure below, one can also have metadata about the document as a whole. Such kind of metadata can be named document-level annotations, as opposed to the character-level annotations, which refer just a specific part of the text.

KIM is based on a stand-off model - the annotations are kept and managed separately.

Semantic Annotations

We refer to semantic annotation at the same time as (i) a sort of meta-data and (ii) the process of generating such meta-data.

While there could be an argument with respect to the name (it could well be "Entity annotation") its nature is quite unambiguous: the named entities in the text are recognized and identified. The result is formally recorded and associated with the place in the text where the entity has been mentioned. The identity of the entity is "verbalized" via URIs which means that those can be easily linked to their descriptions within a semantic repository, as demonstrated below.

Although redundant, in accordance with the good NE recongnition tradition in the IE community, the types of the entities are also explicitly indicated via URIs to the respective (most specific) classes in the ontology.

KIM also annotates "key phrases" in the documents. Such phrases are general terms (i.e. universals, rather than particulars or entities), which were found to be characteristic for the documents, based on statistical analysis. Taken together, the named entities and the key-phrases are considered "semantic features" of the document. They form a feature space of reduced dimentionality, which is used by KIM for semantic indexing and retrieval, co-occurence, and popularity trend analysis.

KIM is also annotating "key phrases" in the documents. Those are general terms (i.e. universals, rather than particulars or entities), which were found to be characteristic for the documents, based on statistical analysis. Taken together, the named entities and the key-phrases are considered "semantic features" of the document. Those form a feature space of reduced dimentionality, which is used by KIM for semantic indexing and retrieval, co-occurence, and popularity trend analysis.

Named Entities

Named entities (NE) are considered: people, organizations, locations, and others referred to by name. Apples and bicycles are not considered NE, because they are not typically referred to by name.

Within a wider interpretation, NE can be considered also some scalar values (numbers, amounts of money, dates) and addresses.

A couple of general remarks:

What about words?

Words can also be formally marked up. One of the typical approaches is to annotate a given word with a sort of a designator of the word sense used in the specific case. For instance, a designator could be "link-v2", meaning that the second meaning (according to some register) of the word "link" is taken as a verb ("link" could well serve as a noun).

There are number of complex issues related to the meanings of words:

We respect the above mentioned questions and the complexity of their answers. If one is eager to dedicate a part of his/her energy to examine further these phenomena, we would recommend the following projects as realted work, which deserves attention: WordNet and Cyc and conferences like SENSEVAL, FOIS, and OntoLex.

However, at present this is not our prime objective. At this stage our focus is on a much simpler problem - the basic semantics of named entities. And we do believe that the general semantic annotation appoach, proposed and implemented in KIM, can serve word-level semantics as well. Meanwhile, we adapted a more limited approach, where we identify and annotate only the key phrases in the documents, which are handled as entities of class GeneralTerm, without regard to their lexical semantics.