An annotation, is a form of meta-data attached to a particular section of document content. The section may be a single word, a sentence or even a series of paragraphs. An annotation must have a type (or a name) which is used to create classes of similar annotations, usually linked together by their semantics. For more information, see semantic annotation.
A controlled vocabulary is a closed list of terms, which can be used for classification. These terms are names for particular concepts. Controlled vocabularies can vary from simple alphabetical lists of terms to complex annotated thesauri. An example is the MeSH controlled vocabulary used in MEDLINE and other MEDLARS databases of the NLM.
The term "description logic" refers to a logic that focuses on descriptions as its principal means for expressing logical expressions. A description logic system emphasizes the use of classification and subsumption reasoning as its primary mode of inference.
Today description logic has become a cornerstone of the Semantic Web for its use in the design of ontologies. The OWL-DL and OWL-Lite sub-languages of the W3C-endorsed Web Ontology Language (OWL) are based on a description logic.
Document Repository is a KIM Platform component for storing, retrieving, and indexing of annotated documents with semantic, full-text and co-occurrence query support. To achieve that, the KIM Platform integrates and adapts different storage engines, like OracleTM DB, Apache Lucene, and OWLIM.
An entity is something that has a distinct, separate existence, for example a particular person - Barack Obama or a particular object - Air force One. The entity does not need to be a material existence. In particular, abstractions and legal fictions are usually regarded as entities. In the semantic web, entities have unique and persistent URIs.
Faceted classification is is the one in which documents are classified along different axes, called facets. Each facet contains a number of terms, usually with thesaurus classification. And usually each term belongs to only one facet. By selecting one term from each facet, a document is classified.
A folksonomy (comes from folk and taxonomy) is a system of classification derived from social tagging. It is decentralized practice where people create, manage and share tags to annotate and categorized content in an online social environment. Examples of folksonomies are Flickr and Delicious.
Formal knowledge representation is about building models of the world, of a particular domain or problem, which allow automatic reasoning and interpretation. Such formal models are called ontologies and they can be used to provide formal semantics (i.e. machine-interpretable meaning) to any sort of information: databases, catalogs, documents, web pages, etc.
FTS refers to a technique for searching a computer-stored document or database. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. However, when the number of documents to search is potentially large or the quantity of search queries to perform is substantial the problem of full text search is often divided into two tasks: indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms, often called an index, but more correctly named a concordance. In the search stage, when performing a specific query, only the index is referenced rather than the text of the original documents.
A gazetteer consists of a set of lists containing names of things such as cities, organizations, days of the week, etc. These lists are typically used to assist with the task of Named Entity Recognition (NER), although they may be used for any purpose. When the gazetteer is run on a document, annotations will be created for each matching string in the text.
Below is a small section from a list for units of currency:
Search engine indexing collects, parses, and stores data to facilitate fast and accurate searching. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and computer science. Popular search engines focus on the full-text indexing of online, natural language documents. Media types such as video and audio and graphics are also searchable.
The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power.
See Reasoning
Information Extraction is a process that takes unseen texts as input and produces fixed-format, unambiguous data as output. This data may be used directly for display to users, or may be stored in a database or spreadsheet for later analysis, or may be used for indexing purposes in Information Retrieval (IR) applications. For more information, see GATE User Guide
Information Retrieval simply finds texts and presents them to the user, while the typical Information Extraction application analyzes texts and presents only the specific information from them that the user is interested in.
Knowledge and Information Management (in Ontotext sense) is the process of capturing, semantic annotation, indexing, and storing of unstructured, semi-structured, and structured data from different sources. By collecting these artifacts in a central or distributed electronic environment (in a database called a knowledge base), it provides different search paradigms on top of this semantic index.
Knowledge management comprises a range of strategies and practices used in an organization to identify, create, represent, distribute, and enable adoption of insights and experiences. Such insights and experiences comprise knowledge, either embodied in individuals or embedded in organizational processes or practice.
Knowledge Management System refers to a (generally IT based) system for managing knowledge in organizations for supporting creation, capture, storage and dissemination of information.
Knowledge Base is a kind of database that stores the knowledge of a particular domain. It consists of a set of data (entities, entity properties, descriptions, and aliases), a conceptual model (ontology), and rules for reasoning over this data. The knowledge base uses the ontology to specify its structure (entity types and relationships) and classification scheme. In other words, the ontology, together with the set of instances of its classes, constitutes the knowledge base.
Knowledge Domain is the content of a particular field of knowledge such as life science, finance, tourism, etc. Knowledge that may be efficient in every domain is called domain-independent knowledge, for example logics and mathematics.
Language Resource refers to data-only resources such as lexicons, corpora, thesauri or ontologies. Some LRs come with software (e.g. WordNet has both a user query interface and C and Prolog APIs), but where this is only a means of accessing the underlying data we will still define such resources as LRs.
Meta-data is data about data. Meta-data is information (authorship, classification, date, URL, etc.) about an informational resource. It could be a document (such as a webpage), an image, a dataset, or another resource. Metadata is valuable in the storage and retrieval of information. Resources supported by good-quality, structured metadata are more easily discoverable.
For instance, most websites contain metadata to tell the computer how to lay the words out on the screen.
Namespace is the part of a URI, which defines a set of resources, with a common source, location or purpose. Together with the local name of URI, the namespace guarantees the uniqueness of the uniform resource identifier.
For example, in the URI http://proton.semanticweb.org/2006/05/protons#Entity, http://protons.semanticweb.org/2006/05/protons# is the namespace and Entity is the local name. The namespace shows that Entity is one of the resources in the PROTON System ontology, version 2006/05.
(also known as entity identification (EI) and entity extraction)
NER is the simplest and most reliable IE technology. NE systems identify all the names of people, places, organizations, dates, and amounts of money.
For example, a NER system producing MUC-style output might tag the sentence "Jim bought 300 shares of Acme Corp. in 2006".

NER systems have been created that use linguistic grammar-based techniques as well as statistical models. Hand-crafted grammar-based systems typically obtain better results, but at the cost of months of work by experienced linguists. Statistical NER systems typically require a large amount of manually annotated training data.
Ontology is a formal representation of a set of concepts within a domain and the relationships between those concepts. It is used to reason about the properties of that domain, and may be used to define the domain. For more information, see Wikipedia.
Ontology population is a knowledge acquisition technique where instances of ontologically defined concepts and relations are extracted and classified from an information resource.
OEM is an ambiguous and abstruse phrase used in relation to the manufacturing and marketing of products. Usage of the phrase is not consistent, but it typically relates to a situation in which one company uses a component made by a second company in its own product, or sells the product of the second company under its own brand. For more information, see Wikipedia.
OWL is a semantic markup language for publishing and sharing ontologies on the World Wide Web. OWL is developed as a vocabulary extension of RDF (Resource Description Framework) and is derived from the DAML+OIL Web Ontology Language.
"Platform as a Service (PaaS)" deliver a computing platform and/or solution stack as a service, often consuming cloud infrastructure and sustaining cloud applications. It facilitates deployment of applications without the cost and complexity of buying and managing the underlying hardware and software layers.
A processing resource is a plug-in to GATE whose character is principally programmatic or algorithmic, such as lemmatizers, generators, translators, parsers, or speech recognizers. For example, a part-of-speech tagger is best characterized by reference to the process it performs on text. PRs typically include language resources (LRs), e.g. a tagger often has a lexicon; a word sense disambiguator uses a dictionary or thesaurus. For more information, see the GATE documentation.
The full triples notation (in RDFs) requires that URI references be written out completely, in angle brackets, which can result in very long lines on a page. For convenience, sometimes is used a shorthand way of writing triples. This shorthand substitutes an XML qualified name (or QName) without angle brackets as an abbreviation for a full URI reference. A QName contains a prefix that has been assigned to a namespace URI, followed by a colon, and then a local name. The full URIref is formed from the QName by appending the local name to the namespace URI assigned to the prefix.
For example, if the QName prefix foo is assigned to the namespace URI http://example.org/somewhere, then the QName foo:bar is shorthand for the URIref http://example.org/somewhere/bar.
Reasoning is the ability to infer logical consequences from a set of asserted facts or axioms. In our terms, the inference rules are commonly specified by means of an ontology language, and a description language. Inference commonly proceeds by forward chaining and backward chaining.
A relational database management system (RDBMS) is a program that lets you create, update, and administer a relational database. An RDBMS takes Structured Query Language (SQL) statements entered by a user or contained in an application program and creates, updates, or provides access to the database.
Relational database example
Your company needs a better way of keeping track of customers, products, and orders because your paper-based system just ain't cutting it anymore. One way of setting this up using the relational model is to create three tables: Customers, Products and Orders.
You can see that the Customer table doesn't care about orders or products, this keeps it focused on its objective - customers. Likewise, the Product table cares only about itself. The Order table uses the 'CustomerID' and the 'ProductID' to relate a product to a customer based on an order.
A resource is a common term for "anything that has identity. Familiar examples include an electronic document, an image, a service (e.g., "today's weather report for Los Angeles"), as well as a collection of other resources. Not all resources are network "retrievable"; e.g., human beings, corporations, and bound books in a library can also be considered resources".
For example, a web page, a collection of web pages, a service that provides information from a database, an e-mail message, Java classes, etc.
RDF is a language for representing information about resources in the World Wide Web. It is particularly intended for representing meta-data about Web resources, such as the title, author, and modification date of a Web page, copyright and licensing information about a Web document, or the availability schedule for some shared resource.
RDF is based on the idea of identifying things using Web identifiers (called Uniform Resource Identifiers or URIs), and describing resources in terms of simple properties and property values. This enables RDF to represent simple statements about resources as a graph of nodes and arcs representing the resources, and their properties and values.
RDFS is an extensible knowledge representation language, providing basic elements for the description of ontologies, otherwise called RDF vocabularies, intended to structure RDF resources. The first version was published by W3C in April 1998, and the final W3C recommendation was released in February 2004. Main RDFS components are included in the more expressive language OWL.
In KIM Platform semantic annotation is used both as:
Semantic repositories are engines similar to the database management systems (DBMS) - they allow for storage, querying, and management of structured data. The major differences with the DBMS can be summarized as follows:
"The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation."
"The goal of the Semantic Web initiative is to create a universal medium for the exchange of data where data can be shared and processed by automated tools as well as by people. The Semantic Web is designed to smoothly interconnect personal information management, enterprise application integration, and the global sharing of commercial, scientific and cultural data." Tim Berners-Lee
Software as a service (SaaS) is software that is deployed over the internet and/or is deployed to run behind a firewall in your local area network or personal computer. With SaaS, a provider licenses an application to customers as a service on demand, through a subscription or a “pay-as-you-go” model. Saas is also called “software on demand.”
Structured content refers to information or content that has been broken down and classified using meta-data. Structured content often refers to information that has been classified using XML, but can also relate to information classified using other standard or proprietary forms of meta-data.
Structured queries are queries that process structural text elements (meta-data or schema) instead of simple keywords.
Taxonomy is a classification that arranges the terms in the controlled vocabulary into a hierarchy. It allows related terms to be grouped together in a parent-child relationship. An example of a taxonomy is the Linnaean taxonomy (a biological classification).
Thesaurus is a classification that extends taxonomies. It contains hierarchically arranged terms, grouped together according to similarity of meaning (containing synonyms and sometimes antonyms). A typical examples is WordNet.
Text mining is the process of discovering and presenting knowledge – facts, business rules, and relationships – that is otherwise locked in textual form, impenetrable to automated processing. A typical application is to scan a set of documents and to populate a search index with the extracted information.
Typical subtasks are:
URI is a compact string of characters used to identify or name a resource. The main purpose of this identification is to enable interaction with representations of the resource over a network, typically the World Wide Web, using specific protocols. URIs are defined in schemes defining a specific syntax and associated protocols.
Here's a URI example: http://en.wikipedia.org/wiki/Uniform_Resource_Identifier. A URI may be classified as a locator (URL) or a name (URN) or both.
Uniform Resource Name (URN) is like a person's name, while a Uniform Resource Locator (URL) is like their street address. The URN defines the identity of something, while the URL provides a method for finding something. Essentially, "what" vs. "where". For more information, see http://www.w3.org/TR/cooluris/