What is Information Extraction?

Information extraction is the process of extracting information from unstructured textual sources to enable finding entities as well as classifying and storing them in a database. Semantically enhanced information extraction (also known as semantic annotation) couples those entities with their semantic descriptions and connections from a knowledge graph. By adding metadata to the extracted concepts, this technology solves many challenges in enterprise content management and knowledge discovery.

Semantic Information Extraction

Information extraction is the process of extracting specific (pre-specified) information from textual sources. One of the most trivial examples is when your email extracts only the data from the message for you to add in your Calendar.

Other free-flowing textual sources from which information extraction can distill structured information are legal acts, medical records, social media interactions and streams,  online news, government documents, corporate reports and more.

Gathering detailed structured data from texts, information extraction enables:

  • The automation of tasks such as smart content classification, integrated search, management and delivery;
  • Data-driven activities such as mining for patterns and trends, uncovering hidden relationships, etc.
Do you want to make use of the best natural language processing techniques for text analysis and information extraction?

New call-to-action

How Does Information Extraction Work?

There are many subtleties and complex techniques involved in the process of information extraction, but a good start for a beginner is to remember:

Text in & Data out

To elaborate a bit on this minimalist way of describing information extraction, the process involves transforming an unstructured text or a collection of texts into sets of facts (i.e., formal, machine-readable statements of the type “Bukowski is the author of Post Office“) that are further populated (filled) in a database (like an American Literature database).

Typically, for structured information to be extracted from unstructured texts, the following main subtasks are involved:

  • Pre-processing of the text – this is where the text is prepared for processing with the help of computational linguistics tools such as tokenization, sentence splitting, morphological analysis, etc.
  • Finding and classifying concepts – this is where mentions of people, things, locations, events and other pre-specified types of concepts are detected and classified.
  • Connecting the concepts – this is the task of identifying relationships between the extracted concepts.
  • Unifying – this subtask is about presenting the extracted data into a standard form.
  • Getting rid of the noise – this subtask involves eliminating duplicate data.
  • Enriching your knowledge base – this is where the extracted knowledge is ingested in your database for further use.

Information extraction can be entirely automated or performed with the help of human input.

Typically, the best information extraction solutions are a combination of automated methods and human processing.

An Example of Information Extraction

Consider the paragraph below (an excerpt from a news article about Valencia MotoGP and Marc Marques):

Marc Marquez was fastest in the final MotoGP warm-up session of the 2016 season at Valencia, heading Maverick Vinales by just over a tenth of a second.

After qualifying second on Saturday behind a rampant Jorge Lorenzo, Marquez took charge of the 20-minute session from the start, eventually setting a best time of 1m31.095s at half-distance.

Through information extraction, the following basic facts can be pulled out of the free-flowing text and organized in a structured, machine-readable form:

Person: Marc Marquez
Location: Valencia
Event: MotoGP
Related mentions: Maverick Vinales, Yamaha, Jorge Lorenzo

NOW Screen ShotImage Source:  NOW

This is a very basic example of how facts are distilled from a textual source. You can see this by yourself, testing other scenarios live at the NOW platform.

To further get acquainted to what the platform is and how it works, we recommend that you check the following article: 4 Things NOW Lets You Do With Content.

Typical Information Extraction Applications

Information extraction can be applied to a wide range of textual sources: from emails and Web pages to reports, presentations, legal documents and scientific papers. The technology successfully solves challenges related to content management and knowledge discovery in the areas of:

  • Business intelligence (for enabling analysts to gather structured information from multiple sources);
  • Financial investigation (for analysis and discovery of hidden relationships);
  • Scientific research (for automated references discovery or relevant papers suggestion);
  • Media monitoring (for mentions of companies, brands, people);
  • Healthcare records management (for structuring and summarizing patients records);
  • Pharma research (for drug discovery, adverse effects discovery and clinical trials automated analysis).

Adding Semantics to the Information Extraction Process

While information extraction helps for finding entities, classifying and storing them in a database, semantically enhanced information extraction couples those entities with their semantic descriptions and connections from a knowledge graph. The latter is also known as semantic annotation. Technically, semantic annotation adds metadata to the extracted concepts, providing both class and instance information about them.

Semantic annotation is applicable for any sort of text – web pages, regular (non-web) documents, text fields in databases, etc. Further knowledge acquisition can be performed on the basis of extracting more complex dependencies – analysis of relationships between entities, event and situation descriptions, etc.

Extending the existing practices of information extraction, semantic information extraction enables new types of applications such as:

  • highlighting, indexing and retrieval;
  • categorization and generation of more advanced metadata;
  • smooth traversal between unstructured text and available relevant knowledge.

To see how semantic information extraction works and to get a real feel of the way a free-flowing, unstructured text and data facts are stored as database entities interlinked together, you can try Ontotext’s Tagging Service.

Want to learn more about information extraction in Enterprise Content Management?


White Paper: Text Analysis for Content Management
5 Steps To Make Your Content Serve Your Business Better

New call-to-action


Ontotext Newsletter