Information extraction is the process of extracting specific (pre-specified) information from textual sources. One of the most trivial examples is when your email extracts only the data from the message for you to add in your Calendar.
Other free-flowing textual sources from which information extraction can distill structured information are legal acts, medical records, social media interactions and streams, online news, government documents, corporate reports and more.
Gathering detailed structured data from texts, information extraction enables:
There are many subtleties and complex techniques involved in the process of information extraction, but a good start for a beginner is to remember:
To elaborate a bit on this minimalist way of describing information extraction, the process involves transforming an unstructured text or a collection of texts into sets of facts (i.e., formal, machine-readable statements of the type “Bukowski is the author of Post Office“) that are further populated (filled) in a database (like an American Literature database).
Typically, for structured information to be extracted from unstructured texts, the following main subtasks are involved:
Information extraction can be entirely automated or performed with the help of human input.
Typically, the best information extraction solutions are a combination of automated methods and human processing.
Consider the paragraph below (an excerpt from a news article about Valencia MotoGP and Marc Marques):
Marc Marquez was fastest in the final MotoGP warm-up session of the 2016 season at Valencia, heading Maverick Vinales by just over a tenth of a second.
After qualifying second on Saturday behind a rampant Jorge Lorenzo, Marquez took charge of the 20-minute session from the start, eventually setting a best time of 1m31.095s at half-distance.
Through information extraction, the following basic facts can be pulled out of the free-flowing text and organized in a structured, machine-readable form:
Person: Marc Marquez
Related mentions: Maverick Vinales, Yamaha, Jorge Lorenzo
Image Source: NOW
This is a very basic example of how facts are distilled from a textual source. You can see this by yourself, testing other scenarios live at the NOW platform.
To further get acquainted to what the platform is and how it works, we recommend that you check the following article: 4 Things NOW Lets You Do With Content.
Information extraction can be applied to a wide range of textual sources: from emails and Web pages to reports, presentations, legal documents and scientific papers. The technology successfully solves challenges related to content management and knowledge discovery in the areas of:
While information extraction helps for finding entities, classifying and storing them in a database, semantically enhanced information extraction couples those entities with their semantic descriptions and connections from a knowledge graph. The latter is also known as semantic annotation. Technically, semantic annotation adds metadata to the extracted concepts, providing both class and instance information about them.
Semantic annotation is applicable for any sort of text – web pages, regular (non-web) documents, text fields in databases, etc. Further knowledge acquisition can be performed on the basis of extracting more complex dependencies – analysis of relationships between entities, event and situation descriptions, etc.
Extending the existing practices of information extraction, semantic information extraction enables new types of applications such as:
To see how semantic information extraction works and to get a real feel of the way a free-flowing, unstructured text and data facts are stored as database entities interlinked together, you can try Ontotext’s Tagging Service.
White Paper: Text Analysis for Content Management