Text Analysis is about parsing texts in order to extract machine-readable facts from them. The purpose of Text Analysis is to create structured data out of free text content. The process can be thought of as slicing and dicing heaps of unstructured, heterogeneous documents into easy-to-manage and interpret data pieces. Text Analysis is close to other terms like Text Mining, Text Analytics and Information Extraction – see discussion below.
The central challenge in Text Analysis is the ambiguity of human languages. Most people in the USA will easily understand that “Red Sox Tame Bulls” refers to a baseball match. Not having the background knowledge, a computer will generate several linguistically valid interpretations, which are very far from the intended meaning of this news title. People not interested in baseball will have trouble understanding it, too.
Achieving high accuracy for a specific domain and document types require the development of a customized text mining pipeline, which incorporates or reflects these specifics.
Modern Text Analysis technology extensively interplays with knowledge graphs (KG):
Ontotext Platform implements all flavors of this interplay linking text and big Knowledge Graphs to enable solutions for content tagging, classification and recommendation.
Examples of the typical steps of Text Analysis, as well as intermediate and final results, are presented in the fundamental What is Semantic Annotation?, which also features a short video. Ontotext’s NOW public news service demonstrates semantic tagging on news against big knowledge graph developed around DBPedia.
Text Analysis and Text Mining are used as synonyms. Information Extraction is the name of the scientific discipline behind text mining. The article What is Information Extraction? provides a list of typical Text Analysis tasks.
All these terms refer to partial Natural Language Processing (NLP) where the final goal is not to fully understand the text, but rather to retrieve specific information from it in the most practical manner. This means making a good balance between the efforts needed to develop and maintain the analytical pipeline, its computational cost and performance (e.g., how much memory it needs and how long it takes to process one document) and its accuracy. The latter is measured with recall (extraction completeness), precision (quality of the extracted information) and combined measures such as F-Score.
You will often find Text Analysis used interchangeably with Text Analytics. And while to the untrained mind these might sound like synonyms, from the point of view of practice and experience, there is a subtle difference worth mentioning.
Text Analysis is the term describing the very process of computational analysis of texts
Text Analytics involves a set of techniques and approaches towards bringing textual content to a point where it is represented as data and then mined for insights/trends/patterns.
Case in point, Text Analysis helps translate a text in the language of data. And it is when Text Analysis “prepares” the content, that Text Analytics kicks in to help make sense of these data.
In this sentence, Text Analysis is what you do in order to transform the sentence into data and be able to present to computers what this text is about: Rome, the Roman Empire. Then, once presented in the universal language of data, this sentence can easily enter many analytical processes, Text Analytics included. With Text Analytics, you will be able to derive a conclusion about the percentage of texts that mention Rome in the context of the Roman Empire, and not in the context of vacations in Europe, for instance.
Companies use Text Analysis to set the stage for a data-driven approach towards managing content. The moment textual sources are sliced into easy-to-automate data pieces, a whole new set of opportunities opens for processes like decision making, product development, marketing optimization, business intelligence and more.
In a business context, analyzing texts to capture data from them supports the broader tasks of:
When turned into data, textual sources can be further used for deriving valuable information, discovering patterns, automatically managing, using and reusing content, searching beyond keywords and more.
Using Text Analysis is one of the first steps in many data-driven approaches, as the process extracts machine-readable facts from large bodies of texts and allows these facts to be further entered automatically into a database or a spreadsheet. The database or the spreadsheet are then used to analyze the data for trends, to give a natural language summary, or may be used for indexing purposes in Information Retrieval applications.
White Paper: Text Analysis for Content Management