• Blog
  • Informational

The Gold Standard – The Key to Information Extraction and Data Quality Control

How a human curated body of data is used in AI to train algorithms for search, extraction and classification, and to measure their accuracy

May 27, 2021 8 mins. read Gergana Petkova

We often want computers to do the tasks we give them the way people do. But, as Ontotext’s CEO Atanas Kiryakov often says, we forget that nobody nurtures the Artificial Intelligence (AI) systems for 7 years to learn how to walk, talk, count, read, write and, even more important, not touch hot stoves and avoid tilting full glasses and asking: Can I get 3 more cookies? They also don’t take 12 grades of formal education in school to get to know the basic concepts of all sciences, the most important geographical objects, people, companies, etc.

Without all this background knowledge, before computers can perform like humans, they need a machine-readable point of reference that represents “the ground truth”. And this is where the Gold Standard comes in.

Originally, the Gold Standard was a monetary system that required countries to fix the value of their currencies to a certain amount of gold, aiming to replace the unreliable human control with a fixed measurement that could be used by everyone. In much the same way, in the context of Artificial Intelligence AI systems, the Gold Standard refers to a set of data that has been manually prepared or verified and that represents “the objective truth” as closely as possible.

One of the main uses of the Gold Standard is to train AI systems to identify the patterns in various types of data with the help of machine learning (ML) algorithms. In other words, by giving them plenty of data to learn from, we can “teach” AI systems to automatically identify such patterns. We can also use the Gold Standard to evaluate the performance of such algorithms.

What do AI systems need to learn?

To be able to recognize patterns accurately, the data, AI systems are fed with, needs to be unambiguous, otherwise they can’t make any sense of it. And this is a challenge, as today’s data comes in huge volumes and from various sources. Each source covers different aspects of the same real world phenomena or uses different terms for relatively similar things.

Let’s take as an example some data about company transactions and let’s say that we are interested in who buys who and for how much. Our first data source says that on June 13, 2016 Microsoft bought LinkedIn, while the second – that Microsoft bought LinkedIn for $28.1 billion. Here, we see that the first source doesn’t specify the sum and the second has no mention of the date. But following the real-world business logic, if we have Microsoft and LinkedIn in both data sources, it follows that these two records refer to the same transaction and so they should be merged into one.

However, this is not always so straightforward. Consider an example in which our first data source says that Microsoft invested $240 million in Facebook and the second – that on October 24, 2007 Microsoft invested in Facebook. Here, we cannot merge these two records solely because we have Microsoft and Facebook in both sources, as, while an acquisition is usually a one time affair, a company can invest in another company more than once.

So, to help AI systems make sense of all this ambiguity, we use data linking techniques. They identify, match and merge data records referring to the same or a similar entity in multiple datasets and also identify entities that seem to be the same or similar but are not.

In the above case of merging information about companies from different data sources, data linking helps us encode the real-world business logic into data linking rules. But, before we can have any larger scale implementation of these rules, we have to test their validity. Simply put, we need to be able to measure and evaluate our results against clearly set criteria.

How does the Gold Standard help data linking?

To evaluate the performance of our data linking algorithms and calibrate them for higher accuracy, we need to use a set of Gold Standard data that has been manually verified. This verification has to come from at least two independent raters to make sure we avoid any biased point of views. Even more important, there should be sufficient level of agreement between the human experts, which is the only way to prove that the task is well defined and there are clear guidelines for rating, coding or annotation. Once we are satisfied with the results, we can train our AI system to automatically match other data.

Now, let’s consider some specific examples!

In the EU-funded research project CIMA (Intelligent Matching and Linking of Organization Data from Different Sources), the aim was to link and harmonize company data from various sources.

One of the tasks connected to the project was to classify companies by industry sectors. Ontotext used DBpedia, a structured version of Wikipedia, as a Gold Standard. This made sense because the thousands of organizations in this giant dataset had already been classified into industry sectors by more than one human participant. As Wikipedia is not only the biggest encyclopedia, but its contributors adhere to a strict editorial process, each industry tag is usually assigned by one person and reviewed by another. So, for the purposes of this task, Ontotext structured the data syntax into industry taxonomy where organizations were classified according to their attributes and relationships from DBpedia as well as their text descriptions. Then, ML algorithms were trained to classify other organizations under the 32 top-level industry sectors, e.g., Automotive, Healthcare and Retail.

Another task was to make it possible to find companies similar to a particular company (i.e., big tech companies like Google, Facebook and Apple) as well as to specify the degree of similarity between them. For this purpose, Ontotext prepared a Gold Standard dataset of thousands of company pairs, each with a specific degree of similarity – very similar, little similar, has some similarity, has no similarity. Then this sample set was used for fine-tuning the ML algorithms even further.

New call-to-action

 

How does the Gold Standard help Information Extraction from text?

Gold Standards are also needed to create training and evaluation sets for AI systems when we want to extract information from free text. In natural language processing (NLP) and computational linguistics the Gold Standard typically represents a corpus of text or a set of documents, annotated or tagged with the desired results for the analysis – be it designation of the corresponding part of speech, syntactic parsing, concept or relationship.

When “reading” unstructured text, AI systems first need to transform it into machine-readable sets of facts. This happens through the process of semantic annotation, where documents are tagged with relevant concepts and enriched with metadata, i.e., references that link the content to concepts, described in a knowledge graph.

But here again ambiguity is a stumbling block. For example, while most of the time people can easily tell from the context whether Apple refers to Apple Inc. (the technology company) or Apple Records (the record label found by the Beatles in 1978), AI systems lack the background knowledge to make such a distinction.

To help AI systems “tell” whether “Apple” refers to Apple Inc. or Apple Records, we again need to prepare a Gold Standard set of documents, where the accuracy of the results has been verified by more than one person. In the same way as with data linking, we have to adjust our ML algorithms by giving them plenty of documents to learn from.

Once developed and trained, these algorithms become the building blocks of systems that can automatically interpret data. And by gathering detailed structured data from unstructured texts, they can enable the automation of tasks such as smart content classification, relevant content recommendation, intelligent semantic search, mining for patterns and trends, and many more.

Gold Standard takeaways

Trustworthy sample sets are indispensable for the training of machine learning algorithms that are used in various AI systems for pattern recognition. If these sample sets are not of high quality, clean and representative, we cannot hope to train the algorithms to get useful results.

Beyond training, each and every AI system, be it based on symbolic rules or statistical models, has to be evaluated, in the same way children take exams to graduate. Evaluation as well requires a trusted reference, which is again the Gold Standard. Evaluation is for AI systems what quality assurance (QA) is for software systems.

Atanas Kiryakov, CEO at Ontotext

That is why it is vital to manually verify that the rules we have set represent the “objective truth” as closely as possible and that they are interpreted consistently and unambiguously.

Creating a Gold Standard set is a laborious and time-consuming process. But it lays the groundwork for many tailor-made solutions that work with content and data from multiple sources and solve complex business challenges.

Do you want to learn more about how to implement knowledge graph-based Content Management solutions?

White Paper: Text Analysis for Content Management
Learn how we can make your content serve you better!

New call-to-action

Article's content

Content Manager at Ontotext

Gergana Petkova is a philologist and has more than 15 years of experience at Ontotext, working on technical documentation, Gold Standard corpus curation and preparing content about Semantic Technology and Ontotext's offerings.

Knowledge Graphs: Redefining Data Management for the Modern Enterprise

Read this post about some of the primary problems of today’s enterprise data management and how knowledge graphs can solve them

Knowledge Graphs: Breaking the Ice

Read about the nature and key characteristics of knowledge graphs. It also outlines the benefits of formal semantics and how modeling graphs in RDF can help us easily identify, disambiguate and interconnect information

GraphDB in Action: Navigating Knowledge About Living Spaces, Cyber-physical Environments and Skies 

Read about three inspiring GraphDB-powered use cases of connecting data in a meaningful way to enable smart buildings, interoperable design engineering and ontology-based air-traffic control

Your Knowledge Graph Journey In Three Simple Steps

A bird’s eye view on where to start in building a knowledge graph solution to help your business excel in a data-driven market

Data Management Made Easy: The Power of Data Fabrics and Knowledge Graphs

Read about the significance of data fabrics and knowledge graphs in modern data management to address the issue of complex, diverse and large-scale data ecosystems

GraphDB in Action: Powering State-of-the-Art Research

Read about how academia research projects use GraphDB to power innovative solutions to challenges in the fields of Accounting, Healthcare and Cultural Heritage

At Center Stage VIII: Ontotext and Enterprise Knowledge on the Role of Knowledge Graphs in Knowledge Management

Read about our partnership with Enterprise Knowledge and knowledge management as an essential business function and lessons learned from developing content recommenders using taxonomies and GraphDB.

At Center Stage VII: Ontotext and metaphacts on Creating Data Fabrics Built on FAIR Data

Read about our partnership with metaphacts and how one can use the metaphactory knowledge graph platform on top of GraphDB to gain value from their knowledge graph and accelerate their R&D.

At Center Stage VI: Ontotext and Semantic Web Company on Creating and Scaling Big Enterprise Knowledge Graphs

Read about our partnership with Semantic Web Company and how our technologies complement each other and bring even greater momentum to knowledge graph management.

At Center Stage V: Embedding Graphs in Enterprise Architectures via GraphQL, Federation and Kafka

Read about the mechanisms for building a big enterprise software architectures by embedding graphs via GraphQL, Federation and Kafka

Ontotext’s Perspective on an Energy Knowledge Graph

Read about how semantic technology can advance energy data exchange standards and what happens when some energy data is integrated in a knowledge graph.

At Center Stage IV: Ontotext Webinars About How GraphDB Levels the Field Between RDF and Property Graphs

Read about how GraphDB eliminates the main limitations of RDF vs LPG by enabling edge properties with RDF-star and key graph analytics within SPARQL queries with the Graph Path Search plug-in.

At Center Stage III: Ontotext Webinars About GraphDB’s Data Virtualization Journey from Graphs to Tables and Back

Read this second post in our new series of blog posts focusing on major Ontotext webinars and how they fit in the bigger picture of what we do

At Center Stage II: Ontotext Webinars About Reasoning with Big Knowledge Graphs and the Power of Cognitive Graph Analytics

Read this second post in our new series of blog posts focusing on major Ontotext webinars and how they fit in the bigger picture of what we do

At Center Stage I: Ontotext Webinars About Knowledge Graphs and Their Application in Data Management

Read the first post in our new series of blog posts focusing on major Ontotext’s webinars and how they fit in the bigger picture of what we do

The Gold Standard – The Key to Information Extraction and Data Quality Control

Read about how a human curated body of data is used in AI to train algorithms for search, extraction and classification, and to measure their accuracy

Study of the Holocaust: A Way Out of Data Confusion

Learn how a ML algorithm trained to replicate human decisions helped the EU-supported EHRI project on Holocaust research with record linking.