Gaining Global Insights with Multilingual Entity Linking

State-of-the-art AI models for multilingual entity linking enable working with concepts in 100+ languages and help connect and analyze linguistically diverse content

March 8, 2024 9 mins. read Ivelina BozhinovaAndrey TagarevAndrey TagarevEneya GeorgievaEneya Georgieva

This is part of Ontotext’s AI-in-Action initiative aimed at enabling data scientists and engineers to benefit from the AI capabilities of our products.

In our interconnected world, information travels across the globe at unprecedented speed and disinformation is getting spread even faster and reused across geographies. To help in the battle against disinformation, Ontotext is tackling the challenge of identifying narratives or disinformation campaigns. This covers a range of tasks starting from analysis of textual content in multiple languages to detecting and connecting separate pieces of related manipulative stories.

Similar challenges are faced by international organizations that monitor relevant content in multiple languages, whether they operate in an inherently multilingual field or deal with trade and business across borders and between continents. The difficulty lies in harmonizing content across these linguistic landscapes, especially considering that not everyone has immediate access to a multilingual expert. Particularly not one who can process tens or even thousands of content pieces in mere minutes.

Linking entities in text to knowledge bases

A necessary step in analyzing multilingual content is linking mentions of entities in text (general concepts or named entities) in different languages to a common knowledge base. The ideal knowledge base would comprise broad information, be dynamically updated on a regular basis, and be multilingual. According to recent publications for entity linking, Wikipedia and Wikidata are among the most popular ones. Since they meet our requirements, we’ve chosen them as our primary target knowledge base. 

Wikidata is the biggest public knowledge graph, covering over 100 million entities. Wikidata entities are connected to Wikipedia articles, where such exist (Wikipedia has about 7 million articles). We have experimented with different models that enable linking to Wikidata. We have evaluated their performance on several datasets to select the best suited for multilingual content. 

How does multilingual entity linking work?

A multilingual entity linking (MEL) model is a natural language processing (NLP) system designed to detect, disambiguate, and link named entities mentioned in text to a common knowledge base across different languages. Entities in this context can be specific objects such as people, organizations, locations, dates, as well as general concepts, such as “global warming”. 

Connecting words across different languages, however, is no easy feat. Take the example of the “World Health Organization (WHO)”. In German it is “Weltgesundheitsorganisation”, in Italian – “Organizzazione Mondiale della Sanità”, in Polish – “Światowa Organizacja Zdrowia”. MEL overcomes this hurdle by connecting all transcriptions to the respective Wikidata entity – World Health Organisation

Downstream applications can also fetch and correlate additional information about the mentioned entities available in Wikidata, for example, headquarters address, relations to people and other organizations, and more. Wikidata entries can be used to link not only named entity mentions in different languages, but also general concepts such as ‘hospital‘ or ‘prize‘, thus providing common ground for insights over content in different languages. 

The image above shows the word cloud of the text in this post. Without technologies like MEL, these words can create beautiful visualizations but cannot be utilized to provide value for the business.

The required MEL model for this task should be capable of performing end-to-end entity linking. It should take as input unstructured text and return annotations in the form of entity reference, extracted from the input text, and the identifier of the corresponding concept in the target knowledge base. Unfortunately, our current research showed that there are only a few such models or systems. For that reason, we experimented with a two-step approach, where state-of-the-art multilingual entity disambiguation algorithms and systems are used as a second component in a MEL solution. The first step was performed either by a multilingual named entity recognition model or a multilingual entity boundary detection (identifying where a named entity is mentioned in the text) algorithm.

We experimented with different MEL systems and selected three for evaluation:

  • IXA+MGENRE – we combined a transformer-based multilingual masked language model  – IXA, for the entity boundary detection step, with multilingual GENRE (mGENRE) – the state-of-the-art method for entity disambiguation
  • MultiNERD+MGENRE – an alternative combination of the MultiNERD model for the entity boundary detection, again with mGENRE for disambiguation
  • BELA – an end-to-end MEL model, based on bi-encoder architecture

Comparison of different entity linking systems

To select the most suitable entity linking approach, we conducted a thorough comparison based on criteria unrelated to the quality of annotations generated by each system. Our analysis revealed that the choice of system depends on the specific use case. For commercial applications, BELA emerged as the exclusive option, while for non-commercial purposes, any of the three systems could be viable, contingent on the hardware availability.

SystemHardware requirements (CPU/GPU)LicenseWikidata updateSpeed of linking
IXA+MGENRECPUCC BY-SA-NC 4.0 license.yesslow
MultiNERD+MGENREBoth (on CPU practically not usable)CC BY-SA-NC 4.0 license.yesvery slow
BELAGPUMITwork in progressfast

In evaluating the systems annotation quality, we employed two distinct approaches, utilizing English benchmark datasets and the multilingual dataset MultiNERD. Given that MultiNERD NER is trained on the same dataset, our focus is on IXA+MGENRE and BELA. Considering the incomplete annotations in MultiNERD, we prioritized recall as the pivotal metric. The results unequivocally demonstrated BELA’s superior performance across all languages when contrasted with IXA+MGENRE. 

LANGUAGESYSTEMEnd-to-end EL PrecisionEnd-to-end EL RecallEnd-to-end EL F1
German (de)BELA0.610.890.72
IXA+MGENRE0.790.720.75
English (en)BELA0.720.830.77
IXA+MGENRE0.710.660.68
Spanish (es)BELA0.640.820.72
IXA+MGENRE0.710.590.64
Dutch (nl)BELA0.610.840.71
IXA+MGENRE0.730.590.65
Polish (pl)BELA0.560.820.67
IXA+MGENRE0.710.590.64
Portuguese (pt)BELA0.590.830.69
IXA+MGENRE0.670.60.63
Italian (it)BELA0.660.850.74
IXA+MGENRE0.680.540.6
French (fr)BELA0.610.790.69
IXA+MGENRE0.690.630.66
AverageBELA0.620.830.71
IXA+MGENRE0.710.610.65

Shifting our attention to the English benchmark datasets, we considered all three multilingual entity linking systems, alongside CEEL – Ontotext’s recently developed entity linking system for English. This allowed us to understand how the performance of MEL systems compared to the one of English-only systems. The comparison results showed that BELA outperformed IXA+MGENRE for all datasets. Conversely, the other three systems exhibited comparable results for the tested datasets. 

DATASETSystemMD F1end-to-end EL F1
TWEEKI GoldCEEL0.870.6
IXA+MGENRE0.720.57
MultiNERD+MGENRE0.810.67
BELA0.860.65
KORE 50CEEL0.650.32
IXA+MGENRE0.340.24
MultiNERD+MGENRE0.470.35
BELA0.550.35
AIDACEEL0.960.76
IXA+MGENRE0.650.53
MultiNERD+MGENRE0.740.66
BELA0.880.74

The AIDA benchmark is the most directly relevant to the task of interlinking narratives across news articles – it uses the CoNLL dataset, which includes 1393 Thomson Reuters news articles with almost 35 thousand entity mentions. In contrast, the KORE 50 benchmark uses a much smaller set of 50 documents, intentionally selected to include complex disambiguation cases, while TWEEKI Gold uses a corpus of Tweets.
BELA and CEEL share similar architecture and it is not surprising to see that they demonstrate similar performance. Both models generate Wikidata identifiers at a relatively high speed, which makes them applicable to a wide range of applications. Still, there are several important differences between the two models:

  • BELA is multilingual, while CEEL works only for English texts
  • BELA is limited only to the 7 million entities covered in Wikipedia
  • CEEL covers about 40 million people, organizations, and locations, but does not link entities of other types
  • BELA needs a GPU processor, while a CPU is sufficient for CEEL

Concepts in debunking articles

So what results does the system produce when applied to real world data? We ran it over a collection of over 100 thousand disinformation claims and journalistic articles debunking them. The data is strongly multilingual with articles in over 50 languages with 30 of them having at least 1000 articles. Over this data the system discovered over 5 million mentions of 280 thousand unique Wikidata concepts within the texts.

How can we use these annotations to search and analyze our large collection? We pickеd “NATO” as an example of a large international organization that is discussed in many languages and has very different names and acronyms in these languages.

The chart above shows the number of debunking articles mentioning NATO over the last two years broken down by language. Unsurprisingly, English has a strong lead, since about 50% of the content is in English, but there is a long tail of the concept being discussed in many European languages. What other concepts are mentioned in connection with NATO might be even more interesting than just the language of the content.

The diagram above shows the results of applying a filter so we get the most commonly discussed person concepts within articles mentioning NATO in the last two years. As expected, these are the important political and military figures most strongly related to the conflict in Ukraine.

This is a relatively simple analysis but it can be built upon to gain more sophisticated insight such as discovering specific documents on a very specific subject, identifying trends in co-occurring concepts, or being alerted when a known topic re-emerges or starts being discussed in a new language.

Other use cases

Within the realm of Business Intelligence, MEL can help organizations identify and swiftly analyze market trends, assess performance, and make informed decisions using multilingual data with clear connections between concepts in different languages. This capability becomes particularly important when dealing with large-scale datasets.

In the media monitoring field, MEL can be employed to enrich already extracted content to enhance discoverability by linking concepts in a variety of languages, regardless of whether the person behind the screen speaks them.

In academic research, MEL allows for deep research aligning and co-referencing diverse content in many languages, and thanks to entity linking the data is precise, saving a lot of time for the researcher.

To wrap it up

As we continue to traverse our interconnected world, systems for MEL would be increasingly helpful. They enable establishing meaningful connections between terms in different languages via linking to a common knowledge graph, such as Wikidata. These connections can be stored in a knowledge graph, which is an indispensable tool unlocking a world of accuracy and insights.

Ontotext’s experience in applying MEL models for the needs of fact-checking professionals proves that such an approach enables the seamless integration and navigation of linguistically diverse data. It connects entities between languages, thus enabling organizations to identify and decipher global narratives. As a result of these experiments, Ontotext gathered experience and developed the necessary proprietary AI models to address fact-checking needs. This will also allow us to choose the most efficient models depending on the needs for other MEL solutions and based on the type of text, the languages, and the domains that have to be addressed.  

 

Do you want to learn more about applying MEL models?

New call-to-action

 

Ontotext’s work and experiments with MEL models have been carried out as part of the vera.ai project.

This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No:101056973. Views and opinions expressed are however those of the author only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.

Article's content

Data Scientist at Ontotext

A software developer and data scientist who has interest in the combination of classical software development with innovative software projects focusing on artificial intelligence.

Andrey Tagarev

Andrey Tagarev

Researcher at Ontotext

Andrey Tagarev has a MSc degree in Computing Specialism (Machine Learning) from the Imperial College London. He joined Ontotext in 2015 and since then is working on the development of machine learning algorithms for document classification, sentiment analysis, rumor detection and claim identification. Andrey is also specialized in development of pipelines for natural language processing of documents in several European languages – English, French, Dutch.

Eneya Georgieva

Eneya Georgieva

Product Manager at Ontotext

With her experience as an editor of a business magazine and financial journalist, Eneya Georgieva has a deep understanding of the scope of disinformation campaigns. She is passionate about tackling this global issue and works tirelessly with her team to leverage the diverse set of tools and techniques available to create greater awareness and understanding of the topic.