Read about how machines can be of great help with many tasks where fast and error-free computation over big amounts of data are required.
]
Copyright: International Tracing Service (ITS)
The Holocaust was a bleak time in our history. It affected millions of people – not only victims and survivors, but also their relatives and friends. Around the world, many are still sifting through countless records, trying to learn the fate of their loved ones who had disappeared during the Holocaust. And they are not the only ones. The research of archivists and historians of that period is also based predominantly on what can be found in various archives and private collections of Holocaust documentation.
When exploring the vast sea of digital data related to that time, one of the main problems people struggle with is that they are not able to obtain all documents related to a given person. The reasons can be many. The documents may be located in different databases across the globe. Some of them may be kept secret and accessible only upon authorized request. Or they may be available only on paper or as scanned images, which make them more difficult for researchers to analyse.
To help with this, the EU has funded The European Holocaust Research Infrastructure (EHRI) project. By building a digital infrastructure and facilitating human networks, the project aims to support the wider Holocaust research community. Its goal is to provide online access to information from dispersed sources relating to the Holocaust. It also offers different tools that enable researchers and archivists to collaboratively work with such sources.
But even if all historical records could be easily accessible online, it still won’t be the end of the struggle. Mainly because what is so obvious to us, humans, is often incomprehensible to machines. So, search engines are not always able to “tell” whether different entries relate to the same person or not.
As a key technology partner in the EHRI consortium, Ontotext has been exploring the USHMM survivors and victims database, which (as of June 2015) contained over 3 200 000 person records collected over time from heterogeneous sources.
Let’s see one example – the person record of Zoltan Grun. It contains some data about him such as location and date of birth, information about his mother, location and date of death, etc. So far, so good.
The problem is that for some people in the database, there is more than one such record. As records usually come from different sources (lists of arrests, lists of convoys, etc.), different records may contain different information. So, even if records may relate to the same person, they are still not linked to each other. In the case of Zoltan Grun, there are the following two records:
Zoltan Grun (1) Zoltan Grun (2)
For a human, the similarities in these two records may be enough to conclude that they describe the same person and need to be linked. However, for a machine algorithm, this is not so easy.
Copyright: International Tracing Service (ITS)
Since one of the most common tasks in Holocaust research is to explore persons’ data, it is important to find an automated way to link these records and improve the discovery of this data. We call this task record linking or record deduplication. The goal of this task is to have a single record that represents each person and that contains all references to the different records of this person available from the disparate sources.
Let’s have a closer look at how it happens.
Each person record features more than 300 properties such as name, date of birth, date of death, mother’s name, etc. As you can imagine, many of them contain various transliterations and other variations, which impedes the matching of records. For example, consider the name: “Schmil Zelinsky” in one record and “Schmul Zelinsky” in another.
Just like a human, a machine algorithm would classify two records as related to the same person or not by comparing the corresponding properties in each of the records. But, unlike a human, an algorithm has much more difficulties deciding whether the two Schmil Zelinsky/Schmul Zelinsky records are related to the same person.
To help the machine find that out, we have added some heuristics. In other words, we have selected the properties that occur most often together (name, date of birth, date of death, place of birth, place of death, gender, occupation and nationality) to be used by our algorithm when making a decision whether two records are duplicates or not.
The first step in this process was to transform the names in the records to their normalized values. These were some procedures affecting the orthography, the encoding and the removal of some unnecessary symbols.
Another property that we adjusted was gender. We trained a statistical model for automatic predicting of the person’s gender, based on name suffixes, the first name of the person, etc. We also used some rules that helped us guess the person gender such as “if another person with the same name has the gender property filled in, we can assign this gender to a person with the same name and a missing gender property”.
We also normalized the locations (place of birth, place of death, place of arrest, etc.) by resolving them to GeoNames locations. In this way, we made it easier to compute the similarity between locations without being impeded by the lexical differences in the records.
Same as kids, machines learn by examples. When we provide enough positive and negative examples for a phenomenon, a machine learning algorithm learns to “recognize” whether a new example is positive or negative.
For our task, we collected over 1500 pairs of records, which were reviewed by experts and were classified in three categories – positive (duplicates), negative (different persons) and uncertain (not enough information). We supplied our “ground truth” data to the algorithm and let it “learn”.
Once we had classified the pairs as positive, negative or uncertain, we clustered them so that each cluster represented a unique person. Some clusters consisted of only one record and others – of many records related to the same person.
Here is an example of a cluster of 3 records where the family name in the first record is different from the other two, but it sounds suspiciously close. His birth date is also very close to the rest.
Copyright: International Tracing Service (ITS)
Although we still need to do a thorough evaluation of all clusters, to perform error analysis and, if necessary, to make some adjustments to the clustering procedures, the results are very optimistic.
Our experiment showed that the task of record deduplication is feasible and could be performed with high accuracy. With the help of such an algorithm, trained to replicate human decisions, the time for interlinking the Holocaust records of over 250 000 people can be significantly sped up. Which, in turn, would greatly improve the accurate retrieval of documents.
With the success of this task, the EU-supported EHRI project is one step closer to overcoming the main challenge of Holocaust research – the wide dispersal of the archival source material across the globe. What’s more, by supporting the work of an extensive network of researchers, archivists and private individuals, EHRI is able to initiate new transnational and collaborative approaches to the study of the Holocaust.
Ready to discover more about to face your data deduplication challenges with Ontotext’s Semantic Data Modeling?