Learn how Open Data and how more businesses use data analytics to gain insights, predict trends and make data-driven decisions.
On March 14, 2018, Atanas Kiryakov, CEO of Ontotext, presented Graph Analytics on Company Data and News. This webinar, now available on-demand, highlighted some of the challenges in analyzing diverse data from multiple sources. Atanas demonstrated how graph analytics on top of Ontotext’s Knowledge Graph was able to provide entity awareness about People, Organizations and Locations (POL), which is part of the solution that Ontotext provides to overcome such challenges.
Ontotext’s Knowledge Graph (loaded with about 2 billion triples in Ontotext’s GraphDB) combines several open data sources. It is mapped to the FIBO ontology and its entities are interlinked to 1 million news articles.
During the webinar, Atanas showed the power of cognitive graph analytics to create links between various datasets and lead to knowledge discovery.
Some really interesting technical questions were raised by the audience about the workings of FactForge (Ontotext’s public service for free access to POL data) and some of the intricacies behind knowledge graph analytics:
Q: Can you please say a bit more about FactForge? Is FactForge a proprietary product of Ontotext? Is it a free or a paid-for service product?
A: Yes, FactForge is a proprietary product of Ontotext. It is a public demonstrator that everyone can use for free. As every public demonstrator, it has some limitations. In this case, the limitations are on the number of requests you can make per second and the size of the results that you can get from each of the queries.
FactForge is also commercially available and you can pay for access to it through our cognitive cloud services. So, yes, you can also get the production version.
Also, we often use selected pieces of FactForge in commercial projects. FactForge provides open data and free news content. When we use it in a specific project, it is often complemented with commercial sources of company data or other data, needed to guarantee the minimum requirements for coverage and quality.
Q: Is there a comprehensive list of sources that FactForge links to?
A: Yes, you can see them in the About page of FactForge.
Q: How do you update the knowledge graph in FactForge?
A: We have an automated procedure for ingestion of the updated versions of the different datasets that we use such as DBpedia or GeoNames. But what is more interesting is that we have a constant feed of news that we use for picking up new entities and new relationships. This data provides hints about how we can enrich the knowledge graph. So, it is regularly updated through loading newer versions of all these datasets and through the information we derive from news.
Q: Can FactForge extract new relations from texts?
A: The short answer is, yes.
When you do this, you end up with plenty of candidate-relationships, for which you have relatively low confidence – think of 70%. So, you will need to figure out how to distill which of these relations are to be trusted. You will need to decide how “candidate-relationships” are consolidated because often one and the same relationship in the real world is expressed in the text with plenty of variations. Also, above what level of importance you want to make them “first-class citizens” in the knowledge graph. It requires a bit of filtering, but, yes, we do extract relationships from texts.
Q: Which countries are covered in FactForge and what are the new source language distributions?
A: In this demonstrator, we have news with global coverage. But these are just news in English. We have deployed this technology also for Dutch, German, Russian, Bulgarian. For quite a number of languages, actually. But the demonstrator itself is just in English.
Q: Is it possible to train based on a corpus instead of Linked Open Data?
A: Yes, we experiment in combining word embedding techniques with knowledge graph analytics. So, yes, you can train a corpus and combine modern analytic techniques based on text with Linked Open Data.
Q: How do you decide which value is the identifier when disambiguating entities?
A: Wow. That’s the secret sauce. Basically, we use all the information that we have about the entities in the knowledge graph. For instance, if we have “Paris” as a string in the text, first we check which of the candidates such as Paris in France, or Paris in Texas, or Paris Hilton, is compatible with this reference as a type. Most of the time, the document context “tells you” whether it is a person, an organization or a location.
If the document context helps us figure out that this is a location, then it cannot be Paris Hilton. After that, we have to figure out whether it is Paris in France, or Paris in Texas, or some other Paris. We do this by comparing the semantic fingerprint of the document with the semantic fingerprint of each of the candidates. What we call a semantic fingerprint is a sort of profile that represents context; one can also call it embedding. And if this doesn’t help, we will consider popularity, importance, etc. So, these are essentially the methods that we have in our arsenal for disambiguation. For specific projects and tasks, we use different combinations.
Q: How is the relevance ratio calculated?
A: The baseline is similar to TF.IDF – term frequency-inverse document frequency. One can also augment the relevance score by taking into consideration the similarity between the document fingerprint and the fingerprint of the entity. In other words, this is comparing the information that we have in the knowledge graph about an entity and the information about it in the document. In this way, relevance is based not just on string statistics, but also on deeper knowledge about concepts and entities. Which also helps for personalized recommendation.
Q: How do you fuse two entities that are the same?
A: Essentially, to find the right match for a company from one source to a company in another source, first we have to find the likely candidates in the second source. This is called pre-selection. Then we evaluate each of these candidates and score them. For example, if one of the companies is registered in the US and the other is registered in Italy, this is strong evidence that they are not the same entity. If one of them is in the phone industry and the other is a bank, again, they are probably not the same entity.
The slides from this presentation are available on SlideShare and a recording of the presentation is available on demand.
Want to learn more about knowledge graphs and smart data analytics?
Ontotext’s GraphDBGive it a try today! |