Read about how knowledge graphs such as Ontotext’s GraphDB provide the context that enables many Artificial Intelligence applications.
InnoGraph will build a holistic knowledge graph of innovation based on Artificial Intelligence (AI), and more generally of the global “hitech” ecosystem. It is a key use case of the Horizon Europe research project enrichMyData.
InnoGraph originated from a partnership between OECD and the Jožef Stefan Institute (JSI) on the OECD.AI Policy Observatory, and prior Ontotext experience with Science KGs, e.g., the Tracking of Research Results project. Unlike OECD.AI, here we want to track AI elements not just at the summary level, but also at individual level.
InnoGraph aims to comprehensively cover all elements of AI. As the first key step, we have built a comprehensive taxonomy of topics: AI technical topics and application areas (verticals). This post describes our approach to developing such a taxonomy by integrating and coreferencing data from numerous sources.
We have often used the pearl growing or snowballing approach, which can be summarized as follows:
We also coreference topics across datasets to leverage links between datasets.
As an illustration of this approach, let’s consider the following problem. As of the end of 2022, Github had 94 million Developers, 4 million Organizations, 330 million Repositories. So how can we find as many AI-relevant repositories on Github as possible? We can start with a connecting dataset like LinkedPapersWithCode. It includes only ML papers and related entities; this SPARQL query shows some statistics:
We can start with these repositories (most of them are on Github) and get all their topics.
For example, the famous paper Attention Is All You Need that introduced Transformers links to 559 source repos (implementations).
machine-learning reinforcement-learning deep-learning machine-translation tpu.
nlp search-engine compression sentiment-analysis transformers information-extraction question-answering llama pretrained-models embedding bert semantic-analysis distributed-training ernie neural-search uie document-intelligence paddlenlp llm. Here we need to filter out non-AI topics (
search-engine compression), and we find some more recent topics (
llm llama) as well as very specialized topics that pertain only to this library (e.g.,
And so the snowball rolls!
We use Wikipedia articles (and categories) as the core of our AI taxonomy. Every important topic related to AI (old, new, emerging, popular, etc.) will most likely have an article on Wikipedia that can serve as a starting point.
Wikipedia categories are used to classify articles, and to form a hierarchy. They are the result of a collective curation effort to create a taxonomy of Wikipedia articles.
We use Categories as a way of finding relevant articles. Consider the category tree of Artificial Intelligence: it shows the first level and allows you to expand; for each category it shows the number of sub-categories (C) and articles (P).
We use the approach of :
The following figure shows the interactive tool we built to curate the taxonomy.
Above we see that:
As a result, we collected 15k AI and hitech topics in a hierarchy of 9 levels. The taxonomy is pretty good, but still not perfect since we did the filtering at the level of pages not categories, and built the tree based on shortest paths to the root.
The Cooperative Patent Classification (CPC) is a patent classification system that was jointly developed by the European Patent Office (EPO) and the United States Patent and Trademark Office (USPTO). The CPC is substantially based on the previous European classification system (ECLA), which itself was a more specific and detailed version of the International Patent Classification (IPC).
CPC is important because it relates to clear commercial applicability. On the other hand, patent applications and grants lag the newest industry trends by several years. For example, while there are hundreds of thousands of AI-related patents, a search for “OpenAI” on Lens.org shows only about 750 patents, and none are owned by OpenAI.
The following image shows counts of patents for a particular CPC Class.
The CPC version 2023.02 has 260,491 topics and is 15 levels deep. The CPC goes to some extreme levels of detail. For example, let’s consider the classification term A01B33/022 “tilling implements with rotary driven tools; with tools on horizontal shaft transverse to direction of travel; with rigid tools; with helicoidal tools” It can be decoded as the following levels (in some cases counts are shown in parentheses):
CPC includes many other topics that are related to AI. We discovered them using several techniques:
machine learning, ml, neural networks, ann, rnn and many others
In total, we selected 208 CPC AI root topics, which expanded to a total of 2476 CPC AI topics. We are still working to match CPC AI root topics to the core Wikipedia-based taxonomy that we extracted.
There are many more datasets being considered to be used in the taxonomy tree:
In conclusion, we emphasize the significance of a meticulously curated AI taxonomy, which will help us select the content we’ll use in the graph with the ultimate goal of guiding investment decisions, research policy, and further AI research. The taxonomy integrates various data sources, offering a holistic view of AI innovation. Future work includes integrating more topical datasets, leveraging topics for data collection, and snowballing methodologies to enhance the InnoGraph Knowledge Graph.
InnoGraph has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No:101070284. Views and opinions expressed are however those of the author only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.