The InnoGraph Artificial Intelligence Taxonomy

A Key to Unlocking AI-Related Entities and Content

December 15, 2023 8 mins. read Vladimir AlexievBoyan Bechev Boyan Bechev Aleksandr OsitsynAleksandr Ositsyn

Introduction

InnoGraph will build a holistic knowledge graph of innovation based on Artificial Intelligence (AI), and more generally of the global “hitech” ecosystem. It is a key use case of the Horizon Europe research project enrichMyData.

InnoGraph originated from a partnership between OECD and the Jožef Stefan Institute (JSI) on the OECD.AI Policy Observatory, and prior Ontotext experience with Science KGs, e.g., the Tracking of Research Results project. Unlike OECD.AI, here we want to track AI elements not just at the summary level, but also at individual level.

InnoGraph aims to comprehensively cover all elements of AI. As the first key step, we have built a comprehensive taxonomy of topics: AI technical topics and application areas (verticals). This post describes our approach to developing such a taxonomy by integrating and coreferencing data from numerous sources.

Example: Github Topics

We have often used the pearl growing or snowballing approach, which can be summarized as follows:

  • Start from some prominent artifact or piece of content – this first step can be accelerated by starting from all works listed in a relevant venue
  • Extract the topics of this content within its datasets
  • Find more content using the same topics
  • Find the most relevant topics (e.g., by co-occurrence analysis) and repeat

We also coreference topics across datasets to leverage links between datasets. 

As an illustration of this approach, let’s consider the following problem. As of the end of 2022, Github had 94 million Developers, 4 million Organizations, 330 million Repositories. So how can we find as many AI-relevant repositories on Github as possible? We can start with a connecting dataset like LinkedPapersWithCode. It includes only ML papers and related entities; this SPARQL query shows some statistics:

paperstasksmodelsdatasetsmethodsevaluationsrepos
3765574267245988322210152519153476

We can start with these repositories (most of them are on Github) and get all their topics.

For example, the famous paper Attention Is All You Need that introduced Transformers links to 559 source repos (implementations).

  • The official (first) repo is tensorflow/tensor2tensor that has topics: machine-learning reinforcement-learning deep-learning machine-translation tpu.
  • A newer repo is PaddlePaddle/PaddleNLP that has topics: nlp search-engine compression sentiment-analysis transformers information-extraction question-answering llama pretrained-models embedding bert semantic-analysis distributed-training ernie neural-search uie document-intelligence paddlenlp llm. Here we need to filter out non-AI topics (search-engine compression), and we find some more recent topics (llm llama) as well as very specialized topics that pertain only to this library (e.g., uie paddlenlp).
    • By exploring the topic nlp, we find 26.7k repos. Of course, a lot of them would be duplicates (eg huggingface/transformers is listed under both machine-learning and nlp) but anyway we can quickly collect a lot of relevant repos.
    • By exploring the topic llm, we find 3.6k very recent github repos.

And so the snowball rolls!

Topics: Wikipedia Articles

We use Wikipedia articles (and categories) as the core of our AI taxonomy. Every important topic related to AI (old, new, emerging, popular, etc.) will most likely have an article on Wikipedia that can serve as a starting point. 

Wikipedia categories are used to classify articles, and to form a hierarchy. They are the result of a collective curation effort to create a taxonomy of Wikipedia articles.

We use Categories as a way of finding relevant articles. Consider the category tree of Artificial Intelligence: it shows the first level and allows you to expand; for each category it shows the number of sub-categories (C) and articles (P).

Category Pruning

We use the approach of :

  • Start with a few root categories (AI, Data Science, and a couple more)
  • Find the shortest paths from the root categories to all other reachable categories (similar to building a minimal spanning tree)
  • Use an interactive tool to cut out irrelevant branches, while monitoring the number of remaining categories and articles
  • Gather all articles from the final set of categories in a tree
  • If a category has “Main article” then replace it with that article

The following figure shows the interactive tool we built to curate the taxonomy.

Above we see that:

  • The original counts were 1.5M categories and 29.2M articles, distributed up to 16 levels from the roots, with a mode of 13 levels. (We have set a hard limit of 16 levels to reduce slightly the original total)
  • After pruning, the counts are 667 categories and 15k articles, distributed up to 9 levels from the roots, with a mode of 3 levels
  • This represents a reduction of 2250x (categories) and 1950x (articles) respectively
  • This reduction was achieved by pruning only 359 categories! 

As a result, we collected 15k AI and hitech topics in a hierarchy of 9 levels. The taxonomy is pretty good, but still not perfect since we did the filtering at the level of pages not categories, and built the tree based on shortest paths to the root.

Collaborative Patent Classification: Application Areas

The Cooperative Patent Classification (CPC) is a patent classification system that was jointly developed by the European Patent Office (EPO) and the United States Patent and Trademark Office (USPTO). The CPC is substantially based on the previous European classification system (ECLA), which itself was a more specific and detailed version of the International Patent Classification (IPC).

CPC is important because it relates to clear commercial applicability. On the other hand, patent applications and grants lag the newest industry trends by several years. For example, while there are hundreds of thousands of AI-related patents, a search for “OpenAI” on Lens.org shows only about 750 patents, and none are owned by OpenAI.

The following image shows counts of patents for a particular CPC Class.

The CPC version 2023.02 has 260,491 topics and is 15 levels deep. The CPC goes to some extreme levels of detail. For example, let’s consider the classification term A01B33/022 “tilling implements with rotary driven tools; with tools on horizontal shaft transverse to direction of travel; with rigid tools; with helicoidal tools” It can be decoded as the following levels (in some cases counts are shown in parentheses):

  • Section (9): 1 letter from A “Human Necessities” to H “Electricity” or Y “Emerging Cross-Sectional Technologies”
  • Class (137): 2 digits, eg A01 “Agriculture; forestry; animal husbandry; trapping; fishing”
  • Subclass (678): 1 leter, eg A01B “Soil working in agriculture or forestry, parts, details, or accessories of agricultural machines or implements, in general”
  • Group: 1- to 3-digits, eg A01B1/00 “Hand tools” (00 indicates a main group)
  • Subgroup: at least 2 digits: see the example before the bulleted list

Finding All CPC AI Topics

CPC includes many other topics that are related to AI. We discovered them using several techniques:

  • Searching with keywords such as machine learning, ml, neural networks, ann, rnn and many others
  • Assuming that certain branches like Robotics, UAVs, Route Searching, Image and Character Recognition are innately related to AI
  • Expanding all children of the selected topics (e.g., G06N3/02 Neural networks has 57 descendants)
  • Snowballing from topics to patents to topics

In total, we selected 208 CPC AI root topics, which expanded to a total of 2476 CPC AI topics. We are still working to match CPC AI root topics to the core Wikipedia-based taxonomy that we extracted. 

There are many more datasets being considered to be used in the taxonomy tree:

  • ACM CCS (Association for Computing Machinery’s Computing Classification System)
  • AIDA FAT (Academia Industry Dynamics KG’s Focus Areas Taxonomy)
  • AMiner KGs (Knowledge Graphs)
  • ANZSRC FOR (Australian and New Zealand Standard Research Classification)
  • arXiv Areas
  • China NSFC (Natural Science Foundation of China)
  • EU CORDIS EuroSciVoc
  • Crunchbase Categories
  • CSO (Computer Science Ontology)
  • JEL (Journal of Economic Literature Classification)
  • MESH (Medical Subject Headings)
  • MSC (Mathematics Subject Classification)
  • OpenAlex Topics
  • SemanticScholar FOS (Fields of Study)
  • StackExchange Tags

Conclusion and Future Work

In conclusion, we emphasize the significance of a meticulously curated AI taxonomy, which will help us select the content we’ll use in the graph with the ultimate goal of guiding investment decisions, research policy, and further AI research. The taxonomy integrates various data sources, offering a holistic view of AI innovation. Future work includes integrating more topical datasets, leveraging topics for data collection, and snowballing methodologies to enhance the InnoGraph Knowledge Graph.

Are you interested in reading more about our approach to the AI taxonomy?

New call-to-action

InnoGraph has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No:101070284. Views and opinions expressed are however those of the author only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.

Article's content

Chief Data Architect at Ontotext

Vladimir’s passion is data modelling, ontologies and data representation standards. He is a member of the DBpedia and Europeana quality committees, and frequent speaker at conferences and events. His favourite topics are Linked Open Data and its application in cultural heritage and digital humanities.

Boyan Bechev

Boyan Bechev

Data Engineer at Ontotext

Boyan Bechev is a data engineer for the solutions team at Ontotext. He's involved in several projects that require precise data modelling and data quality and is not afraid to get his hands dirty. Having an academic background in distributed computing he's passionate about dealing with large datasets in an efficient as possible manner.

Aleksandr Ositsyn

Aleksandr Ositsyn

Data Engineer at Ontotext

Aleksandr Ositsyn is a machine learning engineer who has experience with Continuous Integration and Continuous Delivery (CI/CD), Django REST Framework, Python (Programming Language), SQL, Machine Learning & Docker Products