Enable Full Control Over Data Ingestion and Updates with Ontotext’s Linked Life Data Inventory 1.4

… plus a bunch of newly added data sets, including new predicted relations based on knowledge graph embeddings!

New York, Sofia, Basel Thursday, February 1, 2024

Ontotext is happy to announce the new 1.4 release of Ontotext’s LinkedLifeData (LLD) Inventory — a knowledge graph building accelerator providing more than 200 semantics ready biomedical datasets. It covers data in multiple modalities — genomics, proteomics, metabolomics, molecular interactions, and biological processes. In addition to the structured data sources, the inventory extends its coverage to pharmacological datasets, clinical records, medical information, and a diverse set of scientific publications and patents. 

LinkedLifeData Inventory serves as a valuable resource for the scientific community, fostering multidisciplinary exploration and analysis across various facets of Life Sciences and Healthcare research. The RDF format employed ensures semantic richness and interoperability, facilitating advanced data integration, semantic querying, and insights generation in alignment with FAIR data principles.

Major highlights of the new functionalities

The latest release of LLD Inventory empowers users with full control of the data ingestion and update process with the  improved monitoring of data updates in the Data Loader tool. Data Loader is a data ingestion management component for all repositories in a single client instance of GraphDB. With the latest improvements to the tool, the tedious data loading turns into a more predictable and reliable operation, with better workflow management and error handling. 

The Entity Linking component has also been improved and is now proven to be capable of serving as a discreet step in the NLP and ETL pipelines alike.  The component is fully integrated into the continuous integration and development process for annotation of unstructured content and it’s normalization to other data sets within the Inventory. The Entity Linking training process is well documented, which ensures repeatability and customization based on the use case needs.

LLD Inventory 1.4 features a complete overhaul of Ontotext’s approach to semantically annotate semi-structured text fields (like in Clinical Trials) and their normalization to reference biomedical terminologies/ontologies. The current version of the data set normalize more than 1 million (1073197) labels describing study condition and more than 9 million (9036884) labels describing reported adverse event reactions to UMLS concepts, while the normalized drug intervention label amounts to around 2.5 million (2459072) record normalized to DrugCentral. The current process significantly reduces the time and hardware resources necessary to complete an annotation step.

New datasets

The latest release of LLD Inventory redesigns the transformation of complex derivative datasets (like UMLS). This allows the generation of a semantic representation of the harmonized data (concept level objects) but also to maintain a separate semantic representation of each dataset part of the metathesaurus (like MedDRA, ICD9/10, SNOMED CT, etc on atom level). In addition to the harmonized dataset, the most frequent 19 medical terminologies from UMLS are now available as separate datasets to mix and match with the other resources.

The following are the new datasets that have been included:

  • Pubtator dataset (added as an alternative source of NLP annotations for a wide variety of biomedical classes)
  • MarkerDB — an important source of linked biomarker data
  • Hugo Human Genome Committee dataset 
  • Cellosaurus — an important source of linked cell line data. Additional semantic mappings were generated to other referenced datasets.
  • Reactome — a manually curated and peer-reviewed pathway database
  • Genome-wide Association Studies (GWAS) — provides a consistent, searchable, visualizable, and freely available database of SNP-trait associations. These studies offer an unprecedented opportunity to investigate the impact of common variants on complex diseases.
  • RHEA — an expert-curated knowledge base of chemical and transport reactions of biological interest and the standard for enzyme and transporter annotation in UniProtKB. Additional mappings are created to reference datasets.
  • Enzyme — dataset of information relative to the nomenclature of enzymes

Several AI-generated Gene-Disease Link Prediction datasets have also been included, based on knowledge graph embedding models like HAKE, QuatE, and TorusE.

  • The Hierarchy-Aware Knowledge Graph Embedding (HAKE) method takes into account the hierarchical ontology relationships effectively by dividing the embedding into two parts: Modulus (capturing how high or low something is in the hierarchy) and Argument (deals with the type of relationship and how entities are related to the same level).
  • The Quaternion Embedding of Knowledge Graphs (QuatE) method helps understand and use large knowledge graphs by transforming the entities and relationships into four-dimensional complex numbers that capture a lot of detail about how data is connected. This can help with tasks like link prediction — finding new relations that might not be obvious at first glance, or answering complicated questions about the data in the graph. 
  • The TorusE method creates knowledge graph embeddings, using a mathematical structure called a “torus” (doughnut-shaped surface) for mapping entities and relationships from the knowledge graph onto it and to its embedding representation. The TorusE method tries to preserve certain properties of the knowledge graph, like the distances between entities or the structure of relationships.

The predicted new links were evaluated using a special semantic evaluation method  Sem@K that reflects the semantic validity of the top-ranked candidate entities – the top-N newly predicted links that fulfill the criteria above a certain threshold were selected and included as separate datasets in LLD Inventory.

The newly generated knowledge based on the embedding models can be used in various use cases spanning from deeper disease understanding, novel drug target identification, drug repurposing, and many more.  

Do you want to build your custom knowledge graph in a matter of days from ready-to-be-used semantic data blocks?

Contact Us For a Demo

For more information, contact Doug Kimball, Chief Marketing Officer at Ontotext