Ontotext is happy to announce the new 1.4 release of Ontotext’s LinkedLifeData (LLD) Inventory — a knowledge graph building accelerator providing more than 200 semantics ready biomedical datasets. It covers data in multiple modalities — genomics, proteomics, metabolomics, molecular interactions, and biological processes. In addition to the structured data sources, the inventory extends its coverage to pharmacological datasets, clinical records, medical information, and a diverse set of scientific publications and patents.
LinkedLifeData Inventory serves as a valuable resource for the scientific community, fostering multidisciplinary exploration and analysis across various facets of Life Sciences and Healthcare research. The RDF format employed ensures semantic richness and interoperability, facilitating advanced data integration, semantic querying, and insights generation in alignment with FAIR data principles.
The latest release of LLD Inventory empowers users with full control of the data ingestion and update process with the improved monitoring of data updates in the Data Loader tool. Data Loader is a data ingestion management component for all repositories in a single client instance of GraphDB. With the latest improvements to the tool, the tedious data loading turns into a more predictable and reliable operation, with better workflow management and error handling.
The Entity Linking component has also been improved and is now proven to be capable of serving as a discreet step in the NLP and ETL pipelines alike. The component is fully integrated into the continuous integration and development process for annotation of unstructured content and it’s normalization to other data sets within the Inventory. The Entity Linking training process is well documented, which ensures repeatability and customization based on the use case needs.
LLD Inventory 1.4 features a complete overhaul of Ontotext’s approach to semantically annotate semi-structured text fields (like in Clinical Trials) and their normalization to reference biomedical terminologies/ontologies. The current version of the data set normalize more than 1 million (1073197) labels describing study condition and more than 9 million (9036884) labels describing reported adverse event reactions to UMLS concepts, while the normalized drug intervention label amounts to around 2.5 million (2459072) record normalized to DrugCentral. The current process significantly reduces the time and hardware resources necessary to complete an annotation step.
The latest release of LLD Inventory redesigns the transformation of complex derivative datasets (like UMLS). This allows the generation of a semantic representation of the harmonized data (concept level objects) but also to maintain a separate semantic representation of each dataset part of the metathesaurus (like MedDRA, ICD9/10, SNOMED CT, etc on atom level). In addition to the harmonized dataset, the most frequent 19 medical terminologies from UMLS are now available as separate datasets to mix and match with the other resources.
The following are the new datasets that have been included:
Several AI-generated Gene-Disease Link Prediction datasets have also been included, based on knowledge graph embedding models like HAKE, QuatE, and TorusE.
The predicted new links were evaluated using a special semantic evaluation method Sem@K that reflects the semantic validity of the top-ranked candidate entities – the top-N newly predicted links that fulfill the criteria above a certain threshold were selected and included as separate datasets in LLD Inventory.
The newly generated knowledge based on the embedding models can be used in various use cases spanning from deeper disease understanding, novel drug target identification, drug repurposing, and many more.
For more information, contact Doug Kimball, Chief Marketing Officer at Ontotext