• Blog
  • Informational

Building Linked Data Bridges To Fish In Data Lakes

May 5, 2016 5 mins. read Milena Yankova

Needed Data in a Data Lake

Do you know that The Great Lakes contain 20 percent of the world’s surface fresh water and are home to some 150 species of fish? Let’s imagine for a second that The Great Lakes were data lakes. Imagine how many and how big fish, anglers-data analysts would catch if they know their species, locations and baits.

Data lakes – huge storage repositories of both structured and unstructured data in their native format – have been a trend in recent years. Data lakes differ from data warehouses, for example, in several crucial data management aspects.

In addition, data lakes managed under a semantic graph database (also called an RDF triplestore) help organizations optimize data, costs and resources by creating highly interlinked data and mastering huge sets of heterogeneous data. Thus, Linked Data and Linked Open Data keep fishermen constantly updated on the best locations to throw bait and build bridges invisible to other anglers.

The Origins of The Data Lake

Still, what’s the data lake buzz all about?

In order to differentiate data lakes from data warehouses, let’s first dig into the origins of the ‘data lake’ collocation. Pentaho CTO James Dixon is credited with coining the term. In a 2010 blog post Dixon wrote:

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

Lakes vs. Warehouses

Diving or swimming in a data lake is nothing like just rummaging through a warehouse full of boxes of stuff. It’s keeping empty boxes ready for use in case you want to put more stuff in.

Data lakes and traditional enterprise data warehouses differ in the way they approach data storage and management. First, warehouses contain structured data designed for specific purposes, while data lakes have all the data, for all time, allowing for any data to be used in the future.

Next, data warehouses have mostly quantitative metrics data while data lakes incorporate all types of data regardless of source, including all new sources of gathering information such as mobile, social or IoT. Also, warehouses may not have all the source data because they are built to serve a case. It’s not so with data lakes.

Data lakes, being repositories for raw data, have all data in their native formats and can be accessed and used at any time. Click To Tweet

This leads to another difference between the warehouse and data lake management approaches: data lakes are more flexible with changes and are highly agile when configured and reconfigured, compared to traditional warehouse structured data.

Last but surely not least, data lakes allow for a faster pace in getting actionable insights because raw heterogeneous data in native format can be used for various types of Big Data analytics and predictive models whenever needed. Unlike data warehouses which keep transformed and structured data for business professionals mostly.

Linking Data Lakes

Michigan, Huron, Erie and Ontario – Building Linked Data Bridges To Fish In Data Lakes

The Great Lakes Waterway of natural channels and artificially built canals allows ships to navigate through the lakes Superior, Michigan, Huron, Erie and Ontario. Though all five lakes are interconnected, water transport needed civil engineering works to pass through the Niagara Falls for example.

Out of the wildlife and civil engineering and into data lakes, we find huge repositories of structured, semi-structured and unstructured data from various sources, kept in native format, and for all time. So how can one navigate and search for insights in such lakes?

The idea of data lakes revolves around having a vast repository of all enterprise data in one place, waiting to be accessed and crunched equally by all business departments and applications, without the need to specially prepare for it. Therefore, tagging and linking the raw data via metadata is essential to identifying relationships out of huge heterogeneous items.

Linked Data, with an RDF database, enables organizations to quickly access their critical actionable information. Click To Tweet

The graph database, where linked data is stored in a knowledge graph structure, allows businesses to reuse data in future applications. By attributing semantic relations to the concepts in raw disparate data, organizations build the bridges to creating data-driven commercial decisions whenever the business environment calls for them.

Building a way to navigate through all the data keeps the lakes fresh and clean and swarming with fish. It also prevents them from becoming data swamps where data is unusable for any operational value.

Enhancing Optimization by Using Data Lakes

The use of data lakes helps organizations optimize their data, costs and resources. Data is being enhanced with the collecting, hosting and analyzing flexible and easily scalable raw heterogeneous datasets. The costs for deploying and maintaining data lakes are lower than those for using traditional enterprise data warehouse solutions, experts agree.

Data lake deployment also optimizes resources by minimizing the labor costs for development and data clean-up until the organization decides how the relevant data it has access to at any time serves its business purposes.

In its 2014 report Technology Forecast: Rethinking integration, PwC said:

Every industry has a potential data lake use case. A data lake can be a way to gain more visibility or put an end to data silos. Many companies see data lakes as an opportunity to capture a 360-degree view of their customers or to analyze social media trends.

So, data lakes have the potential to lead organizations to untapped streams of data analytics and new streams of revenues. By using linked data in the data lakes, enterprises build bridges to extracting more powerful and more relevant insights from their Big Data analytics.

 

          New call-to-action

Article's content

A bright lady with a PhD in Computer Science, Milena's path started in the role of a developer, passed through project and quickly led her to product management. For her a constant source of miracles is how technology supports and alters our behaviour, engagement and social connections.

Linked Data Solutions for Empowering Analytics in Fintech

Read about how FinTech can use the power of Linked Data to put data into context and expose various links between these concepts.

Semantic Technology: Creating Smarter Content for Publishers

Learn how semantic technology helps publishers create better content publishing workflows and improved content consumption for readers.

The 5 Key Drivers Of Why Graph Databases Are Gaining Popularity

Read about the 5 key characteristics of graph databases – speed, meaning, answers, relationships, and transformation.

GraphDB Migration Service: The 10-Step Pathway from Data to Insights

Welcome to our GraphDB Migration Service that helps you prepare for migrating your data to GraphDB, walks you through the setup and monitors performance.

Fighting Fake News: Ontotext’s Role in EU-Funded Pheme Project

Read about the EU-funded project PHEME aiming to create a computational framework for automatic discovery and verification of information at scale and fast.

Semantic Technology: The Future of Independent Investment Research

Learn how independent research firms use cutting-edge technologies to add value to research pieces and monetize the content they offer.

Top 5 Semantic Technology Trends to Look for in 2017

Read about the top 5 trends in which Semantic Technology enables enterprises to make sense of their data and fine-tune their offerings to customers.

Ontotext’s 2016: Our Top 7 Webinars Of The Year

Data shows that in 2016 we had a total of 22 webinars that attracted over 7 000 people – here are the 7 best webinars!

Ontotext’s 2016: What Did You Liked The Most On The Blog

Nearly 10 000 people read our blog in 2016 and the following 5 posts gathered most interest.

Linked Data in Regtech: Boosting Compliance and Performance

Learn how regulatory technology, coupled with semantic technology, can help enterprises and financial institutions reduce exposure to risk.

How Data Integration Joined the Music Hit Charts

Learn how today it is the Internet, data integration, and tailored recommendations that stage the music scene for the new Bob Dylans.

Open Data Innovation? Open Your Data And See It Happen

Learn how open data trend-setting governments and local authorities are opening up data sets and actively encouraging innovation.

Linked Data Innovation – A Key To Foster Business Growth

Learn how freely available and machine-readable Linked Open Data enriches organizations’ data and helps them discover new links and insights.

Linked Data Approach to Smart Insurance Analytics

Read about how Linked Data and semantic technology can enrich data and pave the way to advanced analytics.

Linked Data Paths To A Smart Tourism Journey

Read about how the tourism industry can benefit from Linked Data and big data analytics for wiser investments and higher profits.

Linked Data Pathways To Wisdom

Learn about the linked data pathways to wisdom through ‘who’, ‘what’, ‘when’, ‘where’, ‘why’, ‘how to’ and, finally, ‘what is best’.

Taking Semantic Web to its Next Level with Cognitive Computing

Learn about the new age of cognitive computing and integrating its concepts into two decades of semantic web growth.

Open Data Play in Sports Journalism And EURO 2016

Read about how open data gives those modern-day Sherlocks the bases of their stories.

Open Data Sources for Empowering Smart Analytics

Learn how Open Data and how more businesses use data analytics to gain insights, predict trends and make data-driven decisions.

Journalism in the Age of Open Data

Learn how governments and authorities can start relying more on journalism to promote the use of open data and its social and economic value.

Building Linked Data Bridges To Fish In Data Lakes

Learn how enterprises can build bridges to extracting more powerful and more relevant insights from their Big Data analytics.

Open Data Use Cases In Five Cities

Learn how London, Chicago, New York, Amsterdam and Sofia deal with open data and extract social and business value from databases.

ODI Summit Take Out: Open Data To Be Considered Infrastructure

Learn about The ODI’s second Summit with prominent speakers such as Sir Tim Berners-Lee, Martha Lane Fox and Sir Nigel Shadbolt.

Highlights from the “Mining Electronic Health Records for Insights” Webinar

Read some of the Q&As from our webinar “Mining Electronic Health Records for Insights”.

Highlights from ISWC 2015 – Day Three

The 14th International SemanticWeb Conference started three days ago and Ontotext has been its most prominent sponsor for 13 years in a row.

Highlights from ISWC 2015 – Day Two

The 14th International SemanticWeb Conference started three days ago and Ontotext has been its most prominent sponsor for 13 years in a row.

Overcoming the Next Hurdle in the Digital Healthcare Revolution: EHR Semantic Interoperability

Learn how NLP techniques can process large volumes of clinical text while automatically encoding clinical information in a structured form.

Highlights from ISWC 2015 – Day One

The 14th International SemanticWeb Conference started three days ago and Ontotext has been its most prominent sponsor for 13 years in a row.

Text Mining to Triplestores – The Full Semantic Circle

Read about the unique blend of technology offered by Ontotext – coupling text mining and RDF triplestores.

Text Mining & Graph Databases – Two Technologies that Work Well Together

Learn how connecting text mining to a graph database like GraphDB can help you improve your decision making.

Semantic Publishing – Relevant Recommendations Create a Unique User Experience

Learn how semantic publishing can personalize user experience by delivering contextual content based on NLP, search history, user profiles and semantically enriched data.

Why are graph databases hot? Because they tell a story…

Learn how graph databases like GraphDB allow you to connect the dots and to tell a story.