With data lakes crucial to organizations due to their ability to store large volumes of highly diverse data from multiple sources, Paul Leahy, Country Manager ANZ, Qlik, asks how a data lake can be effectively used to create value for your business.
There’s a common saying that data is the new water. Like water, data must be filtered before it is used, but unlike water, data is not limited by supply. That’s why companies have for a long time been moving to data lakes, where unfiltered, fast flowing data is stored in massive data dams for future analysis.
Put another way, data lakes are where the unfiltered water is, while analytics is the filtering and refining process, extracting value for the company.
Data lakes are important to organizations due to their ability to store large volumes of highly diverse data from multiple sources. It is this promise of cost-effective rapid storage that drove initial interest in data lakes; organizations wanted to overcome the costs and delays associated with storing data in traditional data warehouses.
So how can you effectively use a data lake to create value for your business? It starts with a cloud-based model.
The rise of the cloud-based data pond
Early on-premise installations of data lakes faced criticism for perceived challenges with security and governance performance issues, as well as the cost of maintaining and managing dedicated data centers.
Yet modern cloud-based data lakes have helped overcome many of those challenges.
Big vendors including Amazon, Microsoft and Google offer managed cloud environments for data lakes replacing the capital cost of an on-prem Hadoop environment with an elastic consumption model, where organizations pay for what they use. They also mitigated some of the security and management challenges, allowing businesses to focus on data usage, rather than maintaining the environment.
The consumption model encouraged users to avoid dumping all their data into a central lake and instead load what they need for analytics, leading to smaller, purpose-built cloud data lakes or cloud-based data ponds. The rise of data science and Machine Learning platforms, as well as availability of SQL-based analytic services made accessing and analyzing stored data easier. This has meant data insights are made available faster to business users.
The challenge of making real-time analytics-ready data available to data consumers however remains. Traditional data integration approaches slow the data pipeline, making data outdated even before it is processed and ready for analysis. Then there are challenges around data trust and accessibility to data consumers.
Extracting value from data lakes: How to fill and refine your data lake
In the rush to build a data lake, it is easy to focus on hydrating the data lake and overlook how to make that data actionable for analysis. But storing data in the data lake is just the first step.
The value of data lakes comes not just from their ability to quickly and cost-effectively store all types of data, but also from processing and refining that raw data into an analytics-ready state, so the data is actionable and accessible for exploration and analysis.
When building a performant data lake, focus on not only the sources and types of data to ingest from and the speed of data replication, but also the ability to transform and refine that data and make it consumption ready for analytics.
Universal support for variety of source systems and target platforms, real-time incremental change data capture and pipeline automation – all the way from configuring and managing data pipelines to transforming and refining raw data into curated, analytics-ready data sets-are critical to accelerate value from your data lake.
Value can only be derived from data you trust
Data security, quality, consistency and governance are critical to data lake value. Data lakes can quickly become data swamps if data is dumped without consistent data definitions and metadata models. Check for the ability to auto-generate and augment metadata, tag and secure sensitive data and establish enterprise-wide access controls.
Data in data lakes is of value only if data consumers can understand and use data, verify its origin and trust its quality. Integrated catalog for automated data profiling and metadata generation, lineage, data security and governance are critical to building a successful data lake.
Accessible data is key to unlocking value creation
A key reason for a data lake failing to unlock value is the inability to access and consume data at the speed of the market. It is not enough to just store data in the data lake; data should also be usable and accessible to create value. Data consumers’ inability to easily find, understand and self-provision desired datasets – or their dependence on data scientists or specialized programmers to extract data means delayed and dated data.
A user-friendly marketplace capability for search and evaluation, as well as self-service preparation of derivative datasets can fast-track data lake value realization.
While the original on-prem Hadoop based model might have potentially outlived its usefulness, cloud migration and advances in integration technologies have provided users a new way of storing, processing and refining data, putting it to use in a much more cost and time effective way.
But value from the data cannot simply be unlocked by dumping the data into a single pool and hoping for the best. Avoiding a data swamp involves many considerations, not least of all ensuing the data poured into the lake is trustworthy and accessible.Click below to share this article