What Is The Definition Of A Data Lake?

Learning Objectives

  • Clearly define a data lake and provide a brief overview of its architecture.
  • Describe the benefits and challenges of data lakes.
  • Define important pre-planning considerations when setting up a data lake.

What Is The Definition Of Data Lake?

A data lake is a centralized repository designed to store an enormous amount of data in its raw, native format. As opposed to traditional data warehouses, which store structured data in hierarchical files or folders, data lakes utilize a flat architecture and object storage, which can support structured, semi-structured, or unstructured data. This architecture gives organizations far more flexibility to store data “as is” directly from the source, without processing it, and without needing a defined schema or a specific plan for using the data.

However, just because a data lake is more ad hoc than a data warehouse doesn’t mean it’s disorganized. Object storage labels data with identifiers and metadata tags, ensuring that users can quickly locate and retrieve the data when they need it.

Initially, data lakes were deployed on-premises, with storage on the Hadoop Distributed File System (HDFS) and processing (YARN) on Hadoop clusters. However, the introduction of low-cost cloud object storage services, and the inherent scalability and flexibility of the cloud, caused organizations to gravitate toward cloud-based deployments. Additionally, cloud vendors offer ancillary services to simplify the deployment of data lakes, such as data lake development, data integration, and data management.

Benefits & Challenges of Data Lakes

Arguably, the most significant benefit of data lakes is their flexibility. Between 80% to 90% of data is unstructured. Cleaning, transforming, and preparing this data for analysis is a time-consuming, laborious process, one that organizations may not want to undertake if they’re not sure exactly how they’re going to use it. Data lakes enable organizations to store data now and determine a use case for it later. By providing a central repository for data from different systems, data lakes also break down data silos. 

Data lakes are also highly agile, scalable, and cost-effective. Built using open-source technologies such as Hadoop, Spark, and HBase and running in clusters on inexpensive commodity hardware, they’re easily configurable and expandable. Users can then funnel data into a data lake without concern for size limits.

Because data lakes require open formats to build, the data stored within them can be used by many applications and for a wide variety of use cases, particularly in data science. Data lakes are critical in artificial intelligence and machine learning, where they provide data scientists easy access to a vast pool of data. This data isn’t limited to text but includes images, video, and even real-time streams that can be analyzed using SQL queries, Python, R, or any other language or application.

However, data lakes do come with a set of challenges, most of them related to the very flexibility that makes them such an attractive option in the first place. Organizations can choose from various technologies to set up their data lakes and combine them in different ways. There are so many choices that organizations can find it tricky in determining the solutions and setup that best fit their needs.

Meanwhile, suppose a data lake isn’t set up and maintained properly. In that case, it can devolve into a “data swamp,” where the organization has lost track of the data stored in the data lake (even as more data funnels in) and users can’t find what they need. Data swamps create security and compliance issues, especially for organizations that must comply with the GDPR and other data privacy laws that require companies to produce -- and delete -- consumers’ data upon request. They also suffer from data quality, consistency, and reliability problems, which have unsatisfactory results when the data sets are needed to train machine learning models or used in smart analytics applications.

An improperly managed data lake can also result in companies getting hit with unexpected bills if they use the lake more than they expected.

Preventing Your Data Lake from Becoming a Data Swamp

In most cases, a lack of pre-planning is the biggest reason why data lakes turn into data swamps. While the ability to store data from multiple sources is one of the selling points of a data lake, a data lake shouldn’t be a virtual “junk drawer.” These are not places for organizations to shove all company-related data under the guise of eventually using it for some undetermined future purpose. Organizations must strike a balance between data quantity and the data’s potential value to the business.

With this in mind, before deploying a data lake, organizations must determine what purpose they intend to achieve by using it. Will the data lake be used to augment a data warehouse, either by acting as a staging area for data awaiting preparation, serving as a repository for data that doesn’t easily store in a data warehouse, or both? Will it serve as a data source for an application? Will it be used for archival and historical data storage? Will it do all of these things and more?

Other important considerations include:

  • Whether the data lake will store only raw data, or if it will also contain data that is filtered and processed for analysis when ingested. 
  • Whether the data lake will include analytics sandboxes that data scientists can use to work with data.
  • A process to control and log access to the data contained within the lake using security protections such as data masking, data encryption, and identification of sensitive data.
  • Establishing naming conventions for folders and files, along with a data classification taxonomy to identify data type, content, possible user groups, and usage scenarios.
  • Data governance policies, including data retention and processes for avoiding duplicate data sets.

It’s time to let data charge