Banner Image

Data lakes provide a complete and authoritative data
store that can power data analytics, business intelligence,
and machine learning

Building a lakehouse with Delta Lake

To build a successful lakehouse, organizations have turned to Delta Lake, an open format data management and governance layer that combines the best of both data lakes and data warehouses. Across industries, enterprises are leveraging Delta Lake to power collaboration by providing a reliable, single source of truth. By delivering quality, reliability, security and performance on your data lake — for both streaming and batch operations — Delta Lake eliminates data silos and makes analytics accessible across the enterprise. With Delta Lake, customers can build a cost-efficient, highly scalable lakehouse that eliminates data silos and provides self-serving analytics to end-users.

Data lakes vs. Data lakehouses vs. Data warehouses

  1. #

    Types of data

    Cost

    Format

    Scalability

    Intended users

    Reliability

    Ease of use

    Performance

  2. Data lake

    All types: Structured data, semi-structured data, unstructured (raw) data

    $

    Open format

    Scales to hold any amount of data at low cost, regardless of type

    Limited: Data scientists

    Low quality, data swamp

    Difficult: Exploring large amounts of raw data can be difficult without tools to organize and catalog the data

    Poor

  3. Data lakehouse

    All types: Structured data, semi-structured data, unstructured (raw) data

    $

    Open format

    Scales to hold any amount of data at low cost, regardless of type

    Unified: Data analysts, data scientists, machine learning engineers

    High quality, reliable data

    Simple: Provides simplicity and structure of a data warehouse with the broader use cases of a data lake

    High

  4. Data warehouse

    Structured data only

    $$$

    Closed, proprietary format

    Scaling up becomes exponentially more expensive due to vendor costs

    Limited: Data analysts

    High quality, reliable data

    Simple: Structure of a data warehouse enables users to quickly and easily access data for reporting and analytics

    High

Lakehouse best practices

Use the data lake as a landing zone for all of your data

Save all of your data into your data lake without transforming or aggregating it to preserve it for machine learning and data lineage purposes.

Mask data containing private information before it enters your data lake

Personally identifiable information (PII) must be pseudonymized in order to comply with GDPR and to ensure that it can be saved indefinitely

Secure your data lake with role- and view-based access controls

Adding view-based ACLs (access control levels) enables more precise tuning and control over the security of your data lake than role-based controls alone.

Build reliability and performance into your data lake by using Delta Lake

The nature of big data has made it difficult to offer the same level of reliability and performance available with databases until now. Delta Lake brings these important features to data lakes.

Catalog the data in your data lake

Use data catalog and metadata management tools at the point of ingestion to enable self-service data science and analytics

Shell has been undergoing a digital transformation as part of our ambition to deliver more and cleaner energy solutions. As part of this, we have been investing heavily in our data lake architecture. Our ambition has been to enable our data teams to rapidly query our massive data sets in the simplest possible way. The ability to execute rapid queries on petabyte scale data sets using standard BI tools is a game changer for us.

—Dan Jeavons, GM Data Science, Shell

Read the full story →

Resources

Delta Lake

Webinars

Documentation