论文解析 -- Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analy
数仓新架构的特点,直接访问的开放格式,天然支持机器学习框架,好的性能
This paper argues that the data warehouse architecture as we know it today will wane in the coming years and be replaced by a new architectural pattern, which we refer to as the Lakehouse, characterized by
(i) open direct-access data formats, such as Apache Parquet and ORC, (ii) first-class support for machine learning and data science workloads, and (iii) state-of-the-art performance.
看下数仓架构的演变,
第一代,结构化数仓,schema-on-write;缺点,计算存储耦合,无法应对非结构化数据
The history of data warehousing started with helping business leaders get analytical insights by collecting data from operational databases into centralized warehouses,
which then could be used for decision support and business intelligence (BI).
Data in these warehouses would be written with schema-on-write, which ensured that the data model was optimized for downstream BI consumption.
We refer to this as the first generation data analytics platforms.
A decade ago, the first-generation systems started to face several challenges.
First, they typically coupled compute and storage into an on-premises appliance.
This forced enterprises to provision and pay for the peak of user load and data under management, which became very costly as datasets grew.
Second, not only were datasets growing rapidly, but more and more datasets were completely unstructured, e.g., video, audio, and text documents, which data warehouses could not store and query at all.
第二代,把raw data直接放入DataLake中,称为schema-on-read,将问题推到读端;这里又分为传统datalake,hdfs,cloud datalake,对象存储。
第二代包含了第一代,对于BI需求,仍然需要从DL中ETL到数仓来支持;但是机器学习这样的系统可以直接读取raw data,更灵活。
To solve these problems, the second generation data analytics platforms started offloading all the raw data into data lakes:
low-cost storage systems with a file API that hold data in generic and usually open file formats, such as Apache Parquet and ORC [8, 9].
This approach started with the Apache Hadoop movement [5], using the Hadoop File System (HDFS) for cheap storage.
The data lake was a schema-on-read architecture that enabled the agility(敏捷) of storing any data at low cost, but on the other hand,
punted(踢) the problem of data quality and governance downstream.
In this architecture, a small subset of data in the lake would later be ETLed to a downstream data warehouse (such as Teradata) for the most important decision support and BI applications. The use of open formats also made data lake data directly accessible to a wide range of other analytics engines, such as machine learning systems [30, 37, 42].
From 2015 onwards, cloud data lakes, such as S3, ADLS and GCS, started replacing HDFS.
They have superior durability (often >10 nines), geo-replication, and most importantly, extremely low cost with the possibility of automatic, even cheaper, archival storage, e.g., AWS Glacier.
The rest of the architecture is largely the same in the cloud as in the second generation systems, with a downstream data warehouse such as Redshift or Snowflake.
This two-tier data lake + warehouse architecture is now dominant in the industry in our experience (used at virtually all Fortune 500 enterprises).
第二代明显两套存储,变复杂了,缺点更多
Reliability. Keeping the data lake and warehouse consistent is difficult and costly.
Data staleness. The data in the warehouse is stale compared to that of the data lake, with new data frequently taking days to load.
Limited support for advanced analytics. Despite much research on the confluence(汇合) of ML and data management,
none of the leading machine learning systems, such as TensorFlow, PyTorch and XGBoost, work well on top of warehouses.
Unlike BI queries, which extract a small amount of data, these systems need to process large datasets using complex non-SQL code.
Total cost of ownership.
Apart from paying for continuous ETL, users pay double the storage cost for data copied to a warehouse,
and commercial warehouses lock data into proprietary formats that increase the cost of migrating data or workloads to other systems.
第三代,就是本文说的LakeHouse
主要重3个方面来解决上面的问题, datalakes上rawdata的管理,两种interface,SQL和Dataframe
Reliable data management on data lakes:
A Lakehouse needs to be able to store raw data, similar to today’s data lakes, while simultaneously supporting ETL/ELT processes that curate this data to improve its quality for analysis.
Traditionally, data lakes have managed data as “just a bunch of files” in semi-structured formats, making it hard to offer some of the key management features that simplify ETL/ELT in data warehouses,
such as transactions, rollbacks to old table versions, and zero-copy cloning.
However, a recent family of systems such as Delta Lake [10] and Apache Ice- berg [7] provide transactional views of a data lake, and enable these management features.
Of course, organizations still have to do the hard work of writing ETL/ELT logic to create curated datasets with a Lakehouse, but there are fewer ETL steps overall,
and analysts can also easily and performantly query the raw data tables if they wish to, much like in first-generation analytics platforms.
Support for machine learning and data science:
ML systems’ support for direct reads from data lake formats already places them in a good position to efficiently access a Lakehouse.
In addition, many ML systems have adopted DataFrames as the abstraction for manipulating data,
and recent systems have designed declarative DataFrame APIs [11] that enable performing query optimizations for data accesses in ML workloads.
These APIs enable ML workloads to directly benefit from many optimizations in Lakehouses.
SQL performance:
Lakehouses will need to provide state-of-the-art SQL performance on top of the massive Parquet/ORC datasets
that have been amassed over the last decade (or in the long term, some other standard format that is exposed for direct access to applications).
In contrast, classic data warehouses accept SQL and are free to optimize everything under the hood, including proprietary storage formats.
Nonetheless, we show that a variety of techniques can be used to maintain auxiliary data about Parquet/ORC datasets and to optimize data layout within these existing formats to achieve competitive performance.
We present results from a SQL engine over Parquet (the Databricks Delta Engine [19]) that outperforms leading cloud data warehouses on TPC-DS.
针对这3个方面, Spark对应的有3个project,Delta Lake, Delta Engine and Databricks ML Runtime projects
论文后面的内容无甚新意,不写了
In this section, we sketch one possible design for Lakehouse systems, based on three recent technical ideas that have appeared in various forms throughout the industry.
We have been building towards a Lakehouse platform based on this design at Databricks through the Delta Lake, Delta Engine and Databricks ML Runtime projects [10, 19, 38].
Other designs may also be viable, however, as are other concrete technical choices in our high-level design (e.g., our stack at Databricks currently builds on the Parquet storage format,
but it is possible to design a better format). We discuss several alternatives and future directions for research.
Delta Engine
https://databricks.com/blog/2020/06/24/introducing-delta-engine.html
Delta Engine由3部分组成,优化器是Spark3.0的扩展,caching层,原生执行器Photon
The improved query optimizer extends the functionality already in Spark 3.0 (cost-based optimizer, adaptive query execution, and dynamic runtime filters) with more advanced statistics to deliver up to 18x increased performance in star schema workloads.
Delta Engine’s caching layer automatically chooses which input data to cache for the user, transcoding it along the way in a more CPU-efficient format to better leverage the increased storage speeds of NVMe SSDs. This delivers up to 5x faster scan performance for virtually all workloads.
However, the biggest innovation in Delta Engine to tackle the challenges facing data teams today is the native execution engine, which we call Photon. (We know. It’s in an engine within the engine…) This completely rewritten execution engine for Databricks has been built to maximize the performance from the new changes in modern cloud hardware. It brings performance improvements to all workload types, while remaining fully compatible with open Spark APIs.