Data Warehouse Series - Layered Implementation of Data Warehouse

Data Warehouse Series - Layered Implementation of Data Warehouse#

Content organized from:

“Layered Implementation of Data Warehouse Based on Big Data”

Significance of Data Warehouse Layering#

Clear data structure: Each data layer has its own scope, making it easier for us to locate and understand when using tables.
Data lineage tracking: Simply put, we ultimately present a business table that can be used directly, but it has many sources. If there is an issue with a source table, we hope to quickly and accurately locate the problem and understand its impact.
Reduce duplicate development: Standardizing data layering and developing some common intermediate layer data can greatly reduce redundant calculations.
Simplify complex problems. Break a complex task into multiple steps, with each layer handling a single step, making it simpler and easier to understand. It also facilitates maintaining data accuracy; when data issues arise, we can start fixing from the problematic step without needing to repair all data.
Shield original data anomalies.
Shield business impacts; there is no need to re-integrate data every time a business change occurs.

Four-Layer Model#

ODS Layer#

ODS (Operational Data Store)

This is the layer closest to the data source. The data from the source, after extraction, cleansing, and transmission (i.e., after ETL), is loaded into this layer. Generally, the data in the ODS layer is isomorphic to the source system data, primarily aimed at simplifying subsequent data processing tasks. In terms of data granularity, the data in the ODS layer is the finest.

Tables in the ODS layer typically include two types:

One for storing the current data that needs to be loaded.
One for storing historical data after processing.

Historical data is generally retained for 3-6 months before being cleared to save space. However, different projects should be treated differently; if the source system's data volume is not large, it can be retained for a longer time or even fully preserved.

Before loading data into this layer, the following tasks need to be performed:

Noise reduction (e.g., removing obviously abnormal bank card information).
Deduplication (e.g., names of individuals are included in both bank account information and public security population information, but only one copy should be retained).
Data cleansing (e.g., if a person's bank card is fraudulently used, with two transactions occurring within ten minutes in China and Japan, this is dirty data).
Business extraction.
Unit standardization.
Field reduction (e.g., fields used to support front-end system operations but not needed in data mining).
Business discrimination.

DW#

Data Warehouse Layer (DW), the main body of the data warehouse.

Typically, various data models are established based on themes, such as star or snowflake structures.

For methods related to dimensional modeling and star, snowflake models, please refer to: A Brief Discussion on Data Modeling Methods in Data Warehouse Construction

DM Layer#

Data Mart Layer (DM), also known as DWS (Data Warehouse Service) or thematic layer, stores lightly aggregated data.

Typically divided according to business needs into traffic, orders, users, etc., generating wide tables with many fields for subsequent business queries, OLAP analysis, data distribution, etc.

In terms of data granularity, this layer contains lightly summarized data, with no detailed data remaining. In terms of time span, it is usually part of the DW layer, primarily aimed at meeting user analysis needs. From an analytical perspective, users typically only need to analyze recent years (e.g., data from the last three years). In terms of breadth, it still covers all business data.

APP Layer#

Data Product Layer (APP), this layer provides result data for data products.

In terms of data granularity, it consists of highly summarized data. In terms of breadth, it does not necessarily cover all business data. In extreme cases, a model can be built for each report at the APP layer to support it, achieving the goal of trading space for time.

The application layer is based on business needs, derived from the statistics of the previous three layers, and can directly provide query displays or be imported into MySQL for use. The report data we often refer to, or those large wide tables, are generally placed here.

A system for managing metadata information is needed, capable of providing convenient metadata operations and query operations. It mainly provides data for data products and data analysis, typically stored in systems like ES, MySQL for online system use, and may also exist in Hive or Druid for data analysis and data mining.

Three-Layer Model#

Similar to the four-layer model, but with more logic placed in the DW layer.

Buffer Data Cache Layer#

This layer is used to store the raw data provided by the interface party. The table structure of this layer remains basically consistent with the source data. The data retention time depends on the data volume and project situation. If the data volume is large, only recent data may be stored, while historical data is backed up. The purpose of this layer is for data transfer and backup.