Data Warehouse Series - Data Modeling Methods#
Content organized from:
Basic Concepts of Data Models#
A data model is a tool and method for abstractly describing the real world. It represents the interrelationships of affairs in the real world through abstract entities and the relationships between them. Here, the abstraction represented by the data model is the relationships between entities, which express specific business relationships in actual business through the definition and description of the relationships between entities. It can be divided into the following levels:
From the diagram above, we can easily see that in the entire modeling process of the data warehouse, we need to go through generally four processes:
-
Business modeling, generating business models, mainly addressing the decomposition and programmability at the business level.
-
Domain modeling, generating domain models, mainly abstracting the business model to create a domain conceptual model.
-
Logical modeling, generating logical models, mainly logicalizing the conceptual entities of the domain model and the relationships between entities at the database level.
-
Physical modeling, generating physical models, mainly addressing the physical realization of the logical model for different relational databases and specific technical issues such as performance.
Normalization Modeling#
In the design of data warehouse models, the third normal form is generally adopted, which has a strict mathematical definition. From its expressed meaning, a relationship that conforms to the third normal form must meet the following three conditions:
- Each attribute value is unique and unambiguous;
- Each non-key attribute must fully depend on the entire primary key, not just part of it;
- Each non-key attribute cannot depend on attributes from other relationships, as such attributes should belong to other relationships.
Dimensional Modeling#
The simplest description is to build a data warehouse and data mart according to fact tables and dimension tables, such as Star Schema
, Snowflake Schema
, etc.
Star Schema#
The architecture in the above diagram is a typical star schema. The star schema is a denormalized structure where each dimension of the multidimensional dataset is directly connected to the fact table, with no gradual dimensions, resulting in some data redundancy. For example, in the geographic dimension table, there are two records for city C and city D in province B of country A, meaning that the information for country A and province B is stored twice, hence redundancy.
Characteristics
-
There is only one fact table, and each dimension has a separate table.
-
Each tuple in the fact table is a foreign key pointing to the primary key of the dimension table.
-
The columns of each dimension table consist of all attributes that make up that dimension.
-
The fact table and dimension tables are associated through primary and foreign keys, with no associations between dimension tables, resembling many stars revolving around a central star, hence the name star schema.
Advantages
-
The star schema is the simplest and most commonly used model.
-
Since the star schema consists of only one large table, it is more suitable for big data processing compared to other models.
-
Other models can be transformed into a star schema through certain conversions.
Snowflake Schema#
When one or more dimension tables are not directly connected to the fact table but are connected through other dimension tables, the diagram resembles multiple snowflakes connected together, hence the name snowflake schema. The snowflake schema further hierarchizes the dimension tables of the star schema, where the original dimension tables may be expanded into smaller fact tables, forming some local "hierarchical" areas, with these decomposed tables connected to the main dimension table rather than the fact table.
As shown in the diagram above, the geographic dimension table is further decomposed into country, province, city, and other dimension tables. Its advantage is: by minimizing data storage and joining smaller dimension tables, it improves query performance and eliminates data redundancy.
Most of the time, data warehouses are more suitable for constructing underlying Hive tables using the star schema, enhancing query efficiency through significant redundancy. The star schema is quite friendly to OLAP analysis engines, which is particularly evident in Kylin. The snowflake schema is very common in relational databases like MySQL and Oracle, especially in e-commerce database tables.
Fact Constellation Model#
The constellation model is a more complex model that includes multiple fact tables, with dimension tables being shared among multiple fact tables. This pattern can be viewed as a collection of star schemas, hence referred to as galaxy schema or fact constellation.
Summary#
Advantages of Dimensional Modeling:
Dimensional modeling is very intuitive, closely revolving around the business model, and can intuitively reflect business issues within the business model. It does not require special abstract processing to complete dimensional modeling.
Disadvantages of Dimensional Modeling:
-
A significant amount of data preprocessing is required before constructing the star schema, leading to a considerable amount of data processing work.
-
When business changes occur and the definition of dimensions needs to be redone, it often requires reprocessing of dimension data. During these processing stages, significant data redundancy can occur.
-
Relying solely on dimensional modeling cannot guarantee the consistency and accuracy of data sources, and at the bottom layer of the data warehouse, it is not particularly suitable for dimensional modeling methods.
The field of dimensional modeling is mainly applicable to the data mart layer, and its greatest role is actually to address performance issues in data warehouse modeling. Dimensional modeling struggles to provide a complete abstract method for describing the complex relationships between real business entities.