Data Warehouse Series - Basic Concept Organization

Data Warehouse Series - Basic Concept Organization#

Content organized from:

Basic Concepts of Data Warehouse

I. Data Warehouse and Data Mart#

Data Warehouse (Data Warehouse)#

A Data Warehouse is a theoretical framework for information system data storage, emphasizing the use of certain special data storage methods to make the contained data particularly beneficial for analysis and processing, thereby generating valuable information for decision-making.
A data warehouse is a subject-oriented, integrated, non-updatable (stable), time-variant (different times) data collection, used to support decision support systems (DDS: Decision Support System, mainly reporting systems) in business management.
Data stored in a data warehouse must include time attributes and has the characteristic that once stored, it does not change over time. However, the data in the warehouse may continuously grow and change over time.

Data Mart (Data Market)#

A small-scale data warehouse at the departmental or workgroup level.

Comparison#

	Data Warehouse	Data Mart
Data Source	Legacy systems, OLTP systems, external data	Data Warehouse
Scope	Enterprise-level	Departmental or workgroup level
Subject	Enterprise subject	Department or specific analysis subject
Data Granularity	Finest granularity	Coarser granularity
Data Structure	Normalized structure (third normal form)	Star schema, snowflake schema, or a mix of both
Historical Data	Large amounts of historical data	Moderate historical data
Optimization	Handling massive data, data exploration	Easy access and analysis, fast queries
Indexing	Highly indexed	Highly indexed

II. Dimensions and Measures#

Dimension:
Refers to the perspective from which data is examined, usually an attribute of data records, such as time, location, etc. Generally, it is a set of discrete values, such as each independent date in the time dimension or each independent product in the product dimension. Therefore, during statistics, records with the same dimension value can be aggregated together, and then aggregation functions can be applied for summation, averaging, deduplication counting, and other aggregation calculations.

Cardinality of Dimension:
Refers to the number of different values that appear for that dimension in the dataset. For example, "country" is a dimension; if there are 200 different values, then the cardinality of this dimension is 200. Typically, the cardinality of a dimension can range from dozens to tens of thousands, with some dimensions like "user ID" having cardinalities exceeding millions or even tens of millions. Dimensions with cardinalities exceeding one million are often referred to as Ultra High Cardinality Dimensions (UHC).

Measure:
A calculated value based on data; it is generally a continuous value, such as total sales, number of different users, etc.

Example:
In an SQL query, the attributes in the GroupBy clause are usually dimensions, while the calculated values are measures. For example:

SELECT
department,
site_id,
sum(price) AS total_sales,
count(DISTINCT seller_id) AS sellers
FROM kylin_sales
GROUP BY department, site_id

In the above query, department and site_id are dimensions, while sum(price) and count(distinct seller_id) are measures.

III. Basic Concepts of Cube#

Given a data model, we can combine all dimensions on it. For N dimensions, there are a total of 2^n combinations.

For a combination of dimensions, the measures are aggregated, and the results are saved as a materialized view, referred to as a Cuboid.
All combinations of dimensions' Cuboids as a whole are referred to as a Cube. In simple terms, a Cube is a collection of many materialized views aggregated by dimensions.
A Cube Segment refers to the Cube data calculated for a specific segment of the source data. Typically, the amount of data in a data warehouse grows over time, and Cube Segments are also constructed in chronological order.

IV. Hierarchy, Level, and Member#

A Cube is like a coordinate system, where each Hierarchy represents a coordinate axis, and each Member within the Hierarchy represents a value on that axis. The following diagram illustrates the internal structure of a Dimension using the time dimension as an example:

Dimensions are used to describe the characteristics of things from different perspectives, and generally, dimensions will have multiple levels (Level), with each Level containing some common or specific attributes (Attribute). The structure and composition of dimensions can be illustrated in the following diagram:

Taking the time dimension as an example, it typically includes levels for year, quarter, month, and day. Each Level generally has common attributes such as ID, NAME, and DESCRIPTION. These common attributes apply not only to the time dimension but also to various other types of dimensions. The ID is generally regarded as a surrogate primary key (Agent), used solely as a uniqueness marker and serves as a proxy for relationships in the multidimensional model, having no business significance; NAME is generally the business primary key (Business), which restricts uniqueness at the business level and is often used as a join key during data loading (Load); DESCRIPTION records detailed descriptive information, and we typically choose to use DESCRIPTION to express specific meanings during multidimensional display and analysis.

The above structure of dimensions is not directly applicable to OLAP; OLAP requires hierarchical top-down drilling or bottom-up aggregation. Therefore, each dimension must have a Hierarchy, at least one default, but can have multiple, as shown in the following diagram:

With a Hierarchy, the Levels within the dimension have a top-down tree structure relationship, meaning each upper-level member (Member) will contain zero or more members from the lower level, representing the branches of the tree. It is important to note that the root node of each Hierarchy tree is generally set to be the summary of all members (Total). When the dimension is not used in OLAP, the default display is the summary node of that dimension, which is the aggregation of all data in that dimension (or that the dimension has not been used for segmentation). Each level in the Hierarchy contains several members (Member). For the time dimension, if we build a time dimension spanning from 2006 to 2015, the highest level node would have only one Total member, encompassing all ten years, while the year level would contain members for 2006, 2007…2015, with each year containing four quarter members, and each quarter containing three month members… This seems logical, and we can perform some OLAP operations based on the Hierarchy.

V. Data Cube#

A Data Cube is a technique commonly used for data analysis and indexing; it can establish multidimensional indexes on raw data, significantly speeding up data query efficiency through Cube analysis.

The Data Cube is merely a figurative term for a multidimensional model. A cube itself has only three dimensions, but the multidimensional model is not limited to three dimensions and can combine more dimensions. On one hand, this is for easier explanation and description, while on the other hand, it provides space for mental imaging and imagination; it also serves to distinguish it from traditional relational databases' two-dimensional tables.

Basic Operations#

Drill-down: Changes between different levels of the dimension, moving down from the upper level to the next level, or breaking down summary data into more detailed data. For example, drilling down from the total sales data for the second quarter of 2010 to view the consumption data for each month (April, May, June) of that quarter, as shown in the figure; of course, one can also drill down into provinces like Zhejiang to view sales data for cities such as Hangzhou, Wenzhou, etc.
Roll-up: The reverse operation of drilling down, aggregating from fine-grained data to higher levels, such as summarizing sales data from Jiangsu Province, Shanghai City, and Zhejiang Province to view sales data for the Jiangsu-Zhejiang-Shanghai region, as shown in the figure.
Slice: Selecting specific values within a dimension for analysis, such as only selecting sales data for electronic products or data for the second quarter of 2010.
Dice: Selecting data within a specific range or a specific set of values within a dimension for analysis, such as selecting sales data from the first quarter of 2010 to the second quarter of 2010, or sales data for electronic products and daily necessities.
Pivot: Swapping the positions of dimensions, similar to transposing rows and columns in a two-dimensional table, as shown in the figure by swapping the product dimension and geographic dimension.

VI. Fact Table and Dimension Table#

Fact Table (FactTable)#

A Fact Table is a table that stores factual records, containing specific elements of each event and what happened, such as system logs, sales records, etc.
The Fact Table does not store actual content; it is a collection of primary keys, with these IDs corresponding to a record in the Dimension Table.
Records in the Fact Table are dynamically growing, so its volume is usually much larger than other tables.
Operational events occurring in the real world generate measurable values, which are stored in the Fact Table. From the lowest granularity level, each row in the Fact Table corresponds to a measure event, and vice versa.

Dimension Table (DimensionTable)#

A Dimension Table, sometimes referred to as a Lookup Table, is descriptive information about the elements of events in the Fact Table.
It stores the attribute values of dimensions and can be associated with the Fact Table; it essentially extracts and standardizes frequently repeated attributes from the Fact Table into a single table for management.
Common Dimension Tables include: date tables (storing attributes corresponding to weeks, months, quarters, etc.), location tables (containing attributes such as country, province/state, city, etc.).
Each Dimension Table contains a single primary key column. The primary key of the Dimension Table can serve as a foreign key for any associated Fact Table, and the descriptive environment of Dimension Table rows should correspond completely to the rows of the Fact Table. Dimension Tables are usually wide, flat non-normalized tables containing a large number of low-granularity text attributes.

VII. OLAP and OLTP#

OLAP (Online Analytical Processing)#

Online Analytical Processing is a collection of various operations aimed at analysis based on a multidimensional model of the data warehouse, capable of flexibly providing Roll-up, Drill-down, and Pivot analysis operations. It is a method for presenting integrated decision-making information, commonly used in decision support systems, business intelligence, or data warehouses.
Its function is to facilitate large-scale data analysis and statistical calculations, providing references and support for decision-making.
OLAP requires a large amount of historical data as a foundation, combined with time point differences, to perform complex analyses on multidimensional and aggregated information.
OLAP requires users to have subjective definitions of information needs, thus resulting in better system efficiency.

The concept of OLAP has both broad and narrow interpretations in practical applications. The broad interpretation aligns with its literal meaning, referring to all analytical processes that do not update data. However, more often, OLAP is understood in its narrow sense, specifically related to multidimensional analysis based on Cube calculations.

Types#

MOLAP (Multidimensional), Multidimensional OLAP#

Physically stores multidimensional data used in OLAP analysis as multidimensional arrays, forming a "cube" structure, where the attribute values of dimensions are mapped to the subscript values or ranges of the multidimensional array, and summary data is stored as values in the array cells.
This structure, when highly optimized, can maximize query performance.
Due to the new storage structure implemented at the physical layer, it is also referred to as Physical OLAP; ROLAP primarily achieves this through software tools or middleware, still using relational database storage structures, thus called Virtual OLAP.
The advantage of MOLAP lies in the high efficiency of data computation during analysis due to multidimensional preprocessing, while its main drawback is a certain delay in data updates.

ROLAP (Relational), Relational OLAP#

Represents OLAP implementations based on relational databases. It centers on relational databases, using relational structures to represent and store multidimensional data. ROLAP divides the multidimensional structure of the multidimensional database into two types of tables: one is the Fact Table, used to store data and dimension keywords; the other is the Dimension Table, which uses at least one table for each dimension to store hierarchical, member category, and other descriptive information about the dimension.
Dimension Tables and Fact Tables are linked together through primary keys and foreign keys, forming a "star schema."
For complex hierarchical dimensions, to avoid excessive storage space occupied by redundant data, multiple tables can be used for description; this extension of the star schema is referred to as the "snowflake schema."
The RDBMS used as a ROLAP storage system is also optimized for OLAP, such as parallel storage, parallel querying, parallel data management, cost-based query optimization, bitmap indexing, and SQL OLAP extensions (cube, rollup), etc.

HOLAP (Hybrid) Hybrid OLAP#

Represents OLAP implementations based on hybrid data organization (Hybrid OLAP), allowing users to choose which models to use ROLAP and which to use MOLAP based on their business needs.

Comparative Analysis#

Generally, less frequently used or flexibly defined analyses are performed using ROLAP, while commonly used, standardized models are implemented using MOLAP. For example, detailed data is retained in the relational database's Fact Table, while aggregated data is stored in the Cube. Aggregation may take more time than ROLAP, but query efficiency is higher than ROLAP, but lower than MOLAP.

OLTP (Online Transaction Processing)#

Involves small data volumes, frequent DML operations, and many parallel transactions, but generally, they are very short.
Online Transaction Processing focuses more on basic, daily transaction processing, including data insertion, deletion, modification, and querying.

Differences Between OLAP and OLTP#

Similar to the distinction between databases and data warehouses:

The design goal of database systems is transaction processing. Database systems are designed for recording updates and transaction processing, characterized by primary key-based access, numerous atomic, isolated small transactions, with concurrency and recoverability as key attributes, and maximum transaction throughput as a critical metric; thus, the design of databases reflects these needs.
The design goal of data warehouses is decision support. Historical, summarized, aggregated data is much more important than raw records. Query loads primarily focus on ad-hoc queries and complex queries involving joins, aggregations, etc. Compared to database systems, query throughput and response time are much more important than transaction processing throughput.

Data Processing Type	OLTP	OLAP
Target Audience	Business developers	Analytical decision-makers
Functionality	Daily transaction processing	Analysis-oriented decision-making
Data Model	Relational model	Multidimensional model
Data Volume	A few or dozens of records	Millions to tens of millions of records
Operation Types	Query, insert, update, delete	Primarily queries

VIII. Metadata#

Metadata is data about data. When people describe phenomena in the real world, abstract information is generated, which can be considered metadata. Metadata is mainly used to describe the contextual information of data. In simple terms, if the content of each book in a library is data, then finding the index for each book is metadata. Metadata can be divided into three categories based on the objects they describe: Technical Metadata, Business Metadata, and Management Metadata:

Technical Metadata: Describes concepts, relationships, and rules related to the technical field within the data system, mainly including descriptions of data structure and data processing characteristics, covering all data processing links such as data source interfaces, data warehouse and data mart storage, ETL, OLAP, data encapsulation, and front-end presentation;
Business Metadata: Describes concepts, relationships, and rules related to the business field within the data system, mainly including business terminology, information classification, metric definitions, and business rules;
Management Metadata: Describes concepts, relationships, and rules related to the management field within the data system, mainly including personnel roles, job responsibilities, and management processes.

For specific reference, see: Let's Talk About Metadata Management Systems in Data Warehouses