Basics of Flink#
Content reference from:
Stateful Stream Processing#
Concept#
Traditional batch processing: Continuously collect data, using time as the basis for batch division, and periodically execute batch computations. Possible scenario issues include:
- If one hour is used as the basis for batch division, and we need to count the number of specific event conversions, where the conversion starts at the 58th minute and ends at the 7th minute of the next hour, how can we complete cross-batch data statistics?
- The time taken for data to be generated and received is not certain; an event may occur earlier than A but arrive later than A. How to handle this issue of reversed time order upon receipt?
The ideal approach is:
- Introduce a
State
mechanism that can accumulate and maintain state. The accumulated state represents all historical events received in the past, which affects the output results. Time mechanism
: A mechanism that can ensure data integrity operations, such as setting that calculations for output results only occur after all data in a certain time period has been received.
This is what is known as stateful stream processing.
Challenges of Stateful Stream Processing#
- State Fault Tolerance:
- State Management:
- Event-time Processing:
- Savepoints and Job Migration:
State Fault Tolerance#
Exact-once fault tolerance guarantee in simple scenarios:
Data enters as an infinite stream, but the subsequent computation is a single Process. In this case, to ensure that the Process produces exact-once state fault tolerance, after processing each piece of data and changing the state, a snapshot is taken. The snapshot is included in the queue and compared with the corresponding state. Completing a consistent snapshot ensures exact-once.
Distributed State Fault Tolerance#
-
In a distributed scenario, multiple nodes modify local states but only produce one
Global consistent snapshot
. -
Fault tolerance recovery is based on the
checkpoint
mechanism. -
The extension of the
simple lamport
algorithm mechanism implementsdistributed snapshots
. Flink can continuously completeGlobal consistent snapshots
without interrupting computation. The general method is that Flink inserts checkpoint barrier flags into the data stream. Subsequent Operators will save their states upon receiving checkpoint barrier N in the data stream. This way, a checkpoint save state is established from the initial data source to the completion of computation. Meanwhile, checkpoint barrier N+1 and N+2 are also synchronized in the data stream, allowing continuous generation of Checkpoints without blocking computation.
State Management#
Flink currently supports the following methods:
-
JVM Heap State Storage
: Suitable for cases with small state amounts, as it is directly stored in the JVM heap. When reading the state, it is done directly using Java objects without serialization. However, when Checkpoints need to place each operation's local state into Distributed Snapshots, serialization is required. -
RocksDB State Storage
: An out-of-core state backend. When users read the state in the Runtime's local state backend, it goes through the disk, effectively maintaining the state on the disk. The corresponding cost may be that each time the state is read, it requires serialization and deserialization, which may result in relatively lower performance.
Event-time Processing#
Processing Time
: Process Time is the time when an event arrives and starts processing, which can vary due to network reasons.Event Time
: Event Time is the actual time an event occurs, determined by the timestamp carried by each processing record, which is unique and unchanging.Ingestion Time
: Refers to the time when an event enters the Flink data stream in the source operator.
Comparison
- Processing Time is simpler to handle, while Event Time is more complicated.
- When using Processing Time, the processing results (or the internal state of the stream processing application) are uncertain. However, due to various guarantees made by Flink for Event Time, using Event Time allows for relatively consistent and reproducible results, regardless of how many times data is replayed.
- When deciding whether to use Processing Time or Event Time, a principle can be followed: if your application encounters issues and needs to replay from the last checkpoint or savepoint, do you want the results to be exactly the same? If you want the results to be exactly the same, you must use Event Time; if you accept different results, you can use Processing Time. A common use case for Processing Time is when we need to count the system's throughput based on real-time, such as calculating how many records were processed in one hour of real time, which can only use Processing Time.
Reference Link
State Saving and Migration#
Implemented based on the Savepoint mechanism, Savepoints are similar to checkpoints, but differ in that Savepoints are manually triggered for global state saving. For more details, refer to:
Flink Real-time Computing - In-depth Understanding of Checkpoints and Savepoints
Basic Concepts of Flink#
-
Streams
: Streams are divided into bounded and unbounded data streams. Anunbounded stream
is a never-ending data stream, while abounded stream
is a finite data collection with a defined size. The difference is that unbounded stream data continuously increases over time, with ongoing computations and no end state, while bounded stream data size is fixed, and computations will eventually complete and reach an end state. -
State
: State is the data information during the computation process, playing an important role in fault tolerance recovery and Checkpoints. Stream computation is essentially Incremental Processing, so it requires continuous querying to maintain state. Additionally, to ensure Exactly-once semantics, data must be able to be written to the state. Persistent storage can guarantee Exactly-once even in the event of failures or crashes in the entire distributed system, which is another value of state. -
Time
: Divided intoEvent time
,Ingestion time
, andProcessing time
. Flink's unbounded data stream is a continuous process, and time is an important basis for judging whether business status is lagging and whether data processing is timely. -
API
: APIs are typically divided into three layers: from top to bottom,SQL / Table API
,DataStream API
, andProcessFunction
. The expressive power and business abstraction capability of the API are very strong, but as you approach the SQL layer, the expressive power gradually weakens while the abstraction capability increases. Conversely, the ProcessFunction layer API has very strong expressive power, allowing for various flexible and convenient operations, but the abstraction capability is relatively smaller.
Flink Window#
- Count Window driven by event count
- Session Window driven by session intervals
- Time Window driven by time