Introduction to Flume and Basic Usage

Introduction to Flume and Basic Usage#

Content reference: Official Documentation

Basic Architecture of Flume#

External data sources send events to Flume in a specific format. When the source receives events, it stores them in one or more channels, which keep the events until they are consumed by the sink. The main function of the sink is to read events from the channel and store them in an external storage system or forward them to the next source, removing the events from the channel upon successful processing.

Basic Concepts#

Event
Source
Channel
Sink
Agent

Event#

An Event is the basic unit of data transmission in Flume NG, similar to messages in JMS and messaging systems. An Event consists of a header and a body: the former is a key/value mapping, and the latter is an arbitrary byte array.

Agent#

An Agent is an independent (JVM) process that contains components such as Source, Channel, and Sink.

Source#

The data collection component that collects data from external data sources and stores it in the Channel. It includes dozens of built-in types, such as Avro Source, Thrift Source, Kafka Source, and JMS Source.

Channel#

A Channel is a pipeline between the source and the sink, used for temporarily storing data. It can be in-memory or a persistent file system:

Memory Channel: Uses memory, with the advantage of speed, but data may be lost (e.g., sudden crashes);
File Channel: Uses a persistent file system, ensuring no data loss, but is slower.

Built-in options include Memory Channel, JDBC Channel, Kafka Channel, and File Channel.

Sink#

The main function of the Sink is to read Events from the Channel and store them in an external storage system or forward them to the next Source, removing the Event from the Channel upon successful processing. It includes HDFS Sink, Hive Sink, HBase Sinks, and Avro Sink.

Flume Transactions#

When data is transmitted to the next node (usually in batches), if an exception occurs at the receiving node, such as a network failure, the batch of data will be rolled back, potentially leading to data retransmission (it is retransmission, not duplication). Within the same node, if the Source writes data to the Channel and an exception occurs within a batch of data, it will not be written to the Channel, and any already received data will be discarded, relying on the previous node to retransmit the data.

source -> channel: put transaction
channel -> sink: take transaction

Steps for put transaction:

doput: First, write the batch of data into a temporary buffer called putlist.
docommit: Check if there is space in the channel; if so, pass the data in; if not, dorollback will roll the data back to the putlist.

Steps for take transaction:

dotake: Read the data into a temporary buffer called takelist and send the data to HDFS.
docommit: Determine if the data was sent successfully; if successful, clear the temporary buffer takelist. If not (e.g., if the HDFS system server crashes), dorollback will roll the data back to the channel.

Reference link: Flume Transaction Analysis

Reliability of Flume#

When a node fails, logs can be transmitted to other nodes without loss. Flume provides three levels of reliability guarantees, from strong to weak:

end-to-end: When data is received, the agent first writes the event to disk, and after successful transmission, deletes it; if data transmission fails, it can be resent.
Store on failure: This is also the strategy used by Scribe; when the data receiver crashes, the data is written locally and sent after recovery.
Besteffort: After data is sent to the receiver, no confirmation is performed.

Deployment Types of Flume#

Single Flow#

Multi-Agent Flow (Multiple agents connected in sequence)#

Multiple Agents can be connected in sequence, collecting the initial data source and storing it in the final storage system. This is the simplest case; generally, the number of sequentially connected Agents should be controlled, as the path through which data flows becomes longer. If failover is not considered, a failure will affect the collection service of all Agents in the Flow.

Consolidation (Merging flows, multiple Agents aggregating data into one Agent)#

This scenario is commonly applied, such as collecting user behavior logs from a website. To ensure availability, the website uses a load-balanced cluster mode, where each node generates user behavior logs. An Agent can be configured for each node to collect log data separately, and then multiple Agents aggregate the data into a single storage system, such as HDFS.

Multiplexing the Flow#

Flume supports sending events from one Source to multiple Channels, i.e., passing events to multiple Sinks. This operation is called Fan Out. By default, Fan Out copies Events to all Channels, meaning all Channels receive the same data. Flume also supports customizing a multiplexing selector on the Source to implement custom routing rules.

For example, when mixed log streams from syslog, Java, Nginx, Tomcat, etc., start flowing into an agent, the agent can separate the mixed log streams and establish a dedicated transmission channel for each type of log.

Yige