Flink Basics - DataStream Programming#
Evolution of Stream Processing API (Comparison with Storm)#
- Storm's API has a lower level of abstraction, which is
operation-oriented
, allowing you to directly customize the construction of the DAG through code.
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
- Flink is
data-oriented
, using APIs, which means a series of operators perform a series of transformations on the data, with Flink's underlying system constructing the DAG, thus having a higher level of abstraction.
Basic Usage#
//1、Set up the execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//2、Configure the data source to read data
DataStream text = env.readTextFile ("input");
//3、Perform a series of transformations
DataStream<Tuple2<String, Integer>> counts = text.flatMap(new Tokenizer()).keyBy(0).sum(1);
//4、Configure the data sink to write data
counts.writeAsText("output");
//5、Submit for execution
env.execute("Streaming WordCount");
Overview of Operations#
Basic Transformations of DataStream#
Physical Grouping Methods#
Grouping must be done before calling operators like reduce/sum
for statistical calculations.
Type | Description |
---|---|
keyBy() | The most commonly used method for sending by key type, the number of key types is often much greater than the number of concurrent operator instances. |
global() | All sent to the first task. |
broadcast() | Broadcast, suitable for small data volumes. |
forward() | One-to-one sending when upstream and downstream concurrency is consistent. |
shuffle() | Random and uniform distribution. |
rebalance() | Round-Robin distribution. |
rescale() | Local Round-Robin distribution. |
partitionCustomer() | Custom unicast. |