Flink Basics - DataStream Programming

Flink Basics - DataStream Programming#

Evolution of Stream Processing API (Comparison with Storm)#

Storm's API has a lower level of abstraction, which is operation-oriented, allowing you to directly customize the construction of the DAG through code.

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));

Flink is data-oriented, using APIs, which means a series of operators perform a series of transformations on the data, with Flink's underlying system constructing the DAG, thus having a higher level of abstraction.

Basic Usage#

//1、Set up the execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//2、Configure the data source to read data
DataStream text = env.readTextFile ("input");
//3、Perform a series of transformations
DataStream<Tuple2<String, Integer>> counts = text.flatMap(new Tokenizer()).keyBy(0).sum(1);
//4、Configure the data sink to write data
counts.writeAsText("output");
//5、Submit for execution
env.execute("Streaming WordCount");

Overview of Operations#

Basic Transformations of DataStream#

Physical Grouping Methods#

Grouping must be done before calling operators like reduce/sum for statistical calculations.

Type	Description
keyBy()	The most commonly used method for sending by key type, the number of key types is often much greater than the number of concurrent operator instances.
global()	All sent to the first task.
broadcast()	Broadcast, suitable for small data volumes.
forward()	One-to-one sending when upstream and downstream concurrency is consistent.
shuffle()	Random and uniform distribution.
rebalance()	Round-Robin distribution.
rescale()	Local Round-Robin distribution.
partitionCustomer()	Custom unicast.