Introduction to Livy Series - REST Service Livy Based on Spark#
Content organized from:
Introduction to Livy#
Apache Livy is an open-source REST service based on Spark that allows code snippets or serialized binary code to be submitted to a Spark cluster for execution via REST. It provides the following basic functionalities:
- Submit Scala, Python, or R code snippets for execution on a remote Spark cluster.
- Submit Spark jobs written in Java, Scala, or Python for execution on a remote Spark cluster.
- Submit batch processing applications to run on the cluster.
For information on using the Livy REST API, refer to the official documentation.
Basic Architecture of Livy#
Livy is a typical REST service architecture that accepts and parses user REST requests, converting them into corresponding operations; on the other hand, it manages all Spark clusters initiated by users.
Users can start a new Spark cluster through Livy using REST requests. Livy refers to each started Spark cluster as a session, which consists of a complete Spark cluster and communicates between the Spark cluster and the Livy server via the RPC protocol. Depending on the interaction method, Livy divides sessions into two types:
Interactive session
. This is similar to interactive processing in Spark; an interactive session can receive user-submitted code snippets for compilation and execution on the remote Spark cluster after it is started.Batch session
. Users can start Spark applications in a batch processing manner through Livy, which is referred to as a batch session in Livy, similar to batch processing in Spark.
Enterprise Features of Livy#
Multi-user Support#
Assuming user tom initiates a REST request to the Livy server to start a new session, and the Livy server is started by user livy, who is the user of the created Spark cluster? Will it be user tom or livy? By default, the user of this Spark cluster is livy. This raises access permission issues: user tom cannot access resources he has permission for, while he can access resources owned by user livy.
To address this issue, Livy introduces the proxy user mode from Hadoop, which is widely used in multi-user environments, such as HiveServer2. In this mode, a superuser can act as a normal user to access resources and have the corresponding permissions of the normal user. Once the proxy user mode is enabled, the user of the Spark cluster started by the session created by user tom will be tom.
End-to-End Security#
- Kerberos-based Spnego authentication ensures client authentication security.
- Standard SSL is used to encrypt the HTTP protocol, ensuring the security of HTTP transmission between the client and the Livy server.
- The RPC communication mechanism based on SASL authentication ensures the security of network communication between the Livy server and the Spark cluster.
Failure Recovery Mechanism#
Livy provides a failure recovery mechanism. When a user starts a session, Livy records session-related metadata on reliable storage. Once Livy recovers from a failure, it attempts to read the relevant metadata and reconnect with the Spark cluster.
To use this feature, we need to configure Livy as follows:
// Enable failure recovery feature
livy.server.recovery.mode: recovery
// Configure where to store metadata in reliable storage; currently supports filesystem and zookeeper
livy.server.recovery.state-store
// Configure the specific storage path; if it's filesystem, change this to the file path; for zookeeper, it should be the URL of the zookeeper cluster
livy.server.recovery.state-store.url
Comparison of Livy and Spark Job Server#
Advantages of Livy:
- Livy does not require any changes to the code, while SJS jobs must extend specific classes.
- Livy allows the submission of code snippets (including Python) and pre-compiled jars, while SJS only accepts jars.
- In addition to REST, Livy also has Java and Scala APIs. The Python API is under development, while SJS has a "Python binding."
Advantages of SJS:
- SJS can manage jars, allowing you to upload and store jars, and you only need to call them through a separate REST when deploying jobs. Livy requires re-uploading these jars every time a job is redeployed.
- SJS jobs can be configured using HOCON format, which can be submitted as REST calls.
Official Links
Livy: https://livy.incubator.apache.org/
spark-jobserver: https://github.com/spark-jobserver/spark-jobserver