Spark Web UI – Understanding Spark Execution

Apache Spark provides a suite of Web UI/User Interfaces (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the status of your Spark/PySpark application, resource consumption of Spark cluster, and Spark configurations.

To better understand how Spark executes the Spark/PySpark Jobs, these set of user interfaces comes in handy. In this article, I will run a small application and explain how Spark executes this by using different sections in Spark Web UI.

Before going into Spark UI first, learn about these two concepts.

Let me give a small brief on those two, Your application code is the set of instructions that instructs the driver to do a Spark Job and let the driver decide how to achieve it with the help of executors.

Instructions to the driver are called Transformations and action will trigger the execution.

I had written a small application which does transformation and action.

Application Code

Here we are creating a DataFrame by reading a .csv file and checking the count of the DataFrame. Let’s understand how an application gets projected in Spark UI

Spark UI is separated into below tabs.

  1. Spark Jobs
  2. Stages
  3. Tasks
  4. Storage
  5. Environment
  6. Executors
  7. SQL

If you are running the Spark application locally, Spark UI can be accessed using the http://localhost:4040/ . Spark UI by default runs on port 4040 and below are some of the additional UI’s that would be helpful to track Spark application.

Spark Web UI

Note: To access these URLs, Spark application should in running state. If you wanted to access this URL regardless of your Spark application status and wanted to access Spark UI all the time, you would need to start Spark History server.

1. Spark Jobs Tab

Jobs tab

The details that I want you to be aware of under the jobs section are Scheduling mode, the number of Spark Jobs, the number of stages it has, and Description in your spark job.

1.1 Scheduling Mode

We have three Scheduling modes.

  1. Standalone mode
  2. YARN mode
  3. Mesos

Spark Scheduling tab

As I was running in a local machine, I tried using Standalone mode

1.2 Number of Spark Jobs:

Always keep in mind, the number of Spark jobs is equal to the number of actions in the application and each Spark job should have at least one Stage.
In our above application, we have performed 3 Spark jobs (0,1,2)

  • Job 0. read the CSV file.
  • Job 1. Inferschema from the file.
  • Job 2. Count Check

So if we look at the fig it clearly shows 3 Spark jobs result of 3 actions.

1.3 Number of Stages

Each Wide Transformation results in a separate Number of Stages. In our case, Spark job0 and Spark job1 have individual single stages but when it comes to Spark job 3 we can see two stages that are because of the partition of data. Data is partitioned into two files by default.

1.4 Description

Description links the complete details of the associated SparkJob like Spark Job Status, DAG Visualization, Completed Stages
I had explained the description part in the coming part.

2. Stages Tab

Spark Stage Tab

We can navigate into Stage Tab in two ways.

  1. Select the Description of the respective Spark job (Shows stages only for the Spark job opted)
  2. On the top of Spark Job tab select Stages option (Shows all stages in Application)

In our application, we have a total of 4 Stages.

The Stage tab displays a summary page that shows the current state of all stages of all Spark jobs in the spark application

The number of tasks you could see in each stage is the number of partitions that spark is going to work on and each task inside a stage is the same work that will be done by spark but on a different partition of data.

Stage 0

Stage detail

Details of stage showcase Directed Acyclic Graph (DAG) of this stage, where vertices represent the RDDs or DataFrame and edges represent an operation to be applied.

let us analyze operations in Stages
Operations in Stage0 are
1.FileScanRDD
2.MapPartitionsRDD

FileScanRDD

FileScan represents reading the data from a file.
It is  given FilePartitions that are custom RDD partitions with PartitionedFiles (file blocks)
In our scenario, the CSV file is read

MapPartitionsRDD

MapPartitionsRDD will be created when you use map Partition transformation

Stage1

Operation in Stage(1) are
1.FileScanRDD
2.MapPartitionsRDD
3.SQLExecutionRDD

As File Scan and MapPartitionsRDD is already explained, let us look at SQLExecutionRDD

SQLExecutionRDD

SQLExecutionRDD is Spark property that is used to track multiple Spark jobs that should all together constitute a single structured query execution. 

Stage 2

Operation in Stage(2) and Stage(3) are
1.FileScanRDD
2.MapPartitionsRDD
3.WholeStageCodegen
4.Exchange

Wholestagecodegen

A physical query optimizer in Spark SQL that fuses multiple physical operators

Exchange

Exchange is performed because of the COUNT method.
As data is divided into partitions and shared among executors, to get count there should be adding of the count of from individual partition.

Represents the shuffle i.e data movement across the cluster(Executors).
It is the most expensive operation and if number of partitions is more exchange of data between executors will also be more.

3. Tasks

Spark Tasks Tab

Tasks are located at the bottom space in the respective stage.
Key things to look task page are:
1. Input Size – Input for the Stage
2. Shuffle Write-Output is the stage written.

4. Storage

The Storage tab displays the persisted RDDs and DataFrames, if any, in the application. The summary page shows the storage levels, sizes and partitions of all RDDs, and the details page shows the sizes and using executors for all partitions in an RDD or DataFrame.

5. Environment Tab

Spark Environment Tab

This environment page has five parts. It is a useful place to check whether your properties have been set correctly.

  1. Runtime Information: simply contains the runtime properties like versions of Java and Scala.
  2. Spark Properties: lists the application properties like ‘spark.app.name’ and ‘spark.driver.memory’.
  3. Hadoop Properties: displays properties relative to Hadoop and YARN. Note: Properties like spark.hadoop’ are shown not in this part but in ‘Spark Properties’.
  4. System Properties: shows more details about the JVM.
  5. Classpath Entries: lists the classes loaded from different sources, which is very useful to resolve class conflicts.

                                      Spark Environment properties

The Environment tab displays the values for the different environment and configuration variables, including JVM, Spark, and system properties.

6. Executors Tab

                                         Spark Executors Tab

The Executors tab displays summary information about the executors that were created for the application, including memory and disk usage and task and shuffle information. The Storage Memory column shows the amount of memory used and reserved for caching data.

The Executors tab provides not only resource information like amount of memory, disk, and cores used by each executor but also performance information.

In Executors
Number of cores = 3 as I gave master as local with 3 threads
Number of tasks = 4

7. SQL Tab

Spark SQL Tab

If the application executes Spark SQL queries then the SQL tab displays information, such as the duration, Spark jobs, and physical and logical plans for the queries.

In our application, we performed read and count operation on files and DataFrame. So both read and count are listed SQL Tab

Some of the resources are gathered from Apache Spark™ - Unified Engine for large-scale data analytics thanks for the information.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值