I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs.
Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit.
What is best way to track Spark job using Airflow if even I submitted?
解决方案
My assumption is you an application JAR containing Java / Scala code that you want to submit to remote Spark cluster. Livy is arguably the best option for remote spark-submit when evaluated against other possibilities:
Specifying remote master IP: Requires modifying global configurations / environment variables
Using SSHOperator: SSH connection might break
Using EmrAddStepsOperator: Dependent on EMR
Regarding tracking
Livy only reports state and not progress (% completion of stages)
If your'e OK with that, you can just poll the Livy server via REST API and keep printing logs in console, those will appear on task logs in WebUI (View Logs)
Other considerations
Livy doesn't support reusing SparkSession for POST/batches request
If that's imperative, you'll have to write your application code in PySpark and use POST/session requests
References
Useful links