Skip to content

Add spark-submit boilerplate.

Gmodena requested to merge T296758-add-spark-job-boilerplate into main

This MR adds boilerplate to configure spark-submit for Java based jobs.

This change is meant to simplify submitting spark jobs, deployed in cluster mode, from airflow tasks. It is carried out in the context of T296758.

A new SparkTask dataclass has been added to our dag template and factory, that wraps spark-submit in a BashOperator airflow op.

Example

The spark-submit command for the canonical SparkPi demo application can be configured as

config = SparkConfig()
task = SparkTask(config=config,
    main="org.apache.spark.examples.SparkPi",
    application_jar="spark-examples_2.11-2.4.5.jar",
    main_args="5")
airflow_op = task.operator()
Edited by Gmodena

Merge request reports