Fix AQS/hourly crasing in production using Skein
In development, we are creating Airflow instance run_dev_instance.sh
. But the launcher is not set to skein
in this environment. This is because the env variable AIRFLOW_INSTANCE_NAME is not set in this context. So the Skein related error didn't happen in test.
It was a memory problem: Launching this job with Skein needs more memory than with the default Yarn launcher.
Memory limit exceeded error is crashing the Spark process with a Spark-SQL error, which is a consequence, not the root cause. It is misleading. The valuable information to debug was in the Yarn UI: exit code 143
.