Skip to content

Fix AQS/hourly crasing in production using Skein

Aqu requested to merge aqs_skein_debugging into main

In development, we are creating Airflow instance run_dev_instance.sh. But the launcher is not set to skein in this environment. This is because the env variable AIRFLOW_INSTANCE_NAME is not set in this context. So the Skein related error didn't happen in test.

It was a memory problem: Launching this job with Skein needs more memory than with the default Yarn launcher.

Memory limit exceeded error is crashing the Spark process with a Spark-SQL error, which is a consequence, not the root cause. It is misleading. The valuable information to debug was in the Yarn UI: exit code 143.

Merge request reports