Implement solution (!1) · Merge requests · repos / data-engineering / patches / WMF SparkSQLCLIDriver

Xcollazo requested to merge implement-solution into main Apr 20, 2023

Context: The SparkSqlOperator can take HTTPS URLs on the sql parameter. This is very convenient, since we can combine this with GitLab's raw rendering API to construct operators like so:

SparkSqlOperator(
        task_id="do_hql",
        # To run an HQL file, simply use a GitLab's raw URI that points to it.
        # See how to build such a URI here:
        # https://docs.gitlab.com/ee/api/repository_files.html#get-raw-file-from-repository
        # We strongly recommend you use an immutable URI (i.e. one that includes an SHA or a tag) for reproducibility
        sql=var_props.get(
            'hql_gitlab_raw_path',
            'https://gitlab.wikimedia.org/api/v4/projects/1261/repository/files/test%2Ftest_hql.hql/raw?ref=0e4d2a9'
        ),
        query_parameters={
            'destination_directory': f'/tmp/xcollazo_test_generic_artifact_deployment_dag/{{{{ts_nodash}}}}',
            'snapshot': '2023-01-02',
        },
        launcher='skein',
    )

The operator above pulls the HQL file from GitLab and executes it. This effectively allows users that are interested in running SQL to keep it in their respective repositories without the need to do any artifact creation or declaration.

Unfortunately, the underlying code has a bug that mangles URLs if they contain any url-encoded bits, and the GitLab raw API depends on being able to url-encode the file paths as seen above.

In this MR (and the sole reason for this Gitlab Repo) we implement a subclass of org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver. This class use WmfHttpsFileSystem which is a Hadoop FileSystem implementation that avoids the code that produces the bug.

This code is based off of Spark 3.1.2 and Hadoop Client 3.2.0 ( which is the Hadoop code available at runtime when using pyspark). Adding this code to our codebase means we may need to maintain it once we move to newer Spark / Hadoop codebases. org.apache.hadoop.fs.FileSystem is marked as public and stable. org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver, however, does seem to have active changes.

Bug: T333001

Edited Apr 25, 2023 by Xcollazo

Admin message

Admin message

Admin message

Implement solution

Merge request reports