Make ArtifactLocator methods work with Artifacts, rather than artifact ids, use artifact.name as default cache_key (!25) · Merge requests · repos / data-engineering / WMF Data Workflow Utils

Ottomata requested to merge artifact into main May 18, 2022

This will allow Locator implementations to be smarter about how they source or cache artifacts, because they will have richer information about the Artifact.

This is needed in order to work around usage with Gitlab package registry, where the default URL to a package ends in 'download', even if that package file is a .tgz archive. We need more control over the final cached file name.

The final cached filename ends up influencing the behavior of e.g. spark and skein, as both of those take 'archives' which are automatically unpacked if the filename ends in an archive file ext like .tgz. So, we need cached archive files to have the appropriate file extension!

The default implementation of ArtifactCache.cache_key now uses artifact.name as the cache key, giving artifact instantiators easy control over the final cached filename. To make sure a gitlab download url works as an FsArtifactSource, one would declare:

artifacts:
  my_conda_env_artifact-0.1.0.tgz:
    id: gitlab.wm.org/packages/..../download # gitlab package download link here
    source: fs_artifact_souce

The cached filename will end up being 'my_conda_env_artifact-0.1.0.tgz'.

Another option would have been to add a new optional Artifact config property like cache_key or something, using it for the cached filename if is set. However I prefer the artifact name to option that introduces less config.

Fixes: airflow-dags!47 (comment 6838)

Bug: T307115

Edited May 18, 2022 by Ottomata

Admin message

Admin message

Admin message

Make ArtifactLocator methods work with Artifacts, rather than artifact ids, use artifact.name as default cache_key

Merge request reports