Refactor code to python package and add CI
This MR adds a python package (differential_privacy
) of pyspark
DP jobs.
The MR adds a Gitlab CI pipeline for the repo (.gitlab-ci.yml). CI allows to
- Automatically run unit tests on push
- Automatically run linting (flake 8) on push
- Manually run a build job that produces a
conda-dist
archive of dependencies, compatible with WMF airflow deployments.
Testing
This MR has been tested by running existing tmlt pipelines notebook using the conda environment published at https://gitlab.wikimedia.org/repos/security/differential-privacy/-/packages/158
TODO
The following will be tackled in follow up MRs.
- [] try to build python-flint from source
- [] try to reduce conda-dist size by remove pyspark deps (assuming avail on stat/airflow nodes)
- [] simplify package management either by using pyproject and/or poetry. This might break compat with our internal tooling, and needs testing.