1. 17 Feb, 2022 2 commits
  2. 14 Feb, 2022 6 commits
  3. 10 Feb, 2022 3 commits
  4. 07 Feb, 2022 3 commits
  5. 23 Dec, 2021 9 commits
  6. 22 Dec, 2021 6 commits
    • Gmodena's avatar
      Merge branch 'T295360-set-dag-owner' into 'multi-project-dags-repo' · 32dde83b
      Gmodena authored
      Set dag owner.
      
      The property is required for a dag to be
      picked up by the scheduler, and displayed in DAG UI.
      
      See merge request gmodena/platform-airflow-dags!22
      32dde83b
    • Gmodena's avatar
      Set dag owner. · a9fbe966
      Gmodena authored
      The property is required for a dag to be
      picked up by the scheduler, and displayed in DAG UI.
      a9fbe966
    • Gmodena's avatar
      Merge branch 'T295360-fix-pyspark-paths' into 'multi-project-dags-repo' · 29bc2cf3
      Gmodena authored
      Normalise path layout in factories and cookiecutter
      
      This MR fixes some inconsistencies between dags boilerplate and the
      cookiecutter template:
      
      * the expected venv location has moved one level up in the deployed project home
      * project dir (cookiecutter config) is not part of pipelines home; we let boilerplate chain dirs together.
      
      See merge request gmodena/platform-airflow-dags!21
      29bc2cf3
    • Gmodena's avatar
      537a56c5
    • Gmodena's avatar
      Merge branch 'T295360-datapipeline-scaffolding' into 'multi-project-dags-repo' · 0395fdc5
      Gmodena authored
      T295360 datapipeline scaffolding
      
      This merge request adds a cookiecutter template to scaffold new data pipelines as described in https://phabricator.wikimedia.org/T295360.
      
      This template provides
      * Integration with our tox config (mypy/flake8/pytest)
      * A PySpark job template
      * A pytest template for pyspark code
      * An Airflow dag template to help users getting started.
      
      # Structure changes
      
      The project directory largely follows `image-matching`'s strcuture. Notable changes are:
      * Python code has been moved under `pyspark`
      * Python code is pip installable. This allows to package deps at build time, and ease spark deployment (e.g. we don't need to pass each module like `--files schema.py` - imports will be resolved from the `venv`).
      
      # How to test
      checkout the `T295360-datapipeline-scaffolding` branch and run
      
       A new datapipline can be created with:                                                                                  
       ```                                                                                                                     
       make datapipeline                                                                                                       
       ```                                                                                                                     
                                                                                                                                
       This will generate a new directory for pipeline code under:                                                             
       ```bash                                                                                                                 
       your_data_pipeline                                                                                                      
       ```                                                                                                                     
                                                                                                                               
       And install an Airflow dag template under                                                                               
       ```                                                                                                                     
       dags/your_data_pipeline_dag.py                                                                                          
       ```
      
      From the top level directory, you can now run `make test-dags`. The command will check 
      that `dags/your_data_pipeline_dag.py` is a valid airflow dag. 
      The output should look like this:
      ```
      make test-dags
      
      ---------- coverage: platform linux, python 3.7.11-final-0 -----------
      Name                                    Stmts   Miss  Cover
      -----------------------------------------------------------
      dags/factory/sequence.py                   70      3    96%
      dags/ima.py                                49      5    90%
      dags/similarusers-train-and-ingest.py      20      0   100%
      dags/your_data_pipeline_dag.py             19      0   100%
      -----------------------------------------------------------
      TOTAL                                     158      8    95%
      
      =========================== 8 passed, 8 warnings in 12.75s ===========================
      ______________________________________ summary ____________
      ```
      
      See merge request gmodena/platform-airflow-dags!16
      0395fdc5
    • Gmodena's avatar
      T295360 datapipeline scaffolding · 54d3d4a7
      Gmodena authored
      54d3d4a7
  7. 17 Dec, 2021 1 commit
  8. 16 Dec, 2021 1 commit
    • Gmodena's avatar
      Install openjdk in Github action. · 762dfe51
      Gmodena authored
      Conda vendored openjdk shows flaky
      behaviour with the rest of the build
      pipeline.
      
      This change installs adopt openjdk
      directly on the host system.
      762dfe51
  9. 15 Dec, 2021 2 commits
  10. 13 Dec, 2021 1 commit
  11. 09 Dec, 2021 1 commit
  12. 08 Dec, 2021 2 commits
  13. 24 Nov, 2021 3 commits