Knowledge Gaps merge requestshttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests2022-02-24T11:54:07Zhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/1Development improvements2022-02-24T11:54:07ZBmansurovDevelopment improvementsMake it easy to collaborate by using on PEP8 rules.Make it easy to collaborate by using on PEP8 rules.BmansurovBmansurovhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/2Add metrics2022-03-18T13:38:00ZBmansurovAdd metrics- Add a function that computes pageviews for a given time interval
- Add rev counts metric
Issue: #1- Add a function that computes pageviews for a given time interval
- Add rev counts metric
Issue: #1BmansurovBmansurovhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/3Automate testing with tox.2022-03-17T11:36:43ZGmodenaAutomate testing with tox.This WIP merge request addresses point 1 of issue #5: _Run unit tests_.
It provides a basic skeleton for automating code checks. It's based on the [cookiecutter template that Platform uses to bootstrap data pipelines](https://gitlab.wi...This WIP merge request addresses point 1 of issue #5: _Run unit tests_.
It provides a basic skeleton for automating code checks. It's based on the [cookiecutter template that Platform uses to bootstrap data pipelines](https://gitlab.wikimedia.org/repos/generated-data-platform/datapipelines/-/tree/main/datapipeline-scaffold).
Gitlab's CI Pipeline status can be found at https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/pipelines
Note: _Marking as draft because I want to start a conversation to validate direction. This change is not prescriptive, and feedback is very much welcome._
I adopted the same type of code checks we use in https://gitlab.wikimedia.org/repos/generated-data-platform/datapipelines, but I relaxed the strictness a bit.
A [CI Pipeline](https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/pipelines) is triggered
after each `push` and merge request and will run the following checks:
1. Unit tests (stored under `tests/`).
2. Linting, with flake8, runs against the `knowledge_gaps` module (rules are defined in the [tox.ini](https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/new/diffs?merge_request%5Bsource_branch%5D=5-add-ci-integration&merge_request%5Bsource_project_id%5D=212&merge_request%5Btarget_branch%5D=main&merge_request%5Btarget_project_id%5D=212#61be067c7cf3bdbf8a6b021a2b5167eb30612d0c_0_11) file)
3. Type checking, with mypy, runs against the `knowledge_gaps` module (rules are defined in [tox.ini](https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/new/diffs?merge_request%5Bsource_branch%5D=5-add-ci-integration&merge_request%5Bsource_project_id%5D=212&merge_request%5Btarget_branch%5D=main&merge_request%5Btarget_project_id%5D=212#61be067c7cf3bdbf8a6b021a2b5167eb30612d0c_0_18) file)
Only unit tests are required to pass to mark a pipeline as successful, flake8/mypy are allowed to fail (but will raise a warning).
# Changes
- add the capability to automate pytest / mypy / flake8
via tox.
- added a Gitlab CI pipeline config (`.gitlab-ci.yml`).
- unit tests have been moved into a toplevel tests/ dir.
# Known issues
- Tests are failing. But I'd rather address a fix in a dedicate merge request.
- some code under `interactive/` is not valid Python. There's some notebook magic (%%load_ext) that will raise a syntax error when parsed by mypy/flake8. For know I'd be keen in leaving `interactive` code alone, under the assumption the it will be refactored. @aikochou @bmansurov does this track with you? How do you plan to structure this code base?GmodenaGmodenahttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/4Create a conda env distribution.2022-05-18T17:03:06ZGmodenaCreate a conda env distribution.@bmansurov @aikochou
This MR introduces the capability to generate and publish relocatable conda environments.
It's a requirement needed to satisfy #5, and a prerequisite for #3.
# Changes
## CI job
A new job has been added to the C...@bmansurov @aikochou
This MR introduces the capability to generate and publish relocatable conda environments.
It's a requirement needed to satisfy #5, and a prerequisite for #3.
# Changes
## CI job
A new job has been added to the CI pipeline to generate and publish a conda environment.
This jobs uses the `conda-dist` package from https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils.
## Docker and Makefile
I added a Makefile and a Dockerfile to help with local development. This is not strictly necessary for our pipeline, but allows for local experimentation/dev in an environment that resembles Wikimedia analytics hosts and Gitlab's CI container.
I used it to test/troubleshoot `workflow_utils`, it might come in handy for point 2 of #5.
@bmansurov @aikochou would this type of build tooling be at all useful for you?
## Versioning
I added bump2version to `requirements_dev.txt` to automate version bumps, and propagate changes to multiple affected files.
This is mostly prep work for implementing a release cycle and this point from #5: _triggering building a conda envs for main (and possibly development branches)_. @bmansurov this is something we'll need to think about together, and prepare a proposal for Fabian. I have some ideas, but nothing prescriptive. We'll also need to factor in requirements from DE/Airflow operators.
# Known issues
## conda-dist occasionally fails
I’ve experienced a few of these:
```
CondaPackError:
Files managed by conda were found to have been deleted/overwritten in the
following packages:
- ncurses 6.3:
share/terminfo/2/2621A
share/terminfo/E/Eterm
share/terminfo/E/Eterm-color
+ 1054 others
```
I narrowed down the issue to `conda-dist` pip-installing with the `--prefix=./dist/myenv` argument.
If I instead invoke pip from inside the environment (e.g. `./dist/myenv/bin/pip install .`) the dependencies do not get clobbered and `conda-pack` generates a valid tarball.
I had a patch ready for upstream... and then suddenly the issue went away by itself. I wanted to document it here, might we encounter a regression.
## CI failures
There's a couple of new failures in mypy/flake8. Not a blocker, but flagging just as a FYI.
- mypy: https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/jobs/11676#L58 (we just need to decorate `wmfdata` imports with `# type: ignore`).
- flake8: undefined name at https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/jobs/11675#L57GmodenaGmodenahttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/5Correctly compute revision counts2022-03-30T14:41:18ZBmansurovCorrectly compute revision countsUse wmf_raw.mediawiki_revision rather than wmf.mediawiki_history.
Issue: #1Use wmf_raw.mediawiki_revision rather than wmf.mediawiki_history.
Issue: #1BmansurovBmansurovhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/6Add aggregation and content gap pipeline2022-04-14T11:58:53ZAikoChouAdd aggregation and content gap pipeline* added aggregation pipeline (gender/sexual orientation/geographic)
* dataset generated from aggregation pipeline with the following schema: https://docs.google.com/document/d/1Z-EpXMnfzHAp-M5vdQ-NbFkXx3tQRMQpeZyYwBaK4xU/edit?usp=sharing...* added aggregation pipeline (gender/sexual orientation/geographic)
* dataset generated from aggregation pipeline with the following schema: https://docs.google.com/document/d/1Z-EpXMnfzHAp-M5vdQ-NbFkXx3tQRMQpeZyYwBaK4xU/edit?usp=sharing (not prescriptive)
* added content gap pipeline (previously under `interactive/` directory)
* some function names modified in func.py and util.py
* deleted unnecessary code
* tested on a small dataset using the `spark2-submit --master local` on the stat machinehttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/7Add the article quality score meteric2022-04-28T19:04:11ZBmansurovAdd the article quality score metericIssue: #1Issue: #1BmansurovBmansurovhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/8Use tagged version of in Gitlab CI templates2022-04-25T12:39:13ZGmodenaUse tagged version of in Gitlab CI templatesUpdate Gitlab CI to include tagged templates.
Fix `publish_conda_env` to match `v0.4.0` contract.Update Gitlab CI to include tagged templates.
Fix `publish_conda_env` to match `v0.4.0` contract.Fabian KaelinFabian Kaelinhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/9Update aggregation2022-05-19T15:53:04ZAikoChouUpdate aggregationOnly computes raw count for each metric so far.
Testing using precomputed tables:
- `aikochou.content_gap_feature_20220401`: 13657 articles from en, de, fr, ca, it wikis
- `aikochou.pageviews_2021`: monthly pageviews in 2021. Total 690...Only computes raw count for each metric so far.
Testing using precomputed tables:
- `aikochou.content_gap_feature_20220401`: 13657 articles from en, de, fr, ca, it wikis
- `aikochou.pageviews_2021`: monthly pageviews in 2021. Total 690,766,971 rows
- `aikochou.quality_scores`: using get_quality_scores from article_quality.app. Total 58,607,147 rows
Issues: #2AikoChouAikoChouhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/10Update Gitlab CI MR2022-05-04T02:37:33ZFabian KaelinUpdate Gitlab CI MR- Merged with main branch
- Experimented with the local docker, but didn't get it to work; consistently fails packing the conda env with the `ncurses 6.3` error mentioned in the main MR description
- changed Dockerfile to resemble the ...- Merged with main branch
- Experimented with the local docker, but didn't get it to work; consistently fails packing the conda env with the `ncurses 6.3` error mentioned in the main MR description
- changed Dockerfile to resemble the one used by gitlab ci as much as possible
- moved the installation of wmf_workflow_utils from the Makefile to the Dockerfile, as it is already installed when running in CI
- Beware: after running (or attempting to) `make env` using local docker, the resulting `dist` directory will cause the autopep8 pre-commit hook to run for a loooongFabian KaelinFabian Kaelinhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/11Switch to using setup.cfg / fix tests in ci2022-05-10T20:02:55ZFabian KaelinSwitch to using setup.cfg / fix tests in ciSwitch to using a `setup.cfg` from using `setup.py`, which allows to consolidate more python packaging configuration in a single file. Based on the [example-job-project](https://gitlab.wikimedia.org/repos/data-engineering/example-job-pro...Switch to using a `setup.cfg` from using `setup.py`, which allows to consolidate more python packaging configuration in a single file. Based on the [example-job-project](https://gitlab.wikimedia.org/repos/data-engineering/example-job-project).
Ongoing issues
- ~~the dependency on article quality via git+https breaks gitlab CI (which silently times out after 1 hour). Need to investigate whether that is because of the git+https dependency, or because of a transitive dependencies pulled in. For now, the article quality dependency is disabled~~
- the local `make env` still fails as documented [here](https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/4#conda-dist-occasionally-fails)
- the relationship between local Dockerfile, the Makefile and gitlab-ci.yml is not well defined. At this point the Makefile only supports the local development workflow using Docker, it doesn't generalize commands between gitlab ci and local docker as originally intended.https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/12Add missing dependency2022-05-10T20:49:12ZBmansurovAdd missing dependencyhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/13Package config files / tests / use CI built conda env2022-05-14T11:39:22ZFabian KaelinPackage config files / tests / use CI built conda env- merges https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/9 with main
- adds the configuration files as `package_data` and helper methods to load them via `import_lib`
- tests for loading configuration files
I...- merges https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/9 with main
- adds the configuration files as `package_data` and helper methods to load them via `import_lib`
- tests for loading configuration files
I manually triggered the [conda env CI](https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/jobs/17688) which produced https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/package_files/549/download, and I tested using this url directly as `archives` argument in the `spark-submit` command, which seems to work. See the commands in the readme.
@aikochou, there are no functional changes in this MR; there might be a regression on the code path reading the config files in `content_features.py`, but the config loading itself works as intended. Note that this MR is against your branch, not main; feel free to merge it at your leisure.https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/14Inroduce running mode for metrics2022-05-29T01:54:05ZBmansurovInroduce running mode for metricsBmansurovBmansurovhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/15Bump workflow_utils2022-05-18T17:04:32ZFabian KaelinBump workflow_utilsBumping to new version for workflow_utils, which fixes [this error](https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/4#conda-dist-occasionally-fails) when running `conda-dist`.Bumping to new version for workflow_utils, which fixes [this error](https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/4#conda-dist-occasionally-fails) when running `conda-dist`.https://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/16Feature metrics: add buckets2022-06-01T17:35:21ZBmansurovFeature metrics: add bucketsBmansurovBmansurovhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/17Generate development data for knowledge gap features2022-06-06T13:25:39ZBmansurovGenerate development data for knowledge gap featuresGenerated articles are made up of people articles, location articles,
and articles with time properties.
Issue: # 1Generated articles are made up of people articles, location articles,
and articles with time properties.
Issue: # 1BmansurovBmansurovhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/18Prepare the codebase to be run by Airflow2022-06-14T13:13:53ZBmansurovPrepare the codebase to be run by AirflowIssue: #3Issue: #3BmansurovBmansurovhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/19Parametrization2022-06-17T02:45:00ZBmansurovParametrizationBefore this MR we were unable to run the pipeline independently without overwriting each other's data. This patch allows us to generate different development datasets and to save the data in different databases and tables (defined by pre...Before this MR we were unable to run the pipeline independently without overwriting each other's data. This patch allows us to generate different development datasets and to save the data in different databases and tables (defined by prefixes).
Issue: #3BmansurovBmansurovhttps://gitlab.wikimedia.org/repos/research/knowledge-gaps/-/merge_requests/20Ship development.sql when installing the package2022-06-22T17:34:45ZBmansurovShip development.sql when installing the packageThis is needed for the Airflow job that creates development data.
Issue: #3This is needed for the Airflow job that creates development data.
Issue: #3BmansurovBmansurov