ImageMatching merge requestshttps://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests2022-02-11T14:28:24Zhttps://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/37Update documentation.2022-02-11T14:28:24ZGmodenaUpdate documentation.https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/36Update version after import changes2021-11-29T20:34:22ZClarakosiUpdate version after import changeshttps://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/34Fix imports2021-11-29T20:18:05ZClarakosiFix importshttps://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/33Add algorithm version 22021-11-19T16:48:35ZClarakosiAdd algorithm version 2* Added a new notebook for algorithm_v2
* Modified algorithm script to use algorithm_v2
* Updated setup.py to now show algorithm version
* Fixed minor typos on README
---
These changes also include changes to the UDF to make the algori...* Added a new notebook for algorithm_v2
* Modified algorithm script to use algorithm_v2
* Updated setup.py to now show algorithm version
* Fixed minor typos on README
---
These changes also include changes to the UDF to make the algorithm deterministicGmodenaGmodenahttps://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/32Packaging ImageMatching as a Python wheel2021-10-26T11:00:34ZGmodenaPackaging ImageMatching as a Python wheelThis MR is a refactoring of ImageMatching to make it more compliant with Python’s packaging tooling and practices (setuptools ).
The problem I’m trying to solve is to identify a boundary between upstream (research) code and how we use ...This MR is a refactoring of ImageMatching to make it more compliant with Python’s packaging tooling and practices (setuptools ).
The problem I’m trying to solve is to identify a boundary between upstream (research) code and how we use downstream in product features.
# How to test
We can install algorunner in an env (e.g. stats machines) and launch the pyspark/papermill job with
```
$ (venv) pip install algorunner --extra-index-url https://gitlab.wikimedia.org/api/v4/projects/40/packages/pypi/simple
$ algorunner.py 2021-07-26 hywiki Output
```
# Changes
* The `ImageMatching` repo now contains only notebooks, an nbconverted script & papermill runners.
* I moved all etl and test infra to the `platfor-airflow-dags` repo.
* I created an ima package. Right now it contains notebooks. In the future it could host a library (that the notebooks can import).
* `setuptools` (`setup.py`) is configured to package notebooks and scripts in a wheel, and installs them in PYTHONPATH (e.g. ./venv/lib/python3.7/site-packages/ima and ./venv/bin/ ). Scripts are also added to PATH.
* CI builds and deploys the wheel to pypi (https://gitlab.wikimedia.org/gmodena/ImageMatching/-/pipelines/670).https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/27Add Search table2021-08-12T17:04:01ZGmodenaAdd Search table*Created by: clarakosi*
Add search table with a schema similar to: https://phabricator.wikimedia.org/T285816#7214472
*Created by: clarakosi*
Add search table with a schema similar to: https://phabricator.wikimedia.org/T285816#7214472
https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/28Update spark dep in makefile2021-08-04T19:50:01ZGmodenaUpdate spark dep in makefile*Created by: gmodena*
PR for https://phabricator.wikimedia.org/T288114.
Our CI points to a Spark tarball that does not exist anymore. This change points to the currently available Spark 2.x
tarball in apache mirrors, which means a ...*Created by: gmodena*
PR for https://phabricator.wikimedia.org/T288114.
Our CI points to a Spark tarball that does not exist anymore. This change points to the currently available Spark 2.x
tarball in apache mirrors, which means a bump from 2.4.7 to 2.4.8.https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/25Add recurrent time frame to instances_to_filter list2021-05-06T19:55:20ZGmodenaAdd recurrent time frame to instances_to_filter list*Created by: clarakosi*
*Created by: clarakosi*
https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/24Modify algorithm to use wiki specific label language in query2021-04-30T19:59:30ZGmodenaModify algorithm to use wiki specific label language in query*Created by: clarakosi*
*Created by: clarakosi*
https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/23Remove label language clause from wikidata item query2021-04-28T15:02:19ZGmodenaRemove label language clause from wikidata item query*Created by: clarakosi*
*Created by: clarakosi*
https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/22Move wiki and poc_wiki lists to a config file2021-04-19T18:44:50ZGmodenaMove wiki and poc_wiki lists to a config file*Created by: gmodena*
This PR moves the `wiki` and `poc_wikis` definitions to a dedicated, authoritative, config file.*Created by: gmodena*
This PR moves the `wiki` and `poc_wikis` definitions to a dedicated, authoritative, config file.https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/21Add android dataset scripts.2021-04-14T07:51:41ZGmodenaAdd android dataset scripts.*Created by: gmodena*
This PR adds scripts to generate a variant of the ImageMatching datasets suitable for Android clients.*Created by: gmodena*
This PR adds scripts to generate a variant of the ImageMatching datasets suitable for Android clients.https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/20Filter image suggestions detected as "placeholder images"2021-04-08T15:46:24ZGmodenaFilter image suggestions detected as "placeholder images"*Created by: clarakosi*
- [X] As a user of the Image Suggestion API, when I make a request for image suggestions, I expect that all images detected as a "placeholder image" have been filtered out
- [X] Miriam validation query has...*Created by: clarakosi*
- [X] As a user of the Image Suggestion API, when I make a request for image suggestions, I expect that all images detected as a "placeholder image" have been filtered out
- [X] Miriam validation query has been newly applied (see https://phabricator.wikimedia.org/T277828#6957015), and results should reflect 0 "placeholder images' found for representative wikis
- [x] A static list of "placeholder images" has been generated and stored in HDFS
- [x] The algorithm notebook has been updated to filter out "placeholder images"https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/18Fix metrics naming2021-04-07T15:12:59ZGmodenaFix metrics naming*Created by: clarakosi*
*Created by: clarakosi*
https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/17Update export_prod_data doc2021-04-02T16:27:03ZGmodenaUpdate export_prod_data doc*Created by: gmodena*
This PR updates documentation. *Created by: gmodena*
This PR updates documentation. https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/13T277776 add found on wiki2021-04-01T17:46:41ZGmodenaT277776 add found on wiki*Created by: gmodena*
This PR adds a new `array<string> found_on` column to the production dataset generated by `transform.py`.
Hive metadata has been updated accordingly.
The dataset export script projects the list of string as a `...*Created by: gmodena*
This PR adds a new `array<string> found_on` column to the production dataset generated by `transform.py`.
Hive metadata has been updated accordingly.
The dataset export script projects the list of string as a `found_on` column that contains the list of wikis
as a csv (`,` separated).https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/14Add a list of instances to filter2021-03-31T15:11:43ZGmodenaAdd a list of instances to filter*Created by: clarakosi*
**Acceptance criteria**
- [x] A list of "instance of" items to be filtered is available under version control
- [x] All articles that match the filter list have been filtered out
- [x] Metrics for items that...*Created by: clarakosi*
**Acceptance criteria**
- [x] A list of "instance of" items to be filtered is available under version control
- [x] All articles that match the filter list have been filtered out
- [x] Metrics for items that have been filtered out have been added in the data quality reportshttps://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/15Bugfix: save all unillustrated articles.2021-03-31T11:02:54ZGmodenaBugfix: save all unillustrated articles.*Created by: gmodena*
Don't exclude any image source when building`allimages`.
Fixes https://phabricator.wikimedia.org/T278571*Created by: gmodena*
Don't exclude any image source when building`allimages`.
Fixes https://phabricator.wikimedia.org/T278571https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/16Add itwiki and eswiki to the PoC export list.2021-03-31T08:02:22ZGmodenaAdd itwiki and eswiki to the PoC export list.*Created by: gmodena*
This PR adds two new languages to the list of production dataset we export.
* `eswiki`
* `itwiki`*Created by: gmodena*
This PR adds two new languages to the list of production dataset we export.
* `eswiki`
* `itwiki`https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/12Add page redirect counters2021-03-30T15:27:05ZGmodenaAdd page redirect counters*Created by: gmodena*
Add a check to verify that no "page redirect" article is present in the raw dataset.
Closes https://phabricator.wikimedia.org/T277560*Created by: gmodena*
Add a check to verify that no "page redirect" article is present in the raw dataset.
Closes https://phabricator.wikimedia.org/T277560