ImageMatching merge requestshttps://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests2021-02-15T18:07:58Zhttps://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/1Automate generation of .tsv files2021-02-15T18:07:58ZGmodenaAutomate generation of .tsv files*Created by: clarakosi*
- Refactored the first part of the algorithm for use with papermill in algorithm.ipynb
- Added algorunner.py, a script for running algorithm using papermill
- Updated README with instructions for getting started *Created by: clarakosi*
- Refactored the first part of the algorithm for use with papermill in algorithm.ipynb
- Added algorunner.py, a script for running algorithm using papermill
- Updated README with instructions for getting started https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/2Production data etl2021-02-16T11:27:43ZGmodenaProduction data etl*Created by: gmodena*
Transform raw data to production (PoC) schema.
This PR adds a pyspark etl that generates PoC data from
the notebook's raw output.*Created by: gmodena*
Transform raw data to production (PoC) schema.
This PR adds a pyspark etl that generates PoC data from
the notebook's raw output.https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/4T274798 include all unillustrated articles2021-02-23T19:08:37ZGmodenaT274798 include all unillustrated articles*Created by: gmodena*
Raw data can contain records with NULL image_id.
The `ImageMatching` dataset should include all unillustrated articles,
with or without candidate matches.
This PR updates the algo code, and the production
...*Created by: gmodena*
Raw data can contain records with NULL image_id.
The `ImageMatching` dataset should include all unillustrated articles,
with or without candidate matches.
This PR updates the algo code, and the production
dataset ETL to account for this new behaviour.
Articles with no matches will be saved with an empty
`top_candidates` field in the raw dataset. These records will
be stored with empty (`""`) `image_id`, `source`, `confidence_rating` fields
in prod data. An example of prod dataset with empty suggestions
can be found in `gmodena.imagerec_prod`.
## Example
```
hive (gmodena)> select count(*) from gmodena.imagerec_prod where image_id is not null;
104342
hive (gmodena)> select count(*) from gmodena.imagerec_prod where image_id is null;
39518
```
## Changelog
* `algorithm.ipynb` has been modified to save all articles
that we consider unillustrated.
* `etl/transform.py` has been updated to handle raw data records
with an empty `top_candidates` field.
* `ddl/external_imagerec_prod.hql` has been modified so that empty
strings are formatted as `NULL` by Hive (and provide sql NULL semantic).
https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/5T275162 enable spark metrics collection2021-03-04T15:31:10ZGmodenaT275162 enable spark metrics collection*Created by: gmodena*
This PR improves spark session creation in notebooks and scripts,
enables control for metric collection,
and aligns with resource utilisation patterns defined by Analytics.
Changelog:
* For notebook we init S...*Created by: gmodena*
This PR improves spark session creation in notebooks and scripts,
enables control for metric collection,
and aligns with resource utilisation patterns defined by Analytics.
Changelog:
* For notebook we init SparkSession using the `wmfdata` library.
* For scripts, the recommended way is to specify a `spark.properties` file.
* A `metrics.properties.template` file is also provided, that shows an example
of how to plug in custom configuration files.https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/6T275685 automate pytest2021-03-04T15:52:10ZGmodenaT275685 automate pytest*Created by: gmodena*
This PR adds a github action to lint (flake8) and test algorunner and the etl code.
The build logic is implemented in `Makefile`, and is invoked by
.github/workflow/build.yml. Moving forward, we can port this l...*Created by: gmodena*
This PR adds a github action to lint (flake8) and test algorunner and the etl code.
The build logic is implemented in `Makefile`, and is invoked by
.github/workflow/build.yml. Moving forward, we can port this logic
to gerrit/blubber with (hopefully) reasonably low overhead.
The build status can be found at
- https://github.com/mirrys/ImageMatching/actions/workflows/build.yml
- https://github.com/mirrys/ImageMatching/workflows/build/badge.svg?branch=T275685-automate-pytest
A badge with the build status of the `main` branch has been added to `README.md`.
While the `Makefile` was built for use in a CI system, it will work on any *nix host that
satisfies the following dependencies:
- Java JDK8
- Python 3.7https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/7T275685 generate production datasets2021-03-16T20:09:39ZGmodenaT275685 generate production datasets*Created by: gmodena*
This PR adds the capability to automate end to end generation of production datasets.
For more details see the comments in `publish.sh`. This script will:
- run the notebook with the algorunner wrapper
- copy ...*Created by: gmodena*
This PR adds the capability to automate end to end generation of production datasets.
For more details see the comments in `publish.sh`. This script will:
- run the notebook with the algorunner wrapper
- copy model output to HDFS and expose it via an hive external table (available in superset)
- run etl/transform.py to generate production data
- expose production data via an hive external table (available in superset)
- collect production datasets locally
Datasets will be created for the following wikis:
```
enwiki arwiki kowiki cswiki viwiki frwiki fawiki ptwiki ruwiki trwiki plwiki hewiki svwiki ukwiki huwiki hywiki srwiki euwiki arzwiki cebwiki dewiki bnwiki
```
## Use
`publish.sh <snapshot>`
Each time publish.sh is invoked, it records the following data under runs/<run_id>:
- `metrics`: a set of timing metrics generated by this script
- `Output`: raw model output in tsv format
- `imagerec_prod_${snapshot}`: production datasets in tsv format
- `regular.spark.properties`: spark properties file for the `transform.py` job
Each run has an associated, unique, <run_id>. This uuid is propagated to the etl transforms,
and will populate the `dataset_id` in production datasets. This allows reconciliation of
a given dataset to the process that generated it.
## Example
```
$ ./publish.sh 2021-01-25
[...]
Datasets are available at runs/dc4c9aea-4e85-475f-9626-ad0909b92fb6/imagerec_prod_2021-01-25
Export summary
22 confidence_rating source
684441
240156 high wikidata
293089 low commons
1182152 medium wikipedia
$ ls runs/dc4c9aea-4e85-475f-9626-ad0909b92fb6/imagerec_prod_2021-02-25/
prod-arwiki-2021-02-25-wd_image_candidates.tsv prod-huwiki-2021-02-25-wd_image_candidates.tsv
prod-arzwiki-2021-02-25-wd_image_candidates.tsv prod-hywiki-2021-02-25-wd_image_candidates.tsv
prod-bnwiki-2021-02-25-wd_image_candidates.tsv prod-kowiki-2021-02-25-wd_image_candidates.tsv
prod-cebwiki-2021-02-25-wd_image_candidates.tsv prod-plwiki-2021-02-25-wd_image_candidates.tsv
prod-cswiki-2021-02-25-wd_image_candidates.tsv prod-ptwiki-2021-02-25-wd_image_candidates.tsv
prod-dewiki-2021-02-25-wd_image_candidates.tsv prod-ruwiki-2021-02-25-wd_image_candidates.tsv
prod-enwiki-2021-02-25-wd_image_candidates.tsv prod-srwiki-2021-02-25-wd_image_candidates.tsv
prod-euwiki-2021-02-25-wd_image_candidates.tsv prod-svwiki-2021-02-25-wd_image_candidates.tsv
prod-fawiki-2021-02-25-wd_image_candidates.tsv prod-trwiki-2021-02-25-wd_image_candidates.tsv
prod-frwiki-2021-02-25-wd_image_candidates.tsv prod-ukwiki-2021-02-25-wd_image_candidates.tsv
prod-hewiki-2021-02-25-wd_image_candidates.tsv prod-viwiki-2021-02-25-wd_image_candidates.tsv
```
https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/8T275165 dataset metrics2021-03-17T19:24:31ZGmodenaT275165 dataset metrics*Created by: clarakosi*
### Acceptance Criteria
As an PET Data Engineer, I want the ability to generate a csv file with the following metrics, so that I can have a baseline of how the pipeline performs.
- [ ] Total number of records...*Created by: clarakosi*
### Acceptance Criteria
As an PET Data Engineer, I want the ability to generate a csv file with the following metrics, so that I can have a baseline of how the pipeline performs.
- [ ] Total number of records (per wiki)
- [ ] Total number of images per page
- [ ] Per Wiki
- [ ] Summary of population statistics
- [ ] Size and counts of intermediate and final datasets
A better look at the python notebook here: https://github.com/mirrys/ImageMatching/blob/f34ff48e430b0e83261f45fd754ee6f351db959f/Dataset_Metrics/Dataset_metrics.ipynbhttps://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/9Implement parsing of “instance of” fields in ImageMatching production datasets2021-03-22T20:38:13ZGmodenaImplement parsing of “instance of” fields in ImageMatching production datasets*Created by: clarakosi*
The spark job we use to generate production datasets needs to parse the new "instance of fields"
Acceptance criteria
- [x] Logic to parse the "instance of" json blob is implemented
- [x] Tests for this capab...*Created by: clarakosi*
The spark job we use to generate production datasets needs to parse the new "instance of fields"
Acceptance criteria
- [x] Logic to parse the "instance of" json blob is implemented
- [x] Tests for this capability have been added
- [x] The number of articles with and without valid "instance of" metadata is known (add metric)https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/10T277552 project jdata store as parquet2021-03-22T15:58:05ZGmodenaT277552 project jdata store as parquet*Created by: gmodena*
Project `instanceof` metadata in the model output.
This PR adds change to export metadata related to the "instance of"
property of a q item. This information stored as an appended to column to the model output...*Created by: gmodena*
Project `instanceof` metadata in the model output.
This PR adds change to export metadata related to the "instance of"
property of a q item. This information stored as an appended to column to the model output,
and propagates to HDFS and Hive `imagerec` datasets.
This PR also adds a spark job to upload and convert model output to `Parquet`. This has
been done to facilitate interop with spark, and handle schema migrations.
Closes: https://phabricator.wikimedia.org/T277552https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/11Trigger Github workflow on pull requests2021-03-18T21:28:53ZGmodenaTrigger Github workflow on pull requests*Created by: clarakosi*
*Created by: clarakosi*
https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/12Add page redirect counters2021-03-30T15:27:05ZGmodenaAdd page redirect counters*Created by: gmodena*
Add a check to verify that no "page redirect" article is present in the raw dataset.
Closes https://phabricator.wikimedia.org/T277560*Created by: gmodena*
Add a check to verify that no "page redirect" article is present in the raw dataset.
Closes https://phabricator.wikimedia.org/T277560https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/13T277776 add found on wiki2021-04-01T17:46:41ZGmodenaT277776 add found on wiki*Created by: gmodena*
This PR adds a new `array<string> found_on` column to the production dataset generated by `transform.py`.
Hive metadata has been updated accordingly.
The dataset export script projects the list of string as a `...*Created by: gmodena*
This PR adds a new `array<string> found_on` column to the production dataset generated by `transform.py`.
Hive metadata has been updated accordingly.
The dataset export script projects the list of string as a `found_on` column that contains the list of wikis
as a csv (`,` separated).https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/14Add a list of instances to filter2021-03-31T15:11:43ZGmodenaAdd a list of instances to filter*Created by: clarakosi*
**Acceptance criteria**
- [x] A list of "instance of" items to be filtered is available under version control
- [x] All articles that match the filter list have been filtered out
- [x] Metrics for items that...*Created by: clarakosi*
**Acceptance criteria**
- [x] A list of "instance of" items to be filtered is available under version control
- [x] All articles that match the filter list have been filtered out
- [x] Metrics for items that have been filtered out have been added in the data quality reportshttps://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/15Bugfix: save all unillustrated articles.2021-03-31T11:02:54ZGmodenaBugfix: save all unillustrated articles.*Created by: gmodena*
Don't exclude any image source when building`allimages`.
Fixes https://phabricator.wikimedia.org/T278571*Created by: gmodena*
Don't exclude any image source when building`allimages`.
Fixes https://phabricator.wikimedia.org/T278571https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/16Add itwiki and eswiki to the PoC export list.2021-03-31T08:02:22ZGmodenaAdd itwiki and eswiki to the PoC export list.*Created by: gmodena*
This PR adds two new languages to the list of production dataset we export.
* `eswiki`
* `itwiki`*Created by: gmodena*
This PR adds two new languages to the list of production dataset we export.
* `eswiki`
* `itwiki`https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/17Update export_prod_data doc2021-04-02T16:27:03ZGmodenaUpdate export_prod_data doc*Created by: gmodena*
This PR updates documentation. *Created by: gmodena*
This PR updates documentation. https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/18Fix metrics naming2021-04-07T15:12:59ZGmodenaFix metrics naming*Created by: clarakosi*
*Created by: clarakosi*
https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/20Filter image suggestions detected as "placeholder images"2021-04-08T15:46:24ZGmodenaFilter image suggestions detected as "placeholder images"*Created by: clarakosi*
- [X] As a user of the Image Suggestion API, when I make a request for image suggestions, I expect that all images detected as a "placeholder image" have been filtered out
- [X] Miriam validation query has...*Created by: clarakosi*
- [X] As a user of the Image Suggestion API, when I make a request for image suggestions, I expect that all images detected as a "placeholder image" have been filtered out
- [X] Miriam validation query has been newly applied (see https://phabricator.wikimedia.org/T277828#6957015), and results should reflect 0 "placeholder images' found for representative wikis
- [x] A static list of "placeholder images" has been generated and stored in HDFS
- [x] The algorithm notebook has been updated to filter out "placeholder images"https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/21Add android dataset scripts.2021-04-14T07:51:41ZGmodenaAdd android dataset scripts.*Created by: gmodena*
This PR adds scripts to generate a variant of the ImageMatching datasets suitable for Android clients.*Created by: gmodena*
This PR adds scripts to generate a variant of the ImageMatching datasets suitable for Android clients.https://gitlab.wikimedia.org/repos/generated-data-platform/ImageMatching/-/merge_requests/22Move wiki and poc_wiki lists to a config file2021-04-19T18:44:50ZGmodenaMove wiki and poc_wiki lists to a config file*Created by: gmodena*
This PR moves the `wiki` and `poc_wikis` definitions to a dedicated, authoritative, config file.*Created by: gmodena*
This PR moves the `wiki` and `poc_wikis` definitions to a dedicated, authoritative, config file.