Skip to content

T274798 include all unillustrated articles

Created by: gmodena

Raw data can contain records with NULL image_id.

The ImageMatching dataset should include all unillustrated articles, with or without candidate matches.

This PR updates the algo code, and the production dataset ETL to account for this new behaviour.

Articles with no matches will be saved with an empty top_candidates field in the raw dataset. These records will be stored with empty ("") image_id, source, confidence_rating fields in prod data. An example of prod dataset with empty suggestions can be found in gmodena.imagerec_prod.

Example

hive (gmodena)> select count(*) from gmodena.imagerec_prod where image_id is not null;
104342
hive (gmodena)> select count(*) from gmodena.imagerec_prod where image_id is null;
39518

Changelog

  • algorithm.ipynb has been modified to save all articles that we consider unillustrated.

  • etl/transform.py has been updated to handle raw data records with an empty top_candidates field.

  • ddl/external_imagerec_prod.hql has been modified so that empty strings are formatted as NULL by Hive (and provide sql NULL semantic).

Merge request reports