Compute the search index delta against the `discovery.cirrus_index_without_content` Hive table (!38) · Merge requests · repos / structured-data / Image Suggestions

Marco Fossati requested to merge T338013 into main Dec 15, 2023

This MR introduces a radical change that affects the search index delta computation.

Highlights

use the latest production Cirrus search index dump as the previous state
the relevant snapshot is 1 day before our current one, so remove the need for an explicit previous snapshot parameter
lexicographical sort of weighted tag values. This is only relevant to Commons, as other wikis have single boolean values

Caveat

The delta is now different from the previous implementation. Sorting seems to affect the output, but we couldn't find a better way to compare against the new dataset: it has a different shape, thus requiring pre-processing, see shared.prepare_cirrus_index_tags. If we didn't sort tag values on both states, the computation would be incorrect.

The following code snippet demonstrates the new workflow:

from wmfdata.spark import create_session
from image_suggestions.shared import get_cirrus_index_snapshot, load_cirrus_index_tags, prepare_cirrus_index_tags, compute_search_index_delta

spark = create_session(type='yarn-large', ship_python_env=True)
snapshot = '2023-11-20'

# Full search index state
full = spark.read.table('analytics_platform_eng.image_suggestions_search_index_full').where(f'snapshot="{snapshot}"')
# Previous implementation
prod = spark.read.table('analytics_platform_eng.image_suggestions_search_index_delta').where(f'snapshot="{snapshot}"')

cirrus = load_cirrus_index_tags(spark, get_cirrus_index_snapshot(snapshot))
previous = prepare_cirrus_index_tags(cirrus)

# Current implementation
dev = compute_search_index_delta(previous, full.drop('snapshot'))

prod.count(), dev.count()
(124667, 316564)

Bug: T338013

Admin message

Admin message

Compute the search index delta against the `discovery.cirrus_index_without_content` Hive table

Highlights

Caveat

Merge request reports