Skip to content

Compute the search index delta against the `discovery.cirrus_index_without_content` Hive table

Marco Fossati requested to merge T338013 into main

This MR introduces a radical change that affects the search index delta computation.

Highlights

  • use the latest production Cirrus search index dump as the previous state
  • the relevant snapshot is 1 day before our current one, so remove the need for an explicit previous snapshot parameter
  • lexicographical sort of weighted tag values. This is only relevant to Commons, as other wikis have single boolean values

Caveat

The delta is now different from the previous implementation. Sorting seems to affect the output, but we couldn't find a better way to compare against the new dataset: it has a different shape, thus requiring pre-processing, see shared.prepare_cirrus_index_tags. If we didn't sort tag values on both states, the computation would be incorrect.

The following code snippet demonstrates the new workflow:

from wmfdata.spark import create_session
from image_suggestions.shared import get_cirrus_index_snapshot, load_cirrus_index_tags, prepare_cirrus_index_tags, compute_search_index_delta

spark = create_session(type='yarn-large', ship_python_env=True)
snapshot = '2023-11-20'

# Full search index state
full = spark.read.table('analytics_platform_eng.image_suggestions_search_index_full').where(f'snapshot="{snapshot}"')
# Previous implementation
prod = spark.read.table('analytics_platform_eng.image_suggestions_search_index_delta').where(f'snapshot="{snapshot}"')

cirrus = load_cirrus_index_tags(spark, get_cirrus_index_snapshot(snapshot))
previous = prepare_cirrus_index_tags(cirrus)

# Current implementation
dev = compute_search_index_delta(previous, full.drop('snapshot'))

prod.count(), dev.count()
(124667, 316564)

Bug: T338013

Merge request reports