Compute the search index delta against the `discovery.cirrus_index_without_content` Hive table
This MR introduces a radical change that affects the search index delta computation.
Highlights
- use the latest production Cirrus search index dump as the previous state
- the relevant snapshot is 1 day before our current one, so remove the need for an explicit previous snapshot parameter
- lexicographical sort of weighted tag values. This is only relevant to Commons, as other wikis have single boolean values
Caveat
The delta is now different from the previous implementation. Sorting seems to affect the output, but we couldn't find a better way to compare against the new dataset: it has a different shape, thus requiring pre-processing, see shared.prepare_cirrus_index_tags
.
If we didn't sort tag values on both states, the computation would be incorrect.
The following code snippet demonstrates the new workflow:
from wmfdata.spark import create_session
from image_suggestions.shared import get_cirrus_index_snapshot, load_cirrus_index_tags, prepare_cirrus_index_tags, compute_search_index_delta
spark = create_session(type='yarn-large', ship_python_env=True)
snapshot = '2023-11-20'
# Full search index state
full = spark.read.table('analytics_platform_eng.image_suggestions_search_index_full').where(f'snapshot="{snapshot}"')
# Previous implementation
prod = spark.read.table('analytics_platform_eng.image_suggestions_search_index_delta').where(f'snapshot="{snapshot}"')
cirrus = load_cirrus_index_tags(spark, get_cirrus_index_snapshot(snapshot))
previous = prepare_cirrus_index_tags(cirrus)
# Current implementation
dev = compute_search_index_delta(previous, full.drop('snapshot'))
prod.count(), dev.count()
(124667, 316564)
Bug: T338013