Skip to content

Run page_content_change_to_wikitext_raw DAG serially.

Xcollazo requested to merge run_merge_into_serially into main

The nature of event.rc1_mediawiki_page_content_change is to be skewed towards recent revision.rev_dts.

Iceberg MERGE INTO will try to commit via optimistic concurrency. But because multiple MERGE INTOs are likely to touch the same revision.rev_dt, they will likely conflict. Thus it is best to run serially.

Example:

SELECT DISTINCT date_trunc('HOUR', revision.rev_dt) as modified_hours
FROM  event.rc1_mediawiki_page_content_change
WHERE year = 2023
  AND month = 5
  AND day = 2
  AND hour = 18
ORDER BY modified_hours DESC -- we are ORing this later so let's get it in likelyhood order now
LIMIT 5

+-------------------+
|modified_hours     |
+-------------------+
|2023-04-26 18:00:00|
|2023-04-26 17:00:00|
|2023-04-26 16:00:00|
|2023-04-26 15:00:00|
|2023-04-26 14:00:00|
+-------------------+

SELECT DISTINCT date_trunc('HOUR', revision.rev_dt) as modified_hours
FROM  event.rc1_mediawiki_page_content_change
WHERE year = 2023
  AND month = 5
  AND day = 2
  AND hour = 17
ORDER BY modified_hours DESC -- we are ORing this later so let's get it in likelyhood order now
LIMIT 5

+-------------------+
|modified_hours     |
+-------------------+
|2023-04-26 17:00:00|
|2023-04-26 16:00:00|
|2023-04-26 15:00:00|
|2023-04-26 14:00:00|
|2023-04-26 13:00:00|
+-------------------+

Merge request reports