Run page_content_change_to_wikitext_raw DAG serially.
The nature of event.rc1_mediawiki_page_content_change
is to be skewed towards recent revision.rev_dt
s.
Iceberg MERGE INTO will try to commit via optimistic concurrency. But because multiple MERGE INTOs are likely to touch the same revision.rev_dt
, they will likely conflict. Thus it is best to run serially.
Example:
SELECT DISTINCT date_trunc('HOUR', revision.rev_dt) as modified_hours
FROM event.rc1_mediawiki_page_content_change
WHERE year = 2023
AND month = 5
AND day = 2
AND hour = 18
ORDER BY modified_hours DESC -- we are ORing this later so let's get it in likelyhood order now
LIMIT 5
+-------------------+
|modified_hours |
+-------------------+
|2023-04-26 18:00:00|
|2023-04-26 17:00:00|
|2023-04-26 16:00:00|
|2023-04-26 15:00:00|
|2023-04-26 14:00:00|
+-------------------+
SELECT DISTINCT date_trunc('HOUR', revision.rev_dt) as modified_hours
FROM event.rc1_mediawiki_page_content_change
WHERE year = 2023
AND month = 5
AND day = 2
AND hour = 17
ORDER BY modified_hours DESC -- we are ORing this later so let's get it in likelyhood order now
LIMIT 5
+-------------------+
|modified_hours |
+-------------------+
|2023-04-26 17:00:00|
|2023-04-26 16:00:00|
|2023-04-26 15:00:00|
|2023-04-26 14:00:00|
|2023-04-26 13:00:00|
+-------------------+