Skip to content

Add visibility on backfills via broadcast join of wmf_raw.mediawiki_revision.

Xcollazo requested to merge use-rev-deleted into main

(Depends on !10 (merged))

In this MR we df.cache() and force broadcast a table derived from wmf_raw.mediawiki_revision that provides us with visibility (aka suppressed) data.

Plan is as expected:

== Physical Plan ==
ReplaceData (40)
+- AdaptiveSparkPlan (39)
   +- == Final Plan ==
      Sort (19)
      +- ShuffleQueryStage (18), Statistics(sizeInBytes=50.8 GiB, rowCount=7.81E+6)
         +- Exchange (17)
            +- * Project (16)
               +- MergeRows (15)
                  +- * Sort (14)
                     +- * Project (13)
                        +- * Project (12)
                           +- * BroadcastHashJoin LeftOuter BuildRight (11)
                              :- * Filter (2)
                              :  +- Scan hive wmf.mediawiki_wikitext_history (1)
                              +- BroadcastQueryStage (10), Statistics(sizeInBytes=32.5 MiB, rowCount=8.73E+3)  <<<<<<<
                                 +- BroadcastExchange (9)
                                    +- * Filter (8)
                                       +- InMemoryTableScan (3)
                                             +- InMemoryRelation (4)
                                                   +- * Project (7)
                                                      +- * Filter (6)
                                                         +- Scan hive wmf_raw.mediawiki_revision (5)  <<<<<<<

Some manual test runs on enwiki and simplewiki show no measurable difference between this code and the one from !10 (merged).

Bug: T345183

Edited by Xcollazo

Merge request reports