Skip to content

Consider user_id, user_central_id and user_text on reconcile. Fix related bugs.

In this MR we:

  • Start considering user_id, and user_text for reconcile.
  • If we reconcile a row, we now also emit its user_central_id. For this, we now have a new python script to fetch the localuser table from CentraAuth just before starting the reconcile.
  • We also fix a long standing bug in which we were ingesting the performer instead of the revision.editor.

I have tested all this code in a Jupyter notebook against simplewiki. I have also tuned the copying of localuser.

Here are the results from simplewiki:

spark.sql("""
SELECT count(1) as count,
       reasons
FROM xcollazo.inconsistent_rows_of_mediawiki_content_history_v1
WHERE computation_class = 'all-of-wiki-time'
GROUP BY reasons
ORDER BY count DESC
""").show(truncate=False)

+-----+---------------------------------------------------------------------------+
|count|reasons                                                                    |
+-----+---------------------------------------------------------------------------+
|35448|[mismatch_user_text]                                                       |
|2543 |[missing_from_target]                                                      |
|944  |[page_was_restored, mismatch_user_text]                                    |
|522  |[mismatch_user_id, mismatch_user_text]                                     |
|124  |[missing_from_source, page_was_deleted]                                    |
|42   |[mismatch_page]                                                            |
|29   |[missing_from_source, page_was_deleted, page_was_restored]                 |
|27   |[missing_from_target, page_was_restored]                                   |
|16   |[page_was_restored, mismatch_user_id, mismatch_user_text]                  |
|1    |[mismatch_user_visibility, mismatch_comment_visibility, mismatch_user_text]|
|1    |[mismatch_user_visibility, mismatch_comment_visibility]                    |
+-----+---------------------------------------------------------------------------+

Indeed a bunch of mismatch_user_text that we can now detect!

Bug: T411803

Edited by Xcollazo

Merge request reports

Loading