Consider user_id, user_central_id and user_text on reconcile. Fix related bugs.
In this MR we:
- Start considering
user_id, anduser_textfor reconcile. - If we reconcile a row, we now also emit its
user_central_id. For this, we now have a new python script to fetch thelocalusertable from CentraAuth just before starting the reconcile. - We also fix a long standing bug in which we were ingesting the
performerinstead of therevision.editor.
I have tested all this code in a Jupyter notebook against simplewiki. I have also tuned the copying of localuser.
Here are the results from simplewiki:
spark.sql("""
SELECT count(1) as count,
reasons
FROM xcollazo.inconsistent_rows_of_mediawiki_content_history_v1
WHERE computation_class = 'all-of-wiki-time'
GROUP BY reasons
ORDER BY count DESC
""").show(truncate=False)
+-----+---------------------------------------------------------------------------+
|count|reasons |
+-----+---------------------------------------------------------------------------+
|35448|[mismatch_user_text] |
|2543 |[missing_from_target] |
|944 |[page_was_restored, mismatch_user_text] |
|522 |[mismatch_user_id, mismatch_user_text] |
|124 |[missing_from_source, page_was_deleted] |
|42 |[mismatch_page] |
|29 |[missing_from_source, page_was_deleted, page_was_restored] |
|27 |[missing_from_target, page_was_restored] |
|16 |[page_was_restored, mismatch_user_id, mismatch_user_text] |
|1 |[mismatch_user_visibility, mismatch_comment_visibility, mismatch_user_text]|
|1 |[mismatch_user_visibility, mismatch_comment_visibility] |
+-----+---------------------------------------------------------------------------+
Indeed a bunch of mismatch_user_text that we can now detect!
Bug: T411803