Add temp_user as a value in the ipb_user_type field
The changes include:
- Adding temp_user as a value in the ipb_user_type field in the
trust_safety_block_daily
schema. - Adding temp_user as a value in the ipb_user_type field in the
trust_safety_block_monthly
schema. - Updating COMMENT for table creation.
- Making a few minor formatting adjustments.
Test Approach
Since the temp account is only available on testWiki, while the HQL queries target non-testWiki, I performed testing in two steps:
-
Tested the change of the regular expression on testWiki using the command below.
WITH t_b AS ( SELECT TO_DATE(TO_TIMESTAMP(log_timestamp,'yyyyMMddHHmmss')) AS date_time, wiki_db, log_action, log_title, -- If sitewide is 0, it's a partial block. If log_params is empty, it's a sitewide block. CASE WHEN log_params regexp '\"sitewide\"\;b:0' THEN 'partial' ELSE 'sitewide' END AS block_type, CASE WHEN log_title regexp '.*\/.*' THEN 'ip_range' WHEN log_title regexp '.*[\.:].*' THEN 'ip_single' WHEN log_title regexp '^[~]2' THEN 'temp_user' ELSE 'register_user' END AS ipb_user_type FROM wmf_raw.mediawiki_logging WHERE snapshot ='2024-08' AND log_type = 'block' AND wiki_db='testwiki' AND log_action IN ('block','reblock','unblock') AND TO_TIMESTAMP(log_timestamp,'yyyyMMddHHmmss') >= TO_TIMESTAMP('2024-08') AND TO_TIMESTAMP(log_timestamp,'yyyyMMddHHmmss') < TO_TIMESTAMP('2024-08') + INTERVAL 1 MONTH ) SELECT t_b.date_time, t_b.wiki_db, t_b.log_action, t_b.block_type, t_b.ipb_user_type, COUNT(1) AS blocks, COUNT(DISTINCT t_b.log_title) AS user_or_ip_blocked FROM t_b GROUP BY t_b.date_time, t_b.wiki_db,t_b.log_action, t_b.block_type,t_b.ipb_user_type
-
Tested the final HQL on all non-TestWiki environments using the following commands.
spark3-sql -f create_block_daily_table.hql -d table_name=jiawang_airflow_test.trust_safety_block_daily -d base_directory=/user/hive/warehouse/jiawang_airflow_test spark3-sql --master yarn --executor-memory 16G --executor-cores 8 --driver-memory 4G --conf spark.dynamicAllocation.maxExecutors=64 \ -f generate_block_daily.hql -d source_table=wmf_raw.mediawiki_logging -d destination_table=jiawang_airflow_test.trust_safety_block_daily -d canonical_table=canonical_data.wikis -d snapshot=2024-08 \ -d coalesce_partitions=1 spark3-sql -f create_block_monthly_table.hql -d table_name=jiawang_airflow_test.trust_safety_block_monthly -d base_directory=/user/hive/warehouse/jiawang_airflow_test spark3-sql --master yarn --executor-memory 16G --executor-cores 8 --driver-memory 4G --conf spark.dynamicAllocation.maxExecutors=64 \ -f generate_block_monthly.hql -d source_table=wmf_raw.mediawiki_logging -d destination_table=jiawang_airflow_test.trust_safety_block_monthly -d canonical_table=canonical_data.wikis -d snapshot=2024-08 \ -d coalesce_partitions=1