Skip to content

Add temp_user as a value in the ipb_user_type field

Jenniferwang requested to merge jiawang/data-pipelines:tsp_T376080 into main

The changes include:

  • Adding temp_user as a value in the ipb_user_type field in thetrust_safety_block_daily schema.
  • Adding temp_user as a value in the ipb_user_type field in the trust_safety_block_monthly schema.
  • Updating COMMENT for table creation.
  • Making a few minor formatting adjustments.

Test Approach

Since the temp account is only available on testWiki, while the HQL queries target non-testWiki, I performed testing in two steps:

  1. Tested the change of the regular expression on testWiki using the command below.

    WITH t_b AS (
        SELECT
            TO_DATE(TO_TIMESTAMP(log_timestamp,'yyyyMMddHHmmss'))  AS date_time,
            wiki_db,
            log_action,
            log_title,
            -- If sitewide is 0, it's a partial block. If log_params is empty, it's a sitewide block.
            CASE 
                WHEN log_params regexp '\"sitewide\"\;b:0' 
                THEN 'partial'  
                ELSE 'sitewide' 
            END AS block_type,
            CASE 
                WHEN log_title regexp '.*\/.*' THEN 'ip_range' 
                WHEN log_title regexp '.*[\.:].*' THEN 'ip_single'
                WHEN log_title regexp  '^[~]2' THEN 'temp_user'
                ELSE 'register_user' 
            END AS ipb_user_type
        FROM wmf_raw.mediawiki_logging
        WHERE snapshot ='2024-08' AND log_type = 'block' 
        AND wiki_db='testwiki'
        AND log_action IN ('block','reblock','unblock')
        AND TO_TIMESTAMP(log_timestamp,'yyyyMMddHHmmss')  >= TO_TIMESTAMP('2024-08')
        AND TO_TIMESTAMP(log_timestamp,'yyyyMMddHHmmss') < TO_TIMESTAMP('2024-08') + INTERVAL 1 MONTH
    )
    
    SELECT 
        t_b.date_time,
        t_b.wiki_db,
        t_b.log_action,
        t_b.block_type,
        t_b.ipb_user_type,
        COUNT(1) AS blocks,
        COUNT(DISTINCT t_b.log_title) AS user_or_ip_blocked
    FROM t_b
    GROUP BY t_b.date_time,  t_b.wiki_db,t_b.log_action, t_b.block_type,t_b.ipb_user_type
  2. Tested the final HQL on all non-TestWiki environments using the following commands.

    spark3-sql  -f create_block_daily_table.hql   -d table_name=jiawang_airflow_test.trust_safety_block_daily -d base_directory=/user/hive/warehouse/jiawang_airflow_test
    
    spark3-sql --master yarn --executor-memory 16G --executor-cores 8 --driver-memory 4G --conf spark.dynamicAllocation.maxExecutors=64 \
    -f generate_block_daily.hql   -d source_table=wmf_raw.mediawiki_logging   -d destination_table=jiawang_airflow_test.trust_safety_block_daily  -d canonical_table=canonical_data.wikis   -d snapshot=2024-08   \
     	-d coalesce_partitions=1
    
    spark3-sql  -f create_block_monthly_table.hql   -d table_name=jiawang_airflow_test.trust_safety_block_monthly -d base_directory=/user/hive/warehouse/jiawang_airflow_test 
    
    spark3-sql --master yarn --executor-memory 16G --executor-cores 8 --driver-memory 4G --conf spark.dynamicAllocation.maxExecutors=64 \
    -f generate_block_monthly.hql   -d source_table=wmf_raw.mediawiki_logging   -d destination_table=jiawang_airflow_test.trust_safety_block_monthly -d canonical_table=canonical_data.wikis   -d snapshot=2024-08  \
     	-d coalesce_partitions=1

Merge request reports