Skip to content
  • Aqu's avatar
    f8c6ac6e
    Refine staging: trigger via file existence · f8c6ac6e
    Aqu authored
    * Cleanup duplicated block in artifacts.yaml
    
    * Put in a variable the name of the queue
    
    * Linting
    
    * Prepare for release by adding more DagProperties
    
    * Move refine_to_hive from analytics to main
    
    * Linting
    
    * Post merge fixes
    
    * Fix util
    
    * Switch success flag to _REFINED
    
    * Add success flags at end of Refine staging
    
    This is to allow Refine monitor to run on event_alt DB.
    
    * Refine staging: trigger via file existence
    
    Relying on Gobblin-generated _IMPORTED flags is not reliable.
    In some cases, Gobblin does not produce these flags, such as
    when (next hour) canary events arrive too late.
    
    With this change, the triggering mechanism shifts to the sensor
    with the following rules:
    * Check for files within the current hour’s source directory.
    * Check for files within the subsequent hour’s source directory.
    
    These checks ensure that Refine only begins when we are
    certain the streaming pipelines have completed successfully,
    preventing unnecessary backfills and improving upon the previous
    approach. Additionally, Refine can now start sooner—once
    data for the next hour is detected—thus providing earlier access
    to downstream consumers.
    
    These rules also better align with Gobblin’s execution:
    * We now wait for events to be imported at Gobblin’s natural pace.
    * Hours without data are no longer errors, just scenarios where
      waiting is required.
    
    As a result:
    * We adjust tasks and sensors timeouts.
    * Add extra DAG runs for Refine to accommodate potentially
      more waiting times.
    
    Kept for later:
    * Remove the redundant generation of canary events.
    * Harmonize behaviour to webrequest* Refine dags
    
    Bug: T369845
    f8c6ac6e
    Refine staging: trigger via file existence
    Aqu authored
    * Cleanup duplicated block in artifacts.yaml
    
    * Put in a variable the name of the queue
    
    * Linting
    
    * Prepare for release by adding more DagProperties
    
    * Move refine_to_hive from analytics to main
    
    * Linting
    
    * Post merge fixes
    
    * Fix util
    
    * Switch success flag to _REFINED
    
    * Add success flags at end of Refine staging
    
    This is to allow Refine monitor to run on event_alt DB.
    
    * Refine staging: trigger via file existence
    
    Relying on Gobblin-generated _IMPORTED flags is not reliable.
    In some cases, Gobblin does not produce these flags, such as
    when (next hour) canary events arrive too late.
    
    With this change, the triggering mechanism shifts to the sensor
    with the following rules:
    * Check for files within the current hour’s source directory.
    * Check for files within the subsequent hour’s source directory.
    
    These checks ensure that Refine only begins when we are
    certain the streaming pipelines have completed successfully,
    preventing unnecessary backfills and improving upon the previous
    approach. Additionally, Refine can now start sooner—once
    data for the next hour is detected—thus providing earlier access
    to downstream consumers.
    
    These rules also better align with Gobblin’s execution:
    * We now wait for events to be imported at Gobblin’s natural pace.
    * Hours without data are no longer errors, just scenarios where
      waiting is required.
    
    As a result:
    * We adjust tasks and sensors timeouts.
    * Add extra DAG runs for Refine to accommodate potentially
      more waiting times.
    
    Kept for later:
    * Remove the redundant generation of canary events.
    * Harmonize behaviour to webrequest* Refine dags
    
    Bug: T369845
Loading