Draft: Refine staging: trigger via file existence
Relying on Gobblin-generated _IMPORTED flags is not reliable. In some cases, Gobblin does not produce these flags, such as when (next hour) canary events arrive too late.
With this change, the triggering mechanism shifts to the sensor with the following rules:
- Check for files within the current hour’s source directory.
- Check for files within the subsequent hour’s source directory.
These checks ensure that Refine only begins when we are certain the streaming pipelines have completed successfully, preventing unnecessary backfills and improving upon the previous approach. Additionally, Refine can now start sooner—once data for the next hour is detected—thus providing earlier access to downstream consumers.
These rules also better align with Gobblin’s execution:
- We now wait for events to be imported at Gobblin’s natural pace.
- Hours without data are no longer errors, just scenarios where waiting is required.
As a result:
- We adjust tasks and sensors timeouts.
- Remove the redundant generation of canary events.
- Add extra DAG runs for Refine to accommodate potentially more waiting times.
Bug: T369845