Fix monitoring-pod leak in KubernetesJobOperator by jykae · Pull Request #67333 · apache/airflow

jykae · 2026-05-22T13:06:43Z

Fix monitoring-pod leak in KubernetesJobOperator.

KubernetesJobOperator inherits from KubernetesPodOperator but
overrode execute() without ever invoking the parent's pod-cleanup path,
so the "monitoring" pods discovered via get_pods() (used to stream
logs and XCom while the Job runs) were never deleted. These pods are
created by Airflow, not by the V1Job controller, so they have no
ownerReferences — neither ttl_seconds_after_finished nor the
foreground cascade on on_kill() reaped them. Every task run leaked
one pod per Job.

This PR makes pod cleanup symmetric with KubernetesPodOperator and
honours on_finish_action / on_kill_action for the discovered pods.

Changes

operators/job.py

execute() and execute_complete() now wrap their work in
try/finally and call post_complete_action() for every pod
returned by get_pods(). The inherited on_finish_action
(delete_pod / delete_succeeded_pod / delete_active_pod /
keep_pod) is now respected, matching KubernetesPodOperator
semantics.
on_kill() additionally calls pod_manager.delete_pod() for each
monitoring pod, gated by on_kill_action. The Job's foreground
cascade does not reach these pods because they have no
ownerReferences. Unexpected ApiExceptions are logged instead
of silently suppressed.
execute_complete() resolves monitoring pods once and shares the
lookup between the log-retrieval and cleanup paths. Resolution is
best-effort — failures in the deferrable resume path no longer break
cleanup.
Per-pod cleanup errors are logged but never mask the in-flight
exception, so Job-level failures continue to propagate unchanged.

triggers/job.py

The trigger event now always includes pod_names /
pod_namespace, regardless of get_logs. This guarantees
execute_complete() can reliably clean up monitoring pods even
when log streaming is disabled.

docs/operators.rst

New section documenting the cleanup contract: which pods are affected,
the meaning of each on_finish_action value for monitoring pods, and
the on_kill_action behaviour.

Tests

Coverage for each on_finish_action value (delete_pod,
delete_succeeded_pod, delete_active_pod, keep_pod) on
both success and failure paths.
Coverage for on_kill_action (delete_pod / keep_pod).
Regression test for the deferrable get_logs=False path.
New mocks use spec / autospec to catch attribute typos
against the real kubernetes client surface.

Backwards compatibility

Default on_finish_action is unchanged (delete_pod), so existing
deployments will start reclaiming the leaked monitoring pods
automatically. Users who relied on monitoring pods surviving the task
(e.g. for offline log inspection) can opt in explicitly by passing
on_finish_action="keep_pod".

How to verify

Run a DAG using KubernetesJobOperator with default settings.
After the task finishes, both the V1Job's child pod and the
monitoring pod (label airflow_kpo_in_cluster=True, no
ownerReferences) should be gone.
Repeat with on_finish_action="delete_succeeded_pod" and a
failing command — the monitoring pod should remain for forensics.
Repeat with on_finish_action="keep_pod" — both pods should
remain.

Was generative AI tooling used to co-author this PR?

Yes — GitHub Copilot (Claude Opus 4.7)

Generated-by: GitHub Copilot (Claude Opus 4.7) following the guidelines

boring-cyborg · 2026-05-22T13:06:49Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
Here are some useful points:

Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example Dag that shows how users should use it.
Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
Be sure to read the Airflow Coding style.
Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: dev@airflow.apache.org
Slack: https://s.apache.org/airflow-slack

* fix(providers/cncf/kubernetes): clean up monitoring pods in KubernetesJobOperator KubernetesJobOperator inherited from KubernetesPodOperator but overrode execute() without calling post_complete_action(), so the monitoring / log-streaming pods discovered via get_pods() were never deleted. These pods have no ownerReferences to the V1Job, so ttl_seconds_after_finished and the Foreground cascade in on_kill don't reap them either. - execute() and execute_complete() now wrap their work in try/finally and call post_complete_action() for each pod in self.pods. on_finish_action (delete_pod / delete_succeeded_pod / keep_pod) is now honoured. - on_kill() additionally calls pod_manager.delete_pod() for each monitoring pod (the Job's foreground cascade doesn't reach them). - Per-pod cleanup errors are logged but never mask the in-flight exception, so Job-level failures keep propagating. - execute_complete() resolves monitoring pods once and shares the lookup between the log-retrieval path and the cleanup path. - Added unit tests, a bugfix newsfragment, and an operators.rst section documenting the cleanup contract. * Address code review feedback: remove dead PodNotFoundException check, drop unused import, relax pod-deletion ordering in test, fix trailing comma * Potential fix for pull request finding In _cleanup_monitoring_pods, remote_pod is resolved via find_pod(), which is designed to locate a single matching pod by task-instance labels and can invoke duplicate-pod resolution logic (process_duplicate_label_pods). For KubernetesJobOperator with parallelism > 1, this lookup can return the wrong pod (or trigger duplicate-handling side effects), so post_complete_action() may receive a mismatched remote_pod. Consider using the already-discovered pod’s name/namespace to refresh state (e.g. via hook.get_pod) or just pass remote_pod=pod when you already have the V1Pod object from get_pods(). Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Use isinstance(exc, TaskDeferred) instead of brittle string comparison * Potential fix for pull request finding The new unit tests add several mock.MagicMock() instances (pods, jobs, TI, etc.) without spec/autospec, and some patch() usages also create non-spec'd mocks by default. Using autospec=True on patches and create_autospec(...)/MagicMock(spec=...) for key Kubernetes objects helps catch typos/attribute mismatches in these tests and aligns with Airflow’s test hardening guidance. Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Address PR review comments: fix trigger pod_names, on_kill logging, and test assertions - triggers/job.py: Always include pod_names/pod_namespace in trigger event regardless of get_logs setting, so execute_complete() can reliably clean up monitoring pods even when get_logs=False - operators/job.py: Log unexpected ApiException in on_kill() instead of suppressing all ApiExceptions; remove unused `suppress` import - tests/test_job.py: Rewrite test_execute_respects_keep_pod and test_execute_deletes_pod_default to keep process_pod_deletion real and assert on pod_manager.delete_pod; stub hook.get_pod for remote_pod resolution - tests/test_job.py: Add regression test for get_logs=False deferrable path * Fix orphaned test_on_kill_deletes_monitoring_pods method body after accidental deletion of method signature * Make pod resolution best-effort in execute_complete * Address remaining KubernetesJobOperator review comments * Finalize review-comment fixes for KubernetesJobOperator * Fix remaining KubernetesJobOperator review comments * Update KubernetesJobOperator docs for action semantics * Improve KubernetesJobOperator newsfragment readability --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Ville Jyrkkä <vjyrkka@gmail.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

jykae requested review from hussein-awala, jedcunningham and jscheffl as code owners May 22, 2026 13:06

boring-cyborg Bot added area:providers kind:documentation provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels May 22, 2026

jykae force-pushed the main branch from 7553081 to 1bd0b94 Compare May 22, 2026 13:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix monitoring-pod leak in KubernetesJobOperator#67333

Fix monitoring-pod leak in KubernetesJobOperator#67333
jykae wants to merge 1 commit into
apache:mainfrom
jykae:main

jykae commented May 22, 2026 •

edited

Loading

boring-cyborg Bot commented May 22, 2026

Labels

2 participants

Conversation

jykae commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!