Skip to content

Fix macOS SIGSEGV in task execution by using fork+exec#64874

Merged
kaxil merged 9 commits into
apache:mainfrom
astronomer:fix/macos-fork-exec-supervisor
Apr 22, 2026
Merged

Fix macOS SIGSEGV in task execution by using fork+exec#64874
kaxil merged 9 commits into
apache:mainfrom
astronomer:fix/macos-fork-exec-supervisor

Conversation

@kaxil
Copy link
Copy Markdown
Member

@kaxil kaxil commented Apr 7, 2026

Summary

Discovered while testing the AIP-99 LLMSQLQueryOperator example DAGs from #64824 on macOS. Tasks that make network calls (LLM API requests, HTTP calls) crash intermittently with SIGSEGV or SIGABRT when running via airflow standalone or any executor on macOS.

Root cause: WatchedSubprocess.start() in supervisor.py uses bare os.fork() to create task child processes. On macOS, the forked child inherits corrupted Objective-C runtime state from the parent. When the child later triggers ObjC class initialization -- for example via socket.getaddrinfo() -> macOS system DNS resolver -> Security.framework -> +[NSNumber initialize] -- the ObjC runtime detects the half-initialized state and deliberately crashes.

The fix: Add a use_exec: bool = False kwarg to WatchedSubprocess.start(). ActivitySubprocess.start() passes use_exec=True; DAG processor and triggerer use the default. On platforms in _FORK_EXEC_PLATFORMS (currently {"darwin"}), the fork+exec path runs os.dup2 on the socketpairs to place them on FDs 0/1/2, then os.execv a fresh Python interpreter. os.dup2 (with the default inheritable=True) clears FD_CLOEXEC on the destination FDs, so they survive execv. The structured log channel is NOT inherited; instead, the exec'd child sets _AIRFLOW_FORK_EXEC=1, and after startup() in task_runner.main the existing reinit_supervisor_comms() requests the log FD from the supervisor via ResendLoggingFD + SCM_RIGHTS (the same mechanism used by the sudo / virtualenv re-exec path).

The crash chain

supervisor.py: os.fork()
  -> child runs pydantic_ai.Agent.run_sync()
    -> httpx creates ThreadPoolExecutor for DNS
      -> socket.getaddrinfo() in worker thread
        -> macOS system resolver (not glibc)
          -> _scproxy / Security.framework
            -> ObjC runtime detects fork-unsafe state
              -> SIGABRT / SIGSEGV

The faulthandler traceback that identified the root cause:

Current thread (most recent call first):
  File "socket.py", line 978 in getaddrinfo
  File "concurrent/futures/thread.py", line 59 in run
  ...

Why macOS only

Apple's ObjC runtime, CoreFoundation, and libdispatch are not fork-safe. This is why CPython changed multiprocessing's default start method from fork to spawn on macOS in Python 3.8 (BPO-33725). Linux uses glibc's resolver which has no ObjC dependency, so bare fork() works fine there. The _FORK_EXEC_PLATFORMS set is the extension point if another platform needs the same treatment later.

Scope: task execution only (for now)

This PR only applies fork+exec to task execution (ActivitySubprocess). DAG processor and triggerer also use WatchedSubprocess and can hit the same macOS fork-safety crash -- DAG files often have top-level network calls (secret backends, connection / variable lookups, in-process API server threads), and triggerers run user-defined async triggers that poll APIs. We observed the DAG processor crash earlier in this investigation (the InProcessExecutionAPI + a2wsgi background thread initializing httpx / ssl / Security.framework).

Extending fork+exec to them requires reworking _child_exec_main() to rehydrate arbitrary targets across execv (it currently hardcodes _subprocess_main; DAG processor passes target=_parse_file_entrypoint, triggerer passes target=TriggerRunnerSupervisor.run_in_process). Tracked as follow-up in #65691.

Scope: all executors on macOS, not just LocalExecutor

This affects any executor running on macOS (Local, Celery worker, etc.) because the fork happens inside supervise() in the Task SDK, not in the executor itself. The executor spawns a worker process (which is safe -- multiprocessing.Process uses spawn on macOS), but that worker then calls supervise() which does the bare os.fork().

The two-fork architecture:

Executor -> multiprocessing.Process (spawn, safe)
  -> worker calls supervise()
    -> os.fork() (bare fork, UNSAFE on macOS without exec)
      -> child runs task

How the log channel is obtained after exec

_fork_main needs a log FD to set up structured logging. In the bare-fork path it's inherited. In the fork+exec path:

  1. _child_exec_main (the exec'd entry point) calls _fork_main(..., log_fd=0, ...) -- zero means skip structured-log setup. It also sets _AIRFLOW_FORK_EXEC=1.
  2. _fork_main runs through its usual setup (signals, stdio, etc.) with logging going to the stderr socket.
  3. task_runner.main calls startup() to receive StartupDetails from the supervisor.
  4. After startup(), main checks _AIRFLOW_FORK_EXEC and calls the existing reinit_supervisor_comms(), which sends ResendLoggingFD() over the requests socket and receives a fresh log FD via SCM_RIGHTS.

The timing works because the supervisor enters its event loop (process.wait()) after _on_child_started has sent the startup message. By the time the child requests the log channel, the supervisor is in the loop and ready to handle it.

What we tried that didn't work

Approach Why it failed
Lazy imports (pydantic_ai, datafusion) Crash happens at runtime during DNS resolution, not at import time
OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES Undocumented Apple debug knob. Suppresses the abort but doesn't fix the underlying memory corruption. Cached at ObjC runtime init, unreliable on newer macOS
NO_PROXY='*' Python-level env var; socket.getaddrinfo() is a C library call that uses the macOS system resolver directly
Pre-initialize _scproxy before fork Fragile -- any new dependency that touches ObjC frameworks would break it again
subprocess.Popen instead of os.fork() Loses the socketpair FDs. The child can't communicate with the supervisor

Why fork + exec and not spawn

Python's multiprocessing.Process(start_method='spawn') would also work, but it requires pickling the target function and arguments. The supervisor's communication is built around socketpairs created before fork, with FDs inherited by the child. fork + exec preserves this design: the three socketpairs are placed at FDs 0/1/2 via dup2 (which clears FD_CLOEXEC), and the log channel is obtained after startup via the existing ResendLoggingFD protocol. No new cross-process communication path needed.

References


Was generative AI tooling used to co-author this PR?
  • Yes -- Claude Code
@kaxil kaxil requested a review from potiuk April 7, 2026 23:26
Comment thread task-sdk/src/airflow/sdk/execution_time/supervisor.py Outdated
Comment thread task-sdk/src/airflow/sdk/execution_time/supervisor.py Outdated
Comment thread task-sdk/src/airflow/sdk/execution_time/supervisor.py Outdated
@ashb
Copy link
Copy Markdown
Member

ashb commented Apr 8, 2026

I wonder if we should also set the env var we have to not load settings in this ecec'd process to speed up airflow import?

@kaxil kaxil force-pushed the fix/macos-fork-exec-supervisor branch from c32f2e6 to e25461a Compare April 9, 2026 15:45
@kaxil kaxil requested a review from Copilot April 10, 2026 19:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes intermittent macOS task crashes (SIGSEGV/SIGABRT) caused by fork-unsafe Apple Objective‑C runtime state by switching the task-execution subprocess path to fork + immediate exec.

Changes:

  • Add macOS-only fork+exec path for task execution in the supervisor (WatchedSubprocess.start) with _child_exec_main as the post-exec entrypoint.
  • Reinitialize supervisor comms/logging channel in the task runner when started via the fork+exec path.
  • Add a unit test for _child_exec_main wiring of FDs 0/1/2 and signaling task runner via _AIRFLOW_FORK_EXEC.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File Description
task-sdk/src/airflow/sdk/execution_time/supervisor.py Introduces macOS fork+exec execution path and _child_exec_main entrypoint.
task-sdk/src/airflow/sdk/execution_time/task_runner.py Requests structured logging FD after startup when running under fork+exec.
task-sdk/tests/task_sdk/execution_time/test_supervisor.py Adds test coverage for _child_exec_main FD handling and _AIRFLOW_FORK_EXEC env signaling.
Comment thread task-sdk/src/airflow/sdk/execution_time/supervisor.py Outdated
Comment thread task-sdk/src/airflow/sdk/execution_time/supervisor.py Outdated
Comment thread task-sdk/src/airflow/sdk/execution_time/supervisor.py Outdated
Comment thread task-sdk/tests/task_sdk/execution_time/test_supervisor.py
Comment thread task-sdk/tests/task_sdk/execution_time/test_supervisor.py
Comment thread task-sdk/tests/task_sdk/execution_time/test_supervisor.py
Comment thread task-sdk/src/airflow/sdk/execution_time/supervisor.py Outdated
@kaxil kaxil added this to the Airflow 3.2.2 milestone Apr 14, 2026
@kaxil kaxil force-pushed the fix/macos-fork-exec-supervisor branch from 87883db to 1340552 Compare April 20, 2026 21:38
@kaxil kaxil marked this pull request as ready for review April 20, 2026 23:50
@kaxil kaxil requested a review from amoghrajesh as a code owner April 20, 2026 23:50
@kaxil kaxil requested a review from Copilot April 20, 2026 23:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment thread task-sdk/src/airflow/sdk/execution_time/supervisor.py
Comment thread task-sdk/src/airflow/sdk/execution_time/supervisor.py Outdated
Comment thread task-sdk/tests/task_sdk/execution_time/test_supervisor.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread task-sdk/src/airflow/sdk/execution_time/supervisor.py
Comment thread task-sdk/src/airflow/sdk/execution_time/task_runner.py Outdated
kaxil added 4 commits April 22, 2026 15:56
On macOS, the task supervisor's bare os.fork() copies the parent's
Objective-C runtime state into the child process.  When the child
later triggers ObjC class initialization (e.g. socket.getaddrinfo ->
system DNS resolver -> Security.framework -> +[NSNumber initialize]),
the runtime detects the corrupted state and crashes with SIGABRT/SIGSEGV.

This is a well-documented macOS platform limitation -- Apple's ObjC
runtime, CoreFoundation, and libdispatch are not fork-safe.  CPython
changed multiprocessing's default start method to "spawn" on macOS in
3.8 for this reason, but Airflow's TaskSDK supervisor uses os.fork()
directly.

The fix: on macOS, immediately call os.execv() after os.fork() for
task execution subprocesses.  The exec replaces the child's address
space, giving it clean ObjC state.  The socketpair FDs survive across
exec (marked inheritable) and the child reads their numbers from an
environment variable.

Only task execution (target=_subprocess_main) uses fork+exec.  DAG
processor and triggerer pass different targets and keep bare fork --
they don't make network calls that trigger the macOS crash.

References:
- python/cpython#105912
- python/cpython#58037
- apache#24463
Address review feedback: instead of passing all 4 FD numbers via
JSON env var, dup2 the requests/stdout/stderr sockets onto FDs
0/1/2 before exec (inheritable by default). Only the log channel
FD needs explicit passing via _AIRFLOW_SUPERVISOR_LOG_FD.
Instead of passing the log channel FD via env var, use the existing
ResendLoggingFD protocol: the exec'd child starts with log_fd=0
(no structured logging), and after startup the task runner calls
reinit_supervisor_comms() to request the log channel from the
supervisor. This reuses the same mechanism as sudo/virtualenv
re-exec rather than introducing a new env var.
- Switch from `target is _subprocess_main` identity check to an explicit
  `use_exec: bool = False` kwarg on `start()`, set to True in
  `ActivitySubprocess.start()`.
- Combine fork+exec and bare-fork branches into a single try/except so
  the bare-fork block is no longer inside an else. Error handling is
  shared, reducing duplication and indentation.
- Fix comment about how FDs 0/1/2 survive exec: it's `os.dup2` clearing
  FD_CLOEXEC on destination FDs, not "inheritable by default".
kaxil added 4 commits April 22, 2026 15:56
- Remove inline `import socket as _socket`; use the module-level
  `from socket import socket` import in `_child_exec_main`.
- Raise `ValueError` if `use_exec=True` is passed with a non-default
  target, since the exec'd child always runs `_subprocess_main`.
- Apply `disable_capturing` fixture to `TestChildExecMain` since it
  mutates process-wide stdio FDs via `dup2` onto 0/1/2.
Tests that override `target` with a local stub to exercise the base
infrastructure were hitting the ValueError guard. Only pass
use_exec=True when the default `_subprocess_main` target is used.
Per Copilot review: check for == "1" so an externally-set env var
with a different truthy value can't accidentally trigger
reinit_supervisor_comms().
@kaxil kaxil force-pushed the fix/macos-fork-exec-supervisor branch from a6e585b to 4734445 Compare April 22, 2026 14:56
Copy link
Copy Markdown
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few nits but LGTM over all.

We don't want a way to let users disable this do we?

Comment thread task-sdk/src/airflow/sdk/execution_time/supervisor.py
Comment thread task-sdk/src/airflow/sdk/execution_time/supervisor.py Outdated
Comment thread task-sdk/src/airflow/sdk/execution_time/supervisor.py
Comment thread task-sdk/src/airflow/sdk/execution_time/supervisor.py Outdated
Comment thread task-sdk/src/airflow/sdk/execution_time/supervisor.py Outdated
Per Ash: WatchedSubprocess.start now trusts use_exec directly
(`if use_exec:`); the platform gate lives in ActivitySubprocess.start
where the caller decides. Also updates the _FORK_EXEC_PLATFORMS
docstring to link to apache#65691 for the DAG processor / triggerer
follow-up (per TP's feedback).
Comment thread task-sdk/src/airflow/sdk/execution_time/supervisor.py
@kaxil kaxil merged commit a3383b7 into apache:main Apr 22, 2026
111 checks passed
@kaxil kaxil deleted the fix/macos-fork-exec-supervisor branch April 22, 2026 22:31
@boring-cyborg
Copy link
Copy Markdown

boring-cyborg Bot commented Apr 22, 2026

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

vatsrahul1001 added a commit that referenced this pull request May 15, 2026
…#66872)

On macOS, the task supervisor's bare os.fork() copies the parent's
Objective-C runtime state into the child process.  When the child
later triggers ObjC class initialization (e.g. socket.getaddrinfo ->
system DNS resolver -> Security.framework -> +[NSNumber initialize]),
the runtime detects the corrupted state and crashes with SIGABRT/SIGSEGV.

This is a well-documented macOS platform limitation -- Apple's ObjC
runtime, CoreFoundation, and libdispatch are not fork-safe.  CPython
changed multiprocessing's default start method to "spawn" on macOS in
3.8 for this reason, but Airflow's TaskSDK supervisor uses os.fork()
directly.

The fix: on macOS, immediately call os.execv() after os.fork() for
task execution subprocesses.  The exec replaces the child's address
space, giving it clean ObjC state.  The socketpair FDs survive across
exec (marked inheritable) and the child reads their numbers from an
environment variable.

Only task execution (target=_subprocess_main) uses fork+exec.  DAG
processor and triggerer pass different targets and keep bare fork --
they don't make network calls that trigger the macOS crash.

References:
- python/cpython#105912
- python/cpython#58037
- #24463

(cherry picked from commit a3383b7)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
vatsrahul1001 added a commit that referenced this pull request May 20, 2026
…#66872)

On macOS, the task supervisor's bare os.fork() copies the parent's
Objective-C runtime state into the child process.  When the child
later triggers ObjC class initialization (e.g. socket.getaddrinfo ->
system DNS resolver -> Security.framework -> +[NSNumber initialize]),
the runtime detects the corrupted state and crashes with SIGABRT/SIGSEGV.

This is a well-documented macOS platform limitation -- Apple's ObjC
runtime, CoreFoundation, and libdispatch are not fork-safe.  CPython
changed multiprocessing's default start method to "spawn" on macOS in
3.8 for this reason, but Airflow's TaskSDK supervisor uses os.fork()
directly.

The fix: on macOS, immediately call os.execv() after os.fork() for
task execution subprocesses.  The exec replaces the child's address
space, giving it clean ObjC state.  The socketpair FDs survive across
exec (marked inheritable) and the child reads their numbers from an
environment variable.

Only task execution (target=_subprocess_main) uses fork+exec.  DAG
processor and triggerer pass different targets and keep bare fork --
they don't make network calls that trigger the macOS crash.

References:
- python/cpython#105912
- python/cpython#58037
- #24463

(cherry picked from commit a3383b7)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
vatsrahul1001 added a commit that referenced this pull request May 20, 2026
…#66872)

On macOS, the task supervisor's bare os.fork() copies the parent's
Objective-C runtime state into the child process.  When the child
later triggers ObjC class initialization (e.g. socket.getaddrinfo ->
system DNS resolver -> Security.framework -> +[NSNumber initialize]),
the runtime detects the corrupted state and crashes with SIGABRT/SIGSEGV.

This is a well-documented macOS platform limitation -- Apple's ObjC
runtime, CoreFoundation, and libdispatch are not fork-safe.  CPython
changed multiprocessing's default start method to "spawn" on macOS in
3.8 for this reason, but Airflow's TaskSDK supervisor uses os.fork()
directly.

The fix: on macOS, immediately call os.execv() after os.fork() for
task execution subprocesses.  The exec replaces the child's address
space, giving it clean ObjC state.  The socketpair FDs survive across
exec (marked inheritable) and the child reads their numbers from an
environment variable.

Only task execution (target=_subprocess_main) uses fork+exec.  DAG
processor and triggerer pass different targets and keep bare fork --
they don't make network calls that trigger the macOS crash.

References:
- python/cpython#105912
- python/cpython#58037
- #24463

(cherry picked from commit a3383b7)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
vatsrahul1001 added a commit that referenced this pull request May 21, 2026
…#66872)

On macOS, the task supervisor's bare os.fork() copies the parent's
Objective-C runtime state into the child process.  When the child
later triggers ObjC class initialization (e.g. socket.getaddrinfo ->
system DNS resolver -> Security.framework -> +[NSNumber initialize]),
the runtime detects the corrupted state and crashes with SIGABRT/SIGSEGV.

This is a well-documented macOS platform limitation -- Apple's ObjC
runtime, CoreFoundation, and libdispatch are not fork-safe.  CPython
changed multiprocessing's default start method to "spawn" on macOS in
3.8 for this reason, but Airflow's TaskSDK supervisor uses os.fork()
directly.

The fix: on macOS, immediately call os.execv() after os.fork() for
task execution subprocesses.  The exec replaces the child's address
space, giving it clean ObjC state.  The socketpair FDs survive across
exec (marked inheritable) and the child reads their numbers from an
environment variable.

Only task execution (target=_subprocess_main) uses fork+exec.  DAG
processor and triggerer pass different targets and keep bare fork --
they don't make network calls that trigger the macOS crash.

References:
- python/cpython#105912
- python/cpython#58037
- #24463

(cherry picked from commit a3383b7)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

4 participants