Fix macOS SIGSEGV in task execution by using fork+exec#64874
Conversation
|
I wonder if we should also set the env var we have to not load settings in this ecec'd process to speed up airflow import? |
c32f2e6 to
e25461a
Compare
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes intermittent macOS task crashes (SIGSEGV/SIGABRT) caused by fork-unsafe Apple Objective‑C runtime state by switching the task-execution subprocess path to fork + immediate exec.
Changes:
- Add macOS-only
fork+execpath for task execution in the supervisor (WatchedSubprocess.start) with_child_exec_mainas the post-exec entrypoint. - Reinitialize supervisor comms/logging channel in the task runner when started via the
fork+execpath. - Add a unit test for
_child_exec_mainwiring of FDs 0/1/2 and signaling task runner via_AIRFLOW_FORK_EXEC.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| task-sdk/src/airflow/sdk/execution_time/supervisor.py | Introduces macOS fork+exec execution path and _child_exec_main entrypoint. |
| task-sdk/src/airflow/sdk/execution_time/task_runner.py | Requests structured logging FD after startup when running under fork+exec. |
| task-sdk/tests/task_sdk/execution_time/test_supervisor.py | Adds test coverage for _child_exec_main FD handling and _AIRFLOW_FORK_EXEC env signaling. |
87883db to
1340552
Compare
28b1314 to
3e16998
Compare
On macOS, the task supervisor's bare os.fork() copies the parent's Objective-C runtime state into the child process. When the child later triggers ObjC class initialization (e.g. socket.getaddrinfo -> system DNS resolver -> Security.framework -> +[NSNumber initialize]), the runtime detects the corrupted state and crashes with SIGABRT/SIGSEGV. This is a well-documented macOS platform limitation -- Apple's ObjC runtime, CoreFoundation, and libdispatch are not fork-safe. CPython changed multiprocessing's default start method to "spawn" on macOS in 3.8 for this reason, but Airflow's TaskSDK supervisor uses os.fork() directly. The fix: on macOS, immediately call os.execv() after os.fork() for task execution subprocesses. The exec replaces the child's address space, giving it clean ObjC state. The socketpair FDs survive across exec (marked inheritable) and the child reads their numbers from an environment variable. Only task execution (target=_subprocess_main) uses fork+exec. DAG processor and triggerer pass different targets and keep bare fork -- they don't make network calls that trigger the macOS crash. References: - python/cpython#105912 - python/cpython#58037 - apache#24463
Address review feedback: instead of passing all 4 FD numbers via JSON env var, dup2 the requests/stdout/stderr sockets onto FDs 0/1/2 before exec (inheritable by default). Only the log channel FD needs explicit passing via _AIRFLOW_SUPERVISOR_LOG_FD.
Instead of passing the log channel FD via env var, use the existing ResendLoggingFD protocol: the exec'd child starts with log_fd=0 (no structured logging), and after startup the task runner calls reinit_supervisor_comms() to request the log channel from the supervisor. This reuses the same mechanism as sudo/virtualenv re-exec rather than introducing a new env var.
- Switch from `target is _subprocess_main` identity check to an explicit `use_exec: bool = False` kwarg on `start()`, set to True in `ActivitySubprocess.start()`. - Combine fork+exec and bare-fork branches into a single try/except so the bare-fork block is no longer inside an else. Error handling is shared, reducing duplication and indentation. - Fix comment about how FDs 0/1/2 survive exec: it's `os.dup2` clearing FD_CLOEXEC on destination FDs, not "inheritable by default".
- Remove inline `import socket as _socket`; use the module-level `from socket import socket` import in `_child_exec_main`. - Raise `ValueError` if `use_exec=True` is passed with a non-default target, since the exec'd child always runs `_subprocess_main`. - Apply `disable_capturing` fixture to `TestChildExecMain` since it mutates process-wide stdio FDs via `dup2` onto 0/1/2.
Tests that override `target` with a local stub to exercise the base infrastructure were hitting the ValueError guard. Only pass use_exec=True when the default `_subprocess_main` target is used.
Per Copilot review: check for == "1" so an externally-set env var with a different truthy value can't accidentally trigger reinit_supervisor_comms().
a6e585b to
4734445
Compare
ashb
left a comment
There was a problem hiding this comment.
Few nits but LGTM over all.
We don't want a way to let users disable this do we?
Per Ash: WatchedSubprocess.start now trusts use_exec directly (`if use_exec:`); the platform gate lives in ActivitySubprocess.start where the caller decides. Also updates the _FORK_EXEC_PLATFORMS docstring to link to apache#65691 for the DAG processor / triggerer follow-up (per TP's feedback).
|
Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions. |
…#66872) On macOS, the task supervisor's bare os.fork() copies the parent's Objective-C runtime state into the child process. When the child later triggers ObjC class initialization (e.g. socket.getaddrinfo -> system DNS resolver -> Security.framework -> +[NSNumber initialize]), the runtime detects the corrupted state and crashes with SIGABRT/SIGSEGV. This is a well-documented macOS platform limitation -- Apple's ObjC runtime, CoreFoundation, and libdispatch are not fork-safe. CPython changed multiprocessing's default start method to "spawn" on macOS in 3.8 for this reason, but Airflow's TaskSDK supervisor uses os.fork() directly. The fix: on macOS, immediately call os.execv() after os.fork() for task execution subprocesses. The exec replaces the child's address space, giving it clean ObjC state. The socketpair FDs survive across exec (marked inheritable) and the child reads their numbers from an environment variable. Only task execution (target=_subprocess_main) uses fork+exec. DAG processor and triggerer pass different targets and keep bare fork -- they don't make network calls that trigger the macOS crash. References: - python/cpython#105912 - python/cpython#58037 - #24463 (cherry picked from commit a3383b7) Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
…#66872) On macOS, the task supervisor's bare os.fork() copies the parent's Objective-C runtime state into the child process. When the child later triggers ObjC class initialization (e.g. socket.getaddrinfo -> system DNS resolver -> Security.framework -> +[NSNumber initialize]), the runtime detects the corrupted state and crashes with SIGABRT/SIGSEGV. This is a well-documented macOS platform limitation -- Apple's ObjC runtime, CoreFoundation, and libdispatch are not fork-safe. CPython changed multiprocessing's default start method to "spawn" on macOS in 3.8 for this reason, but Airflow's TaskSDK supervisor uses os.fork() directly. The fix: on macOS, immediately call os.execv() after os.fork() for task execution subprocesses. The exec replaces the child's address space, giving it clean ObjC state. The socketpair FDs survive across exec (marked inheritable) and the child reads their numbers from an environment variable. Only task execution (target=_subprocess_main) uses fork+exec. DAG processor and triggerer pass different targets and keep bare fork -- they don't make network calls that trigger the macOS crash. References: - python/cpython#105912 - python/cpython#58037 - #24463 (cherry picked from commit a3383b7) Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
…#66872) On macOS, the task supervisor's bare os.fork() copies the parent's Objective-C runtime state into the child process. When the child later triggers ObjC class initialization (e.g. socket.getaddrinfo -> system DNS resolver -> Security.framework -> +[NSNumber initialize]), the runtime detects the corrupted state and crashes with SIGABRT/SIGSEGV. This is a well-documented macOS platform limitation -- Apple's ObjC runtime, CoreFoundation, and libdispatch are not fork-safe. CPython changed multiprocessing's default start method to "spawn" on macOS in 3.8 for this reason, but Airflow's TaskSDK supervisor uses os.fork() directly. The fix: on macOS, immediately call os.execv() after os.fork() for task execution subprocesses. The exec replaces the child's address space, giving it clean ObjC state. The socketpair FDs survive across exec (marked inheritable) and the child reads their numbers from an environment variable. Only task execution (target=_subprocess_main) uses fork+exec. DAG processor and triggerer pass different targets and keep bare fork -- they don't make network calls that trigger the macOS crash. References: - python/cpython#105912 - python/cpython#58037 - #24463 (cherry picked from commit a3383b7) Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
…#66872) On macOS, the task supervisor's bare os.fork() copies the parent's Objective-C runtime state into the child process. When the child later triggers ObjC class initialization (e.g. socket.getaddrinfo -> system DNS resolver -> Security.framework -> +[NSNumber initialize]), the runtime detects the corrupted state and crashes with SIGABRT/SIGSEGV. This is a well-documented macOS platform limitation -- Apple's ObjC runtime, CoreFoundation, and libdispatch are not fork-safe. CPython changed multiprocessing's default start method to "spawn" on macOS in 3.8 for this reason, but Airflow's TaskSDK supervisor uses os.fork() directly. The fix: on macOS, immediately call os.execv() after os.fork() for task execution subprocesses. The exec replaces the child's address space, giving it clean ObjC state. The socketpair FDs survive across exec (marked inheritable) and the child reads their numbers from an environment variable. Only task execution (target=_subprocess_main) uses fork+exec. DAG processor and triggerer pass different targets and keep bare fork -- they don't make network calls that trigger the macOS crash. References: - python/cpython#105912 - python/cpython#58037 - #24463 (cherry picked from commit a3383b7) Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Summary
Discovered while testing the AIP-99
LLMSQLQueryOperatorexample DAGs from #64824 on macOS. Tasks that make network calls (LLM API requests, HTTP calls) crash intermittently withSIGSEGVorSIGABRTwhen running viaairflow standaloneor any executor on macOS.Root cause:
WatchedSubprocess.start()insupervisor.pyuses bareos.fork()to create task child processes. On macOS, the forked child inherits corrupted Objective-C runtime state from the parent. When the child later triggers ObjC class initialization -- for example viasocket.getaddrinfo()-> macOS system DNS resolver ->Security.framework->+[NSNumber initialize]-- the ObjC runtime detects the half-initialized state and deliberately crashes.The fix: Add a
use_exec: bool = Falsekwarg toWatchedSubprocess.start().ActivitySubprocess.start()passesuse_exec=True; DAG processor and triggerer use the default. On platforms in_FORK_EXEC_PLATFORMS(currently{"darwin"}), the fork+exec path runsos.dup2on the socketpairs to place them on FDs 0/1/2, thenos.execva fresh Python interpreter.os.dup2(with the defaultinheritable=True) clearsFD_CLOEXECon the destination FDs, so they surviveexecv. The structured log channel is NOT inherited; instead, the exec'd child sets_AIRFLOW_FORK_EXEC=1, and afterstartup()intask_runner.mainthe existingreinit_supervisor_comms()requests the log FD from the supervisor viaResendLoggingFD+ SCM_RIGHTS (the same mechanism used by thesudo/ virtualenv re-exec path).The crash chain
The faulthandler traceback that identified the root cause:
Why macOS only
Apple's ObjC runtime, CoreFoundation, and
libdispatchare not fork-safe. This is why CPython changedmultiprocessing's default start method fromforktospawnon macOS in Python 3.8 (BPO-33725). Linux uses glibc's resolver which has no ObjC dependency, so barefork()works fine there. The_FORK_EXEC_PLATFORMSset is the extension point if another platform needs the same treatment later.Scope: task execution only (for now)
This PR only applies fork+exec to task execution (
ActivitySubprocess). DAG processor and triggerer also useWatchedSubprocessand can hit the same macOS fork-safety crash -- DAG files often have top-level network calls (secret backends, connection / variable lookups, in-process API server threads), and triggerers run user-defined async triggers that poll APIs. We observed the DAG processor crash earlier in this investigation (theInProcessExecutionAPI+a2wsgibackground thread initializinghttpx/ssl/Security.framework).Extending fork+exec to them requires reworking
_child_exec_main()to rehydrate arbitrary targets acrossexecv(it currently hardcodes_subprocess_main; DAG processor passestarget=_parse_file_entrypoint, triggerer passestarget=TriggerRunnerSupervisor.run_in_process). Tracked as follow-up in #65691.Scope: all executors on macOS, not just LocalExecutor
This affects any executor running on macOS (Local, Celery worker, etc.) because the fork happens inside
supervise()in the Task SDK, not in the executor itself. The executor spawns a worker process (which is safe --multiprocessing.Processusesspawnon macOS), but that worker then callssupervise()which does the bareos.fork().The two-fork architecture:
How the log channel is obtained after exec
_fork_mainneeds a log FD to set up structured logging. In the bare-fork path it's inherited. In the fork+exec path:_child_exec_main(the exec'd entry point) calls_fork_main(..., log_fd=0, ...)-- zero means skip structured-log setup. It also sets_AIRFLOW_FORK_EXEC=1._fork_mainruns through its usual setup (signals, stdio, etc.) with logging going to the stderr socket.task_runner.maincallsstartup()to receiveStartupDetailsfrom the supervisor.startup(),mainchecks_AIRFLOW_FORK_EXECand calls the existingreinit_supervisor_comms(), which sendsResendLoggingFD()over the requests socket and receives a fresh log FD via SCM_RIGHTS.The timing works because the supervisor enters its event loop (
process.wait()) after_on_child_startedhas sent the startup message. By the time the child requests the log channel, the supervisor is in the loop and ready to handle it.What we tried that didn't work
pydantic_ai,datafusion)OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YESNO_PROXY='*'socket.getaddrinfo()is a C library call that uses the macOS system resolver directly_scproxybefore forksubprocess.Popeninstead ofos.fork()Why
fork+execand notspawnPython's
multiprocessing.Process(start_method='spawn')would also work, but it requires pickling the target function and arguments. The supervisor's communication is built around socketpairs created before fork, with FDs inherited by the child.fork+execpreserves this design: the three socketpairs are placed at FDs 0/1/2 viadup2(which clearsFD_CLOEXEC), and the log channel is obtained after startup via the existingResendLoggingFDprotocol. No new cross-process communication path needed.References
getaddrinfoSIGSEGV after fork on macOS_scproxycrash after forkWas generative AI tooling used to co-author this PR?