Skip to content

macos_run_puppet: make bootstrap reboot-survivable via LaunchDaemon#1210

Open
rcurranmoz wants to merge 2 commits into
masterfrom
reboot-survivable-bootstrap
Open

macos_run_puppet: make bootstrap reboot-survivable via LaunchDaemon#1210
rcurranmoz wants to merge 2 commits into
masterfrom
reboot-survivable-bootstrap

Conversation

@rcurranmoz
Copy link
Copy Markdown
Contributor

Summary

Fixes #1206.

Today a fresh M4 worker bootstrap needs a babysitter SSH/Bolt session that survives across all reboots — TCC.db reboot, MDM-driven OS upgrade, post-puppet reboot. We just walked through this with macmini-m4-130..149 on 2026-05-12 and spent ~3 hours nursing it.

This PR has run-puppet.sh register a self-removing LaunchDaemon (com.mozilla.ronin-puppet-bootstrap) on first invocation. The LaunchDaemon re-runs the script on every boot until:

  1. Puppet apply succeeds cleanly (the existing while/run_puppet loop already gates this); and
  2. Puppet's regular at-boot mechanism (com.mozilla.atboot_puppet) is installed, indicating the host is fully managed.

Once both hold, the script writes /var/tmp/semaphore/run-buildbot and unloads + removes its own LaunchDaemon. The host is in the pool, and future boots are handled by the regular puppet at-boot mechanism with no overlap.

Implementation

Two helpers in modules/macos_run_puppet/files/run-puppet.sh:

  • install_bootstrap_launchd: copies the script to /usr/local/sbin/ronin-puppet-bootstrap.sh and writes /Library/LaunchDaemons/com.mozilla.ronin-puppet-bootstrap.plist. No-op if puppet's at-boot LaunchDaemon is already in place.
  • finalize_bootstrap: writes /var/tmp/semaphore/run-buildbot, then (only if puppet's at-boot LaunchDaemon exists) unloads and removes the bootstrap LaunchDaemon.

install_bootstrap_launchd is called after the existing role-file / puppet-binary preconditions, so a host without /etc/puppet_role won't install the LaunchDaemon and start a reboot loop. finalize_bootstrap is called at the existing exit 0 after the puppet retry loop breaks.

Operational impact

Before: orchestrator must SSH-and-wait across N reboots.
After: a single MDM script-job (or bolt run) kicks run-puppet.sh and walks away.

Test plan

  • Fresh M4 host: invoke run-puppet.sh once, observe it hits TCC.db reboot trigger, comes back, applies cleanly, writes run-buildbot, removes itself
  • Fresh M4 host where MDM drives a macOS upgrade between first puppet apply and second: confirm the LaunchDaemon survives the upgrade reboot and resumes
  • Already-bootstrapped host (atboot_puppet in place): invoke the script — confirm it does NOT install the bootstrap LaunchDaemon
  • Force a puppet apply failure (rm a required file): confirm the bootstrap LaunchDaemon stays in place and retries on next boot
  • Kitchen mac suites: confirm no regression (running_in_test_kitchen fact path is unchanged)

Related

Pairs naturally with #1208 (TCC.db cltbld-session gate) — together they'd eliminate the babysitter pattern entirely.

🤖 Generated with Claude Code

Today, bootstrapping a fresh M4 worker requires an external SSH session
that babysits the run across at least two reboots (TCC.db detection +
final post-puppet reboot), and a third if MDM drives an OS upgrade
mid-bootstrap.

This commit adds a self-registering LaunchDaemon
(com.mozilla.ronin-puppet-bootstrap) so run-puppet.sh fires on every
boot until two conditions are met:

  1. Puppet apply has succeeded cleanly (existing `while/run_puppet`
     loop already gates this)
  2. Puppet's regular at-boot mechanism (com.mozilla.atboot_puppet) is
     installed

When both are satisfied, the script writes /var/tmp/semaphore/run-buildbot
so generic-worker can start, then unloads and removes its own
LaunchDaemon. Future boots are handled by the regular puppet at-boot
mechanism with no overlap.

Two helpers added:

- `install_bootstrap_launchd`: copies the script to
  /usr/local/sbin/ronin-puppet-bootstrap.sh and writes
  /Library/LaunchDaemons/com.mozilla.ronin-puppet-bootstrap.plist. No-op
  if puppet's at-boot LaunchDaemon is already present.

- `finalize_bootstrap`: writes the run-buildbot semaphore and removes
  the bootstrap LaunchDaemon (only after confirming puppet's at-boot
  mechanism is in place, so we don't leave the host with no puppet
  trigger).

`install_bootstrap_launchd` is called once role/puppet/facter
preconditions are confirmed; `finalize_bootstrap` is called at the
existing `exit 0` after the puppet retry loop has broken.

Result: an MDM script-job (or one-shot `bolt run`) can kick the
bootstrap and walk away. The host finishes provisioning across however
many reboots are needed.

Fixes #1206

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

1 participant