Skip to content

Remote TUI can remain stale after app-server slow-websocket disconnect #18860

@laffo16

Description

@laffo16

What version of Codex CLI is running?

codex-cli 0.122.0

What subscription do you have?

Pro

Which model were you using?

gpt-5.4

What platform is your computer?

Microsoft Windows NT 10.0.19045.0 x64

What terminal emulator and version are you using (if applicable)?

Windows Terminal, PowerShell 7

What issue are you seeing?

This is related to #18203, which reports the app-server outbound websocket queue disconnect trigger. This issue is specifically about the TUI stale-state/reconciliation failure after that kind of disconnect: the server-side thread can be completed/idle, while the TUI remains in Working state and routes the next prompt as turn/steer against the completed turn.

In the captured reproduction, the TUI websocket connection saw a normal turn start:

seq 53  TUI -> app-server  turn/start   id=6
seq 54  app-server -> TUI  response     id=6 result.turn.status=inProgress
seq 55  app-server -> TUI  thread/status/changed status=active
seq 56  app-server -> TUI  turn/started
seq 57-247 app-server -> TUI item/hook/output frames for that turn

Then app-server stderr reported:

WARN codex_app_server::transport
disconnecting slow connection after outbound queue filled: ConnectionId(0)

It was followed by 65 dropping message for disconnected connection: ConnectionId(0) warnings.

The TUI relay log never received the terminal lifecycle frames for that turn:

  • no turn/completed
  • no idle thread/status/changed

At the same time, a separate passive observer against the same app-server reported the authoritative thread state as:

{
  "threadStatusType": "idle",
  "turnCount": 1,
  "latestTurnStatus": "completed",
  "inProgressTurnCount": 0,
  "activeTurnId": null
}

A direct websocket turn/start against the same app-server and same thread then completed successfully:

{
  "directTurnStartStatus": "inProgress",
  "completed": true,
  "notificationCounts": {
    "turn/started": 1,
    "turn/completed": 1,
    "thread/status/changed": 2
  }
}

This suggests the app-server/thread was healthy, while the original TUI connection retained stale active-turn state.

When a later visible prompt was entered into the stale TUI, the TUI sent:

TUI -> app-server turn/steer id=7
expectedTurnId=<previous completed turn id>

No response to that turn/steer request was observed in the TUI relay log.

Evidence chain:

Claim Evidence
TUI started a normal turn Stale relay metadata: turn/start, response inProgress, active status, turn/started
app-server hit outbound websocket backpressure Transport analysis: one disconnecting slow connection after outbound queue filled warning
app-server dropped later messages for that disconnected connection Transport analysis: 65 dropped-message warnings for the same connection id
stale TUI did not receive terminal lifecycle frames Stale relay analysis: zero turn/completed, no idle status delivered
server-side thread was actually complete/idle Passive observer summary: threadStatusType=idle, latestTurnStatus=completed, inProgressTurnCount=0, activeTurnId=null
app-server/thread were still capable of work Direct websocket probe: new turn/start completed with turn/started, turn/completed, and status updates
same wrapper/relay path can complete normally Clean control: two turn/start, two turn/started, two turn/completed, zero turn/steer, zero backpressure events

What steps can reproduce the bug?

I do not have a minimal upstream-only repro script for the stale-state recovery part. The strongest reproduction used a local remote-TUI wrapper/relay and a turn that produced a burst of output frames large enough to fill the app-server outbound websocket queue.

The queue-fill disconnect trigger itself is already reported with an upstream-only reproduction in #18203. The additional observation here is that after such a disconnect, the TUI can remain stale rather than clearly exiting/reconciling.

The useful maintainer-side repro direction is likely:

  1. Start Codex TUI in remote app-server websocket mode.
  2. Put a slow/throttled websocket client or proxy between the TUI and app-server.
  3. Run a turn that emits many output delta frames.
  4. Observe whether the app-server logs the slow-connection disconnect.
  5. Check whether the TUI exits/reconciles, or instead remains in Working and routes the next prompt as turn/steer for the previous turn id.

In my captured Worker 06 reproduction, the relevant thread id was:

019db067-8e04-71e0-a0a4-e1106ee75148

The initial stale turn id was:

019db068-4be4-7063-a7dc-55d20ed439dd

The later stale turn/steer used that same completed turn id as expectedTurnId.

What is the expected behavior?

After app-server transport disconnects a slow websocket client, the remote TUI should do one of the following:

  • receive a close/error and exit clearly;
  • reconnect/resume and reconcile with authoritative server thread state;
  • clear stale active-turn/running state before accepting the next user prompt.

It should not continue accepting prompts while still believing a completed turn is active.

Additional information

Mechanical analysis for the stale run:

{
  "frameCount": 248,
  "malformedLineCount": 0,
  "turnStartCount": 1,
  "turnStartedCount": 1,
  "turnCompletedCount": 0,
  "turnSteerCount": 1,
  "staleTurnStateSuspected": true
}

App-server transport analysis for the stale run:

{
  "slowConnectionDisconnectCount": 1,
  "droppedDisconnectedMessageCount": 65,
  "connectionIds": ["0"],
  "backpressureDisconnectObserved": true
}

Clean control under the same wrapper/relay instrumentation:

{
  "turnStartCount": 2,
  "turnStartedCount": 2,
  "turnCompletedCount": 2,
  "turnSteerCount": 0,
  "staleTurnStateSuspected": false,
  "backpressureDisconnectObserved": false
}

Likely source areas:

  • App-server bounded outbound queue and slow-client disconnect:
    • codex-rs/app-server/src/transport/mod.rs
  • Websocket close/EOF propagation:
    • codex-rs/app-server/src/transport/websocket.rs
    • codex-rs/app-server-client/src/remote.rs
  • TUI routing of next input as turn/steer based on cached active turn id:
    • codex-rs/tui/src/app/thread_routing.rs
  • Clearing active turn and visible Working state after turn/completed:
    • codex-rs/tui/src/app/thread_events.rs
    • codex-rs/tui/src/chatwidget.rs

More detailed source links and a claim-to-evidence map are included in the attached redacted evidence package.

Hypothesis:

The trigger is app-server websocket outbound backpressure from a burst of output frames. The app-server intentionally disconnects the slow websocket and drops later messages for that connection. The TUI then misses turn/completed and the idle status frame, leaving both client-side state machines stale:

  • ThreadEventStore.active_turn_id remains set, so the next prompt is routed as turn/steer.
  • ChatWidget.agent_turn_running remains true, so the visible UI can remain in Working / queued-input mode.

Possible regression tests:

  • TUI/client test: simulate loss of turn/completed after turn/started, then verify the next prompt cannot be silently routed as turn/steer against a completed/non-active turn.
  • Remote-client test: force the websocket read side to receive close/error/EOF after app-server disconnect and verify AppServerEvent::Disconnected reaches the TUI fatal-exit/reconciliation path.
  • App-server transport test: fill a per-connection outbound queue and verify the disconnect behavior is observable by that client, or that terminal lifecycle notifications cannot leave the client in stale state.
  • Protocol test: turn/steer with expectedTurnId for a completed turn should return a response/error that allows the TUI to clear stale state and start a fresh turn.

Attachment package:

I am attaching codex-stale-tui-evidence-redacted.zip.

It contains:

  • triage-summary.md - one-page maintainer summary.
  • claim-to-evidence-map.md - each claim mapped to exact redacted evidence.
  • worker-run-matrix.md - all supervised worker runs and outcomes.
  • source-analysis.md - upstream source pointers and inferred failure path.
  • redaction-report.md - what was removed from raw evidence.
  • stale-run-relay-analysis.redacted.json
  • stale-run-relay-metadata.jsonl
  • stale-run-transport-events.redacted.json
  • stale-run-observer-summary.redacted.json
  • direct-turn-probe-summary.redacted.json
  • clean-control-relay-analysis.redacted.json
  • clean-control-relay-metadata.jsonl
  • clean-control-transport-analysis.redacted.json
  • screenshots-redacted/*.png

Raw logs are not attached because they contain local paths, prompt text, command transcripts, hook paths, private repository remotes, and unrelated large remote response bodies. The package contains derived redacted metadata, summaries, and aggressively redacted screenshots.

Redacted screenshots can be inlined separately if useful; the protocol evidence in the zip is the primary evidence.

codex-stale-tui-evidence-redacted.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    TUIIssues related to the terminal user interface: text input, menus and dialogs, and terminal displayapp-serverIssues involving app server protocol or interfacesbugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions