Skip to content

net: UDS write&read deadlock #74369

@SimonCZW

Description

@SimonCZW

Go version

go1.18.10

Output of go env in your module/workspace:

GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/chenzhuowen.simon/.cache/go-build"
GOENV="/home/chenzhuowen.simon/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/chenzhuowen.simon/.gvm/pkgsets/go1.18.10/global/pkg/mod"
GONOPROXY="*.byted.org,*.everphoto.cn,git.smartisan.com"
GONOSUMDB="*.byted.org,*.everphoto.cn,git.smartisan.com"
GOOS="linux"
GOPATH="/home/chenzhuowen.simon/.gvm/pkgsets/go1.18.10/global"
GOPRIVATE="*.byted.org,*.everphoto.cn,git.smartisan.com"
GOPROXY="https://go-mod-proxy.byted.org,https://goproxy.cn,https://proxy.golang.org,direct"
GOROOT="/home/chenzhuowen.simon/.gvm/gos/go1.18.10"
GOSUMDB="sum.golang.google.cn"
GOTMPDIR=""
GOTOOLDIR="/home/chenzhuowen.simon/.gvm/gos/go1.18.10/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.18.10"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1987224892=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Description
We're observing a deadlock scenario between two Go processes communicating via Unix Domain Sockets (UDS). Both client (Go 1.22.12) and server (Go 1.18.10) have dedicated per-connection goroutines for sending and receiving.

What did you see happen?

Unexpected Behavior
For the same connection:

  1. The client's write goroutine is blocked at runtime.netpollWait → waitWrite (unable to write to fd)
  2. Simultaneously, the server's read goroutine is blocked at runtime.netpollWait → waitRead (unable to read from fd)
  3. Both goroutines entered the blocked state at exactly the same timestamp
  4. Stack traces confirm they're suspended by Go's runtime netpoll mechanism

client traces:

(dlv) bt
 0  0x0000000000a7014e in runtime.gopark
    at /home/chenzhuowen.simon/.gvm/gos/go1.22.12/src/runtime/proc.go:403
 1  0x0000000000a67f97 in runtime.netpollblock
    at /home/chenzhuowen.simon/.gvm/gos/go1.22.12/src/runtime/netpoll.go:573
 2  0x0000000000aa3125 in internal/poll.runtime_pollWait
    at /home/chenzhuowen.simon/.gvm/gos/go1.22.12/src/runtime/netpoll.go:345
 3  0x0000000000b28a47 in internal/poll.(*pollDesc).wait
    at /home/chenzhuowen.simon/.gvm/gos/go1.22.12/src/internal/poll/fd_poll_runtime.go:84
 4  0x0000000000b2bf19 in internal/poll.(*pollDesc).waitWrite
    at /home/chenzhuowen.simon/.gvm/gos/go1.22.12/src/internal/poll/fd_poll_runtime.go:93
 5  0x0000000000b2bf19 in internal/poll.(*FD).Write
    at /home/chenzhuowen.simon/.gvm/gos/go1.22.12/src/internal/poll/fd_unix.go:388
 6  0x0000000000cc7ee5 in net.(*netFD).Write
    at /home/chenzhuowen.simon/.gvm/gos/go1.22.12/src/net/fd_posix.go:96
 7  0x0000000000cda245 in net.(*conn).Write
    at /home/chenzhuowen.simon/.gvm/gos/go1.22.12/src/net/net.go:197
 8  0x0000000000cef725 in net.(*UnixConn).Write
    at <autogenerated>:1

server traces:

(dlv) bt
 0  0x0000000000882996 in runtime.gopark
    at /home/chenzhuowen.simon/.gvm/gos/go1.18.10/src/runtime/proc.go:362
 1  0x000000000087b357 in runtime.netpollblock
    at /home/chenzhuowen.simon/.gvm/gos/go1.18.10/src/runtime/netpoll.go:522
 2  0x00000000008adde9 in internal/poll.runtime_pollWait
    at /home/chenzhuowen.simon/.gvm/gos/go1.18.10/src/runtime/netpoll.go:302
 3  0x00000000009246f2 in internal/poll.(*pollDesc).wait
    at /home/chenzhuowen.simon/.gvm/gos/go1.18.10/src/internal/poll/fd_poll_runtime.go:83
 4  0x0000000000925bda in internal/poll.(*pollDesc).waitRead
    at /home/chenzhuowen.simon/.gvm/gos/go1.18.10/src/internal/poll/fd_poll_runtime.go:88
 5  0x0000000000925bda in internal/poll.(*FD).Read
    at /home/chenzhuowen.simon/.gvm/gos/go1.18.10/src/internal/poll/fd_unix.go:167
 6  0x0000000000a43729 in net.(*netFD).Read
    at /home/chenzhuowen.simon/.gvm/gos/go1.18.10/src/net/fd_posix.go:55
 7  0x0000000000a57605 in net.(*conn).Read
    at /home/chenzhuowen.simon/.gvm/gos/go1.18.10/src/net/net.go:183

Key Observations

  • Go's netpoll only suspends goroutines when system calls return EAGAIN
  • Kernel version: 5.4 (recent, suggesting issue is likely in Go's implementation)
  • Go versions: Client = 1.22, Server = 1.18
  • Suspected root cause: Go's epoll implementation (edge-triggered mode) might be:
    • a) Missing kernel callbacks, or
    • b) Mishandling EAGAIN in ET state transitions
    • Particularly concerning since server (Go 1.18) is known to have ET-related fixes missing

Additional Critical Observations

  • Consistent 255,808 Byte Read Pattern
    • Server read goroutines always block precisely after reading 255,808 bytes
    • This exact byte count occurs across multiple occurrences
  • Server-Initiated Connection Closure Triggers Issue. Problem manifests when:
    • ✅ Server actively closes connection → Client reconnects → Deadlock occurs
    • ❌ Client closes connection → No issue observed
    • Behavior is intermittent but reproducible during reconnection sequences

What did you expect to see?

deadlock should not happen

Is any issue or commit relative?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      close