-
Notifications
You must be signed in to change notification settings - Fork 18.3k
Closed as not planned
Description
Go version
go1.18.10
Output of go env
in your module/workspace:
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/chenzhuowen.simon/.cache/go-build"
GOENV="/home/chenzhuowen.simon/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/chenzhuowen.simon/.gvm/pkgsets/go1.18.10/global/pkg/mod"
GONOPROXY="*.byted.org,*.everphoto.cn,git.smartisan.com"
GONOSUMDB="*.byted.org,*.everphoto.cn,git.smartisan.com"
GOOS="linux"
GOPATH="/home/chenzhuowen.simon/.gvm/pkgsets/go1.18.10/global"
GOPRIVATE="*.byted.org,*.everphoto.cn,git.smartisan.com"
GOPROXY="https://go-mod-proxy.byted.org,https://goproxy.cn,https://proxy.golang.org,direct"
GOROOT="/home/chenzhuowen.simon/.gvm/gos/go1.18.10"
GOSUMDB="sum.golang.google.cn"
GOTMPDIR=""
GOTOOLDIR="/home/chenzhuowen.simon/.gvm/gos/go1.18.10/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.18.10"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1987224892=/tmp/go-build -gno-record-gcc-switches"
What did you do?
Description
We're observing a deadlock scenario between two Go processes communicating via Unix Domain Sockets (UDS). Both client (Go 1.22.12) and server (Go 1.18.10) have dedicated per-connection goroutines for sending and receiving.
What did you see happen?
Unexpected Behavior
For the same connection:
- The client's write goroutine is blocked at runtime.netpollWait → waitWrite (unable to write to fd)
- Simultaneously, the server's read goroutine is blocked at runtime.netpollWait → waitRead (unable to read from fd)
- Both goroutines entered the blocked state at exactly the same timestamp
- Stack traces confirm they're suspended by Go's runtime netpoll mechanism
client traces:
(dlv) bt
0 0x0000000000a7014e in runtime.gopark
at /home/chenzhuowen.simon/.gvm/gos/go1.22.12/src/runtime/proc.go:403
1 0x0000000000a67f97 in runtime.netpollblock
at /home/chenzhuowen.simon/.gvm/gos/go1.22.12/src/runtime/netpoll.go:573
2 0x0000000000aa3125 in internal/poll.runtime_pollWait
at /home/chenzhuowen.simon/.gvm/gos/go1.22.12/src/runtime/netpoll.go:345
3 0x0000000000b28a47 in internal/poll.(*pollDesc).wait
at /home/chenzhuowen.simon/.gvm/gos/go1.22.12/src/internal/poll/fd_poll_runtime.go:84
4 0x0000000000b2bf19 in internal/poll.(*pollDesc).waitWrite
at /home/chenzhuowen.simon/.gvm/gos/go1.22.12/src/internal/poll/fd_poll_runtime.go:93
5 0x0000000000b2bf19 in internal/poll.(*FD).Write
at /home/chenzhuowen.simon/.gvm/gos/go1.22.12/src/internal/poll/fd_unix.go:388
6 0x0000000000cc7ee5 in net.(*netFD).Write
at /home/chenzhuowen.simon/.gvm/gos/go1.22.12/src/net/fd_posix.go:96
7 0x0000000000cda245 in net.(*conn).Write
at /home/chenzhuowen.simon/.gvm/gos/go1.22.12/src/net/net.go:197
8 0x0000000000cef725 in net.(*UnixConn).Write
at <autogenerated>:1
server traces:
(dlv) bt
0 0x0000000000882996 in runtime.gopark
at /home/chenzhuowen.simon/.gvm/gos/go1.18.10/src/runtime/proc.go:362
1 0x000000000087b357 in runtime.netpollblock
at /home/chenzhuowen.simon/.gvm/gos/go1.18.10/src/runtime/netpoll.go:522
2 0x00000000008adde9 in internal/poll.runtime_pollWait
at /home/chenzhuowen.simon/.gvm/gos/go1.18.10/src/runtime/netpoll.go:302
3 0x00000000009246f2 in internal/poll.(*pollDesc).wait
at /home/chenzhuowen.simon/.gvm/gos/go1.18.10/src/internal/poll/fd_poll_runtime.go:83
4 0x0000000000925bda in internal/poll.(*pollDesc).waitRead
at /home/chenzhuowen.simon/.gvm/gos/go1.18.10/src/internal/poll/fd_poll_runtime.go:88
5 0x0000000000925bda in internal/poll.(*FD).Read
at /home/chenzhuowen.simon/.gvm/gos/go1.18.10/src/internal/poll/fd_unix.go:167
6 0x0000000000a43729 in net.(*netFD).Read
at /home/chenzhuowen.simon/.gvm/gos/go1.18.10/src/net/fd_posix.go:55
7 0x0000000000a57605 in net.(*conn).Read
at /home/chenzhuowen.simon/.gvm/gos/go1.18.10/src/net/net.go:183
Key Observations
- Go's netpoll only suspends goroutines when system calls return EAGAIN
- Kernel version: 5.4 (recent, suggesting issue is likely in Go's implementation)
- Go versions: Client = 1.22, Server = 1.18
- Suspected root cause: Go's epoll implementation (edge-triggered mode) might be:
- a) Missing kernel callbacks, or
- b) Mishandling EAGAIN in ET state transitions
- Particularly concerning since server (Go 1.18) is known to have ET-related fixes missing
Additional Critical Observations
- Consistent 255,808 Byte Read Pattern
- Server read goroutines always block precisely after reading 255,808 bytes
- This exact byte count occurs across multiple occurrences
- Server-Initiated Connection Closure Triggers Issue. Problem manifests when:
- ✅ Server actively closes connection → Client reconnects → Deadlock occurs
- ❌ Client closes connection → No issue observed
- Behavior is intermittent but reproducible during reconnection sequences
What did you expect to see?
deadlock should not happen
Is any issue or commit relative?
Metadata
Metadata
Assignees
Labels
No labels