Go transport 长连接导致 goroutine 泄露

2024年12月26日 06:09  ·  阅读 157

记录一个近期发现的线上服务 goroutine 泄露问题,通过监控发现服务存在缓慢的内存泄露,通过 pprof 工具查询其 goroutine 数量,http://ip:port/debug/pprof/goroutine?debug=1 发现其存在 goroutine 泄露,泄露的 goroutine 达到 20000+,在 debug 页面中可以看到类似如下的展示

#	0x42f8b4	internal/poll.runtime_pollWait+0x54		/usr/local/go/src/runtime/netpoll.go:203
#	0x4cb714	internal/poll.(*pollDesc).wait+0x44		/usr/local/go/src/internal/poll/fd_poll_runtime.go:87
#	0x4cc5aa	internal/poll.(*pollDesc).waitRead+0x19a	/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
#	0x4cc58c	internal/poll.(*FD).Read+0x17c			/usr/local/go/src/internal/poll/fd_unix.go:169
#	0x5414be	net.(*netFD).Read+0x4e				/usr/local/go/src/net/fd_unix.go:202
#	0x553dad	net.(*conn).Read+0x8d				/usr/local/go/src/net/net.go:184
#	0x7045b4	net/http.(*persistConn).Read+0x74		/usr/local/go/src/net/http/transport.go:1825
#	0x5c9ae2	bufio.(*Reader).fill+0x102			/usr/local/go/src/bufio/bufio.go:100
#	0x5c9c4e	bufio.(*Reader).Peek+0x4e			/usr/local/go/src/bufio/bufio.go:138
#	0x705247	net/http.(*persistConn).readLoop+0x1a7		/usr/local/go/src/net/http/transport.go:1978

将 debug 模式改为2,http://ip:port/debug/pprof/goroutine?debug=2,可以看到如下的调用栈

goroutine 2280278 [IO wait, 435 minutes]:
internal/poll.runtime_pollWait(0x7f36c5cc9aa0, 0x72, 0x18d73e0)
	/usr/local/go/src/runtime/netpoll.go:220 +0x55
internal/poll.(*pollDesc).wait(0xc01a65b698, 0x72, 0xc0313d2000, 0x1000, 0x1000)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc01a65b680, 0xc0313d2000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:159 +0x1b1
net.(*netFD).Read(0xc01a65b680, 0xc0313d2000, 0x1000, 0x1000, 0xc0002015c0, 0xc020906d98, 0xc020906c20)
	/usr/local/go/src/net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc0205ea7d0, 0xc0313d2000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:182 +0x8e
net/http.(*persistConn).Read(0xc0006e9320, 0xc0313d2000, 0x1000, 0x1000, 0xc020906eb0, 0x46ca40, 0xc020906eb0)
	/usr/local/go/src/net/http/transport.go:1887 +0x77
bufio.(*Reader).fill(0xc01f7f5ce0)
	/usr/local/go/src/bufio/bufio.go:101 +0x105
bufio.(*Reader).Peek(0xc01f7f5ce0, 0x1, 0x2, 0x0, 0x0, 0x0, 0xc02691cfc0)
	/usr/local/go/src/bufio/bufio.go:139 +0x4f
net/http.(*persistConn).readLoop(0xc0006e9320)
	/usr/local/go/src/net/http/transport.go:2040 +0x1a8
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1708 +0xcb7

比较困惑的是,goroutine 的调用是由底层源码发起的,可以猜测是某些基础库的使用方式有误,注意到代码是 http 的 lib 相关,能看到明显的 pollWait、persistConn、IO wait 字眼,为了定位是什么位置的代码引用,我们可以去线上查找当前服务进程打开的连接句柄

使用 pidof 或者 ps 命令找到自己服务进程的 pid,之后 lsof -p pid 来查看进程打开的句柄,能看到许多异常的建立起的连接

srv 17987 work   30u     IPv4 30417287        0t0      TCP localhost:49788->localhost:fmtp (ESTABLISHED)
srv 17987 work   31u     IPv4 30697379        0t0      TCP localhost:61063->localhost:fmtp (ESTABLISHED)
srv 17987 work   32u     IPv4 30565197        0t0      TCP localhost:34199->localhost:fmtp (ESTABLISHED)
srv 17987 work   33u     IPv4 30574610        0t0      TCP localhost:34687->localhost:fmtp (ESTABLISHED)
srv 17987 work   34u     IPv4 29940569        0t0      TCP localhost:47346->localhost:fmtp (ESTABLISHED)
srv 17987 work   35u     IPv4 30708877        0t0      TCP localhost:61533->localhost:fmtp (ESTABLISHED)
srv 17987 work   36u     IPv4 30164366        0t0      TCP localhost:43103->localhost:fmtp (ESTABLISHED)
srv 17987 work   37u     IPv4 30563301        0t0      TCP localhost:34663->localhost:fmtp (ESTABLISHED)
srv 17987 work   38u     IPv4 30709910        0t0      TCP localhost:61559->localhost:fmtp (ESTABLISHED)
srv 17987 work   39u     IPv4 30238360        0t0      TCP localhost:58203->localhost:fmtp (ESTABLISHED)
srv 17987 work   40u     IPv4 30342583        0t0      TCP localhost:35954->localhost:fmtp (ESTABLISHED)
srv 17987 work   41u     IPv4 30838726        0t0      TCP localhost:42024->localhost:fmtp (ESTABLISHED)
srv 17987 work   42u     IPv4 30415497        0t0      TCP localhost:49098->localhost:fmtp (ESTABLISHED)
srv 17987 work   43u     IPv4 30840173        0t0      TCP localhost:42478->localhost:fmtp (ESTABLISHED)
srv 17987 work   44u     IPv4 30416471        0t0      TCP localhost:49762->localhost:fmtp (ESTABLISHED)
srv 17987 work   45u     IPv4 30839661        0t0      TCP localhost:42500->localhost:fmtp (ESTABLISHED)
srv 17987 work   46u     IPv4 31288225        0t0      TCP localhost:39302->localhost:fmtp (ESTABLISHED)
srv 17987 work   47u     IPv4 30997795        0t0      TCP localhost:27359->localhost:fmtp (ESTABLISHED)
srv 17987 work   48u     IPv4 31009992        0t0      TCP localhost:27841->localhost:fmtp (ESTABLISHED)
srv 17987 work   49u     IPv4 30559437        0t0      TCP localhost:33999->localhost:fmtp (ESTABLISHED)

如果 grep -c 统计一下,可以发现其数量和泄露的 goroutine 数量几乎一致,可以判断这就是我们要找的泄露连接。使用 lsof -P -p pid打印完整端口号,找出连接对应的服务,在我们的例子中,是存在大量连接到 consul 的连接。

为什么 Transport 会导致泄露呢? 我们跟踪进源码,在 Transport 的注释中可以看到

// Transports should be reused instead of created as needed.
// Transports are safe for concurrent use by multiple goroutines.

意为 Transports 应该被重复使用而不是重复创建,其在多个 goroutine 之间使用是并发安全的。而实际上 Transport 维护的是一个简单的长连接,goroutine 完成后,其所使用的 fd 并不能被释放,因此便会导致 goroutine 产生泄露。

再回到业务代码中,在对应位置查找对于 Transport 的使用,发现有明显的重新创建行为,而正是该行为导致了大量的 goroutine 泄露,我们只需要将使用到 Transport 的初始化为全局变量,创建一次后重复使用便可以避免该问题了。


评论
全部评论