Go 程序崩溃分析实战：从 Coredump 到根因定位

Posted on Fri 23 January 2026 in Tech

Abstract	Go 程序崩溃分析实战：从 Coredump 到根因定位
Authors	Walter Fan
Category	tech note
Status	v1.0
Updated	2026-01-23
License	CC-BY-NC-ND 4.0

凌晨三点，告警群炸了：「xxx-service 挂了，自动重启了 5 次」。

你揉着眼睛打开日志，看到一行熟悉又绝望的输出：

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x4a3b2c]

然后呢？然后就没了。进程已死，现场已毁，你只能盯着这几行 stack trace 猜：到底是哪个变量是 nil？当时的状态是什么？

这就是 Go 程序崩溃最痛的地方：崩溃时信息太少，事后分析全靠猜。

但其实，Go 是支持生成 coredump 的。只是默认没开，而且用起来有点门槛。这篇文章就来聊聊：

怎么让 Go 程序崩溃时生成 coredump
怎么用 Delve 分析 coredump，还原崩溃现场
怎么写代码，减少崩溃的概率

读完你能得到：一套可落地的「崩溃分析 + 预防」方法论，下次再遇到 panic，你不用只靠猜了。

一、Go 程序为什么会崩溃？

先搞清楚「崩溃」这个词在 Go 里的几种情况：

崩溃类型	触发方式	能 recover 吗	典型场景
panic	代码调用 `panic()` 或运行时检测到错误	能（同一 goroutine 内用 defer + recover）	nil 指针、数组越界、类型断言失败
fatal error	runtime 内部错误	不能	并发写 map、栈溢出、cgo 段错误
signal	外部信号（SIGSEGV, SIGABRT 等）	不能	非法内存访问、cgo 崩溃

关键区别：

panic 是可以被捕获的（只要你在同一个 goroutine 里用了 defer + recover）
fatal error 和 signal 是不可恢复的，进程必死

所以，当你看到 fatal error: concurrent map writes 时，别想着 recover 了，那是救不了的。

二、让 Go 程序崩溃时生成 Coredump

2.1 理解 GOTRACEBACK 环境变量

Go 运行时通过 GOTRACEBACK 环境变量控制崩溃时的行为：

值	行为
`none`	不打印任何 stack trace
`single`	（默认）只打印引发 panic 的 goroutine
`all`	打印所有 goroutine 的 stack trace
`system`	像 `all`，但包含 runtime 内部帧
`crash`	像 `system`，并且生成 coredump

要生成 coredump，你需要设置：

export GOTRACEBACK=crash

2.2 操作系统层面开启 coredump

光设置 GOTRACEBACK=crash 还不够，操作系统默认是不允许生成 coredump 的（因为可能很大）。你需要：

# 查看当前 core 文件大小限制
ulimit -c

# 如果是 0，需要解除限制
ulimit -c unlimited

# 确认生效
ulimit -c  # 应该输出 unlimited

注意：ulimit 设置只对当前 shell session 有效。生产环境建议写到启动脚本或 systemd service 配置里。

2.3 配置 coredump 文件路径

默认情况下，coredump 文件会生成在程序的工作目录，文件名是 core 或 core.<pid>。

在 Linux 上，你可以自定义路径和文件名：

# 查看当前配置
cat /proc/sys/kernel/core_pattern

# 自定义（需要 root 权限）
# %e = 程序名, %p = PID, %t = 时间戳
echo '/tmp/coredumps/core.%e.%p.%t' | sudo tee /proc/sys/kernel/core_pattern

# 确保目录存在
sudo mkdir -p /tmp/coredumps
sudo chmod 777 /tmp/coredumps

Ubuntu/Debian 坑点：Ubuntu 默认使用 apport 来处理崩溃，它会拦截 coredump。你需要：

# 方法 1：临时禁用 apport
sudo service apport stop

# 方法 2：永久禁用
sudo systemctl disable apport

# 方法 3：让 apport 保存 coredump（编辑 /etc/apport/crashdb.conf）

2.4 完整示例：生成一个 coredump

写一个会崩溃的程序：

// crash_demo.go
package main

import (
    "fmt"
    "time"
)

type User struct {
    Name string
    Age  int
}

func main() {
    fmt.Println("Starting...")

    // 模拟一些业务逻辑
    users := make(map[int]*User)
    users[1] = &User{Name: "Alice", Age: 30}

    // 3 秒后触发 panic
    time.Sleep(3 * time.Second)

    // 故意访问不存在的 key，然后访问其字段 -> nil pointer
    user := users[999]
    fmt.Println(user.Name)  // BOOM!
}

编译并运行：

# 编译时禁用优化，方便调试
go build -gcflags="all=-N -l" -o crash_demo crash_demo.go

# 设置环境变量并运行
export GOTRACEBACK=crash
ulimit -c unlimited

./crash_demo

崩溃后，你会看到：

Starting...
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x...]

goroutine 1 [running]:
main.main()
    /path/to/crash_demo.go:22 +0x...

Aborted (core dumped)  # <-- 关键！说明 coredump 生成了

检查 coredump 文件：

ls -la core*
# 或者检查你配置的路径
ls -la /tmp/coredumps/

三、用 Delve 分析 Coredump

3.1 安装 Delve

go install github.com/go-delve/delve/cmd/dlv@latest

# 确认安装成功
dlv version

3.2 加载 Coredump

# 语法：dlv core <二进制文件> <coredump文件>
dlv core ./crash_demo core.12345

进入 Delve 交互界面后，你就可以像调试活进程一样分析崩溃现场了。

3.3 常用调试命令

# 查看当前调用栈
(dlv) bt
# 或者
(dlv) stack

# 输出示例：
#  0  0x00000000004a3b2c in main.main
#     at ./crash_demo.go:22
#  1  0x0000000000435a87 in runtime.main
#     at /usr/local/go/src/runtime/proc.go:250

# 切换到指定栈帧
(dlv) frame 0

# 查看当前位置的源代码
(dlv) list

# 查看局部变量
(dlv) locals

# 查看特定变量
(dlv) print user
# 输出：*main.User nil

(dlv) print users
# 输出：map[int]*main.User [
#   1: *{Name: "Alice", Age: 30},
# ]

# 查看所有 goroutine
(dlv) goroutines

# 切换到指定 goroutine
(dlv) goroutine 1

# 查看 goroutine 的调用栈
(dlv) bt

# 查看寄存器（高级用法）
(dlv) regs

# 查看内存（高级用法）
(dlv) x -fmt hex -len 64 0x12345678

# 退出
(dlv) quit

3.4 实战：定位 nil pointer 问题

回到我们的例子，进入 Delve 后：

(dlv) bt
0  0x00000000004a3b2c in main.main
   at ./crash_demo.go:22

(dlv) frame 0
(dlv) list
    17:     users := make(map[int]*User)
    18:     users[1] = &User{Name: "Alice", Age: 30}
    19:
    20:     time.Sleep(3 * time.Second)
    21:
=>  22:     user := users[999]
    23:     fmt.Println(user.Name)
    24: }

(dlv) print user
*main.User nil

(dlv) print users[999]
*main.User nil

现在你清楚地看到：users[999] 返回了 nil，然后第 23 行访问了 nil 的 Name 字段，导致崩溃。

四、不用 Coredump 的替代方案

有时候你没法开启 coredump（比如容器环境、权限限制），这时候可以用其他方法：

4.1 使用 gcore 抓取运行中进程的 dump

# 找到进程 PID
ps aux | grep your_program

# 生成 coredump（不会杀死进程）
sudo gcore -o /tmp/dump <PID>

# 然后用 delve 分析
dlv core ./your_program /tmp/dump.<PID>

4.2 使用 Delve 远程调试

# 在服务器上启动 delve server
dlv attach <PID> --headless --listen=:2345 --api-version=2

# 在本地连接
dlv connect <server_ip>:2345

4.3 增强日志：打印完整 stack trace

在代码里加上崩溃时的详细日志：

import (
    "runtime/debug"
    "log"
)

func recoverMiddleware() {
    if r := recover(); r != nil {
        // 打印 panic 值
        log.Printf("Recovered from panic: %v", r)

        // 打印完整调用栈
        log.Printf("Stack trace:\n%s", debug.Stack())

        // 可选：上报到监控系统
        // reportToSentry(r, debug.Stack())
    }
}

func main() {
    defer recoverMiddleware()
    // ... your code
}

4.4 使用 pprof 在崩溃前保存快照

import (
    "os"
    "runtime/pprof"
    "os/signal"
    "syscall"
)

func setupSignalHandler() {
    c := make(chan os.Signal, 1)
    signal.Notify(c, syscall.SIGQUIT)  // Ctrl+\

    go func() {
        <-c
        // 保存 heap profile
        f, _ := os.Create("/tmp/heap.prof")
        pprof.WriteHeapProfile(f)
        f.Close()

        // 保存 goroutine profile
        f2, _ := os.Create("/tmp/goroutine.prof")
        pprof.Lookup("goroutine").WriteTo(f2, 1)
        f2.Close()

        os.Exit(1)
    }()
}

五、预防崩溃的最佳实践

与其事后分析，不如事前预防。以下是减少 Go 程序崩溃的实战经验：

5.1 对所有可能为 nil 的值做检查

问题代码：

user := users[id]
fmt.Println(user.Name)  // 如果 id 不存在，user 是 nil

改进代码：

user, ok := users[id]
if !ok || user == nil {
    return fmt.Errorf("user %d not found", id)
}
fmt.Println(user.Name)

5.2 不要并发读写 map

问题代码（会导致 fatal error: concurrent map writes）：

var cache = make(map[string]string)

// goroutine 1
go func() {
    cache["key1"] = "value1"
}()

// goroutine 2
go func() {
    cache["key2"] = "value2"
}()

改进代码：

// 方案 1：使用 sync.Map
var cache sync.Map

cache.Store("key1", "value1")
value, ok := cache.Load("key1")

// 方案 2：使用 sync.RWMutex
type SafeMap struct {
    mu   sync.RWMutex
    data map[string]string
}

func (m *SafeMap) Set(key, value string) {
    m.mu.Lock()
    defer m.mu.Unlock()
    m.data[key] = value
}

func (m *SafeMap) Get(key string) (string, bool) {
    m.mu.RLock()
    defer m.mu.RUnlock()
    v, ok := m.data[key]
    return v, ok
}

5.3 给每个 goroutine 加 recover

问题：一个 goroutine panic 了，如果没有 recover，整个进程都会挂。

改进代码：

func safeGo(fn func()) {
    go func() {
        defer func() {
            if r := recover(); r != nil {
                log.Printf("goroutine panicked: %v\n%s", r, debug.Stack())
            }
        }()
        fn()
    }()
}

// 使用
safeGo(func() {
    // 你的业务逻辑
    processTask()
})

5.4 类型断言要用安全模式

问题代码：

value := someInterface.(string)  // 如果不是 string，直接 panic

改进代码：

value, ok := someInterface.(string)
if !ok {
    return fmt.Errorf("expected string, got %T", someInterface)
}

5.5 slice 操作要检查边界

问题代码：

item := items[index]  // 如果 index >= len(items)，panic

改进代码：

if index < 0 || index >= len(items) {
    return fmt.Errorf("index %d out of range [0, %d)", index, len(items))
}
item := items[index]

5.6 channel 操作要小心

常见 panic 场景：

close(ch)
close(ch)  // panic: close of closed channel

ch <- value  // panic: send on closed channel

改进代码：

type SafeChannel struct {
    ch     chan interface{}
    closed bool
    mu     sync.Mutex
}

func (sc *SafeChannel) Close() {
    sc.mu.Lock()
    defer sc.mu.Unlock()
    if !sc.closed {
        close(sc.ch)
        sc.closed = true
    }
}

func (sc *SafeChannel) Send(v interface{}) bool {
    sc.mu.Lock()
    defer sc.mu.Unlock()
    if sc.closed {
        return false
    }
    sc.ch <- v
    return true
}

六、生产环境 Checklist

检查项	说明
是否开启了 GOTRACEBACK=crash	至少在测试环境开启，生产环境按需
ulimit -c 是否足够大	建议 unlimited 或至少 1GB
coredump 路径是否有写权限	检查 /proc/sys/kernel/core_pattern
是否禁用了 apport（Ubuntu）	否则 coredump 会被拦截
所有 goroutine 是否有 recover	防止单个 goroutine 拖垮整个进程
map 操作是否线程安全	用 sync.Map 或加锁
类型断言是否用安全模式	`v, ok := x.(T)` 而不是 `v := x.(T)`
是否有完善的日志	崩溃时能看到 stack trace
是否接入了监控告警	Sentry, Prometheus, 或其他 APM

七、总结

Go 程序崩溃分析的核心流程：

崩溃发生
    │
    ├─→ 有 coredump？
    │       │
    │       ├─ Yes → dlv core <binary> <coredump> → 分析 stack/variables
    │       │
    │       └─ No → 看日志里的 stack trace → 推测 + 复现
    │
    └─→ 定位到根因 → 修复 → 加防御代码/测试用例

三个关键动作：

开启 coredump：GOTRACEBACK=crash + ulimit -c unlimited
会用 Delve：dlv core 加载 dump，bt/locals/print 看状态
写防御代码：检查 nil、用 sync.Map、每个 goroutine 加 recover

下次再遇到「xxx-service 挂了」，你可以淡定地说：「dump 呢？我来分析一下。」

参考资料

本作品采用知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议进行许可。

Previous Post Next Post

一、Go 程序为什么会崩溃？

二、让 Go 程序崩溃时生成 Coredump

2.1 理解 GOTRACEBACK 环境变量

2.2 操作系统层面开启 coredump

2.3 配置 coredump 文件路径

2.4 完整示例：生成一个 coredump

三、用 Delve 分析 Coredump

3.1 安装 Delve

3.2 加载 Coredump

3.3 常用调试命令

3.4 实战：定位 nil pointer 问题

四、不用 Coredump 的替代方案

4.1 使用 gcore 抓取运行中进程的 dump

4.2 使用 Delve 远程调试

4.3 增强日志：打印完整 stack trace

4.4 使用 pprof 在崩溃前保存快照

五、预防崩溃的最佳实践

5.1 对所有可能为 nil 的值做检查

5.2 不要并发读写 map

5.3 给每个 goroutine 加 recover

5.4 类型断言要用安全模式

5.5 slice 操作要检查边界

5.6 channel 操作要小心

六、生产环境 Checklist

七、总结

参考资料

You might enjoy