今天是首届中国 eBPF 大会举行的日子,发篇文章为 BPF 添砖加瓦。同时,我也打开了 BCC Discussions ,欢迎大家一起交流讨论。
问题
这个问题可以通过以下的脚本复现:
docker pull nginx:1.23.2
docker run -d -p 8090:80 nginx:1.23.2
"uprobe:/proc/`pidof -s nginx`/root/lib/x86_64-linux-gnu/libpthread-2.31.so:pthread_spin_lock {}" bpftrace -e
Attaching 1 probe...
cannot attach uprobe, Invalid argument
ERROR: Error attaching probe: uprobe:/proc/1803813/root/lib/x86_64-linux-gnu/libpthread-2.31.so:pthread_spin_lock
比较费解的是,下面的命令可以正确执行:
"uprobe:/proc/`pidof -s nginx`/root/lib/x86_64-linux-gnu/libpthread-2.31.so:pthread_mutex_lock {}" bpftrace -e
Attaching 1 probe...
两个命令的差别仅在于目标函数不一样。
分析
我们先从现象入手,出错的时候返回 Invalid argument ,如果是系统调用出错了,那么对应的 errno 是 EINVAL,我们用 strace 看看有没有系统调用返回 EINVAL:
$ strace -v bpftrace -e "uprobe:/proc/`pidof -s nginx`/root/lib/x86_64-linux-gnu/libpthread-2.31.so:pthread_spin_lock {}"
...
perf_event_open({type=0x7 /* PERF_TYPE_??? */, size=PERF_ATTR_SIZE_VER5, config=0, sample_period=1, sample_type=0, read_format=0, disabled=0, inherit=0, pinned=0, exclusive=0, exclusive_user=0, exclude_kernel=0, exclude_hv=0, exclude_idle=0, mmap=0, comm=0, freq=0, inherit_stat=0, enable_on_exec=0, task=0, watermark=0, precise_ip=0 /* arbitrary skid */, mmap_data=0, sample_id_all=0, exclude_host=0, exclude_guest=0, exclude_callchain_kernel=0, exclude_callchain_user=0, mmap2=0, comm_exec=0, use_clockid=0, context_switch=0, write_backward=0, namespaces=0, wakeup_events=1, config1=0x55bdf397af90, config2=0xf8c0, sample_regs_user=0, sample_regs_intr=0, aux_watermark=0, sample_max_stack=0}, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = -1 ENOTSUPP (Unknown error 524)
openat(AT_FDCWD, "/sys/kernel/debug/tracing/uprobe_events", O_WRONLY|O_APPEND) = 13
getpid() = 1809995
write(13, "p:uprobes/p__proc_1803813_root_l"..., 155) = -1 EINVAL (Invalid argument)
...
有的,write 的时候返回了 EINVAL ,这涉及到了 tracefs 的接口。
bpftrace 使用了 libbcc 提供的 bpf_attach_uprobe() 接口,加载 uprobe 时先尝试使用基于 perf_event 的接口,失败了再回退到基于 tracefs 的接口。
显然,perf_event_open 这个系统调用以及它的返回值 ENOTSUPP 更适合我们探索出错的原因,我们先从这里入手。扫读一下 perf_event_open 的实现,没发现有用的线索。我们在内核代码目录 kernel/events 中 grep 一下 ENOTSUPP:
grep -r ENOTSUPP kernel/events
kernel/events/uprobes.c: ret = -ENOTSUPP;
很好,只有一个命中:
ret = -ENOTSUPP;
if (is_trap_insn((uprobe_opcode_t *)&uprobe->arch.insn))
goto out;
ret = arch_uprobe_analyze_insn(&uprobe->arch, mm, vaddr);
if (ret)
goto out;
arch_uprobe_analyze_insn 这个符号存在于 /proc/kallsyms ,因此我们可以跟踪它的返回值,大胆假设 arch_uprobe_analyze_insn 返回了 ENOTSUPP,再来小心求证。部署以下脚本然后再一次跟踪 pthread_spin_lock:
'kretprobe:arch_uprobe_analyze_insn { printf("arch_uprobe_analyze_insn returns %d\n", retval); }' bpftrace -e
arch_uprobe_analyze_insn returns -524
可以看到,假设得到了验证 :)
分析一下 arch_uprobe_analyze_insn 的实现,我们发现有如下调用链会返回 ENOTSUPP:
arch_uprobe_analyze_insn
-> uprobe_init_insn
-> is_prefix_bad returns -ENOTSUPP
-> insn_masking_exception returns -ENOTSUPP
returns -ENOTSUPP
我们逐个分析,首先是 is_prefix_bad:
/*
* opcodes we may need to refine support for:
*
* 0f - 2-byte instructions: For many of these instructions, the validity
* depends on the prefix and/or the reg field. On such instructions, we
* just consider the opcode combination valid if it corresponds to any
* valid instruction.
*
* 8f - Group 1 - only reg = 0 is OK
* c6-c7 - Group 11 - only reg = 0 is OK
* d9-df - fpu insns with some illegal encodings
* f2, f3 - repnz, repz prefixes. These are also the first byte for
* certain floating-point instructions, such as addsd.
*
* fe - Group 4 - only reg = 0 or 1 is OK
* ff - Group 5 - only reg = 0-6 is OK
*
* others -- Do we need to support these?
*
* 0f - (floating-point?) prefetch instructions
* 07, 17, 1f - pop es, pop ss, pop ds
* 26, 2e, 36, 3e - es:, cs:, ss:, ds: segment prefixes --
* but 64 and 65 (fs: and gs:) seem to be used, so we support them
* 67 - addr16 prefix
* ce - into
* f0 - lock prefix
*/
/*
* TODO:
* - Where necessary, examine the modrm byte and allow only valid instructions
* in the different Groups and fpu instructions.
*/
static bool is_prefix_bad(struct insn *insn)
{
insn_byte_t p;
int i;
for_each_insn_prefix(insn, i, p) {
insn_attr_t attr;
attr = inat_get_opcode_attribute(p);
switch (attr) {
case INAT_MAKE_PREFIX(INAT_PFX_ES):
case INAT_MAKE_PREFIX(INAT_PFX_CS):
case INAT_MAKE_PREFIX(INAT_PFX_DS):
case INAT_MAKE_PREFIX(INAT_PFX_SS):
case INAT_MAKE_PREFIX(INAT_PFX_LOCK):
return true;
}
}
return false;
}
这个函数做的事情大致根据指令的前缀判断是否适合插入 uprobe,注释中提到 f0 - lock prefix 和我们的目标函数 pthread_spin_lock 似乎有着某种联系,也许 pthread_spin_lock 就是命中了这个判断呢!
$ gdb -p `pidof -s nginx`
...
(gdb) disassemble /r pthread_spin_lock
Dump of assembler code for function pthread_spin_lock:
0x00007f7eee53e8c0 <+0>: f0 ff 0f lock decl (%rdi)
0x00007f7eee53e8c3 <+3>: 75 0b jne 0x7f7eee53e8d0 <pthread_spin_lock+16>
0x00007f7eee53e8c5 <+5>: 31 c0 xor %eax,%eax
0x00007f7eee53e8c7 <+7>: c3 retq
0x00007f7eee53e8c8 <+8>: 0f 1f 84 00 00 00 00 00 nopl 0x0(%rax,%rax,1)
0x00007f7eee53e8d0 <+16>: f3 90 pause
0x00007f7eee53e8d2 <+18>: 83 3f 00 cmpl $0x0,(%rdi)
0x00007f7eee53e8d5 <+21>: 7f e9 jg 0x7f7eee53e8c0 <pthread_spin_lock>
0x00007f7eee53e8d7 <+23>: eb f7 jmp 0x7f7eee53e8d0 <pthread_spin_lock+16>
End of assembler dump.
(gdb) disassemble /r pthread_mutex_lock
Dump of assembler code for function pthread_mutex_lock:
0x00007f7eee539760 <+0>: 8b 47 10 mov 0x10(%rdi),%eax
0x00007f7eee539763 <+3>: 89 c2 mov %eax,%edx
0x00007f7eee539765 <+5>: 81 e2 7f 01 00 00 and $0x17f,%edx
0x00007f7eee53976b <+11>: 83 e0 7c and $0x7c,%eax
0x00007f7eee53976e <+14>: 0f 85 7c 00 00 00 jne 0x7f7eee5397f0 <pthread_mutex_lock+144>
0x00007f7eee539774 <+20>: 53 push %rbx
0x00007f7eee539775 <+21>: 48 83 ec 10 sub $0x10,%rsp
0x00007f7eee539779 <+25>: 85 d2 test %edx,%edx
...
显然,这印证了我们的猜想,pthread_spin_lock 的指令前缀是 f0,命中了 is_prefix_bad;而 pthread_mutex_lock 的指令前缀是 8b,所以它可以正常被跟踪。
总结
本文由一个读者的问题入手,通过源码的解读和工具的使用,展示了传统工具(strace/GDB)和 BPF 工具(bpftrace)如何在 tracing 的场景中结合起来解决问题。欢迎读者指正,可加群交流。
当然,我们并没有直接回答 tracefs 接口出错的时候返回 Invalid argument 的原因,感兴趣的读者可以自行探索。答案在 create_or_delete_trace_uprobe 这个内核函数里。