analyze nfs-ganesha coredump file (by quqi99)

quqi99

已于 2024-08-12 19:05:09 修改

阅读量897

点赞数 17

分类专栏： gdb 文章标签： gdb nfs-ganesha crash coredump

于 2024-07-18 16:13:13 首次发布

本文链接：https://blog.csdn.net/quqi99/article/details/140525441

版权

gdb 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

问题

nfs-ganesha安装在lxd中，它会偶尔发生crash，产生了coredump文件，因为产生coredump文件要二十几分钟，导致CPU升高服务也停了近二十几分钟，需要搞清楚为什么总crash.

nfs-ganesha是一个基于FUSE实现的用户空间文件系统(ganesha比 kernel based nfsv4的性能会有所欠缺，但是基于user-space会带来更多有意思的功能，如更通用能适配更多的文件系统，如可以把FUSE挂载在NFS上而不需要内核的帮助），可以在如ceph等特定的存储后端之上提供POSIX接口来允许用户用熟悉的linux命令(eg: ls, cp)等来操作ceph, 它还支持在POSIX接口的文件系统之上提供NFS服务。路径是：nfs-client -> nfs-ganesha -> FSAL_RGW(File System Abstraction Layer, bypass the kernel) -> libcephfs -> rados cluster

搭建gdb环境

客户用的版本是jammy nfs-ganesha=3.5-1ubuntu1, 所以先得lxd创建一个 jammy环境(lxc launch ubuntu:22.04 jammy)，然后命令进去（lxc exec jammy – sudo /bin/bash）运行下列命令搭建gdb环境(在源码目录下运行gdb，或者进入gdb后用directory命令关联源码)：

sudo apt install cgdb nfs-ganesha nfs-ganesha-ceph -y
mkdir tmp && apport-unpack _usr_bin_ganesha.nfsd.0.crash ./tmp/ && cd tmp
cd /root && git clone https://github.com/nfs-ganesha/nfs-ganesha.git && cd nfs-ganesha && git checkout -b 3.5 V3.5
cd /root && git clone https://github.com/ceph/ceph.git && cd ceph && git checkout -b 17.2.5 v17.2.5
cd /root/nfs-ganesha #or use 'directory /root/nfs-ganesha' instead of 'cd'
ls /usr/bin/ganesha.nfsd
cgdb /usr/bin/ganesha.nfsd /tmp/ganesha-nfsd-crash/CoreDump
(gdb) directory /root/nfs-ganesha
(gdb) directory /root/ceph
(gdb) l main

如何解决符号表问题呢? 如果符号表不对，在gdb中就会看到很多问号。仅安装上面的3个符号表“nfs-ganesha-dbgsym nfs-ganesha-dbgsym libcephfs2-dbgsym”还没有用。还要在gdb中用’info sharedlibrary’将每个动态库的符号表也安装。另外，还得一个个确认版本号。哪怕重新通过源码编译也会和客户给出的coredump的版本号不一致。

ubuntu 22.10后可以使用debuginfod自动安装符号表(export DEBUGINFOD_URLS=“https://debuginfod.ubuntu.com”, https://ubuntu.com/server/docs/about-debuginfod), ubuntu 22.10才默认安装debuginfod, 且gdb>=10.1 and elfutils>=0.178版本才开始支持debuginfod. 且一般只有一个release的最新版本才支持debuginfod, 所以这里先不考虑debuginfod

如果手动安装符号表怎么做呢？

find-dbgsym-packages能用于查找CoreDump中缺失了哪些符号表
eu-unstrip与readelf可以用来检查Build ID严格一致
该命令’pull-{lp,uca}-{ddebs,source,debs} ceph 12.3.4.4~cloud0’可用来下载符号表

apt install elfutils debian-goodies debuginfod -y
#readelf -n /usr/bin/ganesha.nfsd |grep 'Build ID'
    Build ID: 71d0693e76b2f0b4516bfa255d6a60bbf042ad09
# eu-unstrip -n --core ./CoreDump |grep -i build |grep 71d0693e76b2f0b4516bfa255d6a60bbf042ad09
0x559589477000+0x8000 71d0693e76b2f0b4516bfa255d6a60bbf042ad09@0x559589477378 . /usr/lib/debug/.build-id/71/d0693e76b2f0b4516bfa255d6a60bbf042ad09.debug /usr/bin/ganesha.nfsd

echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse" | sudo tee -a /etc/apt/sources.list.d/debuginfo_debs.list
echo "deb http://ddebs.ubuntu.com $(lsb_release -cs)-updates main restricted universe multiverse" | sudo tee -a /etc/apt/sources.list.d/debuginfo_debs.list
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C8CAB6595FDFF622
sudo apt update
例如，用'apt-cache policy nfs-ganesha-dbgsym'是找不到针对3.5-1ubuntu1版本的符号表的(若有就这么安装也行： apt install libcephfs2-dbgsym=17.2.5-0ubuntu0.22.04.3)，这时可以通过pull-lp-ddebs命令从lp来来手动安装old package.
#pull-{lp,uca}-{ddebs,source,debs} ceph 12.3.4.4~cloud0
pull-lp-ddebs nfs-ganesha-dbgsym 3.5-1ubuntu1
pull-lp-ddebs nfs-ganesha-ceph-dbgsym 3.5-1ubuntu1
dpkg -i nfs-ganesha-dbgsym_3.5-1ubuntu1_amd64.ddeb
apt install libcephfs2-dbgsym=17.2.5-0ubuntu0.22.04.3

# find-dbgsym-packages ./CoreDump
dpkg-query: no path found matching pattern /usr/lib/x86_64-linux-gnu/libresolv.so.2
W: Cannot find debug package for /usr/lib/x86_64-linux-gnu/libresolv.so.2 (7fd7253c61aa6fce2b7e13851c15afa14a5ab160)
dpkg-query: no path found matching pattern /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
W: Cannot find debug package for /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 (61ef896a699bb1c2e4e231642b2e1688b2f1a61e)
dpkg-query: no path found matching pattern /usr/lib/x86_64-linux-gnu/libm.so.6
W: Cannot find debug package for /usr/lib/x86_64-linux-gnu/libm.so.6 (27e82301dba6c3f644404d504e1bb1c97894b433)
dpkg-query: no path found matching pattern /usr/lib/x86_64-linux-gnu/libc.so.6
W: Cannot find debug package for /usr/lib/x86_64-linux-gnu/libc.so.6 (69389d485a9793dbe873f0ea2c93e02efaa9aa3d)
dpkg-query: no path found matching pattern /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
W: Cannot find debug package for /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (09c4935b79388431a1248f6a98e00d7dc81b8513)
dpkg-query: no path found matching pattern /lib/x86_64-linux-gnu/libnfsidmap.so.1
W: Cannot find debug package for libnfsidmap.so.1 (dfe1fcd7b9f3c5d04e0e76936cf1dcca9d5442af)
dpkg-query: no path found matching pattern /usr/lib/x86_64-linux-gnu/libcap.so.2.44
W: Cannot find debug package for /usr/lib/x86_64-linux-gnu/libcap.so.2.44 (9e11e3bca4b0a25d047cb36e933e1d727663cf8e)
krb5-admin-server-dbgsym krb5-gss-samples-dbgsym krb5-k5tls-dbgsym krb5-kdc-dbgsym krb5-kdc-ldap-dbgsym krb5-kpropd-dbgsym krb5-otp-dbgsym krb5-pkinit-dbgsym krb5-user-dbgsym libcom-err2-dbgsym libdbus-1-3-dbgsym libkrb5-dbg libnfsidmap1-dbgsym libnss-systemd-dbgsym libssl3-dbgsym libstdc++6-12-dbg libstdc++6-dbgsym libsystemd0-dbgsym libudev1-dbgsym libwbclient0-dbgsym zlib1g-dbgsym

pull-lp-ddebs libcom-err2-dbgsym 1.46.5-2ubuntu1.1
pull-lp-ddebs libdbus-1-3-dbgsym 1.12.20-2ubuntu4.1
pull-lp-ddebs libnfsidmap1-dbgsym 1:2.6.1-1ubuntu1.2
dpkg -i libcom-err2-dbgsym_1.46.5-2ubuntu1.1_amd64.ddeb
dpkg -i libdbus-1-3-dbgsym_1.12.20-2ubuntu4.1_amd64.ddeb
dpkg -i libnfsidmap1-dbgsym_2.6.1-1ubuntu1.2_amd64.ddeb
sudo apt install krb5-admin-server-dbgsym krb5-gss-samples-dbgsym krb5-k5tls-dbgsym krb5-kdc-dbgsym krb5-kdc-ldap-dbgsym krb5-kpropd-dbgsym krb5-otp-dbgsym krb5-pkinit-dbgsym krb5-user-dbgsym libcom-err2-dbgsym libdbus-1-3-dbgsym libkrb5-dbg libnfsidmap1-dbgsym libnss-systemd-dbgsym libssl3-dbgsym libstdc++6-12-dbg libstdc++6-dbgsym libsystemd0-dbgsym libudev1-dbgsym libwbclient0-dbgsym zlib1g-dbgsym -y

# find-dbgsym-packages ./CoreDump
...
W: Cannot find debug package for /usr/lib/x86_64-linux-gnu/libc.so.6 (69389d485a9793dbe873f0ea2c93e02efaa9aa3d)
# eu-unstrip -n --core ./CoreDump |grep -i build |grep 69389d485a9793dbe873f0ea2c93e02efaa9aa3d
<empty>
#有dbg内建包时优先安装dbg包，没有时才安装dbgsym非内建包，这点异常重要 (注意：dbg是内建的，应该在debs里，dbgsym是在ddeb中）
#pull-lp-ddebs libc6-dbg 2.35-0ubuntu3.1
pull-lp-debs libc6-dbg 2.35-0ubuntu3.1
pull-lp-debs libc6 2.35-0ubuntu3.1
dpkg -i libc6_2.35-0ubuntu3.1_amd64.deb
dpkg -i libc6-dbg_2.35-0ubuntu3.1_amd64.deb
# eu-unstrip -n --core ./CoreDump |grep -i build |grep 69389d485a9793dbe873f0ea2c93e02efaa9aa3d
0x7ffa92052000+0x227e50 69389d485a9793dbe873f0ea2c93e02efaa9aa3d@0x7ffa92052390 /lib/x86_64-linux-gnu/libc.so.6 /usr/lib/debug/.build-id/69/389d485a9793dbe873f0ea2c93e02efaa9aa3d.debug libc.so.6

# find-dbgsym-packages ./CoreDump
dpkg-query: no path found matching pattern /usr/lib/x86_64-linux-gnu/libcap.so.2.44
W: Cannot find debug package for /usr/lib/x86_64-linux-gnu/libcap.so.2.44 (9e11e3bca4b0a25d047cb36e933e1d727663cf8e)
dpkg-query: no path found matching pattern /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
W: Cannot find debug package for /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (09c4935b79388431a1248f6a98e00d7dc81b8513)
krb5-admin-server-dbgsym krb5-gss-samples-dbgsym krb5-k5tls-dbgsym krb5-kdc-dbgsym krb5-kdc-ldap-dbgsym krb5-kpropd-dbgsym krb5-otp-dbgsym krb5-pkinit-dbgsym krb5-user-dbgsym libkrb5-dbg libnss-systemd-dbgsym libssl3-dbgsym libstdc++6-12-dbg libstdc++6-dbgsym libsystemd0-dbgsym libudev1-dbgsym libwbclient0-dbgsym

解决符号表(关键是libc6这个符号表)之后看到的bt如下：

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140705682900544) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140705682900544) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140705682900544, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffa92094476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffa9207a7f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffa910783c3 in ceph::__ceph_assert_fail (assertion=<optimized out>, file=<optimized out>, line=<optimized out>, func=<optimized out>) at ./src/common/assert.cc:75
#6  0x00007ffa91078525 in ceph::__ceph_assert_fail (ctx=...) at ./src/common/assert.cc:80
#7  0x00007ffa7049f602 in xlist<ObjectCacher::Object*>::size (this=0x7ffa20734638, this=0x7ffa20734638) at ./src/include/xlist.h:87
#8  operator<< (os=..., out=warning: RTTI symbol not found for class 'StackStringStream<4096ul>'
...) at ./src/osdc/ObjectCacher.h:760
#9  operator<< (out=warning: RTTI symbol not found for class 'StackStringStream<4096ul>'
..., in=...) at ./src/client/Inode.cc:80
#10 0x00007ffa7045545f in Client::ll_sync_inode (this=0x55958b8a5c60, in=in@entry=0x7ffa20734270, syncdataonly=syncdataonly@entry=false) at ./src/client/Client.cc:14717
#11 0x00007ffa703d0f75 in ceph_ll_sync_inode (cmount=cmount@entry=0x55958b0bd0d0, in=in@entry=0x7ffa20734270, syncdataonly=syncdataonly@entry=0) at ./src/libcephfs.cc:1865
#12 0x00007ffa9050ddc5 in fsal_ceph_ll_setattr (creds=<optimized out>, mask=<optimized out>, stx=0x7ff8983f25a0, i=<optimized out>, cmount=<optimized out>)
    at ./src/FSAL/FSAL_CEPH/statx_compat.h:209
#13 ceph_fsal_setattr2 (obj_hdl=0x7fecc8fefbe0, bypass=<optimized out>, state=<optimized out>, attrib_set=0x7ff8983f2830) at ./src/FSAL/FSAL_CEPH/handle.c:2410
#14 0x00007ffa92371da0 in mdcache_setattr2 (obj_hdl=0x7fecc9e98778, bypass=<optimized out>, state=0x7fef0d64c9b0, attrs=0x7ff8983f2830)
    at ../FSAL/Stackable_FSALs/FSAL_MDCACHE/./src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1012
#15 0x00007ffa922b2bbc in fsal_setattr (obj=0x7fecc9e98778, bypass=<optimized out>, state=0x7fef0d64c9b0, attr=0x7ff8983f2830) at ./src/FSAL/fsal_helper.c:573
#16 0x00007ffa9234c7bd in nfs4_op_setattr (op=0x7fecad7ac510, data=0x7fecac314a10, resp=0x7fecad1be200) at ../Protocols/NFS/./src/Protocols/NFS/nfs4_op_setattr.c:212
#17 0x00007ffa9232e413 in process_one_op (data=data@entry=0x7fecac314a10, status=status@entry=0x7ff8983f2a2c) at ../Protocols/NFS/./src/Protocols/NFS/nfs4_Compound.c:920
#18 0x00007ffa9232f9e0 in nfs4_Compound (arg=<optimized out>, req=0x7fecad491620, res=0x7fecac054580) at ../Protocols/NFS/./src/Protocols/NFS/nfs4_Compound.c:1327
#19 0x00007ffa922cb0ff in nfs_rpc_process_request (reqdata=0x7fecad491620) at ./src/MainNFSD/nfs_worker_thread.c:1508
#20 0x00007ffa92029be7 in svc_request (xprt=0x7fed640504d0, xdrs=<optimized out>) at ./src/svc_rqst.c:1202
#21 0x00007ffa9202df9a in svc_rqst_xprt_task_recv (wpe=<optimized out>) at ./src/svc_rqst.c:1183
#22 0x00007ffa9203344d in svc_rqst_epoll_loop (wpe=0x559594308e60) at ./src/svc_rqst.c:1564
#23 0x00007ffa920389e1 in work_pool_thread (arg=0x7feeb802ea10) at ./src/work_pool.c:184
#24 0x00007ffa920e6b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#25 0x00007ffa92178a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

要查看源码的话可以如下, 或者想要避免用directory指令的话就cd到源码目录下再运行cgdb即可：

$ cgdb /usr/bin/ganesha.nfsd ./CoreDump
(gdb) directory /root/nfs-ganesha
Source directories searched: /root/nfs-ganesha:$cdir:$cwd
(gdb) directory /root/ceph
Source directories searched: /root/ceph:/root/nfs-ganesha:$cdir:$cwd
(gdb) l main
warning: Source file is more recent than executable.
133      * @return status to calling program by calling the exit(3C) function.
134      *
135      */
136
137     int main(int argc, char *argv[])
138     {
139             char *tempo_exec_name = NULL;
140             char localmachine[MAXHOSTNAMELEN + 1];
141             int c;
142             int dsc;

分析coredump

首先通过设置断点的方法是无法分析这类问题的, 因为它会退出，因为客户产生crash时本来就不知道是何时触发的，也就是没有reproducer （因为没有reproducer, 所以我们也无法搭建一个实际环境设置gdb后来触发), 那我们在分析crash设置断点也会不清楚什么条件下才会触发断点。

(gdb) break ceph::__ceph_assert_fail
Breakpoint 1 at 0x7ffa703b61f0 (4 locations)
(gdb) run
Starting program: /usr/bin/ganesha.nfsd 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 12433]
[Inferior 1 (process 12430) exited normally]

上面的bt帧是从下往上运行的，所以分析coredump时一般是先通过frame指令移到某个帧，然后再用’info locals’来查看变量, 也能通过print line来打印代码的行号，但很多时候代码是经过优化的打印的代码行号也是不可信的，那我们能不能从源码重新用优化级别低的设置来重新编译源码呢？答案也是NO, 因为这样又会造成和客户生成crash时使用的符号表的Build ID不一致造成一堆问号。

frame xx    #切换帧
info locals  #查看变量
info args    #查看输入参数
print assertion
print file
print line #查看行号，但代码若优化过就不准了
list  #查看代码
(gdb) info line *0x00007ffa922cb0ff
Line 1508 of "./src/MainNFSD/nfs_worker_thread.c" starts at address 0x7ffa922cb0f0 <nfs_rpc_process_request+3008>
   and ends at 0x7ffa922cb102 <nfs_rpc_process_request+3026>.
(gdb) info symbol 0x00007ffa922cb0ff
nfs_rpc_process_request + 3023 in section .text of /usr/lib/ganesha/libganesha_nfsd.so.3.5
info frame
info registers
x/i $pc
x/10x $esp
info threads
thread apply all bt
info target
info sharedlib

所以对于这类没有reproducer并且也代码优化过的问题，没别的办法，只能通过分析crash并结合代码来解决问题。
断点设置在__ceph_assert_fail没有reproducer肯定断不住，但如果断在nfs4_Compound至少能有办法想到触发nfs4_Compound的步骤（sudo mount -t nfs <nfs_server_ip>:/exported_directory /mnt && touch /mnt/tmp)，但这只是在分析coredump文件啊，并不是一个完整的环境，所以显然也是无法来手工触发这个断点的，那样也就无法用step/next来debug理解代码了。所以分析coredump只能通过frame导航并查看变量等，没别的方法。
用上面的命令一层层frame分析（https://paste.ubuntu.com/p/p7trjTkqBJ/），像是下列第8帧中的760行os.objects是空 (需要使用 Valgrind 等工具检查无效内存访问吗）.

#8  operator<< (os=..., out=warning: RTTI symbol not found for class 'StackStringStream<4096ul>'
...) at ./src/osdc/ObjectCacher.h:760

(gdb) down
#8  operator<< (os=..., out=warning: RTTI symbol not found for class 'StackStringStream<4096ul>'
...) at ./src/osdc/ObjectCacher.h:760
760                  << " objects " << os.objects.size()
(gdb) l
755     inline std::ostream& operator<<(std::ostream &out,
756                                     const ObjectCacher::ObjectSet &os)
757     {
758       return out << "objectset[" << os.ino
759                  << " ts " << os.truncate_seq << "/" << os.truncate_size
760                  << " objects " << os.objects.size()
761                  << " dirty_or_tx " << os.dirty_or_tx
762                  << "]";
763     }
764

$ git log --oneline --no-merges v17.2.5...master ./src/osdc/ObjectCacher.h
warning: refname 'v17.2.5' is ambiguous.
dba751ac0c0 osdc: add set_error in BufferHead, when split set_error to right
215facf5782 osdc: Build target 'common' without using namespace in headers
a54d0a90c06 crimson:common add TOPNSPC namespace for ceph and crimson
20b1ac6e095 osdc: s/Mutex/ceph::mutex/
5d4f82117ed osdc: reduce ObjectCacher's memory fragments
c33ce07fb8e mount,osdc: fix typos
c1179cd446b osdc: Use ceph_assert for asserts.

接下来是应该测试下列patch，但没有reproducer如何测试？

$ git diff
diff --git a/src/osdc/ObjectCacher.h b/src/osdc/ObjectCacher.h
index 60f049ef55d..ebecaa532fc 100644
--- a/src/osdc/ObjectCacher.h
+++ b/src/osdc/ObjectCacher.h
@@ -748,10 +748,16 @@ inline ostream& operator<<(ostream &out, const ObjectCacher::BufferHead &bh)
 
 inline ostream& operator<<(ostream &out, const ObjectCacher::ObjectSet &os)
 {
-  return out << "objectset[" << os.ino
+         out << "objectset[" << os.ino
             << " ts " << os.truncate_seq << "/" << os.truncate_size
-            << " objects " << os.objects.size()
-            << " dirty_or_tx " << os.dirty_or_tx
+            << " objects ";
+         if (os.objects.size() > 0) {
+           out << os.objects.size();
+         } else {
+           out << "empty";
+           std::cerr << "Error: os.objects is empty!" << std::endl;
+          }
+      return out << " dirty_or_tx " << os.dirty_or_tx
             << "]";
 }

或者用 valgrind ? - https://blog.csdn.net/xhtchina/article/details/121187064

g++ -g -o test test.cpp
$ cat test.cpp 
#include<iostream>
using namespace std;
int main(){
	int a[5];
	int i,s=0;
	a[0]=a[1]=a[3]=a[4]=0;
	for(i=0;i<5;++i)
		s+=a[i];
	if(s==33)
		cout<<"sum is 33"<<endl;
	else
		cout<<"sum is not 33"<<endl;
	return 0;
}
# Conditional jump or move depends on uninitialised valu
$ valgrind --leak-check=full ./test 
==686432== Memcheck, a memory error detector
==686432== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==686432== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==686432== Command: ./test
==686432== 
==686432== Conditional jump or move depends on uninitialised value(s)
==686432==    at 0x1091E7: main (test.cpp:9)
==686432== 
sum is not 33
==686432== 
==686432== HEAP SUMMARY:
==686432==     in use at exit: 0 bytes in 0 blocks
==686432==   total heap usage: 2 allocs, 2 frees, 74,752 bytes allocated
==686432== 
==686432== All heap blocks were freed -- no leaks are possible
==686432== 
==686432== Use --track-origins=yes to see where uninitialised values come from
==686432== For lists of detected and suppressed errors, rerun with: -s
==686432== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

nfs-ganesha环境中可以这样使用valgrind

sudo apt install valgrind -y
sudo systemctl stop nfs-ganesha
sudo valgrind --leak-check=full --show-reachable=yes --trace-children=yes --log-file=/tmp/valgrind-logfile /usr/bin/ganesha.nfsd -L /var/log/ganesha/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT

可能的测试角本：

#create and write lots of small files
for i in {1..500}; do
  touch /mnt/cephfs/mixedfile_$i
  echo "Mixed test data" > /mnt/cephfs/mixedfile_$i
done

#create a large file
dd if=/dev/zero of=/mnt/cephfs/largefile bs=1M count=512

#read files concurrently
for i in {1..500}; do
  cat /mnt/cephfs/mixedfile_$i > /dev/null &
done
wait

#delete files
rm /mnt/cephfs/mixedfile_*
rm /mnt/cephfs/largefile

#在多个终端或脚本中并发执行
for i in {1..10}; do
  for j in {1..100}; do
    #do something
  done &
done
wait

#快照操作会导致元数据的频繁变化，从而可能触发缓存问题
ceph fs snapshot create myfs my_snapshot
ceph fs snapshot rm myfs my_snapshot

malina怎么用的呢？

./generate-bundle.sh --name manila -s jammy --num-compute 1 --manila --ceph --ceph-fs --run
./tools/vault-unseal-and-authorise.sh
./configure
source novarc
# https://gist.github.com/congto/aba6f9d5087bb8e78b6377b463c3bde5
sudo apt  install python3-manilaclient -y
# Configure a share type that matches the CephFS/NFS-Ganesha backend capabilities.
manila type-create cephfsnfstype false
manila type-key cephfsnfstype set vendor_name=Ceph storage_protocol=NFS
# Create a share.
manila create --share-type cephfsnfstype --name cephnfsshare1 nfs 1
$ manila share-export-location-list cephnfsshare1 |grep vol
| eb9e928e-1e00-409f-9c34-9bb80a8226fb | 10.149.144.79:/volumes/_nogroup/7c94b263-e432-475d-ba9a-15f3f14effea/b92c7dc7-80ec-4586-9894-5c1c4770915f | False     |
Allow access to a nova VM(eg: bastion 10.149.144.44), which can connect to the ganesha server.
manila access-allow cephnfsshare1 ip 10.149.144.44
# Try mounting the NFS share from bastion
sudo mkdir /mnt/nfs && sudo chown $USER /mnt/nfs
sudo apt install nfs-common -y
#showmount is used for nfsv3, not nfsv4, use 'nfsstat -m' instead for nfsv4
#sudo showmount -e 10.149.144.79
sudo mount -t nfs 10.149.144.79:/volumes/_nogroup/7c94b263-e432-475d-ba9a-15f3f14effea/b92c7dc7-80ec-4586-9894-5c1c4770915f /mnt/nfs
$ nfsstat -m
/mnt/nfs from 10.149.144.79:/volumes/_nogroup/7c94b263-e432-475d-ba9a-15f3f14effea/b92c7dc7-80ec-4586-9894-5c1c4770915f
 Flags: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.149.144.44,local_lock=none,addr=10.149.144.79


#juju ssh ceph-mon/0 -- sudo -s
# rados lspools |grep manila
manila-ganesha

# rados -p manila-ganesha ls
ganesha-export-counter
ganesha-export-index
node0_recov
$ juju ssh manila-ganesha/0 -- sudo -s

root@juju-41cdda-manila-11:/home/ubuntu# rpcinfo -p localhost
   program vers proto   port  service
    100000    4   tcp    111  portmapper
    100000    3   tcp    111  portmapper
    100000    2   tcp    111  portmapper
    100000    4   udp    111  portmapper
    100000    3   udp    111  portmapper
    100000    2   udp    111  portmapper
    100024    1   udp  44544  status
    100024    1   tcp  44403  status
    100003    4   udp   2049  nfs
    100003    4   tcp   2049  nfs

#https://gist.github.com/congto/aba6f9d5087bb8e78b6377b463c3bde5
# grep -r '\[cephfsnfs1' /etc/manila/manila.conf -A20
[cephfsnfs1]
driver_handles_share_servers = False
ganesha_rados_store_enable = True
ganesha_rados_store_pool_name = manila-ganesha
share_backend_name = CEPHFSNFS1
share_driver = manila.share.drivers.cephfs.driver.CephFSDriver
cephfs_protocol_helper_type = NFS
cephfs_conf_path = /etc/ceph/ceph.conf
cephfs_auth_id = manila-ganesha
cephfs_cluster_name = ceph
cephfs_enable_snapshots = False
cephfs_ganesha_server_is_remote = False
cephfs_ganesha_server_ip = 10.149.144.79

实际上，manila支持CephFS NFS shares， CephFS native shares两个模式，上面的cephfs_protocol_helper_type = NFS代表是通过nfs-common支持NFS share，要想客户端使用ceph-fuse的话应该是使用cephfs_protocol_helper_type = CEPFFS ( https://docs.openstack.org/manila/latest/configuration/shared-file-systems/drivers/cephfs_driver.html#configure-cephfs-nfs-share-backend-in-manila-conf 和 https://blog.51cto.com/u_16213461/7188633)

manila type-create cephfs_type false
manila type-key cephfs_type set vendor_name=Ceph storage_protocol=CephFS
manila create --share-type cephfs_type --name cephfs_share1 cephfs 1
manila access-allow cephnfsshare1 ip 10.149.142.103
manila share-export-location-list cephfs_share1

但manila-ganesha将cephfs_protocol_helper_type写死了,

$ grep -r 'cephfs_protocol_helper_type' manila*
manila-ganesha/src/templates/rocky/manila.conf:cephfs_protocol_helper_type = NFS

$ manila create --share-type cephfs_type --name cephfs_share1 cephfs 1
ERROR: Invalid input received: Invalid share protocol provided: CEPHFS. It is either disabled or unsupported. Available protocols: ['NFS']. (HTTP 400) (Request-ID: req-03b21a2a-7bd7-4145-bf6e-cb4d23180d7e)

objectcacher源码

osdc是client端(用户态client才会调用osdc)比较底层的模块用于从cephfs的一维地址空间(文件系统采用树状结构管理文件和目录，查于查表进行寻址, 文件系统必须引入中心化的元数据管理)到对象的三维地址空间(ceph采用扁平方式管理数据，基于计算进行寻址)的对象化转换。所以它有一个object级别的缓存(objectcacher), 也需要再转化为三维地址空间后使用crush算法进行数据的再定位.
用户态nfs-ganesha的请求流程大致如下：

posix(open|read|write) -> system call -> vfs -> fuse kernel module -> fuse user lib -> cephfs client -> client::read|client::write
client::read -> client::_read -> client::_read_async -> file_read -> file_to_extents(address convert) -> ObjectCacher::readx -> ObjectCacher::_readx

一个对象会有很多小文件（一个小文件条带单元也叫对象分片su，横着的3个小文件叫一个条带stripe)按顺序分布大多个rados底层对象(默认为4M, 如图中一个rodos底层对象容纳了3个条带，那对象分片大小就是4/3M)上. 这样， file_to_extent函数把一维坐标转化成三维坐标(objectset，stripeno，stripepos)，这三维坐标分别表示哪一个objectset，哪一个stripe(条带)，条带中的哪一个su(对象分片)。
在这里插入图片描述
如上图，这个要读的文件总共被分布了18个小块（对象分片su, 假设一个su是1M), 存储在0-5共6个rados对象里(假设一个对象是3M), 占用了两个objectset. 现在要读取su1-su6(2M-7M)的范围：

offset = 1M 表示读偏移量
len = 6M 表示要读取的大小
su = 1M
object_size = 3M
stripe_count = 3  条带宽度
stripes_per_object = 3 一个对象包含的对象分片数

这样上面的地址空间已经从一维转化成了三维, 比如读取su1

一维地址空间：(offset, len) ==>（1M，6M）
三维地址空间：(objectset，stripeno，stripepos) ==> （objectset0，stripe0，object1）
blockno = offset/su =1M/1M =1         块号，也就是分片号就是su1
stripeno = blockno /stripe_count = 1/3 =0    条带号，表示一个条带stripe0
stripepos  = blockno%stripe_count = 1%3 = 1  条带内偏移，就是在条带内的第二个对象上面
objectsetno = stripeno / stripes_per_object=0/3 =0   对象set号，表示objectset0
objectno = objectsetno*stripe_count + stripepos= 0*3+1对象号，就是分片所在的哪个对象

对象分片su用ObjectExtent(oid, objectno, offset, length, truncate_size)来表示一维地址，分片的结果会保存在一个map中（map<object_t,vector > object_extent）， map key的含义如下：
10000000000.00000000__head_F0B56F30__1 #点之后表示objectno, 点之前表示inode号相当于ns用于确保不唯一
fuse读写数据时大小有限制，写一次最大是4k, 读一次最大是128k, 所以这里面有fragment,

数据在objectcaher管理
file_to_extent后的结果集object_extent是一个map（这里的object并不是rados的object，是osdc cache中的object), 先遍历这个map放在一个vector中，然后用readx来并发读取object_extent里的对象分片，在读的时候会用到objectcacher缓存(bufferhead)，命中走缓存不命中就需要去rados读。map_read用于将objectextent和bufferhead映射起来。bufferhead有很多状态(STATE_MISSING=0, STATE_CLEAN=1, STATE_ZERO=2, STATE_DIRTY=3, STATE_RX=4, STATE_TX=5, STATE_ERROR). 刚开始缓冲是空的，经过map_read之后的流程如下：
1, 第一次读缓存未命中，要到osd上去取（在bh_read中发起一个到osd的对象数据的op操作)
2, 客户端收到osd回来的响应后会通过bh_read注册的回调函数C_ReadFinish把OSD中读到的数据拷到bufferhead.
3，第二次读就能命中缓存

Reproducer

#https://www.findbugzero.com/operational-defect-database/vendors/rh/defects/2247762
git clone https://github.com/bengland2/smallfile.git
cd smallfile
sudo chown -R $USER /mnt/nfs/
for i in $(seq 1 10); do mkdir -p /mnt/nfs/smallfile$i; done
for i in $(seq 1 10); do python3 smallfile_cli.py --operation create --threads 4 --file-size 4194 --files 1024 --files-per-dir 10 --dirs-per-dir 2 --record-size 128 --top /mnt/nfs/smallfile$i --output-json=create.json;done

#遇到了'Disk quota exceeded'这个问题，用下列的方法不管用，因为环境不支持CephFS native shares(cephfs_protocol_helper_type=CEPHFS)
host = jammy-065702,thr = 00,elapsed = None,files = None,records = None,status = ERR: Disk quota exceeded                                                                               
WARNING: thread 00 on host jammy-065702 never completed
# https://superuser.com/questions/1787448/cant-remove-ceph-xattrs-on-cephfs-on-linux
sudo setfattr -n ceph.quota.max_bytes -v 0 /path/to/ceph/directory
sudo setfattr -n ceph.quota.max_files -v 0 /path/to/ceph/directory
#用下列的方法也不好使
#Disk quota exceeded nfs-ganesha - https://github.com/ceph/ceph/commit/48acd4b35c860589d43e7cce7a80b5a023fd9f21
echo 'client quota = false' >> /etc/ceph/ceph.conf

编译nfs-ganesha

#https://bbs.huaweicloud.com/blogs/193848
wget https://github.com/nfs-ganesha/nfs-ganesha/archive/next.zip && unzip next.zip
cd next/next
cmake -DCMAKE_BUILD_TYPE=Release -Wno-dev -DPROXY_HANDLE_MAPPING=ON -DUSE_9P=OFF -DUSE_FSAL_CEPH=OFF -DUSE_FSAL_GLUSTER=OFF -DUSE_FSAL_LUSTRE=OFF -DUSE_FSAL_LIZARDFS=OFF -DUSE_FSAL_XFS=ON -DUSE_FSAL_RGW=OFF -DRADOS_URLS=OFF -DUSE_RADOS_RECOV=OFF -D_MSPAC_SUPPORT=OFF -DUSE_GSS=ON -DUSE_FSAL_LUSTRE=OFF -DALLOCATOR=libc ../src/ \
make
make install

可能的workaround

(gdb) frame 7
#7  0x00007ffa7049f602 in xlist<ObjectCacher::Object*>::size (this=0x7ffa20734638, this=0x7ffa20734638) at ./src/include/xlist.h:87
87	./src/include/xlist.h: No such file or directory.
(gdb) p *this
$1 = {_front = 0x0, _back = 0x0, _size = 0}
(gdb) frame 6
#6  0x00007ffa91078525 in ceph::__ceph_assert_fail (ctx=...) at ./src/common/assert.cc:80
80	./src/common/assert.cc: No such file or directory.
(gdb) p ctx
$2 = (const ceph::assert_data &) @0x7ffa70587900: {assertion = 0x7ffa70530598 "(bool)_front == (bool)_size", file = 0x7ffa705305b4 "./src/include/xlist.h", line = 87, 
  function = 0x7ffa7053b410 "size_t xlist<T>::size() const [with T = ObjectCacher::Object*; size_t = long unsigned int]"}

_front和_size都是0, 怎么’(bool)_front == (bool)_size’还不等呢？好奇怪？是因为打印时没有加锁吗？照下列步骤设置debug level可能skip这个日志。

1. Log in to the ceph-mon unit:
#juju ssh ceph-mon/0

2. Adjust the debug client level to 0/2, which will still provide sufficient error logs:
#sudo ceph config set global debug_client 0/2

3. Verify that the configuration has been successfully set.
#sudo ceph config dump
WHO   MASK LEVEL   OPTION                 VALUE    RO
global    advanced debug_client              0/2      
mon      advanced auth_allow_insecure_global_id_reclaim false     
mgr      advanced mgr/prometheus/rbd_stats_pools           * 
osd.1     basic   osd_mclock_max_capacity_iops_hdd    275.751910   
osd.2     basic   osd_mclock_max_capacity_iops_hdd    194.949454

possible fix - https://github.com/ceph/ceph/pull/59162