analyze nfs-ganesha coredump file (by quqi99)

作者:张华 发表于:2024-07-18 版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明(http://blog.csdn.net/quqi99)

问题

nfs-ganesha安装在lxd中,它会偶尔发生crash,产生了coredump文件,因为产生coredump文件要二十几分钟,导致CPU升高服务也停了近二十几分钟,需要搞清楚为什么总crash.

nfs-ganesha是一个基于FUSE实现的用户空间文件系统(ganesha比 kernel based nfsv4的性能会有所欠缺,但是基于user-space会带来更多有意思的功能, 如更通用能适配更多的文件系统,如可以把FUSE挂载在NFS上而不需要内核的帮助),可以在如ceph等特定的存储后端之上提供POSIX接口来允许用户用熟悉的linux命令(eg: ls, cp)等来操作ceph, 它还支持在POSIX接口的文件系统之上提供NFS服务。路径是:nfs-client -> nfs-ganesha -> FSAL_RGW(File System Abstraction Layer, bypass the kernel) -> libcephfs -> rados cluster

搭建gdb环境

客户用的版本是jammy nfs-ganesha=3.5-1ubuntu1, 所以先得lxd创建一个 jammy环境(lxc launch ubuntu:22.04 jammy),然后命令进去(lxc exec jammy – sudo /bin/bash)运行下列命令搭建gdb环境(在源码目录下运行gdb,或者进入gdb后用directory命令关联源码):

sudo apt install cgdb nfs-ganesha nfs-ganesha-ceph -y
mkdir tmp && apport-unpack _usr_bin_ganesha.nfsd.0.crash ./tmp/ && cd tmp
cd /root && git clone https://github.com/nfs-ganesha/nfs-ganesha.git && cd nfs-ganesha && git checkout -b 3.5 V3.5
cd /root && git clone https://github.com/ceph/ceph.git && cd ceph && git checkout -b 17.2.5 v17.2.5
cd /root/nfs-ganesha #or use 'directory /root/nfs-ganesha' instead of 'cd'
ls /usr/bin/ganesha.nfsd
cgdb /usr/bin/ganesha.nfsd /tmp/ganesha-nfsd-crash/CoreDump
(gdb) directory /root/nfs-ganesha
(gdb) directory /root/ceph
(gdb) l main

如何解决符号表问题呢? 如果符号表不对,在gdb中就会看到很多问号。仅安装上面的3个符号表“nfs-ganesha-dbgsym nfs-ganesha-dbgsym libcephfs2-dbgsym”还没有用。还要在gdb中用’info sharedlibrary’将每个动态库的符号表也安装。另外,还得一个个确认版本号。哪怕重新通过源码编译也会和客户给出的coredump的版本号不一致。

ubuntu 22.10后可以使用debuginfod自动安装符号表(export DEBUGINFOD_URLS=“https://debuginfod.ubuntu.com”, https://ubuntu.com/server/docs/about-debuginfod), ubuntu 22.10才默认安装debuginfod, 且gdb>=10.1 and elfutils>=0.178版本才开始支持debuginfod. 且一般只有一个release的最新版本才支持debuginfod, 所以这里先不考虑debuginfod

如果手动安装符号表怎么做呢?

  • find-dbgsym-packages能用于查找CoreDump中缺失了哪些符号表
  • eu-unstrip与readelf可以用来检查Build ID严格一致
  • 该命令’pull-{lp,uca}-{ddebs,source,debs} ceph 12.3.4.4~cloud0’可用来下载符号表
apt install elfutils debian-goodies debuginfod -y
#readelf -n /usr/bin/ganesha.nfsd |grep 'Build ID'
    Build ID: 71d0693e76b2f0b4516bfa255d6a60bbf042ad09
# eu-unstrip -n --core ./CoreDump |grep -i build |grep 71d0693e76b2f0b4516bfa255d6a60bbf042ad09
0x559589477000+0x8000 71d0693e76b2f0b4516bfa255d6a60bbf042ad09@0x559589477378 . /usr/lib/debug/.build-id/71/d0693e76b2f0b4516bfa255d6a60bbf042ad09.debug /usr/bin/ganesha.nfsd

echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse" | sudo tee -a /etc/apt/sources.list.d/debuginfo_debs.list
echo "deb http://ddebs.ubuntu.com $(lsb_release -cs)-updates main restricted universe multiverse" | sudo tee -a /etc/apt/sources.list.d/debuginfo_debs.list
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C8CAB6595FDFF622
sudo apt update
例如,用'apt-cache policy nfs-ganesha-dbgsym'是找不到针对3.5-1ubuntu1版本的符号表的(若有就这么安装也行: apt install libcephfs2-dbgsym=17.2.5-0ubuntu0.22.04.3),这时可以通过pull-lp-ddebs命令从lp来来手动安装old package.
#pull-{lp,uca}-{ddebs,source,debs} ceph 12.3.4.4~cloud0
pull-lp-ddebs nfs-ganesha-dbgsym 3.5-1ubuntu1
pull-lp-ddebs nfs-ganesha-ceph-dbgsym 3.5-1ubuntu1
dpkg -i nfs-ganesha-dbgsym_3.5-1ubuntu1_amd64.ddeb
apt install libcephfs2-dbgsym=17.2.5-0ubuntu0.22.04.3

# find-dbgsym-packages ./CoreDump
dpkg-query: no path found matching pattern /usr/lib/x86_64-linux-gnu/libresolv.so.2
W: Cannot find debug package for /usr/lib/x86_64-linux-gnu/libresolv.so.2 (7fd7253c61aa6fce2b7e13851c15afa14a5ab160)
dpkg-query: no path found matching pattern /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
W: Cannot find debug package for /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 (61ef896a699bb1c2e4e231642b2e1688b2f1a61e)
dpkg-query: no path found matching pattern /usr/lib/x86_64-linux-gnu/libm.so.6
W: Cannot find debug package for /usr/lib/x86_64-linux-gnu/libm.so.6 (27e82301dba6c3f644404d504e1bb1c97894b433)
dpkg-query: no path found matching pattern /usr/lib/x86_64-linux-gnu/libc.so.6
W: Cannot find debug package for /usr/lib/x86_64-linux-gnu/libc.so.6 (69389d485a9793dbe873f0ea2c93e02efaa9aa3d)
dpkg-query: no path found matching pattern /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
W: Cannot find debug package for /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (09c4935b79388431a1248f6a98e00d7dc81b8513)
dpkg-query: no path found matching pattern /lib/x86_64-linux-gnu/libnfsidmap.so.1
W: Cannot find debug package for libnfsidmap.so.1 (dfe1fcd7b9f3c5d04e0e76936cf1dcca9d5442af)
dpkg-query: no path found matching pattern /usr/lib/x86_64-linux-gnu/libcap.so.2.44
W: Cannot find debug package for /usr/lib/x86_64-linux-gnu/libcap.so.2.44 (9e11e3bca4b0a25d047cb36e933e1d727663cf8e)
krb5-admin-server-dbgsym krb5-gss-samples-dbgsym krb5-k5tls-dbgsym krb5-kdc-dbgsym krb5-kdc-ldap-dbgsym krb5-kpropd-dbgsym krb5-otp-dbgsym krb5-pkinit-dbgsym krb5-user-dbgsym libcom-err2-dbgsym libdbus-1-3-dbgsym libkrb5-dbg libnfsidmap1-dbgsym libnss-systemd-dbgsym libssl3-dbgsym libstdc++6-12-dbg libstdc++6-dbgsym libsystemd0-dbgsym libudev1-dbgsym libwbclient0-dbgsym zlib1g-dbgsym

pull-lp-ddebs libcom-err2-dbgsym 1.46.5-2ubuntu1.1
pull-lp-ddebs libdbus-1-3-dbgsym 1.12.20-2ubuntu4.1
pull-lp-ddebs libnfsidmap1-dbgsym 1:2.6.1-1ubuntu1.2
dpkg -i libcom-err2-dbgsym_1.46.5-2ubuntu1.1_amd64.ddeb
dpkg -i libdbus-1-3-dbgsym_1.12.20-2ubuntu4.1_amd64.ddeb
dpkg -i libnfsidmap1-dbgsym_2.6.1-1ubuntu1.2_amd64.ddeb
sudo apt install krb5-admin-server-dbgsym krb5-gss-samples-dbgsym krb5-k5tls-dbgsym krb5-kdc-dbgsym krb5-kdc-ldap-dbgsym krb5-kpropd-dbgsym krb5-otp-dbgsym krb5-pkinit-dbgsym krb5-user-dbgsym libcom-err2-dbgsym libdbus-1-3-dbgsym libkrb5-dbg libnfsidmap1-dbgsym libnss-systemd-dbgsym libssl3-dbgsym libstdc++6-12-dbg libstdc++6-dbgsym libsystemd0-dbgsym libudev1-dbgsym libwbclient0-dbgsym zlib1g-dbgsym -y

# find-dbgsym-packages ./CoreDump
...
W: Cannot find debug package for /usr/lib/x86_64-linux-gnu/libc.so.6 (69389d485a9793dbe873f0ea2c93e02efaa9aa3d)
# eu-unstrip -n --core ./CoreDump |grep -i build |grep 69389d485a9793dbe873f0ea2c93e02efaa9aa3d
<empty>
#有dbg内建包时优先安装dbg包,没有时才安装dbgsym非内建包,这点异常重要 (注意:dbg是内建的,应该在debs里,dbgsym是在ddeb中)
#pull-lp-ddebs libc6-dbg 2.35-0ubuntu3.1
pull-lp-debs libc6-dbg 2.35-0ubuntu3.1
pull-lp-debs libc6 2.35-0ubuntu3.1
dpkg -i libc6_2.35-0ubuntu3.1_amd64.deb
dpkg -i libc6-dbg_2.35-0ubuntu3.1_amd64.deb
# eu-unstrip -n --core ./CoreDump |grep -i build |grep 69389d485a9793dbe873f0ea2c93e02efaa9aa3d
0x7ffa92052000+0x227e50 69389d485a9793dbe873f0ea2c93e02efaa9aa3d@0x7ffa92052390 /lib/x86_64-linux-gnu/libc.so.6 /usr/lib/debug/.build-id/69/389d485a9793dbe873f0ea2c93e02efaa9aa3d.debug libc.so.6

# find-dbgsym-packages ./CoreDump
dpkg-query: no path found matching pattern /usr/lib/x86_64-linux-gnu/libcap.so.2.44
W: Cannot find debug package for /usr/lib/x86_64-linux-gnu/libcap.so.2.44 (9e11e3bca4b0a25d047cb36e933e1d727663cf8e)
dpkg-query: no path found matching pattern /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
W: Cannot find debug package for /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (09c4935b79388431a1248f6a98e00d7dc81b8513)
krb5-admin-server-dbgsym krb5-gss-samples-dbgsym krb5-k5tls-dbgsym krb5-kdc-dbgsym krb5-kdc-ldap-dbgsym krb5-kpropd-dbgsym krb5-otp-dbgsym krb5-pkinit-dbgsym krb5-user-dbgsym libkrb5-dbg libnss-systemd-dbgsym libssl3-dbgsym libstdc++6-12-dbg libstdc++6-dbgsym libsystemd0-dbgsym libudev1-dbgsym libwbclient0-dbgsym

解决符号表(关键是libc6这个符号表)之后看到的bt如下:

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140705682900544) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140705682900544) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140705682900544, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffa92094476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffa9207a7f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffa910783c3 in ceph::__ceph_assert_fail (assertion=<optimized out>, file=<optimized out>, line=<optimized out>, func=<optimized out>) at ./src/common/assert.cc:75
#6  0x00007ffa91078525 in ceph::__ceph_assert_fail (ctx=...) at ./src/common/assert.cc:80
#7  0x00007ffa7049f602 in xlist<ObjectCacher::Object*>::size (this=0x7ffa20734638, this=0x7ffa20734638) at ./src/include/xlist.h:87
#8  operator<< (os=..., out=warning: RTTI symbol not found for class 'StackStringStream<4096ul>'
...) at ./src/osdc/ObjectCacher.h:760
#9  operator<< (out=warning: RTTI symbol not found for class 'StackStringStream<4096ul>'
..., in=...) at ./src/client/Inode.cc:80
#10 0x00007ffa7045545f in Client::ll_sync_inode (this=0x55958b8a5c60, in=in@entry=0x7ffa20734270, syncdataonly=syncdataonly@entry=false) at ./src/client/Client.cc:14717
#11 0x00007ffa703d0f75 in ceph_ll_sync_inode (cmount=cmount@entry=0x55958b0bd0d0, in=in@entry=0x7ffa20734270, syncdataonly=syncdataonly@entry=0) at ./src/libcephfs.cc:1865
#12 0x00007ffa9050ddc5 in fsal_ceph_ll_setattr (creds=<optimized out>, mask=<optimized out>, stx=0x7ff8983f25a0, i=<optimized out>, cmount=<optimized out>)
    at ./src/FSAL/FSAL_CEPH/statx_compat.h:209
#13 ceph_fsal_setattr2 (obj_hdl=0x7fecc8fefbe0, bypass=<optimized out>, state=<optimized out>, attrib_set=0x7ff8983f2830) at ./src/FSAL/FSAL_CEPH/handle.c:2410
#14 0x00007ffa92371da0 in mdcache_setattr2 (obj_hdl=0x7fecc9e98778, bypass=<optimized out>, state=0x7fef0d64c9b0, attrs=0x7ff8983f2830)
    at ../FSAL/Stackable_FSALs/FSAL_MDCACHE/./src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1012
#15 0x00007ffa922b2bbc in fsal_setattr (obj=0x7fecc9e98778, bypass=<optimized out>, state=0x7fef0d64c9b0, attr=0x7ff8983f2830) at ./src/FSAL/fsal_helper.c:573
#16 0x00007ffa9234c7bd in nfs4_op_setattr (op=0x7fecad7ac510, data=0x7fecac314a10, resp=0x7fecad1be200) at ../Protocols/NFS/./src/Protocols/NFS/nfs4_op_setattr.c:212
#17 0x00007ffa9232e413 in process_one_op (data=data@entry=0x7fecac314a10, status=status@entry=0x7ff8983f2a2c) at ../Protocols/NFS/./src/Protocols/NFS/nfs4_Compound.c:920
#18 0x00007ffa9232f9e0 in nfs4_Compound (arg=<optimized out>, req=0x7fecad491620, res=0x7fecac054580) at ../Protocols/NFS/./src/Protocols/NFS/nfs4_Compound.c:1327
#19 0x00007ffa922cb0ff in nfs_rpc_process_request (reqdata=0x7fecad491620) at ./src/MainNFSD/nfs_worker_thread.c:1508
#20 0x00007ffa92029be7 in svc_request (xprt=0x7fed640504d0, xdrs=<optimized out>) at ./src/svc_rqst.c:1202
#21 0x00007ffa9202df9a in svc_rqst_xprt_task_recv (wpe=<optimized out>) at ./src/svc_rqst.c:1183
#22 0x00007ffa9203344d in svc_rqst_epoll_loop (wpe=0x559594308e60) at ./src/svc_rqst.c:1564
#23 0x00007ffa920389e1 in work_pool_thread (arg=0x7feeb802ea10) at ./src/work_pool.c:184
#24 0x00007ffa920e6b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#25 0x00007ffa92178a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

要查看源码的话可以如下, 或者想要避免用directory指令的话就cd到源码目录下再运行cgdb即可:

$ cgdb /usr/bin/ganesha.nfsd ./CoreDump
(gdb) directory /root/nfs-ganesha
Source directories searched: /root/nfs-ganesha:$cdir:$cwd
(gdb) directory /root/ceph
Source directories searched: /root/ceph:/root/nfs-ganesha:$cdir:$cwd
(gdb) l main
warning: Source file is more recent than executable.
133      * @return status to calling program by calling the exit(3C) function.
134      *
135      */
136
137     int main(int argc, char *argv[])
138     {
139             char *tempo_exec_name = NULL;
140             char localmachine[MAXHOSTNAMELEN + 1];
141             int c;
142             int dsc;

分析coredump

首先通过设置断点的方法是无法分析这类问题的, 因为它会退出,因为客户产生crash时本来就不知道是何时触发的,也就是没有reproducer (因为没有reproducer, 所以我们也无法搭建一个实际环境设置gdb后来触发), 那我们在分析crash设置断点也会不清楚什么条件下才会触发断点。

(gdb) break ceph::__ceph_assert_fail
Breakpoint 1 at 0x7ffa703b61f0 (4 locations)
(gdb) run
Starting program: /usr/bin/ganesha.nfsd 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 12433]
[Inferior 1 (process 12430) exited normally]

上面的bt帧是从下往上运行的,所以分析coredump时一般是先通过frame指令移到某个帧,然后再用’info locals’来查看变量, 也能通过print line来打印代码的行号,但很多时候代码是经过优化的打印的代码行号也是不可信的,那我们能不能从源码重新用优化级别低的设置来重新编译源码呢?答案也是NO, 因为这样又会造成和客户生成crash时使用的符号表的Build ID不一致造成一堆问号。

frame xx    #切换帧
info locals  #查看变量
info args    #查看输入参数
print assertion
print file
print line #查看行号,但代码若优化过就不准了
list  #查看代码
(gdb) info line *0x00007ffa922cb0ff
Line 1508 of "./src/MainNFSD/nfs_worker_thread.c" starts at address 0x7ffa922cb0f0 <nfs_rpc_process_request+3008>
   and ends at 0x7ffa922cb102 <nfs_rpc_process_request+3026>.
(gdb) info symbol 0x00007ffa922cb0ff
nfs_rpc_process_request + 3023 in section .text of /usr/lib/ganesha/libganesha_nfsd.so.3.5
info frame
info registers
x/i $pc
x/10x $esp
info threads
thread apply all bt
info target
info sharedlib

所以对于这类没有reproducer并且也代码优化过的问题,没别的办法,只能通过分析crash并结合代码来解决问题
断点设置在__ceph_assert_fail没有reproducer肯定断不住,但如果断在nfs4_Compound至少能有办法想到触发nfs4_Compound的步骤(sudo mount -t nfs <nfs_server_ip>:/exported_directory /mnt && touch /mnt/tmp),但这只是在分析coredump文件啊,并不是一个完整的环境,所以显然也是无法来手工触发这个断点的,那样也就无法用step/next来debug理解代码了。所以分析coredump只能通过frame导航并查看变量等,没别的方法。
用上面的命令一层层frame分析(https://paste.ubuntu.com/p/p7trjTkqBJ/),像是下列第8帧中的760行os.objects是空 (需要使用 Valgrind 等工具检查无效内存访问吗).

#8  operator<< (os=..., out=warning: RTTI symbol not found for class 'StackStringStream<4096ul>'
...) at ./src/osdc/ObjectCacher.h:760

(gdb) down
#8  operator<< (os=..., out=warning: RTTI symbol not found for class 'StackStringStream<4096ul>'
...) at ./src/osdc/ObjectCacher.h:760
760                  << " objects " << os.objects.size()
(gdb) l
755     inline std::ostream& operator<<(std::ostream &out,
756                                     const ObjectCacher::ObjectSet &os)
757     {
758       return out << "objectset[" << os.ino
759                  << " ts " << os.truncate_seq << "/" << os.truncate_size
760                  << " objects " << os.objects.size()
761                  << " dirty_or_tx " << os.dirty_or_tx
762                  << "]";
763     }
764

$ git log --oneline --no-merges v17.2.5...master ./src/osdc/ObjectCacher.h
warning: refname 'v17.2.5' is ambiguous.
dba751ac0c0 osdc: add set_error in BufferHead, when split set_error to right
215facf5782 osdc: Build target 'common' without using namespace in headers
a54d0a90c06 crimson:common add TOPNSPC namespace for ceph and crimson
20b1ac6e095 osdc: s/Mutex/ceph::mutex/
5d4f82117ed osdc: reduce ObjectCacher's memory fragments
c33ce07fb8e mount,osdc: fix typos
c1179cd446b osdc: Use ceph_assert for asserts.

接下来是应该测试下列patch,但没有reproducer如何测试?

$ git diff
diff --git a/src/osdc/ObjectCacher.h b/src/osdc/ObjectCacher.h
index 60f049ef55d..ebecaa532fc 100644
--- a/src/osdc/ObjectCacher.h
+++ b/src/osdc/ObjectCacher.h
@@ -748,10 +748,16 @@ inline ostream& operator<<(ostream &out, const ObjectCacher::BufferHead &bh)
 
 inline ostream& operator<<(ostream &out, const ObjectCacher::ObjectSet &os)
 {
-  return out << "objectset[" << os.ino
+         out << "objectset[" << os.ino
             << " ts " << os.truncate_seq << "/" << os.truncate_size
-            << " objects " << os.objects.size()
-            << " dirty_or_tx " << os.dirty_or_tx
+            << " objects ";
+         if (os.objects.size() > 0) {
+           out << os.objects.size();
+         } else {
+           out << "empty";
+           std::cerr << "Error: os.objects is empty!" << std::endl;
+          }
+      return out << " dirty_or_tx " << os.dirty_or_tx
             << "]";
 }

或者用 valgrind ? - https://blog.csdn.net/xhtchina/article/details/121187064

g++ -g -o test test.cpp
$ cat test.cpp 
#include<iostream>
using namespace std;
int main(){
	int a[5];
	int i,s=0;
	a[0]=a[1]=a[3]=a[4]=0;
	for(i=0;i<5;++i)
		s+=a[i];
	if(s==33)
		cout<<"sum is 33"<<endl;
	else
		cout<<"sum is not 33"<<endl;
	return 0;
}
# Conditional jump or move depends on uninitialised valu
$ valgrind --leak-check=full ./test 
==686432== Memcheck, a memory error detector
==686432== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==686432== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==686432== Command: ./test
==686432== 
==686432== Conditional jump or move depends on uninitialised value(s)
==686432==    at 0x1091E7: main (test.cpp:9)
==686432== 
sum is not 33
==686432== 
==686432== HEAP SUMMARY:
==686432==     in use at exit: 0 bytes in 0 blocks
==686432==   total heap usage: 2 allocs, 2 frees, 74,752 bytes allocated
==686432== 
==686432== All heap blocks were freed -- no leaks are possible
==686432== 
==686432== Use --track-origins=yes to see where uninitialised values come from
==686432== For lists of detected and suppressed errors, rerun with: -s
==686432== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

nfs-ganesha环境中可以这样使用valgrind

sudo apt install valgrind -y
sudo systemctl stop nfs-ganesha
sudo valgrind --leak-check=full --show-reachable=yes --trace-children=yes --log-file=/tmp/valgrind-logfile /usr/bin/ganesha.nfsd -L /var/log/ganesha/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT

可能的测试角本:

#create and write lots of small files
for i in {1..500}; do
  touch /mnt/cephfs/mixedfile_$i
  echo "Mixed test data" > /mnt/cephfs/mixedfile_$i
done

#create a large file
dd if=/dev/zero of=/mnt/cephfs/largefile bs=1M count=512

#read files concurrently
for i in {1..500}; do
  cat /mnt/cephfs/mixedfile_$i > /dev/null &
done
wait

#delete files
rm /mnt/cephfs/mixedfile_*
rm /mnt/cephfs/largefile

#在多个终端或脚本中并发执行
for i in {1..10}; do
  for j in {1..100}; do
    #do something
  done &
done
wait

#快照操作会导致元数据的频繁变化,从而可能触发缓存问题
ceph fs snapshot create myfs my_snapshot
ceph fs snapshot rm myfs my_snapshot

malina怎么用的呢?

./generate-bundle.sh --name manila -s jammy --num-compute 1 --manila --ceph --ceph-fs --run
./tools/vault-unseal-and-authorise.sh
./configure
source novarc
# https://gist.github.com/congto/aba6f9d5087bb8e78b6377b463c3bde5
sudo apt  install python3-manilaclient -y
# Configure a share type that matches the CephFS/NFS-Ganesha backend capabilities.
manila type-create cephfsnfstype false
manila type-key cephfsnfstype set vendor_name=Ceph storage_protocol=NFS
# Create a share.
manila create --share-type cephfsnfstype --name cephnfsshare1 nfs 1
$ manila share-export-location-list cephnfsshare1 |grep vol
| eb9e928e-1e00-409f-9c34-9bb80a8226fb | 10.149.144.79:/volumes/_nogroup/7c94b263-e432-475d-ba9a-15f3f14effea/b92c7dc7-80ec-4586-9894-5c1c4770915f | False     |
Allow access to a nova VM(eg: bastion 10.149.144.44), which can connect to the ganesha server.
manila access-allow cephnfsshare1 ip 10.149.144.44
# Try mounting the NFS share from bastion
sudo mkdir /mnt/nfs && sudo chown $USER /mnt/nfs
sudo apt install nfs-common -y
#showmount is used for nfsv3, not nfsv4, use 'nfsstat -m' instead for nfsv4
#sudo showmount -e 10.149.144.79
sudo mount -t nfs 10.149.144.79:/volumes/_nogroup/7c94b263-e432-475d-ba9a-15f3f14effea/b92c7dc7-80ec-4586-9894-5c1c4770915f /mnt/nfs
$ nfsstat -m
/mnt/nfs from 10.149.144.79:/volumes/_nogroup/7c94b263-e432-475d-ba9a-15f3f14effea/b92c7dc7-80ec-4586-9894-5c1c4770915f
 Flags: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.149.144.44,local_lock=none,addr=10.149.144.79


#juju ssh ceph-mon/0 -- sudo -s
# rados lspools |grep manila
manila-ganesha

# rados -p manila-ganesha ls
ganesha-export-counter
ganesha-export-index
node0_recov
$ juju ssh manila-ganesha/0 -- sudo -s

root@juju-41cdda-manila-11:/home/ubuntu# rpcinfo -p localhost
   program vers proto   port  service
    100000    4   tcp    111  portmapper
    100000    3   tcp    111  portmapper
    100000    2   tcp    111  portmapper
    100000    4   udp    111  portmapper
    100000    3   udp    111  portmapper
    100000    2   udp    111  portmapper
    100024    1   udp  44544  status
    100024    1   tcp  44403  status
    100003    4   udp   2049  nfs
    100003    4   tcp   2049  nfs

#https://gist.github.com/congto/aba6f9d5087bb8e78b6377b463c3bde5
# grep -r '\[cephfsnfs1' /etc/manila/manila.conf -A20
[cephfsnfs1]
driver_handles_share_servers = False
ganesha_rados_store_enable = True
ganesha_rados_store_pool_name = manila-ganesha
share_backend_name = CEPHFSNFS1
share_driver = manila.share.drivers.cephfs.driver.CephFSDriver
cephfs_protocol_helper_type = NFS
cephfs_conf_path = /etc/ceph/ceph.conf
cephfs_auth_id = manila-ganesha
cephfs_cluster_name = ceph
cephfs_enable_snapshots = False
cephfs_ganesha_server_is_remote = False
cephfs_ganesha_server_ip = 10.149.144.79

实际上,manila支持CephFS NFS shares, CephFS native shares两个模式,上面的cephfs_protocol_helper_type = NFS代表是通过nfs-common支持NFS share,要想客户端使用ceph-fuse的话应该是使用cephfs_protocol_helper_type = CEPFFS ( https://docs.openstack.org/manila/latest/configuration/shared-file-systems/drivers/cephfs_driver.html#configure-cephfs-nfs-share-backend-in-manila-conf 和 https://blog.51cto.com/u_16213461/7188633)

manila type-create cephfs_type false
manila type-key cephfs_type set vendor_name=Ceph storage_protocol=CephFS
manila create --share-type cephfs_type --name cephfs_share1 cephfs 1
manila access-allow cephnfsshare1 ip 10.149.142.103
manila share-export-location-list cephfs_share1

但manila-ganesha将cephfs_protocol_helper_type写死了,

$ grep -r 'cephfs_protocol_helper_type' manila*
manila-ganesha/src/templates/rocky/manila.conf:cephfs_protocol_helper_type = NFS

$ manila create --share-type cephfs_type --name cephfs_share1 cephfs 1
ERROR: Invalid input received: Invalid share protocol provided: CEPHFS. It is either disabled or unsupported. Available protocols: ['NFS']. (HTTP 400) (Request-ID: req-03b21a2a-7bd7-4145-bf6e-cb4d23180d7e)

objectcacher源码

osdc是client端(用户态client才会调用osdc)比较底层的模块用于从cephfs的一维地址空间(文件系统采用树状结构管理文件和目录,查于查表进行寻址, 文件系统必须引入中心化的元数据管理)到对象的三维地址空间(ceph采用扁平方式管理数据,基于计算进行寻址)的对象化转换。所以它有一个object级别的缓存(objectcacher), 也需要再转化为三维地址空间后使用crush算法进行数据的再定位.
用户态nfs-ganesha的请求流程大致如下:

posix(open|read|write) -> system call -> vfs -> fuse kernel module -> fuse user lib -> cephfs client -> client::read|client::write
client::read -> client::_read -> client::_read_async -> file_read -> file_to_extents(address convert) -> ObjectCacher::readx -> ObjectCacher::_readx

一个对象会有很多小文件(一个小文件条带单元也叫对象分片su,横着的3个小文件叫一个条带stripe)按顺序分布大多个rados底层对象(默认为4M, 如图中一个rodos底层对象容纳了3个条带,那对象分片大小就是4/3M)上. 这样, file_to_extent函数把一维坐标转化成三维坐标(objectset,stripeno,stripepos),这三维坐标分别表示哪一个objectset,哪一个stripe(条带),条带中的哪一个su(对象分片)。
在这里插入图片描述
如上图,这个要读的文件总共被分布了18个小块(对象分片su, 假设一个su是1M), 存储在0-5共6个rados对象里(假设一个对象是3M), 占用了两个objectset. 现在要读取su1-su6(2M-7M)的范围:

offset = 1M 表示读偏移量
len = 6M 表示要读取的大小
su = 1M
object_size = 3M
stripe_count = 3  条带宽度
stripes_per_object = 3 一个对象包含的对象分片数

这样上面的地址空间已经从一维转化成了三维, 比如读取su1

一维地址空间:(offset, len) ==>(1M,6M)
三维地址空间:(objectset,stripeno,stripepos) ==> (objectset0,stripe0,object1)
blockno = offset/su =1M/1M =1         块号,也就是分片号就是su1
stripeno = blockno /stripe_count = 1/3 =0    条带号,表示一个条带stripe0
stripepos  = blockno%stripe_count = 1%3 = 1  条带内偏移,就是在条带内的第二个对象上面
objectsetno = stripeno / stripes_per_object=0/3 =0   对象set号,表示objectset0
objectno = objectsetno*stripe_count + stripepos= 0*3+1对象号,就是分片所在的哪个对象

对象分片su用ObjectExtent(oid, objectno, offset, length, truncate_size)来表示一维地址,分片的结果会保存在一个map中(map<object_t,vector > object_extent), map key的含义如下:
10000000000.00000000__head_F0B56F30__1 #点之后表示objectno, 点之前表示inode号相当于ns用于确保不唯一
fuse读写数据时大小有限制,写一次最大是4k, 读一次最大是128k, 所以这里面有fragment,

数据在objectcaher管理
file_to_extent后的结果集object_extent是一个map(这里的object并不是rados的object,是osdc cache中的object), 先遍历这个map放在一个vector中,然后用readx来并发读取object_extent里的对象分片,在读的时候会用到objectcacher缓存(bufferhead),命中走缓存不命中就需要去rados读。map_read用于将objectextent和bufferhead映射起来。bufferhead有很多状态(STATE_MISSING=0, STATE_CLEAN=1, STATE_ZERO=2, STATE_DIRTY=3, STATE_RX=4, STATE_TX=5, STATE_ERROR). 刚开始缓冲是空的,经过map_read之后的流程如下:
1, 第一次读缓存未命中,要到osd上去取(在bh_read中发起一个到osd的对象数据的op操作)
2, 客户端收到osd回来的响应后会通过bh_read注册的回调函数C_ReadFinish把OSD中读到的数据拷到bufferhead.
3, 第二次读就能命中缓存

Reproducer

#https://www.findbugzero.com/operational-defect-database/vendors/rh/defects/2247762
git clone https://github.com/bengland2/smallfile.git
cd smallfile
sudo chown -R $USER /mnt/nfs/
for i in $(seq 1 10); do mkdir -p /mnt/nfs/smallfile$i; done
for i in $(seq 1 10); do python3 smallfile_cli.py --operation create --threads 4 --file-size 4194 --files 1024 --files-per-dir 10 --dirs-per-dir 2 --record-size 128 --top /mnt/nfs/smallfile$i --output-json=create.json;done

#遇到了'Disk quota exceeded'这个问题,用下列的方法不管用,因为环境不支持CephFS native shares(cephfs_protocol_helper_type=CEPHFS)
host = jammy-065702,thr = 00,elapsed = None,files = None,records = None,status = ERR: Disk quota exceeded                                                                               
WARNING: thread 00 on host jammy-065702 never completed
# https://superuser.com/questions/1787448/cant-remove-ceph-xattrs-on-cephfs-on-linux
sudo setfattr -n ceph.quota.max_bytes -v 0 /path/to/ceph/directory
sudo setfattr -n ceph.quota.max_files -v 0 /path/to/ceph/directory
#用下列的方法也不好使
#Disk quota exceeded nfs-ganesha - https://github.com/ceph/ceph/commit/48acd4b35c860589d43e7cce7a80b5a023fd9f21
echo 'client quota = false' >> /etc/ceph/ceph.conf

编译nfs-ganesha

#https://bbs.huaweicloud.com/blogs/193848
wget https://github.com/nfs-ganesha/nfs-ganesha/archive/next.zip && unzip next.zip
cd next/next
cmake -DCMAKE_BUILD_TYPE=Release -Wno-dev -DPROXY_HANDLE_MAPPING=ON -DUSE_9P=OFF -DUSE_FSAL_CEPH=OFF -DUSE_FSAL_GLUSTER=OFF -DUSE_FSAL_LUSTRE=OFF -DUSE_FSAL_LIZARDFS=OFF -DUSE_FSAL_XFS=ON -DUSE_FSAL_RGW=OFF -DRADOS_URLS=OFF -DUSE_RADOS_RECOV=OFF -D_MSPAC_SUPPORT=OFF -DUSE_GSS=ON -DUSE_FSAL_LUSTRE=OFF -DALLOCATOR=libc ../src/ \
make
make install

可能的workaround

(gdb) frame 7
#7  0x00007ffa7049f602 in xlist<ObjectCacher::Object*>::size (this=0x7ffa20734638, this=0x7ffa20734638) at ./src/include/xlist.h:87
87	./src/include/xlist.h: No such file or directory.
(gdb) p *this
$1 = {_front = 0x0, _back = 0x0, _size = 0}
(gdb) frame 6
#6  0x00007ffa91078525 in ceph::__ceph_assert_fail (ctx=...) at ./src/common/assert.cc:80
80	./src/common/assert.cc: No such file or directory.
(gdb) p ctx
$2 = (const ceph::assert_data &) @0x7ffa70587900: {assertion = 0x7ffa70530598 "(bool)_front == (bool)_size", file = 0x7ffa705305b4 "./src/include/xlist.h", line = 87, 
  function = 0x7ffa7053b410 "size_t xlist<T>::size() const [with T = ObjectCacher::Object*; size_t = long unsigned int]"}

_front和_size都是0, 怎么’(bool)_front == (bool)_size’还不等呢?好奇怪?是因为打印时没有加锁吗?照下列步骤设置debug level可能skip这个日志。

1. Log in to the ceph-mon unit:
#juju ssh ceph-mon/0

2. Adjust the debug client level to 0/2, which will still provide sufficient error logs:
#sudo ceph config set global debug_client 0/2

3. Verify that the configuration has been successfully set.
#sudo ceph config dump
WHO   MASK LEVEL   OPTION                 VALUE    RO
global    advanced debug_client              0/2      
mon      advanced auth_allow_insecure_global_id_reclaim false     
mgr      advanced mgr/prometheus/rbd_stats_pools           * 
osd.1     basic   osd_mclock_max_capacity_iops_hdd    275.751910   
osd.2     basic   osd_mclock_max_capacity_iops_hdd    194.949454

possible fix - https://github.com/ceph/ceph/pull/59162

Reference

[1] https://blog.csdn.net/bandaoyu/article/details/122302684
[2] https://blog.csdn.net/don_chiang709/article/details/90607215

本项目是一个基于SSM(Spring+SpringMVC+MyBatis)后端框架与Vue.js前端框架开发的疫情居家办公系统。该系统旨在为居家办公的员工提供一个高效、便捷的工作环境,同时帮助企业更好地管理远程工作流程。项目包含了完整的数据库设计、前后端代码实现以及详细的文档说明,非常适合计算机相关专业的毕设学生和需要进行项目实战练习的Java学习者。 系统的核心功能包括用户管理、任务分配、进度跟踪、文件共享和在线沟通等。用户管理模块允许管理员创建和管理用户账户,分配不同的权限。任务分配模块使项目经理能够轻松地分配任务给团队成员,并设置截止日期。进度跟踪模块允许员工实时更新他们的工作状态,确保项目按计划进行。文件共享模块提供了一个安全的平台,让团队成员可以共享和协作处理文档。在线沟通模块则支持即时消息和视频会议,以增强团队之间的沟通效率。 技术栈方面,后端采用了Spring框架来管理业务逻辑,SpringMVC用于构建Web应用程序,MyBatis作为ORM框架简化数据库操作。前端则使用Vue.js来实现动态用户界面,搭配Vue Router进行页面导航,以及Vuex进行状态管理。数据库选用MySQL,确保数据的安全性和可靠性。 该项目不仅提供了一个完整的技术实现示例,还为开发者留下了扩展和改进的空间,可以根据实际需求添加新功能或优化现有功能。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

quqi99

你的鼓励就是我创造的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值