io hang linux 原理,Cephfs测试中IO hang问题分析

最新推荐文章于 2021-11-08 22:34:43 发布

徐贵枝

最新推荐文章于 2021-11-08 22:34:43 发布

阅读量455

点赞数

文章标签： io hang linux 原理

8种机械键盘轴体对比

本人程序员，要买一个写代码的键盘，请问红轴和茶轴怎么选？

问题

为了测试cephfs的可用性，我们对cephfs进行了几组stripe的测试，在跑自动化测试时，发现fio总是会跑一部分后hang住，而此时ceph -s输出显示read值太大，但实际环境中查看并没这么大的流量。

ceph输出1

12cluster 16f59233-16da-4864-8c5a-1fec71d119ad

health HEALTH_OK

monmap e1: 3 mons at {cephcluster-server1=10.10.1.6:6789/0,cephcluster-server2=10.10.1.7:6789/0,cephcluster-server3=10.10.1.8:6789/0}

election epoch 12, quorum 0,1,2 cephcluster-server1,cephcluster-server2,cephcluster-server3

fsmap e17: 1/1/1 up {0=mds-daemon-38=up:active}, 1 up:standby

osdmap e180: 30 osds: 30 up, 30 in

flags sortbitwise,require_jewel_osds

pgmap v55504: 1088 pgs, 3 pools, 81924 MB data, 10910 objects

241 GB used, 108 TB / 109 TB avail

1088 active+clean

client io 68228 MB/s rd, 1066 op/s rd, 0 op/s wr分析：cephfs client io显示的信息是正确的。从后面的分析得知cephfs client一直loop着尝试read object，单次read就有64MB，object数据缓存在osd的内存里，所以ceph统计的read速率很大。

查看网络

用iftop看网络的流量只有10MB/s1

5# iftop -B -i eth4

25.5MB 50.9MB 76.4MB 102MB 127MB

└──────────────────────────┴─────────────────────────┴─────────────────────────┴─────────────────────────┴──────────────────────────

cephcluster-server2.****.com => 10.10.1.18 4.46MB 4.51MB 4.52MB

<= 220KB 221KB 222KB

对比几个stripe

发现对64MB的无条带化的object读时，会触发这个bug：1

3# getfattr -n ceph.file.layout dir4/tstfile

# file: dir4/tstfile

ceph.file.layout="stripe_unit=67108864 stripe_count=1 object_size=67108864 pool=cephfs_data”

bs=64M的fio hang住1

7# fio -filename=/home/yangguanjun/cephfs/dir4/tstfile -size=20G -thread -group_reporting -direct=1 -ioengine=libaio -bs=64M -rw=read -iodepth=64 -iodepth_batch_submit=8 -iodepth_batch_complete=8 -name=write_64k_64q

write_64k_64q: (g=0): rw=read, bs=64M-64M/64M-64M/64M-64M, ioengine=libaio, iodepth=64

fio-2.2.8

Starting 1 thread

^Cbs: 1 (f=1): [R(1)] [inf% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158050441d:06h:59m:55s]

fio: terminating on signal 2

...

bs=32M的fio hang住1

5# fio -filename=/home/yangguanjun/cephfs/dir4/tstfile -size=20G -thread -group_reporting -direct=1 -ioengine=libaio -bs=32M -rw=read -iodepth=64 -iodepth_batch_submit=8 -iodepth_batch_complete=8 -name=write_64k_64q

write_64k_64q: (g=0): rw=read, bs=32M-32M/32M-32M/32M-32M, ioengine=libaio, iodepth=64

fio-2.2.8

Starting 1 thread

^Cbs: 1 (f=1): [R(1)] [inf% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158050441d:07h:00m:09s]

bs=16M的dd hang住1

3# dd if=tstfile of=/dev/null bs=16M count=200

root 12763 0.0 0.0 124344 664 pts/0 D+ 14:29 0:00 dd if=tstfile of=/dev/null bs=16M count=2001

17cluster 16f59233-16da-4864-8c5a-1fec71d119ad

health HEALTH_OK

monmap e1: 3 mons at {cephcluster-server1=10.10.1.6:6789/0,cephcluster-server2=10.10.1.7:6789/0,cephcluster-server3=10.10.1.8:6789/0}

election epoch 12, quorum 0,1,2 cephcluster-server1,cephcluster-server2,cephcluster-server3

fsmap e23: 1/1/1 up {0=mds-daemon-37=up:active}

osdmap e183: 30 osds: 30 up, 30 in

flags sortbitwise,require_jewel_osds

pgmap v63806: 1088 pgs, 3 pools, 93200 MB data, 13729 objects

274 GB used, 108 TB / 109 TB avail

1088 active+clean

client io 1330 MB/s rd, 41 op/s rd, 0 op/s wr

2017-07-05 14:30:11.483974 mon.0 [INF] pgmap v63806: 1088 pgs: 1088 active+clean; 93200 MB data, 274 GB used, 108 TB / 109 TB avail; 1330 MB/s rd, 41 op/s

2017-07-05 14:30:16.484856 mon.0 [INF] pgmap v63807: 1088 pgs: 1088 active+clean; 93200 MB data, 274 GB used, 108 TB / 109 TB avail; 1212 MB/s rd, 37 op/s

2017-07-05 14:30:21.485295 mon.0 [INF] pgmap v63808: 1088 pgs: 1088 active+clean; 93200 MB data, 274 GB used, 108 TB / 109 TB avail; 1206 MB/s rd, 37 op/s

2017-07-05 14:30:26.485953 mon.0 [INF] pgmap v63809: 1088 pgs: 1088 active+clean; 93200 MB data, 274 GB used, 108 TB / 109 TB avail; 1212 MB/s rd, 37 op/s

ceph cluster节点信息

在ceph cluster的一个节点上看到测试节点有读数据请求1

18# iftop -n -i eth4

10.10.1.7 => 10.10.1.18 12.3Mb 11.6Mb 11.6Mb

<= 285Kb 284Kb 284Kb

打开source port后

10.10.1.7:6810 => 10.10.1.18 12.5Mb 11.5Mb 11.5Mb

<= 278Kb 255Kb 255Kb

# netstat -nap | grep 6810

tcp 0 0 10.10.1.7:6810 0.0.0.0:* LISTEN 79360/ceph-osd

tcp 0 0 10.10.1.7:6810 10.10.1.7:57266 ESTABLISHED 79360/ceph-osd

# ps aux | grep -w 79360

ceph 79360 85.9 0.5 4338588 709312 ? Ssl Jul04 1556:54 /usr/bin/ceph-osd -f --cluster ceph --id 15 --setuser ceph --setgroup ceph

# df

/dev/sdg1 3905070088 9079428 3895990660 1% /var/lib/ceph/osd/ceph-15

查看sdg没有io1

20# iostat -kx 2 sdg

Linux 3.10.0-514.10.2.el7.x86_64 (cephcluster-server2.****.com) 07/05/2017 _x86_64_ (32 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle

12.47 0.00 1.64 0.03 0.00 85.86

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

sdg 0.01 0.07 2.48 2.01 95.78 140.98 105.44 0.08 17.79 9.69 27.79 3.44 1.54

avg-cpu: %user %nice %system %iowait %steal %idle

0.80 0.00 2.96 0.02 0.00 96.22

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

avg-cpu: %user %nice %system %iowait %steal %idle

0.86 0.00 2.99 0.00 0.00 96.15

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util

sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

查看文件location

测试文件的location信息，正好第一块数据就在osd 15上1

13# cephfs tstfile show_location

WARNING: This tool is deprecated. Use the layout.* xattrs to query and modify layouts.

location.file_offset: 0

location.object_offset:0

location.object_no: 0

location.object_size: 67108864

location.object_name: 10000000805.00000000

location.block_offset: 0

location.block_size: 67108864

location.osd: 15

# ceph osd map cephfs_data 10000000805.00000000

osdmap e188 pool 'cephfs_data' (1) object '10000000805.00000000' -> pg 1.12d00fd5 (1.1d5) -> up ([15,0,26], p15) acting ([15,0,26], p15)

停止ceph osd 151

过一会查看osd 15所在节点没有与测试节点的数据流量了

这时在ceph osd 0节点，出现了与测试节点的数据流量，查看发现正好是osd 0与测试节点的数据通信

查看此时测试文件第一个object的map信息如下：1

2# ceph osd map cephfs_data 10000000805.00000000

osdmap e185 pool 'cephfs_data' (1) object '10000000805.00000000' -> pg 1.12d00fd5 (1.1d5) -> up ([0,26], p0) acting ([0,26], p0)

问题明确

现在这个问题很好重现了创建一个 stripe_unit=67108864 stripe_count=1 object_size=67108864 的文件

dd if=/dev/zero of=foxfile bs=64M count=1 写成功

dd if=foxfile of=/dev/null bs=64M count=1 读就会hang住

查看测试节点与对应的object所在的osd有数据流量

通过strace命令查看dd读数据时确实hang在read函数：1

11# strace dd if=foxfile of=/dev/null bs=64M count=1

...

open("foxfile", O_RDONLY) = 3

dup2(3, 0) = 0

close(3) = 0

lseek(0, 0, SEEK_CUR) = 0

open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3

dup2(3, 1) = 1

close(3) = 0

mmap(NULL, 67121152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f01712d8000

read(0,

收集client log

收集osd log1# ceph daemon /var/run/ceph/ceph-osd..asok config set debug_osd "20/20"

分析log

在cephfs client的log中能循环发现如下：1

19Jul 6 11:30:29 cephfs-client kernel: [ 2010.231008] ceph_sock_data_ready on ffff88202c987030 state = 5, queueing work

Jul 6 11:30:29 cephfs-client kernel: [ 2010.231012] get_osd ffff88202c987000 3 -> 4

Jul 6 11:30:29 cephfs-client kernel: [ 2010.231015] queue_con_delay ffff88202c987030 0

Jul 6 11:30:29 cephfs-client kernel: [ 2010.231034] try_read start on ffff88202c987030 state 5

Jul 6 11:30:29 cephfs-client kernel: [ 2010.231035] try_read tag 1 in_base_pos 0

Jul 6 11:30:29 cephfs-client kernel: [ 2010.231036] try_read got tag 8

Jul 6 11:30:29 cephfs-client kernel: [ 2010.231036] prepare_read_ack ffff88202c987030

Jul 6 11:30:29 cephfs-client kernel: [ 2010.231041] ceph_sock_data_ready on ffff88202c987030 state = 5, queueing work

Jul 6 11:30:29 cephfs-client kernel: [ 2010.231041] get_osd ffff88202c987000 4 -> 5

Jul 6 11:30:29 cephfs-client kernel: [ 2010.231053] queue_con_delay ffff88202c987030 0

Jul 6 11:30:29 cephfs-client kernel: [ 2010.231062] got ack for seq 1 type 42 at ffff881008668100

Jul 6 11:30:29 cephfs-client kernel: [ 2010.231066] ceph_msg_put ffff881008668100 (was 2)

Jul 6 11:30:29 cephfs-client kernel: [ 2010.231081] prepare_read_tag ffff88202c987030

Jul 6 11:30:29 cephfs-client kernel: [ 2010.231083] try_read start on ffff88202c987030 state 5

Jul 6 11:30:29 cephfs-client kernel: [ 2010.231086] try_read tag 1 in_base_pos 0

Jul 6 11:30:29 cephfs-client kernel: [ 2010.231088] try_read got tag 7

Jul 6 11:30:29 cephfs-client kernel: [ 2010.231088] prepare_read_message ffff88202c987030

Jul 6 11:30:29 cephfs-client kernel: [ 2010.231089] read_partial_message con ffff88202c987030 msg (null)

Jul 6 11:30:29 cephfs-client kernel: [ 2010.231090] try_read done on ffff88202c987030 ret -5

证明cephfs client一直在尝试读取osd数据，但收到数据后却认为msg为null

在ceph osd的log中能循环发现如下：1

162017-07-06 11:30:29.185056 7f1788eb2700 1 osd.20 188 ms_handle_reset con 0x7f1748c39d80 session 0x7f170f4f31c0

2017-07-06 11:30:29.185683 7f17719fc700 10 osd.20 188 new session 0x7f174d435380 con=0x7f1748c39f00 addr=10.10.1.18:0/144324133

2017-07-06 11:30:29.185839 7f17719fc700 20 osd.20 188 should_share_map client.24357 10.10.1.18:0/144324133 188

2017-07-06 11:30:29.185854 7f17719fc700 15 osd.20 188 enqueue_op 0x7f174d42f100 prio 127 cost 0 latency 0.000047 osd_op(client.24357.1:3 1.1bdf9243 1000000095b.00000000 [read 0~67108864 [1@-1]] snapc 0=[] RETRY=6 ondisk+retry+read+known_if_redirected e188) v4

2017-07-06 11:30:29.185891 7f1777bff700 10 osd.20 188 dequeue_op 0x7f174d42f100 prio 127 cost 0 latency 0.000085 osd_op(client.24357.1:3 1.1bdf9243 1000000095b.00000000 [read 0~67108864 [1@-1]] snapc 0=[] RETRY=6 ondisk+retry+read+known_if_redirected e188) v4 pg pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean]

2017-07-06 11:30:29.185937 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] handle_message: 0x7f174d42f100

2017-07-06 11:30:29.185951 7f1777bff700 20 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] op_has_sufficient_caps pool=1 (cephfs_data ) owner=0 need_read_cap=1 need_write_cap=0 need_class_read_cap=0 need_class_write_cap=0 -> yes

2017-07-06 11:30:29.185970 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] do_op osd_op(client.24357.1:3 1.1bdf9243 1000000095b.00000000 [read 0~67108864 [1@-1]] snapc 0=[] RETRY=6 ondisk+retry+read+known_if_redirected e188) v4 may_read -> read-ordered flags ondisk+retry+read+known_if_redirected

2017-07-06 11:30:29.185990 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] get_object_context: found obc in cache: 0x7f171e423180

2017-07-06 11:30:29.185997 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] get_object_context: 0x7f171e423180 1:c249fbd8:::1000000095b.00000000:head rwstate(none n=0 w=0) oi: 1:c249fbd8:::1000000095b.00000000:head(188'4499 mds.0.20:66259 dirty|data_digest|omap_digest s 67108864 uv 4499 dd cdba94a2 od ffffffff) ssc: 0x7f172dcfce40 snapset: 1=[]:[]+head

2017-07-06 11:30:29.186010 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] find_object_context 1:c249fbd8:::1000000095b.00000000:head @head oi=1:c249fbd8:::1000000095b.00000000:head(188'4499 mds.0.20:66259 dirty|data_digest|omap_digest s 67108864 uv 4499 dd cdba94a2 od ffffffff)

2017-07-06 11:30:29.186026 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] execute_ctx 0x7f1776beed00

2017-07-06 11:30:29.186035 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] do_op 1:c249fbd8:::1000000095b.00000000:head [read 0~67108864 [1@-1]] ov 188'4499

2017-07-06 11:30:29.186041 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] taking ondisk_read_lock

2017-07-06 11:30:29.186047 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] do_osd_op 1:c249fbd8:::1000000095b.00000000:head [read 0~67108864 [1@-1]]

2017-07-06 11:30:29.186056 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] do_osd_op read 0~67108864 [1@-1]

分析看出cephfs client发送read 64MB的request，ceph osd读64M数据return，但是cephfs client认为msg为null，然后重试

一直重复上面的请求，所以cephfs client的IO就hang住了

提交bug

cephfs client端的配置限制了read message的最大size为16M。

in net/ceph/messenger.c1

12static int read_partial_message(struct ceph_connection *con)

{

...

front_len = le32_to_cpu(con->in_hdr.front_len);

if (front_len > CEPH_MSG_MAX_FRONT_LEN)

return -EIO;

middle_len = le32_to_cpu(con->in_hdr.middle_len);