Profiling Ceph memcpy Usage
目录
- Profiling Ceph memcpy Usage
- Introduction
- Placing the uprobe on memcpy
- Simple Python Perf script to analyse memcpy sizes
- Distribution of memcpy sizes
- Steps to analyse `std::__ostream_insert<char, std::char_traits<char> >`
- Locating ceph-osd callers that lead to size 1 memcpy
- Being tricked by perf
- Flame graph of call stack for size 1 memcpy's
- Actual hot memcpy's
Introduction
This page was originally a brain dump of my investigation into the Ceph memcpy size distributions. I've kept the content mostly as is (after this introduction section).
The take away messages from this page are essentially:
- Yes Ceph has a lot of memcpy's of size 1, ... but
- The only way these really crop up is via a probe point rather than via natural pc sampling, ... meaning
- That they are not likely worth worrying about :-(.
This page should hopefully serve as a warning for those who rely too much on probes and fail to measure actual CPU impact.
Placing the uprobe on memcpy
First find out where in the C library the memcpy resides:
$ nm -D /lib64/libc.so.6 | grep memcpy 0000000000084fc0 T memcpy 00000000000ebaf0 T __memcpy_chk 000000000008c70c W wmemcpy 00000000000ed41c T __wmemcpy_chk
Now place a uprobe:
# echo "p:memcpy /lib64/libc.so.6:0x84fc0 len=%x2" > /sys/kernel/debug/tracing/uprobe_events
The uprobe is recorded via:
perf record -e uprobes:memcpy -a -g sleep 20
This gives twenty seconds of recording system-wide. Processes can be analysed via a perf script.
Simple Python Perf script to analyse memcpy sizes
1 from __future__ import division 2 import os 3 import sys 4 5 import collections 6 import string 7 8 sys.path.append(os.environ['PERF_EXEC_PATH'] + \ 9 '/scripts/python/Perf-Trace-Util/lib/Perf/Trace') 10 11 from perf_trace_context import * 12 from Core import * 13 14 15 sizes = dict() 16 17 sizeones = dict() 18 19 def trace_begin(): 20 print "in trace_begin" 21 22 def trace_end(): 23 print "in trace_end\n\n" 24 25 for key in sizes.keys(): 26 print "\n\nTop 10 memcpy sizes of %s:" % (key) 27 total = len(sizes[key]) 28 29 for pair in collections.Counter(sizes[key]).most_common(10): 30 print "%s\t%d\t(%f%%)" % (pair[0], pair[1], 100.0*pair[1]/total) 31 32 for key in sizeones.keys(): 33 print "\n\nTop 10 memcpy size 1 callers of %s:" % (key) 34 35 for pair in collections.Counter(sizeones[key]).most_common(10): 36 print "%d\t%s" % (pair[1], pair[0]) 37 38 def uprobes__memcpy(event_name, context, common_cpu, 39 common_secs, common_nsecs, common_pid, common_comm, 40 common_callchain, __probe_ip, len): 41 42 if (common_comm not in sizes): 43 sizes[common_comm] = list(); 44 45 sizes[common_comm].append(len) 46 47 if (len == 1): 48 node = common_callchain[1] 49 if ('sym' in node): 50 caller = format("[%x] %s %s" % (node['ip'], node['dso'], node['sym']['name'])) 51 else: 52 caller = format("[%x]" % (node['ip'])) 53 54 if (common_comm not in sizeones): 55 sizeones[common_comm] = list(); 56 57 sizeones[common_comm].append(caller) 58 59 def trace_unhandled(event_name, context, event_fields_dict): 60 print ' '.join(['%s=%s'%(k,str(v))for k,v in sorted(event_fields_dict.items())])
Distribution of memcpy sizes
These were obtained via a uprobe on memcpy
Top 10 memcpy sizes of ceph-mds: 1 3349 (30.542636%) 4 1411 (12.868217%) 8 1007 (9.183767%) 2 663 (6.046512%) 46 350 (3.191974%) 6 313 (2.854537%) 5 307 (2.799818%) 85 280 (2.553580%) 3 270 (2.462380%) 80 231 (2.106703%) Top 10 memcpy sizes of sshd: 4 80 (30.769231%) 16 40 (15.384615%) 100 40 (15.384615%) 76 38 (14.615385%) 6 20 (7.692308%) 60 20 (7.692308%) 40 19 (7.307692%) 231 2 (0.769231%) 8 1 (0.384615%) Top 10 memcpy sizes of perf: 16 1 (20.000000%) 4 1 (20.000000%) 30 1 (20.000000%) 6 1 (20.000000%) 5 1 (20.000000%) Top 10 memcpy sizes of chronyd: 24 5 (50.000000%) 128 2 (20.000000%) 72 2 (20.000000%) 192 1 (10.000000%) Top 10 memcpy sizes of systemd: 10 1 (25.000000%) 39 1 (25.000000%) 54 1 (25.000000%) 7 1 (25.000000%) Top 10 memcpy sizes of lt-rados: 1 6138 (35.294118%) 4 3350 (19.262837%) 8 1549 (8.906906%) 2 1025 (5.893853%) 16 890 (5.117590%) 3 556 (3.197056%) 7 442 (2.541544%) 13 437 (2.512794%) 6 329 (1.891783%) 5 314 (1.805532%) Top 10 memcpy sizes of ceph-mon: 1 56354 (28.211418%) 4 34085 (17.063317%) 8 20027 (10.025731%) 3 9962 (4.987084%) 2 9248 (4.629648%) 6 7521 (3.765093%) 5 6256 (3.131821%) 14 4419 (2.212199%) 80 4063 (2.033981%) 9 3999 (2.001942%) Top 10 memcpy sizes of ceph-osd: 1 354007 (39.020946%) 4 115021 (12.678360%) 2 109050 (12.020198%) 8 51911 (5.721967%) 3 40176 (4.428459%) 6 36285 (3.999568%) 80 25808 (2.844725%) 5 19347 (2.132552%) 7 18729 (2.064432%) 11 15813 (1.743011%) Top 10 memcpy sizes of sleep: 12 14 (73.684211%) 10 2 (10.526316%) 1 1 (5.263158%) 4 1 (5.263158%) 6 1 (5.263158%) Top 10 memcpy size 1 callers of lt-rados: 2716 [98db8] /usr/lib64/libstdc++.so.6.0.20 std::__ostream_insert<char, std::char_traits<char> > 1666 [1df68] /usr/lib64/libsoftokn3.so sftk_AddAttributeType 522 [8ce30] /usr/lib64/libstdc++.so.6.0.20 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<unsigned long> 245 [48984] /home/steven/ceph/src/.libs/librados.so.2.0.0 ceph::buffer::list::append 198 [8c868] /usr/lib64/libstdc++.so.6.0.20 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<long> 125 [266944] /home/steven/ceph/src/.libs/librados.so.2.0.0 Pipe::tcp_read_nonblocking 122 [8d554] /usr/lib64/libstdc++.so.6.0.20 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<unsigned long long> 77 [1c7298] /home/steven/ceph/src/.libs/librados.so.2.0.0 request_redirect_t::decode 77 [1c72a8] /home/steven/ceph/src/.libs/librados.so.2.0.0 request_redirect_t::decode 77 [229b98] /home/steven/ceph/src/.libs/librados.so.2.0.0 pg_t::decode Top 10 memcpy size 1 callers of ceph-osd: 231102 [98db8] /usr/lib64/libstdc++.so.6.0.20 std::__ostream_insert<char, std::char_traits<char> > 39477 [8c868] /usr/lib64/libstdc++.so.6.0.20 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<long> 35877 [8ce30] /usr/lib64/libstdc++.so.6.0.20 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<unsigned long> 19712 [6ccbf8] /home/steven/ceph/src/ceph-osd ceph::buffer::list::append 6734 [1df68] /usr/lib64/libsoftokn3.so sftk_AddAttributeType 4496 [8ccb0] /usr/lib64/libstdc++.so.6.0.20 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::do_put 1496 [8d554] /usr/lib64/libstdc++.so.6.0.20 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<unsigned long long> 883 [68e628] /home/steven/ceph/src/ceph-osd ghobject_t::decode 883 [68e770] /home/steven/ceph/src/ceph-osd ghobject_t::decode 883 [68e6c8] /home/steven/ceph/src/ceph-osd ghobject_t::decode Top 10 memcpy size 1 callers of sleep: 1 [2a698] /usr/lib64/libc-2.20.so _nl_find_locale Top 10 memcpy size 1 callers of ceph-mds: 1403 [98db8] /usr/lib64/libstdc++.so.6.0.20 std::__ostream_insert<char, std::char_traits<char> > 352 [8c868] /usr/lib64/libstdc++.so.6.0.20 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<long> 351 [8ce30] /usr/lib64/libstdc++.so.6.0.20 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<unsigned long> 241 [557f78] /home/steven/ceph/src/ceph-mds ceph::buffer::list::append 214 [8dde0] /usr/lib64/libstdc++.so.6.0.20 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::do_put 211 [8dc1c] /usr/lib64/libstdc++.so.6.0.20 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_float<double> 196 [1df68] /usr/lib64/libsoftokn3.so sftk_AddAttributeType 77 [8ccb0] /usr/lib64/libstdc++.so.6.0.20 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::do_put 70 [43989c] /home/steven/ceph/src/ceph-mds DecayCounter::decode 70 [4397fc] /home/steven/ceph/src/ceph-mds DecayCounter::decode Top 10 memcpy size 1 callers of ceph-mon: 29454 [98db8] /usr/lib64/libstdc++.so.6.0.20 std::__ostream_insert<char, std::char_traits<char> > 9161 [8ce30] /usr/lib64/libstdc++.so.6.0.20 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<unsigned long> 5614 [1df68] /usr/lib64/libsoftokn3.so sftk_AddAttributeType 5226 [8c868] /usr/lib64/libstdc++.so.6.0.20 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<long> 2473 [3cc1d0] /home/steven/ceph/src/ceph-mon ceph::buffer::list::append 463 [455ef4] /home/steven/ceph/src/ceph-mon Pipe::tcp_read_nonblocking 387 [1948e8] /home/steven/ceph/src/ceph-mon MonitorDBStore::Transaction::decode 387 [194944] /home/steven/ceph/src/ceph-mon MonitorDBStore::Transaction::decode 387 [1948f8] /home/steven/ceph/src/ceph-mon MonitorDBStore::Transaction::decode 385 [8ccb0] /usr/lib64/libstdc++.so.6.0.20 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::do_put
Steps to analyse `std::__ostream_insert<char, std::char_traits<char> >`
$ addr2line -f -e /usr/lib64/libstdc++.so.6.0.20 0x98db8 __ostream_write<char, std::char_traits<char> > /usr/src/debug/gcc-4.9.2-20141101/obj-aarch64-redhat-linux/aarch64-redhat-linux/libstdc++-v3/include/bits/ostream_insert.h:51
This leads to:
1 template<typename _CharT, typename _Traits> 2 inline void 3 __ostream_write(basic_ostream<_CharT, _Traits>& __out, 4 const _CharT* __s, streamsize __n) 5 { 6 typedef basic_ostream<_CharT, _Traits> __ostream_type; 7 typedef typename __ostream_type::ios_base __ios_base; 8 9 const streamsize __put = __out.rdbuf()->sputn(__s, __n); 10 if (__put != __n) //<=== line 51 11 __out.setstate(__ios_base::badbit); 12 }
So we need to investigate sputn()
This leads to:
1 streamsize 2 sputn(const char_type* __s, streamsize __n) 3 { return this->xsputn(__s, __n); }
Let's investigate __streambuf_type::xsputnThis leads to:
1 template<typename _CharT, typename _Traits> 2 streamsize 3 basic_streambuf<_CharT, _Traits>:: 4 xsputn(const char_type* __s, streamsize __n) 5 { 6 streamsize __ret = 0; 7 while (__ret < __n) 8 { 9 const streamsize __buf_len = this->epptr() - this->pptr(); 10 if (__buf_len) 11 { 12 const streamsize __remaining = __n - __ret; 13 const streamsize __len = std::min(__buf_len, __remaining); 14 traits_type::copy(this->pptr(), __s, __len); 15 __ret += __len; 16 __s += __len; 17 this->__safe_pbump(__len); 18 } 19 20 if (__ret < __n) 21 { 22 int_type __c = this->overflow(traits_type::to_int_type(*__s)); 23 if (!traits_type::eq_int_type(__c, traits_type::eof())) 24 { 25 ++__ret; 26 ++__s; 27 } 28 else 29 break; 30 } 31 } 32 return __ret; 33 }
Let's investigate traits_type::copy.
1 static char_type* 2 copy(char_type* __s1, const char_type* __s2, size_t __n) 3 { return static_cast<char_type*>(__builtin_memcpy(__s1, __s2, __n)); }
Locating ceph-osd callers that lead to size 1 memcpy
Here is the output:
Top ten ceph callers leading to memcpy size 1 19712 [6ccbf8] ceph::buffer::list::append 13346 [334f04] operator<< 9571 [25e724] operator<< <int> 7426 [308410] operator<< 7426 [3083fc] operator<< 6734 [5a85f4] nss_aes_operation 6657 [68e2b0] operator<< 6647 [68e33c] operator<< 6635 [68e2d0] operator<< 6577 [68e368] operator<<
Investigating first sample:
$ addr2line -Cf -e /home/steven/ceph/src/ceph-osd 0x6ccbf4 Objecter::_session_linger_op_remove(Objecter::OSDSession*, Objecter::LingerOp*) /home/steven/ceph/src/osdc/Objecter.cc:2570
Which gives the following function:
1 void Objecter::_session_linger_op_remove(OSDSession *from, LingerOp *op) 2 { 3 assert(from == op->session); 4 assert(from->lock.is_locked()); 5 6 if (from->is_homeless()) { 7 num_homeless_ops.dec(); 8 } 9 10 from->linger_ops.erase(op->linger_id); 11 put_session(from); 12 op->session = NULL; 13 14 ldout(cct, 15) << __func__ << " " << from->osd << " " << op->linger_id << dendl; 15 }
Looks like this is due to debugging messages!
Being tricked by perf
The ip's need to have 0x40000 added to them for ceph-osd (perf bug?).... Let's try again....
$ addr2line -Cf -e ./src/ceph-osd 0xaccbf8 ceph::buffer::ptr::end() const /home/steven/ceph/src/./include/buffer.h:207 $ addr2line -Cf -e ./src/ceph-osd 0xaccbf4 ceph::buffer::list::append(char const*, unsigned int) /home/steven/ceph/src/common/buffer.cc:1310
This leads to:
1 void buffer::list::append(const char *data, unsigned len) 2 { 3 while (len > 0) { 4 // put what we can into the existing append_buffer. 5 unsigned gap = append_buffer.unused_tail_length(); 6 if (gap > 0) { 7 if (gap > len) gap = len; 8 //cout << "append first char is " << data[0] << ", last char is " << data[len-1] << std::endl; 9 append_buffer.append(data, gap); // <--- this is line 1310 10 append(append_buffer, append_buffer.end() - gap, gap); // add segment to the list 11 len -= gap; 12 data += gap; 13 } 14 if (len == 0) 15 break; // done! 16 17 // make a new append_buffer! 18 unsigned alen = CEPH_PAGE_SIZE * (((len-1) / CEPH_PAGE_SIZE) + 1); 19 append_buffer = create_page_aligned(alen); 20 append_buffer.set_length(0); // unused, so far. 21 } 22 }
I think the problem could be attributed to:
1 void buffer::ptr::append(const char *p, unsigned l) 2 { 3 assert(_raw); 4 assert(l <= unused_tail_length()); 5 memcpy(c_str() + _len, p, l); 6 _len += l; 7 }
Flame graph of call stack for size 1 memcpy's
A flame graph of size 1 memcpy call stacks can be found here: ceph-size1.svgceph-size1.svg
And the inverted graph here: ceph-size1-rev.svgceph-size1-rev.svg
The stack data was generated via the following python script
1 import os 2 import sys 3 4 sys.path.append(os.environ['PERF_EXEC_PATH'] + \ 5 '/scripts/python/Perf-Trace-Util/lib/Perf/Trace') 6 7 from perf_trace_context import * 8 from Core import * 9 10 11 def trace_begin(): 12 pass 13 14 def trace_end(): 15 for callframe in callframes.keys(): 16 print "%s %d" % (callframe, callframes[callframe]) 17 18 callframes = dict() 19 20 def uprobes__memcpy(event_name, context, common_cpu, 21 common_secs, common_nsecs, common_pid, common_comm, 22 common_callchain, __probe_ip, len): 23 24 if (len != 1): 25 return 26 27 callframe = common_comm 28 29 for node in reversed(common_callchain): 30 callframe += ';' 31 32 if 'sym' in node: 33 callframe += node['sym']['name'] 34 else: 35 callframe += '[unknown]' 36 37 if (callframe not in callframes.keys()): 38 callframes[callframe] = 1 39 else: 40 callframes[callframe] += 1 41 42 def trace_unhandled(event_name, context, event_fields_dict): 43 print ' '.join(['%s=%s'%(k,str(v))for k,v in sorted(event_fields_dict.items())])
Actual hot memcpy's
Analysing via uprobe has one significant disadvantage that I overlooked. Namely, that it gives us the proportion of all memcpy's that are size 1. It doesnottake into account the expensive/cheap memcpy's i.e. what are the memcpy's that are taking up the most time.
If we look at the call graph for the hot memcpy calls we get:
10.55% ceph-osd /usr/lib64/libc-2.20.so 0x8503c B [.] memcpy | ---memcpy | |--98.83%-- ceph::buffer::list::rebuild_aligned_size_and_memory(unsigned int, unsigned int) | FileJournal::align_bl(long, ceph::buffer::list&) | FileJournal::do_write(ceph::buffer::list&) | FileJournal::write_thread_entry() | FileJournal::Writer::entry() | Thread::entry_wrapper() | start_thread | thread_start
i.e. we see this as buffer management rather than single character log appends.