Value of one of the thread-argument variable has changed at some place, which is only detected when the program crashes.
(gdb) p *a
$7 = {
host = 0x0,
port = 0,
h_ent = 0x7fff5fbf6f10,
request = 0x100005620 "UH??H??? ",
reqlen = 140734799803600,
timeout = 0x100004e56,
s = 0x100281000,
cnt = 1606381568,
sargs = 0x1001000e0
}
Look at the garbage.
So, watchpoint.
watch a
isn't helpful at all, since there is no write to the variable a. Have to use memory location watch, which is
(gdb) watch -location a
Hardware watchpoint 7: *(struct args **) 4297592504
(GDB shipped with OS X doesn't recognize watch -l (as specified in the help message).
Then the debugger kindly pauses my program at
(gdb) c
Continuing.
Hardware watchpoint 6 deleted because the program has left the block
in which its expression is valid.
READ: 8192, TOTAL: 8192
Hardware watchpoint 7: *(struct args **) 4297592504
Old value = (struct args *) 0x7fff5fbf6f40
New value = (struct args *) 0x7fff5fbf6f00
0x0000000100002641 in parse_response (buf=0x10027eeb8 "HTTP/1.1 200 OK\r\nVary: Accept-Encoding\r\nContent-Type: image/x-icon\r\nLast-Modified: Tue, 14 Aug 2012 15:19:23 GMT\r\nDate: Wed, 17 Apr 2013 14:07:54 GMT\r\nExpires: Thu, 25 Apr 2013 14:07:54 GMT\r\nX-Content"..., len=8192, a=0x7fff5fbf7000) at md.c:512
512 nextproc[len] = 0;
(gdb) bt
#0 0x0000000100002641 in parse_response (buf=0x10027eeb8 "HTTP/1.1 200 OK\r\nVary: Accept-Encoding\r\nContent-Type: image/x-icon\r\nLast-Modified: Tue, 14 Aug 2012 15:19:23 GMT\r\nDate: Wed, 17 Apr 2013 14:07:54 GMT\r\nExpires: Thu, 25 Apr 2013 14:07:54 GMT\r\nX-Content"..., len=8192, a=0x7fff5fbf7000) at md.c:512
#1 0x0000000100005c10 in recv_engine (p=0x7fff5fbf6f40) at engine.c:394
#2 0x00007fff88c637a2 in _pthread_start ()
#3 0x00007fff88c501e1 in thread_start ()
(gdb) info locals
readlen = 8192
sfd = 3
i = 0
n = 3
quit = 0
_i = 0
a = (struct args *) 0x7fff5fbf6f00
buf = "HTTP/1.1 200 OK\r\nVary: Accept-Encoding\r\nContent-Type: image/x-icon\r\nLast-Modified: Tue, 14 Aug 2012 15:19:23 GMT\r\nDate: Wed, 17 Apr 2013 14:07:54 GMT\r\nExpires: Thu, 25 Apr 2013 14:07:54 GMT\r\nX-Content"...
go_loop = 1
total = 8192
error_exit = 0
out_rwx.86 = (int (*)[1]) 0x10027ee20
(gdb) l parse_response
502
503 return bytes_processed - bufsz_orig;
504 }
505
506 int // return 0 for normal, 1 for all done, -1 for error
507 parse_response (char *buf, long len, struct session_args *a) {
508 // responsible for parsing the status line
509 // and handle the rest to different parse sub routines
510 char *nextinput = buf;
511 char *nextproc = buf;
(gdb) l
512 nextproc[len] = 0;
(gdb) l engine.c:366
361 || a->sargs[i].rtt_timer[0].tv_usec) {
362 // we only pause for the first time
363 timer_pause (a->sargs[i].rtt_timer);
364 }
365 //printf ("Read %d\n", sfd);
366 int readlen = recv (sfd, buf, sizeof(buf) - 1, 0);
367 if (readlen == 0) {
368 dprintf ("READ 0 <%d>\n", a->s[i]);
369 close (a->s[i]);
370 if (a->sargs[i].repeat_recv > 0) {
Everything is just crystal clear now. I did an recv using the whole buffer, at line 366, then passed to parse_response, which expected one more byte at end of buffer so I could append a '\0' for easier string operations (line 512), of which location overruns into &a, the variable that shouldn't have been changed.
It's weird that valgrind didn't detect this. Maybe there is a stack checking tool in experimental stage I remember but this definitely is not in the memcheck tool.
Anyway, looks like it's working again now.
A few more insights:
1. this bug didn't reveal until I cleared out some recursion code, tests the code with multiple threads, on a line that has delays and drops.
2. GCC built binary on LInux didn't gave me trouble. LLVM on OS X shows up problems 8/10 times.
3. Read the code, make asserts base on assumptions. I did this in the parsing module, not in the boundary of networking and parsing modules.
4. watch -location is useful.
5. Still didn't figure out what is "-fstack-protector" for in gcc / clang
Update 0424:
Using valgrind exp-sgcheck
==25634== Invalid read of size 1
==25634== at 0x4C2C008: index (h_intercepts.c:89)
==25634== by 0x402BB6: parse_response (md.c:552)
==25634== by 0x403A5F: test_parse (md.c:896)
==25634== by 0x403CAF: main (md.c:951)
==25634== Address 0x7ff000430 expected vs actual:
==25634== Expected: stack array "buf" of size 8,192 in frame 2 back from here
==25634== Actual: unknown
==25634== Actual: is 0 after Expected
Valgrind sgcheck found one occurrence, which is a few accesses after the one bug I found using GDB as shown above. But it could still be useful sometimes maybe.