I need to detect direct and indirect recursion in a rather large (5-15,000) set of C (not C++) files.
The files are already preprocessed.
The code is pretty "old school" for safety reasons so no fancy things like function pointers in there, only functions that pass variables about and some function-macros that do the same.
The most natural way to detect recursion is to make a directed call-graph, considering each function a node with an edge going to all the other functions that it calls. If the graph has any cycles, then we have recursion.
A regex to find function calls is trivial to make but I also need to know which function did the calling.
PyCParser was nice but it complains about a lot of things such as variables that are not defined or typedefs where the source type is not defined or defined in a different file which are completely irrelevant in my use-case. The project uses a custom dependency management system so some includes and the such are added automatically so I would need PyCParser to not care about anything other than FuncCall and FuncDef nodes and I don't think there is a way to limit the parsing process itself to just that.
I would rather not implement a parser as there i do not exactly have the time to learn how to do that in python and then implement the solution.
Back to the issue, how would I go about parsing the functions in a C file? Basically getting a dict with strings(names of functions defined in the file) as the keys, and lists of strings(the functions called by each function) as the values? A regex seems to be the most natural solution.
Using python is not optional sadly.
解决方案
Why not just use objdump on your compiled code then parse the generated assembly to build your graph?
test1.c file:
extern void test2();
void test1()
{
test2();
}
test2.c file:
extern void test1();
void test2()
{
test1();
}
int main()
{
test2();
}
now build it:
gcc -g test1.c test2.c -o myprog
now disassemble
objdump -d myprog > myprog.asm
Lookup all functions calls with a couple of simple regexes while memorizing the context you're on. A sample of the disassembly shows you how easy it should be:
00401630 <_test1>:
401630: 55 push %ebp
401631: 89 e5 mov %esp,%ebp
401633: 83 ec 08 sub $0x8,%esp
401636: e8 05 00 00 00 call 401640 <_test2>
40163b: c9 leave
40163c: c3 ret
40163d: 90 nop
40163e: 90 nop
40163f: 90 nop
00401640 <_test2>:
401640: 55 push %ebp
401641: 89 e5 mov %esp,%ebp
401643: 83 ec 08 sub $0x8,%esp
401646: e8 e5 ff ff ff call 401630 <_test1>
40164b: c9 leave
40164c: c3 ret
then use python to postprocess your disassembly and build a dictionary of function=>calls:
import re
import collections
calldict = collections.defaultdict(set)
callre = re.compile(".*\scall\s+.*")
funcre = re.compile("[0-9a-f]+\s:")
current_function = ""
with open("myprog.asm") as f:
for l in f:
m = funcre.match(l)
if m:
current_function = m.group(1)
else:
m = callre.search(l)
if m:
called = m.group(1)
calldict[current_function].add(called)
I didn't write the full graph search, but you can detect "ping-pong" recursion with a simple code like:
for function,called_set in calldict.items():
for called in called_set:
callset = calldict.get(called)
if callset and function in callset:
print(function,called)
which gives me:
_test2 _test1
_test1 _test2
this symbol/asm analysis technique is also used in callcatcher to detect unused C functions (which can be done very easily here as well by checking keys that aren't in any sets, with a bit of filtering of compiler symbols)