CScout: A Refactoring Browser for C
Athens University of Economics and Business
Department of Management Science and Technology
Patision 76, GR-104 34 Athens, Greece
AbstractDespite its maturity and popularity, the C programming language still lackstool support for reliably performing even simple refactoring, browsing,or analysis operations.This is primarily due to identifierscope complications introduced by the C preprocessor.The CScout refactoring browser analyses complete programfamilies by tagging the original identifiers with their precise location and classifying them into equivalence classesorthogonal to the C language's namespace and scope extents.A web-based user interface provides programmers with an intuitivesource code analysis and navigation front-end, while an SQL-basedbackend allows more complex source code analysis and manipulation.CScout has been successfully applied to many mediumand large proprietary and open source projects identifying thousandsof modest refactoring opportunities.C browser refactoring preprocessor
1 IntroductionC remains the language of choice for developing systems applications, such as operating systems and databases, embedded software, andthe majority of open-source projects [44,p. 16].Despite the language's popularity, tool support for performingeven simple refactoring, browsing, or analysis operations is currently lacking.Programmers typically resort to using either simplistic text-basedoperations that fail to capture the language's semantics, or work on the resultsof the compilation and linking phase that-due to the effects ofpreprocessing-do not correctly reflect the original source code.Interestingly, many of the tools in a C programmer's arsenal were designed in the 1970s,and fail to take advantage of the CPU speed and memory capacity of a modern workstation.In this paper we describe how the CScout refactoring browser,running on a powerful workstation, can be used to accuratelyanalyze, browse, and refactor large program families written in C.The theory behind CScout's operation is described in detailelsewhere ;this paper focuses on the tool's design, implementation, and application.CScout can process program families consisting of multiplerelated projects(we define a project as a collection of C source files that are linked together)correctly handling most of the complexity introducedby the C preprocessor.CScout takes advantage of modern hardware (fast processors,large address spaces, and big memory capacities)to analyze C source code beyond the levelof detail and accuracy provided by current IDEs, compilers, and linkers.Specifically, CScout's analysis takes into account both the identifier scopesintroduced by the C preprocessor and the C language proper scopes andnamespaces.The objective of this paper is to provide a tour of CScout by describingthe domain's challenges,the operation of CScout and its interfaces,the system's design and implementation, anddetails of CScout's application to a number of largesoftware projects.The main contributions of this paper arethe illustration of the types of problems occurring in the analysis ofreal-life C source code and the types of refactorings that can be achieved,the demonstration through the application of CScout to a number ofsystems that accurate large-scale analysis of C code is in fact possible,and a discussion of lessons associated withthe construction of browsers and refactoring tools for languages, like C and C++,that involve a preprocessing step.
2 Problem StatementMany features of the C language hinder the precise analysis of programs written in it and complicate the design of correspondingreasoning algorithms .The most important culprits areunrestricted pointers, aliasing, arbitrary type casts, non-local jumps,an underspecified build environment,and the C preprocessor.All features but the last two ones limit our ability to reason about theruntime behavior of programs (see e.g. the article and the references therein).Significantly, the C preprocessor and a compilation environmentbased on loosely-coupled tools,like make and a language-agnostic linker,also restrict programmers from performingeven supposedly trivial operations such as determining the scope of avariable, the type of an identifier, or the extent of a module.
2.1 Preprocessor ComplicationsIn summary, preprocessormacros complicate the notion of scope and the notion of an identifier[11,4,45].For one, macros and file inclusion create their own scopes.This is for example the case when a single textual macrousing a field name that is incidentally identical between two structuresthat are not otherwise related is applied on variables of thosestructures.In the following example, a renaming operation of the identifier lenwill require changing in all three definitions, although in C the membersof each data structure belong to a different namespace.struct disk_block int len; /* ... */ db;struct mem_block int len; /* ... */ mb;#define get_block_len(b) ((b).len)int s = get_block_len(db) + get_block_len(mb);In addition, new identifiers can be formed at compile timevia the preprocessor's concatenation operator.As an example,the following code snippet defines a variable namedsysctl_var_sdelay, even though this name does notappear in the source file.#define SYSCTL(x) static int sysctl_var_ ## xSYSCTL(sdelay);An additional complication comes from the use of conditionally compiledcode (see also Sections 4.1 and 7).Such code may or may not be compilable under a given compilation environment,and, often, blocks of such code may be mutually incompatible.
2.2 Code Reuse ComplicationsParnas  defines a program familyas a set of programs that should be studiedby first considering the common properties of the set and then determiningindividual properties of family members (see also the work byWeiss and Lai ).When analyzing C source code for browsing and refactoringpurposes we are interested in program families consisting of programsthat through their build process reuse common elements of source code.This is a property of what has been termed the build-timesoftware architecture view .We have identified three interesting instances of source codesharing in such families.
2.3 Problem ImpactDue to the previously described problems, programmers are currently working withmethods and tools that are neither sound nor complete.The typical textual search for an identifier in a source code basemay fail to locate identifier instances that are dynamically constructed, orwill also locate identifiers that reside in a different scope or namespace.When working with a compiler or IDE-constructed symbol table there isanother problem.Many C implementations treat preprocessing as a separate phaseand fail to pass information about C macros down through the other compilationphases.Therefore,a more sophisticated search using such a symbol table databasewill fail to match all macro instances, while its results will be difficultto match against the original source code.Consequently, program maintenance and evolution suffer, becauseprogrammers, unsupported by the tools they use,are reluctant to perform even a simple rename-function refactoring.Anecdotal evidence supports our observation:consider mutilated identifier names such as that of the Unix creat system callthat still persist, decades after the reasons for their original nameshave become irrelevant [7,p. 60].The readability of existing code slowly decays as layers of deprecatedhistorical practice accumulate [23,pp. 4-6, 184] andeven more macro definitionsare used to provide compatibility bridges between legacy and modern code.
3 Related WorkTools that aid program code analysis and transformation operations areoften termed browsers [19,pp. 297-307] andrefactoring browsers  respectively.Related work on object-oriented design refactoring assertsthat it is generally not possible to handle all problems introduced bypreprocessing in large software applications.However, as we shall see in the following sections,advances in hardware capabilitiesare now making it possible to implement useful refactoring tools thataddress the complications of the C programming language.The main advantage of our approach is the correct handling ofpreprocessor constructs, so, although we have only tested the approachon different variantsof C programs,(K&R C, ANSI C, and C99 [28,1,25])it is, in principle, also applicable to programs written inC++ , Cyclone , PL/I and many assembly-code dialects.Reference  provides a complete empirical analysis ofthe C preprocessor use, a categorization of macro bodies,and a description of common erroneous macros found in existing programs.Two theoretical approaches proposed for dealing with the problemsof the C preprocessor involve the use of mathematical concept analysisfor handling cases where the preprocessor is used for configurationmanagement , and the definition of an abstract language for capturingthe abstractions for the C preprocessor in a way that allows formal analysis.The two-way mapping between preprocessor tokens and C-proper identifiersused by CScoutwas first suggested by Livadas and Small .
|Number of supported refactorings||4||11||∞||5||150|
|Handle C namespaces||√||√||√||√||√|
|Rename preprocessor identifiers||√||√||×||√||√|
|Handle scopes introduced by the C preprocessor||√||√||×||×||×|
|Handle identifiers created by the C preprocessor||√||×||×||×||×|
|User environment||Web||Emacs||-||Eclipse||Visual Studio|
4 The CScout Refactoring BrowserTo be able to map and rename identifiersacross program families accurately and efficiently CScout integrates in asingle processing engine functions ofa build tool (such as make or ant),a C preprocessor,a C compiler front-end,a parser of yacc files,a linker,a relational database export facility, anda web-based GUI.
- File Metrics
- Number of: statements, copies of the file, defined project-scoped functions, defined file-scoped (static) functions, defined project-scoped variables, defined file-scoped (static) variables, complete aggregate (struct/union) declarations, declared aggregate (struct/union) members, complete enumeration declarations, declared enumeration elements, directly included files
- File and Function Metrics
- Number of: characters, comment characters, space characters, line comments, block comments, lines, character strings, unprocessed lines, preprocessed tokens, compiled tokens, C preprocessor directives, processed C preprocessor conditionals (ifdef, if, elif), defined C preprocessor function-like macros (e.g. max(a, b)), defined C preprocessor object-like macros (e.g. EOF)
- Maximum number of characters in a line
- Function Metrics
- Number of: statements or declarations, operators, unique operators, numeric constants, character literals, else clauses, global namespace occupants at function's top, parameters
- Number of statements by type: if, switch, break, for, while, do, continue, goto, return
- Number of labels by type: goto, case, default
- Number of identifiers by type: project-scoped, file-scoped (static), macro, object (identifiers having a value) and object-like macros, label
- Number unique of identifiers by type: project-scoped, file-scoped, macro, object and object-like
- Maximum level of statement nesting
- Fan-in and fan-out
- Complexity: cyclomatic,extended cyclomatic, andmaximum (including switch statements) cyclomatic
- annotate source code with hyperlinks to a detail page for each identifier,
- list files that would be affected by changing a specific identifier,
- determine whether a given identifier belongs to the applicationor to an external library, based on the accessibility and location of theheader files that declare or define it,
- locate unused identifiers taking into account inter-projectdependencies,
- perform sophisticated queries for identifiers, files, and functions,
- monitor and report superfluously included header files, and
- provide accurate metrics over functions and files(see Table 2).
- taking into account the namespace of each identifier: a renaming ofa structure tag, member, or a statement label will not affect, for example,variables with the same name,
- respecting the scope of identifiers: a refactoring operation can affectmultiple files, or variables within a single block, exactly matchingthe semantics the C compiler would enforce,
- across multiple projects (linkage units)when the same identifier is defined incommon shared header files or even code,
- across conditionally compiled units, if an appropriateworkspace (a set of interrelated linkage units) has been defined and processed.
4.1 Source Code Processing
|echo string||Display the string on CScout's standard output when the directive is processed.|
|ro_prefix string||Add string to the list of filename prefixes that mark read-only files. This is a global setting used for bifurcating the source code into the system's (read-only) files and the application's (writable) files.|
|project string||Set the name of the current project (linkage unit) to string. All identifiers and files processed from then on will be set to belong to the given project.|
|block_enter||Enter a nested scope block. Two blocks are supported, the first block_enter will enter the project scope (linkage unit); the second encountered nested block_enter will enter the file scope (compilation unit).|
|block_exit||Exit a nested scope block. The number of block_enter pragmasshould match the number of block_exit pragmas and there should never be more than two block_enter pragmas in effect.|
|process string||Analyze (CScout's equivalent to compiling) the C source file named string.|
|pushd string||Set the current directory to string, saving the previous current directory in a stack. From that point onward, all relative file accesses will search the specified file from the given directory.|
|popd||Restore the current directory to the one in effect before a previously pushed directory. The number of pushd pragmas should match the number of popd pragmas.|
|includepath string||Add string to the list of directories used for searching included files (the include path).|
|clear_include||Clear the include path, allowing the specification of a new one from scrarch.|
|clear_defines||Clear all defined macros allowing the specification of new ones from scrarch. Should normally be executed before processing a new file. Note that macros can be defined in the processing script using the normal #define C preprocessor directive.|
- A CScout companion program, csmake, can monitor compiler,archiver, and linker invocations in a make-drivenbuild process, and thereby gather data to automaticallycreate the processing script.This method has been used for processing all codelisted in Table 4 (apart from the Solaris andWindows kernels), as well as tens of other Unix-based systems.
- A declarative specification of the source components, compiler options,and file locations required to build the members of a program familyis processed by the CScout workspace compiler cswc.This method offers precise control of CScout's processing.It is also useful in cases when csmake is not compatiblewith the platform's compilation process;csmake currently handles the programsmake,gcc,cc,ld,ar, andmvrunning in a POSIX shell environment.A 27-line csmake specification has been used for processingthe Unix utilities illustrated in Figure 1and a 125-line specification for processing a350 KLOC proprietary CAD system.
- The build process can be instrumented to record the commandsexecuted.This transcript can then be semi-automatically converted intothe CScout processing script.For instance, a 74-line Perl script was used to convertthe 1,149-line output of Microsoft's nmake program compiling theWindows Research Kernel into a 51,288-lineCscout processing script.Similarly, a 137-line Perl script was used to convert the 26,704-lineoutput of Sun's dmake program  compiling the OpenSolaris kernelinto a 140,552-line Cscout processing script.
- unnecessarily included header files,
- identifiers for functions, variables, macros, labels, tags, and members that are never used across the complete workspace, and
- elements that should have been declared with file-local (static) visibility.
4.2 Web-Based Interface
5(a) Included files.
5(b) Call graph spanning functions and macros.
5(c) Control dependencies between files.
5(d) Data dependencies between files.
- Browse file and identifiernames belonging to specific semantic categories (e.g. read-only files,file-spanning identifiers, or unused identifiers).
- Examine the source code of individual files, with hyperlinks providingnavigation from each identifier to a separate page providing detailsof its use.
- Specify identifier queries based on the identifier's namespace, scope, and name, and whether the identifier is writable, crosses a file boundary, is unused,occurs in files of a given name,is used as a type definition, or is a (possibly undefined) macro, or macro argument.The file and identifier names to include or exclude can also be specified in the query as extended regular expressions-seeFigure 4(a).
- Specify simple form-based file and function queries based onthe calculated metrics listed in Table 2.
- Perform queries for functions based on their callers,the functions they call, identifiers they contain, and the filenameswhere they reside.
- View the semantic information associated with a class of identifiers.Users can find out whether the identifier is read-only(i.e. at least one of its instances resides in a read-only file),and whether its instances are used as macros, macro arguments,structure, union, or enumeration tags, structure or union members,labels, type definitions, or as ordinary identifiers.In addition, users can see if the identifier's scope is limited toa single file,if it is unused (i.e. appears exactly once in a workspace),the files it appears in,and the projects (linkage units) that use it.Unused identifiers allow the programmer to findfunctions, macros, variables,type names, structure, union, or enumeration members or tags, labels,or formal function parameters that can be safely removed fromthe program.
- View information associated with a function or a function-like macro:the identifiers comprising its name,its declaration and definition,the callers and the called functions,and their transitive closure-see Figure 4(b).Uniquely, CScout can calculate metrics and call graphsthat take into account both functions andfunction-like macros-see Figure 5(b)derived while browsing the source codeof awk and drawn using dot .This matches the reality of C programming,where the two are used interchangeably.
- Generate graphs of compile-time and run-time control and data-dependencies-seeFigures 5(c) and 5(d)..
- Perform rename identifier and various function argument refactorings.Specifically, users cansubstitute all matching instances of a given identifier with a newuser-specified name.In addition, in a function's web page users can specify a substitutiontemplate for the function's parameters.This can include the original parameters (denoted by placeholdersfor the original arguments, named@1,@2,etc.), as well as other arbitrary text, like constants andexpressions.Refactorings can be specified multiple times, allowing the incrementalimprovement of the code, without the expensive reprocessing step.A separate operation will permanently save the modified code.
http://localhost:8081/fgraph.txt?gtype=Cto obtain the compile-time dependencies between a project's files.Furthermore,as all web pages that CScout generates are identified by a unique URL,programmers can easily mark important pages (such as a particularidentifier that must be refactored, or the result of a specializedform-based query)using their web browser's bookmarking mechanism, or even email aninteresting page's URL to a coworker.In fact, many links appearing on CScout's main web page are simplycanned hyperlinks to the general queries we previously outlined.
4.3 SQL Backend
+---------+-------+ | name | nfile | +---------+-------+ | NULL | 3292 | | u | 2560 | | printk | 1922 | | ... | ... |
5 Design and ImplementationBringing CScout into life requiredcareful analysis of the principles of its operation,a design that matched the software and computing resources at hand,and substantial implementation work.The major challenges can be divided into:preprocessing and parsing,the enforcement of C namespaces,the handling of C preprocessor complications,the handling of code reuse complications,testing,achieving adequate performance, andkeeping the project in a manageable scale.
5.1 Preprocessing and ParsingPreprocessing C is anything but trivial.CScout's lexical analyzer and the C preprocessor are hand-crafted;converging toward a correct preprocessor proved to be tricky.For many years CScout would be patched to fixmisbehaviors occurring in obscure cases of macro invocations.The situation was becoming increasingly difficult, becauseoften fixing one case would break another.In the end we realized that the only way to achieve correctbehavior was to locate (through a personal communication withits author) and implement the so-called Prosser's macro expansionalgorithm .Almost miraculously all test cases worked correctly, and after twoyears of use and many millions of processed code, no other problemswere reported in the area of macro expansion.In contrast to C++, C is not difficult to parse, butthe grammar supplied as part of the C standards is not suitablefor generating yacc-based parsers, because such parsersthen contain numerous rule conflicts.CScout's C grammar is based on Roskind's work  extended to supportthe parsing of yacc files, and manyC99 , gcc, and Microsoft C extensions.It comprises 144 productions and, after 149 revisions, it is 2,670 lines long.The parsing of preprocessor expressions and the C code arehandled by two separate btyacc grammars.Btyacc was selected over yacc for itsportability,better support for C++,superior handling of syntax errors through backtracking,and the ability to customize it in order to supportthe side-by-side linking of two separate grammars.Handling the various language extension dialects hasn't proven to be difficult;probably because CScout is quite permissive in what is accepts.Therefore, currently CScout's input is the union of all possible languageextensions.If in the future some extensions are found to be mutually exclusive, this can be handled byadding #pragma directives that will change the handling of the correspondingkeywords.
5.2 Enforcement of C NamespacesThe separation of identifiers into C namespaces isachieved through a symbol table containing basic type information foridentifiers in the current scope.Furthermore,support for the C99 initializer designators also requiresthe evaluation of compile-time constants.This non-trivial functionality is needed, because the array positionof an initializer can be specified by a compile-time constant.When elements of nested aggregates-structures, unions, and arrays-arespecified in comma-separated form without enclosing them in braces,the array position constant must be evaluated in order to determinethe type of the next element.The type checking subsystem is mainly used to identify a tag's underlyingstructure or union for member access and initialization,and to handle type definitions.In addition,its implementation provided us with a measure of confidenceregarding the equivalence class unification operations dictated bythe language's semantics.The symbol table design follows the language's block scoping rules,with special cases handling prototype declarations and compilation andlinkage unit visibility.Between the processing of two differentprojects (linkage units) the complete symbol table is cleared and onlyequivalence classes remain in memory,thus reducing CScout's memory footprint.This optimization can be performed, because if we ignore extra-linguisticfacilities (such as shared libraries, debug symbols, and reflection)linked programs operate as standalone processes and do not dependon any program identifiers for their operation.
5.3 Handling C Preprocessor ComplicationsThe basic principle of CScout's operation is to tag each identifier appearingin the original source code with its precise location (file and offset)and to follow that identifier (or its part when new identifiers arecomposed by concatenating original ones) across preprocessing,parsing, (partial) semantic analysis, and (notional) linking .To handle the scoping rule mix-ups generated by the C preprocessor(see Section 2.1),every identifier is set to belong to exactly one equivalence class:a set of identifiers that would have to be renamed in concert forthe program family to remain semantically and syntactically correct.The notion of an equivalence class is orthogonal to the language'sexisting namespace and scope extents, taking into account the changesto those extents introduced by the C preprocessor.When each identifier token is read, a new equivalence class for thattoken is created.Every time a symbol table lookup operation for an identifier matchesan existing identifier (e.g. the use of a declared variable or the use of a parameter of a function-like macro)the two corresponding equivalence classes are unified into a single one.In total, 20 equivalence class unifications are performed byCScout.These can be broadly classified into the following categories:macro formal parameters and their use inside the macro body,macros used within the source code,macros being redefined or becoming undefined,tests for macros being defined,identifiers used in expressions,structure or union member access (direct, through a pointer indirection,or through an initializer designator),declarations using typedef types,application of the typeof operator (a gcc extension) to an identifier,use of structure, union, and enumeration tags,old-style  function parameter declarations with the respective formalparameter name,multiple declarations and definitions of objects with compilation orlinkage unit scope, andgoto labels and targets, respectively.By classifying all identifiers into equivalence classes, and then creating andmerging the classes following the language's rules, we end up witha data structure that can identify many interesting relationships betweenidentifiers.
- A rename operation simply involves changing the name of all identifiersbelonging to the same equivalence class.
- Verifying that a renamed identifier does not clash with other identifiersmeans checking that no new equivalence class unifications occur when reprocessingthe code.This method handles correctly all the language's scoping rules,a problem for many other refactoring tools .
- Unused identifiers are those belonging to an equivalence class withexactly one member.
- If at least one identifier in an equivalence class is located in aread-only file-for instance a system library header file-thenall the identifiers of that class are considered immutable.
5.4 Handling Code Reuse ComplicationsCScout handles the code reuse complications outlined in Section 2.2by providing an integrated build system that can process multiplelinkage units as a whole.An early design choice based this build system on extendingthe C language that CScout can process with a few#pragma directives (see Table 3).Making the build language an extension of C means thatexisting C facilities can be used for a number of tasks.Thus, external macro definitions that other build systemspass to the compiler as flags are simply defined through#define directives.Furthermore,internally-defined macro definitions, such as those handling gcc'sbuilt-in intrinsic functions, can be easily introducedsimply by processing the file that defines them with a#include directive.Making the build language textual, rather than GUI-basedas is typically the case in many IDEs, means thatother more sophisticated tools can create build scripts.This is the case with thecswc, the CScout workspace compiler andcsmake, the make-driven build process monitorand build script generator.
5.5 TestingThe complexity of CScout's analysis requires a frameworkto ensure that it remains functional and correct as the code evolves.The testing of CScout consists of stress and regression testing.Stress testing involves applying CScout to various large open-source systems.Problems in the preprocessor, the parser, or the semantic analysis quicklyexhibit themselves as parsing errors or crashes.In addition, by having CScout replace all identifiers of a systemwith mechanically-derived names and then recompiling and testing thecorresponding code builds confidence in CScout's equivalencealgorithms and the rename-identifier refactoring.Regression testing is currently used to verify corner cases and check for accidentalerrors.The CScout's preprocessor is tested through 70 test cases whose outputis then compared with the hand-verified output.The parser and analyzer are further tested through 42 small and large test caseswhose complete analysis is stored in an RDBMS and compared with previouslyverified results.
5.6 PerformanceWith CScout processing multi-million line projects as a singleentity, time and space performance have to be kept within acceptablebounds, with increases at most linearly dependent on the size of the input.Although no fancy algorithms and data structures were used to achieve theCScout's scalability, extreme care was taken to adopt everywhere data structuresand corresponding algorithms that would gracefully scale.This was made possible by the C++ STL library.For each data structure we simply chose a container that would handleall operations on its elements efficiently in terms of space and time.Thus, all data lookup operations are either O(1)for accessing data through a pointer indirection or at a vector's known index,or O(logN) for operations on sets and maps.These choices also allow the elegant and efficient expression of complexrelationships, using STL functions likeset_union,set_intersection, andequal_range.Up to now algorithmic tuning was required only once, to fix a pathologicalcase in the implementation of the C preprocessor macro expansion .The aggressive use of STL complicated CScout's debugging.Navigating STL data structures with gdb is almost impossible,because gdb provides a view of the data structures' implementation details,but not their high-level operations.This problem was solved by implementing a custom logging framework :a lightweight and efficient construct that allowed us to instrumentthe code with (currently 200) log statements.As the following example shows, writing such a debugpoint statement is trivial:if (DP()) cout << "Gather args returns: " << v << endl;Each debugpoint can be easily enabled at runtimeby specifying in a text file its corresponding file name and line number.The overhead of debugpoints can be completely disabled at compile time,but even when they get compiled, if none of them is enabled, their cost isonly that of a compare and a jump instruction.
5.7 Project ScaleImplementing a tool of CScout's complexity proved torequire considerable effort.CScout has been actively developed for five years, andcurrently consists of 27 KLOC.Most of the code is written in C++ with Perl being used to implement theCScout processing script generators csmake and cswc.Two more Perl scripts automatically extract from the source code the documentationfor the SQL database schema and the reported error messages.Eight class hierarchies allow for some inheritance-based code reuse.Ordered by decreasing number of classes in each inheritance tree, these coverC's types,graph rendering,the handling of user options, SQL output,query processing,C's tokens,metrics, andfunctions.More importantly,CScout benefits from the use of existing mature open source components and tools:the btyacc backtracking variant of the yacc parser generator,the SWILL embedded web server library ,the dot graph drawing program ,and the mySQL and PostgreSQL relational database systems.The main advantages of these components were their stability, efficiency,and hassle-free availability.In addition, the source code availability of btyacc and SWILLallowed us to port them to various platforms and to add some minor butessential features: a function to retrieve an HTTP's request URLin SWILL, and the ability for multiple grammars to co-exist in a programin btyacc.
6 Applying CScout
|awk||Apache||Free BSD||Linux||Solaris||WRK||Postgre SQL||GDB|
|Modules (linkage units)||1||3||1,224||1,563||561||3||92||4|
|C statements (thousands)||4.3||17.7||948||1,772||1,042||192||70||129|
|Unused file-scoped identifiers||20||15||8,853||18,175||4,349||3,893||2,149||2,275|
|Unused project-scoped identifiers||8||8||1,403||1,767||4,459||2,628||2,537||939|
|Variables that could be made static||47||4||1,185||470||3,460||1,188||29||148|
|Functions that could be made static||10||4||1,971||1,996||5,152||3,294||133||69|
|CPU time||0.81"||35"||3h 43'40"||7h 26'35"||1h 18'54"||58'53"||3'55"||11'13"|
|Lines / s||8,148||1,711||194||155||634||235||2,460||539|
|Required memory ( MB)||21||71||3,707||4,807||1,827||582||463||376|
|Bytes / line||3,336||1,243||1,496||1,215||639||736||840||1,086|
|Computer||Custom-made 4U rack-mounted server|
|CPU||4 × Dual-Core Opteron|
|CPU clock speed||2.4 GHz|
|L2 cache||1024k B (per CPU)|
|RAM||16 GB 400 MHz DDR2 SDRAM|
|System Disks||2 × 36 GB, SATA II, 8 MB cache, 10k RPM, software RAID-1 (mirroring)|
|Storage Disks||8 × 500 GB, SATA II, 16 MB cache, 7.2k RPM, hardware RAID-10 (4-stripped mirrors)|
|Database Disks||4 × 300 GB, SATA II, 16 MB cache, 10k RPM, hardware RAID-10 (2-stripped mirrors)|
|RAID Controller||3ware 9550sx, 12 SATA II ports, 226 MB cache|
|Operating system||Debian 5.0 stable running the 2.6.26-1-amd64 Linux kernel|
- The one true awk scripting language.1
- Apache httpd
- The Apache project web server, version 1.3.27.
- The source code of the Free BSD kernel HEAD branch, as of 2006-09-18, in threearchitecture configurations: i386, AMD64, and SPARC64.
- The Linux kernel, version 126.96.36.199-0.5, in its AMD64 configuration.
- Sun's OpenSolaris kernel, as of 2007-07-28, in threearchitecture configurations: Sun4v, Sun4u, and SPARC.
- The Microsoft Windows Research Kernel, version 1.2,into two architecture configurations: i386 and AMD64.
- The PostgreSQL relational database, version 8.2.5.
- The GNU debugger, version 6.7.
7 Lessons LearnedThe main lessons learned from CScout's development are thevalue of end-to-end whole-workspace analysis of C source code,and the many practical difficulties of dealing with real-worldC software.Researchers can apply these lessons by adopting a similardepth of analysis, such as the analysis already done in the LLVMcompiler infrastructure project .Alternatively, researchers at the forefront of tool technology,can save a lot of effort and pain by steering their energy towardmore tractable languages, like Java.Furthermore, commercial tool builders should plan and budget forthe difficulties we outline.The operation of program analysis and transformation toolscan be characterized as sound when the analysis will notgenerate any false positive results, and as completewhen there are not missing elements in the results of the analysis.The analysis performed by CScout over identifier equivalenceclasses is in the general case sound, because it follows preciselythe language's semantic rules.The incompleteness of the produced results stems from three differentcomplications;addressing those with heuristics would result in an analysis thatwould no longer be sound.Predictably,these complications in our scheme arise from preprocessing features.Unifying undefined macros In the absence of a shared #undef directive two undefined macroswith the same name can only be unified into a single identifierthrough a heuristic rule that considers them to be referring to thesame entity.This is typically a correct assumption, because testing through undefinedmacros is used for configuring software through a carefully managednamespace, with identifiers such as HAS_FGETPOS andHAS_GETPPID.Coverage of macro applications Dealing with function-like macros whose application does notcover all possible cases needed for semantically correct refactoringcan be problematic.Consider the first case in Section 2.1.If the code does not apply get_block_len on at least oneelement of type disk_block and one of type mem_blockCScout has no way to know that all three instances of lenare semantically equivalent and should be renamed in concert.Handling conditional compilation In practice, this issue has caused the greatest number of problems.Conditional compilation results in code parts that are not always processed.Some of them may be mutually exclusive, defining e.g. differentoperating system-dependent versions of the same function.The problem can be handled withmultiple passes over the code, or by ignoring conditional compilation commands.This processmay need to be guided by hand, because conditionally compiled code sectionsare often specific to a particular compilation environment.When processing the Free BSD kernel we used both approaches:a special predefined kernelconfiguration target named LINT to maximize the amount of conditionallycompiled code that the configuration and processing would cover, and a separatepass for each of the three supported processor architectures.Yet, even this approachdid not adequately cover the complete source code, as evidenced by anaborted attempt to remove header files that appeared to be unused.Another problem we encountered when applying CScout in realisticsituations concerned language extensions.The first version of CScout supported the 1989 version of ANSIC  and a number of C99  extensions.In practice we found that CScout could not be applied on real-worldsource code without supporting many compiler-specific language extensions.Even programs that were written in a portable manner includedplatform-specific header files, which used many compiler extensions,and could therefore not be processed by a tool that did not support them.This was a significant problem for a number of reasons.
- Compiler-specific language extensions are typically far lesscarefully documented than the standardized language.In a number of cases we had to understand an extension's syntax andsemantics by looking for examples of its use, or by reading the correspondingcompiler's source code.
- Significant effort that could have been spent on improving theusefulness of CScout on all platforms was often diverted toward thesupport of a single proprietary and seldom-used compiler-specific extension.
- Some language extensions were mutually incompatible.
- Unintended extensions arising from a compiler's sometimes haphazard checkingof a program's syntactic correctness restrict the portability of supposedlyportable programs that mistakenly rely on the extension.
8 ConclusionsWe have plans to extend CScout in a number of directions.One challenging and worthwhile avenue is support for the C++language and object-oriented refactorings.The web front-end is beginning to show its age.It should probably be redesigned to use of AJAX technologies,communicating with the CScout engine through XML requests.This interface would also allow the implementation of a more sophisticatedtesting framework.Queries can be made considerably more flexible by allowing the userto specify them in an embedded scripting language, like Lua .Such a change would probably also require the provision of an asynchronousmechanism for aborting expensive queries.An alternative approach would be to provide a built-in SQLinterface, perhaps through virtual tables of an embedded database,like SQLite .Currently, many URLs of the web front end are fragile,breaking across CScout invocations or when the web-front endsource code changes.These URLs can be made more robust by expressing themat a higher level of abstraction.Logging of CScout's HTTP requests can provide research data onits actual use.Source code browsing can also be improved.The source code views can be enhanced through the use of configurablesyntax coloring and easier navigation to various elements.An interface can be provided for showing identifiers shared betweentwo files.Refactoring opportunities can be pointed out by identifying bad smellsin the code.These can be located through the judicious provision of some keymetric-based queries, and through the automatic detection ofduplicated code .CScout could support file names as first class citizens.This should allow the renaming of a file name, correcting all references toit in include directives.Furthermore, the web front-end should hyperlink file names appearingin include directives.On the same subject, CScout could provide a header refactoringoption to support the style guideline that requires each included file to beself-sufficient (compile on its own) by including all the requisite header files[55,p. 42].CScout's support for DSLs can be improved along a number of lines.For one, csmake should also support yacc invocations.More generally, it would probably be worthwhile to provide CScout withan option to perform best-effort identifier substitutions in files it can't parse.These substitutions would be performed simply by matching whole words;developers will enable this option when they are reasonable confident thatthere are no spurious matches of the identifiers they rename in DSL files.In the future, ubiquitous accurate file and offset tagging of theautomatically created source code,in a way similar to the #line directives currently emittedby generators such as lex and yacc,may offer a more robust solution.The application of CPU and memory resources towardthe analysis of large program families written in Cis an effective approach that yields readily exploitablerefactoring opportunities in legacy code.CScout has already been successfully applied on a wide range of projects for performing modest, though notinsignificant, refactoring operations.Our approach can be readily extended to cover other preprocessedlanguages like C++.Open issues from a research perspective arethe automatic identification and implementation of more complex refactoring operations,increasing the accuracy of dependency graphs by reasoning about function pointers ,the production of source code views for given macro values,and theefficient maximization of code coverage.
Acknowledgements and Tool AvailabilityWe would like to thank the anonymous reviewers for their many excellentsuggestions to improve this paper and CScout.The following people have helped over the years the developmentof CScout with advice, comments, and feature requests:Walter Briscoe,Wilko Bulte,Munish Chopra,Georgios Gousios,Poul-Henning Kamp,Kris Kennaway,Alexander Leidinger,Sandor Markon,Marcel Moolenaar,Richard A. O'Keefe,Igmar Palsenberg,Wes Peters,Dave Prosser,Jeroen Ruigrok van der Werven,Remco van Engelen, andPeter Wemm.The tool, its documentation, and representative examples are available at http://www.spinellis.gr/cscout/.CScout currently runs under the Free BSD, Linux, Mac OS X,Microsoft Windows, and Solaris operating systems under several 32 and 64-bit architectures.The freely-downloadable version of CScout can be used on open-source code;the supported commercial version is licensed for use on proprietary code, andincludes the obfuscation and SQL back-ends.
- American National Standard for Information Systems - programming language - C: ANSI X3.159-1989, (Also ISO/IEC 9899:1990) (Dec. 1989).
- G. Antoniol, R. Fiutem, G. Lutteri, P. Tonella, S. Zanfei, E. Merlo, Program understanding and maintenance with the CANTO environment, in: ICSM '97: Proceedings of the International Conference on Software Maintenance, IEEE Computer Society, Washington, DC, USA, 1997.
- L. Aversano, M. D. Penta, I. D. Baxter, Handling preprocessor-conditioned declarations, in: SCAM'02: Second IEEE International Workshop on Source Code Analysis and Manipulation, IEEE Computer Society, Los Alamitos, CA, USA, 2002.
- G. J. Badros, D. Notkin, A framework for preprocessor-aware C source code analyses, Software: Practice & Experience 30 (8) (2000) 907-924.
- I. D. Baxter, M. Mehlich, Preprocessor conditional removal by simple partial evaluation, in: WCRE '01: Proceedings of the Eighth Working Conference on Reverse Engineering, IEEE Computer Society, Washington, DC, USA, 2001.
- Y.-F. Chen, M. Y. Nishimoto, C. V. Ramamoorthy, The C information abstraction system, IEEE Transactions on Software Engineering 16 (3) (1990) 325-334.
- D. Cooke, J. Urban, S. Hamilton, Unix and beyond: An interview with Ken Thompson, IEEE Computer 32 (5) (1999) 58-64.
- K. De Volder, JQuery: A generic code browser with a declarative configuration language, in: Practical Aspects of Declarative Languages, Springer Verlag, 2006, pp. 88-102, Lecture Notes in Computer Science 3819.
- Developer Express Inc., Refactoring your code with Refactor!, Online http://www.devexpress.com/Products/Visual_Studio_Add-in/Refactoring/whitepaper.xml. Accessed 2009-03-17. Archived by WebCite at http://www.webcitation.org/5fMxaOP4n, white paper (2009).URL http://www.webcitation.org/5fMxaOP4n
- M. D. Ernst, G. J. Badros, D. Notkin, An empirical analysis of C preprocessor use, IEEE Transactions on Software Engineering 28 (12) (2002) 1146-1170.
- J.-M. Favre, Preprocessors from an abstract point of view, in: Proceedings of the International Conference on Software Maintenance ICSM '96, IEEE Computer Society, 1996.
- R. T. Fielding, R. N. Taylor, Principled design of the modern Web architecture, ACM Transactions on Internet Technology 2 (2) (2002) 115-150.
- M. Fowler, Refactoring: Improving the Design of Existing Code, Addison-Wesley, Boston, MA, 2000.
- A. Garrido, Program refactoring in the presence of preprocessor directives, Ph.D. thesis, University of Illinois at Urbana-Champaign, Champaign, IL, USA, adviser: Ralph Johnson (2005).URL http://www.lifia.info.unlp.edu.ar/papers/2005/Garrido2005.pdf
- A. Garrido, R. Johnson, Challenges of refactoring C programs, in: IWPSE '02: Proceedings of the International Workshop on Principles of Software Evolution, ACM, New York, NY, USA, 2002.
- A. Garrido, R. Johnson, Analyzing multiple configurations of a C program, in: ICSM '05: Proceedings of the 21st IEEE International Conference on Software Maintenance, IEEE Computer Society, Washington, DC, USA, 2005.
- E. R. Gasner, E. Koutsofios, S. C. North, K.-P. Vo, A technique for drawing directed graphs, IEEE Transactions on Software Engineering 19 (3) (1993) 124-230.
- R. Ghiya, D. Lavery, D. Sehr, On the importance of points-to analysis and other memory disambiguation methods for C programs, ACM SIGPLAN Notices 36 (5) (2001) 47-158, pLDI '01: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation.
- A. Goldberg, D. Robson, Smalltalk-80: The Language, Addison-Wesley, Reading, MA, 1989.
- W. G. Griswold, D. Notkin, Automated assistance for program restructuring, ACM Transactions on Software Engineering and Methodology 2 (3) (1993) 228-269.
- R. C. Holt, A. Schürr, S. E. Sim, A. Winter, GXL: a graph-based standard exchange format for reengineering, Science of Computer Programming 60 (2) (2006) 149-170.
- Y. Hu, E. Merlo, M. Dagenais, B. Lagüe, C/C++ conditional compilation analysis using symbolic execution, in: ICSM '00: Proceedings of the International Conference on Software Maintenance, IEEE Computer Society, Washington, DC, USA, 2000.
- A. Hunt, D. Thomas, The Pragmatic Programmer: From Journeyman to Master, Addison-Wesley, Boston, MA, 2000.
- R. Ierusalimschy, Programming in Lua, 2nd ed., Lua.org, Rio de Janeiro, 2006.
- International Organization for Standardization, Programming Languages - C, ISO, Geneva, Switzerland, 1999, ISO/IEC 9899:1999.
- D. Janzen, K. D. Volder, Navigating and querying code without getting lost, in: AOSD '03: Proceedings of the 2nd International Conference on Aspect-Oriented Software Development, ACM, New York, NY, USA, 2003.
- T. Jim, G. Morrisett, D. Grossman, M. Hicks, J. Cheney, Y. Wang, Cyclone: A safe dialect of C, in: USENIX Technical Conference Proceedings, USENIX Association, Berkeley, CA, 2002.
- B. W. Kernighan, D. M. Ritchie, The C Programming Language, 1st ed., Prentice Hall, Englewood Cliffs, NJ, 1978.
- S. Lampoudi, D. M. Beazley, SWILL: A simple embedded web server library, in: USENIX Technical Conference Proceedings, USENIX Association, Berkeley, CA, 2002, FREENIX Track Technical Program.
- S. Lapierre, B. Laguë, C. Leduc, Datrix source code model and its interchange format: lessons learned and considerations for future work, SIGSOFT Softw. Eng. Notes 26 (1) (2001) 53-56.
- C. Lattner, V. Adve, LLVM: A compilation framework for lifelong program analysis & transformation, in: CGO '04: Proceedings of the 2004 International Symposium on Code Generation and Optimization, 2004.
- Z. Li, S. Lu, S. Myagmar, Y. Zhou, CP-miner: Finding copy-paste and related bugs in large-scale software code, IEEE Transactions on Software Engineering 32 (3) (2006) 176-192.
- M. A. Linton, Implementing relational views of programs, in: SDE 1: Proceedings of the First ACM SIGSOFT/SIGPLAN Software Engineering Symposium on Practical Software Development Environments, ACM, New York, NY, USA, 1984.
- P. E. Livadas, D. T. Small, Understanding code containing preprocessor constructs, in: IEEE Third Workshop on Program Comprehension, 1994.
- A. Milanova, A. Rountev, B. G. Ryder, Precise call graphs for C programs with function pointers, Automated Software Engineering 11 (1) (2004) 7-26.
- G. C. Murphy, M. Kersten, L. Findlater, How are Java software developers using the Eclipse IDE?, IEEE Software 23 (4) (2006) 76-83.
- M. Owens, The Definitive Guide to SQLite, Apress, Berkeley, CA, 2006.
- D. L. Parnas, On the design and development of program families, IEEE Transactions on Software Engineering SE-2 (1) (1976) 1-9.
- D. F. Prosser, Complete macro expansion algorithm, Standarization committee memo X3J11/86-196, ANSI, New York, online http://www.spinellis.gr/blog/20060626/x3J11-86-196.pdf. Accessed 2009-03-21. Archived by WebCite at X3J11/86-196 (Dec. 1986).URL http://www.webcitation.org/5fRS2iru0
- D. Roberts, J. Brant, R. E. Johnson, A refactoring tool for Smalltalk, Theory and Practice of Object Systems 3 (4) (1997) 39-42.
- J. A. Roskind, Grammar file for the dpANSI C language, Available online at http://www.ccs.neu.edu/research/demeter/tools/master/doc/headers/C++Grammar/c4.y. Accessed: 2009-03-13. Archived by WebCite at http://www.webcitation.org/5fF8fX28Q (Mar. 1990).URL http://www.webcitation.org/5fF8fX28Q
- D. Schaefer, Code analysis and refactoring with CDT, Available online http://cdtdoug.blogspot.com/2008/11/code-analysis-and-refactoring-with-cdt.html. Accessed 2009-03-15. Archived by WebCite at http://www.webcitation.org/5fMyo3trp, Eclipse Summit Europe presentation (Nov. 2008).URL http://www.webcitation.org/5fMyo3trp
- G. Snelting, Reengineering of configurations based on mathematical concept analysis, ACM Transactions on Software Engineering and Methodology 5 (2) (1996) 146-189.
- D. Spinellis, Code Reading: The Open Source Perspective, Addison-Wesley, Boston, MA, 2003.
- D. Spinellis, Global analysis and transformations in preprocessed languages, IEEE Transactions on Software Engineering 29 (11) (2003) 1019-1030.
- D. Spinellis, Code finessing, Dr. Dobb's 31 (11) (2006) 58-63.
- D. Spinellis, Debuggers and logging frameworks, IEEE Software 23 (3) (2006) 98-99.
- D. Spinellis, A tale of four kernels, in: W. Schäfer, M. B. Dwyer, V. Gruhn (eds.), ICSE '08: Proceedings of the 30th International Conference on Software Engineering, Association for Computing Machinery, New York, 2008.
- D. Spinellis, The way we program, IEEE Software 25 (4) (2008) 89-91.
- D. Spinellis, Optimizing header file include directives, Journal of Software Maintenance and Evolution: Research and Practice 21 (4) (2009) 233-251.
- R. M. Stallman, EMACS: The extensible, customizable, self-documenting display editor, in: D. R. Barstow, H. E. Shrobe, E. Sandwell (eds.), Interactive Programming Environments, McGraw-Hill, 1984, pp. 300-325.
- F. Steimann, A. Thies, From public to private to absent: Refactoring Java programs under constrained accessibility, in: S. Drossopoulou (ed.), ECOOP '09: Proceedings of the European Conference on Object-Oriented Programming, Springer-Verlag, 2009, Lecture Notes in Computer Science.
- B. Stroustrup, The C++ Programming Language, 3rd ed., Addison-Wesley, Reading, MA, 1997.
- Sun Microsystems, Inc., Santa Clara, CA, Sun Studio 12: Distributed Make (dmake), part No: 819-5273. Available online http://docs.sun.com/app/docs/doc/819-5273. Accessed 2009-03-13 (2007).URL http://docs.sun.com/app/docs/doc/819-5273
- H. Sutter, A. Alexandrescu, C++ Coding Standards: 101 Rules, Guidelines, and Best Practices, Addison Wesley, 2004.
- L. Tokuda, D. Batory, Evolving object-oriented designs with refactorings, Automated Software Engineering 8 (2001) 89-120.
- Q. Tu, M. Godfrey, The build-time software architecture view, in: ICSM'01: Proceedings of the IEEE International Conference on Software Maintenance, 2001.
- L. Vidács, A. Beszédes, R. Ferenc, Columbus schema for C/C++ preprocessing, in: CSMR '04: Proceedings of the Eighth European Conference on Software Maintenance and Reengineering, IEEE Computer Society, 2004.
- M. Vittek, Refactoring browser with preprocessor, in: CSMR '03: Proceedings of the Seventh European Conference on Software Maintenance and Reengineering, IEEE Computer Society, 2003.
- D. G. Waddington, B. Yao, High-fidelity C/C++ code transformation, Electronic Notes in Theoretical Computer Science 141 (4) (2005) 35-56.
- D. M. Weiss, C. T. R. Lai, Software Product-Line Engineering: A Family-Based Software Development Process, Addison-Wesley, 1999.
- R. Wuyts, Declarative reasoning about the structure of object-oriented systems, in: TOOLS '98: Proceedings of the Technology of Object-Oriented Languages and Systems, IEEE Computer Society, Washington, DC, 1998.