前言
无效代码删除(DCE)是一种广泛使用的编译优化技术。数据流分析技术的发展,许多传统优化问题有了成熟的解决方案,无效代码删除优化也越发高效稳定。
对于以下示例:
#include<stdio.h>
void func();
void a() {
int i = 0;
for (; i < 1000; ++i);
func();
return;
}
static void b() {
int i = 0;
for (; i < 1000; ++i);
return;
}
int main()
{
int i = 0;
for (; i < 0xfffff; ++i);
a();
b();
return 0;
}
不采用gcc中任何无效代码删除优化编译(gcc test.c -S -o t1.s
),生成的汇编码包含了所有代码结构,如图所示:
使用-O3
编译之后的汇编码如下:
如上所示,采用-O3
编译优化了程序中的无效代码,极大地缩小了代码体积。虽然O3中包含了众多的优化操作,但对于上述示例代码的优化处理,无效代码删除有着重要的作用。
主要的dce文件
GCC中如果使用-O
编译程序,它会自动启用-fdce
、-fauto-inc-dec
等一系列的优化。但对于一些删除操作可能处理在编译器中启用优化选项外,还需要在链接器中添加一些标记,例如:
# keep every function in separate section. This will allow linker to dump unused functions
CFLAGS += -ffunction-sections -fdata-sections
# let linker to dump unused sections
LDFLAGS := -Wl,--gc-sections
很多的优化工作其实也是依赖-flto
实现的。若想禁用相关的优化可以使用-fno-xx
来实现,或使用关键字volatile
。
GCC编译器中的无效的代码删除优化主要在SSA(Static Single Assignment)
和RTL(Register Transfer Language)
中间表示上进行,从打印出来的中间表示可以看出,该优化是被多次调用执行的。GCC8.2.0版本中无效代码删除相关的文件主要有:tree-ssa-dce.c、dce.c
两个文件,分别代表SSA
上的优化和RTL
上的优化。
1. tree-ssa-dce.c
该文件中的无效代码删除算法主要包含以下三个方面:
- 标记所有已知的必要语句,例如:大多数函数调用,将值写入内存等
- 传播必要的语句,例如:为操作数赋值的语句等
- 删除无效语句
主要方法有:标记语句和操作是否是必要的,处理必要的操作数,删除或替换无效的语句和phi。具体函数和说明如下:
mark_stmt_if_obviously_necessary
方法根据GIMPLE_CODE
调用mark_stmt_necessary
函数,判断一条语句,如果ADD_TO_WORKLIST值为true,并且该语句没有被标记为必要的话,把它添加到worklist
gimple_set_plf (stmt, STMT_NECESSARY, true);
if (add_to_worklist)
worklist.safe_push (stmt);
if (add_to_worklist && bb_contains_live_stmts && !is_gimple_debug (stmt))
bitmap_set_bit (bb_contains_live_stmts, gimple_bb (stmt)->index);
与该函数类似的mark_operand_necessary (tree op)
函数则是用来处理操作数的。mark_last_stmt_necessary
函数用于找出BB块的最后语句。mark_control_dependent_edges_necessary
函数处理BB相关的控制依赖边的操作。
propagate_necessity
方法传播语句必要的操作数,并将相关结果放入worklist,是比较重要的一个函数
/* PHI nodes are somewhat special in that each PHI alternative has
data and control dependencies. All the statements feeding the
PHI node's arguments are always necessary. In aggressive mode,
we also consider the control dependent edges leading to the
predecessor block associated with each PHI alternative as
necessary. */
gphi *phi = as_a <gphi *> (stmt);
size_t k;
for (k = 0; k < gimple_phi_num_args (stmt); k++)
{
tree arg = PHI_ARG_DEF (stmt, k);
if (TREE_CODE (arg) == SSA_NAME)
mark_operand_necessary (arg);
}
......
if (aggressive && !degenerate_phi_p (stmt))
{
for (k = 0; k < gimple_phi_num_args (stmt); k++)
{
basic_block arg_bb = gimple_phi_arg_edge (phi, k)->src;
if (gimple_bb (stmt)
!= get_immediate_dominator (CDI_POST_DOMINATORS, arg_bb))
{
if (!bitmap_bit_p (last_stmt_necessary, arg_bb->index))
mark_last_stmt_necessary (arg_bb);
}
else if (arg_bb != ENTRY_BLOCK_PTR_FOR_FN (cfun)
&& !bitmap_bit_p (visited_control_parents,
arg_bb->index))
mark_control_dependent_edges_necessary (arg_bb, true);
}
}
......
/* Calls to functions that are merely acting as barriers
or that only store to memory do not make any previous
stores necessary. */
if (callee != NULL_TREE
&& DECL_BUILT_IN_CLASS (callee) == BUILT_IN_NORMAL
&& (DECL_FUNCTION_CODE (callee) == BUILT_IN_MEMSET
|| DECL_FUNCTION_CODE (callee) == BUILT_IN_MEMSET_CHK
|| DECL_FUNCTION_CODE (callee) == BUILT_IN_MALLOC
|| DECL_FUNCTION_CODE (callee) == BUILT_IN_ALIGNED_ALLOC
|| DECL_FUNCTION_CODE (callee) == BUILT_IN_CALLOC
|| DECL_FUNCTION_CODE (callee) == BUILT_IN_FREE
|| DECL_FUNCTION_CODE (callee) == BUILT_IN_VA_END
|| ALLOCA_FUNCTION_CODE_P (DECL_FUNCTION_CODE (callee))
|| DECL_FUNCTION_CODE (callee) == BUILT_IN_STACK_SAVE
|| DECL_FUNCTION_CODE (callee) == BUILT_IN_STACK_RESTORE
|| DECL_FUNCTION_CODE (callee) == BUILT_IN_ASSUME_ALIGNED))
continue;
/* Calls implicitly load from memory, their arguments
in addition may explicitly perform memory loads. */
mark_all_reaching_defs_necessary (stmt);
remove_dead_phis
、remove_dead_stmt
、eliminate_unnecessary_stmts
等方法进行删除无效语句、phi节点等。其中perform_tree_ssa_dce (bool aggressive)
是消除无效代码的主程序,AGGRESSIVE控制算法的积极性。 在保守模式下,我们忽略控制依赖性,仅声明必要的除最琐碎的死分支之外的所有分支。 这种模式是快速的。在主动模式下,考虑了控制依赖性,这导致更多的死代码消除,但要花费一些时间,源码如下:
static unsigned int
perform_tree_ssa_dce (bool aggressive)
{
bool something_changed = 0;
calculate_dominance_info (CDI_DOMINATORS);
/* Preheaders are needed for SCEV to work.
Simple lateches and recorded exits improve chances that loop will
proved to be finite in testcases such as in loop-15.c and loop-24.c */
bool in_loop_pipeline = scev_initialized_p ();
if (aggressive && ! in_loop_pipeline)
{
scev_initialize ();
loop_optimizer_init (LOOPS_NORMAL
| LOOPS_HAVE_RECORDED_EXITS);
}
tree_dce_init (aggressive);
if (aggressive)
{
/* Compute control dependence. */
calculate_dominance_info (CDI_POST_DOMINATORS);
cd = new control_dependences ();
visited_control_parents =
sbitmap_alloc (last_basic_block_for_fn (cfun));
bitmap_clear (visited_control_parents);
mark_dfs_back_edges ();
}
find_obviously_necessary_stmts (aggressive);
if (aggressive && ! in_loop_pipeline)
{
loop_optimizer_finalize ();
scev_finalize ();
}
longest_chain = 0;
total_chain = 0;
nr_walks = 0;
chain_ovfl = false;
visited = BITMAP_ALLOC (NULL);
propagate_necessity (aggressive);
BITMAP_FREE (visited);
something_changed |= eliminate_unnecessary_stmts ();
something_changed |= cfg_altered;
/* We do not update postdominators, so free them unconditionally. */
free_dominance_info (CDI_POST_DOMINATORS);
/* If we removed paths in the CFG, then we need to update
dominators as well. I haven't investigated the possibility
of incrementally updating dominators. */
if (cfg_altered)
free_dominance_info (CDI_DOMINATORS);
statistics_counter_event (cfun, "Statements deleted", stats.removed);
statistics_counter_event (cfun, "PHI nodes deleted", stats.removed_phis);
/* Debugging dumps. */
if (dump_file && (dump_flags & (TDF_STATS|TDF_DETAILS)))
print_stats ();
tree_dce_done (aggressive);
if (something_changed)
{
free_numbers_of_iterations_estimates (cfun);
if (in_loop_pipeline)
scev_reset ();
return TODO_update_ssa | TODO_cleanup_cfg;
}
return 0;
}
- 该文件中包含的两个pass为:
pass_dce
、pass_cd_dce
。其中pass_dce
的入口函数为tree_ssa_dce
,该pass的执行条件是flag_tree_dce != 0
,其在Gimple中的简称是dce
。
pass_cd_dce
的入口函数为tree_ssa_cd_dce
,该pass的执行条件也是flag_tree_dce != 0
,其在Gimple中的简称是cddce
。
2. dce.c
GCC在RTL上主要关于insn
的处理,insn
与rtx
不同,insn
通常表示instruction,它由指令模板中的RTX模板构造而成。dce.c文件中主要的函数如下:
-
deletable_insn_p
方法判断如果INSN是可以通过DCE传递删除的常规指令,则返回true。find_call_stack_args
函数作用是如果为ACCUMULATE_OUTGOING_ARGS,尝试查找CALL_INSN参数的所有堆栈存储。如果找到了所有堆栈存储,则返回true,否则返回false。 -
fast_dce (bool word_level)
方法表示初始化完成后,执行快速DCE。 如果WORD_LEVEL为true,则使用单词级别dce,否则使用伪级别。相关函数有run_word_dce
、run_fast_df_dce
,部分源码如下:
/* Fast byte level DCE. */
void
run_word_dce (void)
{
int old_flags;
if (!flag_dce)
return;
timevar_push (TV_DCE);
old_flags = df_clear_flags (DF_DEFER_INSN_RESCAN + DF_NO_INSN_RESCAN);
df_word_lr_add_problem ();
init_dce (true);
fast_dce (true);
fini_dce (true);
df_set_flags (old_flags);
timevar_pop (TV_DCE);
}
/* This is an internal call that is used by the df live register
problem to run fast dce as a side effect of creating the live
information. The stack is organized so that the lr problem is run,
this pass is run, which updates the live info and the df scanning
info, and then returns to allow the rest of the problems to be run.
This can be called by elsewhere but it will not update the bit
vectors for any other problems than LR. */
void
run_fast_df_dce (void)
{
if (flag_dce)
{
/* If dce is able to delete something, it has to happen
immediately. Otherwise there will be problems handling the
eq_notes. */
int old_flags =
df_clear_flags (DF_DEFER_INSN_RESCAN + DF_NO_INSN_RESCAN);
df_in_progress = true;
rest_of_handle_fast_dce ();
df_in_progress = false;
df_set_flags (old_flags);
}
}
-
delete_unmarked_insns
函数用于删除所有未标记的指令。以及init_dce
、fini_dce
函数分别用于初始化新的DCE遍历的全局变量和释放由init_dce分配的数据。 -
该文件中包含的两个pass为:
pass_ud_rtl_dce
、pass_fast_rtl_dce
。其中pass_ud_rtl_dce
的入口函数是rest_of_handle_ud_dce
,其执行条件是optimize > 1 && flag_dce && dbg_cnt (dce_ud)
,其在RTL中的pass简称是ud_dce
。
pass_fast_rtl_dce
的入口函数是rest_of_handle_fast_dce
,其执行条件是optimize > 0 && flag_dce && dbg_cnt (dce_fast)
,其在RTL中的pass简称是rtl_dce
.
References:
- https://elinux.org/images/2/2d/ELC2010-gc-sections_Denys_Vlasenko.pdf
- https://stackoverflow.com/questions/8988291/disabling-specific-optimizationdead-code-elimination-in-gcc-compiler