GCC - 无效代码删除优化

前言

  无效代码删除(DCE)是一种广泛使用的编译优化技术。数据流分析技术的发展,许多传统优化问题有了成熟的解决方案,无效代码删除优化也越发高效稳定。

对于以下示例:

#include<stdio.h>
void func();

void a() {
    int i = 0;

    for (; i < 1000; ++i);
    func();
    return;
}
static void b() {
    int i = 0;

    for (; i < 1000; ++i);
    return;
}
int main()
{
    int i = 0;
    for (; i < 0xfffff; ++i);
    a();
    b();
    return 0;
}

不采用gcc中任何无效代码删除优化编译(gcc test.c -S -o t1.s),生成的汇编码包含了所有代码结构,如图所示:
dce1
使用-O3编译之后的汇编码如下:
dce2
如上所示,采用-O3编译优化了程序中的无效代码,极大地缩小了代码体积。虽然O3中包含了众多的优化操作,但对于上述示例代码的优化处理,无效代码删除有着重要的作用。

主要的dce文件

  GCC中如果使用-O编译程序,它会自动启用-fdce-fauto-inc-dec等一系列的优化。但对于一些删除操作可能处理在编译器中启用优化选项外,还需要在链接器中添加一些标记,例如:

# keep every function in separate section. This will allow linker to dump unused functions
CFLAGS += -ffunction-sections -fdata-sections

# let linker to dump unused sections
LDFLAGS := -Wl,--gc-sections

很多的优化工作其实也是依赖-flto实现的。若想禁用相关的优化可以使用-fno-xx来实现,或使用关键字volatile

  GCC编译器中的无效的代码删除优化主要在SSA(Static Single Assignment)RTL(Register Transfer Language)中间表示上进行,从打印出来的中间表示可以看出,该优化是被多次调用执行的。GCC8.2.0版本中无效代码删除相关的文件主要有:tree-ssa-dce.c、dce.c两个文件,分别代表SSA上的优化和RTL上的优化。

1. tree-ssa-dce.c

该文件中的无效代码删除算法主要包含以下三个方面:

  • 标记所有已知的必要语句,例如:大多数函数调用,将值写入内存等
  • 传播必要的语句,例如:为操作数赋值的语句等
  • 删除无效语句

主要方法有:标记语句和操作是否是必要的,处理必要的操作数,删除或替换无效的语句和phi。具体函数和说明如下:

  1. mark_stmt_if_obviously_necessary方法根据GIMPLE_CODE调用mark_stmt_necessary函数,判断一条语句,如果ADD_TO_WORKLIST值为true,并且该语句没有被标记为必要的话,把它添加到worklist
gimple_set_plf (stmt, STMT_NECESSARY, true);
if (add_to_worklist)
    worklist.safe_push (stmt);
if (add_to_worklist && bb_contains_live_stmts && !is_gimple_debug (stmt))
    bitmap_set_bit (bb_contains_live_stmts, gimple_bb (stmt)->index);

与该函数类似的mark_operand_necessary (tree op)函数则是用来处理操作数的。mark_last_stmt_necessary函数用于找出BB块的最后语句。mark_control_dependent_edges_necessary函数处理BB相关的控制依赖边的操作。

  1. propagate_necessity方法传播语句必要的操作数,并将相关结果放入worklist,是比较重要的一个函数
/* PHI nodes are somewhat special in that each PHI alternative has
data and control dependencies.  All the statements feeding the
PHI node's arguments are always necessary.  In aggressive mode,
we also consider the control dependent edges leading to the
predecessor block associated with each PHI alternative as
necessary.  */
gphi *phi = as_a <gphi *> (stmt);
size_t k;
for (k = 0; k < gimple_phi_num_args (stmt); k++)
{
	tree arg = PHI_ARG_DEF (stmt, k);
	if (TREE_CODE (arg) == SSA_NAME)
	mark_operand_necessary (arg);
}
......
if (aggressive && !degenerate_phi_p (stmt))
{
	for (k = 0; k < gimple_phi_num_args (stmt); k++)
	{
		basic_block arg_bb = gimple_phi_arg_edge (phi, k)->src;

		if (gimple_bb (stmt)
		    != get_immediate_dominator (CDI_POST_DOMINATORS, arg_bb))
		{
		    if (!bitmap_bit_p (last_stmt_necessary, arg_bb->index))
			mark_last_stmt_necessary (arg_bb);
		}
		else if (arg_bb != ENTRY_BLOCK_PTR_FOR_FN (cfun)
		        && !bitmap_bit_p (visited_control_parents,
					arg_bb->index))
		mark_control_dependent_edges_necessary (arg_bb, true);
	}
}
......
 /* Calls to functions that are merely acting as barriers
	or that only store to memory do not make any previous
	stores necessary.  */
if (callee != NULL_TREE
	&& DECL_BUILT_IN_CLASS (callee) == BUILT_IN_NORMAL
	&& (DECL_FUNCTION_CODE (callee) == BUILT_IN_MEMSET
	|| DECL_FUNCTION_CODE (callee) == BUILT_IN_MEMSET_CHK
	|| DECL_FUNCTION_CODE (callee) == BUILT_IN_MALLOC
	|| DECL_FUNCTION_CODE (callee) == BUILT_IN_ALIGNED_ALLOC
	|| DECL_FUNCTION_CODE (callee) == BUILT_IN_CALLOC
	|| DECL_FUNCTION_CODE (callee) == BUILT_IN_FREE
	|| DECL_FUNCTION_CODE (callee) == BUILT_IN_VA_END
	|| ALLOCA_FUNCTION_CODE_P (DECL_FUNCTION_CODE (callee))
	|| DECL_FUNCTION_CODE (callee) == BUILT_IN_STACK_SAVE
	|| DECL_FUNCTION_CODE (callee) == BUILT_IN_STACK_RESTORE
	|| DECL_FUNCTION_CODE (callee) == BUILT_IN_ASSUME_ALIGNED))
continue;

	/* Calls implicitly load from memory, their arguments
	in addition may explicitly perform memory loads.  */
mark_all_reaching_defs_necessary (stmt);
  1. remove_dead_phisremove_dead_stmteliminate_unnecessary_stmts等方法进行删除无效语句、phi节点等。其中perform_tree_ssa_dce (bool aggressive)是消除无效代码的主程序,AGGRESSIVE控制算法的积极性。 在保守模式下,我们忽略控制依赖性,仅声明必要的除最琐碎的死分支之外的所有分支。 这种模式是快速的。在主动模式下,考虑了控制依赖性,这导致更多的死代码消除,但要花费一些时间,源码如下:
static unsigned int
perform_tree_ssa_dce (bool aggressive)
{
  bool something_changed = 0;

  calculate_dominance_info (CDI_DOMINATORS);

  /* Preheaders are needed for SCEV to work.
     Simple lateches and recorded exits improve chances that loop will
     proved to be finite in testcases such as in loop-15.c and loop-24.c  */
  bool in_loop_pipeline = scev_initialized_p ();
  if (aggressive && ! in_loop_pipeline)
    {
      scev_initialize ();
      loop_optimizer_init (LOOPS_NORMAL
			   | LOOPS_HAVE_RECORDED_EXITS);
    }

  tree_dce_init (aggressive);

  if (aggressive)
    {
      /* Compute control dependence.  */
      calculate_dominance_info (CDI_POST_DOMINATORS);
      cd = new control_dependences ();

      visited_control_parents =
	sbitmap_alloc (last_basic_block_for_fn (cfun));
      bitmap_clear (visited_control_parents);

      mark_dfs_back_edges ();
    }

  find_obviously_necessary_stmts (aggressive);

  if (aggressive && ! in_loop_pipeline)
    {
      loop_optimizer_finalize ();
      scev_finalize ();
    }

  longest_chain = 0;
  total_chain = 0;
  nr_walks = 0;
  chain_ovfl = false;
  visited = BITMAP_ALLOC (NULL);
  propagate_necessity (aggressive);
  BITMAP_FREE (visited);

  something_changed |= eliminate_unnecessary_stmts ();
  something_changed |= cfg_altered;

  /* We do not update postdominators, so free them unconditionally.  */
  free_dominance_info (CDI_POST_DOMINATORS);

  /* If we removed paths in the CFG, then we need to update
     dominators as well.  I haven't investigated the possibility
     of incrementally updating dominators.  */
  if (cfg_altered)
    free_dominance_info (CDI_DOMINATORS);

  statistics_counter_event (cfun, "Statements deleted", stats.removed);
  statistics_counter_event (cfun, "PHI nodes deleted", stats.removed_phis);

  /* Debugging dumps.  */
  if (dump_file && (dump_flags & (TDF_STATS|TDF_DETAILS)))
    print_stats ();

  tree_dce_done (aggressive);

  if (something_changed)
    {
      free_numbers_of_iterations_estimates (cfun);
      if (in_loop_pipeline)
	scev_reset ();
      return TODO_update_ssa | TODO_cleanup_cfg;
    }
  return 0;
}
  1. 该文件中包含的两个pass为:pass_dcepass_cd_dce。其中pass_dce的入口函数为tree_ssa_dce,该pass的执行条件是flag_tree_dce != 0,其在Gimple中的简称是dce
    pass_cd_dce的入口函数为tree_ssa_cd_dce,该pass的执行条件也是flag_tree_dce != 0,其在Gimple中的简称是cddce

2. dce.c

  GCC在RTL上主要关于insn的处理,insnrtx不同,insn通常表示instruction,它由指令模板中的RTX模板构造而成。dce.c文件中主要的函数如下:

  1. deletable_insn_p方法判断如果INSN是可以通过DCE传递删除的常规指令,则返回true。find_call_stack_args函数作用是如果为ACCUMULATE_OUTGOING_ARGS,尝试查找CALL_INSN参数的所有堆栈存储。如果找到了所有堆栈存储,则返回true,否则返回false。

  2. fast_dce (bool word_level)方法表示初始化完成后,执行快速DCE。 如果WORD_LEVEL为true,则使用单词级别dce,否则使用伪级别。相关函数有run_word_dcerun_fast_df_dce,部分源码如下:

/* Fast byte level DCE.  */

void
run_word_dce (void)
{
  int old_flags;

  if (!flag_dce)
    return;

  timevar_push (TV_DCE);
  old_flags = df_clear_flags (DF_DEFER_INSN_RESCAN + DF_NO_INSN_RESCAN);
  df_word_lr_add_problem ();
  init_dce (true);
  fast_dce (true);
  fini_dce (true);
  df_set_flags (old_flags);
  timevar_pop (TV_DCE);
}

/* This is an internal call that is used by the df live register
   problem to run fast dce as a side effect of creating the live
   information.  The stack is organized so that the lr problem is run,
   this pass is run, which updates the live info and the df scanning
   info, and then returns to allow the rest of the problems to be run.

   This can be called by elsewhere but it will not update the bit
   vectors for any other problems than LR.  */

void
run_fast_df_dce (void)
{
  if (flag_dce)
    {
      /* If dce is able to delete something, it has to happen
	 immediately.  Otherwise there will be problems handling the
	 eq_notes.  */
      int old_flags =
	df_clear_flags (DF_DEFER_INSN_RESCAN + DF_NO_INSN_RESCAN);

      df_in_progress = true;
      rest_of_handle_fast_dce ();
      df_in_progress = false;

      df_set_flags (old_flags);
    }
}
  1. delete_unmarked_insns函数用于删除所有未标记的指令。以及init_dcefini_dce函数分别用于初始化新的DCE遍历的全局变量和释放由init_dce分配的数据。

  2. 该文件中包含的两个pass为:pass_ud_rtl_dcepass_fast_rtl_dce。其中pass_ud_rtl_dce的入口函数是rest_of_handle_ud_dce,其执行条件是optimize > 1 && flag_dce && dbg_cnt (dce_ud),其在RTL中的pass简称是ud_dce
    pass_fast_rtl_dce的入口函数是rest_of_handle_fast_dce,其执行条件是optimize > 0 && flag_dce && dbg_cnt (dce_fast),其在RTL中的pass简称是rtl_dce.


References:
  • https://elinux.org/images/2/2d/ELC2010-gc-sections_Denys_Vlasenko.pdf
  • https://stackoverflow.com/questions/8988291/disabling-specific-optimizationdead-code-elimination-in-gcc-compiler
  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值