BITCODE DEMYSTIFIED

转载 2017年09月06日 16:48:45

A few months ago Apple announced a ‘new feature,’ called ‘Bitcode.’ In this article, I will try to answer the questions like what is Bitcode, what problems it aims to solve, what issues it introduces and so on.

What is Bitcode?

To answer this question let’s look at what compilers do for us. Here is a brief overview of compilation process:

  • Lexer: takes source code as an input and translates it into a stream of tokens;
  • Parser: takes stream of tokens as an input and translates it into an AST;
  • Semantic Analysis: takes an AST as an input, checks if a program is correct (method called with correct amount of parameters, method called on object actually exists and non-private, etc.), fills in ‘missing types’ (e.g.: let x = yx has type of y) and passes AST to the next phase;
  • Code Generation: takes an AST as an input and emits some high-level IR (intermediate representation);
  • Optimization: takes IR, makes optimizations and emits IR which is potentially faster and/or smaller;
  • AsmPrinter: another code generation phase, it takes IR and emits assembly for particular CPU;
  • Assembler: takes assembly and converts it into an object code (stream of 0s and 1s);
  • Linker: usually programs refer to already compiled routines from other programs (e.g.: printf) to avoid recompilation of the same code over and over. Until this phase these links do not have correct addresses, they are just placeholders. Linker’s job is to resolve those placeholders so that they point to the correct addresses of their corresponding routines.

You can find more details here: The Compiler.

In the modern world these phases are split into two parts: compiler frontend (lexerparsersemantic analysiscode generation) and compiler backend (optimizationasm printerassemblerlinker). This separation makes much sense for both language designers and hardware manufacturers. If you want to create a new programming language you ‘just’ need to implement a frontend, and you get all available optimizations and support of different CPUs for free. On the other hand, if you created a new chip, you ‘just’ need to extend the backend and you get all the available languages (frontends) support for your CPU.

Below you can see a picture that illustrates compilation process using Clang and LLVM:

This picture clearly demonstrates how communication between frontend and backend is done using IR, LLVM has it is own format, that can be encoded using LLVM bitstream file format - Bitcode.

Just to recall it explicitly - Bitcode is a bitstream representation of LLVM IR.

What problems Apple’s Bitcode aims to solve?

Again, we need to dive a bit deeper and look at how an OS runs programs. This description is not precise and is given just to illustrate the process. For more details I can recommend reading this article: How OS X Executes Applications.

OS X and iOS can run on different CPUs (i386x86_64armarm64, etc.), if you want to run a program on any OS X/iOS setup, then the program should contain object code for each platform. Here is how a binary might look like:

When you run a program, OS reads the ‘Table Of Contents’ and looks for a slice corresponding to the OS CPU. For instance, if you run operating system on x86_64, then OS will load object code for x86_64 into a memory and run the program.

What’s happening with other slices? Nothing, they just waste your disk space.

This is the problem Apple wants to solve: currently, all the apps on the AppStore contain object code for arm and arm64 CPUs. Moreover, third-party proprietary libraries or frameworks contain object code for i386x86_64arm and arm64, so you can use them to test the app on a device or simulator. (Can you imagine how many copies of Google Analytics for i386 you have in your pocket?)

UPD: I do not know why, but I was sure that final executable contains these slices as well (i386x86_64, etc.), but it seems they are stripped during the build phase.

Apple did not give us that many details about how the Bitcode and App Thinning works, so let me assume how it may look:

When you submit an app (including Bitcode) Apple’s ‘BlackBox’ recompiles it for each supported platform and drops any ‘useless’ object code, so AppStore has a copy of the app for each CPU. When an end user wants to install the app - she installs the only version for the particular processor, without any unused stuff.

Bitcode might save up to 50% of disk space per program.

UPD: Of course, I do not take in count resources, it is just about binary itself. For instance, an app I am working on currently has size ~40 megabytes (including assets, xibs. fonts), a size of a binary itself is ~16 megabytes. I checked sizes of each slice: ~7MB for armv7 and 9MB for arm64, if we crop just one of them, it will decrease the size of the app by ~20%.

What problems do Bitcode introduce?

The idea of Bitcode and recompiling for each platform looks really great, and it is a huge improvement, though it has downsides as well: the biggest one is security.

To get the benefits of Bitcode, you should submit your app including Bitcode (surprisingly). If you use some proprietary third-party library, then it also should contain Bitcode, hence as a maintainer of a proprietary library, you should distribute the library with Bitcode.

To recall: Bitcode is just another form of LLVM IR.

LLVM IR

Let’s write some code to see LLVM IR in action.

// main.c
extern int printf(const char *fmt, ...);

int main() {
  printf("Hello World\n");
  return 0;
}

Run the following:

clang -S -emit-llvm main.c

And you’ll have main.ll containing IR:

@.str = private unnamed_addr constant [13 x i8] c"Hello World\0A\00", align 1

; Function Attrs: nounwind ssp uwtable
define i32 @main() #0 {
  %1 = alloca i32, align 4
  store i32 0, i32* %1
  %2 = call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([13 x i8]* @.str, i32 0, i32 0))
  ret i32 0
}

declare i32 @printf(i8*, ...) #1

What can we see here? It is a bit more verbose than original C code, but it is still much more readable than assembler. Malefactors will be much happier to work with this representation, than with disassembled version of a binary (and they do not even have to pay for tools such Hopper or IDA).

How could malefactor get the IR?

iOS and OS X executables have their own format - Mach-O (read Parsing Mach-O files for more details). Mach-O file contains several segments such as Read-Only Data, Code, Symbol Table, etc. One of those sections contain xar archive with Bitcode:

It is really easy to retrieve it automatically, here I wrote a simple C program that does just that: bitcode_retriever. The workflow is pretty straightforward. Let’s assume that some_binary is a Mach-O file that contains object code for two CPUs (arm and x86_64), and each object code is built using two source files:

$ bitcode_retriever some_binary
arm.xar
x86_64.xar
$ xar -xvf arm.xar
1
2
$ llvm-dis 1 # outputs 1.ll
$ llvm-dis 2 # outputs 2.ll

Bitcode does not store any information about original filenames but uses numbers instead (123, etc.). Also, probably you do not have llvm-disinstalled/built on your machine, but you can easily obtain it, see this article for more details: Getting Started with Clang/LLVM on OS X.

Another potential issue (can’t confirm it) - Bitcode thingie works only for iOS 9, so if you submit your app to the AppStore and it includes Bitcode, then malefactor can get the whole IR from your app using iOS 78 and jailbroken device.

I know only one way to secure the IR - obfuscation. This task is not trivial itself, and it requires even much more efforts if you want to introduce this phase into your Xcode-Driven development flow.

Summary

  • Bitcode is a bitstream file format for LLVM IR
  • one of its goals is to decrease a size of an app by eliminating unused object code
  • malefactor can obtain your app or library, retrieve the IR from it and steal your ‘secret algorithm.’

TTY解密(The TTY demystified)

TTY解密(The TTY demystified)The TTY subsystem is central to the design of Linux, and UNIX in general. ...
  • astrotycoon
  • astrotycoon
  • 2014年09月25日 19:51
  • 2586

Redis persistence demystified

The OS and the disk The first thing to consider is what we can expect from a database in terms of d...
  • cuidongdong1234
  • cuidongdong1234
  • 2014年06月08日 11:15
  • 346

Redis persistence demystified - part 1

关于Redis我的一部分工作是阅读博客,论坛以及twitter时间线(time line)。对于开发者来说,能够了解用户社区,非用户社区如果理解他正在开发的产品是非常重要的。据我所知,持久化特性是最易...
  • hanhuili
  • hanhuili
  • 2013年10月19日 18:01
  • 1070

CS231n Neural Networks Part 1: Setting up the Architecture

有这么几个有意思的知识1.sigmoid,tanh,relu,prelu以及maxout,不同的激活函数2.the deeper the better? 更深层次意味着更好的抽象。假设这个世界是可以...
  • Zhaohui1995_Yang
  • Zhaohui1995_Yang
  • 2016年08月19日 21:39
  • 266

Bitcode

Bitcode 是编译的应用程序的中间一种形式.你提交到AppStore上的应用程序包含的bitcode会在AppStore上被编译和链接.在未来包含bitcode的应用程序,会让Apple重新优化你...
  • soindy
  • soindy
  • 2015年09月17日 11:14
  • 4437

如何正确理解BitCode

前言 昨天去美团吹水,聊到 BitCode 话题,朋友说他们发现美团没有开启 Bitcode,下载下来的包还是二进制文件只有一个架构的状态。我感到非常不可思议,晚上就去越狱了我的 iPad mi...
  • Gatsbyi
  • Gatsbyi
  • 2016年03月19日 12:57
  • 683

静态库支持bitcode

参考:http://stackoverflow.com/questions/31233395/ios-library-to-bitcode -fembed-bitcode 如图:...
  • cafei111
  • cafei111
  • 2016年06月15日 11:18
  • 1374

[iOS]Xcode7.0关闭Bitcode编译

今天在iOS上编译原来开发的代码,出现了以下错误xxxx.o does not contain bitcode. You must rebuild it with bitcode enabled (...
  • yctccg
  • yctccg
  • 2016年08月16日 10:44
  • 2569

xcode7中Bitcode的介绍及配置

&用Xcode 7 beta 3在真机(iOS 8.3)上运行一下工程,结果发现工程编译不过。看了下问题,报的是以下错误: ld: ‘/Users/**/Framework/SDKs/Polym...
  • helloworld183
  • helloworld183
  • 2017年10月16日 14:09
  • 987

深究Xcode的bitcode设置

前言 做iOS开发的朋友们都知道,目前最新的Xcode7,新建项目默认就打开了bitcode设置.而且大部分开发者都被这个突如其来的bitcode功能给坑过导致项目编译失败,而这些因为bitco...
  • u011363981
  • u011363981
  • 2017年05月10日 11:52
  • 629
收藏助手
不良信息举报
您举报文章:BITCODE DEMYSTIFIED
举报原因:
原因补充:

(最多只允许输入30个字)