BITCODE DEMYSTIFIED

转载 2017年09月06日 16:48:45

A few months ago Apple announced a ‘new feature,’ called ‘Bitcode.’ In this article, I will try to answer the questions like what is Bitcode, what problems it aims to solve, what issues it introduces and so on.

What is Bitcode?

To answer this question let’s look at what compilers do for us. Here is a brief overview of compilation process:

  • Lexer: takes source code as an input and translates it into a stream of tokens;
  • Parser: takes stream of tokens as an input and translates it into an AST;
  • Semantic Analysis: takes an AST as an input, checks if a program is correct (method called with correct amount of parameters, method called on object actually exists and non-private, etc.), fills in ‘missing types’ (e.g.: let x = yx has type of y) and passes AST to the next phase;
  • Code Generation: takes an AST as an input and emits some high-level IR (intermediate representation);
  • Optimization: takes IR, makes optimizations and emits IR which is potentially faster and/or smaller;
  • AsmPrinter: another code generation phase, it takes IR and emits assembly for particular CPU;
  • Assembler: takes assembly and converts it into an object code (stream of 0s and 1s);
  • Linker: usually programs refer to already compiled routines from other programs (e.g.: printf) to avoid recompilation of the same code over and over. Until this phase these links do not have correct addresses, they are just placeholders. Linker’s job is to resolve those placeholders so that they point to the correct addresses of their corresponding routines.

You can find more details here: The Compiler.

In the modern world these phases are split into two parts: compiler frontend (lexerparsersemantic analysiscode generation) and compiler backend (optimizationasm printerassemblerlinker). This separation makes much sense for both language designers and hardware manufacturers. If you want to create a new programming language you ‘just’ need to implement a frontend, and you get all available optimizations and support of different CPUs for free. On the other hand, if you created a new chip, you ‘just’ need to extend the backend and you get all the available languages (frontends) support for your CPU.

Below you can see a picture that illustrates compilation process using Clang and LLVM:

This picture clearly demonstrates how communication between frontend and backend is done using IR, LLVM has it is own format, that can be encoded using LLVM bitstream file format - Bitcode.

Just to recall it explicitly - Bitcode is a bitstream representation of LLVM IR.

What problems Apple’s Bitcode aims to solve?

Again, we need to dive a bit deeper and look at how an OS runs programs. This description is not precise and is given just to illustrate the process. For more details I can recommend reading this article: How OS X Executes Applications.

OS X and iOS can run on different CPUs (i386x86_64armarm64, etc.), if you want to run a program on any OS X/iOS setup, then the program should contain object code for each platform. Here is how a binary might look like:

When you run a program, OS reads the ‘Table Of Contents’ and looks for a slice corresponding to the OS CPU. For instance, if you run operating system on x86_64, then OS will load object code for x86_64 into a memory and run the program.

What’s happening with other slices? Nothing, they just waste your disk space.

This is the problem Apple wants to solve: currently, all the apps on the AppStore contain object code for arm and arm64 CPUs. Moreover, third-party proprietary libraries or frameworks contain object code for i386x86_64arm and arm64, so you can use them to test the app on a device or simulator. (Can you imagine how many copies of Google Analytics for i386 you have in your pocket?)

UPD: I do not know why, but I was sure that final executable contains these slices as well (i386x86_64, etc.), but it seems they are stripped during the build phase.

Apple did not give us that many details about how the Bitcode and App Thinning works, so let me assume how it may look:

When you submit an app (including Bitcode) Apple’s ‘BlackBox’ recompiles it for each supported platform and drops any ‘useless’ object code, so AppStore has a copy of the app for each CPU. When an end user wants to install the app - she installs the only version for the particular processor, without any unused stuff.

Bitcode might save up to 50% of disk space per program.

UPD: Of course, I do not take in count resources, it is just about binary itself. For instance, an app I am working on currently has size ~40 megabytes (including assets, xibs. fonts), a size of a binary itself is ~16 megabytes. I checked sizes of each slice: ~7MB for armv7 and 9MB for arm64, if we crop just one of them, it will decrease the size of the app by ~20%.

What problems do Bitcode introduce?

The idea of Bitcode and recompiling for each platform looks really great, and it is a huge improvement, though it has downsides as well: the biggest one is security.

To get the benefits of Bitcode, you should submit your app including Bitcode (surprisingly). If you use some proprietary third-party library, then it also should contain Bitcode, hence as a maintainer of a proprietary library, you should distribute the library with Bitcode.

To recall: Bitcode is just another form of LLVM IR.

LLVM IR

Let’s write some code to see LLVM IR in action.

// main.c
extern int printf(const char *fmt, ...);

int main() {
  printf("Hello World\n");
  return 0;
}

Run the following:

clang -S -emit-llvm main.c

And you’ll have main.ll containing IR:

@.str = private unnamed_addr constant [13 x i8] c"Hello World\0A\00", align 1

; Function Attrs: nounwind ssp uwtable
define i32 @main() #0 {
  %1 = alloca i32, align 4
  store i32 0, i32* %1
  %2 = call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([13 x i8]* @.str, i32 0, i32 0))
  ret i32 0
}

declare i32 @printf(i8*, ...) #1

What can we see here? It is a bit more verbose than original C code, but it is still much more readable than assembler. Malefactors will be much happier to work with this representation, than with disassembled version of a binary (and they do not even have to pay for tools such Hopper or IDA).

How could malefactor get the IR?

iOS and OS X executables have their own format - Mach-O (read Parsing Mach-O files for more details). Mach-O file contains several segments such as Read-Only Data, Code, Symbol Table, etc. One of those sections contain xar archive with Bitcode:

It is really easy to retrieve it automatically, here I wrote a simple C program that does just that: bitcode_retriever. The workflow is pretty straightforward. Let’s assume that some_binary is a Mach-O file that contains object code for two CPUs (arm and x86_64), and each object code is built using two source files:

$ bitcode_retriever some_binary
arm.xar
x86_64.xar
$ xar -xvf arm.xar
1
2
$ llvm-dis 1 # outputs 1.ll
$ llvm-dis 2 # outputs 2.ll

Bitcode does not store any information about original filenames but uses numbers instead (123, etc.). Also, probably you do not have llvm-disinstalled/built on your machine, but you can easily obtain it, see this article for more details: Getting Started with Clang/LLVM on OS X.

Another potential issue (can’t confirm it) - Bitcode thingie works only for iOS 9, so if you submit your app to the AppStore and it includes Bitcode, then malefactor can get the whole IR from your app using iOS 78 and jailbroken device.

I know only one way to secure the IR - obfuscation. This task is not trivial itself, and it requires even much more efforts if you want to introduce this phase into your Xcode-Driven development flow.

Summary

  • Bitcode is a bitstream file format for LLVM IR
  • one of its goals is to decrease a size of an app by eliminating unused object code
  • malefactor can obtain your app or library, retrieve the IR from it and steal your ‘secret algorithm.’

相关文章推荐

Linux终端解密(The TTY demystified)

对于Unix类系统,TTY的概念至关重要,然而却很难找到合适的文档来参考。偶遇一片洋文,恰好介绍TTY,觉得不错,转载过来了。The TTY subsystem is central to the d...
  • smstong
  • smstong
  • 2016年05月30日 15:49
  • 6535

fdk-aac-0.1.3(包含bitcode)

  • 2016年06月17日 17:32
  • 30.15MB
  • 下载

Redis persistence demystified - part 1

关于Redis我的一部分工作是阅读博客,论坛以及twitter时间线(time line)。对于开发者来说,能够了解用户社区,非用户社区如果理解他正在开发的产品是非常重要的。据我所知,持久化特性是最易...

Web Servers Demystified

original adress : http://www.macronimous.com/resources/web_servers_demystifyed.asp What is a web s...
  • dingxy
  • dingxy
  • 2011年07月12日 11:25
  • 672

SOCK_RAW Demystified

// SOCK_RAW Demystified by ithilgore - ithilgore.ryu.L@gmail.com sock-raw.org / sock-raw.homeunix...

[转载]The TTY demystified

The TTY demystified Real teletypes in the 1940s. The TTY subsystem is central to the d...
  • ThinkHY
  • ThinkHY
  • 2013年03月04日 21:57
  • 2085

Video Demystified

  • 2011年07月02日 15:42
  • 4.8MB
  • 下载

Video Demystified

  • 2011年11月12日 01:31
  • 9.56MB
  • 下载

Bitcode是什么,如何配置?

今天在一个麦子学院上看到一篇关于第三方库不包含bitcode就会报错的文章,感觉剖析得很详细,分享出来,希望可以对iOS初入门者有所帮助。下面我们就一起来看看吧。 用Xcode 7 b...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:BITCODE DEMYSTIFIED
举报原因:
原因补充:

(最多只允许输入30个字)