Working Around JNI UTF-8 Strings

SnowMonkey777

于 2020-01-09 16:52:30 发布

阅读量311

点赞数

文章标签： JNI UTF-8 UTF-16 Unicode

原文链接：http://banachowski.com/deprogramming/2012/02/working-around-jni-utf-8-strings/

版权

问题背景（日志）

01-07 17:08:19.999 12012-12037/com.daojia.jz.testimlib A/art: art/runtime/check_jni.cc:65] JNI DETECTED ERROR IN APPLICATION: input is not valid Modified UTF-8: illegal start byte 0xf0
01-07 17:08:19.999 12012-12037/com.daojia.jz.testimlib A/art: art/runtime/check_jni.cc:65] string: '{"data":{"default_text":"","modify_members":[{"authority":1,"forbidden":false,"nick_name":"??????????????????????????????????","nick_spell":"??????????????????????????????????","ses_top":false,"slient":true,"user_id":"313408","user_source":29763708}],"op_type":"modify_group_member","operator":{},"targets":[],"type":"operator_tip"},"name":"GROUPOPERATION"}'
01-07 17:08:19.999 12012-12037/com.daojia.jz.testimlib A/art: art/runtime/check_jni.cc:65] in call to NewStringUTF

相关概念

java内部是使用16bit的unicode编码（UTF-16）来表示字符串的，无论中文英文都是2字节；
jni内部是使用UTF-8编码来表示字符串的，UTF-8是变长编码的unicode，一般ascii字符是1字节，中文是3字节；
c/c++使用的是原始数据，ASCII 就是一个字节了，中文一般是 GB2312 编码，用两个字节来表示一个汉字。

Java Native Interface enables Java code to call functions in a C or C++ library. When passing arguments to through the JNI layer, Java String objects map to C UTF-8 character arrays. This post describes how to avoid JNI’s non-standard implementation of UTF-8.

Here’s a simple example that shows how to pass a string. First, the Java code declares the “native” function type. Here, the function is also static:

class Example {
  ...
  private static native void printString(String text);
  ...
  void examplePrintString() {
    String str = "A" + "\u00ea" + "\u00f1" + "\u00fc" + "C";
    System.out.println("String = " + str);
    printString(str);
  }
}

To access the string, C++ needs to retrieve the bytes of the string using a function from the JNI library, GetStringUTFChars(), like so:

JNIEXPORT void JNICALL Java_Example_printString(JNIEnv *env, jclass, jstring text) {
  const char* text_input = env->GetStringUTFChars(text, NULL);
  for (int i = 0; text_input[i] != 0; ++i) {
    printf("jni[%d] = %x\n", i, ((unsigned char *) text_input)[i]);
  }
  env->ReleaseStringUTFChars(text, text_input);
}

In a sample run, I get the following output:

String = AêñüC
jni[0] = 41
jni[1] = c3
jni[2] = aa
jni[3] = c3
jni[4] = b1
jni[5] = c3
jni[6] = bc
jni[7] = 43

The five character string “AêñüC” is encoded in eight bytes under UTF-8, because three of the characters occupy two bytes each.

Now this works fine in this example. What isn’t yet apparent is that UTF-8 strings generated by JNI are not standard, but instead are modified UTF-8. According the JNI spec:

There are two differences between this format and the standard UTF-8 format. First, the null character (char)0 is encoded using the two-byte format rather than the one-byte format. This means that modified UTF-8 strings never have embedded nulls. Second, only the one-byte, two-byte, and three-byte formats of standard UTF-8 are used. The Java VM does not recognize the four-byte format of standard UTF-8; it uses its own two-times-three-byte format instead.

If there’s a technical reason JNI does not use standard UTF-8 format, I have not seen a discussion, and I cannot fathom why. A case may be made for the non-embedded nulls, but that’s easy to work around by relying on a length variable instead of null to mark the end. The avoidance of four-byte UTF-8 characters seems more mysterious.

Here’s an example of passing a valid four-byte-character: The Java routine now passes in the following string:

class Example {
  ...
  void examplePrintString() {
    byte[] bb = new byte[4];
    bb[0] = (byte) 0xf0;
    bb[1] = (byte) 0xa0;
    bb[2] = (byte) 0x9c;
    bb[3] = (byte) 0x8e;
    String str = new String(bb, "UTF-8");
    System.out.println("String = " + str);
    printString(str);
  }
}

And the output is now:

String = <unprintable>*
jni[0] = ed
jni[1] = a1
jni[2] = 81
jni[3] = ed
jni[4] = bc
jni[5] = 8e

* This blog can’t handle that character.

The Java example sets the four bytes of the character explicitly, so it is obvious this character was converted to a 5-byte sequence.

Suppose you relied on a string processing library in your native function to manipulate the strings from the Java call. And also suppose this library expects and produces standard UTF-8 encoding, because, why would it not use the standard? And suppose it reacted unpredictably when faced with non-standard, or more politely, “modified” encoding. At best, it hopefully discards characters it can’t interpret. At worst it crashes. In the case of passing strings from native back to Java, the JNI definitely crashes if not in correctly modified UTF-8, so you have this problem too.

Chances are you’d never encounter the problem lurking, because use of four-byte characters seems sufficiently rare. But I wouldn’t want to rely on the scarcity of these characters to avoid a potential bug. As I’ve learned from running code that drives popular web-sites, once running on sufficiently enough data, even the unlikeliest of bugs become commonplace.

So how to work around this without needing to write a converter in native code? Well, it turns out converting to UTF-8 in Java (as opposed to JNI) produces standard encoding. Therefore, the workaround is to convert in Java, and send a byte array in lieu of a String.

Now, the Java example looks like:

class Example {
  ...
  private static native void printBytes(byte[] text);
  ...
  void examplePrintString() {
    byte[] bb = new byte[4];
    bb[0] = (byte) 0xf0;
    bb[1] = (byte) 0xa0;
    bb[2] = (byte) 0x9c;
    bb[3] = (byte) 0x8e;
    String str = new String(bb, "UTF-8");
    System.out.println("String = " + str);
    printBytes(str.getBytes("UTF-8")); // Do the conversion here.
  }
}

JNIEXPORT void JNICALL Java_Example_printBytes(JNIEnv *env, jclass, jbyteArray text) {
  jbyte* text_input = env->GetByteArrayElements(text, NULL);
  jsize size = env->GetArrayLength(text);
  for (int i = 0; i < size; ++i) {
    printf("bytes[%d] = %x\n", i, ((const unsigned char *) text_input)[i]);
  }
  env->ReleaseByteArrayElements(text, text_input, NULL);
}

Now this prints the following expected four bytes:

String = <unprintable>
bytes[0] = f0
bytes[1] = a0
bytes[2] = 9c
bytes[3] = 8e

When using a UTF-8 library in JNI, I prefer byte array over String when passing data from Java.

SnowMonkey777

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Working Around JNI UTF-8 Strings

问题背景（日志）01-07 17:08:19.999 12012-12037/com.daojia.jz.testimlib A/art: art/runtime/check_jni.cc:65] JNI DETECTED ERROR IN APPLICATION: input is not valid Modified UTF-8: illegal start byte 0xf001-07...
复制链接

扫一扫