在前一篇小文中,从字符串中识别出Emoji表情后,转为了内部UTF8编码的十六进制。可客户觉得这个码在其他环境不大方便转换回去,要求转为能识别的XML格式Unicode码。
例如,对于Emoji“👨👩👦👦”
原来是转为如下形式:\F0\9F\91\A8\E2\80\8D\F0\9F\91\A9\E2\80\8D\F0\9F\91\A7\E2\80\8D\F0\9F\91\A6
现在要求转为如下形式:👨‍👩‍👧‍👦
好吧,脑壳又开始疼了,还得继续挖!
在网上翻来翻去,没有找到现成的实现方法,了解到有个函数ASCIISTR()与提供Unicode相关,试了一下。
SELECT ASCIISTR('👨👨👧👦') FROM DUAL;
--\D83D\DC68\200D\D83D\DC68\200D\D83D\DC67\200D\D83D\DC66
对1Fxxx这种5位十六进制的Unicode字符,都转成了两个4位十六进制形式的UTF16编码\D83D\DC68。这不是客户需要的,还得接着挖。
翻到一篇C语句将UTF8码转为Unicode码的文章,大受启发,链接如下: https://www.cnblogs.com/cthon/p/9297232.html
学习了UTF8码由Unicode码转换的编码规则:
我们只需要逆着规则,解析UTF8码,转回来即可。
拿爸爸Emoji字符 “👨”举例:
UTF8字符集十六进制编码为:\F0\9F\91\A8
对应二进制位为:1111 0000 1001 1111 1001 0001 1010 1000
根据上图这个编码规则,去掉红色二进制代码后,将其他二进制编码拼起来。
其Unicode编码的二进制为: 000 01 1111 0100 01 10 1000
对应十六进制为:1F468
从上面转换过程来看,不是很复杂,而且作者还提供C语言的实现方式。一想,将C语言实现方法Copy到Oracle来,不就解决了?
当然,最后也确实是借鉴了作者C语言里的实现方式,只是颇为费了几番脑筋。C语言和Oracle的实现方式的主要差别在于,C语言中提供了对内部二进制码按bit操作的函数和方法,而Oracle最小的操作单位也是字节,无法按bit去操作(去掉头部添加码,将剩余码拼起来),这就是本次实现转换的难点。
最终我想到的是用数学运算方法来实现。
同样用上面这个4字节的Emoji字符 “👨”举例,二进制运算规则如下:
第1个字节:(1111 0000 - 1111 0000) * 1 000000 000000 000000
= 000 000000 000000 000000
第2个字节:(1001 1111 - 1000 0000) * 1 000000 000000
= 01 1111 000000 000000
第3个字节:(1001 0001 - 1000 0000) * 1 000000
= 01 0001 000000
第4个字节:(1010 1000 - 1000 0000)
= 10 1000
将上面4个字节二进制运算后的结果相加,即可得到其最后Uuicode二进制码:
000 000000 000000 000000
+ 011111 000000 000000
+ 010001 000000
+ 101000
= 000 011111 010001 101000
其他字节数的转换算法可以类比完成。
最终我写一个函数来实现UTF8转Unicode,代码贴出来,如下:
CREATE OR REPLACE FUNCTION F_UTF8_TO_UNICODE(Inp_STR IN VARCHAR2) RETURN VARCHAR2 IS
lv_emojiChar VARCHAR2(500 CHAR) := TRIM(Inp_STR);
lv_emojiCharUtf8Hex VARCHAR2(500 CHAR);
lv_emojiCharLengthB INTEGER;
lv_emojiUnicodeInt INTEGER;
lv_emojiCharUnicodeHex VARCHAR2(500 CHAR);
BEGIN
--DBMS_OUTPUT.PUT_LINE('lv_emojiChar: ' || lv_emojiChar);
IF lv_emojiChar IS NULL THEN
lv_emojiCharUnicodeHex := '';
ELSE
SELECT RAWTOHEX(lv_emojiChar), LENGTHB(lv_emojiChar)
INTO lv_emojiCharUtf8Hex, lv_emojiCharLengthB
FROM DUAL;
--DBMS_OUTPUT.PUT_LINE('lv_emojiCharUtf8Hex: ' || lv_emojiCharUtf8Hex);
--DBMS_OUTPUT.PUT_LINE('lv_emojiCharLengthB: ' || lv_emojiCharLengthB);
IF lv_emojiCharLengthB=1 THEN
-- 0xxxxxxx
-- -> 0xxxxxxx
NULL;
ELSIF lv_emojiCharLengthB=2 THEN
-- 110xxxxx 10xxxxxx
-- -> 000xxxxx000000 + 00xxxxxx
SELECT /* (110xxxxx-11000000)*2^6 */
(UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,1,1)) - (128+64)) * 64
/* 10xxxxxx-10000000 */
+ UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,2,1)) - 128
INTO lv_emojiUnicodeInt
FROM DUAL;
ELSIF lv_emojiCharLengthB=3 THEN
-- 1110xxxx 10xxxxxx 10xxxxxx
-- -> 0000xxxx000000000000 + 00xxxxxx000000 + 00xxxxxx
SELECT /* (1110xxxx-11100000)*2^6*2^6 */
(UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,1,1)) - (128+64+32)) * 64 * 64
/* (10xxxxxx-10000000)*2^6 */
+ (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,2,1)) - 128) * 64
/*10xxxxxx-10000000*/
+ UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,3,1)) - 128
INTO lv_emojiUnicodeInt
FROM DUAL;
ELSIF lv_emojiCharLengthB=4 THEN
-- 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
-- -> 00000xxx000000000000000000 + 00xxxxxx000000000000 + 00xxxxxx000000 + 00xxxxxx
SELECT /* (11110xxx-11110000)*2^6*2^6*2^6 */
(UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,1,1)) - (128+64+32+16)) * 64 * 64 * 64
/* (10xxxxxx-10000000)*2^6*2^6 */
+ (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,2,1)) - 128) * 64 * 64
/* (10xxxxxx-10000000)*2^6 */
+ (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,3,1)) - 128) * 64
/*10xxxxxx-10000000*/
+ UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,4,1)) - 128
INTO lv_emojiUnicodeInt
FROM DUAL;
ELSIF lv_emojiCharLengthB=5 THEN
--111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
-- -> 000000xx000000000000000000000000 + 00xxxxxx000000000000000000 + 00xxxxxx000000000000 + 00xxxxxx000000 + 00xxxxxx
SELECT /* (111110xx-11111000)*2^6*2^6*2^6*2^6 */
(UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,1,1)) - (128+64+32+16+8)) * 64 * 64 * 64 * 64
/* (10xxxxxx-10000000)*2^6*2^6*2^6 */
+ (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,2,1)) - 128) * 64 * 64 * 64
/* (10xxxxxx-10000000)*2^6*2^6 */
+ (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,3,1)) - 128) * 64 * 64
/* (10xxxxxx-10000000)*2^6 */
+ (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,4,1)) - 128) * 64
/*10xxxxxx-10000000*/
+ UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,5,1)) - 128
INTO lv_emojiUnicodeInt
FROM DUAL;
ELSIF lv_emojiCharLengthB=6 THEN
--1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
-- -> 000000xx000000000000000000000000000000 + 00xxxxxx000000000000000000000000 + 00xxxxxx000000000000000000 + 00xxxxxx000000000000 + 00xxxxxx000000 + 00xxxxxx
SELECT /* (1111110x-11111100)*2^6*2^6*2^6*2^6*2^6 */
(UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,1,1)) - (128+64+32+16+8)) * 64 * 64 * 64 * 64 * 64
/* (10xxxxxx-10000000)*2^6*2^6*2^6*2^6 */
+ (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,2,1)) - 128) * 64 * 64 * 64 * 64
/* (10xxxxxx-10000000)*2^6*2^6*2^6 */
+ (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,3,1)) - 128) * 64 * 64 * 64
/* (10xxxxxx-10000000)*2^6*2^6 */
+ (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,4,1)) - 128) * 64 * 64
/* (10xxxxxx-10000000)*2^6 */
+ (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,5,1)) - 128) * 64
/*10xxxxxx-10000000*/
+ UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,6,1)) - 128
INTO lv_emojiUnicodeInt
FROM DUAL;
END IF;
--DBMS_OUTPUT.PUT_LINE('lv_emojiUnicodeInt: ' || lv_emojiUnicodeInt);
IF lv_emojiCharLengthB = 1 THEN
lv_emojiCharUnicodeHex := lv_emojiCharUtf8Hex;
ELSE
SELECT LTRIM(UTL_RAW.cast_from_binary_integer(lv_emojiUnicodeInt), '0')
INTO lv_emojiCharUnicodeHex
FROM DUAL;
END IF;
--DBMS_OUTPUT.PUT_LINE('lv_emojiCharUnicodeHex: ' || lv_emojiCharUnicodeHex);
END IF;
RETURN lv_emojiCharUnicodeHex;
EXCEPTION
WHEN OTHERS THEN
DBMS_OUTPUT.PUT_LINE(SQLERRM);
--}
END F_UTF8_TO_UNICODE;
/
- -放开脚本中的put_line,测试中打印的日志如下:
SELECT F_UTF8_TO_UNICODE('1') FROM DUAL;
lv_emojiChar: 1
lv_emojiCharUTF8hex: 31
lv_emojiCharLengthB: 1
lv_emojiCharXmlHex: 31
SELECT F_UTF8_TO_UNICODE('®') FROM DUAL;
lv_emojiChar: ®
lv_emojiCharUTF8hex: C2AE
lv_emojiCharLengthB: 2
lv_emojiUnicodeInt: 174
lv_emojiCharXmlHex: AE
SELECT F_UTF8_TO_UNICODE('□') FROM DUAL;
lv_emojiChar: □
lv_emojiCharUTF8hex: E296A1
lv_emojiCharLengthB: 3
lv_emojiUnicodeInt: 9633
lv_emojiCharXmlHex: 25A1
SELECT F_UTF8_TO_UNICODE('👨') FROM DUAL;
lv_emojiChar: 👨
lv_emojiCharUTF8hex: F09F91A8
lv_emojiCharLengthB: 4
lv_emojiUnicodeInt: 128104
lv_emojiCharXmlHex: 1F468
为了学习Oralce 对UTF8编码的单个字节的处理,学习了UTL_RAW包中所有的函数,链接如下: 【ORACLE】详解oracle数据库UTL_RAW包各个函数的模拟算法_oracle utl_raw_DarkAthena的博客-CSDN博客
望大家尊重原创,在转载时注明出处,谢谢!