Oracle 中字符UTF8编码转为Unicode编码

        在前一篇小文中,从字符串中识别出Emoji表情后,转为了内部UTF8编码的十六进制。可客户觉得这个码在其他环境不大方便转换回去,要求转为能识别的XML格式Unicode码。

        例如,对于Emoji“👨‍👩‍👦‍👦”

        原来是转为如下形式:\F0\9F\91\A8\E2\80\8D\F0\9F\91\A9\E2\80\8D\F0\9F\91\A7\E2\80\8D\F0\9F\91\A6

        现在要求转为如下形式:👨‍👩‍👧‍👦

       

         好吧,脑壳又开始疼了,还得继续挖!

        在网上翻来翻去,没有找到现成的实现方法,了解到有个函数ASCIISTR()与提供Unicode相关,试了一下。

        SELECT ASCIISTR('👨‍👨‍👧‍👦') FROM DUAL;

        --\D83D\DC68\200D\D83D\DC68\200D\D83D\DC67\200D\D83D\DC66

        对1Fxxx这种5位十六进制的Unicode字符,都转成了两个4位十六进制形式的UTF16编码\D83D\DC68。这不是客户需要的,还得接着挖。

        翻到一篇C语句将UTF8码转为Unicode码的文章,大受启发,链接如下:        https://www.cnblogs.com/cthon/p/9297232.html

        学习了UTF8码由Unicode码转换的编码规则:

        我们只需要逆着规则,解析UTF8码,转回来即可。

        拿爸爸Emoji字符 “👨”举例:

        UTF8字符集十六进制编码为:\F0\9F\91\A8

        对应二进制位为:1111 0000  1001 1111  1001 0001  1010 1000             

  

        根据上图这个编码规则,去掉红色二进制代码后,将其他二进制编码拼起来。

        其Unicode编码的二进制为: 000 01   1111 0100   01 10 1000

        对应十六进制为:1F468

        从上面转换过程来看,不是很复杂,而且作者还提供C语言的实现方式。一想,将C语言实现方法Copy到Oracle来,不就解决了?

        当然,最后也确实是借鉴了作者C语言里的实现方式,只是颇为费了几番脑筋。C语言和Oracle的实现方式的主要差别在于,C语言中提供了对内部二进制码按bit操作的函数和方法,而Oracle最小的操作单位也是字节,无法按bit去操作(去掉头部添加码,将剩余码拼起来),这就是本次实现转换的难点。

        最终我想到的是用数学运算方法来实现。

        同样用上面这个4字节的Emoji字符 “👨”举例,二进制运算规则如下:

        第1个字节:(1111 0000 - 1111 0000)  * 1 000000 000000 000000

                        =  000 000000 000000 000000

        第2个字节:(1001 1111 - 1000 0000)  * 1 000000 000000

                        =  01 1111 000000 000000

        第3个字节:(1001 0001 - 1000 0000)  * 1 000000

                        =  01 0001 000000

        第4个字节:(1010 1000 - 1000 0000) 

                        =  10 1000

        将上面4个字节二进制运算后的结果相加,即可得到其最后Uuicode二进制码:

                         000 000000 000000 000000

                      +         011111 000000 000000

                      +                     010001 000000

                      +                                  101000

                      =  000 011111 010001 101000

        其他字节数的转换算法可以类比完成。

        最终我写一个函数来实现UTF8转Unicode,代码贴出来,如下:

CREATE OR REPLACE FUNCTION F_UTF8_TO_UNICODE(Inp_STR IN VARCHAR2) RETURN VARCHAR2 IS

    lv_emojiChar            VARCHAR2(500 CHAR) := TRIM(Inp_STR);

    lv_emojiCharUtf8Hex     VARCHAR2(500 CHAR);

    lv_emojiCharLengthB     INTEGER;

    lv_emojiUnicodeInt      INTEGER;

    lv_emojiCharUnicodeHex  VARCHAR2(500 CHAR);

BEGIN

    --DBMS_OUTPUT.PUT_LINE('lv_emojiChar: ' || lv_emojiChar);

                IF lv_emojiChar IS NULL THEN

                                lv_emojiCharUnicodeHex := '';

                ELSE

                                SELECT RAWTOHEX(lv_emojiChar), LENGTHB(lv_emojiChar)

                                INTO lv_emojiCharUtf8Hex, lv_emojiCharLengthB

                                FROM DUAL;

                                --DBMS_OUTPUT.PUT_LINE('lv_emojiCharUtf8Hex: ' || lv_emojiCharUtf8Hex);

                                --DBMS_OUTPUT.PUT_LINE('lv_emojiCharLengthB: ' || lv_emojiCharLengthB);

                                IF lv_emojiCharLengthB=1 THEN

                                                -- 0xxxxxxx

                                                -- -> 0xxxxxxx

                                                NULL;

                                ELSIF lv_emojiCharLengthB=2 THEN

                                                -- 110xxxxx 10xxxxxx

                                                -- -> 000xxxxx000000 + 00xxxxxx

                                                SELECT /* (110xxxxx-11000000)*2^6 */

                                                                   (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,1,1)) - (128+64)) * 64 

                                                                   /* 10xxxxxx-10000000 */

                                                                 + UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,2,1)) - 128

                                                INTO lv_emojiUnicodeInt

                                                FROM DUAL;

                                ELSIF lv_emojiCharLengthB=3 THEN

                                                -- 1110xxxx 10xxxxxx 10xxxxxx

                                                -- -> 0000xxxx000000000000 + 00xxxxxx000000 + 00xxxxxx

                                                SELECT /* (1110xxxx-11100000)*2^6*2^6 */

                                                                   (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,1,1)) - (128+64+32)) * 64 * 64

                                                                   /* (10xxxxxx-10000000)*2^6 */

                                                                 + (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,2,1)) - 128) * 64

                                                                   /*10xxxxxx-10000000*/

                                                                 + UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,3,1)) - 128

                                                INTO lv_emojiUnicodeInt

                                                FROM DUAL;

                                ELSIF lv_emojiCharLengthB=4 THEN

                                                -- 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

                                                -- -> 00000xxx000000000000000000 + 00xxxxxx000000000000 + 00xxxxxx000000 + 00xxxxxx

                                                SELECT /* (11110xxx-11110000)*2^6*2^6*2^6 */

                                                                   (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,1,1)) - (128+64+32+16)) * 64 * 64 * 64

                                                                   /* (10xxxxxx-10000000)*2^6*2^6 */

                                                                 + (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,2,1)) - 128) * 64 * 64

                                                                   /* (10xxxxxx-10000000)*2^6 */

                                                                 + (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,3,1)) - 128) * 64

                                                                   /*10xxxxxx-10000000*/

                                                                 + UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,4,1)) - 128

                                                INTO lv_emojiUnicodeInt

                                                FROM DUAL;

                                ELSIF lv_emojiCharLengthB=5 THEN

                                                --111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

                                                -- -> 000000xx000000000000000000000000 + 00xxxxxx000000000000000000 + 00xxxxxx000000000000 + 00xxxxxx000000 + 00xxxxxx

                                                SELECT /* (111110xx-11111000)*2^6*2^6*2^6*2^6 */

                                                                   (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,1,1)) - (128+64+32+16+8)) * 64 * 64 * 64 * 64

                                                                   /* (10xxxxxx-10000000)*2^6*2^6*2^6 */

                                                                 + (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,2,1)) - 128) * 64 * 64 * 64

                                                                   /* (10xxxxxx-10000000)*2^6*2^6 */

                                                                 + (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,3,1)) - 128) * 64 * 64

                                                                   /* (10xxxxxx-10000000)*2^6 */

                                                                 + (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,4,1)) - 128) * 64

                                                                   /*10xxxxxx-10000000*/

                                                                 + UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,5,1)) - 128

                                                INTO lv_emojiUnicodeInt

                                                FROM DUAL;

                                ELSIF lv_emojiCharLengthB=6 THEN

                                                --1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

                                                -- -> 000000xx000000000000000000000000000000 + 00xxxxxx000000000000000000000000 + 00xxxxxx000000000000000000 + 00xxxxxx000000000000 + 00xxxxxx000000 + 00xxxxxx

                                                SELECT /* (1111110x-11111100)*2^6*2^6*2^6*2^6*2^6 */

                                                                   (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,1,1)) - (128+64+32+16+8)) * 64 * 64 * 64 * 64 * 64

                                                                   /* (10xxxxxx-10000000)*2^6*2^6*2^6*2^6 */

                                                                 + (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,2,1)) - 128) * 64 * 64 * 64 * 64

                                                                   /* (10xxxxxx-10000000)*2^6*2^6*2^6 */

                                                                 + (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,3,1)) - 128) * 64 * 64 * 64

                                                                   /* (10xxxxxx-10000000)*2^6*2^6 */

                                                                 + (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,4,1)) - 128) * 64 * 64

                                                                   /* (10xxxxxx-10000000)*2^6 */

                                                                 + (UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,5,1)) - 128) * 64

                                                                   /*10xxxxxx-10000000*/

                                                                 + UTL_RAW.cast_to_binary_integer(UTL_RAW.substr(lv_emojiCharUtf8Hex,6,1)) - 128

                                                INTO lv_emojiUnicodeInt

                                                FROM DUAL;

                                END IF;

                                --DBMS_OUTPUT.PUT_LINE('lv_emojiUnicodeInt: ' || lv_emojiUnicodeInt);

                                IF lv_emojiCharLengthB = 1 THEN

                                                lv_emojiCharUnicodeHex := lv_emojiCharUtf8Hex;

                                ELSE

                                                SELECT LTRIM(UTL_RAW.cast_from_binary_integer(lv_emojiUnicodeInt), '0')

                                                INTO lv_emojiCharUnicodeHex

                                                FROM DUAL;

        END IF;

                                --DBMS_OUTPUT.PUT_LINE('lv_emojiCharUnicodeHex: ' || lv_emojiCharUnicodeHex);

                               

                END IF;

                RETURN lv_emojiCharUnicodeHex;

EXCEPTION

                WHEN OTHERS THEN

                                DBMS_OUTPUT.PUT_LINE(SQLERRM);

--}

END F_UTF8_TO_UNICODE;

/

- -放开脚本中的put_line,测试中打印的日志如下:

SELECT F_UTF8_TO_UNICODE('1') FROM DUAL;

lv_emojiChar: 1

lv_emojiCharUTF8hex: 31

lv_emojiCharLengthB: 1

lv_emojiCharXmlHex: 31

SELECT F_UTF8_TO_UNICODE('®') FROM DUAL;

lv_emojiChar: ®

lv_emojiCharUTF8hex: C2AE

lv_emojiCharLengthB: 2

lv_emojiUnicodeInt: 174

lv_emojiCharXmlHex: AE

SELECT F_UTF8_TO_UNICODE('□') FROM DUAL;

lv_emojiChar: □

lv_emojiCharUTF8hex: E296A1

lv_emojiCharLengthB: 3

lv_emojiUnicodeInt: 9633

lv_emojiCharXmlHex: 25A1

SELECT F_UTF8_TO_UNICODE('👨') FROM DUAL;

lv_emojiChar: 👨

lv_emojiCharUTF8hex: F09F91A8

lv_emojiCharLengthB: 4

lv_emojiUnicodeInt: 128104

lv_emojiCharXmlHex: 1F468

        为了学习Oralce 对UTF8编码的单个字节的处理,学习了UTL_RAW包中所有的函数,链接如下:                【ORACLE】详解oracle数据库UTL_RAW包各个函数的模拟算法_oracle utl_raw_DarkAthena的博客-CSDN博客

        望大家尊重原创,在转载时注明出处,谢谢!

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值