字符串的显示宽度

最新推荐文章于 2024-05-07 22:17:05 发布

fortunewang

最新推荐文章于 2024-05-07 22:17:05 发布

阅读量1.5k

点赞数 1

分类专栏： tools

本文链接：https://blog.csdn.net/FortuneWang/article/details/41575397

版权

tools 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

字符宽度数据库：http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt
字符宽度文档：http://www.unicode.org/reports/tr11/

文档中定义了Unicode字符的显示宽度如下：
A : Ambiguous 不确定
F : Fullwidth 全角
H : Halfwidth 半角
N : Neutral 中性
Na : Narrow 窄
W : Wide 宽

总体来说：
In a broad sense, wide characters include W, F,
and A (when in East Asian context), and narrow characters include N, Na, H,
and A (when not in East Asian context).

其中比较难处理的是A类型，典型的字符是中文的引号。
中文输入法所谓“半角中文引号”其实是A，在有的场合下和一个英文字符等宽，
有的场合和一个汉字等宽。

这里由于使用需要，认为A类型与一个汉字等宽。

#!/usr/bin/env python3
#-*- coding: utf-8 -*-

East_Asian_Width = {
    'A': 2,
    'F': 2,
    'H': 1,
    'N': 1,
    'Na': 1,
    'W': 2
}

ilast = 0
jlast = 0
ealast = 1

def eaw_operation(i, j, ea, f):
    global ilast
    global jlast
    global ealast
    # 如果与上一范围宽度不同，打印上一范围并记录当前范围
    # 否则当前范围并入上一范围
    if ealast != East_Asian_Width[ea]:
        print('    {{{:#x}, {}}},'.format(jlast, ealast), file=f)
        ilast = i
        ealast = East_Asian_Width[ea]
    jlast = j

if __name__ == '__main__':
    with open('EastAsianWidth.txt', 'r', newline='\n') as DATABASE, open('east_asian_width.table', 'w', encoding='utf-8') as f:
        print('eaw_data east_asian_width[] = {', file=f)
        for line in DATABASE:
            line = line.strip('\n')
            # 过滤注释和空行
            if not line or line.startswith('#'):
                continue
            # 第一个分号前为范围，第一个引号后的第一个空格前为类型
            # 范围可能是一个值，或以点号分隔的两个值
            # 第一个值为当前范围的下限i
            # 最后一个值为当前范围的上限j
            crange, ea = line.split(';', maxsplit=1)
            crange = crange.split('.')
            ea = ea.split(' ', maxsplit=1)[0]
            i = int(crange[0], base=16)
            j = int(crange[-1], base=16)
            # 先检查是否缺失范围jlast+1 .. i-1，有则为N类型
            if i > jlast + 1:
                eaw_operation(jlast + 1, i - 1, 'N', f)
            eaw_operation(i, j, ea, f)
        # 将0x10ffff作为上限更新最后一行之后的缓存
        eaw_operation(jlast + 1, 0x10ffff, 'N', f)
        # 最后再打印一次缓存
        print('    {{{:#x}, {}}}'.format(jlast, ealast), file=f)
        print('};', file=f, end='')

然后east_asian_width.table里就是Uncode范围与对应宽度。

我在项目里这么用：

#include "windows.h"
#include <stdexcept>

namespace {

typedef struct{
    int value;
    int width;
} eaw_data;

#include "east_asian_width.table"

const int TABLESIZE = sizeof(east_asian_width) / sizeof(eaw_data);

/**
 * @brief  一个字符的显示宽度
 * @param  字符的Unicode值
 * @return 字符的显示宽度，1或2
**/
int eaw(int value)
{
    if(value > 0x10ffff)
        throw std::domain_error("Character value out of range");
    // 二分查找
    int imin = 0;
    int imax = TABLESIZE - 1;
    int imid = (imin + imax) / 2;
    for(;;)
    {
        if(imid == imin)
            return value < east_asian_width[imin].value 
                ? east_asian_width[imin].width : east_asian_width[imax].width;
        if(value < east_asian_width[imid].value)
        {
            imax = imid;
        }
        else if(value == east_asian_width[imid].value)
        {
            return east_asian_width[imid].width;
        }
        else
        {
            imin = imid;
        }
        imid = (imin + imax) / 2;
    }
}

} // anonymous namespace

/**
 * @brief  计算字符串的显示宽度
 * @param  s 字符串，必须以'\0'结尾
 * @return 字符串的显示宽度，以一个半角字符的宽度为单位
 *
 * 函数内部先转换字符串为Unicode再计算
 * 在ANSI工程中宏eaw_len被定义为eaw_len_a
**/
int eaw_len_a(char *s)
{
    wchar_t buff[MAX_PATH];
    ::MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED | MB_USEGLYPHCHARS, s, -1,
        buff, MAX_PATH);
    return eaw_len_w(buff);
}

/**
 * @brief  计算字符串的显示宽度
 * @param  s 字符串，必须以L'\0'结尾
 * @return 字符串的显示宽度，以一个半角字符的宽度为单位
 *
 * 在Unicode工程中宏eaw_len被定义为eaw_len_w
**/
int eaw_len_w(wchar_t *s)
{
    int len = 0;
    for(wchar_t *i = s; *i; ++i)
    {
        len += eaw(static_cast<int>(*i));
    }
    return len;
}

因为是Windows项目，所以用WinSDK转码，非Windows用libiconv吧。

eaw_len是一个宏，UNICODE时为eaw_len_w，非UNICODE时为eaw_len_a。

然后在调用处：

// 假定字符的宽度为6px
int len_visible = eaw_len(some_str);
len_visible = len_visible > 20 ? 120 : len_visible * 6;

fortunewang

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
字符串的显示宽度

字符宽度数据库：http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt字符宽度文档：http://www.unicode.org/reports/tr11/文档中定义了Unicode字符的显示宽度如下：A : Ambiguous 不确定F : Fullwidth 全角H : Halfw
复制链接

扫一扫