python偏移量口诀_将字符偏移量转换为字节偏移量(在Python中)

Suppose I have a bunch of files in UTF-8 that I send to an external API in unicode. The API operates on each unicode string and returns a list with (character_offset, substr) tuples.

The output I need is the begin and end byte offset for each found substring. If I'm lucky the input text contains only ASCII characters (making character offset and byte offset identical), but this is not always the case. How can I find the begin and end byte offsets for a known begin character offset and substring?

I've answered this question myself, but look forward to other solutions to this problem that are more robust, more efficient, and/or more readable.

解决方案

I'd solve this using a dictionary mapping character offsets to byte offsets and then looking up the offsets in that.

def get_char_to_byte_map(unicode_string):

"""

Generates a dictionary mapping character offsets to byte offsets for unicode_string.

"""

response = {}

byte_offset = 0

for char_offset, character in enumerate(unicode_string):

response[char_offset] = byte_offset

byte_offset += len(character.encode('utf-8'))

return response

char_to_byte_map = get_char_to_byte_map(text)

for begin_offset, substring in api_response:

begin_offset = char_to_byte_map[character_offset]

end_offset = char_to_byte_map[character_offset + len(substring)]

# do something

Performance of this solution as compared to yours depends a lot on the size of the input and the amount of substrings involved. Local micro-benchmarking suggests that encoding each individual character in a text takes about 1000 times as long as encoding the entire text at once.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值