scel转txt抽取词库

最新推荐文章于 2024-06-27 21:12:50 发布

keeleylee

最新推荐文章于 2024-06-27 21:12:50 发布

阅读量3.4k

点赞数 3

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/sunlilan/article/details/78761659

版权

本文介绍了如何使用Python将搜狗词库SCEL格式转换为TXT。文章详细讨论了在不同Python版本中遇到的问题，如bytes与16进制字符串的比较，unichr在Python 3中的变化，struct.unpack的使用，以及从xrange到range的调整，以适应Python 3的语法。

摘要由CSDN通过智能技术生成

最近需要词库来优化分词效果，找到了有大神写好的能将搜狗词库scel转成txt的python脚本。
http://blog.csdn.net/zhangzhenhu/article/details/7014271
实际运行时因为python版本不同转换不能成功，最后终于可行。

bytes与16进制字符串

f = open(file_name, 'rb')
data = f.read()
f.close()
if data[0:12] != "\x40\x15\x00\x00\x44\x43\x53\x01\x01\x00\x00\x00":
    print ("确认你选择的是搜狗(.scel)词库?")
    sys.exit(0)

程序刚开始运行由于if判断语句中的两者不相等就退出了

两者确实是不相等的，但用notepad hex editor查看前12个字符确实是
\x40\x15\x00\x00\x44\x43\x53\x01\x01\x00\x00\x00

这里写图片描述

以二进制模式来读取文件，data=f.read() 是bytes对象。

>>> type(data)
<class 'bytes'>
>>> data[0:12]
b'@\x15\x00\x00DCS\x01\x01\x00\x00\x00'

一个bytes对象b，b[0]的类型是int，而b[0:1]的类型是长度为1的bytes对象。
bytes对象的表示：b’…’
list(b)可以把一个bytes对象转换为a list of integer

需要了解的两个方法：
bytes.fromhex(string)：This bytes class method returns a bytes object, decoding the given string object. The string must contain two hexadecimal digits per byte, with ASCII whitespace being ignored.

hex()：Return a string object containing two hexadecimal digits for each byte in the instance.