【搜狗&百度词库】.bdict文件与.scel转txt

最新推荐文章于 2024-04-23 09:34:45 发布

是谁在学习

最新推荐文章于 2024-04-23 09:34:45 发布

阅读量1.6k

点赞数 2

分类专栏：硕论文章标签：百度自然语言处理人工智能搜狗文件格式转化

本文链接：https://blog.csdn.net/qq_32760017/article/details/123262352

版权

硕论专栏收录该内容

1 篇文章 0 订阅

订阅专栏

文章目录

0.背景

论文提及关键词提取，需要优化分词效果，为此需要领域词典。
但是搜狗和百度下载下来的词典文件无法直接处理，需要转为txt。

1.搜狗词库和百度词库

https://pinyin.sogou.com/dict/
https://shurufa.baidu.com/dict_list

2.搜狗词库文件.scel转.txt

在线工具-亲测有效：http://tools.bugscaner.com/sceltotxt/

3.百度词库文件.bdict转.txt

Python实现

import struct
import binascii
 
class Baidu(object):
    def __init__(self, originfile):
        self.originfile = originfile
        self.lefile = originfile + '.le'
        self.txtfile = originfile[0:(originfile.__len__()-5)] + 'txt'
        self.buf = [b'0' for x in range(0,2)]
        self.listwords = [] 
    # 字节流大端转小端
    def be2le(self):
        of = open(self.originfile,'rb')
        lef = open(self.lefile, 'wb')
        contents = of.read()
        contents_size = contents.__len__()
        mo_size = (contents_size % 2)
        # 保证是偶数
        if mo_size > 0:
            contents_size += (2-mo_size)
            contents += contents + b'0000'
        # 大小端交换
        for i in range(0, contents_size, 2):
            self.buf[1] = contents[i]
            self.buf[0] = contents[i+1]
            le_bytes = struct.pack('2B', self.buf[0], self.buf[1])
            lef.write(le_bytes)
        print('写入成功转为小端的字节流')
        of.close()
        lef.close()
    def le2txt(self):
        lef = open(self.lefile, 'rb')
        txtf = open(self.txtfile, 'w')
        # 以字符串形式读取转成小端后的字节流，百度词典的起始位置为0x350
        le_bytes = lef.read().hex()[0x350:]
        i = 0
        while i<len(le_bytes):
            result = le_bytes[i:i+4]
            i+=4
            # 将所有字符解码成汉字，拼音或字符
            content = binascii.a2b_hex(result).decode('utf-16-be')
            # 判断汉字
            if '\u4e00' <= content <= '\u9fff':
                self.listwords.append(content)
            else:
                if self.listwords:
                    word = ''.join(self.listwords)
                    txtf.write(word + '\n')
                self.listwords = []
        print('写入txt成功')
        lef.close()
        txtf.close()
if __name__ == '__main__':
    path = '你的.bdict文件'
    bd = Baidu(path)
    bd.be2le()
    bd.le2txt()

4.参考文献

https://blog.csdn.net/qiuwen_521/article/details/122056981

是谁在学习

关注

2
点赞
踩
8

收藏

觉得还不错? 一键收藏
1
评论
【搜狗&百度词库】.bdict文件与.scel转txt

文章目录0.背景1.搜狗词库和百度词库2.搜狗词库文件.scel转.txt3.百度词库文件.bdict转.txt4.参考文献0.背景论文提及关键词提取，需要优化分词效果，为此需要领域词典。但是搜狗和百度下载下来的词典文件无法直接处理，需要转为txt。1.搜狗词库和百度词库https://pinyin.sogou.com/dict/https://shurufa.baidu.com/dict_list2.搜狗词库文件.scel转.txt在线工具-亲测有效：http://tools.b
复制链接

扫一扫