mdx和mdd格式的词典解析Android JNI方式实现

最新推荐文章于 2025-01-26 08:44:10 发布

tranquility-c

最新推荐文章于 2025-01-26 08:44:10 发布

阅读量6.4k

点赞数 2

分类专栏： Android JNI 文章标签： c++

本文链接：https://blog.csdn.net/tranquality_c/article/details/127120537

版权

Android JNI 专栏收录该内容

2 篇文章

订阅专栏

写在最前：本文将简单介绍一下Mdict词典文件的解析，只是我一点浅显的理解，如有不对，请各位大佬多多指正，侵删

前言

Mdict是一种较为广泛使用的电子词典，主要包含mdx和mdd两种格式，mdx中记录的是单词和单词释义，内容以html格式存放，mdd中存放的则是与之配套的资源文件，图片、音频等等，mdx和mdd两种文件内部的结构基本相同，只存在细微的差别（下文会详细介绍），本文只针对2.0版本的词典文件解析。
参考项目
https://github.com/Tuo-ZHANG/mdict-jni-query-library
https://github.com/csarron/mdict-analysis

一、mdx词典解析

mdx文件的内部结构如下：
mdx内部结构

词典解析

解析关键逻辑为

bool Mdict::init() {
  /* indexing... */
  bool bRet = this->read_header();
  if(bRet)
  {
    this->printhead();
    this->read_key_block_header();
    this->read_key_block_info();
    this->read_record_block_header();
    //  this->decode_record_block();
    // TODO delete this  this->decode_record_block(); // very slow!!!
  }
  return bRet;
}

read_header，读取header信息，文件起始，前四个字节存放了词典的info信息的长度，根据info长度按字节读取info信息，可以得到类似这种格式的内容

"<Dictionary GeneratedByEngineVersion=\"2.0\" RequiredEngineVersion=\"2.0\" Format=\"Html\" KeyCaseSensitive=\"No\" StripKey=\"Yes\" Encrypted=\"2\" RegisterBy=\"EMail\" Description=\"&lt;font size=5 color=red&gt;1122͋test&lt;/font&gt;\" Title=\"1122͋\"xQ\"xQ\"Q\"\" Encoding=\"UTF-8\" CreationDate=\"2022-9-22\" Compact=\"No\" Compat=\"No\" Left2Right=\"Yes\" DataSourceFormat=\"107\" StyleSheet=\"\"/>\r\n"

然后再读取各字段值，为后续解析做准备，GeneratedByEngineVersion即为词典的版本值，暂时只考虑2.0版本的词典解析
read_key_block_header，读取mdx的Entries数量，keyblock info信息，keyblock数量
read_key_block_info，主要实现keyblock info和keyblock数据的解码
read_record_block_header，读取record header信息record block信息

单词搜索

单词搜索关键代码

std::string Mdict::lookup(const std::string word) {
  try {
    // search word in key block info list
    long idx = this->reduce0(_s(word), 0, this->key_block_info_list.size());
    //            std::cout << "==> lookup idx " << idx << std::endl;
    if (idx >= 0) {
      // decode key block by block id
      std::vector<key_list_item*> tlist =
          this->decode_key_block_by_block_id(idx);
      // reduce word id from key list item vector to get the word index of key
      // list
//      long word_id = reduce1(tlist, word);
      key_list_item * pItem = nullptr;
      pItem = key_item_map[word];
      if (/*word_id >= 0*/pItem != nullptr) {
        // reduce search the record block index by word record start offset
        unsigned long record_block_idx = reduce2(pItem->record_start);
        // decode recode by record index
        auto vec = decode_record_block_by_rid(record_block_idx);
        //  for(auto it= vec.begin(); it != vec.end(); ++it){
        //   std::cout<<"word: "<<(*it).first<<" \n def:
        //   "<<(*it).second<<std::endl;
        //  }
        // reduce the definition by word
        std::string def = reduce3(vec, word);
        if(m_bIsMddFile)
        {
          //若当前解析的是mdd格式的词典文件，则将查询的资源文件的内容写入本地文件
          bool bRet = createResourceFile(word, def);
          if(!bRet)
            std::cout << "create local resource file failed, the filen name is : " << word << std::endl;
        }
        //释放内存
        for(auto tItem : key_item_map)
        {
          delete tItem.second;
          tItem.second = nullptr;
        }
        key_item_map.clear();
        return def;
      }
    }
  } catch (std::exception& e) {
    std::cout << "==> lookup error" << e.what() << std::endl;
  }
  return std::string();
}

大体逻辑是在词典初始化时得到的缓存列表key_block_info_list中搜索单词，判断单词对应的key block索引idx，然后对该key block解码，然后再用解码得到的key_list_item列表用二分法查找是否存在这个单词（但是这种查找方式要求词典中单词的存放顺序是有序的，按字母排列的，所以我这里改成了key_item_map键值对的方式），最后再藉由找到的item的record_block_idx索引，进一步的解码，找到对应的单词释义并返回，至此单词查询完成。

二、mdd词典解析

mdd文件的内部结构如下：
mdd内部结构
可以看到结构和mdx基本一致，解析流程也可以基本代码复用，但需要注意的是，mdx的编码一般是utf-8，解析也是以此字段值解析，而mdd则是须以utf-16解析，或可以特殊处理一下。

bool Mdict::read_header() {
...
// ---------- encoding ------------
  if (headinfo.find("Encoding") != headinfo.end() ||
      headinfo["Encoding"] == "" || headinfo["Encoding"] == "UTF-8") {
    this->encoding = ENCODING_UTF8;
  } else if (headinfo["Encoding"] == "GBK" ||
             headinfo["Encoding"] == "GB2312") {
    this->encoding = ENCODING_GB18030;
  } else if (headinfo["Encoding"] == "Big5" || headinfo["Encoding"] == "BIG5") {
    this->encoding = ENCODING_BIG5;
  } else if (headinfo["Encoding"] == "utf16" ||
             headinfo["Encoding"] == "utf-16") {
    this->encoding = ENCODING_UTF16;
  } else {
    this->encoding = ENCODING_UTF8;
  }
  if(m_bIsMddFile)
  {
    //mdd词典文件需用UTF16编码解析
    this->encoding = ENCODING_UTF16;
  }
  ...

其他

缓存key_list时，split_key_block函数也许特殊处理一下

if (this->encoding == 1 /* ENCODING_UTF16 */) {
      // TODO
//      throw std::runtime_error("NOT SUPPORT UTF16 YET");
        if(bHandleMdd && m_bIsMddFile)
        {
          key_text = le_bin_utf16_to_utf8(
                  (const char*)key_block, (key_start_idx + this->number_width),
                  static_cast<unsigned long>(key_end_idx - key_start_idx -
                                             this->number_width));
        }
        else
          key_text = be_bin_to_utf8(
                  (const char*)key_block, (key_start_idx + this->number_width),
                  static_cast<unsigned long>(key_end_idx - key_start_idx -
                                             this->number_width));
    } else if (this->encoding == 0 /* ENCODING_UTF8 */) {
      key_text = be_bin_to_utf8(
          (const char*)key_block, (key_start_idx + this->number_width),
          static_cast<unsigned long>(key_end_idx - key_start_idx -
                                     this->number_width));
    }

key_text 单词文本的转码使用le_bin_utf16_to_utf8方法代替

std::string le_bin_utf16_to_utf8(const char* bytes, int offset, int len) {
  std::u16string u16;
  {
    std::string strByte;
    for(int i = 0; i < len * sizeof(char); i++)
    {
      const char * tChar = bytes + offset * sizeof(char) + i;
      strByte.append(tChar);
    }
    u16 = u"";
    char16_t c16str[3] = u"\0";
    mbstate_t mbs;
    for (const auto& it: strByte){
      memset (&mbs, 0, sizeof (mbs));//set shift state to the initial state
      memmove(c16str, u"\0\0\0", 3);
      mbrtoc16 (c16str, &it, 3, &mbs);
      u16.append(std::u16string(c16str));
    }//for
  }

//  std::string u8 = conv16.to_bytes(u16);
  std::string u8 = std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t>{}.to_bytes(u16);
//  if (len > 0) std::free(cbytes);
  return u8;
}

与mdx的单词查询不同，mdd的key_text不是单词，而是资源文件名，即，mdd的查询须以资源文件名称搜索，如上文的单词搜索代码所示，对mdd文件特殊处理，最后查询到的def不再是单词释义，而是资源文件的内容

 std::string def = reduce3(vec, word);
        if(m_bIsMddFile)
        {
          //若当前解析的是mdd格式的词典文件，则将查询的资源文件的内容写入本地文件
          bool bRet = createResourceFile(word, def);
          if(!bRet)
            std::cout << "create local resource file failed, the filen name is : " << word << std::endl;
        }

createResourceFile可自行实现

题外话

若有mdx、mdd词典加解密需求，或可考虑用AES的CFB模式（加密前后字段长度不变）加密词典的前四个字节然后覆写这四个字节的方式实现，因为前四个字节存放的是header info的长度信息，如果这个值获取不正确，后续解析应该无法继续执行了，需解密的时候再解密前四个字节（但不覆写回去）读取到正确的值即可。

void Mdict::encryptMdict()
{
  char* head_size_buf = (char*)std::calloc(4, sizeof(char));
  readfile(0, 4, head_size_buf);

  // 加密词典文件的前四个字节，以达到文件加密的效果
  // 采用AES的CFB模式加密，以保证加密前后字段长度不变化
  unsigned char tmpChar[DECRYPTBYTELEN] = {0};
  for(int i = 0; i < DECRYPTBYTELEN; i++)
  {
    unsigned char tCh = *(head_size_buf + i);
    tmpChar[i] = tCh;
  }
  AESModeOfOperation moo;
  unsigned char iv[] = { 103,35,148,239,76,213,47,119,255,222,123,176,106,134,98,92 };
  unsigned char key[] = { 143,194,34,218,145,208,230,143,124,245,96,206,145,92,255,85 };
  unsigned char encryptOutput[DECRYPTBYTELEN] = { 0 };
  moo.set_key(key);
  moo.set_mode(MODE_CFB);
  moo.set_iv(iv);
  moo.Encrypt(tmpChar, DECRYPTBYTELEN, encryptOutput);
  //加密完的字段覆写到词典文件中
  std::fstream writeFile(filename, std::ios::in | std::ios::out);
  writeFile << encryptOutput;
  writeFile.close();

  std::free(head_size_buf);
}