Python Mecab的使用（Windows）+ PDF的转换

原创已于 2022-11-29 23:57:40 修改 · 2.9k 阅读

16 ·

CC 4.0 BY-SA版权

文章标签：

#python #windows #pdf #Mecab #日语

于 2022-11-29 23:44:24 首次发布

Python Mecab的使用（Windows）+ PDF的转换

最近最近做了个翻译网页，主要是为了学日语用。但是就日语注音这一块儿很难搞，尝试了多种方法，都没成功。最后采用还是选择了用Python+Mecab的方向去实现这一功能。

一、日语分词

我使用了以下命令安装了mecab-python3

pip3 install mecab-python3

然后执行了以下代码：

import MeCab

text = "天気がいいから、散歩しましょう"
mecab_tagger = MeCab.Tagger("-Owakati")
print(mecab_tagger.parse(text))

我得到了以下的错误：

Traceback (most recent call last):
  File "E:\Python env projects\JaToKana\venv\lib\site-packages\MeCab\__init__.py", line 133, in __init__
    super(Tagger, self).__init__(args)
RuntimeError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:/Python env projects/JaToKana/main.py", line 4, in <module>
    mecab_tagger = MeCab.Tagger("-Owakati")
  File "E:\Python env projects\JaToKana\venv\lib\site-packages\MeCab\__init__.py", line 135, in __init__
    raise RuntimeError(error_info(rawargs)) from ee
RuntimeError: 
----------------------------------------------------------

Failed initializing MeCab. Please see the README for possible solutions:

    https://github.com/SamuraiT/mecab-python3#common-issues

If you are still having trouble, please file an issue here, and include the
ERROR DETAILS below:

    https://github.com/SamuraiT/mecab-python3/issues

issueを英語で書く必要はありません。

------------------- ERROR DETAILS ------------------------
arguments: -Owakati
[ifs] no such file or directory: c:\mecab\mecabrc
----------------------------------------------------------

网上百度之后发现，这个错误是由于缺少词库导致的，需要使用以下命令安装unidic-lite词库依赖。

pip install unidic-lite

再次执行之后，分词的问题就解决了。
在这里插入图片描述

二、日语注音①

网上搜索到最多的就是 mecab_tagger = MeCab.Tagger("-Ochasen"), 但是经过个人尝试，以及官方说明，正确的写法为MeCab.Tagger("-chasen")。

不出意外会提示你找不到 xxx/dicdic/目录下的mecabrc文件。

使用-chasen参数需要安装以下的包，

pip install unidic

但是安装之后，并不会像unidic-lite库安装时那样自动下载字典库, 你需要执行以下命令下载字典:

python -m unidic download

但是发现没法下载，好像下载地址已经没有了还是中国大陆没法访问的原因吧。所以一番查找下去以下的网址下载了官方的字典库：
https://clrd.ninjal.ac.jp/unidic/
在这里插入图片描述
这三种unidic库都是可以的，有一定的区别。都挺大的，每一个都大约有四五百M。下载完之后，解压后把以下的文件手动copy覆盖到unidic库的dicdir目录下，目录没有的话需要自己创建一个。另外mecabrc这个文件必须要有，可以是空的。
在这里插入图片描述
这块弄完之后就可以正常执行以下命令了。

import MeCab

text = "私は張ですが、また明日想像指定です"
mecab_tagger = MeCab.Tagger("-chasen")
print(mecab_tagger.parse(text))

二、日语注音②

之后也了解了ipadic这个库，使用这个库貌似也能达到相同的效果。
首先也需要安装:

pip install ipadic

安装之后字典自动下载了，大约45M左右，比较小巧。

接着使用下面的代码就可以了：

import ipadic

text = "私は張ですが、また明日想像指定です"
mecab_tagger = MeCab.Tagger(ipadic.MECAB_ARGS)
print(mecab_tagger.parse(text))

在这里插入图片描述
然后配合jaconv库，将每个单词的读音转为平假名就完成了注音的功能。安装命令:

pip install jaconv

详细使用请参照官网。

但是这个程序如果用pyinstaller打包之后又会提示找不到字典目录的情况, 提示的目录完全看不懂是哪儿，所以修改了ipadic.py的源码：
第9行：_curdir = os.path.dirname(__file__)
改为了第10行: _curdir = os.getcwd()
在这里插入图片描述
这样改了之后，基础路径就会变为文件运行时的路径，所以只需要把dicdir这个目录copy到入口文件的同级目录下就可以了，也方便更新字典:

三、文字转PDF

网上倒是有不少关于PDF转换的库，挑来挑去最终选择了pdfkit这个库，理由是支持保留html样式转pdf。
试了好几种转换的库，主要是想保留，日语注音后的html效果，貌似只有pdfkit支持。
在这里插入图片描述
首先使用以下命令安装pdfkit库：

pip install pdfkit

这个库对于PDF的转换是依赖于第三方软件 wkhtmltopdf的，需要的朋友可以从这里下载：https://download.csdn.net/download/qq_36991535/87213206

解压之后放入项目中，然后运行以下的代码:

path_wk = r"wkhtmltox\bin\wkhtmltopdf.exe"
pdfkit_config = pdfkit.configuration(wkhtmltopdf=path_wk)
text = "<ruby>天気<rt>てんき</rt></ruby>がいいから、<ruby>散歩<rt>さんぽ</rt></ruby>しましょう。"
options = {
    'encoding': "utf-8",
    'page-size': 'A4',
    # 'margin-top': '0mm',
    # 'margin-right': '0mm',
    # 'margin-bottom': '0mm',
    # 'margin-left': '0mm'
}
pdfkit.from_string(text, "test01.pdf", configuration=pdfkit_config, options=options)

你会得到以下的PDF：
在这里插入图片描述

四、最后的代码

最后分享以下我这边最终封装的一些代码，其中word转pdf用的 mammoth库，pdf转word文档用的 pdf2docx这个库。

完整的程序请从这里下载: https://download.csdn.net/download/qq_36991535/87213229

import os
import traceback
import MeCab
import ipadic
import jaconv

import pdfkit

import mammoth

from pdf2docx import Converter

path_wk = r"wkhtmltox\bin\wkhtmltopdf.exe"
pdfkit_config = pdfkit.configuration(wkhtmltopdf=path_wk)

mecab = MeCab.Tagger(ipadic.MECAB_ARGS)


def get_kana(text):
    """
    将text全部转为平假名
    :param text:
    :return:
    """
    result = mecab.parse(text)
    words_arr = result.split("\n")

    kana_result = ""

    for word in words_arr:
        if word == "EOS":
            break
        key = word.split("\t")[0]
        mean = word.split("\t")[1].split(",")

        kana = mean[len(mean) - 2]
        if kana == "*" or kana == key:
            kana_result = f"{kana_result}{key}"
        else:
            kana_result = f"{kana_result}{jaconv.kata2hira(kana)}"
    return kana_result


def get_kata(text):
    """
    将text全部转为片假名
    :param text:
    :return:
    """
    result = mecab.parse(text)
    # print(result)
    words_arr = result.split("\n")

    hira_result = ""

    for word in words_arr:
        if word == "EOS":
            break
        key = word.split("\t")[0]
        mean = word.split("\t")[1].split(",")
        kana = mean[len(mean) - 2]
        if kana == "*" or kana == key:
            hira_result = f"{hira_result}{key}"
        else:
            hira_result = f"{hira_result}{jaconv.hira2kata(kana)}"
    return hira_result


def get_words_dict(text):
    """
    分析text文本中的日语单词，返回单词的平片假名，以及词性等信息
    :param text:
    :return:
    """
    result_dict = {}
    result = mecab.parse(text)
    words_arr = result.split("\n")

    for word in words_arr:
        if word == "EOS":
            break
        key = word.split("\t")[0]
        if key not in result_dict:
            mean = word.split("\t")[1].split(",")
            hira = mean[len(mean) - 2]
            if hira == "*" or hira == key:
                result_dict[key] = {
                    "kana": "",
                    "hira": "",
                    "category": ""
                }
            else:
                kana = jaconv.kata2hira(hira)
                category = ",".join(mean[:-3]).replace(",*", "").rstrip(",")
                result_dict[key] = {
                    "kana": kana,
                    "hira": hira,
                    "category": category
                }
    return result_dict


def get_text_with_kana(text):
    """
    以汉字词[注音]的形式返回text
    :param text:
    :return:
    """
    result = mecab.parse(text)
    words_arr = result.split("\n")

    kana_result = ""

    for word in words_arr:
        if word == "EOS":
            break
        key = word.split("\t")[0]
        mean = word.split("\t")[1].split(",")

        hira = mean[len(mean) - 2]
        if hira == "*" or hira == key:
            kana_result = f"{kana_result}{key}"
        else:
            kana = jaconv.kata2hira(hira)
            if key == kana:
                kana_result = f"{kana_result}{key}"
            else:
                kana_result = f"{kana_result}{key}[{kana}] "
    return kana_result


def get_html_with_kana(text):
    """
    返回html代码注音的文本,<ruby>汉字词<rt>注音</rt></ruby>
    :param text:
    :return:
    """
    result = mecab.parse(text)
    words_arr = result.split("\n")

    kana_result = ""

    for word in words_arr:
        if word == "EOS":
            break
        key = word.split("\t")[0]
        mean = word.split("\t")[1].split(",")

        hira = mean[len(mean) - 2]
        if hira == "*" or hira == key:
            kana_result = f"{kana_result}{key}"
        else:
            kana = jaconv.kata2hira(hira)
            if key == kana:
                kana_result = f"{kana_result}{key}"
            else:
                kana_result = f"{kana_result}<ruby>{key}<rt>{kana}</rt></ruby>"
    return kana_result


def get_complated_html(text_kana):
    """
    将含有日语注音的文本,替换到模板html中后返回模板html代码
    :param text_kana:
    :return:
    """
    with open("templte_kana.html", mode="r", encoding="utf-8") as f:
        html_kana = f.read().replace("#content", text_kana)
        return html_kana


def html_file_to_kana_html(html_file, encoding="utf-8"):
    """
    日本语html文件注音，生成带有注音的html文件
    :param html_file:
    :param encoding:
    :return:
    """
    with open(html_file, mode="r", encoding=encoding) as f:
        html_text = f.read()
        result_html_file = html_file.replace(".html", "_kana.html")
        with open(result_html_file, mode="w", encoding=encoding) as wf:
            wf.write(get_html_with_kana(html_text))


def text_with_kana_to_word(text="", outputfile="document.docx"):
    """
    将带有注音的文字转换成word文档
    :param text:
    :param outputfile:
    :return:
    """
    try:
        temp_pdf = "temp.pdf"
        if text_to_pdf(text=text, outputfile=temp_pdf):
            cv = Converter(temp_pdf)
            cv.convert(outputfile, start=0, end=None)
            cv.close()
            os.remove(temp_pdf)
            return True

        return False
    except:
        traceback.print_exc()
        return False


def text_to_pdf(text="", outputfile="document.pdf"):
    """
    将text文本转换成PDF，支持html代码
    :param text:
    :param outputfile:
    :return:
    """
    try:
        options = {
            'encoding': "utf-8",
            'page-size': 'A4',
            # 'margin-top': '0mm',
            # 'margin-right': '0mm',
            # 'margin-bottom': '0mm',
            # 'margin-left': '0mm'
        }
        pdfkit.from_string(text, outputfile, configuration=pdfkit_config, options=options)
        return True
    except:
        traceback.print_exc()
        return False


def html_to_pdf(html_file, outputfile="document.pdf"):
    """
    直接将html文件转换为pdf文件
    :param html_file:
    :param outputfile:
    :return:
    """
    options = {
        'encoding': "utf-8",
        'page-size': 'A4',
        # 'margin-top': '0mm',
        # 'margin-right': '0mm',
        # 'margin-bottom': '0mm',
        # 'margin-left': '0mm'
    }
    pdfkit.from_file(html_file, outputfile, configuration=pdfkit_config, options=options)


def word_to_mark_kana(word_file, output_word_file="output.docx"):
    """
    word文档标注假名，格式会稍微有些变化
    :param word_file:
    :param output_word_file:
    :return:
    """
    with open(word_file, "rb") as docx_file:
        result = mammoth.convert_to_html(docx_file)
        text_kana = get_html_with_kana(result.value)
        html_code = get_complated_html(text_kana)
        text_with_kana_to_word(html_code, outputfile=output_word_file)
        # text_with_kana_to_word(text_kana, outputfile=output_word_file)


def word_to_pdf(word_file, pdf_file):
    """
    word文档转PDF，会保留格式转换
    :param word_file:
    :param pdf_file:
    :return:
    """
    with open(word_file, "rb") as docx_file:
        result = mammoth.convert_to_html(docx_file)
        text_to_pdf(result.value, pdf_file)

try:
    text = "天気がいいから、散歩しましょう。"
    # text = open("test.html", encoding="utf-8").read()
    # print(get_kana(text))
    # print(get_kata(text))
    # print(get_text_with_kana(text))
    print(get_html_with_kana(text))
    text_to_pdf("<ruby>天気<rt>てんき</rt></ruby>がいいから、<ruby>散歩<rt>さんぽ</rt></ruby>しましょう。", "test01.pdf")
    # print(get_words_dict(text))

    # text = get_html_with_kana(text)
    # html_code = get_complated_html(text)

    # text_with_kana_to_word(text, "test.docx")

    # print(text)
    # html_with_kana_to_word(text)
    # html_with_kana_to_pdf(text)

    # html_file_to_kana_html("test.html")
    # html_to_pdf("test.html", "test.pdf")
    # word_to_mark_kana("小班美术教案.docx", output_word_file="小班美术教案.docx")
    # word_to_pdf("幼儿园组织与管理名词解释题.docx", "幼儿园组织与管理名词解释题.pdf")
    pass

except:
    traceback.print_exc()

# input("请输入任意键结束...")