html文件转换为PDF文档

Forge_ahead

已于 2023-06-30 13:16:42 修改

阅读量260

点赞数

分类专栏： work_efficiency 文章标签： html pdf python

于 2023-05-15 18:13:33 首次发布

本文链接：https://blog.csdn.net/weixin_50646402/article/details/130690031

版权

work_efficiency 专栏收录该内容

46 篇文章 2 订阅

订阅专栏

工作中遇到一个问题，将一篇微信公众号文章（可以使用浏览器打开，文章链接：https://mp.weixin.qq.com/s/M9Oz3UDaEXJoLMzSB4mJPQ）转换为PDF格式。这篇文章有较多的图片，页数也比较多，转换中遇到一些问题，在这里记录下来。

方法一：巧用Chrome浏览器打印功能

微信公众号在网页打开，相当于html文件，可以打印整个网页。Ctrl+P打开打印页面，选择另存为PDF，设置去掉页眉页脚，则可以保存为PDF文档。

但是注意到一个问题，因为本文较长，Ctrl+P打开是发现后面大部分页面是空白的，下载PDF也只有前面几页内容，要在原网页从头缓慢拉到底部进行一个缓存，再进到打印页面，发现可以下载整个PDF。
在这里插入图片描述

方法二：使用Python程序将公众号文章转换为html文件或PDF文档

# coding: utf-8
import pdfkit
import os
import requests
from bs4 import BeautifulSoup

# 模板html,微信抓取到的html内容过多.
T_HTML = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="referrer" content="never">
    <meta name="referrer" content="no-referrer">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document</title>
    <style>{style}</style>
</head>
<body>
    {content}
</body>
</html>"""

# pdf的一些参数
PDF_OPTIONS = {
    'page-size': 'A4',
    'encoding': "UTF-8",
}


def getHtmlContent(url, proxies=None):
    '''
    获取html
    '''
    if proxies is None:
        proxies = {"http": None, "https": None}
    res = requests.get(url, proxies)
    res.encoding = 'utf-8'
    return res.text


def reHtmlTags(cnt_html):
    '''
    替换图片src、元素、删除元素
    '''
    # 替换图片标签属性
    cnt_html = cnt_html.replace(
        "data-src", "src").replace('style="visibility: hidden;"', "")
    soup = BeautifulSoup(cnt_html, 'html.parser')

    # 删除评论和投票的html标签
    if soup.iframe:
        soup.iframe.decompose()

    # 用模板格式化
    comments = soup.findAll("img", {"class": "rich_pages wxw-img"})  #BeautifulSoup找到文章图片节点，不通文章节点可能不同
    styles = soup.find_all('style')
    content = soup.find('div', id='page-content')  #正文文本节点，不同文章可能不同
    fmt_html = T_HTML.format(style=styles[0].string, content=content)
    html = fmt_html.replace(comments[0].attrs['src'], '')
    return html


def outFile(data, out_type):
    '''
    导出
    '''
    if out_type == 'pdf':
        pdfkit.from_string(data, '大米评测_文章.pdf', PDF_OPTIONS)
#         path = os.getcwd() + '\\大米评测_文章.pdf'
#         with open(path, 'w', encoding='utf-8') as f:
#             f.write(data)
    else:
        path = os.getcwd() + '\\大米评测_文章.html'
        with open(path, 'w', encoding='utf-8') as f:
            f.write(data)


source = getHtmlContent('https://mp.weixin.qq.com/s/M9Oz3UDaEXJoLMzSB4mJPQ') #改变文章链接
html = reHtmlTags(source)
#outFile(html, 'html')
outFile(html, 'pdf')