Python批量处理微信公众号文章转PDF

Marita Christian

已于 2023-05-31 20:54:17 修改

阅读量401

点赞数 1

文章标签： chrome python

于 2023-05-31 20:53:42 首次发布

本文链接：https://blog.csdn.net/qq_46565716/article/details/130977064

版权

#! https://zhuanlan.zhihu.com/p/633738815

Python批量处理微信公众号文章转PDF

[序言]因为每周要帮朋友把一些考试相关公众号文章内容转为PDF保存下来之后查看，遂想着自动化完成这部操作，之前是用iPhone的Safari浏览器打印功能一个个重复点击很麻烦。调研后网上最多的是pdfkit这个包，但是不能达到满意的效果，内容会出现缺失。例如要转换以下这篇文章：https://mp.weixin.qq.com/s/6Izbgd8QI9LM4ecrL2Fv_Q

1. 方案一：pdfkit+selenium+bs4

由于微信公众号文章网页是请求得到的数据，需要使用selenium去模拟打开网页的再解析内容。这个方案会出现问题，得到的结果如下：

代码如下：

import time
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from selenium import webdriver
import pdfkit

# 公众号文章的URL
url = 'https://mp.weixin.qq.com/s/6Izbgd8QI9LM4ecrL2Fv_Q'
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
# 获取网页内容
driver = webdriver.Edge(options=chrome_options)
driver.get(url)
# 等待页面加载完成
time.sleep(5)
html = driver.page_source
# 获取页面源代码
# 解析页面
soup = BeautifulSoup(html, 'html.parser')
article_content = soup.find('div', {'id': 'js_article'}).prettify()
path_wk = r'E:\wkhtmltopdf\bin\wkhtmltopdf.exe' #wkhtmltopdf安装位置
config = pdfkit.configuration(wkhtmltopdf = path_wk)
# 将文章正文内容转换为PDF文件
pdfkit.from_string(article_content, 'example-article.pdf',configuration=config,options={'encoding': 'utf-8',"enable-local-file-access":None})

2. 方案二：playwright

由于之前的方案效果不佳，考虑转换思路滚动截图的方式，于是搜索了相关的方案其中一个是playwright效果不错看起来，得到结果如下：

看起来格式效果更好，不过两种方案都有问题会出现图片加载不全，不过我只需要拿到文字内容就行了。该方案代码做了封装如下：

import re
import bs4
import requests
from playwright.sync_api import sync_playwright
import os
from tqdm import tqdm

def download(url,file_name):
    with sync_playwright() as pw:
        res = requests.get(url)
        res.raise_for_status()
        soup = bs4.BeautifulSoup(res.text,"html.parser")
        titleElem = soup.select('#activity-name')
        title = titleElem[0].getText().split('\n')[2]
        title = re.sub('[\\/:*?"<>|\\n]', '', title)
        b = pw.chromium.launch()
        c = b.new_context()
        p = c.new_page()
        p.goto(url)
        p.pdf(path=f'./{file_name}/{title}.pdf')
        b.close()


if __name__ == "__main__":
    import datetime
    today = datetime.date.today()
    file_name = f'{today.year}-{today.month}-{today.day}'
    if not os.path.exists(file_name):
        os.mkdir(file_name)
    with open('urls.txt', 'r') as file:
        urls = file.readlines()
        urls = [line.strip() for line in urls]
    for url in tqdm(urls):
        download(url,file_name)

PS

本篇文章是第一次使用Zhihu On Vscode插件试水。

Marita Christian

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
Python批量处理微信公众号文章转PDF

序言]因为每周要帮朋友把一些考试相关公众号文章内容转为PDF保存下来之后查看，遂想着自动化完成这部操作，之前是用iPhone的Safari浏览器打印功能一个个重复点击很麻烦。这个包，但是不能达到满意的效果，内容会出现缺失。由于微信公众号文章网页是请求得到的数据，需要使用selenium去模拟打开网页的再解析内容。看起来格式效果更好，不过两种方案都有问题会出现图片加载不全，不过我只需要拿到文字内容就行了。本篇文章是第一次使用。
复制链接

扫一扫