#! https://zhuanlan.zhihu.com/p/633738815
Python批量处理微信公众号文章转PDF
[序言]因为每周要帮朋友把一些考试相关公众号文章内容转为PDF保存下来之后查看,遂想着自动化完成这部操作,之前是用iPhone的Safari浏览器打印功能一个个重复点击很麻烦。调研后网上最多的是pdfkit
这个包,但是不能达到满意的效果,内容会出现缺失。例如要转换以下这篇文章:https://mp.weixin.qq.com/s/6Izbgd8QI9LM4ecrL2Fv_Q
1. 方案一:pdfkit+selenium+bs4
由于微信公众号文章网页是请求得到的数据,需要使用selenium去模拟打开网页的再解析内容。这个方案会出现问题,得到的结果如下:
代码如下:
import time
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from selenium import webdriver
import pdfkit
# 公众号文章的URL
url = 'https://mp.weixin.qq.com/s/6Izbgd8QI9LM4ecrL2Fv_Q'
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
# 获取网页内容
driver = webdriver.Edge(options=chrome_options)
driver.get(url)
# 等待页面加载完成
time.sleep(5)
html = driver.page_source
# 获取页面源代码
# 解析页面
soup = BeautifulSoup(html, 'html.parser')
article_content = soup.find('div', {'id': 'js_article'}).prettify()
path_wk = r'E:\wkhtmltopdf\bin\wkhtmltopdf.exe' #wkhtmltopdf安装位置
config = pdfkit.configuration(wkhtmltopdf = path_wk)
# 将文章正文内容转换为PDF文件
pdfkit.from_string(article_content, 'example-article.pdf',configuration=config,options={'encoding': 'utf-8',"enable-local-file-access":None})
2. 方案二:playwright
由于之前的方案效果不佳,考虑转换思路滚动截图的方式,于是搜索了相关的方案其中一个是playwright效果不错看起来,得到结果如下:
看起来格式效果更好,不过两种方案都有问题会出现图片加载不全,不过我只需要拿到文字内容就行了。该方案代码做了封装如下:
import re
import bs4
import requests
from playwright.sync_api import sync_playwright
import os
from tqdm import tqdm
def download(url,file_name):
with sync_playwright() as pw:
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,"html.parser")
titleElem = soup.select('#activity-name')
title = titleElem[0].getText().split('\n')[2]
title = re.sub('[\\/:*?"<>|\\n]', '', title)
b = pw.chromium.launch()
c = b.new_context()
p = c.new_page()
p.goto(url)
p.pdf(path=f'./{file_name}/{title}.pdf')
b.close()
if __name__ == "__main__":
import datetime
today = datetime.date.today()
file_name = f'{today.year}-{today.month}-{today.day}'
if not os.path.exists(file_name):
os.mkdir(file_name)
with open('urls.txt', 'r') as file:
urls = file.readlines()
urls = [line.strip() for line in urls]
for url in tqdm(urls):
download(url,file_name)
PS
本篇文章是第一次使用Zhihu On Vscode
插件试水。