tadvstringgrid win32 获取内容_Python爬虫，微信公众号话题标签内容采集打印PDF输出...-CSDN博客

微信公众号内容采集，比较怪异，其参数，post参数需要话费时间去搞定，这里采集的是话题标签的内容，同时应用了pdfkit打印输出内容。

这里实现应用了两个版本，第一个是直接网页访问，其真实地址即post网址也存在比较多的参数，没有尝试过，获取到的内容仅有部分，比较不理想。第二个版本是采用了无头浏览器直接访问，获取到网页源码，进行解析，得到想要的内容。

本渣渣现在比较懒，代码都是拿以前的，现成的，复制，改改，直接使用的！

版本一：

#微信公众号内容获取打印pdf
#by 微信：huguo00289
#https://mp.weixin.qq.com/mp/homepage?__biz=MzA4NjQ3MDk4OA==&hid=5&sn=573b1b806f9ebf63171a56ee2936b883&devicetype=android-29&version=27001239&lang=zh_CN&nettype=WIFI&a=&session_us=gh_7d55ab2d943f&wx_header=1&fontScale=100&from=timeline&isappinstalled=0&scene=1&subscene=2&clicktime=1594602258&enterid=1594602258&ascene=14
# -*- coding: UTF-8 -*-
import requests
from fake_useragent import UserAgent
import os,re
import pdfkit


confg = pdfkit.configuration(
    wkhtmltopdf=r'D:\wkhtmltox-0.12.5-1.mxe-cross-win64\wkhtmltox\bin\wkhtmltopdf.exe')

class Du():
    def __init__(self,furl):
        ua=UserAgent()
        self.headers={
            "User-Agent": ua.random,
                      }
        self.url=furl


    def get_urls(self):

        response=requests.get(self.url,headers=self.headers,timeout=8)
        html=response.content.decode('utf-8')
        req=re.findall(r'var data={(.+?)if',html,re.S)[0]
        urls=re.findall(r',"link":"(.+?)",',req,re.S)


        urls=set(urls)
        print(len(urls))


        return urls



    def get_content(self,url,category):
        response = requests.get(url, headers=self.headers, timeout=8)
        print(response.status_code)
        html = response.content.decode('utf-8')
        req = re.findall(r'

(.+?)var first_sceen__time',html,re.S)[0]#获取标题
        h1=re.findall(r'

`(.+?)`

',req,re.S)[0]
        h1=h1.strip()
        pattern = r"[\/\\\:\*\?\"\\|]"
        h1 = re.sub(pattern, "_", h1)  # 替换为下划线
        print(h1)#获取详情
        detail = re.findall(r'

(.+?)