前言
用Fiddler结合Python爬下微信公众号900篇文章
先看效果
提示:以下是本篇文章正文内容,下面案例可供参考
一、fiddler抓包
首先我是通过fiddler在PC端微信上进行抓包,找到文章的接口,并且解析出其规律
关于fiddler的配置安装不赘述
1.数据抓包
启动fiddler开始抓包,打开微信打开公众号,并且向下滑动使其显示出更多文章
再利用fiddler的筛选功能,将数据包找到
我们拿到数据包,放到json官网解析数据,发现他是一个json套json,一个真正的json文件在当前json的
我们再将其转换就可得到数据
2.分析url
我们已经解决了数据包的问题,现在我们来分析url,仔细观察在变的只有一个参数offset,规律是呈现当前页数*10,从0开始,0表示第一页,10表示第二页,20表示第三页
https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz=MzAxNzE1OTA1MA==&f=json&offset=10&count=10&is_ok=1&scene=124&uin=MjQ0OTAxMTE1OA%3D%3D&key=ddc148ae0bed3f3b789c3b6b4b2fccb615cf85c5c07072eaabd700d9135e21d2c0d00629fe28359c7c8a45064e78029fe946d6bc94fbc7333cc45b9b2c797ed8703e4c11a547f6e8a045316563a9d1b619a393f105049fa9023952bd5f339cd9845cef41c53f01d84768b5768979309f10736120c882567b988e0bc0a1ee008e&pass_ticket=dC03dEe6iiIOydQ1ju1FlPWP4OX6VLYNRzpscK7SAz2X2BnzPln81Zt%2B4oQiVadK&wxtoken=&appmsg_token=1107_CcuZtyZ%252BYr6gZ3t9QZ64DjszBGeDOYZsxPfvjg~~&x5=0&f=json
https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz=MzAxNzE1OTA1MA==&f=json&offset=20&count=10&is_ok=1&scene=124&uin=MjQ0OTAxMTE1OA%3D%3D&key=ddc148ae0bed3f3b789c3b6b4b2fccb615cf85c5c07072eaabd700d9135e21d2c0d00629fe28359c7c8a45064e78029fe946d6bc94fbc7333cc45b9b2c797ed8703e4c11a547f6e8a045316563a9d1b619a393f105049fa9023952bd5f339cd9845cef41c53f01d84768b5768979309f10736120c882567b988e0bc0a1ee008e&pass_ticket=dC03dEe6iiIOydQ1ju1FlPWP4OX6VLYNRzpscK7SAz2X2BnzPln81Zt%2B4oQiVadK&wxtoken=&appmsg_token=1107_CcuZtyZ%252BYr6gZ3t9QZ64DjszBGeDOYZsxPfvjg~~&x5=0&f=json
https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz=MzAxNzE1OTA1MA==&f=json&offset=30&count=10&is_ok=1&scene=124&uin=MjQ0OTAxMTE1OA%3D%3D&key=ddc148ae0bed3f3b789c3b6b4b2fccb615cf85c5c07072eaabd700d9135e21d2c0d00629fe28359c7c8a45064e78029fe946d6bc94fbc7333cc45b9b2c797ed8703e4c11a547f6e8a045316563a9d1b619a393f105049fa9023952bd5f339cd9845cef41c53f01d84768b5768979309f10736120c882567b988e0bc0a1ee008e&pass_ticket=dC03dEe6iiIOydQ1ju1FlPWP4OX6VLYNRzpscK7SAz2X2BnzPln81Zt%2B4oQiVadK&wxtoken=&appmsg_token=1107_CcuZtyZ%252BYr6gZ3t9QZ64DjszBGeDOYZsxPfvjg~~&x5=0&f=json
二、代码实现
分析处理数据包和url的规律,现在来实现代码的实现
先写一个解析数据、获取文章title和url的函数
def Parser_Data(self,url):
text=requests.get(url=url,headers=self.head).json()
text=text["general_msg_list"]
text=json.loads(text)
for li in text["list"]:
try:
dic={}
dic["title"]=li["app_msg_ext_info"]["title"]
dic["url"]=li["app_msg_ext_info"]["content_url"]
with open(".//gzh.csv", "a", encoding="utf-8") as f:
writer = csv.DictWriter(f, dic.keys())
writer.writerow(dic)
except:
pass
在写一个生成url列表,传递url给解析函数的函数
def Get_Data(self):
print("-"*30+"开始获取文章标题和链接"+"-"*30)
urls=[]
for i in range(7):
url="https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz=MzAxNzE1OTA1MA==&f=json&offset={}&count=10&is_ok=1&scene=124&uin=MjQ0OTAxMTE1OA%3D%3D&key=78be2463e59dfd165483b736ef02c10d9477cf586303df98c48e68da0d3498b337b506cb6d87a82f86c8201937f44179c90720a3b731284220046db38040f1568b7a564ec0ef9c0c9fb32f6b45dcdaadb6f5724398c7fcd74da4c862808ef684cb9e41bd7f4acb8c25ceb6d7f6fa9f5b2175fed1f2a748205d48b31184c01f2c&pass_ticket=dC03dEe6iiIOydQ1ju1FlPWP4OX6VLYNRzpscK7SAz2X2BnzPln81Zt%2B4oQiVadK&wxtoken=&appmsg_token=1107_oJGIsfFfD4hPtgQHxHkcD0699YFcNyXBnu_laA~~&x5=0&f=json".format(str(i*10))
urls.append(url)
pool = newPool(3)
pool.map(self.Parser_Data,urls)
根据上面两个函数已经可以获取到,文章的title和url,最后写一个访问文章并保存的函数
def Get_artice(self,url,title):
try:
text=requests.get(url=url,headers=self.head).text
html=etree.HTML(text)
text='\n'.join(html.xpath('//div[@class="rich_media_content "]//text()')).strip()
with open("./gzh/%s.txt"%(title),"w",encoding="utf-8") as f:
f.write(text)
except Exception as e:
print(e)
pass
还缺一个主函数调用
def main(self):
self.head={
"Host": "mp.weixin.qq.com",
"Connection": "keep-alive",
"Accept": "*/*",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 MicroMessenger/6.5.2.501 NetType/WIFI WindowsWechat QBCore/3.43.884.400 QQBrowser/9.0.2524.400",
"X-Requested-With": "XMLHttpRequest",
"Referer": "https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=Mz005820472388cda7e8c12be9&devicetype=Windows+10&version=620603c8&lang=zh_CN&a8scene=7&pass_ticket=dC03dEe6iiIOydQ1ju1FlPWP4OX6VLYNRzpscK7SAz2X2BnzPln81Zt%2B4oQiVadK&winzoom=1",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.8,en-us;q=0.6,en;q=0.5;q=0.4",
"Cookie": "wxuin=R3Z5Y1kzR3BwUXY5SDk3anJRM2U4bWFWeFlwZzNEWHpfU1VHeUF3VnlMUzdydXp4ZFdraHFQdUJDWnJqQVNBQUF+MOqUmoMGOA1AlU4="
}
self.Get_Data()
df = pd.read_csv("gzh.csv", names=["title", "url"])
print("-" * 30 + "共获取%d个文章链接" % (len(df.url)) + "-" * 30)
if not os.path.exists("gzh"):
os.mkdir("gzh")
print("-" * 30 + "开始保存文章" + "-" * 30)
pool = newPool(3)
pool.map(self.Get_artice,df.url,df.title)
到目前为止已经写完了
通过fiddler抓包,得到的数据包url是变化的,更新很快,所以每爬一次就要抓一次包,修改url
完整代码可在公众号“阿虚学Python”中回复“公众号”获取
谢谢大家的观看
转载请标明出处