记录一次菜鸟的失败爬虫经历
最近刚学python爬虫,跟着教程爬了点豆瓣美剧列表就开始飘了,总想要爬点什么。
于是开始爬微信公众号里的付费小说
首先先抓一下包
这章是用签到得到的书币购买了的,下一章没书币要付费了^_^
查看发送的请求:
可以看到点开小说共发起了两次https请求第一次请求返回的body里有10184,可以猜测一下应该是返回的小说页面html
请求1:
请求头:
GET https://wx25b3aa07592dbc39.taoteq.cn/index/book/chapter?book_id=11010024389&sid=47117049&ext=%7B%22mark%22%3A%221011%22%2C%22push_id%22%3A%2211010024389%22%2C%22push_idx%22%3A1%2C%22push_time%22%3A1567097313%7D HTTP/1.1
Host: wx25b3aa07592dbc39.taoteq.cn
Connection: keep-alive
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36 QBCore/3.53.1159.400 QQBrowser/9.0.2524.400 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 MicroMessenger/6.5.2.501 NetType/WIFI WindowsWechat
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.5;q=0.4
Cookie: user_id=45400321; channel_id=9808; openid=oKoR154o49UwxjOVi6vG0yPRaEMk; ext=%7B%22mark%22%3A%221011%22%2C%22push_id%22%3A%2211010024389%22%2C%22push_idx%22%3A1%2C%22push_time%22%3A1567097313%7D
把url和cookie解码后用requests.get请求试一下
import requests
url = 'https://wx25b3aa07592dbc39.taoteq.cn/index/book/chapter?book_id=11010024389&sid=47117049&ext={"mark":