爬取目标
我们需要爬取博客园每一页下的文章标题,链接,作者名,时间,点赞,评论,点击量等数据
分析
抓包分析一下,发现点击不同页面时,url不变,猜测是一个ajax请求。于是按照正常的发包流程来即可。
只有一点需要注意,就是注意到content-type是json类型,所以需要将data格式化为json
代码
# 2022-12-7
# 难点1:因为请求头中content-type是json格式,所以相应的将data也要改为json格式即可
import requests
import json
from lxml import etree
url = 'https://www.cnblogs.com/AggSite/AggSitePostList'
headers = {
'accept': 'application/json, text/javascript, */*; q=0.01',
'accept-encoding': 'gzip, deflate, br',
'accept-language':'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7',
'content-type':'application/json; charest=UTF-8',
'content-length':'0',
'origin':'https//www.cnblogs.com',
'referer':'https//www.cnblogs.com/',
'sec-ch-ua':'"Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"',
'sec-ch-ua-mobile':'?0',
'sec-ch-ua-platform':'"Windows"',
'sec-fetch-dest':'empty',
'sec-fetch-mode':'cors',
'sec-fetch-site':'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
'x-requested-with':'XMLHttpRequest',
}
data = {
'CategoryId': '808',
'CategoryType': 'SiteHome',
'ItemListActionName': 'AggSitePostList',
'PageIndex': '6',
'ParentCategoryId': '0',
'TotalPostCount': '4000',
}
responce = requests.post(url=url,headers=headers,data=json.dumps(data))
tree=etree.HTML(responce.text)
title = tree.xpath('//a[@class="post-item-title"]/text()')
href = tree.xpath('//a[@class="post-item-title"]/@href')
name = tree.xpath('//a[@class="post-item-author"]/span/text()')
time = tree.xpath('//span[@class="post-meta-item"]/span/text()')
digg = tree.xpath('//footer[@class="post-item-foot"]/a[2]/span/text()')
comment = tree.xpath('//footer[@class="post-item-foot"]/a[3]/span/text()')
view = tree.xpath('//footer[@class="post-item-foot"]/a[4]/span/text()')
print(responce.status_code)
print(title)
print(href)
print(name)
print(time)
print(digg)
print(comment)
print(view)
# print(responce.text)