博客园主页文章爬取

最新推荐文章于 2023-10-05 16:03:58 发布

Marhoosh

最新推荐文章于 2023-10-05 16:03:58 发布

阅读量390

点赞数 1

分类专栏：技术杂谈文章标签： python json 开发语言爬虫

本文链接：https://blog.csdn.net/weixin_45727188/article/details/128223129

版权

技术杂谈专栏收录该内容

11 篇文章 0 订阅

订阅专栏

爬取目标

我们需要爬取博客园每一页下的文章标题，链接，作者名，时间，点赞，评论，点击量等数据

分析

抓包分析一下，发现点击不同页面时，url不变，猜测是一个ajax请求。于是按照正常的发包流程来即可。

只有一点需要注意，就是注意到content-type是json类型，所以需要将data格式化为json

代码

# 2022-12-7
# 难点1：因为请求头中content-type是json格式，所以相应的将data也要改为json格式即可
import requests
import json
from lxml import etree
url = 'https://www.cnblogs.com/AggSite/AggSitePostList'

headers = {
    'accept': 'application/json, text/javascript, */*; q=0.01',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language':'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7',
    'content-type':'application/json; charest=UTF-8',
    'content-length':'0',
    'origin':'https//www.cnblogs.com',
    'referer':'https//www.cnblogs.com/',
    'sec-ch-ua':'"Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"',
    'sec-ch-ua-mobile':'?0',
    'sec-ch-ua-platform':'"Windows"',
    'sec-fetch-dest':'empty',
    'sec-fetch-mode':'cors',
    'sec-fetch-site':'same-origin',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
    'x-requested-with':'XMLHttpRequest',
}

data = {
    'CategoryId': '808',
    'CategoryType': 'SiteHome',
    'ItemListActionName': 'AggSitePostList',
    'PageIndex': '6',
    'ParentCategoryId': '0',
    'TotalPostCount': '4000',
}

responce = requests.post(url=url,headers=headers,data=json.dumps(data))

tree=etree.HTML(responce.text)
title = tree.xpath('//a[@class="post-item-title"]/text()')
href = tree.xpath('//a[@class="post-item-title"]/@href')
name = tree.xpath('//a[@class="post-item-author"]/span/text()')
time = tree.xpath('//span[@class="post-meta-item"]/span/text()')
digg = tree.xpath('//footer[@class="post-item-foot"]/a[2]/span/text()')
comment = tree.xpath('//footer[@class="post-item-foot"]/a[3]/span/text()')
view = tree.xpath('//footer[@class="post-item-foot"]/a[4]/span/text()')
print(responce.status_code)
print(title)
print(href)
print(name)
print(time)
print(digg)
print(comment)
print(view)

# print(responce.text)