![](https://i-blog.csdnimg.cn/blog_migrate/f842e8842d37634fc2653babdaa5887f.png)
![](https://i-blog.csdnimg.cn/blog_migrate/5856cf72f56ee28f4ab7b20d956515b8.png)
翻页后url不变
![](https://i-blog.csdnimg.cn/blog_migrate/f4cae0896bdbc416d46b0c685dcd2d9e.png)
![](https://i-blog.csdnimg.cn/blog_migrate/e5b176f63628ec04da554390572b8833.png)
今日头条url也没有变化翻页
![](https://i-blog.csdnimg.cn/blog_migrate/279d941104f2fea184daefa0ea4cba69.png)
左侧多了
![](https://i-blog.csdnimg.cn/blog_migrate/082317466660ea9ec785dc755f55fffc.png)
chorm中josonview插件
![](https://i-blog.csdnimg.cn/blog_migrate/07ea4d124f36cd9b296a250e0a7808ab.png)
所以加入不一样的请求头:headers
http://www.zhihu.com/api/v4/people/112
根据经验把api删掉即可打开这个链接
![](https://i-blog.csdnimg.cn/blog_migrate/a90a67195a0623a9cf7a198d4bbf2b0f.png)
第一个参数固定是url不用指定,后面的参数需要指明。headers字典。
下面是翻页
![](https://i-blog.csdnimg.cn/blog_migrate/99578cec9c736c2e59527f9ecd08a97d.png)
百度输入python.extend
![](https://i-blog.csdnimg.cn/blog_migrate/dd7ec6fa7441a3a72ce5a660c3bac5c1.png)
import requests
import pandas as pd
import time
headers = {
'authorization':'Bearer 2|1:0|10:1513832293|4:z_c0|92:Mi4xUFJOakF3QUFBQUFBa0lLVHVlN2REQ1lBQUFCZ0FsVk5aWTBvV3dBTW4yUk1XX0l2YjNhNlNSUmhmRy1GaDZsWWVR|d45ed089d0c3ca18eff8a3f5bee812db4804d2a13a92b69f124d47b5a82d0292','User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3278.0 Safari/537.36','X-UDID':'AGBsMCXoEg2PTrQf77mdwRHSy0xePXc5juQ='
}
url = 'https://www.zhihu.com/api/v4/members/zhong-guo-ke-pu-bo-lan/followers?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=60&limit=20'
user_data = []
def get_user_data(page): #爬取几页
for i in range(page):
url = 'https://www.zhihu.com/api/v4/members/excited-vczh/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset={}&limit=20'.format(i*20)
response = requests.get(url, headers=headers).json()['data']
user_data.extend(response) #把response数据添加进user_data
print('正在爬取第%s页' % str(i+1))
time.sleep(1) #设置爬取网页的时间间隔为1秒,爬虫暂停1s,防止被监测到
if __name__ == '__main__':
get_user_data(10)
df = pd.DataFrame.from_dict(user_data)
df.to_csv('users.csv')
截止。
'''
response = requests.get(url,headers = headers).json()['data']
df = pd.DataFrame.from_dict(response) #from_dict函数可以直接把json数据转换
df.to_csv('zhihu.csv')#出现错误,经验告诉是知乎反爬
'''
json是个像字典一样使用的东西。
你看看json返回的东西,是一个字典,取字典的值是怎样取的就怎样取,xpath是取源代码的
代码运行结果如下:
![](https://i-blog.csdnimg.cn/blog_migrate/2dc646a6e1d28354ed9416a4194e1f90.png)
![](https://i-blog.csdnimg.cn/blog_migrate/554dae59a0386b6a410b28ac72523ee1.png)
![](https://i-blog.csdnimg.cn/blog_migrate/ba88175a105da0da7ac0093e8bada8f6.png)
![](https://i-blog.csdnimg.cn/blog_migrate/d599620ebe688a934bf66deee495f8d1.png)
![](https://i-blog.csdnimg.cn/blog_migrate/f43a16b612a7b03202f7b4af9b61fa8b.png)
![](https://i-blog.csdnimg.cn/blog_migrate/0246b4fa8ed0d118196d5d6790af38c8.png)
![](https://i-blog.csdnimg.cn/blog_migrate/afa43d009b65abdb819a3325716de4f8.png)
![](https://i-blog.csdnimg.cn/blog_migrate/e556bd46a02dda7af06982dccd7d75b0.png)
a选项不一定是当前页面,因为有时候翻页页面地址不变。
x-requested-with XMLHttpRequest
//表明是AJax异步,也就是json格式
![](https://i-blog.csdnimg.cn/blog_migrate/15c58b778f32d6a62c9ee5434aa090ca.png)
range(3) 0,1,2
![](https://i-blog.csdnimg.cn/blog_migrate/6905ada667efee6aa571fc4d0a19a84a.png)
![](https://i-blog.csdnimg.cn/blog_migrate/b08b356673b4e8497d83febd5306da3b.png)
![](https://i-blog.csdnimg.cn/blog_migrate/337ba394845bf1fdbb045170a862f0a8.png)
星号题