Python爬虫案例——在行(心理模块)

前言:

1、本案例基于阿里云天池平台,爬取“在行”网站“心理”模块28页从业者的'name'和'city'等相关信息。

2、本案例需要使用的网站有:

(1)阿里云天池:https://tianchi.aliyun.com/

(2)在行:https://www.zaih.com/

3、本案例需要的软件:postman,下载官网:https://www.postman.com/downloads/

正文:

一、准备环境(天池):

1、打开阿里云天池网站(https://tianchi.aliyun.com/),选择“天池实验室”-->“天池Notebook”,进入一下界面。注册账号并登录(选择中国站)。阿里云网站需要注册本人账号和实名认证。

(  若不想实名认证也可切换到本地python运行环境如PyCharm,或者使用百度飞桨AI Studio等python的运行环境。 )

2、点击新建“Notebook”后进入以下页面,剪切掉显示的所有代码。

二、准备环境(在行)

1、打开在行网站(https://www.zaih.com/)。如果你用的是Google浏览器,右上角点击设置,找到“检查”。

我用的是Edge浏览器,右上角找到“更多设置”,点击“开发人员工具”,进入到开发者界面:

2、进入开发者界面后,先点击“网络”,选择“Fetch/XHR”,点击在行网站的“心理”模块后,会出现“4”这一条信息。单击这条信息。

3、单击后点击右侧“响应”,我们可以看到相应的列表和字典格式的网页代码。

根据右侧图示,单击右键复制代码后,打开postman软件。

 

三、Postman操作

1、将复制后的代码粘贴到GET栏中,“Send”一下。根据图示步骤复制Python格式的代码。

四、代码编写(天池)

1、粘贴从Postman复制的代码至阿里云Notebook运行环境:

(这是我本次粘贴过来的代码)

import requests

url = "https://www.zaih.com/falcon/mentor_api/v1/tags/479/mentors?per_page=15"

payload = {}
headers = {
  'Accept': 'application/json, text/plain, */*',
  'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
  'Authorization': 'Basic d2ViOjlkZjk4dTJqa2hsaGphMjEzSExOMTJqaGtHS0dPOTMxMg==',
  'Connection': 'keep-alive',
  'Cookie': '_ga=GA1.1.1994262530.1712130413; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2218ea2eccff21036-0fb7fa764cafd7-4c657b58-2621440-18ea2eccff31dbc%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E5%BC%95%E8%8D%90%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fmp.csdn.net%2Fmp_blog%2Fcreation%2Feditor%3Fspm%3D1011.2124.3001.6192%22%7D%2C%22%24device_id%22%3A%2218ea2eccff21036-0fb7fa764cafd7-4c657b58-2621440-18ea2eccff31dbc%22%7D; _ga_TBFHLKW1ZL=GS1.1.1712305419.5.1.1712305431.0.0.0',
  'Referer': 'https://www.zaih.com/',
  'Sec-Fetch-Dest': 'empty',
  'Sec-Fetch-Mode': 'cors',
  'Sec-Fetch-Site': 'same-origin',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0',
  'sa-platform': 'PC',
  'sec-ch-ua': '"Microsoft Edge";v="123", "Not:A-Brand";v="8", "Chromium";v="123"',
  'sec-ch-ua-mobile': '?0',
  'sec-ch-ua-platform': '"Windows"'
}

response = requests.request("GET", url, headers=headers, data=payload)

print(response.text)

2、点击运行:

3、打印爬取得到的所有信息

更改代码为下面格式,点击运行后,即可爬取所有页面的姓名等信息。

# import requests   将该行注释掉
current_index = 0       # 定义并初始化current_index
for i in range(1,29):   # 心理模块总共有28页,定义循环
    url = f"https://www.zaih.com/falcon/mentor_api/v1/tags/479/mentors?page={i}&per_page=15"   # 更改page处为:page={i}&per_page=15

    payload = {}
    headers = {
    'Accept': 'application/json, text/plain, */*',
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    'Authorization': 'Basic d2ViOjlkZjk4dTJqa2hsaGphMjEzSExOMTJqaGtHS0dPOTMxMg==',
    'Connection': 'keep-alive',
    'Cookie': 'sajssdk_2015_cross_new_user=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2218ea2eccff21036-0fb7fa764cafd7-4c657b58-2621440-18ea2eccff31dbc%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fcn.bing.com%2F%22%7D%2C%22%24device_id%22%3A%2218ea2eccff21036-0fb7fa764cafd7-4c657b58-2621440-18ea2eccff31dbc%22%7D; _ga=GA1.1.1994262530.1712130413; _ga_TBFHLKW1ZL=GS1.1.1712133292.2.0.1712133292.0.0.0',
    'Referer': 'https://www.zaih.com/falcon/mentors?first_tag_id=479&first_tag_name=%E5%BF%83%E7%90%86',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-origin',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0',
    'sa-platform': 'PC',
    'sec-ch-ua': '"Microsoft Edge";v="123", "Not:A-Brand";v="8", "Chromium";v="123"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"'
    }

    response = requests.request("GET", url, headers=headers, data=payload)
    
    # print(response.text)   注释掉Print()


    # 添加以下代码
    return_obj = (response.text) #得到json格式的内容
    
    import json
    
    objects = json.loads(return_obj) #将json格式转换为字典格式
    
    # 打印内容
    for j in range(len(objects)):
        current_index +=1
        print(current_index,end='\t')
        print(objects[j]['name'],end='\t')
        print(objects[j]['city'],end='\t')
        print(objects[j]['occupation'],end='\t')
        print(objects[j]['location'])

打印结果如图所示:

4、将爬取到的信息写入Excel

要将内容写入Excel需要下载一个库,运行成功后有Successfully installed xlwt字样:

! pip install xlwt

5、将3的代码更改为以下代码并运行:

import xlwt
workbook = xlwt.Workbook('utf-8')
worksheet = workbook.add_sheet('在行心理学')

current_index = 0
for i in range(1,29):
    url = f"https://www.zaih.com/falcon/mentor_api/v1/tags/479/mentors?page={i}&per_page=15"

    payload = {}
    headers = {
    'Accept': 'application/json, text/plain, */*',
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    'Authorization': 'Basic d2ViOjlkZjk4dTJqa2hsaGphMjEzSExOMTJqaGtHS0dPOTMxMg==',
    'Connection': 'keep-alive',
    'Cookie': 'sajssdk_2015_cross_new_user=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2218ea2eccff21036-0fb7fa764cafd7-4c657b58-2621440-18ea2eccff31dbc%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fcn.bing.com%2F%22%7D%2C%22%24device_id%22%3A%2218ea2eccff21036-0fb7fa764cafd7-4c657b58-2621440-18ea2eccff31dbc%22%7D; _ga=GA1.1.1994262530.1712130413; _ga_TBFHLKW1ZL=GS1.1.1712133292.2.0.1712133292.0.0.0',
    'Referer': 'https://www.zaih.com/falcon/mentors?first_tag_id=479&first_tag_name=%E5%BF%83%E7%90%86',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-origin',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0',
    'sa-platform': 'PC',
    'sec-ch-ua': '"Microsoft Edge";v="123", "Not:A-Brand";v="8", "Chromium";v="123"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"'
    }

    response = requests.request("GET", url, headers=headers, data=payload)


    return_obj = (response.text) #得到json格式的内容
    import json
    objects = json.loads(return_obj) #转换为字典格式
    for j in range(len(objects)):
        current_index +=1
        # worksheet.write(行号,列号,值)
        worksheet.write(current_index-1,0,current_index)
        worksheet.write(current_index-1,1,objects[j]['name'])
        worksheet.write(current_index-1,2,objects[j]['city'])
        worksheet.write(current_index-1,3,objects[j]['occupation'])
        worksheet.write(current_index-1,4,objects[j]['location'])



workbook.save('./在行.xlsx')

得到运行结果为“在行.xlsx”文档。

6、右键点击下载并打开Excel文件:

爬取完成!

  • 8
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值