requests实战练手，使用bs4和XPath爬取51job职位信息

最新推荐文章于 2020-08-30 14:56:57 发布

JiaXionG_Lynn

最新推荐文章于 2020-08-30 14:56:57 发布

阅读量460

点赞数 2

本文链接：https://blog.csdn.net/JiaXionG_Lynn/article/details/102534888

版权

实战练习

第一步先打开51job网址，然后搜索python获取他的url，就是下图画红线的部分啦
在这里插入图片描述

使用BS4

就先简单的演示一下爬取第一页的内容吧

# 导包
import requests
from bs4 import BeautifulSoup

# 获取目标网址
url = 'https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,1.html'
# 构造请求头
header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3902.4 Safari/537.36',
}

# 请求访问页面
response = requests.get(url=url, headers=header)
print(renponse.text)

print(response.text)输出文本如下图
可以看出黄色圈圈部分是乱码，这是因为红色圈圈部分的标签属性charset=gbk（在HTML里，charset表示字符编码），解决办法就是重新编码response.encoding = 'gbk'
在这里插入图片描述

# 对返回的文本进行重新编码，51job
response.encoding = 'gbk'

重新编码完后输出效果如下图~
在这里插入图片描述

# 使用BeautifulSoup解析文本
soup = BeautifulSoup(response.text, 'lxml')

# 使用css选择器获取目标块元素
# id属性名为resultList 下的 class属性名为el的所有标签
all_info = soup.select('#resultList .el')

# 创建空列表便于存储爬取的信息
job_list = []

# 遍历所有标签爬取内容
for item in all_info:
    job_info = {}
# 使用try 避免有的职位为空，或者薪资为空，导致字典数据存储失败报错，终止运行
    try:
        job_info['职位'] = item.select('.t1 a')[0].string.strip()
        job_info['公司'] = item.select('.t2 a')[0].string.strip()
        job_info['工作地点'] = item.select('.t3')[0].string.strip()
        try:
            job_info['薪资'] = item.select('.t4')[0].string.strip()
        except:
            job_info['薪资'] = '面议'

        job_info['发布时间'] = item.select('.t5')[0].string.strip()
        job_list.append(job_info)

    except Exception as e:
        print(e)

# 遍历爬取到的信息
for i in job_list:
    print(i)

打印出的信息：
在这里插入图片描述

使用XPath

import requests
from lxml import etree

url = 'https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,1.html'
header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3902.4 Safari/537.36',
}

response = requests.get(url=url, headers=header)
response.encoding = 'gbk'
html = etree.HTML(response.text)

info_list = []

all_div = html.xpath('//div[@id="resultList"]/div[@class="el"]')

for item in all_div:
    job_info = {}
    job_info['职位'] = item.xpath('./p/span/a/@title')[0]
    job_info['公司'] = item.xpath('./span/a/@title')[0]
    job_info['工作地点'] = item.xpath('./span[@class="t3"]/text()')[0]
    try:
        job_info['薪资'] = item.xpath('./span[@class="t4"]/text()')[0]
    except IndexError:
        job_info['薪资'] = '无数据'
    job_info['发布时间'] = item.xpath('./span[@class="t5"]/text()')[0]
    info_list.append(job_info)

for i in info_list:
    print(i)

打印出来的效果也是一样的

JiaXionG_Lynn

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
requests实战练手，使用bs4和XPath爬取51job职位信息

实战练习第一步先打开51job网址，然后搜索python获取他的url，就是下图画红线的部分啦使用BS4就先简单的演示一下爬取第一页的内容吧# 导包import requestsfrom bs4 import BeautifulSoup# 获取目标网址url = 'https://search.51job.com/list/000000,000000,0000,00,9,99,...
复制链接

扫一扫