实战练习
第一步先打开51job网址,然后搜索python获取他的url,就是下图画红线的部分啦
使用BS4
就先简单的演示一下爬取第一页的内容吧
# 导包
import requests
from bs4 import BeautifulSoup
# 获取目标网址
url = 'https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,1.html'
# 构造请求头
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3902.4 Safari/537.36',
}
# 请求访问页面
response = requests.get(url=url, headers=header)
print(renponse.text)
print(response.text)输出文本如下图
可以看出黄色圈圈部分是乱码,这是因为红色圈圈部分的标签属性charset=gbk(在HTML里,charset表示字符编码),解决办法就是重新编码response.encoding = 'gbk'
# 对返回的文本进行重新编码,51job
response.encoding = 'gbk'
重新编码完后输出效果如下图~
# 使用BeautifulSoup解析文本
soup = BeautifulSoup(response.text, 'lxml')
# 使用css选择器获取目标块元素
# id属性名为resultList 下的 class属性名为el的所有标签
all_info = soup.select('#resultList .el')
# 创建空列表便于存储爬取的信息
job_list = []
# 遍历所有标签爬取内容
for item in all_info:
job_info = {}
# 使用try 避免有的职位为空,或者薪资为空,导致字典数据存储失败报错,终止运行
try:
job_info['职位'] = item.select('.t1 a')[0].string.strip()
job_info['公司'] = item.select('.t2 a')[0].string.strip()
job_info['工作地点'] = item.select('.t3')[0].string.strip()
try:
job_info['薪资'] = item.select('.t4')[0].string.strip()
except:
job_info['薪资'] = '面议'
job_info['发布时间'] = item.select('.t5')[0].string.strip()
job_list.append(job_info)
except Exception as e:
print(e)
# 遍历爬取到的信息
for i in job_list:
print(i)
打印出的信息:
使用XPath
import requests
from lxml import etree
url = 'https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,1.html'
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3902.4 Safari/537.36',
}
response = requests.get(url=url, headers=header)
response.encoding = 'gbk'
html = etree.HTML(response.text)
info_list = []
all_div = html.xpath('//div[@id="resultList"]/div[@class="el"]')
for item in all_div:
job_info = {}
job_info['职位'] = item.xpath('./p/span/a/@title')[0]
job_info['公司'] = item.xpath('./span/a/@title')[0]
job_info['工作地点'] = item.xpath('./span[@class="t3"]/text()')[0]
try:
job_info['薪资'] = item.xpath('./span[@class="t4"]/text()')[0]
except IndexError:
job_info['薪资'] = '无数据'
job_info['发布时间'] = item.xpath('./span[@class="t5"]/text()')[0]
info_list.append(job_info)
for i in info_list:
print(i)
打印出来的效果也是一样的