Python爬虫 拉勾网招聘爬取
运行平台: Windows
Python版本: Python3.6
IDE: Sublime Text
其他工具: Chrome浏览器
文章目录
0、打开搜索页
首先需要安装selenium库 pip install selenium
运用selenium工具,打开搜索页
实现代码如下
self.driver = webdriver.Chrome()
self.url = 'https://www.lagou.com/jobs/list_python?px=default&city=%E5%85%A8%E5%9B%BD#filterBox'
1、找到职位连接
为了得到详情页面,我们需要在搜索完成后,找到职位详情页面的连接,操作如下
运用xpath工具找到该连接:
def page_list_details(self, source):
html = etree.HTML(source)
links = html.xpath('//a[@class="position_link"]/@href')
for link in links:
self.request_details(link)
time.sleep(2)
2、打开职位详情页面,解析职位详情信息
2.1、详情页面解析
由第一步得到的页面URL地址,打开详情页面,在游览器打开后,找到我们所需要的数据:
在开发者项并运用xpath工具检验能否得到数据:
2.2 、代码实现
首先需要具备xpath语法知识,安装lxml库: pip install lxml
def request_details(self, link):
self.driver.execute_script("window.open('%s')" % link)
self.driver.switch_to_window(self.driver.window_handles[1])
source = self.driver.page_source
html = etree.HTML(source)
WebDriverWait(self.driver, timeout=5).until(
EC.presence_of_element_located((By.XPATH, '//span[@class="name"]'))
)
name = html.xpath('//span[@class="name"]/text()')[0].strip()
company = html.xpath('//h2[@class="fl"]/text()')[0].strip()
# salary = html.xpath('//span[@class="salary"]/text()')
job_request = html.xpath('//dd[@class="job_request"]//span/text()')
salary = job_request[0].strip()
city = re.sub(r'[\s/]', '', job_request[1])
experience = job_request[2]
experience = re.sub(r'[\s/]', '', experience)
education = re.sub(r'[\s/]', '', job_request[3])
job_desc = ''.join(html.xpath('//dd[@class="job_bt"]//text()')).strip()
position_detail = {
'name': name,
'city': city,
'company': company,
'salary': salary,
'experience': experience,
'education': education,
# 'job_desc': job_desc
}
self.write_text(job_desc)
self.position_details.append(position_detail)
self.write_csv_rows(self.headers, position_detail)
self.driver.close()
self.driver.switch_to_window(self.driver.window_handles[0])
3、将获得数据写入CSV或TXT文件
将职位描述写入TXT文件,便于词频统计,其余信息写入CSV文件
def write_csv_headles(self, headers):
with open('lagou_positions.csv', 'a', encoding='utf-8', newline='') as f:
position_headline = csv.DictWriter(f, headers)
position_headline.writeheader()
def write_text(self, job_desc):
with open('lagou_position_details.txt', 'a', encoding='utf-8') as f:
f.write('\n------------------------------------------------' + '\n')
f.write(job_desc)
def write_csv_rows(self, headers, position_detail):
with open('lagou_positions.csv', 'a', encoding='utf-8', newline='') as f:
position_headlines = csv.DictWriter(f, headers)
position_headlines.writerow(position_detail)
3.1、求出平均工资
读取CSV文件,获得所有工资信息
工资只有一种形式 ×k-×k,取出数值求平均,乘以1000即可
def read_lagou_information(self, column):
with open('lagou_positions.csv', 'r', encoding='utf-8', newline='') as f:
salary_reader = csv.reader(f)
return [row[column] for row in salary_reader]
sal = self.read_lagou_information(3)
for i in range(len(sal)-1):
requre_sal = sal[i+1]
requre_sal = re.sub(r'k', '', requre_sal)
inx = requre_sal.find('-')
average_sal = (int(requre_sal[0:inx]) + int(requre_sal[inx+1:]))/2
requre_sal = average_sal * 1000
4、数据分析
4.1 工资统计
运用饼状图显示各个工资阶层的分布情况
def analyse_industry_salary(self):
sal = self.read_lagou_information(3)
for i in range(len(sal)-1):
requre_sal = sal[i+1]
requre_sal = re.sub(r'k', '', requre_sal)
inx = requre_sal.fin