Python爬虫拉勾网招聘爬取

最新推荐文章于 2020-12-17 12:21:07 发布

火狐火狐

最新推荐文章于 2020-12-17 12:21:07 发布

阅读量556

点赞数

本文链接：https://blog.csdn.net/qq_39568852/article/details/84965182

版权

使用Python爬虫从拉勾网获取招聘职位信息，包括解析职位详情、工资统计、工作经验分布和职位词频分析，并通过邮件发送分析结果。涉及selenium、xpath、lxml、jieba、pandas等库。

摘要由CSDN通过智能技术生成

Python爬虫拉勾网招聘爬取

运行平台： Windows
Python版本： Python3.6
IDE： Sublime Text
其他工具： Chrome浏览器

文章目录

Python爬虫拉勾网招聘爬取

0、打开搜索页

首先需要安装selenium库 pip install selenium
运用selenium工具，打开搜索页

Python岗位搜索页
实现代码如下

self.driver = webdriver.Chrome()
self.url = 'https://www.lagou.com/jobs/list_python?px=default&city=%E5%85%A8%E5%9B%BD#filterBox'

1、找到职位连接

为了得到详情页面，我们需要在搜索完成后，找到职位详情页面的连接，操作如下

运用xpath工具找到该连接：

def page_list_details(self, source):
        html = etree.HTML(source)
        links = html.xpath('//a[@class="position_link"]/@href')
        for link in links:
            self.request_details(link)
            time.sleep(2)

2、打开职位详情页面，解析职位详情信息

2.1、详情页面解析

由第一步得到的页面URL地址，打开详情页面，在游览器打开后，找到我们所需要的数据：
在这里插入图片描述
在开发者项并运用xpath工具检验能否得到数据：

2.2 、代码实现

首先需要具备xpath语法知识，安装lxml库： pip install lxml

    def request_details(self, link):
        self.driver.execute_script("window.open('%s')" % link)
        self.driver.switch_to_window(self.driver.window_handles[1])
        source = self.driver.page_source
        html = etree.HTML(source)
        WebDriverWait(self.driver, timeout=5).until(
            EC.presence_of_element_located((By.XPATH, '//span[@class="name"]'))
            )
        name = html.xpath('//span[@class="name"]/text()')[0].strip()
        company = html.xpath('//h2[@class="fl"]/text()')[0].strip()
        # salary = html.xpath('//span[@class="salary"]/text()')
        job_request = html.xpath('//dd[@class="job_request"]//span/text()')
        salary = job_request[0].strip()
        city = re.sub(r'[\s/]', '', job_request[1])
        experience = job_request[2]
        experience = re.sub(r'[\s/]', '', experience)
        education = re.sub(r'[\s/]', '', job_request[3])
        job_desc = ''.join(html.xpath('//dd[@class="job_bt"]//text()')).strip()
        position_detail = {
            'name': name,
            'city': city,
            'company': company,
            'salary': salary,
            'experience': experience,
            'education': education,
            # 'job_desc': job_desc
        }
        self.write_text(job_desc)
        self.position_details.append(position_detail)
        self.write_csv_rows(self.headers, position_detail)
        self.driver.close()
        self.driver.switch_to_window(self.driver.window_handles[0])

3、将获得数据写入CSV或TXT文件

将职位描述写入TXT文件，便于词频统计，其余信息写入CSV文件

    def write_csv_headles(self, headers):
        with open('lagou_positions.csv', 'a', encoding='utf-8', newline='') as f:
            position_headline = csv.DictWriter(f, headers)
            position_headline.writeheader()
        
    def write_text(self, job_desc):
        with open('lagou_position_details.txt', 'a', encoding='utf-8') as f:
            f.write('\n------------------------------------------------' + '\n')
            f.write(job_desc)

    def write_csv_rows(self, headers, position_detail):
        with open('lagou_positions.csv', 'a', encoding='utf-8', newline='') as f:
            position_headlines = csv.DictWriter(f, headers)
            position_headlines.writerow(position_detail)

3.1、求出平均工资

读取CSV文件，获得所有工资信息
工资只有一种形式 ×k-×k,取出数值求平均，乘以1000即可

def read_lagou_information(self, column):
        with open('lagou_positions.csv', 'r', encoding='utf-8', newline='') as f:
            salary_reader = csv.reader(f)
            return [row[column] for row in salary_reader]

		sal = self.read_lagou_information(3)
        for i in range(len(sal)-1):
            requre_sal = sal[i+1]
            requre_sal = re.sub(r'k', '', requre_sal)
            inx = requre_sal.find('-')
            average_sal = (int(requre_sal[0:inx]) + int(requre_sal[inx+1:]))/2
            requre_sal = average_sal * 1000

4、数据分析

4.1 工资统计

运用饼状图显示各个工资阶层的分布情况

def analyse_industry_salary(self):
        sal = self.read_lagou_information(3)
        for i in range(len(sal)-1):
            requre_sal = sal[i+1]
            requre_sal = re.sub(r'k', '', requre_sal)
            inx = requre_sal.fin

最低0.47元/天解锁文章

火狐火狐

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
Python爬虫拉勾网招聘爬取

Python爬虫拉勾网招聘爬取运行平台： WindowsPython版本： Python3.6IDE： Sublime Text其他工具： Chrome浏览器1、找到职位连接2、打开职位详情页面，解析职位详情信息2.1、网页解析2.2、代码实现3、求出平均工资4、数据分析4.1 工资统计4.2 工作经验统计4.3 职位词频描述4.3.1 读取T...
复制链接

扫一扫