某直聘的爬取-selenium篇

^三年梦^

于 2024-08-30 22:08:07 发布

阅读量540

点赞数 22

文章标签： selenium 测试工具

本文链接：https://blog.csdn.net/2301_77455840/article/details/141719732

版权

前言

如今的计算机行业，越来越来趋近成熟，体系架构越来越来清晰明了，但相应的计算机专业学生任务也越来越重，认真的是真的认真，摆烂的是真的摆烂，而临近毕业大多数学生也就面临着就业，考研，考公的选择，考研就要有科研的准备，考公就要有努力的准备，就业的话就要有找工作的准备，今天我们一起来看看某直聘的就业信息。

准备工作

下载浏览器对应版本的驱动
Chrome114版本之前的
 Chrome114版本之后的
pip install selenium(我的版本是3.3.0)，如果嫌慢用镜像下载

爬取

一、工作，城市的选择

直聘上面的反爬机制有点13，所以我们整个爬取过程就用动作链，模仿人点击操作浏览页面。

1.工作

在这里插入图片描述

找到其搜索框与点击搜索的xpath,实现城市的输入和点击，因为跳转页面需要一点时间，我们就弄一个显示等待，当我们等待的logo类名出现既可开始下面的操作

        self.driver.get('https://www.zhipin.com/')
        (ActionChains(self.driver).move_to_element(self.driver.find_element_by_xpath('//*[@id="wrap"]/div[3]/div/div[1]/div[1]/form/div[2]/p/input'))
         .click().send_keys(self.job).perform())
        time.sleep(2)
        (ActionChains(self.driver).move_to_element(self.driver.find_element_by_xpath('//*[@id="wrap"]/div[3]/div/div[1]/div[1]/form/button'))
         .click().perform())

        WebDriverWait(self.driver, 30).until(
            EC.presence_of_element_located((By.CLASS_NAME, "logo"))
        )

2.城市

点击城市选择的按钮，并在里面找到你想要去的城市，既可在页面元素中找到你想要的城市
在这里插入图片描述

但发现无论是分类的点击，还是城市的点击，都没有适合的点击标签，所以我们使用xpath来选择标签，但是要注意一个点，在我用xpath进行定位的时候，它的html里的代码一直在变化，所以页面里的复制xpath需要谨慎（包括后面的爬取），也许这次能定位，到下次就报错了。

        self.li = ['','','ABCDE','FGHJ','KLMN','PQRST','WXYZ']
        pinyin_list = pinyin(self.city, style=Style.NORMAL)
        p = chr(ord(pinyin_list[0][0][0])-32)#B
        for n,t in enumerate(self.li):
            if p in t:
                p = t.index(p)
                break
        time.sleep(5)
        WebDriverWait(self.driver,20).until(
            EC.presence_of_element_located((By.CLASS_NAME,'city-label'))
        )
        ActionChains(self.driver).move_to_element(self.driver.find_element_by_class_name('city-label')).click().perform()
        time.sleep(5)
        try:
            ActionChains(self.driver).move_to_element(self.driver.find_element_by_xpath('/html/body/div[4]/div[2]/div[2]/div/ul[1]/li['+str(n)+']')).click().perform()
            res=etree.HTML(self.driver.page_source)
            div=res.xpath('/html/body/div[4]/div[2]/div[2]/div/ul[2]/li['+str(p+1)+']/div')
            ls = []
            for d in div:
                ls = d.xpath('./a/text()')
            a = ls.index(self.city)
            ActionChains(self.driver).move_to_element(self.driver.find_element_by_xpath('/html/body/div[4]/div[2]/div[2]/div/ul[2]/li['+str(p+1)+']/div/a['+str(a+1)+']')).click().perform()
        except:
            ActionChains(self.driver).move_to_element(self.driver.find_element_by_xpath(
                '/html/body/div[7]/div[2]/div[2]/div/ul[1]/li[' + str(n) + ']')).click().perform()
            res = etree.HTML(self.driver.page_source)
            div = res.xpath('/html/body/div[7]/div[2]/div[2]/div/ul[2]/li[' + str(p + 1) + ']/div')
            ls = []
            for d in div:
                ls = d.xpath('./a/text()')
            a = ls.index(self.city)
            ActionChains(self.driver).move_to_element(self.driver.find_element_by_xpath(
                '/html/body/div[7]/div[2]/div[2]/div/ul[2]/li[' + str(p + 1) + ']/div/a[' + str(
                    a + 1) + ']')).click().perform()

二、工作信息爬取

我们只要爬取工作名字，工作福利，工资，技能要求，经验以及学历。
在这里插入图片描述
我们看到工作信息全部在job-list-box这个盒子里面，所以只需要遍历这个盒子，既可获得你想要的信息

                name,welfare,wages,experience,education,skill = [],[],[],[],[],[]
                res = etree.HTML(self.driver.page_source)
                div = res.xpath('//*[@id="wrap"]/div[2]/div[2]/div/div[1]//ul[@class="job-list-box"]')
                ls = []
                for i in div:
                    ls = i.xpath('./li')
                #print(ls)
                for li in range(1,len(ls)):
                    temp = []
                    name.append(ls[li].xpath('./div[1]/a/div[1]/span[1]/text()'))
                    welfare.append(ls[li].xpath('./div[2]/div[@class="info-desc"]/text()'))
                    wages.append(ls[li].xpath('.//a/div[@class="job-info clearfix"]/span/text()'))
                    experience.append(ls[li].xpath('.//a/div[@class="job-info clearfix"]/ul/li[1]/text()'))
                    education.append(ls[li].xpath('.//a/div[@class="job-info clearfix"]/ul/li[2]/text()'))
                    for li1 in ls[li].xpath('./div[2]/ul'):
                        temp=li1.xpath('./li/text()')
                    skill.append(temp)
                #print(name,welfare,wages,experience,education,skill)

注意这里的xpath也最好不要用浏览器里面复制的

三、每页工作信息的爬取

我们先获取这个工作它总共的页数，这里的get_attribute(“innerHTML”)获取其子元素的html，而里面的参数改为outerHTML 则会获取它和其子元素的html

 pages = int(self.driver.find_element_by_xpath(
                '//*[@id="wrap"]/div[2]/div[2]/div/div[1]//div[@class="options-pages"]/a[last()-1]').get_attribute(
                'innerHTML'))

接下来就很轻松了，只要在工作信息爬取上加一个下一页点击操作和一个页面循环基本上就成了。

            for page in range(pages):
                name,welfare,wages,experience,education,skill = [],[],[],[],[],[]
                time.sleep(5)
                res = etree.HTML(self.driver.page_source)
                div = res.xpath('//*[@id="wrap"]/div[2]/div[2]/div/div[1]//ul[@class="job-list-box"]')
                ls = []
                for i in div:
                    ls = i.xpath('./li')
                #print(ls)
                for li in range(1,len(ls)):
                    temp = []
                    name.append(ls[li].xpath('./div[1]/a/div[1]/span[1]/text()'))
                    welfare.append(ls[li].xpath('./div[2]/div[@class="info-desc"]/text()'))
                    wages.append(ls[li].xpath('.//a/div[@class="job-info clearfix"]/span/text()'))
                    experience.append(ls[li].xpath('.//a/div[@class="job-info clearfix"]/ul/li[1]/text()'))
                    education.append(ls[li].xpath('.//a/div[@class="job-info clearfix"]/ul/li[2]/text()'))
                    for li1 in ls[li].xpath('./div[2]/ul'):
                        temp=li1.xpath('./li/text()')
                    skill.append(temp)
                #print(name,welfare,wages,experience,education,skill)
                self.write((name, welfare, wages, experience, education, skill))

                if page < pages-1:
                    time.sleep(2)
                    # self.driver.execute_script("arguments[0].scrollIntoView();",
                    #                            self.driver.find_element_by_class_name('ui-icon-arrow-right'))
                    ActionChains(self.driver).move_to_element(self.driver.find_element_by_class_name('ui-icon-arrow-right')).click().perform()
                    ActionChains(self.driver).move_to_element(self.driver.find_element_by_class_name('ui-icon-arrow-right')).click().perform()

不过这里要注意这个点击下一页有坑，要点击两次才有效果(原来在爬取的时候看到信息是一样的，还以为是driver没有更新网页，真被我蠢笑死了)

四、保存信息

每一页的信息我们都存入进了列表里了，所以我们再把列表里的信息存入csv里面既可。

        with open("./work.csv",'a',encoding='utf-8',newline='') as f:
            writer_ = csv.writer(f)
            for record in range(len(name)):
                try:
                    writer_.writerow([name[record][0],welfare[record][0],wages[record][0],experience[record][0],education[record][0],skill[record][0]])
                except:
                    writer_.writerow([name[record][0], None, wages[record][0], experience[record][0],
                                      education[record][0], skill[record][0]])

^三年梦^

关注

22
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
某直聘的爬取-selenium篇

如今的计算机行业，越来越来趋近成熟，体系架构越来越来清晰明了，但相应的计算机专业学生任务也越来越重，认真的是真的认真，摆烂的是真的摆烂，而临近毕业大多数学生也就面临着就业，考研，考公的选择，考研就要有科研的准备，考公就要有努力的准备，就业的话就要有找工作的准备，今天我们一起来看看某直聘的就业信息。
复制链接

扫一扫