Python爬虫案例2：爬取前程无忧网站数据

VIV-

已于 2024-03-18 15:29:44 修改

阅读量3.1k

点赞数 4

文章标签： python 爬虫开发语言

于 2023-10-27 13:56:10 首次发布

本文链接：https://blog.csdn.net/weixin_60361911/article/details/134074422

版权

1 爬虫技术介绍

Python中有许多模块可以用于编写爬虫程序，常用的有urllib2、requests、selenium模块等，本文选取的是selenium模块，selenium是一个Web的自动化测试工具，最初是为网站自动化测试而开发的，完全由JavaScript编写，因此可以用于任何支持JavaScript的浏览器上。选取其是基于以下原因：selenium模块本质是通过驱动浏览器、完全模拟浏览器的操作，配合使用随机延时操作，在保证被爬取页面完全加载以确保爬虫程序正常运行的同时，最大限度地模仿用户的行为，从而避免被网站识别为爬虫；selenium可以处理JavaScript渲染的网页，避免了其他爬虫方式如requests因为无法处理JavaScript而导致的数据缺失或错误。

2 爬虫策略

前期解析网页和内容之后需要设计爬虫策略开始爬取数据，前程无忧网站每页显示的招聘信息量为50条，本文爬取数据时设计爬取的页面数量为200页，也就是一万条数据，为了模拟用户的行为，采用for循环遍历待爬取的每一个页面，并用程序控制在侧边滚动条上以50像素滑动一次的速度浏览页面。因目标页数较多，翻页时采用输入页码然后点击跳转的方式，避免了在for循环范围内页面突然结束的情况。

第二大步即是对爬取的HTML文件进行解析，并将数据取出存储为excel文件，便于后续的数据预处理和可视化分析。由于爬取的数据是半结构化数据，因此要使用beautifulsoup、lxml等包对数据进行解析，本文选取的是lxml，lxml是python的一个解析库，支持HTML和XML解析，同时支持XPath(XML Path Language)解析方式。Lxml的解析速率相较BeautifulSoup更高，虽然后者学习相较更简单。

3 爬虫代码

from selenium import webdriver
from selenium.webdriver import ChromeOptions
from time import sleep
import random
from selenium.webdriver.common.by import By


def main():
    for p in range(200):
        p += 1
        print(f'爬取第{p}页')
        sleep(5 * random.random())
        for i in range(140):
            sleep(random.random()/5)
            driver.execute_script('window.scrollBy(0,50)')
        res = driver.page_source
        open(f'html/{p}.html','w',encoding='utf-8').write(res)
        if p != 200:
            driver.find_element(By.ID,'jump_page').clear()
            driver.find_element(By.ID,'jump_page').send_keys(p + 1)
            sleep(random.random())
            button2 = driver.find_element(By.CLASS_NAME,'jumpPage')
            driver.execute_script("arguments[0].click();", button2)


if __name__ == '__main__':
    options = ChromeOptions()
    options.add_experimental_option('excludeSwitches', ['enable-automation'])
    driver = webdriver.Chrome(options=options)
    js = open('stealth.min.js').read()
    driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument',{'source':js})
    driver.get('https://we.51job.com/pc/search?keyword=&searchType=2&sortType=0&metro=')
    sleep(5)
    input = driver.find_element(By.ID,'keywordInput')
    input.send_keys('电商')
    button1 = driver.find_element(By.ID, 'search_btn')
    driver.execute_script("arguments[0].click();", button1)
    sleep(5)
    main()
    driver.quit()

4 爬虫解析代码

from lxml import etree
import pandas as pd


def collect():
    resls = []
    for i in range(200):
        i += 1
        res = open(f'html/{i}.html',encoding='utf-8').read()
        tree = etree.HTML(res)
        for li in tree.xpath('.//div[@class="j_joblist"]/div'):
            name = li.xpath('.//span[@class="jname at"]/text()')[0]
            href = li.xpath('./a/@href')[0]
            time = li.xpath('.//span[@class="time"]/text()')[0]
            sala = (li.xpath('.//span[@class="sal"]/text()') + [''])[0]
            addr = (li.xpath('.//span[@class="d at"]/span/text()') + [''] * 5)[0]
            exp = (li.xpath('.//span[@class="d at"]/span/text()') + [''] * 5)[2]
            edu = (li.xpath('.//span[@class="d at"]/span/text()') + [''] * 5)[4]
            comp = li.xpath('.//a[@class="cname at"]/text()')[0]
            kind = li.xpath('.//p[@class="dc at"]/text()')[0].split('|')[0].strip()
            num = (li.xpath('.//p[@class="dc at"]/text()')[0].split('|') + [''])[1].strip()
            ind = (li.xpath('.//p[@class="int at"]/text()') + [''])[0]
            dic = {
                '职位': name,
                '链接': href,
                '时间': time,
                '薪资': sala,
                '地区': addr,
                '经验': exp,
                '学历': edu,
                '公司': comp,
                '类型': kind,
                '规模': num,
                '行业': ind
            }
            print(dic)
            resls.append(dic)
    pd.DataFrame(resls).to_excel('前程无忧电商数据.xlsx', index=False, encoding='utf-8')


if __name__ == '__main__':
    collect()