Scrapy爬虫框架爬取牛客网招聘数据

本文对上次的爬虫框架进行了改进,原本是直接使用Python的Request库发送请求进行爬取,这里使用了Scrapy框架,提高了爬取效率,简化了爬取流程。

分析网页

牛客招聘数据:

详情页:

Scrapy框架

项目结构如下:

首先在items.py中新建类,用于存放爬到的数据,用于管道输出,包括公司名,工资,城市,教育要求,工作,详细要求等字段。

class EmployItem(scrapy.Item):
    company = scrapy.Field()
    salary = scrapy.Field()
    city = scrapy.Field()
    education = scrapy.Field()
    job = scrapy.Field()
    details = scrapy.Field()

在spiders目录下新建一个newCoderSpider.py,用于发送请求和解析返回的HTML网页。start_urls是开始请求的URL,返回的数据会在parse()方法中解析。

class ExperienceSpider(scrapy.Spider):
    name = "espider"
    allowed_domains = ["nowcoder.com"]

    start_urls = 
["https://www.nowcoder.com/feed/main/detail/3364c82e4c3b4eac9dfca2ee92100b06"]

    
    def parse(self, response):
        item = EmployItem()
        item["company"] = response.xpath("//div[@class='company-card card-container']/div//text()").extract()[0]
        salary = response.xpath("//div[@class='salary']//text()").extract()[0]
        if salary == '薪资面议' or salary is None:
            item["salary"] = None
        else:
            strs = salary.strip().strip("薪").split('*')
            months = int(strs[1])
            s_min, s_max = strs[0].strip().strip("K").split("-")
            item["salary"] = months * 0.5 * (int(s_min) + int(s_max)) * 1000
        infos = response.xpath("//div[@class='extra flex-row']")
        item["city"] = infos.xpath("//span[@class='el-tooltip']//text()").extract()[0]
        item["education"] = infos.xpath("//span[@class='edu-level']//text()").extract()[0]
        item["job"] = response.meta["category"]
        item["details"] = '\n'.join(response.xpath("//div[@class='job-detail-infos tw-flex-auto']//text()").extract())

        yield item

原本使用scrapy框架是要在Terminal输入指令的,为了使用更方便,还能直接在PyCharm上进行调试,我们在爬虫项目的根目录下新建应该run.py:

from scrapy import cmdline

name = 'espider'
cmd = 'scrapy crawl {0}'.format(name)
cmdline.execute(cmd.split())

直接启动run.py就可以启动爬虫项目。

反爬机制

牛客有几个反爬的机制:

直接用默认的Scrapy爬取牛客网的页面会出现滑块验证,导致无法正常爬取数据:

我认为问题出在了User-Agent上,人为设置User-Agent,防止被识别为爬虫。

在settings.py中启用DOWNLOADER_MIDDLEWARES:

DOWNLOADER_MIDDLEWARES = {
    'newcodeSpider.middlewares.MyUserAgentMiddleware': 543,
}

在middle.py中添加对Request预处理的设置,每次随机选择一个请求头:

class MyUserAgentMiddleware(UserAgentMiddleware):

    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        ua = UserAgent()

        request.headers['User-Agent'] = ua.random

用这个方法果然可以骗过网站,但是出现了另外一个问题:

用鼠标拖动到最底下,发现牛客是动态加载的,同时这里还有选择按钮,点击之后也是动态加载:


这个我用selenium解决:

完整的newCoderSpider.py代码:

import logging

import scrapy
from selenium.webdriver.common.by import By
from webSpider.items import EmployItem
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver import Keys
import time
from lxml import etree
from selenium.webdriver.support.ui import Select


class ExperienceSpider(scrapy.Spider):
    name = "espider"
    allowed_domains = ["nowcoder.com"]

    # start_urls = ["https://www.nowcoder.com/feed/main/detail/3364c82e4c3b4eac9dfca2ee92100b06"]

    def __init__(self):
        """
        包含链接的页面是用Ajax动态渲染的,先使用selenium模拟浏览器爬取
        """
        super(ExperienceSpider, self).__init__()
        self.urls = {}
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument('--no-sandbox')

        # driver = webdriver.Chrome(options=chrome_options)
        driver = webdriver.Chrome()
        driver.get("https://www.nowcoder.com/jobs/recommend/campus")

        # 点击全部职位
        driver.find_element(By.XPATH, "//div[@data-v-3683ca40]").click()
        # element = driver.find_element(By.XPATH, "//div[@data-v-3683ca40]")
        # driver.execute_script("arguments[0].click();", element)
        time.sleep(1)
        logging.info("Clicked All Jobs")

        # 选择一级职位
        category_elements = driver.find_elements(By.XPATH,
                                                 "//div[@class='el-scrollbar nowcoder-custom el-cascader-menu'][1]//li")
        category = {c.text: c for c in category_elements if c.text != ""}
        logging.info("Find job categories: " + category.keys().__str__())

        # 循环点击一级职位
        # for k, v in category.items():
        for k, v in {"软件开发": category["软件开发"]}.items():
            v.click()
            # driver.execute_script("arguments[0].click();", v)
            time.sleep(1)
            logging.info("Clicked button " + k)

            # 选择二级职位
            sub_category_elements = driver.find_elements(By.XPATH,
                                                         "//div[@class='el-scrollbar nowcoder-custom el-cascader-menu'][2]//li")
            sub_category = {c.text: c for c in sub_category_elements if c.text != ""}
            logging.info("Find sub_categories: " + sub_category.keys().__str__())

            # 点击二级职位
            # for s_k, s_v in sub_category.items():
            for s_k, s_v in {"后端开发": sub_category["后端开发"]}.items():
                s_v.click()
                # driver.execute_script("arguments[0].click();", s_v)
                time.sleep(1)
                logging.info("Clicked button " + s_k)

                # 选择三级职位
                sub_sub_category_elements = driver.find_elements(By.XPATH,
                                                                 "//div[@class='el-scrollbar nowcoder-custom el-cascader-menu'][3]//span[@class='el-cascader-node__label']")
                sub_sub_category = {c.text: c for c in sub_sub_category_elements if c.text != ""}
                logging.info("Find sub_sub_categories: " + sub_sub_category.keys().__str__())

                # 点击三级职位
                for ss_k, ss_v in sub_sub_category.items():
                    ss_v.click()
                    # driver.execute_script("arguments[0].click();", ss_v)
                    time.sleep(2)
                    logging.info("Clicked button " + ss_k)

                    # 将滚动条滑动到页面底部
                    while True:
                        # 模拟滑动操作
                        actions = ActionChains(driver)
                        actions.send_keys(Keys.END).perform()

                        # 等待页面加载
                        time.sleep(3)

                        # 检查是否已到达页面底部
                        if driver.execute_script(
                                "return window.pageYOffset + window.innerHeight >= document.body.scrollHeight"):
                            break

                    # 获取页面的源代码
                    content = driver.page_source
                    html = etree.HTML(content, parser=etree.HTMLParser())
                    self.urls[k + '-' + s_k + '-' + ss_k] = html.xpath(
                        '//a[@class="recruitment-job-card feed-job-card"]/@href')

                    driver.find_element(By.XPATH, "//div[@data-v-3683ca40]").click()
                    # element = driver.find_element(By.XPATH, "//div[@data-v-3683ca40]")
                    # driver.execute_script("arguments[0].click();", element)
                    time.sleep(1)

        with open("urls.txt", "w", encoding='utf-8') as f:
            f.write(self.urls.__str__())
        driver.close()

    def start_requests(self):
        for category, urls in self.urls.items():
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse, meta={"category": category}, dont_filter=True)

    def parse(self, response):
        item = EmployItem()
        item["company"] = response.xpath("//div[@class='company-card card-container']/div//text()").extract()[0]
        salary = response.xpath("//div[@class='salary']//text()").extract()[0]
        if salary == '薪资面议' or salary is None:
            item["salary"] = None
        else:
            strs = salary.strip().strip("薪").split('*')
            months = int(strs[1])
            s_min, s_max = strs[0].strip().strip("K").split("-")
            item["salary"] = months * 0.5 * (int(s_min) + int(s_max)) * 1000
        infos = response.xpath("//div[@class='extra flex-row']")
        item["city"] = infos.xpath("//span[@class='el-tooltip']//text()").extract()[0]
        item["education"] = infos.xpath("//span[@class='edu-level']//text()").extract()[0]
        item["job"] = response.meta["category"]
        item["details"] = '\n'.join(response.xpath("//div[@class='job-detail-infos tw-flex-auto']//text()").extract())

        yield item

定义了__init__()方法,首先用selenium定位按钮元素,模拟鼠标点击,加载对应类别的页面,然后模拟鼠标滚轮一直滚到页面最下方,直到所有的数据都加载出来,就返回这个页面,一个一个把所有的类别都爬取下来,从中抽取详细页面的URL,写入文件中备用。

接着把所有的URL传给start_requests()方法,这个方法代替了最初的start_urls,所有的请求都是从这里了发出的,发出请求,收到返回数据后会调用parse()方法,解析页面内容,并输入管道,我们这里把管道中的数据保存到csv文件中。

在pipelines.py中新建一个类用于保存数据:

class CsvPipeline(object):

    def __init__(self):
        self.file = codecs.open('data.csv', 'w', encoding='utf_8_sig')

    def process_item(self, item, spider):
        fieldnames = ['company', 'salary', 'city', 'education', 'job', 'details']
        w = csv.DictWriter(self.file, fieldnames=fieldnames)
        w.writerow(item)
        return item

    def close_spider(self, spider):
        self.file.close()

这样我们的全部工作就完成了,运行run.py就可以开始爬取。

爬取结果

爬取结果如下:

  • 7
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值