使用scrapy爬取拉钩网招聘信息

具体代码如下

今天开始接触这个框架,反反爬措施没做多少,一直被发现

# -*- coding: utf-8 -*-
import scrapy
from lagou.items import LagouItem
import re

class LagouZhaopinSpider(scrapy.Spider):
    name = 'lagou_zhaopin'
    allowed_domains = ['lagou.com']
    start_urls = ["https://www.lagou.com/zhaopin/1/"]

    def parse(self, response):
        li_list = response.xpath("//li[@class='con_list_item default_list']")
        for i in li_list:
            data_dict = LagouItem()
            data_dict["title"] = i.xpath(".//h3/text()").extract()
            data_dict["addr"] = i.xpath(".//span[@class='add']/em/text()").extract()

            #提取详情页数据
            detail_url = i.xpath('.//a[@class="position_link"]/@href').extract_first()
            yield scrapy.Request(
                                detail_url,
                                callback=self.parse_detail,
                                meta={"data_dict": data_dict}
                                )

        # 请求下一页
        #extract_first()得到的是str类型
        next_page_url = response.xpath("//a[text()='下一页']/@href").extract_first()
        print(next_page_url)
        if next_page_url != "javascript:;":
            yield scrapy.Request(next_page_url, callback=self.parse)

    def parse_detail(self, response):  #详情页数据提取
        data_dict = response.meta["data_dict"]

        #去掉多余字符(空表,换行等)
        content_list = response.xpath('//*[@id="job_detail"]/dd[2]/div//text()').extract()
        content_list = [re.sub(r"\s", "", i) for i in content_list]  #将多余字符替换成空字符
        data_dict["detail"] = [i for i in content_list if len(i) > 0]  #去掉空字符
        yield data_dict


网上找了一堆user-agent

在setting.py设置

import random

USER_AGENT_LIST = [
    'MSIE (MSIE 6.0; X11; Linux; i686) Opera 7.23',
    'Opera/9.20 (Macintosh; Intel Mac OS X; U; en)',
    'Opera/9.0 (Macintosh; PPC Mac OS X; U; en)',
    'iTunes/9.0.3 (Macintosh; U; Intel Mac OS X 10_6_2; en-ca)',
    'Mozilla/4.76 [en_jp] (X11; U; SunOS 5.8 sun4u)',
    'iTunes/4.2 (Macintosh; U; PPC Mac OS X 10.2)',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:5.0) Gecko/20100101 Firefox/5.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:9.0) Gecko/20100101 Firefox/9.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:16.0) Gecko/20120813 Firefox/16.0',
    'Mozilla/4.77 [en] (X11; I; IRIX;64 6.5 IP30)',
    'Mozilla/4.8 [en] (X11; U; SunOS; 5.7 sun4u)'
]

USER_AGENT = random.choice(USER_AGENT_LIST)
在pipelines.py中打印结果
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from pymongo import MongoClient
client = MongoClient()
db = client["lagou"]

class LagouPipeline:
    def process_item(self, item, spider):

        #保存数据的地方
        print(item)
        #db.zhaopin.insert_one(dict(item))
        return item

运行结果下面

在这里插入图片描述
用户体验:爬取速度非常快(没有做反反爬措施很容易被发现)

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值