python网络爬虫--项目实战--scrapy嵌入selenium,芯片厂级联评论爬取(6)

一、目标

爬取芯片厂电影级联页面的评论

二、分析

2.1 网页分析

经过研究发现,该网页的评论是动态加载的。故我们本次采用selenium来解决。本次只拿数据不进行存储。

三、完整代码

xpc.py

import scrapy


class XpcSpider(scrapy.Spider):
    name = 'xpc'
    allowed_domains = ['www.xinpianchang.com']
    start_urls = ['https://www.xinpianchang.com/a10975710?from=ArticleList']

    def parse(self, response):
        results = response.xpath("//ul[contains(@class, 'comment-list')]/li/div/div/i[@class='text']/text()").extract()
        print(results)

middlewares.py

该py文件中只需要改 process_request函数即可

class ScrapyadvancedDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called

        if isinstance(spider, XpcSpider):
            # 在这可以很方便的添加 随机UA,Cookie,Proxy
            print("切点我来了", request.url)

            # if isinstance(spider, XpcSpider):
            # 调用谷歌浏览器进行请求
            driver = WebDriver()
            driver.get(request.url)
            sleep(2)
            # 获取请求的内容
            content = driver.page_source

            # 使用请求内容构造Response
            response = HtmlResponse(request.url, body=content.encode("utf-8"))
            return response
        # return None
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值