CrawlSpider爬取微信文章

最新推荐文章于 2021-02-17 22:15:30 发布

迷路的贝壳儿

最新推荐文章于 2021-02-17 22:15:30 发布

阅读量130

点赞数

分类专栏：爬虫文章标签：爬虫

本文链接：https://blog.csdn.net/weixin_39218107/article/details/100745471

版权

爬虫专栏收录该内容

8 篇文章 0 订阅

订阅专栏

需要使用LinkExtractor和Rule,这两个东西决定爬虫的具体走向

1.allow设置规则的方法，要能够限制在我们想要的url上面，不要和其他的url产生相同的正则表达式即可

2.啥情况下使用follow，如果在爬取页面的时候需要将当前的url在进行更近，那么就设置为True，否者设置为False

3.啥情况下该指定callback，如果这个url对应的页面为了获取更多的url，并不需要获取里面的数据，那么可以不指定callback，如果想要获取url对应页面中的数据，那么就需要指定一个callback

创建crawlspider爬虫文件

scrapy genspider crawl -t 爬虫名称 限定域

items.py文件

import scrapy

class WechatItem(scrapy.Item):
    title = scrapy.Field()
    auth = scrapy.Field()
    time = scrapy.Field()
    content = scrapy.Field()

piplines.py文件

from scrapy.exporters import JsonLinesItemExporter


class WechatPipeline(object):

    def __init__(self):
        self.fp = open('wechat.json', 'wb')
        self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii=False, encoding="utf-8")

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def open_spider(self, spider):
        print("爬虫开始了")

    def close_spider(self, spider):
        print("爬虫结束了")
        self.fp.close()

wechatspider.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from wechat.items import WechatItem

class WechatSpiderSpider(CrawlSpider):
  name = 'wechat_spider'
  allowed_domains = ['wxapp-union.com']
  start_urls = ['http://www.wxapp-union.com/'
                'portal.php?mod=list&catid=2&page=1']
  # rules规则如何编写
  # allow
  # 允许爬取的链接规则，可以使用正则
  # callback
  # 不需要解析的话不用添加回调函数
  # follow
  # 解析过后不用再跟踪页面中相应的链接，就选择False
  rules = (
      Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=\d'),  follow=True),
      Rule(LinkExtractor(allow=r'.+article-.+\.html'),  callback="parse_detail", follow=False),
  )

  def parse_detail(self, response):
      title = response.xpath("//h1[@class='ph']/text()").get()
      auth_time = response.xpath("//p[@class='authors']")
      auth = auth_time.xpath(".//a/text()").get()
      time = auth_time.xpath(".//span/text()").get()
      article = response.xpath('//td[@id="article_content"]//text()').getall()
      content = "".join(article).strip()
      # print("********************")
      # print(title, auth, time, article)
      # print("********************")

      item = WechatItem(title=title, auth=auth, time=time, content=content)
      yield item

若有建议评论区留言