scrapy 框架 python 爬虫

最新推荐文章于 2024-11-08 13:43:44 发布

chiduo3959

最新推荐文章于 2024-11-08 13:43:44 发布

阅读量97

点赞数

文章标签： python 爬虫 json

原文链接：https://my.oschina.net/u/3744319/blog/1788254

版权

朋友托我帮忙写个爬虫，记录一下。

项目整体介绍：

scrapy 框架， anaconda(python 3.6)

开发工具：

IDEA

详细介绍：

scrapy 结构图：

Scrapy主要包括了以下组件：

- 引擎(Scrapy Engine)
  负责Spider . ItemPipline. Downloader . Scheduler 中间的通讯，信号，数据传递等
- 调度器(Scheduler)
  负责接受引擎发送过来的Request请求，并按照一定的方式进行整理队列，入队，当引擎需要时，交换给引擎（引擎发送指令要下载的时候）
- 下载器(Downloader)
  负责下载engine发送的所有Request请求，并将获取到的response交换给engine，由引擎交给spider来处理
- 爬虫(Spiders)
  它负责出来所有Response请求，并将需要跟进的URL提交给引擎，再次进入schedule
- 项目管道(Pipeline)
  负责处理Spider中获取到的item，并进行后期处理（详细分析，过滤，存储等）的地方
- 下载器中间件(Downloader Middlewares)
  位于Scrapy引擎和下载器之间的框架，可以当作是一个可以自定义扩展下载功能的组件。
- 爬虫中间件(Spider Middlewares)
  介于Scrapy引擎和爬虫之间的框架，主要工作是处理蜘蛛的响应输入和请求输出。
- 调度中间件(Scheduler Middewares)
  介于Scrapy引擎和调度之间的中间件，一个自定扩展和操作引擎和Spider中间通信的功能组件(比如Spider的Response和从Spider出去的Requests)

制作爬虫过程：

1.新建项目(scrapy startproject xxx ):新建一个新的爬虫项目

2.明确目标(编写item.py):明确你想抓去的目标

scrapy genspider 爬虫名称域名 --在spider包下面创建爬虫(Spiders)文件

scrapy crawl 爬虫名称 --启动项目

3.制作爬虫:(spiders/xxxspider.py):制作爬虫开始爬取网页

4.存储内容:(piplines.py):设计管道存储爬取内容

保存数据

默认有四种，-o 输出到指定格式

#json 格式，默认是Unicode

scrapy crawl itcast -o teachers.json

#lines格式，默认是Unicode

scrapy crawl itcast -o teachers.jsonl

#csv逗号表达式式，可用Excel打开

scrapy crawl itcast -o teachers.scv

#xml

scrapy crawl itcast -o teachers.xml

下面是我给朋友写的爬虫例子：

1.首先先说下需求吧。红色框里面是我们需要的信息。目标网站是

https://www.dankegongyu.com/room/tj

我们爬所有的详情信息。

2.首先我们先写items

class ItcastItem(scrapy.Item):
    #名称
    name=scrapy.Field()
    #地理位置
    discount=scrapy.Field()
    #标题
    title=scrapy.Field()
    #价格
    price=scrapy.Field()
    #详情
    list_box=scrapy.Field()

然后我们再写spiders

class ItcastSpider(scrapy.Spider):
    #爬虫名
    name = "itcast"
    #允许爬的域名，作用与拦截器一样
    allowed_domains = ['https://www.dankegongyu.com/room']
    #起始爬虫的url
    start_urls = ['https://www.dankegongyu.com/room/tj?page=1']

    #流程是爬起始页，然后获取详情页的URL，爬详情页输出，爬完起始页的信息后，判断是否有下一页，继续爬下一页
     直到所有页面
    def parse(self, response):
        node_list=response.xpath("//div[@class='r_lbx']")
        for node in node_list:
            url=node.xpath("./a/@href").extract()[0]
            #爬取详情页
            #这个方法是回调，dont_filter=True 会爬取重复的页面
            #yield  作用与return一样，但是不同的是返回来的时候会从这开始运行
            yield scrapy.Request(url,callback=self.parse_detail1,dont_filter=True)

        #分页爬取
        #1.获取下一页的url
        now_url=response.xpath("//div[@class='page']/a[@class='on']/text()").extract()
        next_url=int(now_url[0])+1
        print('next_url=',next_url)
        #2.如果存在下一页，就继续发送请求
        if response.xpath("//div[@class='page']/a[@href='https://www.dankegongyu.com/room/tj?page="+str(next_url)+"']"):
            url2='https://www.dankegongyu.com/room/tj?page='+str(next_url)
            yield scrapy.Request(url2,callback=self.parse,dont_filter=True)


    #详情页面爬取规则
    def parse_detail1(self,response):
        item=ItcastItem()
        name=response.xpath("//div[@class='room-detail-right']/div[@class='room-name']/h1/text()").extract()
        discount=response.xpath("//div[@class='room-detail-right']/div[@class='room-name']/em/text()").extract()
        title=response.xpath("//div[@class='room-detail-right']/div[@class='room-title']/span/text()").extract()
        price=response.xpath("//div[@class='room-detail-right']//div[@class='room-price-sale']/text()").extract()
        list_box=response.xpath("//div[@class='room-detail-right']/div[@class='room-list-box']//label/text()").extract()

        #处理结果,并塞入item对象
        item['name']=name[0]
        item['discount']=discount[0]
        item['title']=','.join(title)
        item['price']=price[0].strip()
        item['list_box']=','.join(list_box).replace(' ','').replace('\n','').replace(',,,,,',',')

        yield item

这个爬虫不需要写中间件，所以我们的爬虫就算写好了，很简单但是写学到东西了。

转载于:https://my.oschina.net/u/3744319/blog/1788254