Scrapy快速入门---个人总结【包含案例】

最新推荐文章于 2023-12-20 19:35:22 发布

JW☞♡Lee

最新推荐文章于 2023-12-20 19:35:22 发布

阅读量181

点赞数

分类专栏： python 爬虫 scrapy 文章标签： python

本文链接：https://blog.csdn.net/weixin_43422435/article/details/108332450

版权

python 爬虫同时被 2 个专栏收录

15 篇文章 0 订阅

订阅专栏

scrapy

1 篇文章 0 订阅

订阅专栏

一、scrapy 安装

1.用pip【此方法比较繁琐】

1.1 pip3 install wheel

需要在Python安装目录下的scripts 的文件下安装
在这里插入图片描述

1.2 下载安装软件【Twisted】,需要与Python版本一致

1.3 安装 Twisted【在Twisted 文件目录下】

在这里插入图片描述

1.4 安装pywin32

1.5安装scrapy

2. 用pycharm

在这里插入图片描述

二、常用命令

创建项目：scrapy startproject 项目名称
进入项目：cd 项目名称
创建爬虫：scrapy genspider 爬虫名 www.123.com【建议随便写可改】
生成文件：scrapy crawl xxx -o xxx.json (生成某种类型的文件)
运行爬虫：scrapy crawl 爬虫名
列出所有爬虫：scrapy list
获得配置信息：scrapy settings [options]

三、利用Scrapy 抓取京客隆店面信息

1. 建立项目

在这里插入图片描述

2. 进入爬虫文件中，更改起始url及删除域名

在这里插入图片描述

3. 更改settings.py 中配置信息

在这里插入图片描述

4. 编写爬虫逻辑代码

在这里插入图片描述

5. 定义item 容器

在这里插入图片描述

6. 设置管道及数据保存

在这里插入图片描述

7. 开启管道

在这里插入图片描述

完成代码

1. 爬虫文件

import scrapy
from Ajkl.items import AjklItem

class JklSpider(scrapy.Spider):
    name = 'jkl' 
    start_urls = ['http://www.jkl.com.cn/cn/shop.aspx']
    #进入起始url，抓取各个大区的详细url
    def parse(self, response):
        li_list = response.xpath('//div[@class="infoLis"]//li')       
        for i in li_list:          
            url ="http://www.jkl.com.cn/cn/" + i.xpath('./a/@href').extract()[0]
            yield scrapy.Request(url,callback=self.get_page)
    #解析每个详情url，得到我们要保存的信息，
    def get_page(self,response):        
        shop_name_list = response.xpath('//div[@class="shopLis"]//dd')
        item = AjklItem()
        for j in shop_name_list:
            item['店铺名称'] = (j.xpath('./a/span[1]/text()').extract()[0]).strip()
            item['店铺地址']= j.xpath('./a/span[2]/text()').extract()[0]
            item['店铺电话'] = j.xpath('./a/span[3]/text()').extract()[0]
            item['店铺时间'] = j.xpath('./a/span[4]/text()').extract()[0]
            yield item

2. item 文件

class AjklItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    
    店铺名称 = scrapy.Field()
    店铺地址 = scrapy.Field()
    店铺电话 = scrapy.Field()
    店铺时间 = scrapy.Field()

3. 管道文件

import pandas as pd
class AjklPipeline:
    def process_item(self, item, spider):
        shop_name = item['店铺名称']
        shop_dis = item['店铺地址']
        shop_phone = item['店铺电话']
        shop_time = item['店铺时间']
        
        data = pd.DataFrame({'店铺名称': [shop_name], '店铺地址': [shop_dis], '店铺电话': [shop_phone], '店铺时间':[shop_time]})
        data.to_csv('e:/店铺信息.csv',index=False,header=0,mode='a',encoding="ANSI")

        return item

JW☞♡Lee

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Scrapy快速入门---个人总结【包含案例】

Scrapy入门案例之抓取京客隆店面信息一、scrapy 安装1.用pip【此方法比较繁琐】1.1 pip3 install wheel1.2 下载安装软件【Twisted】,需要与Python版本一致1.3 安装 Twisted【在Twisted 文件目录下】1.4 安装pywin321.5安装scrapy2. 用pycharm二、常用命令三、利用Scrapy 抓取京客隆店面信息1. 建立项目2. 进入爬虫文件中，更改起始url及删除域名3. 更改settings.py 中配置信息4. 编写爬虫逻辑代码5
复制链接

扫一扫