【Scrapy】一篇完成入门与实战

落叶阳光

已于 2024-06-26 09:59:17 修改

阅读量691

点赞数 2

分类专栏：工具篇文章标签：爬虫 python scrapy

于 2021-02-26 15:54:35 首次发布

本文链接：https://blog.csdn.net/xiangxiang613/article/details/114136364

版权

工具篇专栏收录该内容

13 篇文章 1 订阅

订阅专栏

Scrapy是python环境下的一个爬虫框架，相比Beautiful和requests，其效率更高。

1.Scrapy的入门教程（推荐）：

https://www.jianshu.com/p/43029ea38251（Scrapy的安装和基本使用）

这篇照着学完，对scrapy能有一个初步感觉。

其它的可以在实战过程中继续深入了解，包括但不限于：

Xpath和CSS的语法：

参考1：https://www.cnblogs.com/youxin/p/4041917.html（介绍/ 、//、 @的区别）

参考2：https://docs.scrapy.org/en/latest/topics/selectors.html（官方介绍）

正则表达式语法：

参考：https://www.runoob.com/regexp/regexp-syntax.html

正则表达式测试工具：https://c.runoob.com/front-end/854（很好用）

Link Extractors （链接提取器）

参考1：https://docs.scrapy.org/en/latest/topics/link-extractors.html（官方介绍）

小例子：https://www.cnblogs.com/lei0213/p/7976280.html （python爬虫scrapy之rules的基本使用）

2.实战教程：

汽车之家热门车型口碑数据分析项目之数据抓取：

https://zhuanlan.zhihu.com/p/268117716（python + Scrapy + pymysql+ mysql）

使用爬虫获取汽车之家全车型数据：

https://zhuanlan.zhihu.com/p/54996488（python + Scrapy）

3.基础笔记：

一周不搞就忘了一半了，记下来，以后需要的时候好快速捡起来。

（1）常用命令

创建项目：scrapy startproject 项目名
执行爬虫：cd 项目名
scrapy crawl 爬虫名
注：爬虫名与项目名不同

（2）scrapy目录结构及功能

spiders文件夹:这里面编写自己创建的爬虫文件，需要配置start_urls和parse函数，start_urls可以是一组url，这样scrapy会依次爬取，parse函数只解析对应的网页内容即可；也可以只给一个开始地址，那么在parse函数中需要爬取类似“下一页”的url，然后采用 yield scrapy.Request(url=url,callback=新的parse）的方式继续爬取。在parse中创建item = VehicleHomeItem()。所有的结果以字典的形式保存进item，注意实体的名称需要与items.py中的定义一致

# 用于结构示范,并不能正常运行
import scrapy
from demo.items import VehicleHomeItem		# 从items导入对应的类

class AutohomeSpider(scrapy.Spider):
    name = "img"	# 爬虫名
    allowed_domains = ["autohome.com"]
    start_urls = ['http://autohome.com/page1',
   					'http://autohome.com/page2' ]	# 爬取的网页

    def parse(self, response):
    	# 实例化类
        item = VehicleHomeItem()
		# 对response进行处理
        car_id = response.url.split("/")[-1]
        img_url = response.xpath("//span[@class='scaleimg']//picture//img/@src").get()
        # 保存结果
        item['car_id'] = car_id
        item['img_url'] = img_url
        # 返回结果
        yield item

items: 需要修改，创建item项的结构。在爬虫中的每一个parse函数都对应着items中一个class类，名称可以自定义。

import scrapy

class VehicleHomeItem(scrapy.Item):
    # define the fields for your item here like:
    car_id = scrapy.Field()
    img_url = scrapy.Field()

pipelines：需要修改，在pipelines中处理爬虫返回的items，这里可以进行任何数据操作，常用于数据的保存，如存入数据库。而我这里是选择下载链接中的图片

class DemoPipeline:

    def process_item(self, item, spider):
        print("----------下载图片----------------")
        car_id = item['car_id']
        img_url = "https:" + item['img_url']
        print("图片地址：", img_url)
        path = "demo/spiders/img/" + str(car_id) + ".jpg"
        wget.download(img_url, out=path)
        return item

    def close_spider(self, spider):
        print("----------下载完成----------------")

settings：需要修改，这里是对爬虫的初始设置，包括延时，User-agent等
middlewares：不需要修改，直接用默认的即可

（3）scarpy选择元素

举例：读取author

<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

在parse函数中，
xpath方式（推荐）：
response.xpath(".//div[@class=‘quote’]//small[@class=‘author’]/@text").get()
使用正则表达式：
response.xpath(’.’).re(".*")

（4）重要的易错点

1）加.和不加.
加.表示在当前选中的结点查询，不加点表示在全文中查询；

divs = response.xpath("//div")

for p in divs.xpath(".//p")：	# divs.xpath(".//p")等价于divs.xpath("p")，表示在提取divs中所有的p
	...

for p in divs.xpath("//p")：	# 表示获取全文的p
	...

2）//和/
“//”：表示相对定位，可以读取到任意位置，给标签加上条件，如//div[@class=‘quote’]，就能唯一定位；
“/”：表示绝对定位，当读取到内容所在的标签时，就需要改成/，然后后面接属性，如/@text，就可以定位到属性内容，最后再用get()或
getall（）获取具体的内容。get只获取第一个，getall会获取满足条件的全部内容。
“@”：表示选择某个属性

eg：response.xpath("//@href")：选择所有标签下的href属性
response.xpath("/@href")：选择当前标签及其子标签下的href属性

3）加括号的区别

# //node[1]:表示选择所有node结点下对应出现的第一个，返回的是一堆结点
# (//node)[1]:表示选择全部结点后，再获得结果中的第一个节点，返回的是一个

# eg：
<ul>
	<li>1
	<li>2
<ul>
	<li>3
	<li>4

//li[1]返回：<li>1
			<li>3
(//li)[1]返回：<li>1

4.常见的Scrapy状态码

根据状态码来判断爬虫的状态。

类似：[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.autohome.com.cn/4891> (referer: https://www.autohome.com.cn/grade/carhtml/J.html)

200： 表示请求成功（200~300之间的数字都表示请求成功，也就是工作正常）

301、302： 表示重定向（属于异常，需要解决）

404：表示网页没找到

重定向的原因及解决办法：https://blog.csdn.net/xiangxiang613/article/details/114136229
429：表示请求速率已超出服务器API限制。服务器终止访问。解决办法：
https://blog.csdn.net/Z_suger7/article/details/134929657

5.使用scrapy shell调试

在工程目录下执行scrapy shell http://www.xxxxx.xx，之后就会进入python的交互终端，这时就可以进行调试了。执行print （response.xpath('xxxxx')）来验证xpath语句是否符合预期。
注意：xpath helper调通了的在scrapy中可能仍然有问题，所以一定要在shell中进行调试，shell中调通了的去，才能准确的获取信息。

6.Xpath Helper插件的使用

https://blog.csdn.net/qq_54528857/article/details/122202572

落叶阳光

关注

2
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
【Scrapy】一篇完成入门与实战

Scrapy是python环境下的一个爬虫框架，相比Beautiful和requests，其效率更高。1.Scrapy的入门教程（推荐）：https://www.jianshu.com/p/43029ea38251（Scrapy的安装和基本使用）这篇照着学完，对scrapy能有一个初步感觉。其它的可以在实战过程中继续深入了解，包括但不限于：Xpath和CSS的语法：参考1：https://www.cnblogs.com/youxin/p/4041917.html（介绍/ 、//、 @的区别）
复制链接

扫一扫