Python爬虫: 用scrapy写的抓取网页内容的示例

最新推荐文章于 2021-03-17 22:19:47 发布

Liu610921

最新推荐文章于 2021-03-17 22:19:47 发布

阅读量1.5k

点赞数 1

分类专栏：自动化测试

本文链接：https://blog.csdn.net/youran02100210/article/details/79626997

版权

自动化测试专栏收录该内容

14 篇文章 1 订阅

订阅专栏

scrapy是一个专门用于写爬虫的python框架。它将抓取页面内容，处理结果，执行做了很好的模块化处理。

1. 安装scrapy

准备环境：一定要安装pip,因为用sudo apt-get 安装的scrapy版本很旧，会出现很多问题，在ubuntu16.0.4下执行sudo apt-get install scrapy好像是找不到package的。切记！不要偷懒，安装pip之后再装scrapy。

sudo apt-get install python-pip
sudo apt-get install python-dev
sudo apt-get install libevent-dev
sudo apt-get install libssl-dev

安装好pip之后，再执行 sudo pip install scrapy就可以了。（注意区分大小写，scrapy要全小写。）

sudo pip install scrapy

2. 简单的项目示例

1. 在命令行创建scrapy项目：创建项目目录，并cd到该目录，执行命令： scrapy startproject 项目名

若命令执行成功，则会在项目目录下看到新生成的目录文件（如下图）。

目录介绍：

-scrapy.cfg ：整个项目的配置文件。

---spiders: 爬虫目录，在这个目录里面定义自己的爬虫类。抓取内容并组装到items.py里面自己定义的数据结构里面。

---items.py: 封装页面抓取的内容。便于后续处理。

---pipelines.py: 定义如何处理爬虫类抓取的内容。比如：去重复，存储到数据库或者按照某种格式输出。

2. 项目示例

需求: 抓取早安心语里面每条鸡汤的url，内容并且将图片下载到本地。

在Items.py里面定义自己的Item类。一定要继承自scrapy.Item类。

编写自己的爬虫类。爬虫类需要继承自scrapy.Spider类，parse()方法用于抓取页面内容并组装到自定义的Item类中。

在Pipelines.py文件里面定义处理Item类的方法。指定输出格式。并且要将这个类配置到setting.py文件中。

每个PipeLine类后面都会跟一个数字（1-1000），这个数字表示执行的优先级。数字越小，优先级越高。

3. 执行并查看结果

cd到项目目录，执行 scrapy crawl spiderName (spiderName就是在爬虫类里面定义的name的值)

执行没有报错，会在项目中生成json文件并将图片下载到download目录下。

4. 源代码

import scrapy


class MyItem(scrapy.Item):
	title = scrapy.Field()

	url = scrapy.Field()

	img_src = scrapy.Field()

import scrapy
import urllib
import os
import re
from spiderOne.items import MyItem

class SpiderOne(scrapy.Spider):
	'''
    	简单爬虫类示例。
	'''

	#爬虫的名字，执行的时候要用到的
	name='zaoanxinyu'
    #允许访问的域名
	allowed_domains=['gxdxw.cn']
    #需要爬取的网页链接列表
	start_urls=['http://www.gxdxw.cn/zaoanxinyu/list_29_1.html',
	 			'http://www.gxdxw.cn/zaoanxinyu/list_29_2.html',
	 			'http://www.gxdxw.cn/zaoanxinyu/list_29_3.html',
	 			'http://www.gxdxw.cn/zaoanxinyu/list_29_4.html',
	 			'http://www.gxdxw.cn/zaoanxinyu/list_29_5.html'          
	           ]	           
	#处理页面内容的方法
	def parse(self,response):
	
		for div in response.xpath('//div[@class="listbox"]//li'):
			url= div.xpath('./a/@href').extract()
			img_src = div.xpath('./a/img/@src').extract()
			title = div.xpath('./h2/a/text()').extract()

			if len(title) > 0:
				item = MyItem()
				item['url'] = url
				item['title'] = title
				item['img_src'] = img_src
				if len(img_src) > 0:
					self.downloadImg(img_src[0])

				yield item

	def downloadImg(self,img_src):
		print img_src
		filePath = os.getcwd() + '/downloads/uploads/allimg/'

		if not os.path.exists(filePath):
			os.makedirs(filePath)

		
		fileName = re.compile(r'/([^/]*)').findall(img_src)[3]

		filePath += fileName


		imgUrl = 'http://www.gxdxw.cn' + img_src

		urllib.urlretrieve(imgUrl,filePath)

import json

class SpideronePipeline(object):

	def __init__(self):
		self.filename=open('zaoanxinyu.json','w')
	
	def process_item(self, item, spider):
		text =  json.dumps(dict(item),ensure_ascii=False) + ',\n'    	
		self.filename.write(text.encode('utf-8'))
		return item
	
	def close_spider(self,spider):
		self.filename.close()