目录
1.Scrapy运行原理
2.Scrapy安装
第一种:在命令行模式下使用pip命令即可安装:
pip install scrapy
第二种:首先下载,然后再安装:
pip download scrapy -d ./
进入下载目录后执行下面的命令进行安装:
pip install Scrapy-2.5.1-py2.py3-none-any.whl
3.scrapy命令
1.全局命令:在哪里都能使用。
scrapy version # 查看版本
scrapy version -v # 查看依赖库版本
scrapy -h # 查看帮助信息 or scrapy
scrapy -help # 查看帮助信息
2.项目命令:必须在爬虫项目里面使用。
scrapy startproject demo # 创建爬虫项目
scrapy -h # 查看项目命令
scrapy bench # 测试
scrapy list # 查看项目的爬虫
scrapy crawl spider_name # 运行爬虫
scrapy view https://www.baidu.com # view使用浏览器打开网页
4.shell交互终端
请求百度,返回一个交互终端,返回一个response对象。
scrapy shell https://www.baidu.com
查看response。
response.url
response.status
response.body[:50]
response.css("title::text")
response.css("title::text").extract_first()
5.创建工程
(1)新建项目
scrapy startproject demo
文件 | 说明 |
---|---|
items.py | Items的定义,定义抓取的数据结构 |
middlewares.py | 定义Spider和DownLoader的Middlewares中间件实现 |
pipelines.py | 定义Item Pipeline的实现,即数据通道 |
settings.py | 定义项目的全局配置 |
spiders.py | 其中包含一个个Spider的实现,每个Spider都有一个文件 |
scrapy.cfg | Scrapy部署时的配置文件,定义了配置文件路径、部署相关信息等内容 |
(2)构建爬虫genspider,一个项目可以存在多个spider,但名字必须唯一。
scrapy genspider name domain
# scrapy genspider baidu baidu.com
scrapy list # 查看当前项目的爬虫
(3)runspider命令运行爬虫,直接运行创建的爬虫,但不会运行整个项目。
scrapy runspider 爬虫名称
(4)crawl命令运行爬虫。
scrapy crawl 爬虫名称
6.Selector选择器
(1)三种选择方法。
from scrapy import Selector
context = "<html><head><title>Test html</title><body><h3 class='center'>He1llo Word!</h3></body></head></html>"
selector = Selector(text=context)
selector.xpath("//title/text()").extract_first()
selector.xpath("//title/text()").get()
selector.xpath("//h3[@class='center']/text()").get()
selector.css("title::text").get()
selector.css("h3.center::text").get()
selector.re_first(r"<title>(.*?)</title>")
(2)交互模式测试。
scrapy shell https://www.baidu.com
response.url
response.selector.xpath("//title/text()").get()
response.selector.css("title::text").get()
response.selector.re_first("<title>(.*?)</title>")
alist = response.xpath("//a")
for vo in alist:
print(vo.css("::attr(href)").get(),":",vo.css("::text()").get())
# print(vo.xpath("./@href").get(),":",vo.xpath("./text()").get())
注:正则选择selector不能省略,xpath和css可以省略selector。
7.Spider的使用
(1)在Scrapy中,要抓取网站的链接配置、抓取逻辑、解析逻辑里其实都是在Spider中配置的。
(2)Spider要做的事就是有两件:定义抓取网站的动作和解析爬取下来的网页
方法一:
import scrapy
from tqdm import tqdm
class VillaSpider(scrapy.Spider):
name = 'villa'
allowed_domains = ['3d.qingmo.com', 'img.qingmo.com']
start_urls = ['https://3d.qingmo.com/so/%E5%88%AB%E5%A2%85-1-6-0-0-0.html']
page = 1
img_count = 0
def parse(self, response):
imgs = response.css("div.box")
print(f"===============正在爬取第{self.page}页===============")
if self.page < 5:
for img in tqdm(imgs):
yield scrapy.Request(url=img.css("img::attr(data-src)").get(), callback=self.downloadImg)
self.page += 1
next_url = f'so/%E5%88%AB%E5%A2%85-{self.page}-6-0-0-0.html'
url = response.urljoin(next_url)
yield scrapy.Request(url=url, callback=self.parse)
def downloadImg(self, response):
self.img_count += 1
with open(f'./images/{self.img_count}-{self.page}.jpg', 'wb') as f:
f.write(response.body)
方法二:
import scrapy
from tqdm import tqdm
class Villa2Spider(scrapy.Spider):
name = 'villa2'
allowed_domains = ['3d.qingmo.com', 'img.qingmo.com']
page = 1
img_count = 0
def start_requests(self):
next_url = f'https://3d.qingmo.com/so/%E5%88%AB%E5%A2%85-{self.page}-6-0-0-0.html'
yield scrapy.Request(url=next_url, callback=self.myParse)
def myParse(self, response):
imgs = response.css("div.box")
for img in tqdm(imgs):
yield scrapy.Request(url=img.css("img::attr(data-src)").get(), callback=self.downloadImg)
def downloadImg(self, response):
self.img_count += 1
with open(f'./images/{self.img_count}.jpg', 'wb') as f:
f.write(response.body)
方法三(POST提交):
import scrapy
import json
class YoudaoSpider(scrapy.Spider):
name = 'youdao'
allowed_domains = ['fanyi.youdao.com']
# start_urls = ['http://fanyi.youdao.com/']
def start_requests(self):
url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"
kw = input("请输入要翻译的单词: ")
data = {'i':kw, 'doctype':'json'}
# FormRequest是Scrapy发送POST请求的方法
yield scrapy.FormRequest(
url = url,
formdata = data,
callback = self.parse
)
def parse(self, response):
res = json.loads(response.body)
self.logger.info(f'result: {res["translateResult"][0][0]["tgt"]}')
input("按任意键继续")
8.Downloader中间件代理
在settings配置文件里面设置代理服务器,并开启downloader中间件。
IPPOOL=[
{"ipaddr": 'http://0.0.0.0:00'},
{"ipaddr": 'http://0.0.0.0:00'},
{"ipaddr": 'http://0.0.0.0:00'},
]
DOWNLOADER_MIDDLEWARES = {
'middletest.middlewares.MiddletestDownloaderMiddleware': 543,
}
在middlewares.py文件编写代理类。
from scrapy import signals
import random
from .settings import IPPOOL
class MiddletestDownloaderMiddleware:
@classmethod
def from_crawler(cls, crawler):
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
ip = random.choice(IPPOOL)['ipaddr']
request.meta['proxy'] = ip
return None
def process_response(self, request, response, spider):
pass
def process_exception(self, request, exception, spider):
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
9.Pipelines使用
(1)过滤数据。
ITEM_PIPELINES = {
'dangdang.pipelines.DangdangPipeline': 300,
}
import scrapy
from scrapy.exceptions import DropItem
class DangdangPipeline:
def process_item(self, item, spider):
'''过滤数据'''
if item.get('price') == '¥45.40':
return item
else:
raise DropItem("Missing price")
(2)自定义图片存储,设置存储路径和编写类。
IMAGES_STORE = './images'
ITEM_PIPELINES = {
'dangdang.pipelines.MyImagesPipeline': 301,
# 'scrapy.pipelines.images.ImagesPipeline': 303, #scrapy内部图片存储类
}
import scrapy
from scrapy.pipelines.images import ImagesPipeline
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
yield scrapy.Request(item['pic'])
def file_path(self, request, response=None, info=None):
url = request.url
file_name = url.split("/")[-1]
return file_name
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['pic'] = image_paths
return item
(3)数据保存到数据库,设置数据库配置信息和类。
DATABASES = {
'default': {
'ENGINE': 'mysql',
'NAME': 'scrapy',
'USER': 'root',
'PASSWORD': '123456+',
'HOST': 'localhost',
'PORT': 3306,
}
}
ITEM_PIPELINES = {
'dangdang.pipelines.MysqlPipeline': 302,
}
import scrapy
import pymysql
class MysqlPipeline:
def __init__(self, host, user, password, database, port):
self.host = host
self.user = user
self.password = password
self.database = database
self.port = port
@classmethod
def from_crawler(cls, crawler):
return cls(
host = crawler.settings.get('DATABASES')['default']['HOST'],
user = crawler.settings.get('DATABASES')['default']['USER'],
password = crawler.settings.get('DATABASES')['default']['PASSWORD'],
database = crawler.settings.get('DATABASES')['default']['NAME'],
port = crawler.settings.get('DATABASES')['default']['PORT'],
)
def open_spider(self, spider):
self.db = pymysql.connect(
host = self.host,
port = self.port,
user = self.user,
passwd = self.password,
database = self.database,
charset = 'utf8',
)
self.cursor = self.db.cursor()
def close_spider(self, spider):
self.db.close()
def process_item(self, item, spider):
sql = f'insert into dangdang(title,pic,author,publish,price,comment) values ("{item["title"]}","{item["pic"]}","{item["author"]}","{item["publish"]}","{item["price"]}","{item["comment"]}");'
self.cursor.execute(sql)
self.db.commit()
return item
(4)items字段类。
import scrapy
class DangdangItem(scrapy.Item):
title = scrapy.Field()
pic = scrapy.Field()
author = scrapy.Field()
publish = scrapy.Field()
price = scrapy.Field()
comment = scrapy.Field()
# image_urls = scrapy.Field()
(5)spider爬虫类。
import scrapy
from dangdang.items import DangdangItem
class DangSpider(scrapy.Spider):
name = 'dang'
allowed_domains = ['search.dangdang.com']
start_urls = ['http://search.dangdang.com/?key=python&act=input&page_index=1']
page = 1
all_pages = 5
def parse(self, response):
dlist = response.selector.css("ul.bigimg li")
for d in dlist:
item = DangdangItem()
# item['title'] = d.css("a::attr(title)").get()
item['title'] = d.xpath(".//a/@title").get()
item['pic'] = d.css("a img::attr(data-original)").get()
item['pic'] = response.urljoin(item['pic'])
item['author'] = d.xpath(".//p[@class='search_book_author']//span[1]//a[1]/text()").get()
item['publish'] = d.xpath(".//p[@class='search_book_author']//span[3]/a/text()").get()
item['price'] = d.css("p.price span::text").get()
item['comment'] = d.css("p.search_star_line a::text").get()
# item['image_urls'] = [item['pic']]
yield item
next_url = response.selector.css("li.next a::attr(href)").get()
print("="*50)
print(next_url)
print("="*50)
if next_url and (self.page < self.all_pages):
self.page += 1
url = response.urljoin(next_url)
yield scrapy.Request(url=url, callback=self.parse)
10.日志处理
(1)使用下面命令可以将日志信息存储到access.log文件中。
scrapy crawl spider_name -s LOG_FILE=access.log
(2)logging设置,Scrapy提供了log功能,可以通过logging模块使用,可以修改配置文件settings.py。通过在settings.py中进行以下设置可以被用来配置logging
参数 | 说明 |
---|---|
LOG_ENABLED | 默认:True,启用logging |
LOG_ENCODING | 默认:‘utf-8’,logging使用的编码 |
LOG_FILE | 默认:None,在当前目录里创建logging输出的文件名 |
LOG_LEVEL | 默认:‘DEBUG’,log的最低级别 |
LOG_STDOUT | 默认:False,如果为True,进程所有的标准输出(及错误)将会被重定向到log中。 |
LOG_FILE = "access.log"
LOG_LEVEL = "INFO"
Scrapy提供5层logging级别:
CRITICAL # 严重错误(critical)
ERROR # 一般错误(regular errors)
WARNING # 警告信息(warning messages)
INFO # 一般信息(informational messages)
DEBUG # 调试信息(debugging messages)
(3)记录信息,使用WARNING级别记录信息。
from scrapy import log
log.msg("This is a warning", evel=log.WRANING)
(4)spider使用日志。
def my_parse(self, response):
lilist = response.selector.css("li.gl-item")
self.logger.info(len(lilist))