Scrapy框架学习笔记2

最新推荐文章于 2023-03-29 11:10:58 发布

闻DD

最新推荐文章于 2023-03-29 11:10:58 发布

阅读量647

点赞数 1

分类专栏： python 文章标签： python scrapy

本文链接：https://blog.csdn.net/deepexpert_wendongdi/article/details/47003997

版权

python 专栏收录该内容

28 篇文章 0 订阅

订阅专栏

在这里记录一下学习过程中遇到的问题：

1.我们自定义的爬虫是处理的从返回的response中提取item的过程，那么，在发送请求到接受响应之间，如果存在一些异常，我们需要记录下来：

首先要知道，在srapy遇到异常之后，会在你允许的情况下重试这个网页，那么我们可以自定义一个RetryMiddleware：

#自定义的下载中间件，用于记录下没有返回响应的异常网页
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.exceptions import NotConfigured
class myre(RetryMiddleware):
    def __init__(self, settings):
        if not settings.getbool('RETRY_ENABLED'):
            raise NotConfigured
        self.max_retry_times = settings.getint('RETRY_TIMES')
        self.retry_http_codes = set(int(x) for x in settings.getlist('RETRY_HTTP_CODES'))
        self.priority_adjust = settings.getint('RETRY_PRIORITY_ADJUST')
        self.errfile = open('C:/python/errurl.txt','wb')
        
    def process_exception(self, request, exception, spider):
        self.errfile.write((request.url+'\t'+exception.__str__()+'\r\n').encode('utf-8'))
        return None

在这里，我们修改了process_exception()方法：不进行重试而是记录下出错的网址和异常内容。

同时需要在setting文件中导入自定义的中间件

RETRY_ENABLED = True#可重试以及重试次数
RETRY_TIMES = 1
DOWNLOADER_MIDDLEWARES = {
    'week1.spiders.myter.mymw':345#自定义的下载中间件
}

2.读取特别大的文件的时候应该行读：

f3 = open('C:/python/dojieba.txt','w')
for line in done:

或者

for i in range(5000):#试处理5000个网页
    url = f.readline().split('\t')[0].strip()

3.scrapy会过滤掉一些常见的网络错误，想要获得他们：

class Spider1(Spider):
	handle_httpstatus_list = [301,302,307,404,502,504]#能够处理网页的常见的错误
        othercodes...
def parse(self , response):<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">	item = Week1Item()#完善item</p><p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">	url = response.request.url</p><p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">	if 'redirect_urls' in response.meta:#如果网页重定向，获得一开始的网址。 request.meta 会复制给response.meta</p><p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">   		url = response.request.meta['redirect_urls'][0]</p><p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">	item['url_status'] =url+'\t'+str(response.status)+'\r\n'</p><p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">	if response.status != 200:#不对出错的网页进一步处理</p><p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">		return item</p>

在我们自定义的spider中处理这些错误,常见的404，502等等我们在文件中输出他们，

class Week1Pipeline(object):
    def __init__(self):
        self.file = open('C:/python/results.txt','wb')#记录结果
        self.errfile = open('C:/python/notfound_url.txt','wb')#记录出错的网页
        
    def process_item(self, item, spider):
            if int(item['url_status'].split('\t')[1]) != 200:
                self.errfile.write(item['url_status'].encode('utf-8'))
                return item
            line = item['url']+'\t'+item['content']+'\r\n'
            self.file.write(line.encode('utf-8'))
            return item

而对于重定向的一些网站，他的原网址会存在request.meta中，scrapy会将meta复制给response，所以我们可以通过response.meta来获得他们的原来的url(meta['redirect_urls']是一个列表，每次转向都会存放下转向前的网址)

4.对于想获取的页面为空的情况，我们也要完善item：

if head!=[]:#空<head>的情况
	headtext = head[0].extract()
else:
	item['url'] = url
	item['content'] = 'EmptyHead'
	return item

5.某些网站中编码特殊，要区别对待，这里举一例错误的编码 gbk2312 他会使得网页的文字一场显示为 u'GBK编码',这种情况下既不能编码为GBK，也不能解码为unicode，可以通过这一步操作取消他的unicode识别：

p_bm = re.compile('charset=.*?gbk2312')#处理特殊的网页中文编码问题
	bm = re.findall(p_bm,headtext)
	if bm != []:
		headtext = headtext.encode('unicode-escape').decode('string_escape').decode('gbk')