。。。

scrapy User-Agent随机更换

首先在settings.py文件末尾添加用户代理池,如:

USER_AGENTS=['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11','Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0','Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1','Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50','Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50','Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11','Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)']

然后在middlewares.py中将原有的代码注释掉,来自定义随即更换useragent类:

import random

class RandomUserAgentMiddleware(object):
	def process_request(self,request,spider):
		ua = random.choice(spider.settings.get('USER_AGENTS'))
		request.headers['User-Agent']=ua

class Middleware(object):
	def process_response(self,request,response,spider):
		return response

最后在settings.py中将DOWNLOADER_MIDDLEWARES解除注释,然后换上自定义的middleware

将爬取的数据导入mongodb数据库

在settings.py末尾添加数据库信息,如:

MONGODB_HOST='127.0.0.1'     #定义数据库的地址,端口,名称
MONGODB_PORT=27017
MONGODB_NAME= <数据库名>   
MONGODB_COLLECTIONS= <集合名>

然后在piplines.py中添加如下代码:

from scrapy.utils.project import get_project_settings           #导入该模块,使能使用settings.py中的数据
import pymongo

class Pipeline(object):
    def __init__(self):        
    	settings = get_project_settings()       
    	host=settings['MONGODB_HOST']        
    	port=settings['MONGODB_PORT']        
    	dbname=settings['MONGODB_NAME']        
    	client=pymongo.MongoClient(host=host,port=port)      #链接到mongodb        
    	tdb=client[dbname]                                   #创建数据库        
    	self.post=tdb[settings['MONGODB_COLLECTIONS']]       #指定集合
    def process_item(self, item, spider):        
    	info=dict(item)        
    	self.post.insert(info)                #插入数据        
    	return item

最后在settings.py中将ITEM_PIPELINES解除注释,换上自定义的piplines

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值