本篇博客需要读者有一定的爬虫基础,需要在爬虫过程中遇到一些反爬虫机制学起来才有意思。如果对爬虫不是很了解的可以先看下我前面4个实战。本文项目采用python3.6版本语言,利用scrapy框架进行爬取。实现的功能是爬取美剧100(http://www.meijutt.com/new100.html)的信息。
下面是本次项目的目录结构:
----meiju100
----meiju100
----middlewares
__init__.py
customMiddlewares.py
resource.py
userAgents.py
----spiders
__init__.py
meiju100Spider.py
__init__.py
items.py
pipelines.py
settings.py
scrapy.cfg
上述目录结构中,没有后缀名的为文件夹,有后缀的为文件。
大家先安装下面的提示,一步一步将项目搭建起来,然后我再讲述里面涉及到的爬虫攻防。
1、决定爬取的内容items.py
#决定爬取哪些项目
import scrapy
class Meiju100Item(scrapy.Item):
storyId=scrapy.Field()
storyName=scrapy.Field()
storyState=scrapy.Field()
tvStation=scrapy.Field()
updateTime=scrapy.Field()
2、定义怎样爬取meiju100Spider.py
#定义如何爬取
import scrapy
from meiju100.items import Meiju100Item
class Meiju100Spider(scrapy.Spider):
name="meiju100Spider"
allowed_domains=['meiju100.com']
start_urls=('http://www.meijutt.com/new100.html',)
def parse(self,response):
subSelector=response.xpath('//li/div[@class="lasted-num fn-left"]')
items=[]
for sub in subSelector:
item=Meiju100Item()
item['storyId']=sub.xpath('.//i/text()').extract()[0]
item['storyName']=sub.xpath('../h5/a/text()').extract()[0]
item['storyState']=sub.xpath('../span[@class="state1 new100state1"]/font/text()').extract()
item['tvStation']=sub.xpath('../span[@class="mjtv"]/text()').extract()
item['updateTime']=sub.xpath('../div[@class="lasted-time new100time fn-right"]/text()').extract()
items.append(item)
return items
不会使用xpath选择器的小伙伴可以查看的前几篇博客,这里不再重复说明各个标签怎么来的了。
3、保存爬取的结果pipelines.py
#保存爬取结果
import time
class Meiju100Pipeline(object):
def process_item(self,item,spider):
today=time.strftime('%Y-%m-%d',time.localtime())
fileName=today+'美剧.txt'
with open(fileName,'a') as fp:
fp.write("%s\t%s\t\t%s\t\t" %(item['storyId'],item['storyName'],item['storyState']))
if len(item['tvStation'])==0:
fp.write("unknow \t\t")
else:
fp.write("%s\t\t" %(item['tvStation']))
fp.write("%s\n" %(item['updateTime']))
time.sleep(1)
return item
4、分派任务的settings.py
BOT_NAME='meiju100'
SPIDER_MODULES=['meiju100.spiders']
NEWSPIDER_MODULE='meiju100.spiders'
DOWNLOADER_MIDDLEWARES={
'meiju100.middlewares.customMiddlewares.CustomProxy':10,
'meiju100.middlewares.customMiddlewares.CustomUserAgent':30,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware':20
}
ITEM_PIPELINES={'meiju100.pipelines.Meiju100Pipeline':1,
}
DOWNLOAD_DELAY=5
COOKIES_ENABLED=False
5、中间件
customMiddlewares.py
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
from meiju100.middlewares.resource import UserAgents,PROXIES
import random
class CustomUserAgent(UserAgentMiddleware):
def process_request(self,request,spider):
ua=random.choice(UserAgents)
request.headers.setdefault('User-Agent',ua)
class CustomProxy(object):
def process_request(self,request,spider):
proxy=random.choice(PROXIES)
request.meta['proxy']='http://%s' %proxy
6、中间件需要的资源文件resource.py
UserAgents = [
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
]
PROXIES = [
'139.224.237.33:8888',
'120.132.26.17:8080',
'120.132.26.30:8080',
'120.132.26.39:8080',
'61.184.185.68:3128',
'120.132.71.212:80'
]
7、配置文件scrapy.cfg
[settings]
default=meiju100.settings
[deploy]
project=meiju100
8、怎么运行
cmd->cd 将文件调到我们项目所在的这一层文件,也就是上面目录结构中scrapy.cfg所在的这一层文件夹,然后输入命令:scrapy crawl meiju100Spider
这里的meiju100Spider是Meiju100Spider类中的name="meiju100Spider"的值。
9、爬虫攻防
经常遇到的反爬虫有以下几种形式:
(1)封锁间隔时间
(2)封锁Cookies
(3)封锁user-agent
(4)封锁IP
关于这四种反爬虫机制,在上面的项目中都已经解决了,下面具体列出来:
(1)封锁间隔时间破解
Scrapy在两次请求之间的时间设置是DOWNLOAD_DELAY(单位是秒),如果不考虑反爬虫机制,这个值当然是越小越好,可是网站管理员只要稍微过滤下日志就能发现你是爬虫。
settings.py中DOWNLOAD_DELAY=5就是破解封锁间隔时间。
(2)封锁Cookies破解
网站是通过Cookies来确定用户身份的,如果在爬取数据时使用同一个Cookies发送请求,这种做法和把DOWNLOAD_DELAY设置为0.1秒没有差别。
settings.py中COOKIES_ENABLED=False就是破解封锁Cookies。
(3)封锁user-agent破解
user-agent是浏览器的身份标识,网站就是通过user-agent来确定浏览器类型的,有些网站是会拒绝不符合一定标准user-agent的请求。在中间件中准备了一系列的user-agent,而在customMiddlewares.py中每次请求网站都随机选择一个。
(4)封锁IP破解
和破解封锁user-agent一样,我们在中间件中准备了一系列的代理IP,不会获取和验证代理IP的可以去看本博客的“Scrapy爬虫实战三:获取代理”
(http://blog.csdn.net/m0_37728157/article/details/72862240),然后在customMiddlewares.py中每次请求网站都随机选择一个。