很久没有更新博客了,这段时间其实也做了不少东西,但总是懒得坐下来整理下学习笔记,今天终于努力说服自己。做了那么多东西到底改写什么呢?自从接触python以来首先接触的就是爬虫,之前也写过许多关于爬虫的博客,但是其中最负盛名的基于scrapy的爬虫框架还没有写过,于是乎就以这为出发点吧。另外,在github上研究过某大神基于scrapy的爬虫(github地址我已经找不到了,不过那个爬虫已经过期了,基本不能用了),这个网站很好,平时我也经常在上边找一些线性代数、概率论的视频来研究学习一番,我呢,在巨人的肩膀上,实现了该网站的视频和视频封面下载功能,如下:
项目准备
1、科学上网
2、python3.7 (如何配置开发环境这里不多赘述)
3、连接mongodb
理论基础
Scrapy是基于Twisted的异步处理框架,默认是10线程同步。其数据流由引擎控制,数据流的过程如下:
1、Engine首先打开一个网站,找到处理该网站的Spider,并向该Spider请求第一个要爬的URL
2、Engine从Spider中获取到第一个要爬取的URL,并通过Scheduler以Request的形式调度。
3、Engine从Scheduler请求下一个要爬取的URL
4、Scheduler返回下一个要爬取的URL给Engine,Engine将URL通过Downloader MiddleWares转发给Downloader下载
5、一旦页面下载完毕,Downloader生成该页面的Response,并将其通过Downloader MiddleWares发送给Engine
6、Engine从下载器中接收到Response,并将其通过Spider Middlewares发送给Spider处理
7、Spider处理Response,并返回爬取到的Item及新的 Request给Engine
8、Engine将Spider返回的Item给Item Pipline ,将新的Requst给Scheduler
9、重复2-8 ,直到Scheduler中没有更多的Request,Engine关闭
项目实践
1、创建项目
#创建项目文件夹
scrapy startproject pornhubBot
#cd 项目路径
#创建spider 注意spider的名字#1不能和项目名重复 #2是网站域名
scrapy genspider pornhub pornhub.com
2、创建Item
Item是保存爬取数据的容器,它的使用方法和字典类似。创建Item需要继承scrapy.Item类,并且定义类型为scrapy.Field的字段。观察目标网站,我们可以获取到的内容有:
import scrapy
class PornhubbotItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
video_title = scrapy.Field() #视频标题
image_urls = scrapy.Field() #缩略图下载链接
image_paths = scrapy.Field() #缩略图本地路径
video_duration = scrapy.Field() #视频时长
video_views = scrapy.Field() #视频播放量
video_rating = scrapy.Field() #视频热度排行
link_url = scrapy.Field() #视频在线地址
file_urls = scrapy.Field() #分段视频文件下载链接列表
file_paths = scrapy.Field() #分段视频文件本地路径列表
3、创建Spider
最核心的便是Spider类了,在这里要做两件事:定义爬取网站的动作、分析爬取下来的网页。
(1)以初始的URL初始化Request,并设置回调函数。当该Request成功请求并返回时,Response生成并作为参数传给该回调函数
(2)在回到函数内分析返回的网页内容。返回结果有两种形式。一种是解析得到的字典或Item对象,可以直接保存;一种是解析得到的下一个(如下一页)链接,可以利用此链接构造Request并设置新的回调函数,返回Request等待后续调度。
(3)如果返回的是字典或Item对象,我们可以通过Feed Exports等组件将返回结果存入到文件。如果设置了Pipeline的话,我们可以使用Pipeline处理并保存。
(4)如果返回的是Requst,那么Request执行成功得到Response之后,Response会被传递给Request中定义的回调函数,在回调函数中我们可以再次使用选择器(如Selector)来分析新得到的网页内容,并根据分析的数据生成Item
通过以上几步循环,完成整个网站的爬取。
首先,我们在构建初始URL的时候,我们首先分析整个网站的资源分类,发现pornhub将资源根据热度排行、观看总量、评分等方式进行了分类:
"""归纳PornHub资源链接"""
PH_TYPES = [
'',
'recommended',
'video?o=ht', # hot
'video?o=mv', # Most Viewed
'video?o=tr', # Top Rate
# Examples of certain categories
# 'video?c=1', # Category = Asian
# 'video?c=111', # Category = Japanese
]
Scrapy提供了自己的数据提取方法,即Selector选择器,是基于lxml来构建的,支持XPath、CSS、正则表达式,如:
from scrapy import Selector
selector = Selector(response)
#xpath
title = selector.xpath('//img/a[@class="image1.html"]/text()').extract_first()
#css
title = selector.css('img a[class="image1.html"]::text').extract_first()
# -*- coding: utf-8 -*-
import requests
import logging
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from pornhubBot.items import PornhubbotItem
from pornhubBot.pornhub_type import PH_TYPES
from scrapy.http import Request
import re
import json
import random
class PornhubSpider(CrawlSpider):
name = 'pornhub' #每个项目唯一的名字
allowed_domains = ['www.pornhub.com'] #允许爬取的域名
host = 'https://www.pornhub.com'
start_urls = list(set(PH_TYPES)) #启动时爬取的url列表
logging.getLogger("requests").setLevel(logging.WARNING
) # 将requests的日志级别设成WARNING
logging.basicConfig(
level=logging.DEBUG,
format=
'%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
datefmt='%a, %d %b %Y %H:%M:%S',
filename='cataline.log',
filemode='w')
# 构建初始URL,并设置回调函数
def start_requests(self):
for ph_type in self.start_urls:
yield Request(url='https://www.pornhub.com/%s' % ph_type,callback=self.parse_ph_key)
#迭代Request
def parse_ph_key(self, response):
selector = Selector(response)
logging.debug('request url:------>' + response.url)
# logging.info(selector)
divs = selector.xpath('//div[@class="phimage"]')
for div in divs:
# logging.debug('divs :------>' + div.extract())
#herf = " viewkey= ******",匹配双引号之前的数字
viewkey = re.findall('viewkey=(.*?)"', div.extract())
# logging.debug(viewkey)
#这里返回的是在线视频播放页面,因为我们要从单个视频在线播放页面的源码中寻找我们所要的信息
yield Request(url='https://www.pornhub.com/view_video.php?viewkey=%s' % viewkey[0],
callback=self.parse_ph_info)
#找到 next 按钮 ,并提取 herf 属性<a href="/video?o=ht&page=2" class="orangeButton">
url_next = selector.xpath('//a[@class="orangeButton" and text()="Next "]/@href').extract()
logging.debug(url_next)
if url_next:
# if self.test:
logging.debug(' next page:---------->' + self.host + url_next[0])
yield Request(url=self.host + url_next[0],callback=self.parse_ph_key)
# 解析得到Item
def parse_ph_info(self, response):
phItem = PornhubbotItem()
selector = Selector(response)
# logging.info(selector)
#方括号把一列字符或一个范围括在了一起 (或两者). 例如, [abc] 表示 "a, b 或 c 的中任何一个字符
#竖线将两个或多个可选项目分隔开来. 如果可选项目中 任何一个 满足条件, 则会形成匹配. 例如, gray|grey 既可以匹配 gray 也可以匹配 grey.
_ph_info = re.findall('var flashvars_\d+ =(.*?)[,|;]\n', selector.extract())
logging.debug('PH信息的JSON:')
logging.debug(_ph_info)
_ph_info_json = json.loads(_ph_info[0])
duration = _ph_info_json.get('video_duration')
phItem['video_duration'] = duration
title = _ph_info_json.get('video_title')
phItem['video_title'] = title
image_urls = _ph_info_json.get('image_url')
phItem['image_urls'] = image_urls
link_url = _ph_info_json.get('link_url')
phItem['link_url'] = link_url
file_urls = _ph_info_json.get('quality_480p')
phItem['file_urls'] = file_urls
yield phItem
4、创建 Item Pipeline
当Spider解析完Response之后,Item就会传到Item Pipeline,被定义的Item Pipeline组件会被顺次调用,完成一连串的处理过程:
- 清理html数据
- 验证爬取数据,检查爬取字段
- 查看并丢弃重复内容
- 将爬取结果保存到数据库
import pymongo
from pymongo import IndexModel, ASCENDING
from pornhubBot import items
from scrapy import Request
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline
from scrapy.pipelines.files import FilesPipeline
#链接mongodb
class PornhubbotMongoDBPipeline(object):
def __init__(self):
clinet = pymongo.MongoClient("localhost", 27017)
db = clinet["PornHub"]
self.PhRes = db["PhRes"]
#建立数据库的索引,一个索引也可以
idx1 = IndexModel([('link_url', ASCENDING)], unique=True)
idx2 = IndexModel([('video_title', ASCENDING)], unique=True)
self.PhRes.create_indexes([idx1,idx2])
# if your existing DB has duplicate records, refer to:
# https://stackoverflow.com/questions/35707496/remove-duplicate-in-mongodb/35711737
#这是必须实现的方法
def process_item(self, item, spider):
print('MongoDBItem', item)
""" 判断类型 存入MongoDB """
if isinstance(item, items.PornhubbotItem):
print('PornVideoItem True')
try:
#'$set‘操作符替换掉指定字段的值,意为更新数据
self.PhRes.update_one({'video_title': item['video_title']}, {'$set': dict(item)}, upsert=True)
except Exception:
pass
return item
#链接Feed Exports组件
# https://doc.scrapy.org/en/latest/topics/media-pipeline.html#module-scrapy.pipelines.files
class VideoThumbPipeline(ImagesPipeline):
# 自定义缩略图路径(及命名), 注意该路径是 IMAGES_STORE 的相对路径
def file_path(self, request, response=None, info=None):
file_name = request.url.split('/')[-1]
return "%s/thumb.jpg" % file_name # 返回路径及命名格式
# 下载完成后, 将缩略图本地路径(IMAGES_STORE + 相对路径)填入到 item 的 thumb_path
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem('Image Downloaded Failed')
item['image_paths'] = image_paths
return item
# 从item中取出缩略图的url并下载文件
def get_media_requests(self, item, info):
yield Request(url=item['image_urls'], meta={'item': item})
# https://doc.scrapy.org/en/latest/topics/media-pipeline.html#module-scrapy.pipelines.files
class VideoFilesPipeline(FilesPipeline):
# 从item中取出分段视频的url列表并下载文件
def get_media_requests(self, item, info):
yield Request(url=item['file_urls'], meta={'item': item})
# 自定义分段视频下载到本地的路径(以及命名), 注意该路径是 FILES_STORE 的相对路径
def file_path(self, request, response=None, info=None):
url = request.url
file_name = url.split('/')[-1]
return "%s/%s.mp4" % (file_name, file_name) # 返回路径及命名格式
#return file_name
# 下载完成后, 将分段视频本地路径列表(FILES_STORE + 相对路径)填入到 item 的 file_paths
def item_completed(self, results, item, info):
file_paths = [x['path'] for ok, x in results if ok]
if not file_paths:
raise DropItem("Item contains no files")
item['file_paths'] = file_paths
return item
当然,还要在Settings里面定义各Pipeline的调用顺序,数值越小越先被调用。
#Scrapy自带了Feed输出,并且支持多种序列化格式
#生成存储文件中文不乱码
FEED_EXPORT_ENCODING = 'utf-8'
FEED_URI=u'/Users/chenyan/important/python_demo/pornhubBot/pornhub.csv'
FEED_FORMAT='CSV'
#配置文件下载目录FILE_STORE
IMAGES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
FILES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
IMAGES_URLS_FIELD = 'image_urls' # 自定义链接Field
FILES_URLS_FIELD = 'file_urls' # 自定义链接Field
IMAGES_THUMBS = {
'small': (50, 50),
'big': (270, 270),
}
#滤出小图片
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110
DOWNLOADER_MIDDLEWARES = {
"pornhubBot.middlewares.UserAgentMiddleware": 401,
"pornhubBot.middlewares.CookiesMiddleware": 402,
"pornhubBot.middlewares.ProxyMiddleware": 403,
}
ITEM_PIPELINES = {
"pornhubBot.pipelines.PornhubbotMongoDBPipeline": 3,
'pornhubBot.pipelines.VideoThumbPipeline': 1,
'pornhubBot.pipelines.VideoFilesPipeline': 1,
}
5、创建Middleware
pornhub采用了并不是很严格的反爬策略,一开始没有设置代理时,爬取了几次我的IP就被封禁了,因此我使用了自己维护的代理池。常用的爬虫策略,userAgent,cookies,proxy的设置都可以放在这里。
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import random
import json
import logging
import requests
class UserAgentMiddleware(object):
""" 换User-Agent """
def __init__(self, agents):
self.agents = agents
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings.getlist('USER_AGENTS'))
def process_request(self, request, spider):
# print "**************************" + random.choice(self.agents)
request.headers.setdefault('User-Agent', random.choice(self.agents))
class CookiesMiddleware(object):
""" 换Cookie """
cookie = {
'platform': 'pc',
'ss': '367701188698225489',
'bs': '%s',
'RNLBSERVERID': 'ded6699',
'FastPopSessionRequestNumber': '1',
'FPSRN': '1',
'performance_timing': 'home',
'RNKEY': '40859743*68067497:1190152786:3363277230:1'
}
def process_request(self, request, spider):
bs = ''
for i in range(32):
bs += chr(random.randint(97, 122))
_cookie = json.dumps(self.cookie) % bs
request.cookies = json.loads(_cookie)
class ProxyMiddleware():
#获取随机可用代理的地址为:http://localhost:5555/random
def __init__(self,proxy_url):
self.logger = logging.getLogger(__name__)
self.proxy_url = proxy_url
def get_random_proxy(self):
try:
response = requests.get(self.proxy_url)
if response.status_code == 200:
proxy = response.text
return proxy
except requests.ConnectionError:
return False
@classmethod
def from_crawler(cls,crawler):
settings = crawler.settings
return cls(
proxy_url = settings.get('PROXY_URL')
)
def process_request(self,request,spider):
#request.meta 是一个Python字典 ,'retry_times' 是scrapy常见的请求参数
if request.meta.get('retry_times'):
proxy = self.get_random_proxy()
if proxy:
uri = 'https://{proxy}'.format(proxy = proxy)
self.logger.debug('使用代理:'+ proxy)
request.meta['proxy'] = uri
我把常用的UserAgent放在了Setting里:
USER_AGENTS = [
"Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
"Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
"Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
"Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
"Mozilla/2.02E (Win95; U)",
"Mozilla/3.01Gold (Win95; I)",
"Mozilla/4.8 [en] (Windows NT 5.1; U)",
"Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
"HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
"Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522 (KHTML, like Gecko) Safari/419.3",
"Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
]
6、构建Settings
在这里对一些东西做了定义
# -*- coding: utf-8 -*-
# Scrapy settings for pornhubBot project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'pornhubBot'
SPIDER_MODULES = ['pornhubBot.spiders']
NEWSPIDER_MODULE = 'pornhubBot.spiders'
DOWNLOAD_DELAY = 1 # 间隔时间
# LOG_LEVEL = 'INFO' # 日志级别
CONCURRENT_REQUESTS = 20 # 默认为16
# CONCURRENT_ITEMS = 1
# CONCURRENT_REQUESTS_PER_IP = 1
REDIRECT_ENABLED = False
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'pornhub (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
#获取随机代理的地址
PROXY_URL = 'http://localhost:5555/random'
#Scrapy自带了Feed输出,并且支持多种序列化格式
#生成存储文件中文不乱码
FEED_EXPORT_ENCODING = 'utf-8'
FEED_URI=u'/Users/chenyan/important/python_demo/pornhubBot/pornhub.csv'
FEED_FORMAT='CSV'
#配置文件下载目录FILE_STORE
IMAGES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
FILES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
IMAGES_URLS_FIELD = 'image_urls' # 自定义链接Field
FILES_URLS_FIELD = 'file_urls' # 自定义链接Field
IMAGES_THUMBS = {
'small': (50, 50),
'big': (270, 270),
}
#滤出小图片
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110
DOWNLOADER_MIDDLEWARES = {
"pornhubBot.middlewares.UserAgentMiddleware": 401,
"pornhubBot.middlewares.CookiesMiddleware": 402,
"pornhubBot.middlewares.ProxyMiddleware": 403,
}
ITEM_PIPELINES = {
"pornhubBot.pipelines.PornhubbotMongoDBPipeline": 3,
'pornhubBot.pipelines.VideoThumbPipeline': 1,
'pornhubBot.pipelines.VideoFilesPipeline': 1,
}
#默认情况下,Scrapy使用 LIFO 队列来存储等待的请求。简单的说,就是 深度优先顺序
#以 广度优先顺序 进行爬取
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
USER_AGENTS = [
"Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
"Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
"Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
"Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
"Mozilla/2.02E (Win95; U)",
"Mozilla/3.01Gold (Win95; I)",
"Mozilla/4.8 [en] (Windows NT 5.1; U)",
"Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
"HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
"Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522 (KHTML, like Gecko) Safari/419.3",
"Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
]
7、定义一个快速开启通道
from __future__ import absolute_import
from scrapy import cmdline
cmdline.execute("scrapy crawl pornhub".split())
运行这个程序就可以了
《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《
以上,就可以离线学高数啦!!!