NO.54——基于scrapy的P站爬虫

        很久没有更新博客了,这段时间其实也做了不少东西,但总是懒得坐下来整理下学习笔记,今天终于努力说服自己。做了那么多东西到底改写什么呢?自从接触python以来首先接触的就是爬虫,之前也写过许多关于爬虫的博客,但是其中最负盛名的基于scrapy的爬虫框架还没有写过,于是乎就以这为出发点吧。另外,在github上研究过某大神基于scrapy的爬虫(github地址我已经找不到了,不过那个爬虫已经过期了,基本不能用了),这个网站很好,平时我也经常在上边找一些线性代数、概率论的视频来研究学习一番,我呢,在巨人的肩膀上,实现了该网站的视频和视频封面下载功能,如下:

 

 项目准备

1、科学上网

2、python3.7 (如何配置开发环境这里不多赘述)

3、连接mongodb

理论基础

      Scrapy是基于Twisted的异步处理框架,默认是10线程同步。其数据流由引擎控制,数据流的过程如下:

1、Engine首先打开一个网站,找到处理该网站的Spider,并向该Spider请求第一个要爬的URL

2、Engine从Spider中获取到第一个要爬取的URL,并通过Scheduler以Request的形式调度。

3、Engine从Scheduler请求下一个要爬取的URL

4、Scheduler返回下一个要爬取的URL给Engine,Engine将URL通过Downloader MiddleWares转发给Downloader下载

5、一旦页面下载完毕,Downloader生成该页面的Response,并将其通过Downloader MiddleWares发送给Engine

6、Engine从下载器中接收到Response,并将其通过Spider Middlewares发送给Spider处理

7、Spider处理Response,并返回爬取到的Item及新的 Request给Engine

8、Engine将Spider返回的Item给Item Pipline ,将新的Requst给Scheduler

9、重复2-8   ,直到Scheduler中没有更多的Request,Engine关闭

项目实践

1、创建项目

#创建项目文件夹
scrapy startproject pornhubBot
#cd 项目路径
#创建spider  注意spider的名字#1不能和项目名重复   #2是网站域名
scrapy genspider pornhub pornhub.com   

2、创建Item

          Item是保存爬取数据的容器,它的使用方法和字典类似。创建Item需要继承scrapy.Item类,并且定义类型为scrapy.Field的字段。观察目标网站,我们可以获取到的内容有:

import scrapy


class PornhubbotItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    video_title = scrapy.Field()  #视频标题
    image_urls = scrapy.Field()  #缩略图下载链接
    image_paths = scrapy.Field()  #缩略图本地路径
    video_duration = scrapy.Field()  #视频时长
    video_views = scrapy.Field()    #视频播放量
    video_rating = scrapy.Field()   #视频热度排行
    link_url = scrapy.Field()    #视频在线地址
    file_urls = scrapy.Field()   #分段视频文件下载链接列表
    file_paths = scrapy.Field() #分段视频文件本地路径列表

   3、创建Spider

          最核心的便是Spider类了,在这里要做两件事:定义爬取网站的动作、分析爬取下来的网页。

(1)以初始的URL初始化Request,并设置回调函数。当该Request成功请求并返回时,Response生成并作为参数传给该回调函数

(2)在回到函数内分析返回的网页内容。返回结果有两种形式。一种是解析得到的字典或Item对象,可以直接保存;一种是解析得到的下一个(如下一页)链接,可以利用此链接构造Request并设置新的回调函数,返回Request等待后续调度。

(3)如果返回的是字典或Item对象,我们可以通过Feed Exports等组件将返回结果存入到文件。如果设置了Pipeline的话,我们可以使用Pipeline处理并保存。

(4)如果返回的是Requst,那么Request执行成功得到Response之后,Response会被传递给Request中定义的回调函数,在回调函数中我们可以再次使用选择器(如Selector)来分析新得到的网页内容,并根据分析的数据生成Item

       通过以上几步循环,完成整个网站的爬取。

       首先,我们在构建初始URL的时候,我们首先分析整个网站的资源分类,发现pornhub将资源根据热度排行、观看总量、评分等方式进行了分类:

"""归纳PornHub资源链接"""
PH_TYPES = [
    '',
    'recommended',
    'video?o=ht', # hot
    'video?o=mv', # Most Viewed
    'video?o=tr', # Top Rate

    # Examples of certain categories
    # 'video?c=1',  # Category = Asian
    # 'video?c=111',  # Category = Japanese
]

         Scrapy提供了自己的数据提取方法,即Selector选择器,是基于lxml来构建的,支持XPath、CSS、正则表达式,如:

from scrapy import Selector

selector = Selector(response)
#xpath
title = selector.xpath('//img/a[@class="image1.html"]/text()').extract_first()
#css
title = selector.css('img a[class="image1.html"]::text').extract_first()

 

# -*- coding: utf-8 -*-
import requests
import logging
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from pornhubBot.items import PornhubbotItem
from pornhubBot.pornhub_type import PH_TYPES
from scrapy.http import Request
import re
import json
import random

class PornhubSpider(CrawlSpider):
    name = 'pornhub'   #每个项目唯一的名字
    allowed_domains = ['www.pornhub.com']   #允许爬取的域名
    host = 'https://www.pornhub.com'
    start_urls = list(set(PH_TYPES))   #启动时爬取的url列表
    logging.getLogger("requests").setLevel(logging.WARNING
                                           )  # 将requests的日志级别设成WARNING
    logging.basicConfig(
        level=logging.DEBUG,
        format=
        '%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
        datefmt='%a, %d %b %Y %H:%M:%S',
        filename='cataline.log',
        filemode='w')
                                           
    # 构建初始URL,并设置回调函数
    def start_requests(self):
        for ph_type in self.start_urls:
            yield Request(url='https://www.pornhub.com/%s' % ph_type,callback=self.parse_ph_key)
    #迭代Request
    def parse_ph_key(self, response):
        selector = Selector(response)
        logging.debug('request url:------>' + response.url)
        # logging.info(selector)
        divs = selector.xpath('//div[@class="phimage"]')
        for div in divs:
            # logging.debug('divs :------>' + div.extract())
            #herf = " viewkey= ******",匹配双引号之前的数字
            viewkey = re.findall('viewkey=(.*?)"', div.extract())
            # logging.debug(viewkey)
            #这里返回的是在线视频播放页面,因为我们要从单个视频在线播放页面的源码中寻找我们所要的信息
            yield Request(url='https://www.pornhub.com/view_video.php?viewkey=%s' % viewkey[0],
                          callback=self.parse_ph_info)
        #找到 next 按钮 ,并提取 herf 属性<a href="/video?o=ht&page=2" class="orangeButton">
        url_next = selector.xpath('//a[@class="orangeButton" and text()="Next "]/@href').extract()
        logging.debug(url_next)
        if url_next:
            # if self.test:
            logging.debug(' next page:---------->' + self.host + url_next[0])
            yield Request(url=self.host + url_next[0],callback=self.parse_ph_key)
    # 解析得到Item
    def parse_ph_info(self, response):
        phItem = PornhubbotItem()
        selector = Selector(response)
        # logging.info(selector)
        #方括号把一列字符或一个范围括在了一起 (或两者). 例如, [abc] 表示 "a, b 或 c 的中任何一个字符
        #竖线将两个或多个可选项目分隔开来. 如果可选项目中 任何一个 满足条件, 则会形成匹配. 例如, gray|grey 既可以匹配 gray 也可以匹配 grey.
        _ph_info = re.findall('var flashvars_\d+ =(.*?)[,|;]\n', selector.extract())
        logging.debug('PH信息的JSON:')
        logging.debug(_ph_info)
        _ph_info_json = json.loads(_ph_info[0])
        duration = _ph_info_json.get('video_duration')
        phItem['video_duration'] = duration
        title = _ph_info_json.get('video_title')
        phItem['video_title'] = title
        image_urls = _ph_info_json.get('image_url')
        phItem['image_urls'] = image_urls
        link_url = _ph_info_json.get('link_url')
        phItem['link_url'] = link_url
        file_urls = _ph_info_json.get('quality_480p')
        phItem['file_urls'] = file_urls

        yield phItem

4、创建 Item Pipeline

       当Spider解析完Response之后,Item就会传到Item Pipeline,被定义的Item Pipeline组件会被顺次调用,完成一连串的处理过程:

  • 清理html数据
  • 验证爬取数据,检查爬取字段
  • 查看并丢弃重复内容
  • 将爬取结果保存到数据库

       

import pymongo
from pymongo import IndexModel, ASCENDING
from pornhubBot import items 
from scrapy import Request
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline
from scrapy.pipelines.files import FilesPipeline

#链接mongodb
class PornhubbotMongoDBPipeline(object):
    def __init__(self):
        clinet = pymongo.MongoClient("localhost", 27017)
        db = clinet["PornHub"]
        self.PhRes = db["PhRes"]
        #建立数据库的索引,一个索引也可以
        idx1 = IndexModel([('link_url', ASCENDING)], unique=True)
        idx2 = IndexModel([('video_title', ASCENDING)], unique=True)
        self.PhRes.create_indexes([idx1,idx2])
        # if your existing DB has duplicate records, refer to:
        # https://stackoverflow.com/questions/35707496/remove-duplicate-in-mongodb/35711737
    #这是必须实现的方法
    def process_item(self, item, spider):
        print('MongoDBItem', item)
        """ 判断类型 存入MongoDB """
        if isinstance(item, items.PornhubbotItem):
            print('PornVideoItem True')
            try:
                #'$set‘操作符替换掉指定字段的值,意为更新数据
                self.PhRes.update_one({'video_title': item['video_title']}, {'$set': dict(item)}, upsert=True)
            except Exception:
                pass
        return item

#链接Feed Exports组件
# https://doc.scrapy.org/en/latest/topics/media-pipeline.html#module-scrapy.pipelines.files
class VideoThumbPipeline(ImagesPipeline):
    
    # 自定义缩略图路径(及命名), 注意该路径是 IMAGES_STORE 的相对路径
    def file_path(self, request, response=None, info=None):
        file_name = request.url.split('/')[-1]
        return "%s/thumb.jpg" % file_name  # 返回路径及命名格式
    
    # 下载完成后, 将缩略图本地路径(IMAGES_STORE + 相对路径)填入到 item 的 thumb_path
    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem('Image Downloaded Failed')
        item['image_paths'] = image_paths
        return item
    # 从item中取出缩略图的url并下载文件
    def get_media_requests(self, item, info):
        yield Request(url=item['image_urls'], meta={'item': item})

# https://doc.scrapy.org/en/latest/topics/media-pipeline.html#module-scrapy.pipelines.files
class VideoFilesPipeline(FilesPipeline):
    # 从item中取出分段视频的url列表并下载文件
    def get_media_requests(self, item, info):
        yield Request(url=item['file_urls'], meta={'item': item})

    # 自定义分段视频下载到本地的路径(以及命名), 注意该路径是 FILES_STORE 的相对路径
    def file_path(self, request, response=None, info=None):
        url = request.url
        file_name = url.split('/')[-1]
        return "%s/%s.mp4" % (file_name, file_name) # 返回路径及命名格式
        #return file_name

    # 下载完成后, 将分段视频本地路径列表(FILES_STORE + 相对路径)填入到 item 的 file_paths
    def item_completed(self, results, item, info):
        file_paths = [x['path'] for ok, x in results if ok]
        if not file_paths:
            raise DropItem("Item contains no files")
        item['file_paths'] = file_paths
        return item

         当然,还要在Settings里面定义各Pipeline的调用顺序,数值越小越先被调用。

#Scrapy自带了Feed输出,并且支持多种序列化格式
#生成存储文件中文不乱码
FEED_EXPORT_ENCODING = 'utf-8'
FEED_URI=u'/Users/chenyan/important/python_demo/pornhubBot/pornhub.csv'
FEED_FORMAT='CSV'
#配置文件下载目录FILE_STORE
IMAGES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
FILES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
IMAGES_URLS_FIELD = 'image_urls'  # 自定义链接Field
FILES_URLS_FIELD = 'file_urls'   # 自定义链接Field
IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (270, 270),
}
#滤出小图片
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110

DOWNLOADER_MIDDLEWARES = {
    "pornhubBot.middlewares.UserAgentMiddleware": 401,
    "pornhubBot.middlewares.CookiesMiddleware": 402,
    "pornhubBot.middlewares.ProxyMiddleware": 403,
}
ITEM_PIPELINES = {
    "pornhubBot.pipelines.PornhubbotMongoDBPipeline": 3,
    'pornhubBot.pipelines.VideoThumbPipeline': 1,
    'pornhubBot.pipelines.VideoFilesPipeline': 1,
}

5、创建Middleware

           pornhub采用了并不是很严格的反爬策略,一开始没有设置代理时,爬取了几次我的IP就被封禁了,因此我使用了自己维护的代理池。常用的爬虫策略,userAgent,cookies,proxy的设置都可以放在这里。

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

import random
import json
import logging
import requests


class UserAgentMiddleware(object):
    """ 换User-Agent """

    def __init__(self, agents):
        self.agents = agents

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings.getlist('USER_AGENTS'))

    def process_request(self, request, spider):
        # print "**************************" + random.choice(self.agents)
        request.headers.setdefault('User-Agent', random.choice(self.agents))


class CookiesMiddleware(object):
    """ 换Cookie """
    cookie = {
        'platform': 'pc',
        'ss': '367701188698225489',
        'bs': '%s',
        'RNLBSERVERID': 'ded6699',
        'FastPopSessionRequestNumber': '1',
        'FPSRN': '1',
        'performance_timing': 'home',
        'RNKEY': '40859743*68067497:1190152786:3363277230:1'
    }

    def process_request(self, request, spider):
        bs = ''
        for i in range(32):
            bs += chr(random.randint(97, 122))
        _cookie = json.dumps(self.cookie) % bs
        request.cookies = json.loads(_cookie)

class ProxyMiddleware():
    #获取随机可用代理的地址为:http://localhost:5555/random
    def __init__(self,proxy_url):
        self.logger = logging.getLogger(__name__)
        self.proxy_url = proxy_url

    def get_random_proxy(self):
        try:
            response = requests.get(self.proxy_url)
            if response.status_code == 200:
                proxy = response.text
                return proxy
        except requests.ConnectionError:
            return False

    @classmethod
    def from_crawler(cls,crawler):
        settings = crawler.settings
        return cls(
            proxy_url = settings.get('PROXY_URL')
        )

    def process_request(self,request,spider):
        #request.meta 是一个Python字典 ,'retry_times' 是scrapy常见的请求参数
        if request.meta.get('retry_times'):
            proxy = self.get_random_proxy()
            if proxy:
                uri = 'https://{proxy}'.format(proxy = proxy)
                self.logger.debug('使用代理:'+ proxy)
                request.meta['proxy'] = uri

    我把常用的UserAgent放在了Setting里:

USER_AGENTS = [
          "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
          "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
          "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
          "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
          "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
          "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
          "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
          "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
          "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
          "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
          "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
          "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
          "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
          "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
          "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
          "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
          "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
          "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
          "Mozilla/2.02E (Win95; U)",
          "Mozilla/3.01Gold (Win95; I)",
          "Mozilla/4.8 [en] (Windows NT 5.1; U)",
          "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
          "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
          "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3",
          "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
          "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          ]

  6、构建Settings

       在这里对一些东西做了定义

# -*- coding: utf-8 -*-

# Scrapy settings for pornhubBot project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'pornhubBot'

SPIDER_MODULES = ['pornhubBot.spiders']
NEWSPIDER_MODULE = 'pornhubBot.spiders'


DOWNLOAD_DELAY = 1  # 间隔时间
# LOG_LEVEL = 'INFO'  # 日志级别
CONCURRENT_REQUESTS = 20  # 默认为16
# CONCURRENT_ITEMS = 1
# CONCURRENT_REQUESTS_PER_IP = 1
REDIRECT_ENABLED = False
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'pornhub (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

#获取随机代理的地址
PROXY_URL = 'http://localhost:5555/random'

#Scrapy自带了Feed输出,并且支持多种序列化格式
#生成存储文件中文不乱码
FEED_EXPORT_ENCODING = 'utf-8'
FEED_URI=u'/Users/chenyan/important/python_demo/pornhubBot/pornhub.csv'
FEED_FORMAT='CSV'
#配置文件下载目录FILE_STORE
IMAGES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
FILES_STORE = u'/Users/chenyan/important/python_demo/pornhubBot/Downloads'
IMAGES_URLS_FIELD = 'image_urls'  # 自定义链接Field
FILES_URLS_FIELD = 'file_urls'   # 自定义链接Field
IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (270, 270),
}
#滤出小图片
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110

DOWNLOADER_MIDDLEWARES = {
    "pornhubBot.middlewares.UserAgentMiddleware": 401,
    "pornhubBot.middlewares.CookiesMiddleware": 402,
    "pornhubBot.middlewares.ProxyMiddleware": 403,
}
ITEM_PIPELINES = {
    "pornhubBot.pipelines.PornhubbotMongoDBPipeline": 3,
    'pornhubBot.pipelines.VideoThumbPipeline': 1,
    'pornhubBot.pipelines.VideoFilesPipeline': 1,
}
#默认情况下,Scrapy使用 LIFO 队列来存储等待的请求。简单的说,就是 深度优先顺序
#以 广度优先顺序 进行爬取
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

USER_AGENTS = [
          "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
          "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
          "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
          "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
          "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
          "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
          "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
          "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
          "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
          "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
          "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
          "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
          "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
          "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
          "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
          "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
          "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
          "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
          "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
          "Mozilla/2.02E (Win95; U)",
          "Mozilla/3.01Gold (Win95; I)",
          "Mozilla/4.8 [en] (Windows NT 5.1; U)",
          "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
          "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
          "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
          "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3",
          "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
          "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
          "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
          "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
          ]

7、定义一个快速开启通道

from __future__ import absolute_import
from scrapy import cmdline

cmdline.execute("scrapy crawl pornhub".split())

运行这个程序就可以了

《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《《

           以上,就可以离线学高数啦!!!

  • 6
    点赞
  • 24
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
好的,我可以为您提供爬虫教程,以下是使用Scrapy框架爬取豆瓣电影TOP250的步骤: 1. 创建Scrapy项目 首先需要使用Scrapy命令行工具创建一个Scrapy项目,可以使用以下命令: ``` scrapy startproject douban_top250 ``` 该命令将会在当前目录下创建一个名为douban_top250的Scrapy项目。 2. 创建Spider 接下来需要创建一个Spider来定义爬取的规则,可以使用以下命令在Scrapy项目的spiders目录下创建一个名为douban_spider的Spider: ``` scrapy genspider douban_spider movie.douban.com ``` 在spiders目录下会生成一个名为douban_spider.py的文件,我们可以在该文件中定义爬取规则。 3. 编写Spider规则 在douban_spider.py文件中,我们需要定义如何爬取豆瓣电影TOP250的规则,以下是一个简单的例子: ```python import scrapy from scrapy import Selector from douban_top250.items import DoubanTop250Item class DoubanSpider(scrapy.Spider): name = "douban_spider" allowed_domains = ["movie.douban.com"] start_urls = [ "https://movie.douban.com/top250" ] def parse(self, response): selector = Selector(response) item_list = selector.xpath('//ol[@class="grid_view"]/li') for item in item_list: douban_item = DoubanTop250Item() douban_item['rank'] = item.xpath('div[@class="pic"]/em/text()').extract()[0] douban_item['title'] = item.xpath('div[@class="info"]/div[@class="hd"]/a/span[@class="title"]/text()').extract()[0] douban_item['rating'] = item.xpath('div[@class="info"]/div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract()[0] douban_item['quote'] = item.xpath('div[@class="info"]/div[@class="bd"]/p[@class="quote"]/span[@class="inq"]/text()').extract()[0] yield douban_item ``` 在上述代码中,我们定义了一个名为DoubanSpider的Spider,并定义了一些爬取规则: - allowed_domains:定义允许爬取的域名; - start_urls:定义爬虫开始爬取的URL列表; - parse:定义如何解析响应结果,生成Item对象。 4. 定义Item 在上述代码中,我们定义了一个名为DoubanTop250Item的Item,需要在douban_top250/items.py文件中定义该Item,以下是一个简单的例子: ```python import scrapy class DoubanTop250Item(scrapy.Item): rank = scrapy.Field() title = scrapy.Field() rating = scrapy.Field() quote = scrapy.Field() ``` 在上述代码中,我们定义了DoubanTop250Item包含以下字段: - rank:电影排名; - title:电影名称; - rating:电影评分; - quote:电影的经典语录。 5. 运行Spider 在完成上述步骤后,就可以运行Spider开始爬取豆瓣电影TOP250了,可以通过以下命令来运行Spider: ``` scrapy crawl douban_spider -o douban_top250.csv ``` 该命令将会运行名为douban_spider的Spider,并将结果保存到douban_top250.csv文件中。 以上就是使用Scrapy爬取豆瓣电影TOP250的基本步骤,希望能对您有所帮助。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值