scrapy+splash爬取动态网页

1.环境:windows x64位,scrapy,splash,python3.6,Eclipse4.4,pydev4.4.5,virtual box5.2,Centos-7-x86-64-minimal-1708

2.首先去官网下载python3.6,进行安装,安装注意要把python勾选加入系统path。

3.打开CMD窗口执行python -m pip install --upgrade pip

4.由于scrapy需要twisted网络框架,如果不先安装twisted,直接运行pip install scrapy,会下载源码,然后进行编译生成可安装的

twisted。然而一般的windows系统都不会这种编译环境。所以我们最好直接找到可安装的twisted安装文件

5.https://www.lfd.uci.edu/~gohlke/pythonlibs/ 打开这个网站找到Twisted-18.7.0-cp36-cp36m-win_amd64.whl,下载到本地

然后cmd中运行 pip install  本地文件路径,安装好twisted

6.安装scrapy ,pip install scrapy 即可安装最新版本。

7.查看scrapy是否安装成功,scrapy shell "http://www.baidu.com",可能会报错:ModuleNotFoundError: No module named 'win32api',此时需要安装>pip install pypiwin32

8.读完scrapy的helloworld:https://doc.scrapy.org/en/latest/intro/tutorial.html,很精简。

9.打开一个网站http://category.vip.com/suggest.php?keyword=%E7%94%B7%E5%A3%AB%E7%9F%AD%E8%A2%96polo

10.打卡浏览器,F12查看网页的结构。

11.cmd窗口运行scrapy shell 你的url,运行response.css("")是否能够选择到你用F12看到的图片内容。

12.如果F12可以看到,但是response.css("")却看不到,说明此网站是动态生成的html.

13.安装splash.模拟浏览器,用来解析出网址在js脚本运行完成后最后的html.

14.安装virtual box,设置网络桥接网卡。如果你已经有了虚拟机,可以跳过。

15.splash安装参见https://www.cnblogs.com/zhonghuasong/p/5976003.html

16.测试splash是否安装成功 scrapy shell "http://你虚拟机的IP:8050/render.html?你的url".

17.运行response.css()对比F12在浏览器看到的。

18.Eclipse4.4启动的jdk,eclipse.ini里面一定要配置jdk7,不然pydev4.4.5会安装不成功

19.pydev4.4.5现在安装需要用的url是:Pydev p2 Repository - http://dl.bintray.com/fabioz/pydev/4.5.5,不能用https访问。

20.用scrapy生成爬虫项目scrapy startproject tutorial

21.用eclipse新建pydev项目,tutorial,将scrapy生成的项目完整拷贝到eclipse生成的项目下面。

22.项目代码:

       新建的爬虫脚本 quotes_spider.py:

import os

import scrapy
from scrapy_splash import SplashRequest

from tutorial.items import ImageItem


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    
    localPath="E:\polo-www.vip.com"
    
    start_urls = [
        'http://category.vip.com/suggest.php?keyword=%E7%94%B7%E5%A3%AB%E7%9F%AD%E8%A2%96polo'
    ]
    """
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    """
    
    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse, args={'wait': 5})
            
    """
    def parse(self, response):
        filename = 'vip.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)
    """
    
    
    def parse(self, response):
        srcsAbsolute = []
        item = ImageItem()
        srcsRelative = response.css("img.goods-image-img::attr(src)").extract();
        for srcRelative in srcsRelative:
            srcAbsolute=response.urljoin(srcRelative);
            srcsAbsolute.append(srcAbsolute);
        
        item['image_urls'] = srcsAbsolute
        yield item
        
        next_page = response.css("a.cat-paging-next::attr(href)").extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield SplashRequest(next_page, callback=self.parse, args={'wait': 5})

wait:5指的是等待时间,一般可以用浏览器的F12查看网页的平均加载完成时间来确定。

 

cmdline.py:

'''
Created on 2015-8-28
@author: xxh
'''
import scrapy.cmdline


if __name__ == '__main__':
    scrapy.cmdline.execute(argv=['scrapy','crawl','quotes'])

 

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ImageItem(scrapy.Item):
     image_urls = scrapy.Field()
     images = scrapy.Field()


class TutorialItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

 

 

pipelines.py:

 

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request

class MyImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
         for image_url in item['image_urls']:
            yield Request(image_url)
            
    def item_completed(self, results, item, info):
        image_path = [x['path'] for ok,x in results if ok]
        if not image_path:
            raise DropItem('Item contains no images')
        item['image_paths'] = image_path
        return item


class TutorialPipeline(object):
    def process_item(self, item, spider):
        return item
 

  settings.py  

# -*- coding: utf-8 -*-

# Scrapy settings for tutorial project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'tutorial'

SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'

SPLASH_URL = 'http://10.73.17.130:8050'

DOWNLOADER_MIDDLEWARES = {
  'scrapy_splash.SplashCookiesMiddleware': 723,
  'scrapy_splash.SplashMiddleware': 725,
  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
  'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'


#增添如下代码
 
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
#开启图片管道
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1
}
#将IMAGES_STORE设置为一个有效的文件夹,用来存储下载的图片.否则管道将保持禁用状态,即使你在ITEM_PIPELINES设置中添加了它.
IMAGES_STORE = 'E:\\polo-www.vip.com'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tutorial (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

 

最终效果:某网站的polo衫

 

 

      

       

      

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值