1.环境:windows x64位,scrapy,splash,python3.6,Eclipse4.4,pydev4.4.5,virtual box5.2,Centos-7-x86-64-minimal-1708
2.首先去官网下载python3.6,进行安装,安装注意要把python勾选加入系统path。
3.打开CMD窗口执行python -m pip install --upgrade pip
4.由于scrapy需要twisted网络框架,如果不先安装twisted,直接运行pip install scrapy,会下载源码,然后进行编译生成可安装的
twisted。然而一般的windows系统都不会这种编译环境。所以我们最好直接找到可安装的twisted安装文件
5.https://www.lfd.uci.edu/~gohlke/pythonlibs/ 打开这个网站找到Twisted-18.7.0-cp36-cp36m-win_amd64.whl,下载到本地
然后cmd中运行 pip install 本地文件路径,安装好twisted
6.安装scrapy ,pip install scrapy 即可安装最新版本。
7.查看scrapy是否安装成功,scrapy shell "http://www.baidu.com",可能会报错:ModuleNotFoundError: No module named 'win32api',此时需要安装>pip install pypiwin32
8.读完scrapy的helloworld:https://doc.scrapy.org/en/latest/intro/tutorial.html,很精简。
9.打开一个网站http://category.vip.com/suggest.php?keyword=%E7%94%B7%E5%A3%AB%E7%9F%AD%E8%A2%96polo
10.打卡浏览器,F12查看网页的结构。
11.cmd窗口运行scrapy shell 你的url,运行response.css("")是否能够选择到你用F12看到的图片内容。
12.如果F12可以看到,但是response.css("")却看不到,说明此网站是动态生成的html.
13.安装splash.模拟浏览器,用来解析出网址在js脚本运行完成后最后的html.
14.安装virtual box,设置网络桥接网卡。如果你已经有了虚拟机,可以跳过。
15.splash安装参见https://www.cnblogs.com/zhonghuasong/p/5976003.html
16.测试splash是否安装成功 scrapy shell "http://你虚拟机的IP:8050/render.html?你的url".
17.运行response.css()对比F12在浏览器看到的。
18.Eclipse4.4启动的jdk,eclipse.ini里面一定要配置jdk7,不然pydev4.4.5会安装不成功
19.pydev4.4.5现在安装需要用的url是:Pydev p2 Repository - http://dl.bintray.com/fabioz/pydev/4.5.5,不能用https访问。
20.用scrapy生成爬虫项目scrapy startproject tutorial
21.用eclipse新建pydev项目,tutorial,将scrapy生成的项目完整拷贝到eclipse生成的项目下面。
22.项目代码:
新建的爬虫脚本 quotes_spider.py:
import os
import scrapy
from scrapy_splash import SplashRequest
from tutorial.items import ImageItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
localPath="E:\polo-www.vip.com"
start_urls = [
'http://category.vip.com/suggest.php?keyword=%E7%94%B7%E5%A3%AB%E7%9F%AD%E8%A2%96polo'
]
"""
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
"""
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 5})
"""
def parse(self, response):
filename = 'vip.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
"""
def parse(self, response):
srcsAbsolute = []
item = ImageItem()
srcsRelative = response.css("img.goods-image-img::attr(src)").extract();
for srcRelative in srcsRelative:
srcAbsolute=response.urljoin(srcRelative);
srcsAbsolute.append(srcAbsolute);
item['image_urls'] = srcsAbsolute
yield item
next_page = response.css("a.cat-paging-next::attr(href)").extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield SplashRequest(next_page, callback=self.parse, args={'wait': 5})
wait:5指的是等待时间,一般可以用浏览器的F12查看网页的平均加载完成时间来确定。
cmdline.py:
'''
Created on 2015-8-28
@author: xxh
'''
import scrapy.cmdline
if __name__ == '__main__':
scrapy.cmdline.execute(argv=['scrapy','crawl','quotes'])
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ImageItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
class TutorialItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
pipelines.py:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield Request(image_url)
def item_completed(self, results, item, info):
image_path = [x['path'] for ok,x in results if ok]
if not image_path:
raise DropItem('Item contains no images')
item['image_paths'] = image_path
return item
class TutorialPipeline(object):
def process_item(self, item, spider):
return item
settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for tutorial project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'tutorial'
SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
SPLASH_URL = 'http://10.73.17.130:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
#增添如下代码
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
#开启图片管道
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1
}
#将IMAGES_STORE设置为一个有效的文件夹,用来存储下载的图片.否则管道将保持禁用状态,即使你在ITEM_PIPELINES设置中添加了它.
IMAGES_STORE = 'E:\\polo-www.vip.com'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tutorial (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
最终效果:某网站的polo衫