写在前面的话:
Java程序员一枚,初入大数据神坑,爬虫算是第一个项目,项目细节无需赘述,几经挣扎最终决定放弃Java爬虫,使用Python来
写爬虫,Python爬虫当然绕不过去Scrapy神来之笔的框架!
环境搭建及安装各种工具包,相信每一位和我一样的初次入坑的小伙伴们都必需经历,痛并快乐着,最终放弃2.7版本,选择3.5版
本,毕竟掌握新技术总是能给人带来成就感!
听着豆瓣音乐,看着慢悠悠爬虫爬着豆瓣电影数据,没有冲动,忘却欢喜,只是一种放松,一种技术人的全身心彻底放松!
看图看真相:
关于IP代理池:
听说豆瓣封ip,所以第一时间就在找了porxyPool相关项目,一共尝试了两种方法。
第一种;是去国内高匿代理网站爬去免费代理,生成一个 proxy_list.json 然后将这个文件拷贝到自己项目根目录下,每次Request的时候从json文件中随机取一个IP,思路很好,但是免费的代理靠得住吗?看懂了代码,放弃了自我!前前后后折腾一上午,无疾而终!
第二中:与第一种类似,gitHub上小有名气的项目ProxyPool-master,依然是去各大免费网站爬取免费代理,然后存储到Redis,最后发布出来,在本地浏览器访问http://127.0.0.1:5000/random就会获取到一个代理,值得学习的是每个代理在入库的时候score是10分,异步测试成功变成100分,失败就从10开始自减,到0分的时候就从库中移除,但是依然摆脱不了免费代理的厄运,最终还是放弃了!
慢悠悠爬虫:
项目结构及目录
settings.py
# -*- coding: utf-8 -*- # Scrapy settings for douban project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html BOT_NAME = 'douban' SPIDER_MODULES = ['douban.spiders'] NEWSPIDER_MODULE = 'douban.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'douban (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) CONCURRENT_REQUESTS = 20 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 5 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'douban.middlewares.DoubanSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'douban.middlewares.MyCustomDownloaderMiddleware': 543, #} # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/exte