由于python在人工智能的卓越表现,最近自己在摸索python,希望在今后学习中能够对python有个循序渐进的了解,也期待和大家一起探讨python的各种问题。借着周末在家休息,写了个小小的爬虫程序,期间遇到的问题在此进行汇总。
(一)编程环境
系统 :ubuntu 18.04
python :2.7.15
Scrapy :1.5.0
lxml : 4.2.3.0
IDE :pycharm
(二)安装Scrapy
检查自己的运行环境,代码如下:
pythoner@pythoner-Lenovo-M4400s:~/Desktop/douban$ python
Python 2.7.15rc1 (default, Apr 15 2018, 21:51:34)
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
pythoner@pythoner-Lenovo-M4400s:~/Desktop/douban$ pip --version
pip 9.0.1 from /usr/lib/python2.7/dist-packages (python 2.7)
安装Scrapy
pythoner@pythoner-Lenovo-M4400s:~/Desktop/douban$ pip install scrapy
Collecting scrapy
Downloading Scrapy-1.4.0-py2.py3-none-any.whl (248kB)
100% |████████████████████████████████| 256kB 188kB/s
// 漫长的安装过程
Successfully installed Twisted-17.9.0 scrapy-1.4.0
在安装Scrapy过程中,提示安装失败,原因为没有权限获取文件,因此我改为超级用户安装,即在pip前面添加sudo。
(三)创建Scrapy项目
在安装Scrapy后,即可利用命令行工具“scrapy startproject projectname”创建项目,这个命令是个全局命令,无需在项目内运行。代码如下:
pythoner@pythoner-Lenovo-M4400s:~/Desktop/douban$ scrapy startproject SpiderDemo
New Scrapy project 'SpiderDemo', using template directory '/usr/local/lib/python2.7/dist-packages/scrapy/templates/project', created in:
/home/pythoner/Desktop/douban/SpiderDemo
You can start your first spider with:
cd SpiderDemo
scrapy genspider example example.com
创建项目后,会在你准备保存的目录上新增一个项目文件,其结构如下所是:
SpiderDemo/
scrapy.cfg # 部署配置文件
SpiderDemo/ # python模块
__init__.py
items.py # 数据容器
pipelines.py # project pipelines file
settings.py # 配置文件
spiders/ # Spider类定义了如何爬取某个(或某些)网站
__init__.py
在 SpiderDemo/spiders中创建执行爬取的类douban_Spider,代码如下:
# -*- coding: utf-8 -*-
import scrapy
import urlparse
class DoubanSpider(scrapy.Spider):
#spider的名字定义了Scrapy如何定位(并初始化)spider,所以其必须是唯一的
name = 'douban_spider'
# 域名不在列表中的URL不会被爬取
allowed_domains = ['www.imooc.com']
# 起始URL列表
start_urls = ['http://www.imooc.com/course/list']
def parse(self, response):
learn_nodes = response.css('a.course-card')
for learn_node in learn_nodes:
learn_url = learn_node.css("::attr(href)").extract_first()
yield scrapy.Request(url=urlparse.urljoin(response.url, learn_url), callback=self.parse_learn)
def parse_learn(self, response):
title = response.xpath('//h2[@class="l"]/text()').extract_first()
content = response.xpath('//div[@class="course-brief"]/p/text()').extract_first()
url = response.url
print ('标题:' + title)
print ('地址:' + url)
执行爬行程序
pythoner@pythoner-Lenovo-M4400s:~/Desktop/douban$ scrapy crawl douban_spider
在执行过程中,报了一个字符编码错误,如下所示:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/pythoner/Desktop/douban/douban/spiders/douban_spider.py", line 21, in parse_learn
print ('标题:' + title)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)
碰到这个问题,是因为ubuntu系统默认的python编译器是2.7版本,经过搜索,发现应该是因为python2.x的默认编码是ascii,而代码中可能由utf-8的字符导致,解决方法是设置utf-8。
找到出错的文件,在import后增加下面几行,如下所示:
# -*- coding: utf-8 -*-
import scrapy
import urlparse
import sys
if sys.getdefaultencoding() != 'utf-8':
reload(sys)
sys.setdefaultencoding('utf-8')
class DoubanSpider(scrapy.Spider):
#spider的名字定义了Scrapy如何定位(并初始化)spider,所以其必须是唯一的
name = 'douban_spider'
# 域名不在列表中的URL不会被爬取
allowed_domains = ['www.imooc.com']
# 起始URL列表
start_urls = ['http://www.imooc.com/course/list']
def parse(self, response):
learn_nodes = response.css('a.course-card')
for learn_node in learn_nodes:
learn_url = learn_node.css("::attr(href)").extract_first()
yield scrapy.Request(url=urlparse.urljoin(response.url, learn_url), callback=self.parse_learn)
def parse_learn(self, response):
title = response.xpath('//h2[@class="l"]/text()').extract_first()
content = response.xpath('//div[@class="course-brief"]/p/text()').extract_first()
url = response.url
print ('标题:' + title)
print ('地址:' + url)
添加后,再次运行spider类,可以看到如下效果,则表示爬取成功。
2018-07-08 18:57:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/994> (referer: http://www.imooc.com/course/list)
2018-07-08 18:57:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/995> (referer: http://www.imooc.com/course/list)
2018-07-08 18:57:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/997> (referer: http://www.imooc.com/course/list)
2018-07-08 18:57:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/984> (referer: http://www.imooc.com/course/list)
2018-07-08 18:57:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.imooc.com/learn/998> (referer: http://www.imooc.com/course/list)
标题:C4D地面多边形建模
地址:http://www.imooc.com/learn/987
标题:TensorFlow与Flask结合打造手写体数字识别
地址:http://www.imooc.com/learn/994
标题:C4D化妆品套装建模
地址:http://www.imooc.com/learn/995
标题:Java9之模块系统
地址:http://www.imooc.com/learn/997
标题:MAYA-贴图基础
地址:http://www.imooc.com/learn/984
标题:Unity 3D 翻牌游戏开发
地址:http://www.imooc.com/learn/998
后期还会继续深入学习Scrapy,希望大家共同探讨。