爬取猫眼电影TOP100
参考来源:静觅丨崔庆才的个人博客 https://cuiqingcai.com/5534.html
目的:使用Scrapy爬取猫眼电影TOP100并保存至MONGODB数据库
目标网址:http://maoyan.com/board/4?offset=0
分析/知识点:
爬取难度:
a. 入门级,网页结构简单,静态HTML,少量JS,不涉及AJAX;
b. 处理分页需要用到正则;MONGODB的update语句使用:
a. update语句:具备查重/插入新数据功能,以title为查重标准
def process_item(self, item, spider):
self.db['movies'].update({
'title': item['title']}, {
'$set': item}, upsert=True) #注意upsert=True,更新并插入
return item
实际步骤:
1) 创建Scrapy项目/maoyan(spider)
Terminal: > scrapy startproject maoyan_movie
Terminal: > scrapy genspider maoyan maoyan.com/board/4?offset=
2) 配置settings.py文件
# MONGODB配置
MONGO_URI = 'localhost'