使用Scrapy爬虫框架爬取豆瓣top250 https://movie.douban.com/top250,并将爬取结果用MongoDB来储存。
创建项目
scrapy startproject Douban250
创建爬虫
scrapy genspider douban douban.com
文件结构
items.py
我们选择爬取电影的排名,名字,评分,评论人数,一句话短评,链接,信息介绍 然后写进items.py中
import scrapy
class Douban250Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
ranking = scrapy.Field()
name = scrapy.Field()
score = scrapy.Field()
score_num = scrapy.Field()
quote = scrapy.Field()
cover_url = scrapy.Field()
introduce = scrapy.Field()
douban.py
import scrapy
from Douban250.items import Douban250Item
class DoubanSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['douban.com']
start_urls = ['https://movie.douban.com/top250']
def parse(self, response):
item = Douban250Item()
for node in response.xpath('//ol[@class="grid_view"]/li'):
item['ranking'] = node.xpath('.//div[@class="pic"]/em/text()').extract()[0]
item['name'] = node.xpath('.//span[@class="title"][1]/text()').extract()[0]
item['score'] = node.xpath('.//div[@class="star"]/span[2]/text()').extract()[0]
item['score_num'] = node.xpath('.//div[@class="star"]/span[4]/text()').extract()[0]
item['quote'] = node.xpath('.//p[@class="quote"]/span/text()').extract_first()#有一项没有quote所以用first
item['cover_url'] = node.xpath('.//div[@class="pic"]/a/@href').extract()[0]
item['introduce'] = ''.join(node.xpath('.//div[@class="bd"]/p/text()').extract()[0]).strip() + 'n' + ''.join(node.xpath('.//div[@class="bd"]/p[1]/text()').extract()[1]).strip()
yield item
print(item)#看看输出
next_page = response.xpath('//span[@class="next"]/a/@href')
if next_page:
url = 'https://movie.douban.com/top250' + next_page[0].extract()
yield scrapy.Request(url, callback=self.parse)
settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36'
ROBOTSTXT_OBEY = False
也可以加一句,在settings中设置log级别,在settings.py中添加一行:
LOG_LEVEL = 'WARNING'
scrapy默认显示DEBUG级别的log信息
然后可以运行,看下print输出爬取结果如何
scrapy crawl douban
为了更方便些,也可以写一个运行脚本,在项目名文件夹下创建一个.py 文件,我自己取得名叫run.py,写入:
from scrapy import cmdline
cmdline.execute("scrapy crawl douban".split())
这样每次点击run就可以运行
print出来的内容
储存到mongodb
pipelines.py
import pymongo
from scrapy.conf import settings
class Douban250Pipeline(object):
def __init__(self):
host = settings["MONGODB_HOST"]
port = settings["MONGODB_PORT"]
dbname = settings["MONGODB_DBNAME"]
sheetname = settings["MONGODB_SHEETNAME"]
# 创建MONGODB数据库链接
client = pymongo.MongoClient(host=host, port=port)
# 指定数据库
mydb = client[dbname]
# 存放数据的数据库表名
self.post = mydb[sheetname]
def process_item(self, item, spider):
data = dict(item)
self.post.insert(data)
return item
settings.py
ITEM_PIPELINES = {
'Douban250.pipelines.Douban250Pipeline': 300,
}
# MONGODB 主机名
MONGODB_HOST = "127.0.0.1"
# MONGODB 端口号
MONGODB_PORT = 27017
# 数据库名称
MONGODB_DBNAME = "Douban"
# 存放数据的表名称
MONGODB_SHEETNAME = "doubanmovies"
运行
进入mongodb可视化软件连接查看(这里使用pycharm中的Mongo Explorer)
共250条信息,完成!!