【Python】Scrapy完成电影信息爬取并存入数据库

最新推荐文章于 2021-04-08 05:27:47 发布

~来了小老弟

最新推荐文章于 2021-04-08 05:27:47 发布

阅读量2k

点赞数 5

分类专栏： Python

本文链接：https://blog.csdn.net/weixin_43358075/article/details/103723989

版权

Python 专栏收录该内容

4 篇文章 2 订阅

订阅专栏

本文使用了scrapy框架对电影信息进行爬取并将这些数据存入MySQL数据库。

一、安装相关python模块

根据你所使用的python包管理器安装相应的模块。比如使用pip:

pip install scrapy
pip install pymysql

二、创建scrapy项目

和其他python框架一样，利用scrapy startproject projectname命令创建项目即可:
在这里插入图片描述
出现上图提示即说明scrapy项目创建成功，如果出现command not found等提示，说明你需要重新安装scrapy。项目创建成功后的项目目录如图所示:

这里介绍一下部分文件的主要作用。

items.py文件里主要存放你的模型，即实体。
pipelines.py爬虫抓取到网页数据后在该文件中执行相关数据处理操作。
settings.py存放框架配置。
spiders/该文件夹下放爬虫业务代码。

三、coding

items.py，我们需要分析我们爬取的信息。

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class DialogItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class Movie(scrapy.Item):
	name = scrapy.Field()           #电影名称
	href = scrapy.Field()			#电影链接
	actor = scrapy.Field()			#演员
	status = scrapy.Field()			#状态
	district = scrapy.Field()		#地区
	director = scrapy.Field()		#导演
	genre = scrapy.Field()			#类型
	intro = scrapy.Field()			#介绍

在Spider文件夹下创建爬虫文件MovieSpider.py,创建MovieSpider类时并继承scrapy.Spider。这里使用了xpath定位资源，下面会简单介绍，更多用法请点击这里,进入菜鸟教程进行学习。

import scrapy
from movie.items import Movie

class MovieSpider(scrapy.Spider):
	# 爬虫名称，最终会利用该名称启动爬虫
	name = 'MovieSpider'
	# 这里只填写域名即可，不需要协议和资源地址
	allowed_domains = ['88ys.com']
	# 开始url,即我们爬虫最开始需要爬取的地址
	start_urls = ['https://www.88ys.com/vod-type-id-14-pg-1.html']

	def parse(self, response):
		urls = response.xpath('//li[@class="p1 m1"]')
		for item in urls:
			movie = Movie()
			movie['name'] = item.xpath('./a/span[@class="lzbz"]/p[@class="name"]/text()').extract_first()
			movie['href'] = 'https://www.88ys.com' + item.xpath('./a/@href').extract_first()
			request = scrapy.Request(movie['href'], callback=self.crawl_details)
			request.meta['movie'] = movie
			yield request

	def crawl_details(self, response):
		movie = response.meta['movie']
		movie['actor'] = response.xpath('//div[@class="ct-c"]/dl/dt[2]/text()').extract_first()
		movie['status'] = response.xpath('//div[@class="ct-c"]/dl/dt[1]/text()').extract_first()
		movie['district'] = response.xpath('//div[@class="ct-c"]/dl/dd[4]/text()').extract_first()
		movie['director'] = response.xpath('//div[@class="ct-c"]/dl/dd[3]/text()').extract_first()
		movie['genre'] = response.xpath('//div[@class="ct-c"]/dl/dd[1]/text()').extract_first()
		movie['intro'] = response.xpath('//div[@class="ee"]/text()').extract_first()
		yield movie

xpath使用

syntax	说明
//	全文递归搜索
.	选取当前结点
. .	选取父节点
text()	选取标签下的文本
@属性	选取该属性的值
label	这里指节点名称，即html的标签
`div[@class="ct-c"]`	指类属性为`ct-c`的div
`/dl/dt[1]`	指dl下的第一个dt

编写pipelines.py，将爬取到的数据存入数据库

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql


class DialogPipeline(object):
    def __init__(self):
        self.conn = pymysql.connect('localhost', 'huangwei', '123456789', 'db_88ys')
        self.cursor = self.conn.cursor()

    def process_item(self, item, spider):
        sql = "insert into tb_movie(name, href, actor, status, district, director, genre, intro) values(%s, %s, %s, %s, %s, %s, %s, %s)"
        self.cursor.execute(sql, (item['name'], item['href'], item['actor'], item['status'],
            item['district'], item['director'], item['genre'], item['intro']) )
        self.conn.commit()

    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()

更改settings.py相关配置

# 是否遵循robots协议
ROBOTSTXT_OBEY = False

# 模拟浏览器进行数据请求
DEFAULT_REQUEST_HEADERS = {
    "User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
}

# 启用pipelines，将爬取到的数据进行保存
ITEM_PIPELINES = {
   'dialog.pipelines.DialogPipeline': 300,
}

四、启动爬虫

进入项目目录，使用scrapy crawl MovieSpider即可，执行中会打印相关日志，在命令中加入--nolog即可不显示日志。当然，在启动前我们需要准备好数据表。启动过程如下:
在这里插入图片描述
最终，我们查看数据库，爬取成功！！！

~来了小老弟

关注

5
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
【Python】Scrapy完成电影信息爬取并存入数据库

本文使用了scrapy框架对电影信息进行爬取并将这些数据存入MySQL数据库。一、安装相关python模块根据你所使用的python包管理器安装相应的模块。比如使用pip:pip install scrapypip install pymysql二、创建scrapy项目和其他python框架一样，利用scrapy startproject projectname命令创建项目即可:...
复制链接

扫一扫

专栏目录