scrapy爬取豆瓣热门电影——存储到MySQL

Scrapy

Windows环境下安装

以Pycharm为例,直接在虚拟终端中进行安装 pip install scrapy,或者在settings中project Interpreter中搜索添加。

linux 环境下的安装

根据官方指导安装依赖

  • 链接:https://docs.scrapy.org/
  • 依赖库:sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

安装依赖库的时候会报错:Unable to locate Package 某个依赖库。
不要慌,更新一下:sudo apt-get update.
等待更新完成之后再安装就可以了。

  • sudo apt-get install python3 python3-dev

安装虚拟工作环境

  • 安装:
    • virtualenv -p /usr/bin/python3 env

    env 是虚拟环境的名字,见闻觉知最好啦。

  • 激活:
    • source env/bin/activate
  • 退出虚拟环境
    • deactivate
在虚拟环境中安装scrapy
  • pip install scrapy

会安装许多依赖文件,耐心等待一下。安装失败好多次,网不好,多次请求超时,没办法,只能不断重装,直到完全安装。

创建项目

  • scrapy startproject example(项目名字)
  • 创建爬虫
    • 会有提示,进入到项目中(cd example)
    • scrapy genspider examplespider examplespider.com

运行项目

  • 第一种方式,命令运行:scrapy crawl examplespider

  • 第二种方式,新建run.py

    from scrapy.cmdline import execute
    
    #下面两行效果相同,书写的时候更快捷。。
    # execute(['scrapy', 'crawl', 'douban'])
    execute('scrapy crawl douban'.split())
    

    运行run.py.

添加headers 的三种方式

  • 全局设置
    • 在setting.py中设置,此处是全局的设置,会影响所有的爬虫
  • 局部设置
    • 第一种:在spiders文件夹中的爬虫文件添加,如下

      custom_settings = {
      ‘DEFAULT_REQUEST_HEADERS’ : {
      ‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8’,
      ‘Accept-Language’: ‘en’,
      ‘User-Agent’: ‘添加内容
      }
      }

    • 第二种:在请求中为其添加url

      seq = scrapy.Request(url=url, callback=self.parse)

      在此处添加User-Agent

      seq.headers[‘User-Agent’] = ‘’
      yield seq

光说不练假把式——爬取豆瓣热门电影

链接是这个热门电影

  1. 简单看一下页面,最近热门电影的网页是动态加载的,所有打开控制台,看一下他的动态加载时的文件信息。来了,链接是这个热门电影
  2. 创建项目:scrapy startproject hotmovie
  3. 进入项目空间:cd hotmovie,
  4. 创建爬虫:scrapy genspider douban douban.com文件结构
  5. 目前阶段,我们需要编辑的文件有,douban.py,MySqlHelper.py,items.py,pipelines.py,settings.py等等。

douban.py

这个文件的作用主要是产生要爬取的url,并解析出要搜集的字段信息

# -*- coding: utf-8 -*-
import scrapy
import json
from HotMovie.items import DoubanItem


class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['douban.com']
    # 爬取豆瓣 headers信息需要添加。
    custom_settings = {
        'DEFAULT_REQUEST_HEADERS' : {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
            'Host': 'movie.douban.com',
            'Referer': 'https://movie.douban.com/explore',
        }
      }
    # start_urls = ['http://www.douban.com']
    
    def start_requests(self):
        base_url = "https://movie.douban.com/j/search_subjects?type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=recommend&page_limit=20&page_start={}"
        page = 20
        for i in range(page):
            url = base_url.format(i*20)
            seq = scrapy.Request(url=url, callback=self.parse)
            # 在此处添加User-AAgent
            # seq.headers['User-Agent'] = ''
            yield seq

    def parse(self, response):
        json_str = response.body.decode('utf-8')
        # print('-'*50)
        # print(type(json_str))
        result = json.loads(json_str)
        items_list = result['subjects']
        for item in items_list:
            url = item['url']
            yield scrapy.Request(url=url, callback=self.get_detail)
            # print(url)
        # filename = 'result.json'
        # with open(filename, 'a') as f:
        #     f.write(json_str)
        # self.log("文件{}下载完成!".format(filename))
    
    def get_detail(self, response):
        # 电影名字
        name = response.xpath('//span[@property="v:itemreviewed"]/text()').extract_first()
        # 上映时间
        year = response.xpath('//span[@class="year"]/text()').extract_first()[1:-1]
        # 导演
        director = response.xpath('//a[@rel="v:directedBy"]/text()').extract_first()
        # 编剧 是一个列表 会有多个编剧
        screenwriter_list = response.xpath('//div[@id="info"]/span[2]/span[2]/a/text()').extract()
        # 主演 也是一个列表
        actor_list = response.xpath('//div[@id="info"]/span[3]/span[2]/a/text()').extract()
        # 电影类型 也是列表
        type_list = response.xpath('//span[@property="v:genre"]/text()').extract()
        # 其他未在标签中的内容
        others = response.xpath('//div[@id="info"]/text()').extract()
        # 化简 去除空格,换行, 斜杠
        simplify = [i for i in others if i.strip()!="\n" and i.strip()!="" and i.strip()!="/"]
        # 制片国家和地区
        made = simplify[1]
        # 语言
        language = simplify[2]
        # 上映时间 
        publish_time = response.xpath('//span[@property="v:initialReleaseDate"]/text()').extract()
        # 时长
        length = response.xpath('//span[@property="v:runtime"]/text()').extract_first()
        # 又名 列表
        other_name = simplify[-1]
        # 评分
        score = float(response.xpath('//strong[@property="v:average"]/text()').extract_first())
        # 参与评分人数
        rate_people = int(response.xpath('//div[@class="rating_sum"]/a/span/text()').extract_first())
        # 简介
        summary = response.xpath('//div[@class="indent"]/span/text()').extract_first().strip()
        # 海报链接
        poster = response.xpath('//div[@id="mainpic"]/a/img/@src').extract_first()
        
        item = DoubanItem()
        item['name'] = name
        item['year'] = year
        item['director'] = director
        item['screenwriter_list'] = ','.join(screenwriter_list)
        item['actor_list'] = ','.join(actor_list)
        item['type_list'] = ','.join(type_list)
        # item['simplify'] = simplify
        item['made'] = made
        item['language'] = language
        item['publish_time'] = ','.join(publish_time)
        item['length'] = length
        item['other_name'] = other_name
        item['score'] = score
        item['rate_people'] = rate_people
        item['summary'] = summary
        item['poster'] = poster

        yield item
    

MySqlHelper.py

连接数据库方法类。

import pymysql

class SqlHelper(object):
    def __init__(self):
        self.conn = pymysql.connect(host="localhost", port=3306, user='root', password='root',db='scrapy', charset='utf8mb4')
        self.cursor = self.conn.cursor()
    
    def insert_data(self, sql, data):
        self.cursor.execute(sql, data)
        self.conn.commit()

    def __del__(self):
        self.cursor.close()
        self.conn.close()

# 测试方法	可以不要
if __name__ == '__main__':
    helper = SqlHelper()
    sql = "INSERT INTO txt (con) VALUES (%s)"
    data = ('这是一条测试语句',)
    helper.insert_data(sql, data)

items.py

这个有点机械,注意命名,不要错乱囖。

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class HotmovieItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    year = scrapy.Field()
    director = scrapy.Field()
    screenwriter_list = scrapy.Field()
    actor_list = scrapy.Field()
    type_list = scrapy.Field()
    # simplify = scrapy.Field()
    made = scrapy.Field()
    language = scrapy.Field()
    publish_time = scrapy.Field()
    length = scrapy.Field()
    other_name = scrapy.Field()
    score = scrapy.Field()
    rate_people = scrapy.Field()
    summary = scrapy.Field()
    poster = scrapy.Field()

    def insert_data_sql(self):
        sql = "INSERT INTO hotmovie (`name`,`year`,director,screenwriter_list,actor_list,type_list,made,`language`,publish_time,`length`,other_name,score,rate_people,summary,poster) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
        data = (self['name'],self['year'],self['director'],self['screenwriter_list'],self['actor_list'],self['type_list'],self['made'],self['language'],self['publish_time'],self['length'],self['other_name'],self['score'],self['rate_people'],self['summary'],self['poster'])
        return (sql, data)

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from HotMovie.MySqlHelper import SqlHelper

class HotmoviePipeline(object):
    def process_item(self, item, spider):
        return item

# 创建了自己 pipeline 记得在settings.py,中修改,文下会说
class DoubanPipeline(object):
    def __init__(self):
        self.helper = SqlHelper()

    def process_item(self, item, spider):
        if 'insert_data_sql' in dir(item):
            (sql, data) = item.insert_data_sql()
            self.helper.insert_data(sql, data)
        return item

settings.py

修改配置。
主要用到的几个点:
1、robots协议:不遵循。# Obey robots.txt rules
ROBOTSTXT_OBEY = False
2、添加自己定义的pipelines
ITEM_PIPELINES = {
#‘HotMovie.pipelines.HotmoviePipeline’: 300,
‘HotMovie.pipelines.DoubanPipeline’: 300,
}

run_douban.py

我们可以用命令:scrapy crawl douban 来运行
也可以,新建文件,码入。

from scrapy.cmdline import execute
# execute(['scrapy', 'crawl', 'douban']) # 下面一种方式也可以
execute('scrapy crawl douban'.split())

运行结果

运行时
存入数据库

不要搞的太频繁啦,君子协议。
sql语句建库:
CREATE TABLE hotmovie (
id INT AUTO_INCREMENT PRIMARY KEY ,
title VARCHAR(255) ,
year int ,
director VARCHAR(255) ,
screenwriter_list VARCHAR(255),
actor_list VARCHAR(255) ,
type_list VARCHAR(255) ,
made VARCHAR(255),
language VARCHAR(255) ,
publish_time VARCHAR(255),
length VARCHAR(100) ,
other_name VARCHAR(255) ,
score VARCHAR(11) ,
rate_people VARCHAR(100),
summary TEXT ,
poster VARCHAR(255)
)DEFAULT CHAR SET = ‘utf8mb4’;

我的IP已经被封了,今天数据库搞错了,没弄好,访问的有点频繁。该准备整一整代理了。下次再见。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值