Python3 使用Scrapy爬豆瓣影评和电影详情

最新推荐文章于 2024-07-11 11:28:28 发布

Johline

最新推荐文章于 2024-07-11 11:28:28 发布

阅读量3.3k

点赞数 2

分类专栏： python基础知识

本文链接：https://blog.csdn.net/Johline/article/details/80583892

版权

本文介绍了如何使用Python3的Scrapy框架爬取豆瓣电影的影评内容和详情。从创建项目开始，包括item.py的定义、settings.py的配置，再到douban.py中爬取入口URL、解析数据，以及处理分页和存储数据到MySQL数据库的过程。最终，通过运行scrapy crawl douban命令执行爬虫，成功获取并保存了影评信息。

摘要由CSDN通过智能技术生成

最近一直在用scrapy来爬虫，通过scrapy来爬取豆瓣影评内容以及影评的详情

首先建立scrapy项目：scrapy startproject douban

就会出现如下内容：

1. item.py 在里面声明要爬取信息

import scrapy


class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()       #标题
    othername=scrapy.Field()  #影片别名
    url = scrapy.Field()         #电影链接
    duration = scrapy.Field()    #视频时长 秒
    re_content = scrapy.Field()      #评论内容
    date = scrapy.Field()         #发行时间
    director = scrapy.Field()     #导演
    actors = scrapy.Field()  #演员
    style = scrapy.Field()      #电影类型
    area = scrapy.Field()  #影片地区
    re_time = scrapy.Field()   #影评发表时间
    re_author = scrapy.Field()  #影评作者
    re_content = scrapy.Field()  #影评内容
    re_title = scrapy.Field()  #影评标题

2.settings.py

在py文件中修改以及增加一些内容

ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 5
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 64
CONCURRENT_REQUESTS_PER_IP = 16

ITEM_PIPELINES = {
   'douban.pipelines.DoubanPipeline': 300,
}

DOWNLOADER_MIDDLEWARES = {
       'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
       }

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36'

3. 爬取文件 douban.py

爬取入口:https://movie.douban.com/j/subject_suggest?q=羞羞的铁拳

然后通过提取url来进行爬取页面，如下：

 def parse_page(self, response):
        #print(response.url)#https://movie.douban.com/subject/26363254/
        id=response.meta['id']
        title=response.xpath('//*[@id="content"]/h1/span[1]/text()').extract_first()
        type=response.xpath('//*[@id="info"]/span[5]/text()').extract_first()
        if type is not None:
            style=type
        for node in response.xpath('//*[@id="info"]/span').extract():
            selector = Selector(text=node)
            des=selector.xpath('//span')
            content=des[0].xpath('normalize-space(string(.))').extract()[0].replace('\xa0', '')
            if '导演' in content:
                director=content.replace('导演:','')
            elif '编剧:' in content:
                author=content.replace('编剧:','')
            elif '主演:' in content:
                actor=content.replace('主演:','')
            elif '分钟' in content:
                duration=content.replace('分钟','')
            elif '制片国家/地区:' in content:
                area=content.replace('制片国家/地区:','')
            elif '又名:' in content:
                an_title=content.replace('又名:','')
            t=re.search(r'(\d{4}-\d{2}-\d{2})',content)
            if t is not None:
                time=t.group(1)
        #print(title,style,director,author,actor,duration,area,an_title,time)
        re_url='https://movie.douban.com/subject/'+str(id)+'/reviews'
        meta={}
        meta['review_url']=re_url
        yield scrapy.Request(re_url, callback=self.review_page,meta=meta,headers=self.header,dont_filter =True)#https://movie.douban.com/subject/26363254/reviews

这段代码是爬取上面图片的内容，以及获的id，从而得到影评入口

def review_page(self,response):
        review_url=response.meta['review_url']
        meta={}
        meta['review_url']=review_url
        resultList= response.xpath(r'//*[re:match(@id, "\d+")]', )
        for res in resultList:
            self.num=self.num+1
            author=res.xpath('./header/a[2]/text()').extract_first()
            if author: