爬取豆瓣某个片单的信息，包括电影名字、发布时间等，最主要的还是要爬取片单创建者的评语（评论）-CSDN博客

本文链接：https://blog.csdn.net/qq_62573773/article/details/140036801

我的需求是爬取某个片单的信息，包括电影名字、导演、主演、类型、制片国家/地区、年份，最主要的还是要爬取这个片单创建者的评论和电影的豆瓣链接。目前csdn我没有找到爬取豆瓣片单的爬虫代码，自己动手丰衣足食，所以自己写了个小爬虫，供同好参考。小小记录一下自己的实践。

整个代码分为三部分，发送请求、找到自己想要的信息、存储。

主要的问题是在爬取评语时，因为位置比较特殊，且爬取导演等信息时，class="abstract"下有多个br标签，br标签既表示开始也表示结束，使用text()就只能提取第一个。所以需要用到xpath的following-sibling::text()方法。我最开始的操作是

comment = sech.xpath( './/div[@class="ft"]//blockquote[@class="comment"]/span/text()')

但是会发现根本没有东西，而使用

comment = sech.xpath(
'.//div[@class="ft"]//blockquote[@class="comment"]/span/descendant-or-self::text()')

就可以实现我的需求，这是因为descendant-or-self::text()表示选取当前节点的所有后代元素（子、孙等）以及当前节点本身。直接使用text()会仅提取所选节点的直接文本子节点。

还有一个问题就是，xpath返回的是一个列表，即使是只有一条数据，也是一个列表。我在后面想获取某个电影的详细信息时，使用了前面爬取的豆瓣链接，但是因为xpath给的是一个列表，导致调用自己定义的fetch_url函数传参时出现错误。

xpath路径不会的可以参照代码仔细看看，这个开始我觉得很麻烦不想学，但看下来从这个较简单的入手，就没有那么排斥了。

下面就是代码：

import requests
import lxml.html
import csv
import time

etree = lxml.html.etree# 获取lxml中的elementTree对象，用于HTML解析

def fetch_url(url, headers, retries=3):
    while retries > 0:
        try:
            response = requests.get(url, headers=headers)
            response.encoding = "utf-8"
            return response
        except requests.exceptions.RequestException as e:
            print(f"Request failed: {e}, retries left: {retries - 1}")
            retries -= 1
            time.sleep(2)
    return None

with open('doubanMovie0.csv', 'w') as f:
    csvwriter = csv.writer(f, dialect='excel')
    csvwriter.writerow(['title', 'year', 'country', 'comment', 'cast', 'director', 'linkURL'])

header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.7 Safari/537.36'
}

for n in range(11):
    url = 'https://www.douban.com/doulist/68132/?start=%s&sub_type=' % (n * 25)#在这里输入想要爬取的片单ID，替换就可以，*25是因为一页有25部电影
    response = fetch_url(url, header)
    if response is None:
        print(f"Failed to fetch data from: {url}")
        continue

    html1 = etree.HTML(response.text)
    h = html1.xpath('//div[@class="doulist-item"]')

    with open('doubanMovie0.csv', 'a') as f:
        csvwriter = csv.writer(f, dialect='excel')

        for sech in h:
            title = sech.xpath('.//div[@class="title"]/a/text()')
            linkURL = sech.xpath('.//div[@class="title"]/a/@href')
            abstract = sech.xpath('.//div[@class="abstract"]/descendant-or-self::text()')
            # 提取电影评论，这个位置比较特殊
            comment = sech.xpath('.//div[@class="ft"]//blockquote[@class="comment"]/span/following-sibling::text()')

            if not title or not linkURL:
                print(f"Missing title or linkURL in: {sech}")
                continue

            director, cast, year, country = "", "", "", ""
            for line in abstract: # 遍历简介内容，提取导演、主演、年份、国家信息
                if '导演:' in line:
                    director = line.split('导演: ')[1].strip()
                elif '主演:' in line:
                    cast = line.split('主演: ')[1].strip()
                elif '年份:' in line:
                    year = line.split('年份: ')[1].strip()
                elif '制片国家/地区:' in line:
                    country = line.split('制片国家/地区:')[1].strip()

            max_len = max(len(title), len(linkURL))
            title = title if len(title) == max_len else [''] * max_len
            linkURL = linkURL if len(linkURL) == max_len else [''] * max_len
            comment = comment if len(comment) == max_len else [''] * max_len

            for a, b, c, d, e, f, g in zip(title, [year] * max_len, [country] * max_len, comment,
                                           [cast] * max_len, [director] * max_len, linkURL):
                print(a, b, c, d, e, f, g)
                csvwriter.writerow([a, b, c, d, e, f, g])

    time.sleep(2)

最终爬取到的效果：