爬虫爬取豆瓣电影Top250信息-CSDN博客

本文链接：https://blog.csdn.net/z2141830440/article/details/145039721

任务需求

豆瓣电影 Top250是豆瓣网按照评分高低和评论热度排序的电影排行榜。在该网站可以查到各种
型和国家的经典电影、用户的评分和评论。网址为:https://movie.douban.com/top250
根据其网站地址，使用Python语言编写一段网络爬虫程序并对以下内容数据进行采集:
1、电影名(包括中文名、英文名，不包括“别名”)
2、导演名字
3、制片国家/地区
4、电影评分
5、参评人数
6、5星~1星评分的百分占比情况

网络请求与网页解析

网络请求库 (requests)：
- requests.get(url, headers=HEADERS)用于向指定URL发送HTTP GET请求。
- HEADERS设置了伪装的用户代理，防止请求被网站拒绝。
网页解析库 (BeautifulSoup)：
- 创建解析器：BeautifulSoup(response.text, 'html.parser')。
- 查找页面元素：
  - find：提取单个HTML元素，例如div、span。
  - find_all：提取多个符合条件的HTML元素

HTML结构解析

通过类名筛选内容：
- soup.find_all("div", class_="item")找到所有电影条目。
- item.find("span", class_="title").text.strip()获取电影名称。
提取属性值：
- item.find("a")["href"]获取详情页链接。
多层嵌套处理：
- 提取评分占比和参评人数时，层层查找所需元素

延迟控制与反爬虫策略

模拟用户行为：
- 设置User-Agent伪装为浏览器访问。
延迟操作：
- time.sleep(2)在每次爬取后暂停2秒，降低被封禁的风险。

完整代码

import requests
from bs4 import BeautifulSoup
import json
import time
# 用户代理，伪装成浏览器访问
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
# 基础URL
BASE_URL = "https://movie.douban.com/top250"

# 存储数据的字典
movie_data = []

# 解析详情页
def parse_detail_page(url):
    response = requests.get(url, headers=HEADERS)
    soup = BeautifulSoup(response.text, 'html.parser')
    # 导演和制片国家/地区信息
    info = soup.find("div", id="info").text.strip()
    director = ""
    country = ""

    for line in info.split("\n"):
        if "导演" in line:
            director = line.split(":")[1].strip()
        if "制片国家/地区" in line:
            country = line.split(":")[1].strip()

    # 评分占比
    ratings = soup.find_all("span", class_="rating_per")
    rating_percentages = [r.text for r in ratings] if ratings else ["0%"] * 5

    return director, country, rating_percentages

# 爬取豆瓣Top250电影
def scrape_douban_top250():
    for start in range(0, 250, 25):  # 每页25个电影，共10页
        url = f"{BASE_URL}?start={start}"
        response = requests.get(url, headers=HEADERS)
        soup = BeautifulSoup(response.text, "html.parser")

        # 找到所有电影条目
        items = soup.find_all("div", class_="item")

        for index, item in enumerate(items, start=1):
            # 电影名
            title = item.find("span", class_="title").text.strip()
            english_title = item.find_all("span", class_="title")[1].text.strip() if len(item.find_all("span", "title")) > 1 else ""
            movie_name = f"{title} {english_title}".strip()
            # 评分
            rating = item.find("span", class_="rating_num").text.strip()
            # 参评人数
            people = item.find("div", class_="star").find_all("span")[-1].text.strip()
            num_reviews = int(people.replace("人评价", ""))
            # 详情页链接
            detail_link = item.find("a")["href"]

            # 进入详情页爬取导演、制片国家和评分占比
            try:
                director, country, rating_percentages = parse_detail_page(detail_link)
            except Exception as e:
                print(f"Error parsing detail page {detail_link}: {e}")
                director, country, rating_percentages = "暂无", "暂无", ["0%"] * 5

            # 保存数据到字典
            movie_data.append({
                "电影名": movie_name,
                "导演": director,
                "制片国家/地区": country,
                "评分": rating,
                "参评人数": num_reviews,
                "评分占比": rating_percentages
            })

            # 控制台输出当前电影信息
            print(f"爬取完成: {movie_name}")
            print(f"  导演: {director}")
            print(f"  制片国家/地区: {country}")
            print(f"  评分: {rating}")
            print(f"  参评人数: {num_reviews}")
            print(f"  评分占比: {rating_percentages}")
            print("-" * 50)

            # 延迟避免封禁
            time.sleep(2)

        print(f"Page {start // 25 + 1} completed.")

# 保存到JSON文件
def save_to_json():
    with open("douban_top250.json", "w", encoding="utf-8") as f:
        json.dump(movie_data, f, ensure_ascii=False, indent=4)
    print("Data saved to douban_top250.json.")

# 主程序
if __name__ == "__main__":
    scrape_douban_top250()
    save_to_json()