豆瓣网络爬虫

最新推荐文章于 2024-09-14 08:25:16 发布

1atte_momo

最新推荐文章于 2024-09-14 08:25:16 发布

阅读量1.9k

点赞数 23

文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/m0_64505923/article/details/134857627

版权

python实现网络爬虫

1、介绍网页编程的基本原理

网页编程是指创建和管理网站、网页和互联网应用程序的过程。它涉及使用不同的编程语言和技术来实现网页的设计、功能和交互。以下是网页编程的基本原理：

HTML（超文本标记语言）：HTML是网页的基本构建块。它用于定义网页的结构和内容。HTML文档由各种HTML元素组成，这些元素可以定义标题、段落、图像、链接等内容。HTML使用标记（标签<>）来指定这些元素的开始和结束。
```
<html>
  <head>
    <title>标题:html_test</title>
  </head>
  <body>
    <h1>欢迎访问!</h1>
    <p>~这是一个简单的网页示例~</p>
  </body>
</html>
```
CSS（层叠样式表）：CSS用于定义网页的外观和样式。通过CSS，可以选择如何呈现HTML元素，例如字体、颜色、大小、布局等。CSS使我们能够将网页内容与其外观分离，从而使样式更易于管理和修改。
```
body {
  font-family: Arial, sans-serif;
  background-color: #f0f0f0;
}

h1 {
  color: blue;
}

p {
  font-size: 16px;
}
```
JavaScript：JavaScript是一种客户端脚本语言，用于实现网页的交互和动态功能。它可以用来处理用户输入、响应事件（例如点击按钮或提交表单）、操作DOM（文档对象模型）以改变页面内容，以及与服务器通信以获取或发送数据。JavaScript可以嵌入在HTML文件中或作为外部脚本文件引用。
```
function greet() {
  alert('Hello, World!');
}
```
服务器端编程：除了客户端编程（使用JavaScript等语言），网页编程还涉及服务器端编程。服务器端编程用于处理来自客户端的请求，通常包括数据存储和检索、用户认证、业务逻辑等。常见的服务器端编程语言包括PHP、Python、Ruby、Java和Node.js等。
数据库：大多数网页应用程序需要存储和检索数据，这通常需要与数据库交互。常用的数据库系统包括MySQL、PostgreSQL、MongoDB和SQLite等。
API（应用程序编程接口）：API允许不同的应用程序之间进行通信。在网页编程中，API通常用于与第三方服务或数据源进行交互，以获取或分享数据。RESTful API和GraphQL是常见的API标准。

2、介绍网络访问的过程

网络访问是指通过互联网访问远程服务器上的资源或服务的过程。这个过程涉及多个步骤，包括域名解析、建立连接、传输数据和关闭连接等。以下是网络访问的基本过程：

URL（统一资源定位符）解析：用户在浏览器中输入网址，浏览器解析URL，提取协议、主机名（域名）、端口号和路径等信息。
DNS解析：浏览器向本地DNS服务器发起域名解析请求，本地DNS服务器查询域名对应的IP地址并返回给浏览器。
建立TCP连接：浏览器根据获取到的IP地址和端口号，通过TCP/IP协议向服务器发起连接请求。这个过程通常经过三次握手，确保双方建立稳定的连接。
HTTP请求：一旦建立TCP连接，浏览器向服务器发送HTTP请求，包括请求方法（GET、POST等）、请求头、请求体等。请求头中包含了客户端的一些信息，如浏览器类型、支持的压缩算法等。
服务器处理请求：服务器接收到HTTP请求后，根据请求头和请求体中的信息，处理请求并准备响应数据。
HTTP响应：服务器将处理好的响应数据以HTTP响应的形式返回给浏览器，包括响应状态码、响应头和响应体。响应头包含了服务器的一些信息，如响应的内容类型、编码方式、缓存控制等。
接收响应：浏览器接收到HTTP响应后，根据响应头中的信息进行处理，如解析响应体、渲染页面等。
关闭TCP连接：通信完成后，浏览器和服务器关闭TCP连接，释放资源。
页面渲染：浏览器根据接收到的响应数据，解析HTML、CSS和JavaScript，构建DOM树、CSSOM树和JavaScript引擎运行相应的脚本，最终渲染出用户可见的页面。

3、介绍爬虫的基本原理

爬虫（Web Crawler）是一种用于自动从互联网上收集信息的程序。爬虫的基本原理涉及以下几个关键步骤：

选择起始点：爬虫首先需要选择一个或多个起始点，这些起始点是爬虫开始抓取信息的网页链接或URL。这些链接可以是特定网站的主页、特定主题的索引页或任何其他感兴趣的链接。
发起HTTP请求：爬虫使用HTTP或HTTPS协议向起始点的URL发送请求，以获取网页的HTML内容。这个过程通常涉及使用HTTP库（如Python中的Requests库）来构建和发送HTTP请求。
获取响应：一旦请求发送到服务器，服务器将返回网页的响应，包括HTTP响应头和响应体。爬虫需要解析响应以提取所需的信息。
解析HTML：爬虫解析HTML响应，通常使用HTML解析库（如Python中的Beautiful Soup或lxml）来将HTML文档转换成可操作的数据结构，如DOM树。这使得爬虫可以轻松地浏览和操作网页内容。
提取数据：爬虫从HTML文档中提取所需的数据。这可以包括文本、图像、链接、元数据等。提取数据的过程通常涉及使用选择器（如CSS选择器或XPath）来定位和提取特定的HTML元素。
存储数据：爬虫将提取的数据存储在本地文件、数据库或其他数据存储系统中，以备将来使用或分析。
跟踪链接：爬虫在抓取网页的同时会检查并提取新的链接，将它们添加到一个待访问的链接队列中。这使得爬虫可以继续遍历更多的页面，不仅限于起始点。
设置爬取规则：爬虫通常需要遵循一定的爬取规则，以控制其行为。这些规则可以包括深度限制（爬取的层级深度）、频率限制（请求速率）、排除规则（不访问特定URL或域名）等，以避免对目标网站造成过大的负担或违反法律规定。
循环遍历：爬虫通过不断地从待访问的链接队列中获取链接，发送HTTP请求，获取响应，解析和提取数据，以及添加新链接到队列中的方式，循环遍历不同的网页，直到满足停止条件（如达到最大爬取数量或深度）为止。
异常处理：爬虫需要处理各种异常情况，如网络错误、链接失效、反爬虫机制等。这包括重新尝试失败的请求、绕过验证码或登录页面等策略。

爬虫的基本原理是通过模拟人工访问网页的过程，自动化地抓取和提取互联网上的信息。爬虫也需要考虑性能、并发和数据存储等方面的问题，以构建高效和可靠的爬虫系统。

4、高阶爬虫技术

高阶爬虫技术是指在爬虫应用程序中使用更复杂、更智能的技术和策略，以应对各种复杂的情况和需求。这些技术可以帮助爬虫更高效地抓取信息、绕过反爬虫措施、提高数据质量以及确保合规性。以下是一些高阶爬虫技术：

用户代理池：网站通常会根据用户代理（User-Agent）来识别爬虫。为了避免被识别为爬虫，爬虫可以使用用户代理池，定期更换User-Agent，模拟不同类型的用户访问。
IP代理池：一些网站可能会对来自同一IP地址的频繁请求采取限制或封锁措施。使用IP代理池可以轮流使用多个IP地址，以减轻这种限制。需要谨慎使用IP代理，以避免侵犯法律或违反网站的使用政策。
验证码识别：有些网站为了防止机器人爬取，会使用验证码。高级爬虫可以集成验证码识别技术，自动识别和解锁验证码，继续抓取数据。
登录和会话管理：一些网站要求用户登录才能访问内容。高级爬虫可以模拟用户登录，管理会话状态，并在登录后访问受限资源。这通常涉及使用HTTP请求头中的Cookie来维护会话。
数据解析和清洗：高级爬虫可以使用更复杂的数据解析技术，例如使用正则表达式、XPath、CSS选择器或自然语言处理（NLP）技术来提取和清洗数据，以确保数据的准确性和可用性。
分布式爬取：为了提高爬取效率和处理大量数据，可以使用分布式爬虫架构，将任务分配给多个爬虫节点并协同工作。常见的工具包括Scrapy集成了分布式爬虫的功能。
反反爬虫策略：网站可能会采用各种反爬虫措施，如动态加载内容、异步请求、频率限制等。高级爬虫需要实施相应的策略来绕过这些障碍，如使用无头浏览器模拟用户行为，处理JavaScript渲染等。
定时任务和调度：爬虫可以设置定时任务和调度，以定期更新数据或监视特定网站的变化。这可以通过工具如Celery或APScheduler来实现。
数据存储和处理：高级爬虫需要考虑数据的有效存储和处理。这可能包括使用数据库、消息队列、分布式存储系统等技术，以及数据备份和恢复策略。
合规性和伦理：在高级爬虫开发中，合规性和伦理问题变得更加重要。开发人员需要确保他们的爬虫遵守法律法规、网站的使用政策，并尊重网站所有者的权益。

5、实验二相关项目

5.1 基础代码的注释补充

程序代码

#import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

context = ssl._create_unverified_context() #表示忽略未经核实的SSL证书认证

all_quotes = []

url = f'https://quotes.toscrape.com/page/1/' # 目标网址
# 用浏览器打开网址，查看源代码
page = urlopen(url,context = context) # 请求网页信息
# 将网页信息组合为BeautifulSoup结构，利用HTML解析器
soup = BeautifulSoup(page, 'html.parser') 
# 搜寻quote标记
quotes = soup.find_all('div', class_='quote')
# 逐行注释代码：
for quote in quotes:
    # 循环每一个元素 找到每个名言的内容
    text = quote.find('span', class_='text').text 
    # 循环每一个元素 找到每个名言的作者信息
    author = quote.find('small', class_='author').text
    # 循环每一个元素 找到每个名言的标签信息 存在列表中
    tags = quote.find('div', class_='tags').find_all('a')
    # 初始化存储标签信息的列表
    tags_list = []
    # 对每个标签元素：
    for tag in tags:
        # 用append方法 把标签信息存在之前初始化的列表中
        tags_list.append(tag.text)
    # 每个引用的内容信息包括 名言文本 作者 所属标签
    single_quote = [text, author, tags_list]
    # 把所有爬取到并且整理好的信息存储在all_quotes
    all_quotes.append(single_quote)
    # 爬取结果展示
    print(all_quotes)

5.2 爬取quotes网站

任务要求

添加代码，爬取相同网站的10个页面内容，并将爬取内容存储在同一个CSV格式文件

程序代码

#import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

context = ssl._create_unverified_context() #表示忽略未经核实的SSL证书认证

all_quotes = []

url = f'https://quotes.toscrape.com/page/1/' # 目标网址
# 用浏览器打开网址，查看源代码
page = urlopen(url,context = context) # 请求网页信息
# 将网页信息组合为BeautifulSoup结构，利用HTML解析器
soup = BeautifulSoup(page, 'html.parser') 
# 搜寻quote标记
quotes = soup.find_all('div', class_='quote')
# 逐行注释代码：
for quote in quotes:
    # 循环每一个元素 找到每个名言的内容
    text = quote.find('span', class_='text').text 
    # 循环每一个元素 找到每个名言的作者信息
    author = quote.find('small', class_='author').text
    # 循环每一个元素 找到每个名言的标签信息 存在列表中
    tags = quote.find('div', class_='tags').find_all('a')
    # 初始化存储标签信息的列表
    tags_list = []
    # 对每个标签元素：
    for tag in tags:
        # 用append方法 把标签信息存在之前初始化的列表中
        tags_list.append(tag.text)
    # 每个引用的内容信息包括 名言文本 作者 所属标签
    single_quote = [text, author, tags_list]
    # 把所有爬取到并且整理好的信息存储在all_quotes
    all_quotes.append(single_quote)
    # 爬取结果展示
    print(all_quotes)

5.3 模拟会员登录过程

程序代码

import time
from selenium import webdriver
from selenium.webdriver.common.by import By  # 导入By模块

# 打开Firefox浏览器
driver = webdriver.Firefox()

# 指定加载页面
driver.get("http://quotes.toscrape.com/login")

# 等待页面加载完成
time.sleep(2)

# 定位用户名和密码输入框，并输入用户名和密码
username_input = driver.find_element(By.XPATH, '//*[@id="username"]')  # 使用By.XPATH定位
password_input = driver.find_element(By.XPATH, '//*[@id="password"]')  # 使用By.XPATH定位

username_input.send_keys('yuanjianying')  # 用户名
password_input.send_keys('yjy')  # 密码

# 提交登录
login_button = driver.find_element(By.XPATH, '/html/body/div/form/input[2]')  # 使用By.XPATH定位
login_button.click()

# 停顿15s，可以看到登录后的效果
time.sleep(15)

# 关闭浏览器
driver.quit()

6、虚拟网站爬取结果

6.1 串联爬取结果与性能分析

任务要求

爬取：图片、价格、评价，书名等书籍所有内容。图片直接下载在文件夹，文字信息和图片路径保存在csv文件。

网址： All products | Books to Scrape - Sandbox

注意，本网站有20页书籍，都要爬。

程序代码

import os
import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin
import time

start = time.time()
# 创建目录用于存储图片
if not os.path.exists('book_images'):
    os.makedirs('book_images')

# 打开CSV文件以写入数据
with open('book_data.csv', 'w', newline='', encoding='utf-8') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(['Title', 'Price', 'Rating', 'Image URL'])

    # 循环遍历不同页面
    base_url = 'http://books.toscrape.com/catalogue/page-{}.html'
    num_pages = 50  # 总共有50页
    for page_num in range(1, num_pages + 1):
        print(f"进入第{page_num}页")
        url = base_url.format(page_num)
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')

        # 查找书籍信息
        books = soup.find_all('article', class_='product_pod')
        for book in books:
            title = book.h3.a.attrs['title']
            title = title.replace('\\', ' ').replace('/', ' ').replace(':', ' ').replace('*', ' ').replace('?', ' ').replace('"', ' ').replace('<', ' ').replace('>', ' ').replace('|', ' ')
            price = book.select('div p.price_color')[0].text
            rating = book.p.attrs['class'][1]

            # 获取书籍图片URL
            img_url_relative = book.find('img')['src']
            img_url = urljoin(url, img_url_relative)

            # 下载图片到book_images目录
            img_filename = os.path.join('book_images', f'{title.replace(":", "：")}.png')
            img_response = requests.get(img_url)
            with open(img_filename, 'wb') as img_file:
                img_file.write(img_response.content)
            print(f"第{page_num}页的图片已下载完毕")
            # 写入CSV文件
            writer.writerow([title, price, rating, img_filename])


print("=== 已完成 ===")

end = time.time()
run = end - start
print(f"单线程执行时间是{run}秒")

爬取结果

csv文件
文件夹结果

6.2 高阶爬虫结果与性能分析

任务要求

爬取：图片、价格、评价，书名等书籍所有内容。

网址： All products | Books to Scrape - Sandbox

注意，本网站有20页书籍，都要爬。

测试一下技术：

（1）多进程多线程技术

（2）数据库交互（redis/mongodb/mysql）

（3）虚拟地址

程序代码

import os
import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin
import time

start = time.time()
# 创建目录用于存储图片
if not os.path.exists('book_images'):
    os.makedirs('book_images')

# 打开CSV文件以写入数据
with open('book_data.csv', 'w', newline='', encoding='utf-8') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(['Title', 'Price', 'Rating', 'Image URL'])

    # 循环遍历不同页面
    base_url = 'http://books.toscrape.com/catalogue/page-{}.html'
    num_pages = 50  # 总共有50页
    for page_num in range(1, num_pages + 1):
        print(f"进入第{page_num}页")
        url = base_url.format(page_num)
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')

        # 查找书籍信息
        books = soup.find_all('article', class_='product_pod')
        for book in books:
            title = book.h3.a.attrs['title']
            title = title.replace('\\', ' ').replace('/', ' ').replace(':', ' ').replace('*', ' ').replace('?', ' ').replace('"', ' ').replace('<', ' ').replace('>', ' ').replace('|', ' ')
            price = book.select('div p.price_color')[0].text
            rating = book.p.attrs['class'][1]

            # 获取书籍图片URL
            img_url_relative = book.find('img')['src']
            img_url = urljoin(url, img_url_relative)

            # 下载图片到book_images目录
            img_filename = os.path.join('book_images', f'{title.replace(":", "：")}.png')
            img_response = requests.get(img_url)
            with open(img_filename, 'wb') as img_file:
                img_file.write(img_response.content)
            print(f"第{page_num}页的图片已下载完毕")
            # 写入CSV文件
            writer.writerow([title, price, rating, img_filename])


print("=== 已完成 ===")

end = time.time()
run = end - start
print(f"单线程执行时间是{run}秒")

爬取结果

通过观察虚拟书店网站，我们可以发现一共是有50页，每页会有20本书籍，一共是1000本书籍；

程序运行结果：

MySQL运行结果：

use bookstore;
truncate table books; // 清除之前的表内容 保留表结果
// 等待程序运行完毕
select * from books;

本地文件夹运行结果：

性能分析

导入threading，使用多线程，一共50页的虚拟书店网站，每一页开启一个进程，全部进程结束之后再执行后面的程序。

使用ThreadPoolExecutor，允许多个图片同时下载。

一共用时259s，和网络状况有关。

可能会多测试几次运行时间：

7、豆瓣影评爬取结果

7.1 串联爬取结果与性能分析（长影评部分）

任务要求

取2022年度电影的最高评分韩国电影，但因为这一类榜单中只有5部，又加入了2022最高评分日本电影，总计10部电影，每部电影爬取10页影评，存入csv文件。

程序代码

from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv
import time
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys

def login_with_cookies(driver):

    # 添加新的Cookie
    cookies = {

    }

    # 访问网站
    driver.get("https://movie.douban.com/")
    for key, value in cookies.items():
        driver.add_cookie({'name': key, 'value': value})


    # 可以看到已登录
    driver.refresh()


start = time.time()
# 创建一个浏览器实例
driver = webdriver.Firefox()
login_with_cookies(driver)



movie_links = [
    {'url': 'https://movie.douban.com/subject/35073886/?source=2022_annual_movie', 'name': '分手的决心 헤어질 결심'},
    {'url': 'https://movie.douban.com/subject/35160926/?source=2022_annual_movie', 'name': '狩猎 헌트'},
    {'url': 'https://movie.douban.com/subject/35441582/?source=2022_annual_movie', 'name': '6/45 육사오'},
    {'url': 'https://movie.douban.com/subject/30267287/?source=2022_annual_movie', 'name': '犯罪都市2 범죄도시2'},
    {'url': 'https://movie.douban.com/subject/35743103/?source=2022_annual_movie', 'name': '小说家的电影 소설가의 영화'},
    {'url': 'https://movie.douban.com/subject/35015968/?source=2022_annual_movie', 'name': '昨日的美食 电影版'},
    {'url': 'https://movie.douban.com/subject/35597426/?source=2022_annual_movie', 'name': '稍微想起一些'},
    {'url': 'https://movie.douban.com/subject/34809360/?source=2022_annual_movie', 'name': '在街上'},
    {'url': 'https://movie.douban.com/subject/34905647/?source=2022_annual_movie', 'name': '由宇子的天平'},
    {'url': 'https://movie.douban.com/subject/35372792/?source=2022_annual_movie', 'name': '老师，您能坐在我旁边吗？'}

]

with open('Korea_movies_2022_comments.csv', 'w', newline='', encoding='utf-8') as csv_file:
    fieldnames = ['电影名称','用户名', '评论时间', '评论内容', '评分等级', '点赞数','点踩数','回复数']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()
    for movie_info in movie_links:
        print("===",movie_info['name'],"===")
        # 访问电影的评论页面
        driver.get(movie_info['url'])
        # 使用相对XPath来查找"更多影评"链接
        more_comments_link = driver.find_element(By.XPATH, "//a[contains(text(), '更多影评')]")
        more_comments_link.click()
        print("从第一页开始")


        for i in range(1,11):
            # 等待影评加载完成
            WebDriverWait(driver,
                          10).until(EC.presence_of_element_located((By.CLASS_NAME, 'review-list')))

            print("本页评论加载完成")
            
            movie_comment_page_source = driver.page_source

            # 使用BeautifulSoup解析HTML源代码
            soup = BeautifulSoup(movie_comment_page_source, 'html.parser')

            # 找到所有的影评块
            comment_blocks = soup.find_all('div', class_='main review-item')

            # 遍历每个短评块并提取信息
            for comment_block in comment_blocks:
                # 用户名
                username = comment_block.find('a', class_='name').text.strip()

                # 评论时间
                comment_time = comment_block.find('span', class_='main-meta').text.strip()
                # 评论内容
                short_content = comment_block.find('div', id=lambda x: x and x.startswith('review_') and x.endswith('_short'))

                
                # 找到包含评分星级的<span>标签
                rating_span = comment_block.find('span', class_=lambda x: x and 'allstar' in x)
                if rating_span:
                    # 获取<span>标签的class属性值
                    class_value = rating_span['class']

                    # 从class属性值中提取评分等级
                    rating = class_value[0].replace('allstar', '')
                else:
                    rating = "未评分"

                # 点赞数
                upvote = comment_block.find('a', class_='action-btn up').find('span', id=lambda x: x and x.startswith('r-useful_count')).text.strip()

                # 反对数
                downvote = comment_block.find('a', class_='action-btn down').find('span', id=lambda x: x and x.startswith('r-useless_count')).text.strip()

                # 回应数
                reply_count = comment_block.find('a', class_='reply').text.strip().replace('回应', '')


                # 找到评论内容块
                comment_content_block = comment_block.find('div', class_='review-short')
                if comment_content_block:
                    comment_content = comment_content_block.text.strip()

                # 如果有展开按钮，点击展开以获取完整评论
                WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[contains(text(), '展开')]")))

                expand_button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[contains(text(), '展开')]")))
                driver.execute_script("arguments[0].scrollIntoView();", expand_button)
                expand_button.click()
                time.sleep(1)  # 等待评论加载
                
                # 获取展开后的页面源代码
                movie_comment_page_source_after_expand = driver.page_source
                soup_after_expand = BeautifulSoup(movie_comment_page_source_after_expand, 'html.parser')

                # 获取展开后的评论内容
                full_comment_block = soup_after_expand.find('div', {'data-author': username})
                if full_comment_block:
                    full_comment = full_comment_block.text.strip()
                    comment_content = full_comment

                # 将换行符替换为空格
                comment_content = comment_content.replace('\n', ' ')


                # 将信息写入CSV文件
                writer.writerow({'电影名称': movie_info['name'],'用户名': username, '评论时间': comment_time, '评论内容': comment_content, '评分等级': rating, '点赞数': upvote,'点踩数':downvote,'回复数':reply_count})

            print(f"==={movie_info['name']}的第{i}页已经爬取完毕===")

            # 寻找下一页短评的元素，并等待它可点击
            next_page = WebDriverWait(driver, 30).until(
                EC.element_to_be_clickable((By.CLASS_NAME, "next"))
            )
            # 滚动到下一页元素位置
            driver.execute_script("arguments[0].scrollIntoView();", next_page)
            time.sleep(1)            
            # 然后再点击
            next_page.click()
end = time.time()
run = end - start
print(f"===全部完成，用时{run}s===")

爬取结果

以下是结果展示

性能分析

用时是120.92306709289551秒（

7.2 高阶爬虫结果与性能分析（长影评部分）

任务要求

取2022年度电影的最高评分韩国电影，但因为这一类榜单中只有5部，又加入了2022最高评分日本电影，总计10部电影，每部电影爬取10页影评，存入csv文件。并加入以下技术：

（1）多进程多线程技术

（2）数据库交互（ redis/mongodb/mysql ）

（3）虚拟地址

可能遇到困难：反爬虫机制

程序代码

create database Douban character set utf8;
use douban;
select * from comments;
truncate comments; // 用于测试之前清楚表内数据
ALTER TABLE comments CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import random
import time
import pymysql
import multiprocessing
import re
from selenium.webdriver.support.ui import WebDriverWait


def remove_emoji(text):
    # 使用正则表达式移除表情符号
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F700-\U0001F77F"  # alchemical symbols
                               u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                               u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                               u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                               u"\U0001FA00-\U0001FA6F"  # Chess Symbols
                               u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

def ip_random():
    #  developer tools -> Network -> Header -> Response header [raw]
    # 修改为自己浏览器的headers
    # Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0'}
    '''
    前版本代码
    API_link = 'https://api.hailiangip.com:8522/api/getIpEncrypt?dataType=1&encryptParam=1Pofnwjjlmawt1HAVLj6tTxUJo8Ctj43ciCg9jb8LL9ZnRSnD61wqH3N2dPDshZ9%2B3kkTAQZ2MD%2FIbJREwKEhHndyjBrABrC6w9Kgqc82yZyTTUse5IUp%2B%2Bl1F%2FVhbLSlhmBtWyY09Uxpbc64mJ2a3wBfYbfJtr8aD2UGympj%2FHBd1HPaJFym4rBQnOJ7ZU9arf5KKS7AFD8J5Z1OJJ%2FWbhiKf6bPqRShUoN9E1%2FgqE%3D'
    response = requests.get(url=API_link, headers=headers)
    # IP地址拆分
    IP_list = response.text.split("\n")
    proxy_list = []
    '''
    IP_list = [
        "114.232.109.213",
        "113.124.216.126",
        "60.170.204.30",
        "183.236.232.160",
        "182.34.101.96"
    ]
    proxy_list = []
    # prefix = 'https:' 前版本代码
    prefix = 'http:'
    for ip in IP_list:
        if ip != '':
            # 添加网页前缀
            proxy_list.append(prefix + ip)
    proxy = random.choice(proxy_list)
    return proxy

def login_with_cookies(driver):
    # 获取代理IP
    proxy = ip_random()
    print("使用代理IP:", proxy)
    # 设置代理
    webdriver.DesiredCapabilities.FIREFOX['proxy'] = {
        "httpProxy": proxy,
        "ftpProxy": proxy,
        "sslProxy": proxy,
        "proxyType": "MANUAL"
    }

    # 访问网站
    driver.get("https://movie.douban.com/")

    # 添加新的Cookie
    cookies = {


    }

    for key, value in cookies.items():
        driver.add_cookie({'name': key, 'value': value})

    # 可以看到已登录
    driver.refresh()

# 链接 MySQL
conn = pymysql.connect(
    host='localhost',
    user='root',
    password='20031008',
    database='Douban',
    charset='utf8mb4'
)
# 创建游标
cursor = conn.cursor()


# 将数据插入数据库
def insert_data_to_mysql(movie_info, username, comment_time, comment_content, rating, upvote,downvote, reply_count):
    try:
        # SQL插入语句
        sql = "INSERT INTO comments (movie_name, username, comment_time, comment_content, rating, upvote,downvote, reply_count) VALUES (%s, %s, %s, %s, %s, %s,%s,%s)"
        # 执行SQL插入操作
        cursor.execute(sql, (movie_info, username, comment_time, comment_content, rating, upvote,downvote, reply_count))
        # 提交更改
        conn.commit()
    except Exception as e:
        # 如果出现错误，回滚更改
        print(f"插入数据时出错: {str(e)}")
        print(username)
        conn.rollback()

# 创建代理IP池
proxy_pool = []

def init_proxy_pool():
    for _ in range(10):  # 创建10个代理IP
        proxy = ip_random()
        proxy_pool.append(proxy)

def get_next_proxy():
    if len(proxy_pool) == 0:
        init_proxy_pool()  # 如果代理池为空，重新初始化
    return proxy_pool.pop()  # 获取下一个代理IP，并从池中移除

def scrape_and_store(movie_info):
    # 创建一个浏览器实例
    driver = webdriver.Firefox()
    login_with_cookies(driver)
    try:
        # 获取代理IP
        proxy = get_next_proxy()
        print("使用代理IP:", proxy)

        # 设置代理
        webdriver.DesiredCapabilities.FIREFOX['proxy'] = {
            "httpProxy": proxy,
            "ftpProxy": proxy,
            "sslProxy": proxy,
            "proxyType": "MANUAL"
        }

        # 获取电影评论页面的URL
        movie_url = movie_info['url']
        movie_name = movie_info['name']
        print("===", movie_name, "===")
        
        # 访问电影的评论页面
        driver.get(movie_url)
        driver.set_window_size(1920, 1080)  # 根据需要设置窗口大小
        # 使用相对XPath来查找
        more_comments_link = WebDriverWait(driver, 30).until(
        EC.presence_of_element_located((By.XPATH, "//a[contains(text(), '更多剧评')]"))
        )
        more_comments_link.click()
        print("从第一页开始")

        for i in range(1,11):
            # 等待影评加载完成
            WebDriverWait(driver,
                          10).until(EC.presence_of_element_located((By.CLASS_NAME, 'review-list')))

            print("本页评论加载完成")
            
            movie_comment_page_source = driver.page_source

            # 使用BeautifulSoup解析HTML源代码
            soup = BeautifulSoup(movie_comment_page_source, 'html.parser')

            # 找到所有的影评块
            comment_blocks = soup.find_all('div', class_='main review-item')

            # 遍历每个短评块并提取信息
            for comment_block in comment_blocks:
                # 用户名
                username = comment_block.find('a', class_='name').text.strip()

                # 评论时间
                comment_time = comment_block.find('span', class_='main-meta').text.strip()
                # 评论内容
                short_content = comment_block.find('div', id=lambda x: x and x.startswith('review_') and x.endswith('_short'))

                
                # 找到包含评分星级的<span>标签
                rating_span = comment_block.find('span', class_=lambda x: x and 'allstar' in x)
                if rating_span:
                    # 获取<span>标签的class属性值
                    class_value = rating_span['class']

                    # 从class属性值中提取评分等级
                    rating = class_value[0].replace('allstar', '')
                else:
                    rating = "未评分"

                # 点赞数
                upvote_span = comment_block.find('a', class_='action-btn up').find('span', id=lambda x: x and x.startswith('r-useful_count'))
                try:
                    upvote = int(upvote_span.text.strip()) if upvote_span.text.strip() else 0
                except ValueError:
                    upvote = 0  

                # 反对数
                downvote_span = comment_block.find('a', class_='action-btn down').find('span', id=lambda x: x and x.startswith('r-useless_count'))
                try:
                    downvote = int(downvote_span.text.strip()) if downvote_span.text.strip() else 0
                except ValueError:
                    downvote = 0  
                

                # 回应数
                reply_count = comment_block.find('a', class_='reply').text.strip().replace('回应', '')


                # 找到评论内容块
                comment_content_block = comment_block.find('div', class_='review-short')
                if comment_content_block:
                    comment_content = comment_content_block.text.strip()
                try:
                    # 如果有展开按钮，点击展开以获取完整评论
                    WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[contains(text(), '展开')]")))

                    expand_button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[contains(text(), '展开')]")))
                    driver.execute_script("arguments[0].scrollIntoView();", expand_button)
                    expand_button.click()
                    time.sleep(0.3)  # 等待评论加载
                    
                    # 获取展开后的页面源代码
                    movie_comment_page_source_after_expand = driver.page_source
                    soup_after_expand = BeautifulSoup(movie_comment_page_source_after_expand, 'html.parser')

                    # 获取展开后的评论内容
                    full_comment_block = soup_after_expand.find('div', {'data-author': username})
                    if full_comment_block:
                        full_comment = full_comment_block.text.strip()
                        comment_content = full_comment
                except TimeoutException:
                    # 在超时后处理，例如记录一条消息
                    print("未找到展开按钮，跳过展开评论")
                # 将换行符替换为空格
                comment_content = comment_content.replace('\n', ' ')
                # 处理包含特殊字符的字符串
                username = username.encode('utf-8', 'ignore').decode('utf-8')
                remove_emoji(username)
                comment_content = comment_content.encode('utf-8', 'ignore').decode('utf-8')
                remove_emoji(comment_content)
                # 写入MySQL
                insert_data_to_mysql(movie_name, username, comment_time, comment_content, rating, upvote, downvote, reply_count)

            print(f"{movie_name}的第{i}页已经爬取完毕")

            # 寻找下一页短评的元素，并等待它可点击
            next_page = WebDriverWait(driver, 30).until(
                EC.element_to_be_clickable((By.CLASS_NAME, "next"))
            )
            # 滚动到下一页元素位置
            driver.execute_script("arguments[0].scrollIntoView();", next_page)

            # 等待遮挡元素消失
            wait = WebDriverWait(driver, 10)
            wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, 'div.action.fixed-action')))

            # 等待下一页按钮可点击
            next_page = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, "next")))
            next_page.click()
            # next_page = driver.find_element(By.CLASS_NAME, "next")
            # driver.execute_script("arguments[0].click();", next_page)


    except Exception as e:
        print(f"爬取电影评论时出错: {str(e)}")
    finally:
        # 关闭浏览器
        driver.quit()
        # 将代理IP放回池中，以便下次使用
        proxy_pool.append(proxy)


if __name__ == "__main__":
    
    start = time.time()
    # 创建多进程池
    pool = multiprocessing.Pool(processes=11)


    # 2022年度榜单
    movie_links = [
    {'url': 'https://movie.douban.com/subject/35314632/', 'name': '黑暗荣耀'},
    {'url': 'https://movie.douban.com/subject/35322421/?source=2022_annual_movie', 'name': '我的解放日志'},
    {'url': 'https://movie.douban.com/subject/35524446/?source=2022_annual_movie', 'name': '非常律师禹英禑'},
    {'url': 'https://movie.douban.com/subject/35248792/?source=2022_annual_movie', 'name': '少年法庭'},
    {'url': 'https://movie.douban.com/subject/30291070/?source=2022_annual_movie', 'name': '财阀家的小儿子'},
    
    {'url': 'https://movie.douban.com/subject/26816519/', 'name': '逃避虽可耻但有用'},
    {'url': 'https://movie.douban.com/subject/26921674/', 'name': '东京女子图鉴'},
    {'url': 'https://movie.douban.com/subject/27140017/', 'name': '非自然死亡'},
    {'url': 'https://movie.douban.com/subject/36156235/', 'name': '重启人生'},
    {'url': 'https://movie.douban.com/subject/24321344/', 'name': '面包和汤和猫咪好天气'}    ]


    # 将每个电影链接分配给不同的进程
    pool.map(scrape_and_store, movie_links)


    # 关闭游标和数据库连接
    cursor.close()
    conn.close()
    # 关闭进程池
    pool.close()
    pool.join()
    end = time.time()
    run = end - start
    print(f"===全部完成，用时{run}s===")

爬取结果

以下是MySQL的数据库结果查询：

性能分析

使用多进程，multiprocessing库，每一部电影单独一个driver窗口。

但是在爬取过程中，为了应对反爬机制，每页影评会强制time.sleep(1)停止一秒，因为被豆瓣封了几次ip，舍弃一部分速度换取程序的稳定性也是很必要的。

7.3 串联爬取结果与性能分析（短影评部分）

任务要求

程序代码

from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv
import time
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys

def login_with_cookies(driver):

    # 访问网站
    driver.get("https://movie.douban.com/")

    # 添加新的Cookie
    cookies = {
        '__utma': '223695111.275596860.1695101450.1695101450.1695265771.2',
        # cookies打码一下
        # ........        
        'll': '"118282"'
    }

    for key, value in cookies.items():
        driver.add_cookie({'name': key, 'value': value})

    # 可以看到已登录
    driver.refresh()

#计时开始
start = time.time()
# 创建一个浏览器实例
driver = webdriver.Firefox()
login_with_cookies(driver)

# 2022年度榜单
wait = WebDriverWait(driver, 30)
page_2022_link = wait.until(
    EC.element_to_be_clickable((By.PARTIAL_LINK_TEXT, "2022年度榜单"))
)
page_2022_link.click()

movie_links = [
    {'url': 'https://movie.douban.com/subject/35073886/?source=2022_annual_movie', 'name': '分手的决心 헤어질 결심'},
    {'url': 'https://movie.douban.com/subject/35160926/?source=2022_annual_movie', 'name': '狩猎 헌트'},
    {'url': 'https://movie.douban.com/subject/35441582/?source=2022_annual_movie', 'name': '6/45 육사오'},
    {'url': 'https://movie.douban.com/subject/30267287/?source=2022_annual_movie', 'name': '犯罪都市2 범죄도시2'},
    {'url': 'https://movie.douban.com/subject/35743103/?source=2022_annual_movie', 'name': '小说家的电影 소설가의 영화'},
    {'url': 'https://movie.douban.com/subject/35015968/?source=2022_annual_movie', 'name': '昨日的美食 电影版'},
    {'url': 'https://movie.douban.com/subject/35597426/?source=2022_annual_movie', 'name': '稍微想起一些'},
    {'url': 'https://movie.douban.com/subject/34809360/?source=2022_annual_movie', 'name': '在街上'},
    {'url': 'https://movie.douban.com/subject/34905647/?source=2022_annual_movie', 'name': '由宇子的天平'},
    {'url': 'https://movie.douban.com/subject/35372792/?source=2022_annual_movie', 'name': '老师，您能坐在我旁边吗？'}

]

with open('Korea_movies_2022_comments.csv', 'w', newline='', encoding='utf-8') as csv_file:
    fieldnames = ['电影名称','用户名', '评论时间', '评论内容', '评分等级', '点赞数']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()
    for movie_info in movie_links:
        print("===",movie_info['name'],"===")
        # 访问电影的评论页面
        driver.get(movie_info['url'])
        # 使用相对XPath来查找"更多短评"链接
        more_comments_link = driver.find_element(By.XPATH, "//a[contains(text(), '更多短评')]")
        more_comments_link.click()
        print("从第一页开始")


        for i in range(1, 11):
            # 使用WebDriverWait等待评论块元素加载
            comment_blocks = WebDriverWait(driver, 30).until(
                EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div.comment-item'))
            )
            print("本页评论加载完成")
            # 获取短评页面的HTML源代码
            short_comment_page_source = driver.page_source

            # 使用BeautifulSoup解析HTML源代码
            soup = BeautifulSoup(short_comment_page_source, 'html.parser')

            # 找到所有的短评块
            comment_blocks = soup.find_all('div', class_='comment-item')

            # 遍历每个短评块并提取信息
            for comment_block in comment_blocks:
                # 用户名
                username = comment_block.find('span', class_='comment-info').find('a').text.strip()

                # 评论时间
                comment_time = comment_block.find('span', class_='comment-time').text.strip()

                # 评论内容
                comment_content = comment_block.find('span', class_='short').text.strip()
                # 将换行符替换为空格
                comment_content = comment_content.replace('\n', ' ')


                # 找到包含评分星级的<span>标签
                rating_span = comment_block.find('span', class_=lambda x: x and 'allstar' in x)
                if rating_span:
                    # 获取<span>标签的class属性值
                    class_value = rating_span['class']

                    # 从class属性值中提取评分等级
                    rating = class_value[0].replace('allstar', '')
                else:
                    rating = "未评分"

                # 点赞数
                vote = comment_block.find('span', class_='votes').text.strip()

                # 将信息写入CSV文件
                writer.writerow({'电影名称': movie_info['name'],'用户名': username, '评论时间': comment_time, '评论内容': comment_content, '评分等级': rating, '点赞数': vote})

            print(f"===第{i}页已经爬取完毕===")

            # 寻找下一页短评的元素，并等待它可点击
            next_page = WebDriverWait(driver, 30).until(
                EC.element_to_be_clickable((By.CLASS_NAME, "next"))
            )
            # 滚动到下一页元素位置
            driver.execute_script("arguments[0].scrollIntoView();", next_page)
            # 然后再点击
            next_page.click()
end = time.time()
run = end - start
print(f"===全部完成，用时{run}s===")

性能分析

用时是120.92306709289551秒（仅供参考，

可能会多测试几次运行时间：

7.4 高阶爬虫结果与性能分析（短影评部分）

任务要求

（1）多进程多线程技术

（2）数据库交互（ redis/mongodb/mysql ）

（3）虚拟地址

可能遇到困难：反爬虫机制

程序代码

create database Douban character set utf8;
use douban;
select * from comments;
truncate comments; // 用于测试之前清楚表内数据
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import random
import time
import pymysql
import multiprocessing


def ip_random():
    #  developer tools -> Network -> Header -> Response header [raw]
    # 修改为自己浏览器的headers
    # Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0'}
    '''
    前版本代码
    API_link = 'https://api.hailiangip.com:8522/api/getIpEncrypt?dataType=1&encryptParam=1Pofnwjjlmawt1HAVLj6tTxUJo8Ctj43ciCg9jb8LL9ZnRSnD61wqH3N2dPDshZ9%2B3kkTAQZ2MD%2FIbJREwKEhHndyjBrABrC6w9Kgqc82yZyTTUse5IUp%2B%2Bl1F%2FVhbLSlhmBtWyY09Uxpbc64mJ2a3wBfYbfJtr8aD2UGympj%2FHBd1HPaJFym4rBQnOJ7ZU9arf5KKS7AFD8J5Z1OJJ%2FWbhiKf6bPqRShUoN9E1%2FgqE%3D'
    response = requests.get(url=API_link, headers=headers)
    # IP地址拆分
    IP_list = response.text.split("\n")
    proxy_list = []
    '''
    IP_list = [
        "114.232.109.213",
        "113.124.216.126",
        "60.170.204.30",
        "113.121.21.191",
        "61.216.185.88",
        "182.34.103.132",
        "47.97.191.179  ",
        "111.3.102.207",
        "114.231.41.224",
        "27.214.51.122",
        "39.99.54.91",
        "60.170.204.30",
        "60.205.132.71",
        "218.75.102.198",
        "113.121.47.118",
        "113.121.42.177",
        "60.170.204.30",
        "183.236.232.160",
        "182.34.101.96"
    ]
    proxy_list = []
    # prefix = 'https:' 前版本代码
    prefix = 'http:'
    for ip in IP_list:
        if ip != '':
            # 添加网页前缀
            proxy_list.append(prefix + ip)
    proxy = random.choice(proxy_list)
    return proxy

def login_with_cookies(driver):
    # 获取代理IP
    proxy = ip_random()
    print("使用代理IP:", proxy)
    # 设置代理
    webdriver.DesiredCapabilities.FIREFOX['proxy'] = {
        "httpProxy": proxy,
        "ftpProxy": proxy,
        "sslProxy": proxy,
        "proxyType": "MANUAL"
    }

    # 访问网站
    driver.get("https://movie.douban.com/")

    # 添加新的Cookie
    cookies = {


    }

    for key, value in cookies.items():
        driver.add_cookie({'name': key, 'value': value})

    # 可以看到已登录
    driver.refresh()

# 链接 MySQL
conn = pymysql.connect(
    host='localhost',
    user='root',
    password='20031008',
    database='Douban',
    charset='utf8mb4'
)
# 创建游标
cursor = conn.cursor()


# 将数据插入数据库
def insert_data_to_mysql(movie_info, username, comment_time, comment_content, rating, vote):
    try:
        # SQL插入语句
        sql = "INSERT INTO comments (movie_name, username, comment_time, comment_content, rating, vote) VALUES (%s, %s, %s, %s, %s, %s)"
        # 执行SQL插入操作
        cursor.execute(sql, (movie_info, username, comment_time, comment_content, rating, vote))
        # 提交更改
        conn.commit()
    except Exception as e:
        # 如果出现错误，回滚更改
        print(f"插入数据时出错: {str(e)}")
        conn.rollback()

# 创建代理IP池
proxy_pool = []

def init_proxy_pool():
    for _ in range(10):  # 创建10个代理IP
        proxy = ip_random()
        proxy_pool.append(proxy)

def get_next_proxy():
    if len(proxy_pool) == 0:
        init_proxy_pool()  # 如果代理池为空，重新初始化
    return proxy_pool.pop()  # 获取下一个代理IP，并从池中移除

def scrape_and_store(movie_info):
    # 创建一个浏览器实例
    driver = webdriver.Firefox()
    login_with_cookies(driver)
    try:
        # 获取代理IP
        proxy = get_next_proxy()
        print("使用代理IP:", proxy)

        # 设置代理
        webdriver.DesiredCapabilities.FIREFOX['proxy'] = {
            "httpProxy": proxy,
            "ftpProxy": proxy,
            "sslProxy": proxy,
            "proxyType": "MANUAL"
        }

        # 获取电影评论页面的URL
        movie_url = movie_info['url']
        movie_name = movie_info['name']
        print("===", movie_name, "===")
        
        # 访问电影的评论页面
        driver.get(movie_url)
        # 使用相对XPath来查找"更多短评"链接
        more_comments_link = WebDriverWait(driver, 30).until(
        EC.presence_of_element_located((By.XPATH, "//a[contains(text(), '更多短评')]"))
        )
        more_comments_link.click()
        print("从第一页开始")

        for i in range(1, 11):
            # 使用WebDriverWait等待评论块元素加载
            comment_blocks = WebDriverWait(driver, 30).until(
                EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div.comment-item'))
            )
            print("本页评论加载完成")
            # 获取短评页面的HTML源代码
            short_comment_page_source = driver.page_source

            # 使用BeautifulSoup解析HTML源代码
            soup = BeautifulSoup(short_comment_page_source, 'html.parser')

            # 找到所有的短评块
            comment_blocks = soup.find_all('div', class_='comment-item')

            time.sleep(0.5)
            # 遍历每个短评块并提取信息
            for comment_block in comment_blocks:
                # 用户名
                username = comment_block.find('span', class_='comment-info').find('a').text.strip()

                # 评论时间
                comment_time = comment_block.find('span', class_='comment-time').text.strip()

                # 评论内容
                comment_content = comment_block.find('span', class_='short').text.strip()
                # 将换行符替换为空格
                comment_content = comment_content.replace('\n', ' ')

                # 找到包含评分星级的<span>标签
                rating_span = comment_block.find('span', class_=lambda x: x and 'allstar' in x)
                if rating_span:
                    # 获取<span>标签的class属性值
                    class_value = rating_span['class']

                    # 从class属性值中提取评分等级
                    rating = class_value[0].replace('allstar', '')
                else:
                    rating = "未评分"

                # 点赞数
                vote = comment_block.find('span', class_='votes').text.strip()

                # 处理包含特殊字符的字符串
                username = username.encode('utf-8', 'ignore').decode('utf-8')
                comment_content = comment_content.encode('utf-8', 'ignore').decode('utf-8')

                # 写入MySQL
                insert_data_to_mysql(movie_name, username, comment_time, comment_content, rating, vote)

            print(f"===第{i}页已经爬取完毕")

            # 寻找下一页短评的元素，并等待它可点击
            next_page = WebDriverWait(driver, 30).until(
                EC.element_to_be_clickable((By.CLASS_NAME, "next"))
            )
            # 滚动到下一页元素位置
            driver.execute_script("arguments[0].scrollIntoView();", next_page)
            # 然后再点击
            time.sleep(0.5)
            next_page.click()

    except Exception as e:
        print(f"爬取电影评论时出错: {str(e)}")
    finally:
        # 关闭浏览器
        driver.quit()
        # 将代理IP放回池中，以便下次使用
        proxy_pool.append(proxy)


if __name__ == "__main__":
    
    start = time.time()
    # 创建多进程池
    pool = multiprocessing.Pool(processes=11)


    # 2022年度榜单
    movie_links = [
        {'url': 'https://movie.douban.com/subject/35073886/?source=2022_annual_movie', 'name': '分手的决心'},
        {'url': 'https://movie.douban.com/subject/35160926/?source=2022_annual_movie', 'name': '狩猎'},
        {'url': 'https://movie.douban.com/subject/35441582/?source=2022_annual_movie', 'name': '6/45'},
        {'url': 'https://movie.douban.com/subject/30267287/?source=2022_annual_movie', 'name': '犯罪都市2'},
        {'url': 'https://movie.douban.com/subject/35743103/?source=2022_annual_movie', 'name': '小说家的电影'},

        {'url': 'https://movie.douban.com/subject/35015968/?source=2022_annual_movie', 'name': '昨日的美食 电影版'},
        {'url': 'https://movie.douban.com/subject/35597426/?source=2022_annual_movie', 'name': '稍微想起一些'},
        {'url': 'https://movie.douban.com/subject/34809360/?source=2022_annual_movie', 'name': '在街上'},
        {'url': 'https://movie.douban.com/subject/34905647/?source=2022_annual_movie', 'name': '由宇子的天平'},
        {'url': 'https://movie.douban.com/subject/35372792/?source=2022_annual_movie', 'name': '老师，您能坐在我旁边吗？'}
    ]


    # 将每个电影链接分配给不同的进程
    pool.map(scrape_and_store, movie_links)


    # 关闭游标和数据库连接
    cursor.close()
    conn.close()
    # 关闭进程池
    pool.close()
    pool.join()
    end = time.time()
    run = end - start
    print(f"===全部完成，用时{run}s===")