PyQt5爬取豆瓣读书(相关知识详解!干货!)

本次的豆瓣爬虫工具,我进行了部分简化,因为有些数据使用BeautifulSoup4爬取不了,需要使用chromedrive结合selenium进行数据的获取,较为复杂,我打算留一下在后面再跟大家介绍如何使用;这次也结合了多线程的使用,大家可以去看看我之前的多线程的教程!

先放上界面!(这次打包就使用我之前发的打包的教程!还不会的朋友可以去看看!)

我们本次主要对以下数据进行爬取: (画红框的部分,以及部分评论进行爬取)

 requests部分

 在开始这个代码功能讲解之前我们先来学习一下,如何使用python对网站进行访问的!

使用 requests 库进行简单网站访问

pip install requests

我们安装完依赖包以后,使用 requests 发送HTTP请求代码发送一个GET请求到 https://www.example.com 并输出响应内容

import requests
# 发送GET请求
response = requests.get('https://www.example.com')
# 输出响应内容
print(response.text)

处理响应:response 对象包含了许多有用的信息,你可以根据需要对其进行处理。例如,你可以获取响应状态码、响应头、JSON数据等。

import requests

response = requests.get('https://www.example.com')

# 获取状态码
status_code = response.status_code
print(f'Status Code: {status_code}')

# 获取响应头
headers = response.headers
print('Response Headers:')
for key, value in headers.items():
    print(f'{key}: {value}')

发送POST请求:除了GET请求,你还可以发送POST请求。

import requests

# 发送POST请求
payload = {'key1': 'value1', 'key2': 'value2'}
response = requests.post('https://www.example.com/post-endpoint', data=payload)

# 输出响应内容
print(response.text)

上面举了几个简单的访问的例子,那我们来看一下什么是请求头:HTTP请求头(HTTP headers)是在HTTP请求或响应中传输额外信息的元数据。它包含了关于请求、响应或服务器的信息,可以帮助服务器或客户端进行更有效的通信。请求头以键值对的形式提供,每一对都包含一个字段名和相应的值。

HTTP请求头包含以下几个方面的信息:

  1. User-Agent:标识请求的用户代理(浏览器、爬虫等)。

  2. Accept:指定客户端能够接收的内容类型,通常是MIME类型。

  3. Authorization:包含客户端提供给服务器的认证凭证,例如用户名和密码。

  4. Cookie:包含之前由服务器通过Set-Cookie头设置的HTTP cookies,用于保持用户的会话状态。

  5. Referer:表示请求的来源,即引导用户代理到当前页面的URI。

  6. Host:指定要访问的服务器的域名和端口号。

  7. Content-Type:指定请求或响应中实体的MIME类型,用于定义请求中的数据类型。

  8. Accept-Language:指定客户端能够接受的自然语言。

  9. Connection:控制是否保持持久连接。

  10. Cache-Control:控制缓存行为,例如no-cache,max-age等。

  11. 其他一些自定义的或标准的HTTP头,如If-Modified-Since、If-None-Match等。

这只是请求头的一小部分示例。不同的应用场景可能需要不同的请求头字段,具体的请求头字段可以根据HTTP规范或特定服务的API文档来查阅。请求头对于定制和控制HTTP请求非常重要,通过适当设置请求头,客户端可以告知服务器其需求和支持的功能,服务器也可以根据请求头中的信息来处理请求。

我们在后面会在程序里面设置多个User-Agent,用来模拟浏览器,防止ip被封!

我们在后面还是用到了BeautifulSoup4,让我们来简单的学习一下:Beautiful Soup 是一个用于解析HTML和XML文档的Python库,它能够提供方便的方法来遍历文档树、搜索文档树中的元素,并从中提取数据。

BeautifulSoup4

安装 BeautifulSoup4

如果你还没有安装 BeautifulSoup4,可以使用以下命令进行安装:

pip install beautifulsoup4

导入 BeautifulSoup

在你的Python脚本中导入 BeautifulSoup:

from bs4 import BeautifulSoup

创建BeautifulSoup对象

使用 BeautifulSoup 解析 HTML 或 XML 文档。你可以将字符串作为参数传递,也可以从文件中读取:

# 从字符串创建
html_string = "<html><body><p>Hello, World!</p></body></html>"
soup = BeautifulSoup(html_string, 'html.parser')

# 或者从文件中读取
with open('example.html', 'r') as file:
    soup = BeautifulSoup(file, 'html.parser')

在第二个参数中,使用 'html.parser' 表示使用 Python 内置的解析器,也可以使用其他解析器,如 'lxml' 或 'html5lib'。

遍历文档树

使用 BeautifulSoup 对象可以遍历文档树,查找元素,提取数据等。以下是一些基本的用法:

# 获取所有的段落标签
paragraphs = soup.find_all('p')
for paragraph in paragraphs:
    print(paragraph.text)

# 获取第一个段落标签
first_paragraph = soup.find('p')
print(first_paragraph.text)

使用CSS选择器

BeautifulSoup 支持使用类似CSS选择器的语法来查找元素:查找所有带有类名为 'example' 的段落标签。

# 通过CSS选择器查找所有带有class="example"的段落标签
example_paragraphs = soup.select('p.example')
for paragraph in example_paragraphs:
    print(paragraph.text)

通过上述的讲解我们大体了解了一下BeautifulSoup到底是个什么东西,下面我们来详细学习一下BeautifulSoup 提供的用于查找元素的方法!主讲一下比较常见的几种方法!

find 方法

find 方法用于查找文档中的第一个匹配条件的元素。你可以使用标签名、类名、id 等作为条件。

# 根据标签名查找第一个段落
paragraph = soup.find('p')

# 根据类名查找第一个具有 example 类的段落
example_paragraph = soup.find('p', class_='example')

# 根据id查找第一个具有 my_id id 的元素
element_with_id = soup.find(id='my_id')

select_one 方法

select_one 方法与 find 类似,用于查找文档中的第一个匹配条件的元素。不同之处在于,select_one 使用 CSS 选择器语法:

# 使用CSS选择器查找第一个段落
paragraph = soup.select_one('p')

# 使用CSS选择器查找第一个具有 example 类的段落
example_paragraph = soup.select_one('p.example')

# 使用CSS选择器查找第一个具有 my_id id 的元素
element_with_id = soup.select_one('#my_id')

find_all 方法

find_all 方法用于查找文档中所有匹配条件的元素,并返回一个列表。同样,你可以使用标签名、类名、id 等作为条件。

# 查找所有段落
all_paragraphs = soup.find_all('p')

# 查找所有具有 example 类的段落
all_example_paragraphs = soup.find_all('p', class_='example')

# 查找所有具有 my_id id 的元素
all_elements_with_id = soup.find_all(id='my_id')

find_parents 和 find_parent

find_parents 方法返回所有符合条件的父元素,而 find_parent 方法返回第一个符合条件的父元素。可以使用标签名、类名、id 等作为条件。

# 返回所有段落的父元素
parent_paragraphs = soup.find_all('p').find_parents()

# 返回第一个具有 example 类的段落的父元素
parent_example_paragraph = soup.find('p', class_='example').find_parent()

find_next_siblings 和 find_next_sibling

find_next_siblings 方法返回所有符合条件的后续同级元素,而 find_next_sibling 方法返回第一个符合条件的后续同级元素。可以使用标签名、类名、id 等作为条件。

# 返回所有同级段落的后续同级元素
next_siblings = soup.find('p').find_next_siblings()

# 返回第一个具有 example 类的段落的后续同级元素
next_example_sibling = soup.find('p', class_='example').find_next_sibling()

find_previous_siblings 和 find_previous_sibling

类似于 find_next_siblingsfind_next_sibling,这两个方法用于查找前面的同级元素。

# 返回所有同级段落的前面同级元素
previous_siblings = soup.find('p').find_previous_siblings()

# 返回第一个具有 example 类的段落的前面同级元素
previous_example_sibling = soup.find('p', class_='example').find_previous_sibling()

descendants 和 find_all(recursive=False)

descendants 方法返回所有元素的子孙元素,而 find_all(recursive=False) 方法只返回直接子元素。这可以帮助你更精确地定位目标元素。

# 返回所有元素的子孙元素
all_descendants = soup.descendants

# 返回第一个具有 example 类的段落的直接子元素
direct_children = soup.find('p', class_='example').find_all(recursive=False)

这里我没法完整的带大家学习,因为东西太多太多了!感兴趣可以去官网看它的API手册,都有每个方法的使用!

PyQt5信号(Signal)和槽(Slot)

在 PyQt5 中,信号(Signal)和槽(Slot)是实现事件驱动编程的核心机制。信号是由一个对象发出的事件,槽是响应这个事件的方法或函数。当信号被发射时,与之连接的槽将会被调用。通过下面一个示例让大家明白什么是信号和槽:

from PyQt5.QtWidgets import QApplication, QPushButton, QVBoxLayout, QWidget
from PyQt5.QtCore import pyqtSignal, QObject

class MyObject(QObject):
    # 定义一个信号
    my_signal = pyqtSignal(str)

class MyWidget(QWidget):
    def __init__(self):
        super().__init__()

        # 创建一个对象
        self.my_object = MyObject()

        # 连接信号和槽
        self.my_object.my_signal.connect(self.on_my_signal)

        # 创建一个按钮,点击按钮会触发信号
        self.button = QPushButton("Click me")
        self.button.clicked.connect(self.emit_signal)

        # 布局
        layout = QVBoxLayout()
        layout.addWidget(self.button)
        self.setLayout(layout)

    def emit_signal(self):
        # 发射信号
        self.my_object.my_signal.emit("Hello from signal!")

    def on_my_signal(self, message):
        # 槽的实现,响应信号
        print(f"Received signal: {message}")

if __name__ == "__main__":
    app = QApplication([])

    window = MyWidget()
    window.show()

    app.exec_()

在这个例子中,我们创建了一个自定义对象 MyObject,并在其中定义了一个信号 my_signal。然后,我们创建了一个 MyWidget 类,该类继承自 QWidget,并在其中连接了 MyObject 对象的信号和槽。

当按钮被点击时,调用 emit_signal 方法,该方法发射了 my_signal 信号。与此信号连接的槽 on_my_signal 将被调用,输出一条消息。

我们在多线程的程序里面因为要界面更新代码和逻辑代码分离,避免因为都在主线程而造成线程的阻塞,在我们的子线程中与主线程传递数据时,就会经常使用信号和槽,在子线程中一个方法完成后,我们需要通过触发信号来告诉主线程我们的方法运行完成了,并把子线程里面需要展示到页面的数据通过信号进行传递到主线程,在主线程里面完成信号的更新!并且这个信号会连接着主线程中的一个槽(也就是一个方法,在主线程中用于更新子线程传递回来用于更新的数据)

这个图不完整,望理解!只是想给大家表达一下信号和槽的一个简单关系!

 豆瓣爬取代码讲解

我的代码仍有简化和完善的空间,可能还存在一些隐藏的BUG欢迎大家在评论区讨论和指出斧正!感谢大家!完整代码放在最后(大家可以直接复制使用)

子线程-1(用来获取图书的基本信息)

子线程-2(用来获取图书的评论)

子线程-3(用来获取图书的照片)

主线程:

        网站状态检查

        url提取书籍编号

        图片保存;文件保存(保存的方法要在主函数中进行)

        复选框处理

        ui界面更新

重点:我在程序中使用了time.sleep方法,在子线程-1和子线程-2中间使用了time.sleep方法进行延迟启动(解释一下为什么这样做,在子线程-1中爬取数据因为设计的爬取元素内容比较所以爬取时速度会比子线程-2慢,在后面触发信号往回传递数据时,总是子线程-2先完成,我想让书籍的基本信息在前,书籍的评论在后)这里我也尝试使用互斥锁了,但是由于是谁先完成谁先上锁,所以还是子线程-2先上锁,解决不了先后的问题,其次如果使用线程队列或者多线程通讯通过等待线程的方式又过于繁琐麻烦,最后我使用了延迟子线程-2的启动,解决了先后的问题!

请求头部分

self.user_agents = [
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
            'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
            'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
            'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
            'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
            'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
            'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
            'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
            'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
            'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
            'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
            'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
            'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
            'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0',
            'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.16 Safari/537.36' ]

这里也是一个小重点,为了防止我们频繁爬取豆瓣的数据,封杀我们的ip,这里我提前生成了很多请求头写在了主线程中,通过更换不同的请求头(相当于模拟不同的浏览器),来欺骗网站后台, 达到ip防封保护!在下面的子线程中,我们使用random.choice方法进行随机选择,确保每次的请求头是随机的不是一成不变的!

url中提取书籍编号

    def book_url(self):
        try:
            url = self.edit_book.text()
            # 使用正则表达式匹配豆瓣书籍页面URL中的书籍号
            pattern = re.compile(r'https://book\.douban\.com/subject/(\d+)/.*')
            match = pattern.match(url)
            if match:
                book_id = match.group(1)
                self.edit_number.setText(book_id)
                self.result.append(f"已提取书籍URL编号:{book_id}")
                return
            else:
                self.result.append("URL链接格式有误(请检查格式)")
                return
        except Exception as e:
            self.result1.setText("ERROR错误...")

每本书的url中变化的只有书的编号,所以我们通过用户传入的url自动提取书籍的编号,传入到URL的模版中,并使用try-except进行捕获处理异常! 

获取书籍基础信息

class Thread_get_book_info(QThread):
    finishedSignal = pyqtSignal(str)

    def __init__(self, main_window):
        super().__init__()
        self.main_window = main_window

    def run(self):
        try:
            print('Thread-get-book-info start up...')

            book_id = self.main_window.edit_number.text()
            self.selected_info = self.main_window.selected_info
            url = f'https://book.douban.com/subject/{book_id}/'
            # 随机选择请求头,防封ip
            headers = {
                'User-Agent': random.choice(self.main_window.user_agents)
            }
            # 发送请求获取书籍信息
            response = requests.get(url, headers=headers, stream=True)
            code = response.status_code
            print(code)
            soup = BeautifulSoup(response.text, 'html.parser')
            # 解析获取的信息
            title_element = soup.select_one('h1 span')
            title = title_element.get_text(strip=True) if title_element else '未获取到书名数据'

            score_element = soup.select_one('#interest_sectl > div > div.rating_self.clearfix > strong')
            score = score_element.get_text(strip=True) if score_element else '未获取到评分数据'

            author_element = soup.select_one('#info > span:nth-child(1) > a')
            author = author_element.get_text(strip=True) if author_element else '未获取到作者数据'

            cbs_element = soup.select_one('#info > a:nth-child(4)')
            cbs = cbs_element.get_text(strip=True) if cbs_element else '未获取到出版社数据'

            cpf_element = soup.select_one('#info > a:nth-child(7)')
            cpf = cpf_element.get_text(strip=True) if cpf_element else '未获取到出品方数据'

            fbt_element = soup.find('span', class_='pl', string='副标题:')
            fbt = fbt_element.next_sibling.get_text(strip=True) if fbt_element else '未获取到副标题数据'

            cs_element = soup.find('span', class_='pl', string='丛书:')
            cs = cs_element.next_sibling.next_sibling.get_text(strip=True) if cs_element else '未获取到丛书数据'

            yzm_element = soup.find('span', class_='pl', string='原作名:')
            yzm = yzm_element.next_sibling.get_text(strip=True) if yzm_element else '未获取到原作名数据'

            yz_element = soup.find('span', class_='pl', string=' 译者')
            yz = yz_element.next_sibling.next_sibling.get_text(strip=True) if yz_element else '未获取到译者数据'

            cbn_element = soup.find('span', class_='pl', string='出版年:')
            cbn = cbn_element.next_sibling.get_text(strip=True) if cbn_element else '未获取到出版年数据'

            ys_element = soup.find('span', class_='pl', string='页数:')
            ys = ys_element.next_sibling.get_text(strip=True) if ys_element else '未获取到页数数据'

            dj_element = soup.find('span', class_='pl', string='定价:')
            dj = dj_element.next_sibling.get_text(strip=True) if dj_element else '未获取到定价数据'

            zz_element = soup.find('span', class_='pl', string='装帧:')
            zz = zz_element.next_sibling.get_text(strip=True) if zz_element else '未获取到装帧数据'

            isbn_element = soup.find('span', class_='pl', string='ISBN:')
            isbn = isbn_element.next_sibling.get_text(strip=True) if isbn_element else '未获取到ISBN数据'

            nrjj_element = soup.find('div', class_='intro')
            nrjj = nrjj_element.get_text(strip=True)


            result = ""
            if 'book_name' in self.selected_info:
                result += f'书名: {title}\n'

            if 'score' in self.selected_info:
                result += f'评分: {score}\n'

            if 'author' in self.selected_info:
                result += f'作者: {author}\n'

            if 'cbs' in self.selected_info:
                result += f'出版社: {cbs}\n'

            if 'cpf' in self.selected_info:
                result += f'出品方: {cpf}\n'

            if 'fbt' in self.selected_info:
                result += f'副标题: {fbt}\n'

            if 'cs' in self.selected_info:
                result += f'丛书: {cs}\n'

            if 'yzm' in self.selected_info:
                result += f'原作名: {yzm}\n'

            if 'yz' in self.selected_info:
                result += f'译者: {yz}\n'

            if 'cbn' in self.selected_info:
                result += f'出版年: {cbn}\n'

            if 'ys' in self.selected_info:
                result += f'页数: {ys}\n'

            if 'dj' in self.selected_info:
                result += f'定价: {dj}\n'

            if 'zz' in self.selected_info:
                result += f'装帧: {zz}\n'

            if 'isbn' in self.selected_info:
                result += f'ISBN: {isbn}\n'

            if 'nrjj' in self.selected_info:
                result += f'内容简介: {nrjj}\n'


            # 发送处理完的数据给主线程
            self.finishedSignal.emit(result)

            # TODO
            print(self.selected_info)
        except Exception as e:
            result = f"{e}"
            self.finishedSignal.emit(result)

这一部分是获取图书的代码,因为我设计的是根据用户的复选框勾选进行爬取数据,在主线程中我创建了一个self.selected_info[]用来存储勾选的复选框,再进行if判断,如果这个self.selected_info[]里面有这个复选框则进行对应内容的爬取和展示。(我们可以看到在子线程里面是没有任何界面更新或者设置的操作,返回到界面的数据都由信号传递数据到主线程中设置的函数槽中进行更新界面)我们使用response对网站发送请求(前面有讲)然后使用Beautiful Soup对网站中的元素内容进行捕获

获取书籍评论

class Thread_get_book_com(QThread):
    finishedSignal = pyqtSignal(str)

    def __init__(self, main_window):
        super().__init__()
        self.main_window = main_window

    def run(self):
        try:
            book_id = self.main_window.edit_number.text()
            self.selected_info = self.main_window.selected_info

            url = f'https://book.douban.com/subject/{book_id}/comments/'
            # 根据状态选择请求头
            headers = {
                'User-Agent': random.choice(self.main_window.user_agents)
            }
            # 发送请求获取书籍信息
            response = requests.get(url, headers=headers)
            soup = BeautifulSoup(response.text, 'html.parser')

            hqpl = soup.select('.comment-item .short')

            results = ""
            if 'hqpl' in self.selected_info:
                for comment in hqpl:
                    results += f'{comment.get_text(strip=True)}'

            self.finishedSignal.emit(results)
        except Exception as e:
            results = 'ERROR错误...'
            self.finishedSignal.emit(results)

我们在这通过response对网站发送请求(前面有讲)然后使用Beautiful Soup对网站中的元素内容进行捕获,通过for循环遍历元素中每一条评论。并添加到results中进行返回到主线程的槽函数中进行更新界面!

书籍封面照片下载

class CoverDownloaderThread(QThread):
    finished = pyqtSignal(bytes)

    def __init__(self, main_window):
        super().__init__()
        self.main_window = main_window

    def run(self):
        try:
            book_id = self.main_window.edit_number.text()
            url = f'https://book.douban.com/subject/{book_id}/'
            headers = {'User-Agent': random.choice(self.main_window.user_agents)}

            # 获取网页内容
            response = requests.get(url, headers=headers)
            soup = BeautifulSoup(response.text, 'html.parser')

            # 找到封面图片的标签
            img_tag = soup.select_one('.nbg img[src]')

            # 获取封面图片的链接
            img_url = img_tag['src']

            # 下载封面图片
            img_data = requests.get(img_url).content

            self.finished.emit(img_data)
        except Exception as e:
            self.finished.emit(e)

我们在这通过response对网站发送请求(前面有讲)然后使用Beautiful Soup对网站中的元素内容进行捕获,然后下载封面图片,下载完成后再通过信号传递到主线程进行保存。

在我们上面的子线程中,我们都创建了主线程的实例self.main_window = main_window因为我们需要调用里面的url提取书籍编号,来传入子线程进行提取数据!

主线程

class MainWindow(QMainWindow, Ui_MainWindow):
    def __init__(self):
        super(MainWindow, self).__init__()
        self.setupUi(self)
        # 设置虚拟请求头,防封ip
        self.user_agents = [
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
            'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
            'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
            'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
            'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
            'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
            'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
            'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
            'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
            'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
            'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
            'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
            'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
            'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0',
            'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.16 Safari/537.36' ]
        self.selected_info = []
        # 将 Thread_get_book_info 的 bookUrlSignal 与主线程的 book_url 方法连接
        self.thread_get_book_info = Thread_get_book_info(self)
        self.thread_get_book_com = Thread_get_book_com(self)
        self.cover_thread = CoverDownloaderThread(self)
        self.thread_get_book_info.finishedSignal.connect(self.handle_finished_signal)
        self.thread_get_book_com.finishedSignal.connect(self.handle_finished_signal)
        self.cover_thread.finished.connect(self.show_cover)



        self.start.clicked.connect(self.book_url)
        self.start.clicked.connect(self.start_thread)
        self.stop.clicked.connect(self.save_to_file)
        self.img.clicked.connect(self.download_cover)
        self.img.clicked.connect(self.book_url)

        # 连接勾选框的槽函数
        self.c3.stateChanged.connect(self.on_checkbox_changed)
        self.c4.stateChanged.connect(self.on_checkbox_changed)
        self.c5.stateChanged.connect(self.on_checkbox_changed)
        self.c6.stateChanged.connect(self.on_checkbox_changed)
        self.c7.stateChanged.connect(self.on_checkbox_changed)
        self.c8.stateChanged.connect(self.on_checkbox_changed)
        self.c9.stateChanged.connect(self.on_checkbox_changed)
        self.c10.stateChanged.connect(self.on_checkbox_changed)
        self.c11.stateChanged.connect(self.on_checkbox_changed)
        self.c12.stateChanged.connect(self.on_checkbox_changed)
        self.c13.stateChanged.connect(self.on_checkbox_changed)
        self.c14.stateChanged.connect(self.on_checkbox_changed)
        self.c15.stateChanged.connect(self.on_checkbox_changed)
        self.c17.stateChanged.connect(self.on_checkbox_changed)
        self.c18.stateChanged.connect(self.on_checkbox_changed)

    # 提取豆瓣url中书籍链接
    def book_url(self):
        try:
            url = self.edit_book.text()
            # 使用正则表达式匹配豆瓣书籍页面URL中的书籍号
            pattern = re.compile(r'https://book\.douban\.com/subject/(\d+)/.*')
            match = pattern.match(url)
            if match:
                book_id = match.group(1)
                self.edit_number.setText(book_id)
                self.result.append(f"已提取书籍URL编号:{book_id}")
                return
            else:
                self.result.append("URL链接格式有误(请检查格式)")
                return
        except Exception as e:
            self.result1.setText("ERROR错误...")

    # 爬取数据类型勾选
    def on_checkbox_changed(self):
        self.selected_info = []

        if self.c3.isChecked():
            self.selected_info.append('book_name')

        if self.c4.isChecked():
            self.selected_info.append('score')

        if self.c5.isChecked():
            self.selected_info.append('author')

        if self.c6.isChecked():
            self.selected_info.append('cbs')

        if self.c7.isChecked():
            self.selected_info.append('cpf')

        if self.c8.isChecked():
            self.selected_info.append('fbt')

        if self.c9.isChecked():
            self.selected_info.append('cs')

        if self.c10.isChecked():
            self.selected_info.append('yzm')

        if self.c11.isChecked():
            self.selected_info.append('yz')

        if self.c12.isChecked():
            self.selected_info.append('cbn')

        if self.c13.isChecked():
            self.selected_info.append('ys')

        if self.c14.isChecked():
            self.selected_info.append('dj')

        if self.c15.isChecked():
            self.selected_info.append('zz')

        if self.c16.isChecked():
            self.selected_info.append('isbn')

        if self.c17.isChecked():
            self.selected_info.append('hqpl')

        if self.c18.isChecked():
            self.selected_info.append('nrjj')


    # 启动get_book_info线程
    def start_thread(self):
        self.result.clear()
        code = self.get_exception()
        if code==200:
            self.result.append("Thread-GetBookInfo Start-Up")
            self.result1.setText("数据抓取中...")
            self.thread_get_book_info.start()
            time.sleep(1)
            self.thread_get_book_com.start()
        else:
            self.result.append("URL连接错误-请检查HTTP连接...")

    # 线程信号完成触发后
    def handle_finished_signal(self, result):
        # 处理子线程传递过来的数据
        self.result.append(result)
        self.result1.setText("数据抓取完成!")

    def get_exception(self):
        try:
            book_id = self.edit_number.text()
            url = f'https://book.douban.com/subject/{book_id}/'
            headers = {
                'User-Agent': random.choice(self.user_agents)
            }
            # 发送请求获取书籍信息
            response = requests.get(url, headers=headers, stream=True)
            code = response.status_code
            response.raise_for_status()  # 如果请求失败,抛出HTTPError异常
            return code
        except requests.exceptions.HTTPError as errh:
            print("HTTP Error:", errh)
            return
        except requests.exceptions.ConnectionError as errc:
            print("Error Connecting:", errc)
            return
        except requests.exceptions.Timeout as errt:
            print("Timeout Error:", errt)
            return
        except requests.exceptions.RequestException as err:
            print("Error:", err)
            return

    from PyQt5.QtWidgets import QFileDialog
    from PyQt5.QtCore import Qt

    def save_to_file(self):
        text_data = self.result.toPlainText()
        try:
            if text_data:
                options = QFileDialog.Options()
                options |= QFileDialog.DontUseNativeDialog  # Use the Qt dialog instead of the native one

                file_name, _ = QFileDialog.getSaveFileName(self, "Save File", "", "Text Files (*.txt);;All Files (*)",
                                                           options=options)

                if file_name:
                    # Ensure the file has a .txt extension
                    if not file_name.lower().endswith(".txt"):
                        file_name += ".txt"

                    # Save as plain text
                    with open(file_name, 'w', encoding='utf-8') as file:
                        file.write(text_data)

                    self.result1.setText("文件保存成功!")
        except Exception as e:
            self.result1.setText("ERROR错误...")

    def download_cover(self):
        self.cover_thread.start()
        self.result1.setText("书籍封面正在下载...")

    def show_cover(self, img_data):
        try:
            # 弹出文件保存对话框
            file_dialog = QFileDialog(self)
            file_dialog.setAcceptMode(QFileDialog.AcceptSave)
            file_path, _ = file_dialog.getSaveFileName(self, '保存封面图片', 'book_cover.jpg', 'Images (*.jpg *.png)')

            # 如果用户选择了文件路径,则保存图片
            if file_path:
                with open(file_path, 'wb') as f:
                    f.write(img_data)
                    self.result1.setText("书籍封面与保存成功!")
        except Exception as e:
            self.result1.setText("ERROR错误...")

在主线程里面写了大量的请求头(大家可以进行修改,把请求头单独写到文件里,增加代码的可读性,这里为了方便大家理解我就写在了主线程里面)其中 handle_finished_signal方法就是来处理子线程完成后传入的数据进行界面的更新,在start_thread函数中用来启动两个线程,在上面讲了,通过延迟解决数据展示在界面的先后顺序

self.thread_get_book_info = Thread_get_book_info(self)
self.thread_get_book_com = Thread_get_book_com(self)
self.cover_thread = CoverDownloaderThread(self)       self.thread_get_book_info.finishedSignal.connect(self.handle_finished_signal)
self.thread_get_book_com.finishedSignal.connect(self.handle_finished_signal)
self.cover_thread.finished.connect(self.show_cover)

创建线程实例,并绑定信号和槽函数,一个信号可以绑定多个槽函数,一个槽函数也可以绑定多个信号!

启动代码

if __name__ == '__main__':
    QCoreApplication.setAttribute(Qt.AA_EnableHighDpiScaling)
    QCoreApplication.setAttribute(Qt.AA_UseHighDpiPixmaps)
    app = QApplication(sys.argv)
    window = MainWindow()
    window.show()
    sys.exit(app.exec_())

这个地方有个细节需要注意一下,在正常的启动代码中加了两行代码,因为我这个程序是为了打包成Mac的pkg安装包,所以加入了以下两行代码:

QCoreApplication.setAttribute(Qt.AA_EnableHighDpiScaling)
QCoreApplication.setAttribute(Qt.AA_UseHighDpiPixmaps)

首先先说一下,因为这里有个坑,你不写这两行代码,你打包后运行程序的界面会非常模糊,

  1. QCoreApplication.setAttribute(Qt.AA_EnableHighDpiScaling)

    这个设置告诉 PyQt5 在应用程序中启用高DPI缩放支持。当应用程序运行在高分辨率屏幕上时,启用此设置可以确保应用程序元素(例如窗口、字体等)根据屏幕密度进行适当缩放,以提供更好的显示效果。

  2. QCoreApplication.setAttribute(Qt.AA_UseHighDpiPixmaps)

    这个设置告诉 PyQt5 在应用程序中使用高DPI的图像资源。在高分辨率屏幕上,使用高DPI的图像资源可以提高图像的清晰度和质量,以适应高DPI显示。在设置这两个属性后,PyQt5 将会根据屏幕的DPI设置适当地缩放应用程序的显示元素和图像资源。

完整代码

逻辑代码

import re
import sys
import time
import random
import requests
from PyQt5.QtCore import QThread, pyqtSignal, QMutex, QSize, Qt, QCoreApplication
from PyQt5.QtGui import QPixmap
from PyQt5.QtWidgets import QMainWindow, QApplication, QFileDialog
from bs4 import BeautifulSoup
from Main_Ui import Ui_MainWindow

class Thread_get_book_info(QThread):
    finishedSignal = pyqtSignal(str)

    def __init__(self, main_window):
        super().__init__()
        self.main_window = main_window
        self.mutex = QMutex()  # 传入互斥锁

    def run(self):
        try:
            print('Thread-get-book-info start up...')

            book_id = self.main_window.edit_number.text()
            self.selected_info = self.main_window.selected_info
            url = f'https://book.douban.com/subject/{book_id}/'
            # 随机选择请求头,防封ip
            headers = {
                'User-Agent': random.choice(self.main_window.user_agents)
            }
            # 发送请求获取书籍信息
            response = requests.get(url, headers=headers, stream=True)
            code = response.status_code
            print(code)
            soup = BeautifulSoup(response.text, 'html.parser')
            # 解析获取的信息
            title_element = soup.select_one('h1 span')
            title = title_element.get_text(strip=True) if title_element else '未获取到书名数据'

            score_element = soup.select_one('#interest_sectl > div > div.rating_self.clearfix > strong')
            score = score_element.get_text(strip=True) if score_element else '未获取到评分数据'

            author_element = soup.select_one('#info > span:nth-child(1) > a')
            author = author_element.get_text(strip=True) if author_element else '未获取到作者数据'

            cbs_element = soup.select_one('#info > a:nth-child(4)')
            cbs = cbs_element.get_text(strip=True) if cbs_element else '未获取到出版社数据'

            cpf_element = soup.select_one('#info > a:nth-child(7)')
            cpf = cpf_element.get_text(strip=True) if cpf_element else '未获取到出品方数据'

            fbt_element = soup.find('span', class_='pl', string='副标题:')
            fbt = fbt_element.next_sibling.get_text(strip=True) if fbt_element else '未获取到副标题数据'

            cs_element = soup.find('span', class_='pl', string='丛书:')
            cs = cs_element.next_sibling.next_sibling.get_text(strip=True) if cs_element else '未获取到丛书数据'

            yzm_element = soup.find('span', class_='pl', string='原作名:')
            yzm = yzm_element.next_sibling.get_text(strip=True) if yzm_element else '未获取到原作名数据'

            yz_element = soup.find('span', class_='pl', string=' 译者')
            yz = yz_element.next_sibling.next_sibling.get_text(strip=True) if yz_element else '未获取到译者数据'

            cbn_element = soup.find('span', class_='pl', string='出版年:')
            cbn = cbn_element.next_sibling.get_text(strip=True) if cbn_element else '未获取到出版年数据'

            ys_element = soup.find('span', class_='pl', string='页数:')
            ys = ys_element.next_sibling.get_text(strip=True) if ys_element else '未获取到页数数据'

            dj_element = soup.find('span', class_='pl', string='定价:')
            dj = dj_element.next_sibling.get_text(strip=True) if dj_element else '未获取到定价数据'

            zz_element = soup.find('span', class_='pl', string='装帧:')
            zz = zz_element.next_sibling.get_text(strip=True) if zz_element else '未获取到装帧数据'

            isbn_element = soup.find('span', class_='pl', string='ISBN:')
            isbn = isbn_element.next_sibling.get_text(strip=True) if isbn_element else '未获取到ISBN数据'

            nrjj_element = soup.find('div', class_='intro')
            nrjj = nrjj_element.get_text(strip=True)


            result = ""
            if 'book_name' in self.selected_info:
                result += f'书名: {title}\n'

            if 'score' in self.selected_info:
                result += f'评分: {score}\n'

            if 'author' in self.selected_info:
                result += f'作者: {author}\n'

            if 'cbs' in self.selected_info:
                result += f'出版社: {cbs}\n'

            if 'cpf' in self.selected_info:
                result += f'出品方: {cpf}\n'

            if 'fbt' in self.selected_info:
                result += f'副标题: {fbt}\n'

            if 'cs' in self.selected_info:
                result += f'丛书: {cs}\n'

            if 'yzm' in self.selected_info:
                result += f'原作名: {yzm}\n'

            if 'yz' in self.selected_info:
                result += f'译者: {yz}\n'

            if 'cbn' in self.selected_info:
                result += f'出版年: {cbn}\n'

            if 'ys' in self.selected_info:
                result += f'页数: {ys}\n'

            if 'dj' in self.selected_info:
                result += f'定价: {dj}\n'

            if 'zz' in self.selected_info:
                result += f'装帧: {zz}\n'

            if 'isbn' in self.selected_info:
                result += f'ISBN: {isbn}\n'

            if 'nrjj' in self.selected_info:
                result += f'内容简介: {nrjj}\n'


            # 发送处理完的数据给主线程
            self.finishedSignal.emit(result)

            # TODO
            print(self.selected_info)
        except Exception as e:
            result = f"{e}"
            self.finishedSignal.emit(result)

class Thread_get_book_com(QThread):
    finishedSignal = pyqtSignal(str)

    def __init__(self, main_window):
        super().__init__()
        self.main_window = main_window
        self.mutex = QMutex()

    def run(self):
        try:
            book_id = self.main_window.edit_number.text()
            self.selected_info = self.main_window.selected_info

            url = f'https://book.douban.com/subject/{book_id}/comments/'
            # 根据状态选择请求头
            headers = {
                'User-Agent': random.choice(self.main_window.user_agents)
            }
            # 发送请求获取书籍信息
            response = requests.get(url, headers=headers)
            soup = BeautifulSoup(response.text, 'html.parser')

            hqpl = soup.select('.comment-item .short')

            results = ""
            if 'hqpl' in self.selected_info:
                for comment in hqpl:
                    results += f'{comment.get_text(strip=True)}'

            self.finishedSignal.emit(results)
        except Exception as e:
            results = 'ERROR错误...'
            self.finishedSignal.emit(results)


class CoverDownloaderThread(QThread):
    finished = pyqtSignal(bytes)

    def __init__(self, main_window):
        super().__init__()
        self.main_window = main_window

    def run(self):
        try:
            book_id = self.main_window.edit_number.text()
            url = f'https://book.douban.com/subject/{book_id}/'
            headers = {'User-Agent': random.choice(self.main_window.user_agents)}

            # 获取网页内容
            response = requests.get(url, headers=headers)
            soup = BeautifulSoup(response.text, 'html.parser')

            # 找到封面图片的标签
            img_tag = soup.select_one('.nbg img[src]')

            # 获取封面图片的链接
            img_url = img_tag['src']

            # 下载封面图片
            img_data = requests.get(img_url).content

            self.finished.emit(img_data)
        except Exception as e:
            self.finished.emit(e)


class MainWindow(QMainWindow, Ui_MainWindow):
    def __init__(self):
        super(MainWindow, self).__init__()
        self.setupUi(self)
        # 设置虚拟请求头,防封ip
        self.user_agents = [
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
            'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
            'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
            'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
            'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
            'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
            'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
            'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
            'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
            'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
            'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
            'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
            'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
            'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0',
            'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.16 Safari/537.36' ]
        self.selected_info = []
        # 将 Thread_get_book_info 的 bookUrlSignal 与主线程的 book_url 方法连接
        self.thread_get_book_info = Thread_get_book_info(self)
        self.thread_get_book_com = Thread_get_book_com(self)
        self.cover_thread = CoverDownloaderThread(self)
        self.thread_get_book_info.finishedSignal.connect(self.handle_finished_signal)
        self.thread_get_book_com.finishedSignal.connect(self.handle_finished_signal)
        self.cover_thread.finished.connect(self.show_cover)



        self.start.clicked.connect(self.book_url)
        self.start.clicked.connect(self.start_thread)
        self.stop.clicked.connect(self.save_to_file)
        self.img.clicked.connect(self.download_cover)
        self.img.clicked.connect(self.book_url)

        # 连接勾选框的槽函数
        self.c3.stateChanged.connect(self.on_checkbox_changed)
        self.c4.stateChanged.connect(self.on_checkbox_changed)
        self.c5.stateChanged.connect(self.on_checkbox_changed)
        self.c6.stateChanged.connect(self.on_checkbox_changed)
        self.c7.stateChanged.connect(self.on_checkbox_changed)
        self.c8.stateChanged.connect(self.on_checkbox_changed)
        self.c9.stateChanged.connect(self.on_checkbox_changed)
        self.c10.stateChanged.connect(self.on_checkbox_changed)
        self.c11.stateChanged.connect(self.on_checkbox_changed)
        self.c12.stateChanged.connect(self.on_checkbox_changed)
        self.c13.stateChanged.connect(self.on_checkbox_changed)
        self.c14.stateChanged.connect(self.on_checkbox_changed)
        self.c15.stateChanged.connect(self.on_checkbox_changed)
        self.c17.stateChanged.connect(self.on_checkbox_changed)
        self.c18.stateChanged.connect(self.on_checkbox_changed)

    # 提取豆瓣url中书籍链接
    def book_url(self):
        try:
            url = self.edit_book.text()
            # 使用正则表达式匹配豆瓣书籍页面URL中的书籍号
            pattern = re.compile(r'https://book\.douban\.com/subject/(\d+)/.*')
            match = pattern.match(url)
            if match:
                book_id = match.group(1)
                self.edit_number.setText(book_id)
                self.result.append(f"已提取书籍URL编号:{book_id}")
                return
            else:
                self.result.append("URL链接格式有误(请检查格式)")
                return
        except Exception as e:
            self.result1.setText("ERROR错误...")

    # 爬取数据类型勾选
    def on_checkbox_changed(self):
        self.selected_info = []

        if self.c3.isChecked():
            self.selected_info.append('book_name')

        if self.c4.isChecked():
            self.selected_info.append('score')

        if self.c5.isChecked():
            self.selected_info.append('author')

        if self.c6.isChecked():
            self.selected_info.append('cbs')

        if self.c7.isChecked():
            self.selected_info.append('cpf')

        if self.c8.isChecked():
            self.selected_info.append('fbt')

        if self.c9.isChecked():
            self.selected_info.append('cs')

        if self.c10.isChecked():
            self.selected_info.append('yzm')

        if self.c11.isChecked():
            self.selected_info.append('yz')

        if self.c12.isChecked():
            self.selected_info.append('cbn')

        if self.c13.isChecked():
            self.selected_info.append('ys')

        if self.c14.isChecked():
            self.selected_info.append('dj')

        if self.c15.isChecked():
            self.selected_info.append('zz')

        if self.c16.isChecked():
            self.selected_info.append('isbn')

        if self.c17.isChecked():
            self.selected_info.append('hqpl')

        if self.c18.isChecked():
            self.selected_info.append('nrjj')


    # 启动get_book_info线程
    def start_thread(self):
        self.result.clear()
        code = self.get_exception()
        if code==200:
            self.result.append("Thread-GetBookInfo Start-Up")
            self.result1.setText("数据抓取中...")
            self.thread_get_book_info.start()
            time.sleep(1)
            self.thread_get_book_com.start()
        else:
            self.result.append("URL连接错误-请检查HTTP连接...")

    # 线程信号完成触发后
    def handle_finished_signal(self, result):
        # 处理子线程传递过来的数据
        self.result.append(result)
        self.result1.setText("数据抓取完成!")

    def get_exception(self):
        try:
            book_id = self.edit_number.text()
            url = f'https://book.douban.com/subject/{book_id}/'
            headers = {
                'User-Agent': random.choice(self.user_agents)
            }
            # 发送请求获取书籍信息
            response = requests.get(url, headers=headers, stream=True)
            code = response.status_code
            response.raise_for_status()  # 如果请求失败,抛出HTTPError异常
            return code
        except requests.exceptions.HTTPError as errh:
            print("HTTP Error:", errh)
            return
        except requests.exceptions.ConnectionError as errc:
            print("Error Connecting:", errc)
            return
        except requests.exceptions.Timeout as errt:
            print("Timeout Error:", errt)
            return
        except requests.exceptions.RequestException as err:
            print("Error:", err)
            return

    from PyQt5.QtWidgets import QFileDialog
    from PyQt5.QtCore import Qt

    def save_to_file(self):
        text_data = self.result.toPlainText()
        try:
            if text_data:
                options = QFileDialog.Options()
                options |= QFileDialog.DontUseNativeDialog  # Use the Qt dialog instead of the native one

                file_name, _ = QFileDialog.getSaveFileName(self, "Save File", "", "Text Files (*.txt);;All Files (*)",
                                                           options=options)

                if file_name:
                    # Ensure the file has a .txt extension
                    if not file_name.lower().endswith(".txt"):
                        file_name += ".txt"

                    # Save as plain text
                    with open(file_name, 'w', encoding='utf-8') as file:
                        file.write(text_data)

                    self.result1.setText("文件保存成功!")
        except Exception as e:
            self.result1.setText("ERROR错误...")

    def download_cover(self):
        self.cover_thread.start()
        self.result1.setText("书籍封面正在下载...")

    def show_cover(self, img_data):
        try:
            # 弹出文件保存对话框
            file_dialog = QFileDialog(self)
            file_dialog.setAcceptMode(QFileDialog.AcceptSave)
            file_path, _ = file_dialog.getSaveFileName(self, '保存封面图片', 'book_cover.jpg', 'Images (*.jpg *.png)')

            # 如果用户选择了文件路径,则保存图片
            if file_path:
                with open(file_path, 'wb') as f:
                    f.write(img_data)
                    self.result1.setText("书籍封面与保存成功!")
        except Exception as e:
            self.result1.setText("ERROR错误...")



if __name__ == '__main__':
    QCoreApplication.setAttribute(Qt.AA_EnableHighDpiScaling)
    QCoreApplication.setAttribute(Qt.AA_UseHighDpiPixmaps)
    app = QApplication(sys.argv)
    window = MainWindow()
    window.show()
    sys.exit(app.exec_())

Ui代码

注意:如果要打包,记得加

import pathlib
folder = pathlib.Path(__file__).parent.resolve()

这个代码可以解决你在打包后找不到图片资源,这里我导入了我的logo图片 self.label_logo.setPixmap(QtGui.QPixmap(f"{folder}/AYAOBOOM.png"))


from PyQt5 import QtCore, QtGui, QtWidgets
import pathlib
folder = pathlib.Path(__file__).parent.resolve()



class Ui_MainWindow(object):
    def setupUi(self, MainWindow):
        MainWindow.setObjectName("MainWindow")
        MainWindow.setEnabled(True)
        MainWindow.resize(793, 529)
        font = QtGui.QFont()
        font.setFamily("Hannotate SC")
        font.setBold(True)
        font.setWeight(75)
        MainWindow.setFont(font)
        self.centralwidget = QtWidgets.QWidget(MainWindow)
        self.centralwidget.setObjectName("centralwidget")
        self.label_logo = QtWidgets.QLabel(self.centralwidget)
        self.label_logo.setGeometry(QtCore.QRect(80, 10, 631, 111))
        self.label_logo.setText("")
        self.label_logo.setPixmap(QtGui.QPixmap(f"{folder}/AYAOBOOM.png"))
        self.label_logo.setObjectName("label_logo")
        self.label_book = QtWidgets.QLabel(self.centralwidget)
        self.label_book.setGeometry(QtCore.QRect(60, 140, 61, 20))
        self.label_book.setObjectName("label_book")
        self.edit_book = QtWidgets.QLineEdit(self.centralwidget)
        self.edit_book.setGeometry(QtCore.QRect(130, 140, 371, 21))
        self.edit_book.setPlaceholderText("请输入图书链接...")
        self.edit_book.setObjectName("edit_book")
        self.label_number = QtWidgets.QLabel(self.centralwidget)
        self.label_number.setGeometry(QtCore.QRect(540, 140, 61, 20))
        self.label_number.setObjectName("label_number")
        self.edit_number = QtWidgets.QLineEdit(self.centralwidget)
        self.edit_number.setGeometry(QtCore.QRect(610, 140, 91, 21))
        self.edit_number.setReadOnly(True)
        self.edit_number.setPlaceholderText("自动处理...")
        self.edit_number.setObjectName("edit_number")
        self.c3 = QtWidgets.QCheckBox(self.centralwidget)
        self.c3.setGeometry(QtCore.QRect(60, 190, 87, 20))
        self.c3.setObjectName("c3")
        self.c5 = QtWidgets.QCheckBox(self.centralwidget)
        self.c5.setGeometry(QtCore.QRect(260, 190, 87, 20))
        self.c5.setObjectName("c5")
        self.c6 = QtWidgets.QCheckBox(self.centralwidget)
        self.c6.setGeometry(QtCore.QRect(360, 190, 87, 20))
        self.c6.setObjectName("c6")
        self.c7 = QtWidgets.QCheckBox(self.centralwidget)
        self.c7.setGeometry(QtCore.QRect(460, 190, 87, 20))
        self.c7.setObjectName("c7")
        self.c8 = QtWidgets.QCheckBox(self.centralwidget)
        self.c8.setGeometry(QtCore.QRect(560, 190, 87, 20))
        self.c8.setObjectName("c8")
        self.c10 = QtWidgets.QCheckBox(self.centralwidget)
        self.c10.setGeometry(QtCore.QRect(60, 230, 87, 20))
        self.c10.setObjectName("c10")
        self.c9 = QtWidgets.QCheckBox(self.centralwidget)
        self.c9.setGeometry(QtCore.QRect(660, 190, 87, 20))
        self.c9.setObjectName("c9")
        self.c14 = QtWidgets.QCheckBox(self.centralwidget)
        self.c14.setGeometry(QtCore.QRect(460, 230, 87, 20))
        self.c14.setObjectName("c14")
        self.c12 = QtWidgets.QCheckBox(self.centralwidget)
        self.c12.setGeometry(QtCore.QRect(260, 230, 87, 20))
        self.c12.setObjectName("c12")
        self.c13 = QtWidgets.QCheckBox(self.centralwidget)
        self.c13.setGeometry(QtCore.QRect(360, 230, 87, 20))
        self.c13.setObjectName("c13")
        self.c11 = QtWidgets.QCheckBox(self.centralwidget)
        self.c11.setGeometry(QtCore.QRect(160, 230, 87, 20))
        self.c11.setObjectName("c11")
        self.c15 = QtWidgets.QCheckBox(self.centralwidget)
        self.c15.setGeometry(QtCore.QRect(560, 230, 87, 20))
        self.c15.setObjectName("c15")
        self.c4 = QtWidgets.QCheckBox(self.centralwidget)
        self.c4.setGeometry(QtCore.QRect(160, 190, 87, 20))
        self.c4.setObjectName("c4")
        self.c16 = QtWidgets.QCheckBox(self.centralwidget)
        self.c16.setGeometry(QtCore.QRect(660, 230, 87, 20))
        self.c16.setObjectName("c16")
        self.result1 = QtWidgets.QLabel(self.centralwidget)
        self.result1.setGeometry(QtCore.QRect(30, 480, 181, 16))
        self.result1.setObjectName("result1")
        self.c17 = QtWidgets.QCheckBox(self.centralwidget)
        self.c17.setGeometry(QtCore.QRect(60, 270, 87, 20))
        self.c17.setObjectName("c17")
        self.c18 = QtWidgets.QCheckBox(self.centralwidget)
        self.c18.setGeometry(QtCore.QRect(160, 270, 101, 20))
        self.c18.setObjectName("c18")
        self.start = QtWidgets.QPushButton(self.centralwidget)
        self.start.setGeometry(QtCore.QRect(60, 350, 121, 51))
        self.start.setObjectName("start")
        self.stop = QtWidgets.QPushButton(self.centralwidget)
        self.stop.setGeometry(QtCore.QRect(190, 350, 121, 51))
        self.stop.setObjectName("stop")
        self.line = QtWidgets.QFrame(self.centralwidget)
        self.line.setGeometry(QtCore.QRect(60, 310, 671, 20))
        self.line.setFrameShape(QtWidgets.QFrame.HLine)
        self.line.setFrameShadow(QtWidgets.QFrame.Sunken)
        self.line.setObjectName("line")
        self.result = QtWidgets.QTextEdit(self.centralwidget)
        self.result.setGeometry(QtCore.QRect(350, 337, 411, 151))
        self.result.setObjectName("result")
        self.label = QtWidgets.QLabel(self.centralwidget)
        self.label.setGeometry(QtCore.QRect(130, 420, 101, 16))
        self.label.setObjectName("label")
        self.label_2 = QtWidgets.QLabel(self.centralwidget)
        self.label_2.setGeometry(QtCore.QRect(90, 440, 181, 16))
        self.label_2.setObjectName("label_2")
        self.img = QtWidgets.QPushButton(self.centralwidget)
        self.img.setGeometry(QtCore.QRect(260, 265, 111, 41))
        self.img.setObjectName("img")
        MainWindow.setCentralWidget(self.centralwidget)
        self.statusbar = QtWidgets.QStatusBar(MainWindow)
        self.statusbar.setObjectName("statusbar")
        MainWindow.setStatusBar(self.statusbar)

        self.retranslateUi(MainWindow)
        QtCore.QMetaObject.connectSlotsByName(MainWindow)

    def retranslateUi(self, MainWindow):
        _translate = QtCore.QCoreApplication.translate
        MainWindow.setWindowTitle(_translate("MainWindow", "AYAOBOOM"))
        self.label_book.setText(_translate("MainWindow", "书籍链接:"))
        self.edit_book.setText(_translate("MainWindow", ""))
        self.label_number.setText(_translate("MainWindow", "书籍编号:"))
        self.edit_number.setText(_translate("MainWindow", ""))
        self.c3.setText(_translate("MainWindow", "书籍名称"))
        self.c5.setText(_translate("MainWindow", "作者"))
        self.c6.setText(_translate("MainWindow", "出版社"))
        self.c7.setText(_translate("MainWindow", "出品方"))
        self.c8.setText(_translate("MainWindow", "副标题"))
        self.c10.setText(_translate("MainWindow", "原作名"))
        self.c9.setText(_translate("MainWindow", "丛书"))
        self.c14.setText(_translate("MainWindow", "定价"))
        self.c12.setText(_translate("MainWindow", "出版年"))
        self.c13.setText(_translate("MainWindow", "页数"))
        self.c11.setText(_translate("MainWindow", "译者"))
        self.c15.setText(_translate("MainWindow", "装帧"))
        self.c4.setText(_translate("MainWindow", "评分"))
        self.c16.setText(_translate("MainWindow", "ISBN"))
        self.result1.setText(_translate("MainWindow", "用户提示状态"))
        self.c17.setText(_translate("MainWindow", "获取评论"))
        self.c18.setText(_translate("MainWindow", "内容简介"))
        self.start.setText(_translate("MainWindow", "开始抓取数据"))
        self.stop.setText(_translate("MainWindow", "保存抓取数据"))
        self.label.setText(_translate("MainWindow", "IP防封功能已开启"))
        self.label_2.setText(_translate("MainWindow", "请关闭VPN或其他网络代理工具"))
        self.img.setText(_translate("MainWindow", "保存书籍照片"))

最后程序打包成软件安装包即可!

源码下载链接:

链接: https://pan.baidu.com/s/1rOoB_RLSVt_aUYuBHxHYHw?pwd=8cea

提取码: 8cea 

这篇四万字思路+教学就分享到这啦!如果您觉得对您的学习有帮助,请关注点赞收藏!谢谢您的支持!欢迎大家评论区讨论学习!欢迎指出程序的问题和不足,请大家多多斧正!

  • 26
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

RMB Player

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值