学习Python爬虫记录（2）

最新推荐文章于 2023-09-27 17:41:19 发布

青邃

最新推荐文章于 2023-09-27 17:41:19 发布

阅读量147

点赞数

文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_60852702/article/details/119630517

版权

书接上文：学习Python爬虫记录贴_青邃的博客-CSDN博客

在上一篇文章，我们成功的编写了一个简易的爬虫，但是我想爬取整个小说，所以我们接着往下看

本文分两个部分：

1、思路分析

2、单线程爬取整本小说

3、多线程爬取小说

一、思路分析

import requests as req
from bs4 import BeautifulSoup as bs
html='https://www.bqkan8.com/1_1496/450365.html'#写入网址
txt=req.get(url=html)#对该网址进行爬取 
txt=bs(txt.text,'html.parser') 
txt=txt.find_all('div',id='content')
txt=(str(txt).replace('<br/><br/>',''))
print(txt.replace('        ','\n\n'))

这是上一篇文文章我们所编写的代码，接下来我想爬取整本小说。首先，爬取内容需要获取它的URL，也就是说我们首先要获取全部的URL，并把这些URL存储到一个列表里面，需要用的时候把URL调出来使用

二、单线程爬取小说

好的，思路有了，接下来就是编写代码了：

首先，我们编写一个获取小说目录的爬虫，并且对获取的内容进行筛选除去不需要的东西，编写代码如下：

import requests as req
from bs4 import BeautifulSoup as ds
url1='https://www.bqkan8.com/1_1496'
text1=req.get(url=url1)
bf=ds(text1.content,'html.parser')
text2=bf.find_all('div',class_="listmain")
print(text2)

然后我们得到了：

继续分析得到的内容，首先，我们爬取整本小说，这前12章是最新章节，排进去肯定是回混乱顺序，所以这前12章不要，其次要把章名和URL单独提取出来。

为了方便，我这里是使用一个列表来存储所有的URL和章名，同时是使用string和get.()方法得到其中的URL

话不多说，我们看代码理解：

import requests as req
from bs4 import BeautifulSoup as ds
jsq=0
url1='https://www.bqkan8.com/1_1496'
encoding='utf-8'
text1=req.get(url=url1)
bf=ds(text1.content,'html.parser')
text2=bf.find_all('div',class_="listmain")
text2=text2[0]
a=text2.find_all('a')
list1=[]
url1='https://www.bqkan8.com'
for i in a:
    if jsq>11:
        wz=url1+i.get('href')
        list1.append(wz)
    else:
        jsq=jsq+1
print(list1)

之所以添加一个for循环进去，是因为我们要去除最新的12个章节，这样我们获得的网址就是从小说第一章开始的输出的结果是这样的：

这下，网址搞定了，我们就可以开始编写爬取小说正文内容的程序了，只需要将我们上文编写的程序稍作修改就行：

import requests as req
from bs4 import BeautifulSoup as ds
jsq=0
url1='https://www.bqkan8.com/1_1496'
encoding='utf-8'
text1=req.get(url=url1)
bf=ds(text1.content,'html.parser')
text2=bf.find_all('div',class_="listmain")
text2=text2[0]
a=text2.find_all('a')
r=open('E:/暑期爬虫学习/小说.txt','a',encoding='utf-8')
list1=[];zjjs=0
url1='https://www.bqkan8.com'
for i in a:
    if jsq>11:
        wz=url1+i.get('href')
        list1.append(wz)
        zjjs+=1
    else:
        jsq=jsq+1
jsq=1
for i in list1:
    nr=req.get(url=i)
    text3=ds(nr.content,'html.parser')
    text3=text3.find_all('div',id='content')
    text3=text3[0].text.replace("        ",'\n\n')
    r.write(text3)
    jd=jsq/zjjs*100;jd=round(jd,2)
    print(f'下载进度{jd}%')
    jsq+=1
r.close()
print('下载完成！\n程序退出')

这里选择的是for循环对每一个网址进行循环获取，通过open实现文件的本地保存，然后你运行了一下，发现...

妈耶，一本13MB的小说要下载半个多小时，这效率也忒低了吧！

所以让我们进入本文，也是我学习爬虫最重要的一个节点：Thread+queue，线程与队列的结合使用

三、多线程爬取小说

关于Python的“假多线程”我在这里就不多说了，网上有很多关于这方面的总结。首先我门了解一下我们这次需要使用到的的工具 threading 线程函数：

import threading #引用线程函数
import time
def work(worker):
    print(f"工人{worker}正在工作") #定义一个可以运行的程序
    time.sleep(5)
thread1=threading.Thread(target=work,args=(1,)) #target后面跟着的是这个线程执行的程序,args则是参数，这个参数是元组的形式表现出来的所以后面要加一个逗号
thread2=threading.Thread(target=work,args=(2,))
thread1.start()
thread2.start()

注意，在该程序中我设置了五秒的系统休眠，但是运行程序的时候我们发现两个线程几乎是没有停顿的运行。

接下来就是队列函数 queue（）我在这里使用的是它的三种运行方式中的优先级队列queue.PriorityQueue（），简单地说就是给每一个添加进队列的任务加一个标识符，标识符越小的任务越先出来。

我们来看代码，代码很长，但是并不复杂，需要慢慢看注释慢慢理解：

import requests as req
from bs4 import BeautifulSoup as bf
import threading
import queue
address=input('请输入存储地址：(如 E:/xxx/xxx.txt)：')
url1=input('请输入小说的目录网址(在此小说网张内寻找：https://www.bqkan8.com/)，注意：必须是完整的地址！！！\n请输入：')
def nove(url1):
    nove_directory_url=req.get(url=url1)
    nove_directory_url=bf(nove_directory_url.content,'html.parser')
    nove_directory_url=nove_directory_url.find_all('div',class_='listmain')
    nove_directory_url=nove_directory_url[0]
    nove_directory_url=list(nove_directory_url.find_all('a'))
    return nove_directory_url       #小说目录网址与小说章名的获取

def nove_directory_url_get(nove_directory_url1):
    nove_directory_url_list=[];jsq=0
    for i in nove_directory_url1:
        if jsq>11:
            nove_directory='https://www.bqkan8.com'+i.get('href')
            nove_directory_url_list.append(nove_directory)
        else:
            jsq=jsq+1
    return nove_directory_url_list     #小说每章URL的获取

def nove_directory_get(nove_directory):
    nove_directory_list=[];jsq=0
    for i in nove_directory :
        if jsq>11:
            nove_directory1=i.string
            nove_directory_list.append(nove_directory1)
        else:
            jsq=jsq+1
    return nove_directory_list  #小说章名的获取

def nove_text(url2):
    nove_text=req.get(url=url2)
    nove_text=bf(nove_text.content,'html.parser')
    nove_text=nove_text.find_all('div',id='content')
    nove_text=nove_text[0].text.replace('        ','\n\n')
    return nove_text       #小说正文的获取

nove_url_and_directory=nove(url1)#获取小说章名与网址列表
nove_directory_url_list=nove_directory_url_get(nove_url_and_directory)#获取小说每章的URL列表
nove_directory_list=nove_directory_get(nove_url_and_directory)#获取小说每章的章名列表


def nove_text_(nove_text_url,nove_title):
    text1=(nove_title+'\n\n')
    text2=nove_text_url
    global text3
    text3={}
    nove_text1='\n\n'+text1+nove_text(text2[1])
    text3[text2[0]]=nove_text1 #使用字典的形式暂存，并将队列中的标识符赋给字典key值，将正文内容给字典的value值

long=len(nove_directory_list)#得到章节数
nove_directory_url_queue=queue.PriorityQueue()#建立一个网址队列
nove_directory_list_queue=queue.PriorityQueue()#建立一个章节名称队列
z=0
for i in nove_directory_list:
    nove_directory_list_queue.put([z,i])
    z+=1 #为章名队列添加任务即添加章名进入队列

x=0
for i in nove_directory_url_list:
    nove_directory_url_queue.put([x,i])
    x+=1#为网址队列添加任务并附上标识符
print('下载队列加载完成')

long2=10
while nove_directory_url_queue is not True: #循环条件是网址队列为空
    thread1=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(),nove_directory_list_queue.get()[1]))
    thread2=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1]))
    thread3=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1]))
    thread4=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1]))
    thread5=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1]))
    thread6=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1]))
    thread7=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1]))
    thread8=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1]))
    thread9=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1]))
    thread10=threading.Thread(target=nove_text_,args=(nove_directory_url_queue.get(), nove_directory_list_queue.get()[1]))
    thread1.start()
    thread2.start()
    thread3.start()
    thread4.start()
    thread5.start()
    thread6.start()
    thread7.start()
    thread8.start()
    thread9.start()
    thread10.start()
    thread1.join()
    thread2.join()
    thread3.join()
    thread4.join()
    thread5.join()
    thread6.join()
    thread7.join()
    thread8.join()
    thread9.join()
    thread10.join()#阻塞线程，可以理解为停止线程
    t=open(f'{address}','a',encoding='utf-8')
    for i in sorted(text3):#对字典的key值进行排序，得到排序好的内容
        txt=text3[i]
        t.write(txt)#写入文件
    f=long2/long
    f=round(f,4)
    print('下载进度：',f*100,'%')
    long2=long2+10
    t.close()
print('下载完成！')

这里主要是使用到了字典的排序方法，这样就能得到排序后的小说内容，使写入的内容不是杂乱无序的。

这个程序并不完善，比如最后的线程拿去任务那个环节......

但是精力有限，毕竟高考快到了：)

最后祝愿能阅读到最后的每位，写程序没保存的时候不会停电2333