python写一个爬虫（2）_python爬虫小说截取下一章内容-CSDN博客

本文链接：https://blog.csdn.net/majackfeng/article/details/50738658

继续上一篇进行
3、得到文章的名称和下一章的url地址
由于已经得到了html的内容了，只需要对该内容进行解析就可以得到每一章的标题和下一章的url地址。
在这里使用正则表达式进行处理：

def getTitle(DData): #得到文章的题目
    zb=r'<div id="title">(.*?)</div>' #正则表达式 查找内容
    titleL=re.findall(zb,DData)
    print(titleL[0])
    return titleL[0]

def getNext(DData): #得到下一章的url 最后一章没有返回值
    zb=r'录</a> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href="(.*?)">下一'
    nextL=re.findall(zb,DData)
    #print(nextL[0])
    if nextL:
        return nextL[0]
    return 0

其中在得到下一章的url时当为最后一章时返回值为0，这样便于跳出去，结束程序。
4、循环
现在程序以及可以得到下一章的url地址了，应该已经可以运行了，在写一个循环应该就可以运行了

    while True:
        html=getDate(weburl,webheaders)
        file=targetDir+'dp1.txt'
        #f=open(file,'w',encoding = LXBM)
        #f.write(html)
        #f.close()
        saveDataFileN(html,file)
        #a = openFileN(file)
        #print(a)
        title = getTitle(html)#得到这一章的题目
        Nurl=getNext(html)#得到下一章的url
        if Nurl == 0:
            break
        weburl=weburlT+Nurl
        print(weburl)
        time.sleep(0.05) #睡眠50毫秒，防止被封ip

注：睡眠50毫秒是看部分资料说如果过于频繁的访问会被封ip所以加上了睡眠防止被封。
在实际运行的时间会不时的报错影响程序的正常运行，报错如下：

UnicodeDecodeError: 'gbk' codec can't decode byte 0xab in position 3301: illegal multibyte sequence

经过查询是因为html中有一些非法字符，因此在转码过程中出现了异常，这种异常十分可怕，只要有一个非法字符，整个字符串都无法转码。这里的解决办法是：使用ignore忽略非法字符：

s.decode('gbk', ‘ignore').encode('utf-8′)

贴上现在已经完成的代码：

#getStory
#爬取一个小说网的一部小说。
import urllib.request
import re
import time

targetDir='Y://A/' #文件的保存路径

LXBM='gbk' #文件的编码类型，需要单独设置
weburlT='http://www.bxwx.org/b/8/8823/'
weburlN='5789596.html'
weburl=weburlT + weburlN
webheaders={
    'Connection':'Keep_Alive',
    'Accept':'text/html,application/xhtml+xml,*/*',
    'Accept-Language':'zh-CN,zh;q=0.8',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.154 Safari/537.36 LBBROWSER',
    } #报头，模拟浏览器的访问行为

def getDate(Burl,Bheaders):
    #根据url地址得到该url中的文本数据
    req = urllib.request.Request(url=Burl,headers=Bheaders)#构造请求头
    data = urllib.request.urlopen(req).read() #读取数据
    data1 = data.decode(LXBM,'ignore') #把读取的数据转换为utf-8的格式
    return data1

def getTitle(DData): #得到文章的题目
    zb=r'<div id="title">(.*?)</div>' #正则表达式 查找内容
    titleL=re.findall(zb,DData)
    print(titleL[0])
    return titleL[0]

def getNext(DData): #得到下一章的url
    zb=r'录</a> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href="(.*?)">下一'
    nextL=re.findall(zb,DData)
    #print(nextL[0])
    if nextL:
        return nextL[0]
    return 0

if __name__ == '__main__': #程序运行的入口
    while True:
        html=getDate(weburl,webheaders)
        title = getTitle(html)#得到这一章的题目
        Nurl=getNext(html)#得到下一章的url
        if Nurl == '5839455.html':#最后一章的url。。。
            break
        weburl=weburlT+Nurl
        print(weburl)
        time.sleep(0.05) #睡眠50毫秒，防止被封ip