小说爬取函数

最新推荐文章于 2022-03-19 17:31:18 发布

Posierd

最新推荐文章于 2022-03-19 17:31:18 发布

阅读量161

点赞数

分类专栏：文本信息

本文链接：https://blog.csdn.net/qq_44779863/article/details/104696821

版权

文本信息专栏收录该内容

8 篇文章 0 订阅

订阅专栏



'''
找的页面的 小说名字的 html 建议 提取 h1 的标签（提取每章节的名字时也这样做）
关于章节 的 url
ul 标签下 4个 li 标签 里  a 标签 的 href 的属性值 注意拼接
得到每章节的url  再次发送请求  将此时返回的数据进行数据提取
在内容提取 需要进行数据清洗 可用方法 .replace("需要替换的内容"，“”)  理解为将内容替换为 空
保存时  注意写入模式为 追加  可用  a/a+  来实现

'''


import requests
import re

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}

#  获取 html 代码
def get_html(url):
    resp = requests.get(url,headers=headers)
    resp.encoding = resp.apparent_encoding  #  万能编码
    html = resp.text
    return html

# 获取 章节的短链接（需要拼接）   书名
def get_url(html):
    # 书名
    name = re.findall('<h1>(.*?)</h1>',html)[0]  #  re 找到的结果的数据类型为 list  ，可以根据索引来提取
    # 所有的 章节短链接 这个用 for  遍历提取 以便拼接出完整的 url
    urllist = re.findall('<li><a href="(.*?)">.*?<span></span></a></li>',html,re.S)
    return name ,urllist



# 请求 新链接  提取 章节名  内容
def get_content(urllist):
    #  循环 拼接 再次发送请求  提取 章节名字 和 内容（需要清洗）
    for url in urllist:
        new_url = 'http://www.39shubao.com' + url
        resp = requests.get(new_url,headers=headers)
        resp.encoding = resp.apparent_encoding
        html = resp.text
        #  数据提取
        title = re.findall('<h1>(.*?)</h1>',html,re.S)[0]
        content = re.findall('<div id="book_text">.*?<br/><br/>(.*?)<div id="gt1">',html,re.S)[0]
        #  数据清洗
        content = content.replace(" ","")
        content = content.replace("<br/><br/>","")
        content = content.replace("</div>","")
        yield title,content

#  保存
def save(title,content,name):
    print("正在保存章节；",title)
    path = r'C:\Users\DELL\Desktop\python_wd\{}.txt'.format(name)
    with open(path,'a',encoding='utf-8',)as f:
        f.write("\n\t\t\t\t")  # \n = 相当于换行   \t = 相当于一个 Tab  这样做可以使内容更加好看
        f.write(title)
        f.write("\n\n\t")
        f.write(content)

#  主函数
def main():
    url = 'http://www.39shubao.com/files/article/html/109/109012/'
    html = get_html(url)
    name,urllist = get_url(html)
    mess = get_content(urllist)
    for title , content in mess:
        save(title , content,name)




if __name__ == '__main__':
    main()