爬小说解决乱码问题附源码

最新推荐文章于 2023-11-25 15:51:56 发布

BRYTLEVSON

最新推荐文章于 2023-11-25 15:51:56 发布

阅读量3.3k

点赞数 5

分类专栏：爬虫 python

本文链接：https://blog.csdn.net/brytlevson/article/details/104021922

版权

python 同时被 2 个专栏收录

48 篇文章 6 订阅

订阅专栏

爬虫

5 篇文章 0 订阅

订阅专栏

爬小说解决乱码问题：
今天给朋友爬了一本小说：虽然没有什么反爬，但是爬取到的内容一直是乱码。

解决方法：对获取到的文本编码，不能是‘gbk’ 也不能是’utf-8’
response = requests.get(url, headers).text.encode(‘iso-8859-1’)
解决之后开始爬取
源代码：

import requests
from lxml import etree
import time

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
    'Host': 'www.xbiquge.la',
    'Referer': 'http://www.xbiquge.la/13/13959/',
}

url = "http://www.xbiquge.la/13/13959/"
response = requests.get(url, headers).text
# print(response)

res = etree.HTML(response)
hrefs_list = res.xpath('//dl//dd/a/@href')
# print(hrefs_list)

for href in hrefs_list:
    url = "http://www.xbiquge.la"+href
    print(url)

    response = requests.get(url, headers).text.encode('iso-8859-1')
    res = etree.HTML(response)
    title = res.xpath('//div[@class="bookname"]/h1/text()')[0]
    content = res.xpath('//div[@id="content"]//text()')
    content_list = []
    for text in content:
        content_list.append(text)
    with open('./小说 圣墟/'+title + '.txt', 'w+', encoding='utf-8')as f:
        content = ''.join(content_list)
        print('正在保存' + title)
        f.write(content)
    time.sleep(0.3)

代码没有任何反爬。只是给小白看的。

BRYTLEVSON

关注

5
点赞
踩
7

收藏

觉得还不错? 一键收藏
打赏
2
评论
爬小说解决乱码问题附源码

爬小说解决乱码问题：今天给朋友爬了一本小说：虽然没有什么反爬，但是爬取到的内容一直是乱码。解决方法：对获取到的文本编码，不能是‘gbk’ 也不能是’utf-8’response = requests.get(url, headers).text.encode(‘iso-8859-1’)源代码：import requestsfrom lxml import etreeimp...
复制链接

扫一扫