Python 爬虫学习3 -简单抓取小说网信息/python CSV中文乱码问题

最新推荐文章于 2023-10-29 23:09:29 发布

dieyan7275

最新推荐文章于 2023-10-29 23:09:29 发布

阅读量171

点赞数

文章标签：爬虫 python

原文链接：http://www.cnblogs.com/IrisLee/p/9157084.html

版权

小说网 https://www.qu.la/paihangbang/

功能：抓取每个排行榜内的小说名和对应链接，然后写入excel表格里面。

按F12 审查页面元素可以得到你所要的信息的class，从而来定位。

具体看代码讲解吧。

#coding:utf-8  #为了正常转码 必写
import codecs  #为下面新建excel，转码正确准备得一个包

__author__ = 'Administrator'
import requests
from bs4 import BeautifulSoup
"""
get_html函数是为了抓取对应url的html页面，然后返回这个页面。
其实也可以全部写入一个函数，不过这样就会显得函数很臃肿。
将这种公共函数独立来写，进行封装，有助于以后重复利用。
"""
def get_html(url):     
    try:
        r = requests.get(url,timeout = 3000)
        r.raise_for_status
        r.encoding = 'utf-8'
        return r.text
    except:
        return
"""
get_content函数是用来提取你所需要的信息的，并把信息写入excel表格。

"""
def get_content(url):
    url_list = []
    html = get_html(url).encode('utf-8')
    soup = BeautifulSoup(html, "html.parser")
    category_list = soup.find_all('div', class_='index_toplist mright mbottom')
    history_list = soup.find_all('div', class_ = 'index_toplist mbottom')
    for cate in category_list:
        name = cate.find('div', class_ = 'toptab').span.text
        name = name.encode('utf-8')
        with codecs.open('novel_list.csv','a+','utf-8') as f:
            f.write('\n小说种类:{}\n'.format(name))

        book_list = cate.find('div', class_ = 'topbooks').find_all('li')
        for book in book_list:
            link = 'http://www.qu.la/'+book.a['href']
            title = book.a['title'].encode('utf-8')
            url_list.append(link)
            with codecs.open('novel_list.csv','a+','utf-8') as f:
                f.write('小说名：{} \t 小说地址:{}\n'.format(title,link))

    for cate in history_list:
        name = cate.find('div', class_='toptab').span.string
        with codecs.open('novel_list.csv','a+','utf-8') as f:
            f.write("\n小说种类：{} \n".format(name))

        general_list = cate.find(style='display: block;') #找到总排行榜
        book_list = general_list.find_all('li')
        for book in book_list:
            link = 'http://www.qu.la/' + book.a['href']
            title = book.a['title']
            url_list.append(link)
            with codecs.open('novel_list.csv','a+','utf-8') as f:
                f.write("小说名：{:<} \t 小说地址：{:<} \n".format(title, link))
    return url_list



def main():
    # 排行榜地址
    base_url = 'http://www.qu.la/paihangbang/'
    # 获取排行榜中所有小说链接
    url_list = get_content(base_url)


if __name__=='__main__':
    main()

本次主要是记录编码问题。

编写完后run完出来是一个乱码的excel表格

然后就开始进行debug

在每一步设置断点，观察每个变量 name title这些到底是用什么编码

现在这个版本，我基本是都加上了.encode('utf-8')

出来时每一个都是string变量

可是加了后还是乱码。

然后我尝试把信息写入txt文档，发现成功了。

所以问题在于写入excel，excel不能正确编码，所以我改成了codecs.open('novel_list.csv','a','utf-8')

最终成功解决问题。

思路仅供参考编码问题一直都会有，真的是头大。

解决方案二：

#coding:utf-8
import codecs
import sys

__author__ = 'Administrator'
import requests
from bs4 import BeautifulSoup


def get_html(url):     
    try:
        r = requests.get(url,timeout = 3000)
        r.raise_for_status
        r.encoding = 'utf-8'
        return r.text
    except:
        return

def get_content(url):
    url_list = []
    html = get_html(url).encode('utf-8')
    soup = BeautifulSoup(html, "html.parser")
    category_list = soup.find_all('div', class_='index_toplist mright mbottom')
    history_list = soup.find_all('div', class_ = 'index_toplist mbottom')
    for cate in category_list:
        name = cate.find('div', class_ = 'toptab').span.text
        #name = name.encode('GB2312')
        f = open('novel_list.csv','a+')
        f.write(('\n小说种类:{}\n'.format(name)).encode('GB2312'))

        book_list = cate.find('div', class_ = 'topbooks').find_all('li')
        for book in book_list:
            link = 'http://www.qu.la/'+book.a['href']
            title = book.a['title']
            #title = title.encode('GB2312')
            url_list.append(link)
            f= open('novel_list.csv','a+')
            f.write(('小说名：{} \t 小说地址:{}\n'.format(title,link)).encode('GB2312'))

    for cate in history_list:
        name = cate.find('div', class_='toptab').span.string#.encode('GB2312')
        f= open('novel_list.csv','a+')
        f.write(("\n小说种类：{} \n".format(name)).encode('GB2312'))

        general_list = cate.find(style='display: block;') #找到总排行榜
        book_list = general_list.find_all('li')
        for book in book_list:
            link = 'http://www.qu.la/' + book.a['href']
            title = book.a['title']
            #title = title.encode('GB2312')
            url_list.append(link)
            f= open('novel_list.csv','a+')
            f.write(("小说名：{:<} \t 小说地址：{:<} \n".format(title, link)).encode('GB2312'))
    return url_list



def main():
    # 排行榜地址
    base_url = 'http://www.qu.la/paihangbang/'
    # 获取排行榜中所有小说链接
    url_list = get_content(base_url)


if __name__=='__main__':
    main()

　　改变的是红色的部分。

csv格式接受的是'GB2312'编码，所以把要写在文件里的信息全部GB2312编码。

欢迎提问~

转载于:https://www.cnblogs.com/IrisLee/p/9157084.html

dieyan7275

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python 爬虫学习3 -简单抓取小说网信息/python CSV中文乱码问题

小说网 https://www.qu.la/paihangbang/功能：抓取每个排行榜内的小说名和对应链接，然后写入excel表格里面。按F12 审查页面元素可以得到你所要的信息的class，从而来定位。具体看代码讲解吧。#coding:utf-8 #为了正常转码必写import codecs #为下面新建excel，转码正确准备得一个包__aut...
复制链接

扫一扫