Python爬取晋江文学城积分总榜的小说标题、作者及标签

最新推荐文章于 2025-03-21 13:39:35 发布

小陈会长很多头发

最新推荐文章于 2025-03-21 13:39:35 发布

阅读量1.4k

点赞数 15

文章标签： python 开发语言爬虫

本文链接：https://blog.csdn.net/weixin_74021639/article/details/138772911

版权

因为数据库作业需要爬取晋江的数据，本着不写白不写的原则分享一下爬取过程

1.import需要用到的库

import requests
import pandas as pd
from lxml import etree
import openpyxl

这里的requests用于获取网页的内容，返回值为html格式，etree用于对requests获得的数据进行处理，pandas库在后面格式化保存数据要用到

2.使用requests.get()获取网页数据

这里我爬取的是积分总榜，网页为https://www.jjwxc.net/topten.php?orderstr=7&t=2

url='https://www.jjwxc.net/topten.php?orderstr=7&t=2'
r=requests.get(url)
print(r.status_code)
#r.status_code=200说明服务器成功返回网页
#由于默认html编码格式为‘utf-8’,而晋江的编码为‘gb18030’，故要对爬取的数据进行解码
rt=r.content.decode('gb18030')
et=etree.HTML(rt)#这里使用etree