回想当初自学Python很大一部分原因是想要自己爬数据,今天终于学会了怎么下载小说。于是搞了一波《球状闪电》。
需要用到两个库:requests 和 BeautifulSoup,用 pip 安装即可。
主要的步骤是:
- 利用
requests.get(url)
获取网页信息。注意如果中文出现乱码,就加上edcoding = '*'
,其中星号代表 html 解码方式,通常在 head 里面的 charset 可以找到。 - 通过审查元素找到想要抓取内容的路径(右击正文部分打开检查即可)
- 利用
find_all()
抓取有用信息,并进行过滤
但最后效果是换行有点多,格式有些乱。我用 text.replace()
也没用,可能换行符不大一样吧。
同志还需努力。
import requests, sys
from bs4 import BeautifulSoup
def get_contents(target): # 获取章节内容
req = requests.get(url = target)
req.encoding = 'GB2312'
html = req.text
bf = BeautifulSoup(html, features = "lxml")
texts = bf.find_all('div', id = 'content')
texts = texts[0].text.replace('\n\n', '\n') #去不掉多余换行?
return texts
def writer(name, path, text): # 写入 path
write_flag = True
with open(path, 'a', encoding = 'utf-8') as f:
f.write(name + '\n')
f.writelines(text)
f.write('\n\n')
if __name__ == "__main__":
# 获取目录
names, urls = [], []
req = requests.get(url = 'http://book.sbkk8.com/xiandai/liucixinzuopinji/qiuzhuangshandian')
req.encoding = 'GB2312'
html = req.text
bf = BeautifulSoup(html, features = "lxml")
content = bf.find_all('div', class_ = 'mulu')
atmp = BeautifulSoup(str(content[0]), features = "lxml")
a = atmp.find_all('a') # 返回一个list
num = len(a)
for u in a: # 每章名称和链接
names.append(u.string)
urls.append('http://book.sbkk8.com/' + u.get('href'))
print("Downloading...")
for i in range(num):
writer(names[i], 'Ball-lightning.txt', get_contents(urls[i]))
print("%.2f%% has been downloaded" % float(100.0*i/num), end = '\r')
print("100.00% has been downloaded\nFinish")