因为最近几个小伙伴遇到了一些小问题,爬小说有乱码,或者不成功的情况,所有今天上午特意写了一个简单的顶点的小说爬虫。这里做了简单的演示所以只爬取了前几页如果需要更多,自行更改range里面参数就可了。
代码如下:
import requests
from lxml import etree
urls = ["https://www.23us.us/html/14/14593/{}.html".format(i) for i in range(5374091, 5374095)]
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.75 Safari/537.36"
}
def get_novel(url):
response = requests.get(url, headers=headers)
text = etree.HTML(response.text)
title = text.xpath("//div[@class='content']/h1/text()")[0]
contents = text.xpath("//div[@id='content']/text()")
with open("小说.txt" + title, "w", encoding="utf-8")as f:
for content in contents:
f.write(content)
if __name__ == '__main__':
for url in urls:
get_novel(url)