Python爬虫–笔趣阁小说爬取
爬虫用到的插件
import requests
from lxml import etree
小说目录页
以小说“我有百万技能点”为例,在笔趣阁搜索进入目录页,复制目录页URL:https://www.1biquge.com/32/32014/
对目录页的每个章节的URL进行爬取,分析网页利用Xpath定位每个章节的URL然后进行爬取,然后重新构造URL。
# 目录每一章节的URL
href = html_ele.xpath('//dd/a/@href')
for i in href:
url = 'https://www.1biquge.com' + i
print(url)
小说正文爬取
利用爬取到的每个章节的URL,进入每一个章节进行爬取,我们要爬取的内容是,章节标题、章节内容。
# 章节标题
href1 = html_ele.xpath('//div[@class="box_con"]/div[@class="bookname"]/h1/text()')
# 章节内容
href2 = html_ele.xpath('//div[@class="box_con"]/div[@id="content"]/text()')
将爬取的内容存入TXT文档
我们将这个TXT文档命名为:“我有百万技能点.txt”,将爬取的数据都放入这个txt文档中。
file_name = '我有百万技能点.txt'
with open(file_name, "ab") as f:
# 写入章节标题
f.write(href1[0].encode('utf=8'))
# 写入换行符
f.write('\n'.encode('utf-8'))
# 引文爬取到的正文内容是每一个P标签的内容,所以要将href2遍 历然后写入TXT文档
for i in href2:
print(i)
# 写入正文内容
f.write(i.encode('utf=8'))
# 写入换行符
f.write('\n'.encode('utf-8'))
到此“我有百万技能点”的小说爬虫完成。
完整代码如下:
import requests
from lxml import etree
url = 'https://www.1biquge.com/32/32014/'
# 设置请求头
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"
}
response = requests.get(url,headers = headers)
response.encoding='gbk'
html_ele = etree.HTML(response.text)
# 目录每一章节的URL
href = html_ele.xpath('//dd/a/@href')
for i in href:
url = 'https://www.1biquge.com' + i
print(url)
# 设置请求头
headers = {
"Referer": "https://www.1biquge.com/32/32014/",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"
}
response = requests.get(url,headers = headers)
response.encoding='gbk'
html_ele = etree.HTML(response.text)
# 章节标题
href1 = html_ele.xpath('//div[@class="box_con"]/div[@class="bookname"]/h1/text()')
# 章节内容
href2 = html_ele.xpath('//div[@class="box_con"]/div[@id="content"]/text()')
file_name = '我有百万技能点.txt'
with open(file_name, "ab") as f:
f.write(href1[0].encode('utf=8'))
f.write('\n'.encode('utf-8'))
for i in href2:
print(i)
f.write(i.encode('utf=8'))
f.write('\n'.encode('utf-8'))