python爬取顶点小说简单版
爬取网络资源首先要下载requests库
因为这里面也有数据提取和分析所以也要有etree库,re库
下载库的代码是:pip install 库名
如:pip install requsets
下载库:
可以通过win+R 键进入cmd模式,直接运行pip install requsets
废话不多说代码如下:
import requests
import time
import re
from lxml import etree
# https://www.xxbooktxt.com/0_688/
import requests
import time
import re
from lxml import etree
if __name__ == '__main__':
url_ ='https://www.xxbooktxt.com/0_688/'
#请求头参数
headers_ = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36',
'Cookie':'Hm_lvt_c325ff0643138838669c8eb16bd10b29=1662429679,1662430518,1662443406,1662443891; Hm_lpvt_c325ff0643138838669c8eb16bd10b29=1662444163',
'Referer':'https://www.baidu.com/link?url=IiDx_njks98nNwhgyKX9mX6b6t1I-7ByCCPzdmmCDLRd4VrgFZQmB6GfRrFDhJNF&wd=&eqid=baeaca5f00029999000000066316e16d',
}
#获取小说的文章url
response_ = requests.get(url_,headers=headers_)
data_ = response_.content
html_obj = etree.HTML(data_)
url_list = html_obj.xpath('//div[@id ="list"]//@href')
#文章的详情页
for i in range(len(url_list)):
url_1 = 'https://www.xxbooktxt.com'+url_list[i]
#再次请求获取相应
response_1 = requests.get(url_1, headers=headers_)
data_1 = response_1.content
html_obj_1 = etree.HTML(data_1)
#获取文章名字
res_ = html_obj_1.xpath('//title/text()')[0]
name_ = re.findall(r'完美世界_辰东_(.*?)- 顶点小说', res_)[0]
#获取正文
zw = html_obj_1.xpath('//div[@id = "content"]//text()')
n = len(zw)
for i in range(n):
zw[i] = ''+zw[i]
# print(zw[i])
with open(f'D:\\项目\\完美世界\\{name_}.txt','a',encoding='utf-8')as f:
f.write(zw[i])
print('下载成功')
做为一个合格的网民我们获取到数据就行了,所以使用time.sleep(2)等待两秒,恶意爬取小的网站可能会崩,所以请合理使用谢谢。