网页爬取
1.引入模块
import requests
from lxml import etree
2.找到想要爬取的网址
#声明网址变量
url = "https://www.12zw.com/2/2671/" #笔趣阁 “成神” 小说网址
# 添加浏览器header,模拟浏览器登录
header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3704.400 QQBrowser/10.4.3620.400"}
result = requests(url,header)
resultHtml = etree.HTML(result.text) # 自动补全html标签
context = resultHtml.xpath("//div[@id='context']")
网页分析
1.xpath 使用
/ | 选取当前节点下的字内容 |
---|---|
// | 选取当前节点下所有子孙节点 |
. | 选择当前节点 |
.. | 选择当前节点的父节点 |
@ | 选取属性 |
- 文本获取
/text() 例如:result = html.xpath(’//li[@class=“item-0”]/text()’) - 属性获取
/@id 例如:result = html.xpath(’//li[@class=“item-0”]/@id’)
示例:获取网页小说 “仙王的日常生活”
import requests
from lxml import etree
url = "https://www.biqukan.com/38_38269/"
hearder = {
"user-agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1"
}
result = requests.get(url,hearder)
result.encoding = "gb2312"
html = result.text
htmlContext = etree.HTML(html)
nameList = htmlContext.xpath("//div[@class='listmain']//dd/a/text()")
hrefList = htmlContext.xpath("//div[@class='listmain']//dd/a/@href")
print(hrefList)
url2 = "https://www.biqukan.com"
for i in range(len(hrefList)):
context = requests.get(url2+hrefList[i])
context.encoding = "gb2312"
context = context.text
contextHtml = etree.HTML(context)
con = contextHtml.xpath("//div[@id='content']//text()") #文章内容
contitle = nameList[i] # 文章标题
con.insert(0,contitle + "\n")
file = open("仙王的日常生活.txt","a",encoding="utf-8")
file.writelines(con)
file.close()
print(contitle)