利用xpath爬取豆*散文页相关图书信息
python中提供了多种库供我们选择,在网络页面资源爬取方面有bs4,lxml等可供选择。
今天我们使用lxml中的etree机型简单的实践。
目标:学会使用xpath和HTML标签定位。
1.头部伪装(上一篇讲过,不再赘述)
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.62'
}
2.使用request中的get方法获取网页源码,利用etree的HTML方法解析数据
#爬取页面源码数据
url = 'https://book.dou*.com/tag/%E6%95%A3%E6%96%87'
page_text = requests.get(url=url,headers=headers).text
#数据解析
tree_book = etree.HTML(page_text)
网页源码如下(主要对照标签从属)
#来源:豆瓣
<div id="wrapper">
<div id="content">
<h1>豆瓣图书标签: 散文</h1>
<div class="grid-16-8 clearfix">
<div class="article">
<div id="subject_list">
<div class="clearfix">
<span class="rr greyinput">
综合排序
/
<a href="/tag/%E6%95%A3%E6%96%87?type=R">按出版日期排序</a>
/
<a href="/tag/%E6%95%A3%E6%96%87?type=S">按评价排序</a>
</span>
</div>
<ul class="subject-list">
<li class="subject-item">
<div class="pic">
<a class="nbg" href="https://book.douban.com/subject/1060068/" onclick="moreurl(this,{i:'0',query:'',subject_id:'1060068',from:'book_subject_search'})">
<img class="" src="https://img3.doubanio.com/view/subject/s/public/s1066570.jpg" width="90">
</a>
</div>
<div class="info">
<h2 class="">
<a href="https://book.douban.com/subject/1060068/" title="撒哈拉的故事" onclick="moreurl(this,{i:'0',query:'',subject_id:'1060068',from:'book_subject_search'})">
撒哈拉的故事
</a>
</h2>
<div class="pub">
三毛 / 哈尔滨出版社 / 2003-8 / 15.80元
</div>
<div class="star clearfix">
<span class="allstar45"></span>
<span class="rating_nums">9.2</span>
<span class="pl">
(116916人评价)
</span>
</div>
<p>三毛作品中最脍炙人口的《撒哈拉的故事》,由12篇精彩动人的散文结合而成,其中《沙漠中的饭店》,是三毛适应荒凉单调的沙漠生活后,重新拾笔的第一篇文字,自此之后... </p>
<div class="ft">
<div class="collect-info">
</div>
<div class="cart-actions">
<span class="buy-info">
<a href="https://book.douban.com/subject/1060068/buylinks">
纸质版 134.60元
</a>
</span>
</div>
</div>
</div>
</li>
<li class="subject-item">
<div class="pic">
<a class="nbg" href="https://book.douban.com/subject/1023045/" onclick="moreurl(this,{i:'1',query:'',subject_id:'1023045',from:'book_subject_search'})">
<img class="" src="https://img2.doubanio.com/view/subject/s/public/s1015872.jpg" width="90">
</a>
</div>
<div class="info">
<h2 class="">
<a href="https://book.douban.com/subject/1023045/" title="我们仨" onclick="moreurl(this,{i:'1',query:'',subject_id:'1023045',from:'book_subject_search'})">
我们仨
</a>
</h2>
<div class="pub">
杨绛 / 生活·读书·新知三联书店 / 2003-7 / 18.80元
</div>
<div class="star clearfix">
<span class="allstar45"></span>
<span class="rating_nums">8.7</span>
<span class="pl">
(268157人评价)
</span>
</div>
<p>《我们仨》是钱钟书夫人杨绛撰写的家庭生活回忆录。1998年,钱钟书逝世,而他和杨绛唯一的女儿钱瑗已于此前(1997年)先他们而去。在人生的伴侣离去四年后,杨... </p>
<div class="ft">
<div class="collect-info">
</div>
<div class="cart-actions">
</div>
</div>
</div>
</li>
3.定位到相应标签位置,取出数据,这里需要对照网页源码
bookList = []
li_list = tree_book.xpath('//ul[@class="subject-list"]/li')
for li in li_list:
bookAuthor = li.xpath('normalize-space(./div[2]/div/text())')
bookName = li.xpath('normalize-space(./div[2]/h2/text())')
bookBrief = li.xpath('normalize-space(./div[2]/p/text())')
注意//和/和./的用法,在注释中已经说明
这里的normalize-space()函数是为了去掉获取的text数据中的换行符和空格
本文结束,附上源代码
#作者:void_zk
#时间:2021.8.4
#利用xpath爬取豆*散文页图书详情
import requests
from lxml import etree
#
if __name__ == "__main__":
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.62'
}
#爬取页面源码数据
url = 'https://book.douban.com/tag/%E6%95%A3%E6%96%87'
page_text = requests.get(url=url,headers=headers).text
#数据解析
tree_book = etree.HTML(page_text)
#定位文本所在的li标签列表
bookList = []
#这里的写法主要是非直系的定位到ul,所以用//
li_list = tree_book.xpath('//ul[@class="subject-list"]/li')
for li in li_list:
#这里用./开头是因为我们已经定位到li标签,且在li.xpath()方法下
bookAuthor = li.xpath('normalize-space(./div[2]/div/text())')
bookName = li.xpath('normalize-space(./div[2]/h2/text())')
bookBrief = li.xpath('normalize-space(./div[2]/p/text())')
#/用法类似于文件中
#将完整的数据信息插入
bookList.append(bookName+'/'+bookAuthor+'/'+bookBrief)
print('\n爬取完毕!!!\n')
print(bookList)