【Python学习】xpath解析

xpath解析

XPath,全称 XML Path Language,即 XML 路径语言,它是一门在 XML 文档中查找信息的语言。XPath 可用来在 XML 文档中对元素和属性进行遍历。

xpath解析原理:

  1. 实例化一个etree的对象,且需要将被解析的页面源码数据加载到该对象中
  2. 调用etree对象中的xpath方法结合xpath表达式实现标签的定位和内容的捕获

环境安装

pip install lxml

导包

from lxml import etree

如何实例化一个etree对象?

  1. 将本地的HTML文档中的数据加载到该对象中:(解析本地本件第二个参数最好加上,不然可能报错)
etree.parse(filePath,etree.HTMLParser())
  1. 也可以将互联网上获取的源码数据加载到该对象中
etree.HTML('page_text')

xpath语法
下面列出了最常用的表达式

表达式描述
nodename选取此节点的所有子节点
/从根节点选取
//从匹配选择的当前节点选择文档中的节点,而不考虑它们的位置
.选取当前节点
选取当前节点的父节点
@选取属性
text()选取文本

实例

<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>
</bookstore>

路径表达式解释
bookstore选择bookstore元素。
/bookstore选取根元素 bookstore。注释:假如路径起始于正斜杠( / ),则此路径始终代表到某元素的绝对路径!
bookstore/book选取属于 bookstore 的子元素的所有 book 元素。
//book选取所有 book 子元素,而不管它们在文档中的位置。
bookstore//book选择属于 bookstore 元素的后代的所有 book 元素,而不管它们位于 bookstore 之下的什么位置。
//book/title/@lang选择所有的book下面的title中的lang属性的值。
//book/title/text()选择所有的book下面的title的文本。

谓语 & 查找特定的节点

路径表达式解释
/bookstore/book[1]选取属于 bookstore 子元素的第一个 book 元素。
/bookstore/book[last()]选取属于 bookstore 子元素的最后一个 book 元素。
/bookstore/book[last()-1]选取属于 bookstore 子元素的倒数第二个 book 元素。
/bookstore/book[position()< 3]选取最前面的两个属于 bookstore 元素的子元素的 book 元素。
//title[@lang]选取所有拥有名为 lang 的属性的 title 元素。
/bookstore/book[price>35.00]选取 bookstore 元素的所有 book 元素,且其中的 price 元素的值须大于 35.00。
/bookstore/book[price>35.00]/title选取 bookstore 元素中的 book 元素的所有 title 元素,且其中的 price 元素的值须大于 35.00。

注意点: 在xpath中,第一个元素的位置是1,最后一个元素的位置是last(),倒数第二个是last()-1

选取未知节点

通配符描述
*匹配任何元素节点。
@*匹配任何属性节点。
node()匹配任何类型的节点。

举例:

路径表达式结果
/bookstore/*选取 bookstore 元素的所有子元素。
//*选取文档中的所有元素。
//title[@*]选取所有带有属性的 title 元素。

选取若干路径

通过在路径表达式中使用“|”运算符,可以选取若干个路径。

路径表达式结果
//book/title | //book/price选取 book 元素的所有 title 和 price 元素。
//title | //price选取文档中的所有 title 和 price 元素。
/bookstore/book/title | //price选取属于 bookstore 元素的 book 元素的所有 title 元素,以及文档中所有的 price 元素。

.

实战案例,豆瓣电影爬虫

html代码

<div id="screening" class="s" data-dstat-areaid="70" data-dstat-mode="click,expose">
	<div class="screening-hd">
		<div class="ui-slide-control">
			<span class="prev-btn">
				<a class="btn-prev" href="javascript:void(0)"></a>
			</span>
			<span class="next-btn">
				<a class="btn-next" href="javascript:void(0)"></a>
			</span>
		</div>
		<div class="slide-tip">
			<span class="ui-slide-index">1</span> / <span class="ui-slide-max">4</span>
		</div>
		<h2>正在热映<span>
				<a onclick="moreurl(this, {from:'mv_l_a'})" href="/cinema/nowplaying/">全部正在热映&raquo;</a>
			</span>
			<span>
				<a onclick="moreurl(this, {from:'mv_l_w'})" href="./later/">即将上映&raquo;</a>
			</span>
		</h2>
	</div>
	<div class="screening-bd">
		<ul class="ui-slide-content" data-slide-index="1" data-index-max="4">


			<li class="ui-slide-item s" data-dstat-areaid=70_1 data-dstat-mode=click,expose data-dstat-watch=.ui-slide-content data-dstat-viewport=.screening-bd data-title="万里归途" data-release="2022" data-rate="7.5" data-star="40" data-trailer="https://movie.douban.com/subject/26654184/trailer" data-ticket="https://movie.douban.com/ticket/redirect/?movie_id=26654184" data-duration="137分钟" data-region="中国大陆" data-director="饶晓志" data-actors="张译 / 王俊凯 / 殷桃" data-intro="" data-enough="true" data-rater="163672">
				<ul class="">
					<li class="poster">
						<a onclick="moreurl(this, {from:'mv_a_pst'})" href="https://movie.douban.com/subject/26654184/?from=showing">
							<img src="https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2881176356.jpg" alt="万里归途" rel="nofollow" class="" />
						</a>
					</li>
					<li class="title">
						<a onclick="moreurl(this, {from:'mv_a_tl'})" href="https://movie.douban.com/subject/26654184/?from=showing" class="">万里归途</a>
					</li>
					<li class="rating">
						<span class="rating-star allstar40"></span>
						<span class="subject-rate">7.5</span>
					</li>
					<li class="ticket_btn">
						<span>
							<a onclick="moreurl(this, {from:'mv_b_tc'})" href="https://movie.douban.com/ticket/redirect/?movie_id=26654184" target="_blank">选座购票</a>
						</span>
					</li>
				</ul>


			<li class="ui-slide-item" data-title="平凡英雄" data-release="2022" data-rate="" data-star="00" data-trailer="https://movie.douban.com/subject/35554292/trailer" data-ticket="https://movie.douban.com/ticket/redirect/?movie_id=35554292" data-duration="97分钟" data-region="中国大陆" data-director="陈国辉" data-actors="李冰冰 / 冯绍峰 / 黄晓明" data-intro="" data-enough="false" data-rater="13273">
				<ul class="">
					<li class="poster">
						<a onclick="moreurl(this, {from:'mv_a_pst'})" href="https://movie.douban.com/subject/35554292/?from=showing">
							<img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2880994870.jpg" alt="平凡英雄" rel="nofollow" class="" />
						</a>
					</li>
					<li class="title">
						<a onclick="moreurl(this, {from:'mv_a_tl'})" href="https://movie.douban.com/subject/35554292/?from=showing" class="">平凡英雄</a>
					</li>
					<li class="rating">


						<span class="text-tip">暂无评分</span>
					</li>
					<li class="ticket_btn">
						<span>
							<a onclick="moreurl(this, {from:'mv_b_tc'})" href="https://movie.douban.com/ticket/redirect/?movie_id=35554292" target="_blank">选座购票</a>
						</span>
					</li>
				</ul>


			<li class="ui-slide-item" data-title="还是觉得你最好 飯戲攻心" data-release="2022" data-rate="7.6" data-star="40" data-trailer="https://movie.douban.com/subject/35503125/trailer" data-ticket="https://movie.douban.com/ticket/redirect/?movie_id=35503125" data-duration="116分钟(中国大陆)" data-region="中国香港" data-director="陈咏燊" data-actors="黄子华 / 邓丽欣 / 张继聪" data-intro="" data-enough="true" data-rater="60133">
				<ul class="">
					<li class="poster">
						<a onclick="moreurl(this, {from:'mv_a_pst'})" href="https://movie.douban.com/subject/35503125/?from=showing">
							<img src="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2879301401.jpg" alt="还是觉得你最好" rel="nofollow" class="" />
						</a>
					</li>
					<li class="title">
						<a onclick="moreurl(this, {from:'mv_a_tl'})" href="https://movie.douban.com/subject/35503125/?from=showing" class="">还是觉得你最...</a>
					</li>
					<li class="rating">
						<span class="rating-star allstar40"></span>
						<span class="subject-rate">7.6</span>
					</li>
					<li class="ticket_btn">
						<span>
							<a onclick="moreurl(this, {from:'mv_b_tc'})" href="https://movie.douban.com/ticket/redirect/?movie_id=35503125" target="_blank">选座购票</a>
						</span>
					</li>
				</ul>


			<li class="ui-slide-item" data-title="搜救" data-release="2022" data-rate="" data-star="00" data-trailer="https://movie.douban.com/subject/35237993/trailer" data-ticket="https://movie.douban.com/ticket/redirect/?movie_id=35237993" data-duration="108分钟" data-region="中国大陆" data-director="罗志良" data-actors="甄子丹 / 韩雪 / 贾冰" data-intro="" data-enough="false" data-rater="5247">
				<ul class="">
					<li class="poster">
						<a onclick="moreurl(this, {from:'mv_a_pst'})" href="https://movie.douban.com/subject/35237993/?from=showing">
							<img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2881182580.jpg" alt="搜救" rel="nofollow" class="" />
						</a>
					</li>
					<li class="title">
						<a onclick="moreurl(this, {from:'mv_a_tl'})" href="https://movie.douban.com/subject/35237993/?from=showing" class="">搜救</a>
					</li>
					<li class="rating">


						<span class="text-tip">暂无评分</span>
					</li>
					<li class="ticket_btn">
						<span>
							<a onclick="moreurl(this, {from:'mv_b_tc'})" href="https://movie.douban.com/ticket/redirect/?movie_id=35237993" target="_blank">选座购票</a>
						</span>
					</li>
				</ul>


			<li class="ui-slide-item" data-title="狼群" data-release="2022" data-rate="5.8" data-star="30" data-trailer="https://movie.douban.com/subject/33429617/trailer" data-ticket="https://movie.douban.com/ticket/redirect/?movie_id=33429617" data-duration="105分钟" data-region="中国大陆" data-director="蒋丛" data-actors="张晋 / 李治廷 / 蒋璐霞" data-intro="" data-enough="true" data-rater="8500">
				<ul class="">
					<li class="poster">
						<a onclick="moreurl(this, {from:'mv_a_pst'})" href="https://movie.douban.com/subject/33429617/?from=showing">
							<img src="https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2879572474.jpg" alt="狼群" rel="nofollow" class="" />
						</a>
					</li>
					<li class="title">
						<a onclick="moreurl(this, {from:'mv_a_tl'})" href="https://movie.douban.com/subject/33429617/?from=showing" class="">狼群</a>
					</li>
					<li class="rating">
						<span class="rating-star allstar30"></span>
						<span class="subject-rate">5.8</span>
					</li>
					<li class="ticket_btn">
						<span>
							<a onclick="moreurl(this, {from:'mv_b_tc'})" href="https://movie.douban.com/ticket/redirect/?movie_id=33429617" target="_blank">选座购票</a>
						</span>
					</li>
				</ul>


			<li class="ui-slide-item s" data-dstat-areaid=70_2 data-dstat-mode=click,expose data-dstat-watch=.ui-slide-content data-dstat-viewport=.screening-bd data-title="哥,你好" data-release="2022" data-rate="5.3" data-star="30" data-trailer="https://movie.douban.com/subject/35102469/trailer" data-ticket="https://movie.douban.com/ticket/redirect/?movie_id=35102469" data-duration="111分钟" data-region="中国大陆" data-director="张栾" data-actors="马丽 / 常远 / 魏翔" data-intro="" data-enough="true" data-rater="42222">
				<ul class="">
					<li class="poster">
						<a onclick="moreurl(this, {from:'mv_a_pst'})" href="https://movie.douban.com/subject/35102469/?from=showing">
							<img src="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2880026973.jpg" alt="哥,你好" rel="nofollow" class="" />
						</a>
					</li>
					<li class="title">
						<a onclick="moreurl(this, {from:'mv_a_tl'})" href="https://movie.douban.com/subject/35102469/?from=showing" class="">哥,你好</a>
					</li>
					<li class="rating">
						<span class="rating-star allstar30"></span>
						<span class="subject-rate">5.3</span>
					</li>
					<li class="ticket_btn">
						<span>
							<a onclick="moreurl(this, {from:'mv_b_tc'})" href="https://movie.douban.com/ticket/redirect/?movie_id=35102469" target="_blank">选座购票</a>
						</span>
					</li>
				</ul>


			<li class="ui-slide-item" data-title="钢铁意志" data-release="2022" data-rate="" data-star="00" data-trailer="https://movie.douban.com/subject/35517351/trailer" data-ticket="https://movie.douban.com/ticket/redirect/?movie_id=35517351" data-duration="105分钟" data-region="中国大陆" data-director="宁海强" data-actors="刘烨 / 韩雪 / 林永健" data-intro="" data-enough="false" data-rater="2372">
				<ul class="">
					<li class="poster">
						<a onclick="moreurl(this, {from:'mv_a_pst'})" href="https://movie.douban.com/subject/35517351/?from=showing">
							<img src="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2880944813.jpg" alt="钢铁意志" rel="nofollow" class="" />
						</a>
					</li>
					<li class="title">
						<a onclick="moreurl(this, {from:'mv_a_tl'})" href="https://movie.douban.com/subject/35517351/?from=showing" class="">钢铁意志</a>
					</li>
					<li class="rating">


						<span class="text-tip">暂无评分</span>
					</li>
					<li class="ticket_btn">
						<span>
							<a onclick="moreurl(this, {from:'mv_b_tc'})" href="https://movie.douban.com/ticket/redirect/?movie_id=35517351" target="_blank">选座购票</a>
						</span>
					</li>
				</ul>


			<li class="ui-slide-item" data-title="海的尽头是草原" data-release="2022" data-rate="7.2" data-star="40" data-trailer="https://movie.douban.com/subject/35346312/trailer" data-ticket="https://movie.douban.com/ticket/redirect/?movie_id=35346312" data-duration="124分钟" data-region="中国大陆" data-director="尔冬升" data-actors="陈宝国 / 马苏 / 阿云嘎" data-intro="" data-enough="true" data-rater="22315">
				<ul class="">
					<li class="poster">
						<a onclick="moreurl(this, {from:'mv_a_pst'})" href="https://movie.douban.com/subject/35346312/?from=showing">
							<img src="https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2877827228.jpg" alt="海的尽头是草原" rel="nofollow" class="" />
						</a>
					</li>
					<li class="title">
						<a onclick="moreurl(this, {from:'mv_a_tl'})" href="https://movie.douban.com/subject/35346312/?from=showing" class="">海的尽头是草...</a>
					</li>
					<li class="rating">
						<span class="rating-star allstar40"></span>
						<span class="subject-rate">7.2</span>
					</li>
					<li class="ticket_btn">
						<span>
							<a onclick="moreurl(this, {from:'mv_b_tc'})" href="https://movie.douban.com/ticket/redirect/?movie_id=35346312" target="_blank">选座购票</a>
						</span>
					</li>
				</ul>


			<li class="ui-slide-item" data-title="我是霸王龙" data-release="2022" data-rate="" data-star="00" data-trailer="https://movie.douban.com/subject/34979319/trailer" data-ticket="https://movie.douban.com/ticket/redirect/?movie_id=34979319" data-duration="80分钟" data-region="中国大陆" data-director="尚铭" data-actors="依诺 / 林浪 / 温池禹" data-intro="" data-enough="false" data-rater="571">
				<ul class="">
					<li class="poster">
						<a onclick="moreurl(this, {from:'mv_a_pst'})" href="https://movie.douban.com/subject/34979319/?from=showing">
							<img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2880723500.jpg" alt="我是霸王龙" rel="nofollow" class="" />
						</a>
					</li>
					<li class="title">
						<a onclick="moreurl(this, {from:'mv_a_tl'})" href="https://movie.douban.com/subject/34979319/?from=showing" class="">我是霸王龙</a>
					</li>
					<li class="rating">


						<span class="text-tip">暂无评分</span>
					</li>
					<li class="ticket_btn">
						<span>
							<a onclick="moreurl(this, {from:'mv_b_tc'})" href="https://movie.douban.com/ticket/redirect/?movie_id=34979319" target="_blank">选座购票</a>
						</span>
					</li>
				</ul>


			<li class="ui-slide-item" data-title="新灰姑娘2 Cinderella and the Spellbinder" data-release="2022" data-rate="" data-star="00" data-trailer="https://movie.douban.com/subject/35791622/trailer" data-ticket="https://movie.douban.com/ticket/redirect/?movie_id=35791622" data-duration="90分钟" data-region="中国大陆" data-director="爱丽丝·布莱哈特" data-actors="蒋丽 / 邵敏佳 / 赵路" data-intro="" data-enough="false" data-rater="636">
				<ul class="">
					<li class="poster">
						<a onclick="moreurl(this, {from:'mv_a_pst'})" href="https://movie.douban.com/subject/35791622/?from=showing">
							<img src="https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2880657159.jpg" alt="新灰姑娘2" rel="nofollow" class="" />
						</a>
					</li>
					<li class="title">
						<a onclick="moreurl(this, {from:'mv_a_tl'})" href="https://movie.douban.com/subject/35791622/?from=showing" class="">新灰姑娘2</a>
					</li>
					<li class="rating">


						<span class="text-tip">暂无评分</span>
					</li>
					<li class="ticket_btn">
						<span>
							<a onclick="moreurl(this, {from:'mv_b_tc'})" href="https://movie.douban.com/ticket/redirect/?movie_id=35791622" target="_blank">选座购票</a>
						</span>
					</li>
				</ul>


			<li class="ui-slide-item s" data-dstat-areaid=70_3 data-dstat-mode=click,expose data-dstat-watch=.ui-slide-content data-dstat-viewport=.screening-bd data-title="新大头儿子和小头爸爸5:我的外星朋友" data-release="2022" data-rate="" data-star="00" data-trailer="https://movie.douban.com/subject/36035677/trailer" data-ticket="https://movie.douban.com/ticket/redirect/?movie_id=36035677" data-duration="81分钟" data-region="中国大陆" data-director="" data-actors="董浩 / 鞠萍 / 陈怡" data-intro="" data-enough="false" data-rater="1418">
				<ul class="">
					<li class="poster">
						<a onclick="moreurl(this, {from:'mv_a_pst'})" href="https://movie.douban.com/subject/36035677/?from=showing">
							<img src="https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2881054838.jpg" alt="新大头儿子和小头爸爸5:我的外星朋友" rel="nofollow" class="" />
						</a>
					</li>
					<li class="title">
						<a onclick="moreurl(this, {from:'mv_a_tl'})" href="https://movie.douban.com/subject/36035677/?from=showing" class="">新大头儿子和...</a>
					</li>
					<li class="rating">


						<span class="text-tip">暂无评分</span>
					</li>
					<li class="ticket_btn">
						<span>
							<a onclick="moreurl(this, {from:'mv_b_tc'})" href="https://movie.douban.com/ticket/redirect/?movie_id=36035677" target="_blank">选座购票</a>
						</span>
					</li>
				</ul>


			<li class="ui-slide-item" data-title="新神榜:杨戬" data-release="2022" data-rate="7.1" data-star="35" data-trailer="https://movie.douban.com/subject/35360684/trailer" data-ticket="https://movie.douban.com/ticket/redirect/?movie_id=35360684" data-duration="127分钟" data-region="中国大陆" data-director="赵霁" data-actors="王凯 / 季冠霖 / 李立宏" data-intro="" data-enough="true" data-rater="150785">
				<ul class="">
					<li class="poster">
						<a onclick="moreurl(this, {from:'mv_a_pst'})" href="https://movie.douban.com/subject/35360684/?from=showing">
							<img src="https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2878224156.jpg" alt="新神榜:杨戬" rel="nofollow" class="" />
						</a>
					</li>
					<li class="title">
						<a onclick="moreurl(this, {from:'mv_a_tl'})" href="https://movie.douban.com/subject/35360684/?from=showing" class="">新神榜:杨戬...</a>
					</li>
					<li class="rating">
						<span class="rating-star allstar35"></span>
						<span class="subject-rate">7.1</span>
					</li>
					<li class="ticket_btn">
						<span>
							<a onclick="moreurl(this, {from:'mv_b_tc'})" href="https://movie.douban.com/ticket/redirect/?movie_id=35360684" target="_blank">选座购票</a>
						</span>
					</li>
				</ul>


			<li class="ui-slide-item" data-title="妈妈!" data-release="2022" data-rate="7.4" data-star="40" data-trailer="https://movie.douban.com/subject/34954093/trailer" data-ticket="https://movie.douban.com/ticket/redirect/?movie_id=34954093" data-duration="109分钟" data-region="中国大陆" data-director="杨荔钠" data-actors="吴彦姝 / 奚美娟 / 文淇" data-intro="" data-enough="true" data-rater="30917">
				<ul class="">
					<li class="poster">
						<a onclick="moreurl(this, {from:'mv_a_pst'})" href="https://movie.douban.com/subject/34954093/?from=showing">
							<img src="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2879572001.jpg" alt="妈妈!" rel="nofollow" class="" />
						</a>
					</li>
					<li class="title">
						<a onclick="moreurl(this, {from:'mv_a_tl'})" href="https://movie.douban.com/subject/34954093/?from=showing" class="">妈妈!</a>
					</li>
					<li class="rating">
						<span class="rating-star allstar40"></span>
						<span class="subject-rate">7.4</span>
					</li>
					<li class="ticket_btn">
						<span>
							<a onclick="moreurl(this, {from:'mv_b_tc'})" href="https://movie.douban.com/ticket/redirect/?movie_id=34954093" target="_blank">选座购票</a>
						</span>
					</li>
				</ul>


			<li class="ui-slide-item" data-title="我要和你在一起" data-release="2022" data-rate="5.1" data-star="25" data-trailer="https://movie.douban.com/subject/26602901/trailer" data-ticket="https://movie.douban.com/ticket/redirect/?movie_id=26602901" data-duration="94分钟" data-region="中国大陆" data-director="曾晋为" data-actors="尹昉 / 李梦 / 李俊贤" data-intro="" data-enough="true" data-rater="2015">
				<ul class="">
					<li class="poster">
						<a onclick="moreurl(this, {from:'mv_a_pst'})" href="https://movie.douban.com/subject/26602901/?from=showing">
							<img src="https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2879792636.jpg" alt="我要和你在一起" rel="nofollow" class="" />
						</a>
					</li>
					<li class="title">
						<a onclick="moreurl(this, {from:'mv_a_tl'})" href="https://movie.douban.com/subject/26602901/?from=showing" class="">我要和你在一...</a>
					</li>
					<li class="rating">
						<span class="rating-star allstar25"></span>
						<span class="subject-rate">5.1</span>
					</li>
					<li class="ticket_btn">
						<span>
							<a onclick="moreurl(this, {from:'mv_b_tc'})" href="https://movie.douban.com/ticket/redirect/?movie_id=26602901" target="_blank">选座购票</a>
						</span>
					</li>
				</ul>


			<li class="ui-slide-item" data-title="独行月球" data-release="2022" data-rate="6.8" data-star="35" data-trailer="https://movie.douban.com/subject/35183042/trailer" data-ticket="https://movie.douban.com/ticket/redirect/?movie_id=35183042" data-duration="122分钟" data-region="中国大陆" data-director="张吃鱼" data-actors="沈腾 / 马丽 / 常远" data-intro="" data-enough="true" data-rater="459477">
				<ul class="">
					<li class="poster">
						<a onclick="moreurl(this, {from:'mv_a_pst'})" href="https://movie.douban.com/subject/35183042/?from=showing">
							<img src="https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2876409008.jpg" alt="独行月球" rel="nofollow" class="" />
						</a>
					</li>
					<li class="title">
						<a onclick="moreurl(this, {from:'mv_a_tl'})" href="https://movie.douban.com/subject/35183042/?from=showing" class="">独行月球</a>
					</li>
					<li class="rating">
						<span class="rating-star allstar35"></span>
						<span class="subject-rate">6.8</span>
					</li>
					<li class="ticket_btn">
						<span>
							<a onclick="moreurl(this, {from:'mv_b_tc'})" href="https://movie.douban.com/ticket/redirect/?movie_id=35183042" target="_blank">选座购票</a>
						</span>
					</li>
				</ul>


			<li class="ui-slide-item s" data-dstat-areaid=70_4 data-dstat-mode=click,expose data-dstat-watch=.ui-slide-content data-dstat-viewport=.screening-bd data-title="世间有她" data-release="2022" data-rate="5.5" data-star="30" data-trailer="https://movie.douban.com/subject/35187381/trailer" data-ticket="https://movie.douban.com/ticket/redirect/?movie_id=35187381" data-duration="116分钟" data-region="中国大陆" data-director="李少红" data-actors="周迅 / 郑秀文 / 易烊千玺" data-intro="" data-enough="true" data-rater="22641">
				<ul class="">
					<li class="poster">
						<a onclick="moreurl(this, {from:'mv_a_pst'})" href="https://movie.douban.com/subject/35187381/?from=showing">
							<img src="https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2878770125.jpg" alt="世间有她" rel="nofollow" class="" />
						</a>
					</li>
					<li class="title">
						<a onclick="moreurl(this, {from:'mv_a_tl'})" href="https://movie.douban.com/subject/35187381/?from=showing" class="">世间有她</a>
					</li>
					<li class="rating">
						<span class="rating-star allstar30"></span>
						<span class="subject-rate">5.5</span>
					</li>
					<li class="ticket_btn">
						<span>
							<a onclick="moreurl(this, {from:'mv_b_tc'})" href="https://movie.douban.com/ticket/redirect/?movie_id=35187381" target="_blank">选座购票</a>
						</span>
					</li>
				</ul>


			<li class="ui-slide-item" data-title="流浪地球" data-release="2019" data-rate="7.9" data-star="40" data-trailer="https://movie.douban.com/subject/26266893/trailer" data-ticket="https://movie.douban.com/ticket/redirect/?movie_id=26266893" data-duration="125分钟" data-region="中国大陆" data-director="郭帆" data-actors="吴京 / 屈楚萧 / 李光洁" data-intro="" data-enough="true" data-rater="1797785">
				<ul class="">
					<li class="poster">
						<a onclick="moreurl(this, {from:'mv_a_pst'})" href="https://movie.douban.com/subject/26266893/?from=showing">
							<img src="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2545472803.jpg" alt="流浪地球" rel="nofollow" class="" />
						</a>
					</li>
					<li class="title">
						<a onclick="moreurl(this, {from:'mv_a_tl'})" href="https://movie.douban.com/subject/26266893/?from=showing" class="">流浪地球</a>
					</li>
					<li class="rating">
						<span class="rating-star allstar40"></span>
						<span class="subject-rate">7.9</span>
					</li>
					<li class="ticket_btn">
						<span>
							<a onclick="moreurl(this, {from:'mv_b_tc'})" href="https://movie.douban.com/ticket/redirect/?movie_id=26266893" target="_blank">选座购票</a>
						</span>
					</li>
				</ul>


			<li class="ui-slide-item" data-title="明日战记 明日戰記" data-release="2022" data-rate="6.4" data-star="35" data-trailer="https://movie.douban.com/subject/26353671/trailer" data-ticket="https://movie.douban.com/ticket/redirect/?movie_id=26353671" data-duration="99分钟" data-region="中国香港" data-director="吴炫辉" data-actors="古天乐 / 刘青云 / 刘嘉玲" data-intro="" data-enough="true" data-rater="93645">
				<ul class="">
					<li class="poster">
						<a onclick="moreurl(this, {from:'mv_a_pst'})" href="https://movie.douban.com/subject/26353671/?from=showing">
							<img src="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2876734663.jpg" alt="明日战记" rel="nofollow" class="" />
						</a>
					</li>
					<li class="title">
						<a onclick="moreurl(this, {from:'mv_a_tl'})" href="https://movie.douban.com/subject/26353671/?from=showing" class="">明日战记</a>
					</li>
					<li class="rating">
						<span class="rating-star allstar35"></span>
						<span class="subject-rate">6.4</span>
					</li>
					<li class="ticket_btn">
						<span>
							<a onclick="moreurl(this, {from:'mv_b_tc'})" href="https://movie.douban.com/ticket/redirect/?movie_id=26353671" target="_blank">选座购票</a>
						</span>
					</li>
				</ul>


			<li class="ui-slide-item" data-title="追光万里" data-release="2022" data-rate="6.8" data-star="35" data-trailer="https://movie.douban.com/subject/34936382/trailer" data-ticket="https://movie.douban.com/ticket/redirect/?movie_id=34936382" data-duration="86分钟" data-region="中国大陆" data-director="张同道" data-actors="卢燕 / 赖声川 / 白先勇" data-intro="" data-enough="true" data-rater="1494">
				<ul class="">
					<li class="poster">
						<a onclick="moreurl(this, {from:'mv_a_pst'})" href="https://movie.douban.com/subject/34936382/?from=showing">
							<img src="https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2877762319.jpg" alt="追光万里" rel="nofollow" class="" />
						</a>
					</li>
					<li class="title">
						<a onclick="moreurl(this, {from:'mv_a_tl'})" href="https://movie.douban.com/subject/34936382/?from=showing" class="">追光万里</a>
					</li>
					<li class="rating">
						<span class="rating-star allstar35"></span>
						<span class="subject-rate">6.8</span>
					</li>
					<li class="ticket_btn">
						<span>
							<a onclick="moreurl(this, {from:'mv_b_tc'})" href="https://movie.douban.com/ticket/redirect/?movie_id=34936382" target="_blank">选座购票</a>
						</span>
					</li>
				</ul>


			<li class="ui-slide-item" data-title="中国机长" data-release="2019" data-rate="6.6" data-star="35" data-trailer="https://movie.douban.com/subject/30295905/trailer" data-ticket="https://movie.douban.com/ticket/redirect/?movie_id=30295905" data-duration="111分钟" data-region="中国大陆" data-director="刘伟强" data-actors="张涵予 / 欧豪 / 杜江" data-intro="" data-enough="true" data-rater="706911">
				<ul class="">
					<li class="poster">
						<a onclick="moreurl(this, {from:'mv_a_pst'})" href="https://movie.douban.com/subject/30295905/?from=showing">
							<img src="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2568258113.jpg" alt="中国机长" rel="nofollow" class="" />
						</a>
					</li>
					<li class="title">
						<a onclick="moreurl(this, {from:'mv_a_tl'})" href="https://movie.douban.com/subject/30295905/?from=showing" class="">中国机长</a>
					</li>
					<li class="rating">
						<span class="rating-star allstar35"></span>
						<span class="subject-rate">6.6</span>
					</li>
					<li class="ticket_btn">
						<span>
							<a onclick="moreurl(this, {from:'mv_b_tc'})" href="https://movie.douban.com/ticket/redirect/?movie_id=30295905" target="_blank">选座购票</a>
						</span>
					</li>
				</ul>
		</ul>
	</div>
</div>

python代码

import requests
from lxml import etree

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
    'Referer': 'https://www.douban.com/',
}
response = requests.get('https://movie.douban.com/', headers=headers)
text = response.text

# 保存到本地
# with open('douban.html','w',encoding='utf-8') as f:
#     f.write(text)

# 读取本地html
# with open('douban.html','r',encoding='utf-8') as f:
#     text = f.read()

html = etree.HTML(text)
ul = html.xpath("//ul[@class='ui-slide-content']")[0]
# print(etree.tostring(ul,encoding='utf-8').decode('utf-8'))
lis = ul.xpath('./li[@data-title]')
movies = []
for li in lis:
    title = li.xpath('@data-title')[0]
    score = li.xpath('@data-rate')[0]
    duration = li.xpath('@data-duration')[0]
    region = li.xpath('@data-region')[0]
    director = li.xpath('@data-director')[0]
    actors = li.xpath('@data-actors')[0]
    thumbnail = li.xpath('.//img/@src')[0]
    movie = {
        'title': title,
        'score': score,
        'duration': duration,
        'region': region,
        'director': director,
        'actors': actors,
        'thumbnail': thumbnail
    }
    movies.append(movie)
print(movies)

输出结果

[{'title': '万里归途', 'score': '7.5', 'duration': '137分钟', 'region': '中国大陆', 'director': '饶晓志', 'actors': '张译 / 王俊凯 / 殷桃', 'thumbnail': 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2881176356.jpg'}, {'title': '平凡英雄', 'score': '', 'duration': '97分钟', 'region': '中国大陆', 'director': '陈国辉', 'actors': '李冰冰 / 冯绍峰 / 黄晓明', 'thumbnail': 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2880994870.jpg'}, {'title': '还是觉得你最好 飯戲攻心', 'score': '7.6', 'duration': '116分钟(中国大陆)', 'region': '中国香港', 'director': '陈咏燊', 'actors': '黄子华 / 邓丽欣 / 张继聪', 'thumbnail': 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2879301401.jpg'}, {'title': '搜救', 'score': '', 'duration': '108分钟', 'region': '中国大陆', 'director': '罗志良', 'actors': '甄子丹 / 韩雪 / 贾冰', 'thumbnail': 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2881182580.jpg'}, {'title': '狼群', 'score': '5.8', 'duration': '105分钟', 'region': '中国大陆', 'director': '蒋丛', 'actors': '张晋 / 李治廷 / 蒋璐霞', 'thumbnail': 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2879572474.jpg'}, {'title': '哥,你好', 'score': '5.3', 'duration': '111分钟', 'region': '中国大陆', 'director': '张栾', 'actors': '马丽 / 常远 / 魏翔', 'thumbnail': 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2880026973.jpg'}, {'title': '钢铁意志', 'score': '', 'duration': '105分钟', 'region': '中国大陆', 'director': '宁海强', 'actors': '刘烨 / 韩雪 / 林永健', 'thumbnail': 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2880944813.jpg'}, {'title': '海的尽头是草原', 'score': '7.2', 'duration': '124分钟', 'region': '中国大陆', 'director': '尔冬升', 'actors': '陈宝国 / 马苏 / 阿云嘎', 'thumbnail': 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2877827228.jpg'}, {'title': '我是霸王龙', 'score': '', 'duration': '80分钟', 'region': '中国大陆', 'director': '尚铭', 'actors': '依诺 / 林浪 / 温池禹', 'thumbnail': 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2880723500.jpg'}, {'title': '新灰姑娘2 Cinderella and the Spellbinder', 'score': '', 'duration': '90分钟', 'region': '中国大陆', 'director': '爱丽丝·布莱哈特', 'actors': '蒋丽 / 邵敏佳 / 赵路', 'thumbnail': 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2880657159.jpg'}, {'title': '新大头儿子和小头爸爸5:我的外星朋友', 'score': '', 'duration': '81分钟', 'region': '中国大陆', 'director': '', 'actors': '董浩 / 鞠萍 / 陈怡', 'thumbnail': 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2881054838.jpg'}, {'title': '新神榜:杨戬', 'score': '7.1', 'duration': '127分钟', 'region': '中国大陆', 'director': '赵霁', 'actors': '王凯 / 季冠霖 / 李立宏', 'thumbnail': 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2878224156.jpg'}, {'title': '妈妈!', 'score': '7.4', 'duration': '109分钟', 'region': '中国大陆', 'director': '杨荔钠', 'actors': '吴彦姝 / 奚美娟 / 文淇', 'thumbnail': 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2879572001.jpg'}, {'title': '我要和你在一起', 'score': '5.1', 'duration': '94分钟', 'region': '中国大陆', 'director': '曾晋为', 'actors': '尹昉 / 李梦 / 李俊贤', 'thumbnail': 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2879792636.jpg'}, {'title': '独行月球', 'score': '6.8', 'duration': '122分钟', 'region': '中国大陆', 'director': '张吃鱼', 'actors': '沈腾 / 马丽 / 常远', 'thumbnail': 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2876409008.jpg'}, {'title': '世间有她', 'score': '5.5', 'duration': '116分钟', 'region': '中国大陆', 'director': '李少红', 'actors': '周迅 / 郑秀文 / 易烊千玺', 'thumbnail': 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2878770125.jpg'}, {'title': '流浪地球', 'score': '7.9', 'duration': '125分钟', 'region': '中国大陆', 'director': '郭帆', 'actors': '吴京 / 屈楚萧 / 李光洁', 'thumbnail': 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2545472803.jpg'}, {'title': '明日战记 明日戰記', 'score': '6.4', 'duration': '99分钟', 'region': '中国香港', 'director': '吴炫辉', 'actors': '古天乐 / 刘青云 / 刘嘉玲', 'thumbnail': 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2876734663.jpg'}, {'title': '追光万里', 'score': '6.8', 'duration': '86分钟', 'region': '中国大陆', 'director': '张同道', 'actors': '卢燕 / 赖声川 / 白先勇', 'thumbnail': 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2877762319.jpg'}, {'title': '中国机长', 'score': '6.6', 'duration': '111分钟', 'region': '中国大陆', 'director': '刘伟强', 'actors': '张涵予 / 欧豪 / 杜江', 'thumbnail': 'https://img2.doubanio.com/view/photo/s_ratio_poster/public/p2568258113.jpg'}]

lxml模块使用示例

from lxml import etree
text = ''' 
<div> 
  <ul> 
    <li class="item-1">
      <a href="link1.html">first item</a>
    </li> 
    <li class="item-1">
      <a href="link2.html">second item</a>
    </li> 
    <li class="item-inactive">
      <a href="link3.html">third item</a>
    </li> 
    <li class="item-1">
      <a href="link4.html">fourth item</a>
    </li> 
    <li class="item-0">
      a href="link5.html">fifth item</a>
  </ul> 
</div>
'''

html = etree.HTML(text)

#获取href的列表和title的列表
href_list = html.xpath("//li[@class='item-1']/a/@href")
title_list = html.xpath("//li[@class='item-1']/a/text()")

#组装成字典
for href in href_list:
    item = {}
    item["href"] = href
    item["title"] = title_list[href_list.index(href)]
    print(item)

lxml模块中etree.tostring函数的使用

运行下边的代码,观察对比html的原字符串和打印输出的结果

from lxml import etree
html_str = ''' <div> <ul> 
        <li class="item-1"><a href="link1.html">first item</a></li> 
        <li class="item-1"><a href="link2.html">second item</a></li> 
        <li class="item-inactive"><a href="link3.html">third item</a></li> 
        <li class="item-1"><a href="link4.html">fourth item</a></li> 
        <li class="item-0"><a href="link5.html">fifth item</a> 
        </ul> </div> '''

html = etree.HTML(html_str)

handeled_html_str = etree.tostring(html).decode()
print(handeled_html_str)

现象和结论

打印结果和原来相比:

  1. 自动补全原本缺失的li标签
  2. 自动补全html等标签
<html><body><div> <ul> 
        <li class="item-1"><a href="link1.html">first item</a></li> 
        <li class="item-1"><a href="link2.html">second item</a></li> 
        <li class="item-inactive"><a href="link3.html">third item</a></li> 
        <li class="item-1"><a href="link4.html">fourth item</a></li> 
        <li class="item-0"><a href="link5.html">fifth item</a> 
        </li></ul> </div> </body></html>

结论:

  • lxml.etree.HTML(html_str)可以自动补全标签
  • lxml.etree.tostring函数可以将转换为Element对象再转换回html字符串
  • 爬虫如果使用lxml来提取数据,应该以lxml.etree.tostring的返回结果作为提取数据的依据
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值