python之了解xpath
xpath解析原理:
- 实例化一个etree的对象,且需要将被解析的页面源码数据加载到该对象
- 调用etree对象中的xpath方法结合xpath表达式实现标签的定位和内容的捕获
首先三个问题/ xpath能干什么/ xpath的格式/如何去用xpath
xpath能干什么
xpath表达式可实现标签的定位和内容的捕获
xpath的格式
m =tree.xpath("/html/body/div/a")
- / 表示一个层级,从HTML的根节点开始定位
- // 表示多个层级
m = tree.xpath("//div[@class='kuozhan']/a")
- 定位属性 [@class=‘类名’] [@attrName=‘属性名’]
# 获取标签里的值 m = tree.xpath("//span[@class='g']/text()")
- /text()只获取该标签下的值
- //text() 标签下所有的值
# 索引定位 m = tree.xpath("//span[@class='g'][2]/text()")
- 通过[index] 的方式进行定位
#获取属性的值 m = tree.xpath("//a[@class='g']/@href")
- 标签/@属性名
如何去用xpath
实例:
from lxml import etree
from pprint import pprint
if __name__ == "__main__":
# 代码格式不规范解决:
# parser= etree.HTMLParser(encoding="utf-8"),
# 然后传入etree.parse中的 parser=parser
parser = etree.HTMLParser(encoding="utf-8")
tree = etree.parse("sogou.html", parser=parser)
# 首先是语法
m =tree.xpath("/html/body/div/a")
print(m)
m = tree.xpath('//body/div/a')
# / 表示一个层级,从HTML的根节点开始定位
# // 表示多个层级
# 定位属性 [@class='类名'] [@attrName='属性名']
m = tree.xpath("//div[@class='kuozhan']/a")
# 获取标签里的值
m = tree.xpath("//span[@class='g']/text()")
# /text()只获取该标签下的值
print(type(m))
print(m)
# 获取标签里的值
mx = set(tree.xpath("//div/text()"))
print("/text()只获取该标签下的值:",mx)
# //text() 标签下所有的值
ml = set(tree.xpath("//div//text()"))
print("//text()标签下所有的值:",ml)
# 索引定位
m = tree.xpath("//span[@class='g'][2]/text()")
print(m)
m = tree.xpath("//span[@class='g'][1]/text()")
print(m)
#获取属性的值
m = tree.xpath("//a[@class='g']/@href")
pprint(m)
结果:
[<Element a at 0x219f6651f48>]
<class ‘list’>
[‘京网文 (2016) 6432-852号’, ‘京ICP证050897号’, ‘(京)-经营性-2016-0019’, ‘京ICP备11001839号-1’, ‘京网文 (2016) 6432-852号’, ‘(京)-经营性-2016-0019’, ‘京ICP证050897号’, ‘京ICP备11001839号-1’]
/text()只获取该标签下的值: {’\xa0-\xa0\r\n ', '\r\n ', '©\xa02004-2020\xa0Sogou.com\xa0/\xa0\r\n ', '\r\n ', '\r\n ', '\r\n ', '\r\n ', '\xa0/\xa0\r\n ', '\xa0/\xa0\r\n ', '\r\n ', '©\xa02004-2020\xa0Sogou.com\xa0/\xa0\r\n ', '\r\n '}
//text()标签下所有的值: {‘百科’, ‘登录’, '\r\n ', ‘党建’, ‘(京)-经营性-2016-0019’, '\r\n ', ‘免责声明’, ‘明医’, ‘视频’, ‘”后面的文字被忽略,搜狗的查询限制在40个汉字以内。’, '\r\n ', ‘搜狗输入法’, ‘意见反馈及投诉’, ‘购物’, '©\xa02004-2020\xa0Sogou.com\xa0/\xa0\r\n ', '©\xa02004-2020\xa0Sogou.com\xa0/\xa0\r\n ', '\r\n ', ‘汉语’, ‘关于搜狗’, '\r\n ', ‘知识’, ‘京网文 (2016) 6432-852号’, ‘企业推广’, ‘下载搜狗搜索APP’, ‘地图’, ‘隐私政策’, '\r\n ', '\xa0-\xa0\r\n ', ‘全部’, ‘京ICP备11001839号-1’, '\r\n ', ‘图片’, ‘About Sogou’, ‘微信’, ‘学术’, ‘京ICP证050897号’, ‘369’, ‘网上有害信息举报专区’, ‘新闻’, ‘英文’, ‘显示卡片’, ‘科学’, ‘网址导航’, ‘更多’, ‘知乎’, ‘网页’, ‘翻译’, ‘指数’, '\xa0/\xa0\r\n ', ‘京公网安备11000002000025号’, '\xa0/\xa0\r\n ', ‘“’, ‘滚动查看更多’, ‘浏览器’, ‘问问’, ‘应用’}
[‘京ICP证050897号’, ‘(京)-经营性-2016-0019’]
[‘京网文 (2016) 6432-852号’, ‘京网文 (2016) 6432-852号’]
[‘http://www.12377.cn’,
‘http://corp.sogou.com/’,
‘http://ir.sogou.com/’,
‘http://b.sogou.com/’,
‘http://www.sogou.com/docs/terms.htm?v=1’,
‘http://fankui.help.sogou.com/index.php/web/web/index/type/4’,
‘http://corp.sogou.com/private.html’,
,
‘http://fankui.help.sogou.com/index.php/web/web/index/type/4’,
‘http://corp.sogou.com/private.html’,
‘http://www.12377.cn’]