xpath的使用（新手必看）

最新推荐文章于 2024-08-04 18:17:11 发布

彡倾灬染|

最新推荐文章于 2024-08-04 18:17:11 发布

阅读量970

点赞数

文章标签： python xpath 爬虫

本文链接：https://blog.csdn.net/qq_45830025/article/details/107558940

版权

步骤：
1.从lxml中导入etree（两种方式）
第一种：

from lxml import etree

注意：第一种方式，可能etree下方会出现红线，但是不影响使用

第二种：

from lxml import html
etree = html.etree

str = 	'<html>' \
            '<bookstore>' \
                '<book>' \
                    '<title lang="eng">Harry Potter</title>' \
                    '<price>29.99</price>' \
                '</book>' \
                '<book>' \
                    '<title lang="eng">Learning XML</title>' \
                    '<price>39.95</price>' \
                '</book>' \
            '</bookstore>' \
       '</html>'

2.etree.HTML() 将字符串转换成html元素对象（必须进行），可以自动补全缺失的标签

html = etree.HTML(str)
print(html)  # <Element html at 0x11524df0108>

3. 使用xpath进行数据提取
xpath是根据路径表达式来查找元素的，返回值是一个列表
如果没有找到，返回的是一个空列表
比如：F:\Python夏令营

xpath路径分为两种：
第一种：/ 如果作为开头，代表的是根路径，如果是在路径里面中写，代表一层一层的查找
找到bookstore元素

bookstore = html.xpath('/html/bookstore')
print(bookstore)  # []  -> 没有找到
boostore = html.xpath('/html/body/booktore')
print(bookstore)

第二种：// 代表任意路径
找到bookstore元素

bookstore = html.xpath('//body/bookstore')  
print(bookstore)  # [<Element bookstore at 0x198c6960348>]

例如：获取book元素

book = html.xpath('//book')
print(book)  # [<Element book at 0x1f397370348>, <Element book at 0x1f397370388>]

text() 获取标签之间的内容
例如：获取title标签的内容
步骤：

1.先获取到title标签
2.再获取内容即可

title = html.xpath('//book/title')
print(title)  # [<Element title at 0x1e3b17b0448>, <Element title at 0x1e3b17b0488>]
title1 = html.xpath('//book/title/text()')
print(title1)  # ['Harry Potter', 'Learning XML']

str = 	'<html>' \
            '<bookstore>' \
                '<book>' \
                    '<title lang="ang">Harry Potter</title>' \
                    '<price>29.99</price>' \
                '</book>' \
                '<book>' \
                    '<title lang="eng">Learning XML</title>' \
                    '<price>39.95</price>' \
                '</book>' \
            '</bookstore>' \
       '</html>'
html = etree.HTML(str)

谓语：[]
如果条件是属性：需要使用@关键字
例如：获取lang属性为eng的title标签之间的内容

title = html.xpath('//book/title[@lang="eng"]/text()')
print(title)  # ['Learning XML']

str = 	'<html>' \
            '<bookstore>' \
                '<book>' \
                    '<title lang="ang">Harry Potter</title>' \
                    '<price>29.99</price>' \
                '</book>' \
                '<book>' \
                    '<title lang="eng" href="http://www.baidu.com">Learning XML</title>' \
                    '<price>39.95</price>' \
                '</book>' \
            '</bookstore>' \
       '</html>'
html = etree.HTML(str)

获取属性的值:/@+属性名
例如：获取所有title标签href属性的值
步骤：

1.先获取title标签
2.再获取href属性值

title = html.xpath('//book/title/@href')
print(title)  #['http://www.baidu.com']

彡倾灬染|

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
xpath的使用（新手必看）

步骤：1.从lxml中导入etree（两种方式）第一种：from lxml import etree注意：第一种方式，可能etree下方会出现红线，但是不影响使用第二种：from lxml import htmletree = html.etreestr = '<html>' \ '<bookstore>' \ '<book>' \ '<title
复制链接

扫一扫