Xpath 常用函数技巧

最新推荐文章于 2023-07-13 19:49:35 发布

songhao8080

最新推荐文章于 2023-07-13 19:49:35 发布

阅读量245

点赞数

本文链接：https://blog.csdn.net/songhao8080/article/details/103670331

版权

python 提取速度对比

安装lxml

Python

pip install <span class="wp_keywordlink_affiliate"><a href="https://www.168seo.cn/tag/lxml" title="View all posts in lxml" target="_blank">lxml</a></span>

1	pip install lxml

windows 安装参考：http://www.168seo.cn/python/2796.html

另外分享下 xpath的谷歌插件：http://www.168seo.cn/python/23615.html

首先我们使用 lxml 的 etree 库，然后利用 etree.HTML 初始化

Python

html = etree.HTML(html) for i in html.<span class="wp_keywordlink_affiliate"><a href="https://www.168seo.cn/tag/xpath" title="View all posts in xpath" target="_blank">xpath</a></span>(u"XPath語法"): print i 找出所有a链接(a的href屬性)//a/@href 找出所有链接的文字 //a/text() 找出div中id='txt' //div[@id='txt'] 找出td属性中class包含'GridItem'字串的 //td[contains(@class, 'GridItem'），class有多個value：<td class="GridItem td1"> 找出font屬性color='#0000ff'或是屬性color='blue的'物件//font[(@color="#0000ff" or @color="blue")] 找出font屬性color='#0000ff'或是span屬性style="COLOR: blue"的物件 //font[@color="#0000ff"] | //span[@style="COLOR: blue"] 使用<span class="wp_keywordlink_affiliate"><a href="https://www.168seo.cn/tag/xpath" title="View all posts in xpath" target="_blank">xpath</a></span> 找出src/href的链接 html.xpath('//img/@src') html.xpath('//img/@href') 取出文本 //h3[@class="t"]/a/text() <strong>取出多个便签下下的文本</strong> html.xpath('string(.)').extract()[0] 请参：http://www.168seo.cn/<span class="wp_keywordlink"><a href="http://www.168seo.cn/python" title="python">python</a></span>/23613.html 2）获取 li 标签的所有 class result = html.xpath('//li/@class') print result #运行结果 ['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0'] 3）获取 li 标签下 href 为 link1.html 的 a 标签 result = html.xpath('//li/a[@href="link1.html"]') print result #运行结果 [<Element a at 0xc522a88>] 4）获取 li 标签下的所有 span 标签因为 / 是用来获取子元素的，而 span 并不是 li 的子元素，所以，要用双斜杠。 result = html.xpath('//li//span') print result #[<Element span at 0x10d698e18>] 5）获取 li 标签下的所有 class，不包括 li result = html.xpath('//li/a//@class') print result #['bold'] 6）获取最后一个 li 的 a 的 href result = html.xpath('//li[last()]/a/@href') print result #['link5.html'] 7）获取倒数第二个元素的内容 result = html.xpath('//li[last()-1]/a') print result[0].text # fourth item 8）获取 class 为 bold 的标签名 result = html.xpath('//*[@class="bold"]') print result[0].tag <h2>XPATH的几个常用函数</h2> <pre>1.contains ()： //div[contains(@id,'in')] ,表示选择id中包含有’in’的div节点 2.text()：由于一个节点的文本值不属于属性，比如“<a class=”baidu“ href=”http://www.baidu.com“>baidu</a>”,所以，用text()函数来匹配节点：//a[text()='baidu'] 3.last()：返回当前上下文中的最后一个节点的位置号数。 4.starts-with()： //div[starts-with(@id,'in')] ，表示选择以’in’开头的id属性的div节点 5.not()函数，表示否定，//input[@name=‘identity’ and not(contains(@class,‘a’))] ，表示匹配出name为identity并且class的值中不包含a的input节点。 not()函数通常与返回值为true or false的函数组合起来用，比如contains(),starts-with()等，但有一种特别情况请注意一下：我们要匹配出input节点含有id属性的，写法如下：//input[@id]，如果我们要匹配出input节点不含用id属性的，则为：//input[not(@id)]

html = etree . HTML ( html )

for i in html . xpath ( u "XPath語法" ) :

print i

找出所有 a链接 ( a的 href屬性 ) / / a / @ href

找出所有链接的文字 / / a / text ( )

找出 div中 id = 'txt' / / div [ @ id = 'txt' ]

找出 td属性中 class包含 'GridItem'字串的 / / td [ contains ( @ class , 'GridItem'）， class有多個 value： & lt ; td class = "GridItem td1" & gt ;

找出 font屬性 color = '#0000ff'或是屬性 color = 'blue的'物件 / / font [ ( @ color = "#0000ff" or @ color = "blue" ) ]

找出 font屬性 color = '#0000ff'或是 span屬性 style = "COLOR: blue"的物件

/ / font [ @ color = "#0000ff" ] | / / span [ @ style = "COLOR: blue" ]

使用 xpath 找出 src / href的链接

html . xpath ( '//img/@src' )

html . xpath ( '//img/@href' )

取出文本

/ / h3 [ @ class = "t" ] / a / text ( )

< strong >取出多个便签下下的文本 < / strong >

html . xpath ( 'string(.)' ) . extract ( ) [ 0 ]

请参： http : / / www . 168seo.cn / python / 23613.html

2）获取 li 标签的所有 class

result = html . xpath ( '//li/@class' )

print result

#运行结果

[ 'item-0' , 'item-1' , 'item-inactive' , 'item-1' , 'item-0' ]

3）获取 li 标签下 href 为 link1 . html 的 a 标签

result = html . xpath ( '//li/a[@href="link1.html"]' )

print result

#运行结果

[ & lt ; Element a at 0xc522a88 & gt ; ]

4）获取 li 标签下的所有 span 标签

因为 / 是用来获取子元素的，而 span 并不是 li 的子元素，所以，要用双斜杠。

result = html . xpath ( '//li//span' )

print result

#[<Element span at 0x10d698e18>]

5）获取 li 标签下的所有 class，不包括 li

result = html . xpath ( '//li/a//@class' )

print result

#['bold']

6）获取最后一个 li 的 a 的 href

result = html . xpath ( '//li[last()]/a/@href' )

print result

#['link5.html']

7）获取倒数第二个元素的内容

result = html . xpath ( '//li[last()-1]/a' )

print result [ 0 ] . text

# fourth item

8）获取 class 为 bold 的标签名

result = html . xpath ( '//*[@class="bold"]' )

print result [ 0 ] . tag

< h2 > XPATH的几个常用函数 < / h2 >

< pre > 1.contains ( )： / / div [ contains ( @ id , 'in' ) ] ,表示选择 id中包含有’ in’的 div节点

2.text ( )：由于一个节点的文本值不属于属性，比如“ & lt ; a class =” baidu“ href =” http : / / www . baidu . com“ & gt ; baidu & lt ; / a & gt ;” ,所以，用 text ( )函数来匹配节点： / / a [ text ( ) = 'baidu' ]

3.last ( )：返回当前上下文中的最后一个节点的位置号数。

4.starts - with ( )： / / div [ starts - with ( @ id , 'in' ) ] ，表示选择以’ in’开头的 id属性的 div节点

5.not ( )函数，表示否定， / / input [ @ name =‘ identity’ and not ( contains ( @ class ,‘ a’ ) ) ] ，表示匹配出 name为 identity并且 class的值中不包含 a的 input节点。 not ( )函数通常与返回值为 true or false的函数组合起来用，比如 contains ( ) , starts - with ( )等，但有一种特别情况请注意一下：我们要匹配出 input节点含有 id属性的，写法如下： / / input [ @ id ]，如果我们要匹配出 input节点不含用 id属性的，则为： / / input [ not ( @ id ) ]