lxml:提取html标签中的内容

  lxml中有多种方式可以提取HTML标签中的内容,这篇博客的重点在于各个方法的不同。

import lxml
from lxml import etree
import collections

doc='''
<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html' id="xxx">Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <h5>test</h5>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html' class='li li-item' name='items'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
   <a href='image6.html' name='items'><span><h5>test</h5></span>Name: My image 6 <br /><img src='image6_thumb.jpg' /></a>
   <p>hello world hello world <strong> hello world,hello world</strong>你好啊,李银河</p>
  </div>
 </body>
</html>
'''
html=etree.HTML(doc)
tree=html.getroottree()
all_nodes=html.xpath('//*')
xpath=[]
for node in all_nodes:
    xpath.append(tree.getpath(node))
    
print('==============node.text方法=====================')
for node,path in zip(all_nodes,xpath):
    print('{}:  {}'.format(path,node.text))
print('==============node.itertext方法=====================')
for node,path in zip(all_nodes,xpath):
    print('{}:  {}'.format(path,''.join(node.itertext())))
print('==============xpath方法=====================')
for node,path in zip(all_nodes,xpath):
    print('{}:  {}'.format(path,''.join(html.xpath(path+'//text()'))))

node.text结果如下:

==============node.text方法=====================
/html:  
 
/html/head:  
  
/html/head/base:  None
/html/head/title:  Example website
/html/body:  
  
/html/body/div:  
   
/html/body/div/a[1]:  Name: My image 1 
/html/body/div/a[1]/br:  None
/html/body/div/a[1]/img:  None
/html/body/div/h5:  test
/html/body/div/a[2]:  Name: My image 2 
/html/body/div/a[2]/br:  None
/html/body/div/a[2]/img:  None
/html/body/div/a[3]:  Name: My image 3 
/html/body/div/a[3]/br:  None
/html/body/div/a[3]/img:  None
/html/body/div/a[4]:  Name: My image 4 
/html/body/div/a[4]/br:  None
/html/body/div/a[4]/img:  None
/html/body/div/a[5]:  Name: My image 5 
/html/body/div/a[5]/br:  None
/html/body/div/a[5]/img:  None
/html/body/div/a[6]:  None
/html/body/div/a[6]/span:  None
/html/body/div/a[6]/span/h5:  test
/html/body/div/a[6]/br:  None
/html/body/div/a[6]/img:  None
/html/body/div/p:  hello world hello world 
/html/body/div/p/strong:   hello world,hello world

node.itertext结果如下:

/html:  
 
  
  Example website
 
 
  
   Name: My image 1 
   test
   Name: My image 2 
   Name: My image 3 
   Name: My image 4 
   Name: My image 5 
   testName: My image 6 
   hello world hello world  hello world,hello world你好啊,李银河
  
 

/html/head:  
  
  Example website
 
/html/head/base:  
/html/head/title:  Example website
/html/body:  
  
   Name: My image 1 
   test
   Name: My image 2 
   Name: My image 3 
   Name: My image 4 
   Name: My image 5 
   testName: My image 6 
   hello world hello world  hello world,hello world你好啊,李银河
  
 
/html/body/div:  
   Name: My image 1 
   test
   Name: My image 2 
   Name: My image 3 
   Name: My image 4 
   Name: My image 5 
   testName: My image 6 
   hello world hello world  hello world,hello world你好啊,李银河
  
/html/body/div/a[1]:  Name: My image 1 
/html/body/div/a[1]/br:  
/html/body/div/a[1]/img:  
/html/body/div/h5:  test
/html/body/div/a[2]:  Name: My image 2 
/html/body/div/a[2]/br:  
/html/body/div/a[2]/img:  
/html/body/div/a[3]:  Name: My image 3 
/html/body/div/a[3]/br:  
/html/body/div/a[3]/img:  
/html/body/div/a[4]:  Name: My image 4 
/html/body/div/a[4]/br:  
/html/body/div/a[4]/img:  
/html/body/div/a[5]:  Name: My image 5 
/html/body/div/a[5]/br:  
/html/body/div/a[5]/img:  
/html/body/div/a[6]:  testName: My image 6 
/html/body/div/a[6]/span:  test
/html/body/div/a[6]/span/h5:  test
/html/body/div/a[6]/br:  
/html/body/div/a[6]/img:  
/html/body/div/p:  hello world hello world  hello world,hello world你好啊,李银河
/html/body/div/p/strong:   hello world,hello world

xpath结果如下:


==============xpath方法=====================
/html:  
 
  
  Example website
 
 
  
   Name: My image 1 
   test
   Name: My image 2 
   Name: My image 3 
   Name: My image 4 
   Name: My image 5 
   testName: My image 6 
   hello world hello world  hello world,hello world你好啊,李银河
  
 

/html/head:  
  
  Example website
 
/html/head/base:  
/html/head/title:  Example website
/html/body:  
  
   Name: My image 1 
   test
   Name: My image 2 
   Name: My image 3 
   Name: My image 4 
   Name: My image 5 
   testName: My image 6 
   hello world hello world  hello world,hello world你好啊,李银河
  
 
/html/body/div:  
   Name: My image 1 
   test
   Name: My image 2 
   Name: My image 3 
   Name: My image 4 
   Name: My image 5 
   testName: My image 6 
   hello world hello world  hello world,hello world你好啊,李银河
  
/html/body/div/a[1]:  Name: My image 1 
/html/body/div/a[1]/br:  
/html/body/div/a[1]/img:  
/html/body/div/h5:  test
/html/body/div/a[2]:  Name: My image 2 
/html/body/div/a[2]/br:  
/html/body/div/a[2]/img:  
/html/body/div/a[3]:  Name: My image 3 
/html/body/div/a[3]/br:  
/html/body/div/a[3]/img:  
/html/body/div/a[4]:  Name: My image 4 
/html/body/div/a[4]/br:  
/html/body/div/a[4]/img:  
/html/body/div/a[5]:  Name: My image 5 
/html/body/div/a[5]/br:  
/html/body/div/a[5]/img:  
/html/body/div/a[6]:  testName: My image 6 
/html/body/div/a[6]/span:  test
/html/body/div/a[6]/span/h5:  test
/html/body/div/a[6]/br:  
/html/body/div/a[6]/img:  
/html/body/div/p:  hello world hello world  hello world,hello world你好啊,李银河
/html/body/div/p/strong:   hello world,hello world

总结:

  1. node.text 在取文本时不会包含该节点的子节点里的内容。
  2. node.itertext和xpath方法可以将其子节点中的内容都包含进去。并且这两种方法取得文本内容相同。
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值