lxml:提取html标签中的内容

最新推荐文章于 2024-05-11 14:34:52 发布

Sun_Sherry

最新推荐文章于 2024-05-11 14:34:52 发布

阅读量2.2k

点赞数

分类专栏：爬虫文章标签： html 前端

本文链接：https://blog.csdn.net/yeshang_lady/article/details/122370152

版权

爬虫专栏收录该内容

6 篇文章 0 订阅

订阅专栏

lxml中有多种方式可以提取HTML标签中的内容，这篇博客的重点在于各个方法的不同。

import lxml
from lxml import etree
import collections

doc='''
<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html' id="xxx">Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <h5>test</h5>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html' class='li li-item' name='items'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
   <a href='image6.html' name='items'><span><h5>test</h5></span>Name: My image 6 <br /><img src='image6_thumb.jpg' /></a>
   <p>hello world hello world <strong> hello world,hello world</strong>你好啊，李银河</p>
  </div>
 </body>
</html>
'''
html=etree.HTML(doc)
tree=html.getroottree()
all_nodes=html.xpath('//*')
xpath=[]
for node in all_nodes:
    xpath.append(tree.getpath(node))
    
print('==============node.text方法=====================')
for node,path in zip(all_nodes,xpath):
    print('{}:  {}'.format(path,node.text))
print('==============node.itertext方法=====================')
for node,path in zip(all_nodes,xpath):
    print('{}:  {}'.format(path,''.join(node.itertext())))
print('==============xpath方法=====================')
for node,path in zip(all_nodes,xpath):
    print('{}:  {}'.format(path,''.join(html.xpath(path+'//text()'))))

node.text结果如下：

==============node.text方法=====================
/html:  
 
/html/head:  
  
/html/head/base:  None
/html/head/title:  Example website
/html/body:  
  
/html/body/div:  
   
/html/body/div/a[1]:  Name: My image 1 
/html/body/div/a[1]/br:  None
/html/body/div/a[1]/img:  None
/html/body/div/h5:  test
/html/body/div/a[2]:  Name: My image 2 
/html/body/div/a[2]/br:  None
/html/body/div/a[2]/img:  None
/html/body/div/a[3]:  Name: My image 3 
/html/body/div/a[3]/br:  None
/html/body/div/a[3]/img:  None
/html/body/div/a[4]:  Name: My image 4 
/html/body/div/a[4]/br:  None
/html/body/div/a[4]/img:  None
/html/body/div/a[5]:  Name: My image 5 
/html/body/div/a[5]/br:  None
/html/body/div/a[5]/img:  None
/html/body/div/a[6]:  None
/html/body/div/a[6]/span:  None
/html/body/div/a[6]/span/h5:  test
/html/body/div/a[6]/br:  None
/html/body/div/a[6]/img:  None
/html/body/div/p:  hello world hello world 
/html/body/div/p/strong:   hello world,hello world

node.itertext结果如下：

/html:  
 
  
  Example website
 
 
  
   Name: My image 1 
   test
   Name: My image 2 
   Name: My image 3 
   Name: My image 4 
   Name: My image 5 
   testName: My image 6 
   hello world hello world  hello world,hello world你好啊，李银河
  
 

/html/head:  
  
  Example website
 
/html/head/base:  
/html/head/title:  Example website
/html/body:  
  
   Name: My image 1 
   test
   Name: My image 2 
   Name: My image 3 
   Name: My image 4 
   Name: My image 5 
   testName: My image 6 
   hello world hello world  hello world,hello world你好啊，李银河
  
 
/html/body/div:  
   Name: My image 1 
   test
   Name: My image 2 
   Name: My image 3 
   Name: My image 4 
   Name: My image 5 
   testName: My image 6 
   hello world hello world  hello world,hello world你好啊，李银河
  
/html/body/div/a[1]:  Name: My image 1 
/html/body/div/a[1]/br:  
/html/body/div/a[1]/img:  
/html/body/div/h5:  test
/html/body/div/a[2]:  Name: My image 2 
/html/body/div/a[2]/br:  
/html/body/div/a[2]/img:  
/html/body/div/a[3]:  Name: My image 3 
/html/body/div/a[3]/br:  
/html/body/div/a[3]/img:  
/html/body/div/a[4]:  Name: My image 4 
/html/body/div/a[4]/br:  
/html/body/div/a[4]/img:  
/html/body/div/a[5]:  Name: My image 5 
/html/body/div/a[5]/br:  
/html/body/div/a[5]/img:  
/html/body/div/a[6]:  testName: My image 6 
/html/body/div/a[6]/span:  test
/html/body/div/a[6]/span/h5:  test
/html/body/div/a[6]/br:  
/html/body/div/a[6]/img:  
/html/body/div/p:  hello world hello world  hello world,hello world你好啊，李银河
/html/body/div/p/strong:   hello world,hello world

xpath结果如下：


==============xpath方法=====================
/html:  
 
  
  Example website
 
 
  
   Name: My image 1 
   test
   Name: My image 2 
   Name: My image 3 
   Name: My image 4 
   Name: My image 5 
   testName: My image 6 
   hello world hello world  hello world,hello world你好啊，李银河
  
 

/html/head:  
  
  Example website
 
/html/head/base:  
/html/head/title:  Example website
/html/body:  
  
   Name: My image 1 
   test
   Name: My image 2 
   Name: My image 3 
   Name: My image 4 
   Name: My image 5 
   testName: My image 6 
   hello world hello world  hello world,hello world你好啊，李银河
  
 
/html/body/div:  
   Name: My image 1 
   test
   Name: My image 2 
   Name: My image 3 
   Name: My image 4 
   Name: My image 5 
   testName: My image 6 
   hello world hello world  hello world,hello world你好啊，李银河
  
/html/body/div/a[1]:  Name: My image 1 
/html/body/div/a[1]/br:  
/html/body/div/a[1]/img:  
/html/body/div/h5:  test
/html/body/div/a[2]:  Name: My image 2 
/html/body/div/a[2]/br:  
/html/body/div/a[2]/img:  
/html/body/div/a[3]:  Name: My image 3 
/html/body/div/a[3]/br:  
/html/body/div/a[3]/img:  
/html/body/div/a[4]:  Name: My image 4 
/html/body/div/a[4]/br:  
/html/body/div/a[4]/img:  
/html/body/div/a[5]:  Name: My image 5 
/html/body/div/a[5]/br:  
/html/body/div/a[5]/img:  
/html/body/div/a[6]:  testName: My image 6 
/html/body/div/a[6]/span:  test
/html/body/div/a[6]/span/h5:  test
/html/body/div/a[6]/br:  
/html/body/div/a[6]/img:  
/html/body/div/p:  hello world hello world  hello world,hello world你好啊，李银河
/html/body/div/p/strong:   hello world,hello world

总结：

node.text 在取文本时不会包含该节点的子节点里的内容。
node.itertext和xpath方法可以将其子节点中的内容都包含进去。并且这两种方法取得文本内容相同。

Sun_Sherry

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
lxml:提取html标签中的内容

lxml中有多种方式可以提取HTML标签中的内容，这篇博客的重点在于各个方法的不同。import lxmlfrom lxml import etreeimport collectionsdoc='''<html> <head> <base href='http://example.com/' /> <title>Example website</title> </head> <body> &
复制链接

扫一扫

专栏目录