lxml中有多种方式可以提取HTML标签中的内容,这篇博客的重点在于各个方法的不同。
import lxml
from lxml import etree
import collections
doc='''
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div id='images'>
<a href='image1.html' id="xxx">Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
<h5>test</h5>
<a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
<a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
<a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
<a href='image5.html' class='li li-item' name='items'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
<a href='image6.html' name='items'><span><h5>test</h5></span>Name: My image 6 <br /><img src='image6_thumb.jpg' /></a>
<p>hello world hello world <strong> hello world,hello world</strong>你好啊,李银河</p>
</div>
</body>
</html>
'''
html=etree.HTML(doc)
tree=html.getroottree()
all_nodes=html.xpath('//*')
xpath=[]
for node in all_nodes:
xpath.append(tree.getpath(node))
print('==============node.text方法=====================')
for node,path in zip(all_nodes,xpath):
print('{}: {}'.format(path,node.text))
print('==============node.itertext方法=====================')
for node,path in zip(all_nodes,xpath):
print('{}: {}'.format(path,''.join(node.itertext())))
print('==============xpath方法=====================')
for node,path in zip(all_nodes,xpath):
print('{}: {}'.format(path,''.join(html.xpath(path+'//text()'))))
node.text结果如下:
==============node.text方法=====================
/html:
/html/head:
/html/head/base: None
/html/head/title: Example website
/html/body:
/html/body/div:
/html/body/div/a[1]: Name: My image 1
/html/body/div/a[1]/br: None
/html/body/div/a[1]/img: None
/html/body/div/h5: test
/html/body/div/a[2]: Name: My image 2
/html/body/div/a[2]/br: None
/html/body/div/a[2]/img: None
/html/body/div/a[3]: Name: My image 3
/html/body/div/a[3]/br: None
/html/body/div/a[3]/img: None
/html/body/div/a[4]: Name: My image 4
/html/body/div/a[4]/br: None
/html/body/div/a[4]/img: None
/html/body/div/a[5]: Name: My image 5
/html/body/div/a[5]/br: None
/html/body/div/a[5]/img: None
/html/body/div/a[6]: None
/html/body/div/a[6]/span: None
/html/body/div/a[6]/span/h5: test
/html/body/div/a[6]/br: None
/html/body/div/a[6]/img: None
/html/body/div/p: hello world hello world
/html/body/div/p/strong: hello world,hello world
node.itertext结果如下:
/html:
Example website
Name: My image 1
test
Name: My image 2
Name: My image 3
Name: My image 4
Name: My image 5
testName: My image 6
hello world hello world hello world,hello world你好啊,李银河
/html/head:
Example website
/html/head/base:
/html/head/title: Example website
/html/body:
Name: My image 1
test
Name: My image 2
Name: My image 3
Name: My image 4
Name: My image 5
testName: My image 6
hello world hello world hello world,hello world你好啊,李银河
/html/body/div:
Name: My image 1
test
Name: My image 2
Name: My image 3
Name: My image 4
Name: My image 5
testName: My image 6
hello world hello world hello world,hello world你好啊,李银河
/html/body/div/a[1]: Name: My image 1
/html/body/div/a[1]/br:
/html/body/div/a[1]/img:
/html/body/div/h5: test
/html/body/div/a[2]: Name: My image 2
/html/body/div/a[2]/br:
/html/body/div/a[2]/img:
/html/body/div/a[3]: Name: My image 3
/html/body/div/a[3]/br:
/html/body/div/a[3]/img:
/html/body/div/a[4]: Name: My image 4
/html/body/div/a[4]/br:
/html/body/div/a[4]/img:
/html/body/div/a[5]: Name: My image 5
/html/body/div/a[5]/br:
/html/body/div/a[5]/img:
/html/body/div/a[6]: testName: My image 6
/html/body/div/a[6]/span: test
/html/body/div/a[6]/span/h5: test
/html/body/div/a[6]/br:
/html/body/div/a[6]/img:
/html/body/div/p: hello world hello world hello world,hello world你好啊,李银河
/html/body/div/p/strong: hello world,hello world
xpath结果如下:
==============xpath方法=====================
/html:
Example website
Name: My image 1
test
Name: My image 2
Name: My image 3
Name: My image 4
Name: My image 5
testName: My image 6
hello world hello world hello world,hello world你好啊,李银河
/html/head:
Example website
/html/head/base:
/html/head/title: Example website
/html/body:
Name: My image 1
test
Name: My image 2
Name: My image 3
Name: My image 4
Name: My image 5
testName: My image 6
hello world hello world hello world,hello world你好啊,李银河
/html/body/div:
Name: My image 1
test
Name: My image 2
Name: My image 3
Name: My image 4
Name: My image 5
testName: My image 6
hello world hello world hello world,hello world你好啊,李银河
/html/body/div/a[1]: Name: My image 1
/html/body/div/a[1]/br:
/html/body/div/a[1]/img:
/html/body/div/h5: test
/html/body/div/a[2]: Name: My image 2
/html/body/div/a[2]/br:
/html/body/div/a[2]/img:
/html/body/div/a[3]: Name: My image 3
/html/body/div/a[3]/br:
/html/body/div/a[3]/img:
/html/body/div/a[4]: Name: My image 4
/html/body/div/a[4]/br:
/html/body/div/a[4]/img:
/html/body/div/a[5]: Name: My image 5
/html/body/div/a[5]/br:
/html/body/div/a[5]/img:
/html/body/div/a[6]: testName: My image 6
/html/body/div/a[6]/span: test
/html/body/div/a[6]/span/h5: test
/html/body/div/a[6]/br:
/html/body/div/a[6]/img:
/html/body/div/p: hello world hello world hello world,hello world你好啊,李银河
/html/body/div/p/strong: hello world,hello world
总结:
- node.text 在取文本时不会包含该节点的子节点里的内容。
- node.itertext和xpath方法可以将其子节点中的内容都包含进去。并且这两种方法取得文本内容相同。