一.xpath基础
1.常用规则
表达式 | 描述 |
---|---|
nodename(节点名字) | 选取此节点的所有子节点 |
/ | 从根节点选取直接子节点 |
// | 从当前节点选取子孙节点 |
. | 选取当前节点 |
… | 选取当前节点的父节点 |
@ | 选取属性 |
一个小tip:
//节点名[@属性名='属性值']
选取该节点名称为XXX属性值为XXx的所有节点
2.所有节点
一般会用’’//开头’'的xpath规则选取所有节点。
//节点名称
3.子节点、子孙节点
子节点:/
子孙节点://
4.父节点
ex:选取属性为zz的a节点的父节点的class属性
html.xpath('//a[@href="zz"]/../@class')
5.文本获取
/text():获取节点内的文本信息
//text():获取节点内的文本信息,如果节点内还有节点,那么里面的节点的文本也会被获取
6.属性多值,多属性值匹配
属性多值:.xpath('//li[contains(@class, "li")]/a/text()')
多属性值:.xpath('//li[contains(@class, "li") and @name = "item"]/a/text()')
二. parsel
1. 提取文本
get(): 可以获取可迭代对象中的第一个Selector对象中的内容
getall(): 获取可迭代对象中的所有Selector对象中的内容
- css写法
*: 匹配所有节点,包括纯文本节点
*::text : 匹配所有节点(包括纯文本)内的文本
节点名::text :匹配节点内的文本
html = '''
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active">wangqian<a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from parsel import Selector
selector = Selector(text=html)
# 生成的结果为一个可迭代对象SelectorList,get()方法可以获取可迭代对象中的第一个Selector对象中的内容
items = selector.css('.item-0 *::text').get()
# items = selector.css('.item-0 *::text').getall()
print(items)
get(): first item
getall(): ['first item','wangqian', 'third item', 'fifth item']
- xpath写法
html = '''
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active">wangqian<a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from parsel import Selector
selector = Selector(text=html)
items = selector.xpath('//li[contains(@class, "item-0")]//text()').getall()
print(items)
['first item', 'wangqian', 'third item', 'fifth item']
2. 提取属性
- css提取
节点名::attr(属性名)
html = '''
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active">wangqian<a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from parsel import Selector
selector = Selector(text=html)
items = selector.css(".item-0.active a::attr(href)").get()
print(items)
link3.html
- xpath提取
/@属性名
html = '''
<div>
<ul>
<li class="item-0 item-1">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active">wangqian<a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from parsel import Selector
selector = Selector(text=html)
items = selector.xpath('//li[contains(@class, "item-0") and contains(@class, "active")]/a/@href').get()
print(items)
link3.html
3. 正则提取
html = '''
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from parsel import Selector
selector = Selector(text=html)
result = selector.css('.item-0').re('link.*')
print(result)
['link3.html"><span class="bold">third item</span></a></li>', 'link5.html">fifth item</a></li>']
html = '''
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from parsel import Selector
selector = Selector(text=html)
result = selector.css('.item-0').re_first('<span class="bold">(.*?)</span>')
print(result)
third item
文章参考:《Python3网络爬虫开发实战 第二版》