1. 使用CSS选择器
Parsel支持使用CSS选择器来定位HTML元素。可以通过`Selector`类来实现。
```python
from parsel import Selector
html = """
<html>
<body>
<h1>Hello World!</h1>
<p>This is an example paragraph.</p>
</body>
</html>
"""
selector = Selector(text=html)
h1_text = selector.css('h1::text').get()
p_text = selector.css('p::text').get()
print(h1_text) # Hello World!
print(p_text) # This is an example paragraph.
```
2. 使用XPath表达式
除了CSS选择器,Parsel还支持使用XPath表达式来定位HTML元素。可以通过`Selector`类的`xpath`方法来实现。
```python
from parsel import Selector
html = """
<html>
<body>
<h1>Hello World!</h1>
<p>This is an example paragraph.</p>
</body>
</html>
"""
selector = Selector(text=html)
h1_text = selector.xpath('//h1/text()').get()
p_text = selector.xpath('//p/text()').get()
print(h1_text) # Hello World!
print(p_text) # This is an example paragraph.
```
3. 使用正则表达式
Parsel还支持使用正则表达式来提取HTML元素中的内容。可以通过`Selector`类的`re`方法来实现。
```python
from parsel import Selector
html = """
<html>
<body>
<h1>Hello World!</h1>
<p>This is an example paragraph.</p>
</body>
</html>
"""
selector = Selector(text=html)
h1_text = selector.css('h1::text').re(r'Hello\s(.*)!')[0]
p_text = selector.css('p::text').re(r'This\s(.*)\.')[0]
print(h1_text) # World
print(p_text) # is an example paragraph
```
4. 处理多个页面
在爬虫中,经常需要处理多个页面,可以使用`SelectorList`类来处理多个页面。
```python
from parsel import SelectorList
html1 = """
<html>
<body>
<h1>Hello World!</h1>
<p>This is an example paragraph.</p>
</body>
</html>
"""
html2 = """
<html>
<body>
<h1>Goodbye World!</h1>
<p>This is another example paragraph.</p>
</body>
</html>
"""
selector_list = SelectorList([Selector(text=html1), Selector(text=html2)])
h1_texts = selector_list.css('h1::text').getall()
p_texts = selector_list.css('p::text').getall()
print(h1_texts) # ['Hello World!', 'Goodbye World!']
print(p_texts) # ['This is an example paragraph.', 'This is another example paragraph.']
```
5. 使用回调函数
Parsel还支持使用回调函数处理提取的结果。可以通过`Selector`类的`callback`方法来实现。
```python
from parsel import Selector
html = """
<html>
<body>
<h1>Hello World!</h1>
<p>This is an example paragraph.</p>
</body>
</html>
"""
selector = Selector(text=html)
def process_text(text):
return text.upper()
h1_text = selector.css('h1::text').get(callback=process_text)
p_text = selector.css('p::text').get(callback=process_text)
print(h1_text) # HELLO WORLD!
print(p_text) # THIS IS AN EXAMPLE PARAGRAPH.
```
6. 使用ItemLoader
在爬虫中,经常需要将提取的数据保存到数据库或者文件中。可以使用`ItemLoader`类来实现。
```python
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, TakeFirst
from parsel import Selector
class MyItem:
def __init__(self, h1_text, p_text):
self.h1_text = h1_text
self.p_text = p_text
html = """
<html>
<body>
<h1>Hello World!</h1>
<p>This is an example paragraph.</p>
</body>
</html>
"""
selector = Selector(text=html)
item_loader = ItemLoader(item=MyItem(), selector=selector)
item_loader.add_css('h1_text', 'h1::text', MapCompose(str.strip), TakeFirst())
item_loader.add_css('p_text', 'p::text', MapCompose(str.strip), TakeFirst())
my_item = item_loader.load_item()
print(my_item.h1_text) # Hello World!
print(my_item.p_text) # This is an example paragraph.
```
这里定义了一个`MyItem`类,用来保存提取的数据。`ItemLoader`类用来将提取的数据加载到`MyItem`实例中。`add_css`方法用来定义如何从HTML中提取数据,`MapCompose`函数用来对提取的数据进行处理,`TakeFirst`函数用来取第一个结果。最后,`load_item`方法用来将提取的数据加载到`MyItem`实例中。