parsel高级用法

最新推荐文章于 2024-05-20 23:01:46 发布

微剑

最新推荐文章于 2024-05-20 23:01:46 发布

阅读量411

点赞数 1

分类专栏： python爬虫文章标签： python html 爬虫

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_59246157/article/details/129870840

版权

python爬虫专栏收录该内容

17 篇文章 0 订阅

订阅专栏

1. 使用CSS选择器

Parsel支持使用CSS选择器来定位HTML元素。可以通过`Selector`类来实现。

```python

from parsel import Selector

html = """

<html>

<body>

<h1>Hello World!</h1>

<p>This is an example paragraph.</p>

</body>

</html>

"""

selector = Selector(text=html)

h1_text = selector.css('h1::text').get()

p_text = selector.css('p::text').get()

print(h1_text) # Hello World!

print(p_text) # This is an example paragraph.

```

2. 使用XPath表达式

除了CSS选择器，Parsel还支持使用XPath表达式来定位HTML元素。可以通过`Selector`类的`xpath`方法来实现。

```python

from parsel import Selector

html = """

<html>

<body>

<h1>Hello World!</h1>

<p>This is an example paragraph.</p>

</body>

</html>

"""

selector = Selector(text=html)

h1_text = selector.xpath('//h1/text()').get()

p_text = selector.xpath('//p/text()').get()

print(h1_text) # Hello World!

print(p_text) # This is an example paragraph.

```

3. 使用正则表达式

Parsel还支持使用正则表达式来提取HTML元素中的内容。可以通过`Selector`类的`re`方法来实现。

```python

from parsel import Selector

html = """

<html>

<body>

<h1>Hello World!</h1>

<p>This is an example paragraph.</p>

</body>

</html>

"""

selector = Selector(text=html)

h1_text = selector.css('h1::text').re(r'Hello\s(.*)!')[0]

p_text = selector.css('p::text').re(r'This\s(.*)\.')[0]

print(h1_text) # World

print(p_text) # is an example paragraph

```

4. 处理多个页面

在爬虫中，经常需要处理多个页面，可以使用`SelectorList`类来处理多个页面。

```python

from parsel import SelectorList

html1 = """

<html>

<body>

<h1>Hello World!</h1>

<p>This is an example paragraph.</p>

</body>

</html>

"""

html2 = """

<html>

<body>

<h1>Goodbye World!</h1>

<p>This is another example paragraph.</p>

</body>

</html>

"""

selector_list = SelectorList([Selector(text=html1), Selector(text=html2)])

h1_texts = selector_list.css('h1::text').getall()

p_texts = selector_list.css('p::text').getall()

print(h1_texts) # ['Hello World!', 'Goodbye World!']

print(p_texts) # ['This is an example paragraph.', 'This is another example paragraph.']

```

5. 使用回调函数

Parsel还支持使用回调函数处理提取的结果。可以通过`Selector`类的`callback`方法来实现。

```python

from parsel import Selector

html = """

<html>

<body>

<h1>Hello World!</h1>

<p>This is an example paragraph.</p>

</body>

</html>

"""

selector = Selector(text=html)

def process_text(text):

return text.upper()

h1_text = selector.css('h1::text').get(callback=process_text)

p_text = selector.css('p::text').get(callback=process_text)

print(h1_text) # HELLO WORLD!

print(p_text) # THIS IS AN EXAMPLE PARAGRAPH.

```

6. 使用ItemLoader

在爬虫中，经常需要将提取的数据保存到数据库或者文件中。可以使用`ItemLoader`类来实现。

```python

from scrapy.loader import ItemLoader

from scrapy.loader.processors import MapCompose, TakeFirst

from parsel import Selector

class MyItem:

def __init__(self, h1_text, p_text):

self.h1_text = h1_text

self.p_text = p_text

html = """

<html>

<body>

<h1>Hello World!</h1>

<p>This is an example paragraph.</p>

</body>

</html>

"""

selector = Selector(text=html)

item_loader = ItemLoader(item=MyItem(), selector=selector)

item_loader.add_css('h1_text', 'h1::text', MapCompose(str.strip), TakeFirst())

item_loader.add_css('p_text', 'p::text', MapCompose(str.strip), TakeFirst())

my_item = item_loader.load_item()

print(my_item.h1_text) # Hello World!

print(my_item.p_text) # This is an example paragraph.

```

这里定义了一个`MyItem`类，用来保存提取的数据。`ItemLoader`类用来将提取的数据加载到`MyItem`实例中。`add_css`方法用来定义如何从HTML中提取数据，`MapCompose`函数用来对提取的数据进行处理，`TakeFirst`函数用来取第一个结果。最后，`load_item`方法用来将提取的数据加载到`MyItem`实例中。

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
parsel高级用法

ItemLoader`类用来将提取的数据加载到`MyItem`实例中。`add_css`方法用来定义如何从HTML中提取数据，`MapCompose`函数用来对提取的数据进行处理，`TakeFirst`函数用来取第一个结果。最后，`load_item`方法用来将提取的数据加载到`MyItem`实例中。可以通过`Selector`类的`xpath`方法来实现。可以通过`Selector`类的`re`方法来实现。可以通过`Selector`类的`callback`方法来实现。1. 使用CSS选择器。
复制链接

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。