python爬虫标签筛选_python爬虫数据解析的四种不同选择器Xpath，Beautiful Soup，pyquery，re...

最新推荐文章于 2022-09-29 04:04:09 发布

weixin_39728221

最新推荐文章于 2022-09-29 04:04:09 发布

阅读量382

点赞数

文章标签： python爬虫标签筛选

这里主要是做一个关于数据爬取以后的数据解析功能的整合，方便查阅，以防混淆

主要讲到的技术有Xpath，BeautifulSoup，PyQuery，re(正则)

首先举出两个作示例的代码，方便后面举例

解析之前需要先将html代码转换成相应的对象，各自的方法如下：

Xpath：

In [7]: from lxml importetree

In [8]: text = etree.HTML(html)

BeautifulSoup：

In [2]: from bs4 importBeautifulSoup

In [3]: soup = BeautifulSoup(html, 'lxml')

PyQuery：

In [10]: from pyquery importPyQuery as pq

In [11]: doc = pq(html)

re：没有需要的对象，他是直接对字符串进行匹配的规则

示例1

html = '''

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names wereElsie,LacieandTillie;

and they lived at the bottom of a well.

...

'''

接下来我们来用不同的解析方法分析示例的HTML代码

匹配标题内容：

Xpath：

In [16]: text.xpath('//title/text()')[0]

Out[16]: "The Dormouse's story"

BeautifulSoup：

In [18]: soup.title.string

Out[18]: "The Dormouse's story"

PyQuery：

In [20]: doc('title').text()

Out[20]: "The Dormouse's story"

re:

In [11]: re.findall(r'

(.*?)', html)[0]

Out[11]: "The Dormouse's story"

匹配第三个a标签的href属性：

Xpath：#推荐

In [36]: text.xpath('//a[@id="link3"]/@href')[0]

Out[36]: 'http://example.com/tillie'

BeautifulSoup：

In [27]: soup.find_all(attrs={'id':'link3'})

Out[27]: [Tillie]

In [33]: soup.find_all(attrs={'id':'link3'})[0].attrs['href']

Out[33]: 'http://example.com/tillie'

PyQuery：#推荐

In [45]: doc("#link3").attr.hrefOut[45]: 'http://example.com/tillie'

re:

In [46]: re.findall(r'Tillie;', html)[0]

Out[46]: 'http://example.com/tillie'

匹配P标签便是内容的全部数据：

Xpath：

In [48]: text.xpath('string(//p[@class="story"])').strip()

Out[48]: 'Once upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.'In [51]: ' '.join(text.xpath('string(//p[@class="story"])').split('\n'))

Out[51]: 'Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.'

BeautifulSoup：

In [89]: ' '.join(list(soup.body.stripped_strings)).replace('\n', '')

Out[89]: "The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie,Lacie and Tillie; and they lived at the bottom of a well. ..."

PyQuery:

In [99]: doc('.story').text()

Out[99]: 'Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well. ...'

re：不推荐使用，过于麻烦

In [101]: re.findall(r'

(.*?)(.*?)(.*?)(.*?)(.*?)(.*?);(.*?)

', html, re.S)[0]

Out[101]:

('Once upon a time there were three little sisters; and their names were\n','Elsie',',\n','Lacie','and\n','Tillie','\nand they lived at the bottom of a well.')

示例2

html = '''

first item

second item

third item

fourth item

fifth item

'''

匹配second item

Xpath：

In [14]: text.xpath('//li[2]/a/text()')[0]

Out[14]: 'second item'

BeautifulSoup：

In [23]: soup.find_all(attrs={'class': 'item-1'})[0].string

Out[23]: 'second item'

PyQuery：

In [34]: doc('.item-1>a')[0].text

Out[34]: 'second item'

re:

In [35]: re.findall(r'

(.*?)', html)[0]

Out[35]: 'second item'

匹配第五个li标签的href属性：

Xpath：

In [36]: text.xpath('//li[@class="item-0"]/a/@href')[0]

Out[36]: 'link5.html'

BeautifulSoup：

In [52]: soup.find_all(attrs={'class': 'item-0'})

Out[52]:

[

first item, third item, fifth item]

In [53]: soup.find_all(attrs={'class': 'item-0'})[-1].a.attrs['href']

Out[53]: 'link5.html'

PyQuery：

In [75]: [i.attr.href for i in doc('.item-0 a').items()][1]

Out[75]: 'link5.html'

re:

In [95]: re.findall(r'

fifth item',html)[0]

Out[95]: 'link5.html'

示例3

房屋用途普通住宅

分别获取出房屋用途和普通住宅

Xpath：

In [47]: text.xpath('//li/span/text()')[0]

Out[47]: '房屋用途'In [49]: text.xpath('//li/text()')[0]

Out[49]: '普通住宅'

BeautifulSoup：

In [65]: soup.span.string

Out[65]: '房屋用途'In [69]: soup.li.contents[1] #contents 获取直接子节点

Out[69]: '普通住宅'

PyQuery：

In [70]: doc('li span').text()

Out[70]: '房屋用途'In [75]: doc('li .label')[0].tail

Out[75]: '普通住宅'

re: 略

示例4

26667元/平米

weixin_39728221

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫标签筛选_python爬虫数据解析的四种不同选择器Xpath，Beautiful Soup，pyquery，re...

这里主要是做一个关于数据爬取以后的数据解析功能的整合，方便查阅，以防混淆主要讲到的技术有Xpath，BeautifulSoup，PyQuery，re(正则)首先举出两个作示例的代码，方便后面举例解析之前需要先将html代码转换成相应的对象，各自的方法如下：Xpath：In [7]: from lxml importetreeIn [8]: text = etree.HTML(html)Beauti...
复制链接

扫一扫

python爬虫标签筛选_python爬虫数据解析的四种不同选择器Xpath，Beautiful Soup，pyquery，re...

“相关推荐”对你有帮助么？