beautiful Soup方法(美丽汤)

最新推荐文章于 2022-03-28 23:19:51 发布

wangmengli960109

最新推荐文章于 2022-03-28 23:19:51 发布

阅读量608

点赞数 1

本文链接：https://blog.csdn.net/wangmengli980109/article/details/102569843

版权

导入BeautifulSoup
from bs4 import BeautifulSoup

转至对象  
括号可以加入url,要爬取得模板名，还可以加入‘lxml’解析器
bs = BeautifulSoup(html,'lxml')

格式化输出网页
bs.prettify()

匹配所有网页tr标签
不加过滤条件，获取全部的tr标签
bs.find_all('标签名')

limit限制符合条件前n个标签
bs.find_all('tr',limit=2)
相同概念
bs.find_all('tr')[0]


指定取出所有tr子节点class   传入class是class_ = '名'
bs.find_all('tr',class_ = 'h')

选择class=even，同时id=feng的tr子节点
bs.find_all('tr',class_ = 'even',id = 'feng')


使用get_text()方法,这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容
t = bs.find_all('tr')[2]   必须索引否则报错AttributeError   可以指定索引某个
print(t.get_text())


可以传入一个参数作为分隔符，让获取的字符串更好的显示出来
t = bs.find_all('tr')[2]
print(t.get_text('====='))
还可以加入一个参数strip=True删除返回的字符串左右两边的空格
t = bs.find_all('tr')[2]
print(t.get_text('=====',strip = True))


想要获取节点的属性，<a href="www.baidu.com"></a>，想获取它的href属性值，或者对于其他的节点元素，我们想要获取name、class、id等属性值的
当输入错误标签里面子节点时会报错KeyError: 'traget'
t = bs.find_all('a')
for i in t:
    print(i['target'])

find()方法只选取符合条件的第一个标签,取别的标签报错KeyError索引出错
bs.find('tr')

contents属性:该属性返回的是某个节点下的全部子元素，包括子元素的标签名和文本内容。返回的数据类型是列表
t = bs.find('tr')
print(t.contents)列表
可以迭代出
for i in t.contents:
    print(i.string)
    
 children属性：和contents属性的用法是一样的，但是返回的数据类型是迭代器
t = bs.find('tr')
返回包含空格，
for i in t.children:
    print(i.string)

获取全部文本get_text()
t = bs.select('a')
for i in t:
    print(i.get_text())
同样还有一个方法索引出文本
t = bs.select('a')
for i in t:
    print(i.string)

select 可以通过标签名，标签的class、标签的id，通过标签的name、href等属性来选择我们的元素。使用该方法返回的是一个迭代器，我们可以通过for...in...循环遍历


可以连着取标签下的子节点，然后用for迭代出i  然后用i所以用i索引出想要的href
t = bs.select('td a')
for i in t:
    print(i['href'])
    
    
# 通过属性来查找标签，比如查找href属性等于index.html的a节点
a = bs.select("a[href='index.html']")

# 选择div中的直接子元素img
img = bs.select("div > img")

一个节点只包含一个文本节点，或者是只包含一个节点,strings,获取该文本节点的文本内容，或者是这个节点的文本内容'''可能或出现换行和空格等空白文本'''
o = r.select('div')[1]   必须索引否则报错'''AttributeError: 'list' object has no attribute 'strings''''

for i in o.strings:
    print(i)
    
    
如果不想获取换行和空格，那么我们可以使用stripped_strings属性
o = r.select('div')[0]
for i in o.stripped_strings:
    print(i)

wangmengli960109

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
beautiful Soup方法(美丽汤)

导入BeautifulSoupfrom bs4 import BeautifulSoup转至对象括号可以加入url,要爬取得模板名，还可以加入‘lxml’解析器bs = BeautifulSoup(html,'lxml')格式化输出网页bs.prettify()匹配所有网页tr标签不加过滤条件，获取全部的tr标签bs.find_all('标签名')limit限制符合...
复制链接

扫一扫