beautifulsoup模块学习

模块安装:pip3 install beautifulsoup4

from bs4 import BeautifulSoup

html_doc = """<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
<a href="dfghrt.cc">Chat on the internet</a>
<p>I have received a e-mail from an old friend yesterday. she asked how my summer holiday is.had I ever gone to
    somewhere for pleasure ! and also ,told me that she wanted to go around for fun ,but ,unfortunately,there is no
    time! <a href="dref.com">what a pity !</a>so ,if I would get to work, how will my life be? what type of jobs should
    I choose? uh, maybe I think a lot! how time flies! I have reached school for almost half a month.came to read
    articles day by day.</p>
<hr>
<div class="story"><p id="we11">Homely &comely appearance! want to make a beating plan! buy some herbs, face-mask and so on . surfing the
    internet ,I chance meet a stranger ,he has a beatiful net-mane,it give me a good sence,so I made him into my
    friends-list,and then i found ,that he has doctor degree. and he is of great knowlege! uh ,and a confident man! but
    unluckly ,he divorced .maybe in our country ,divorcement is normal .but I feel unsafety. I must get to listen "4+1"
    oral english now ,continue it later!</p></div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc,features = 'html.parser')

1、name,标签名称

tag = soup.find(name='a')
print(tag,tag.name)

输出:<a href="dfghrt.cc">Chat on the internet</a> a 

2、attr,标签属性

print(tag.attrs)
tag.attrs = {'k1':'123'}   #所有属性重设
tag.attrs['k2'] = '456'     #新增属性
print(soup)

部分输出:

<body>
<a k1="123" k2="456">Chat on the internet</a>

3,children,所有子标签

body = soup.find('body')
l1 = body.children            #迭代器
print(list(l1))

子类中包含换行

4,descendants,所有子孙标签

l2 = body.descendants

5、clear,清空标签内容并保留标签名

tag = soup.find('body')
tag.clear()
print(soup)

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Title</title>
</head>
<body></body>
</html> 

 6、decompose,递归的删除所有的标签,将选中标签一并删除

body = soup.find('body')
body.decompose()
print(soup)

 

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Title</title>
</head>

</html>

 7、extract,递归删除所有标签,并获取删除标签(删除效果同上,功能相当于剪切)

body = soup.find('body')
v= body.extract()
print(v)

8、decode,转换为字符串(含当前标签);decode_contents(不含当前标签)

body = soup.find('body')
val = body.decode()
print(type(body),type(val))

<class 'bs4.element.Tag'> <class 'str'>

9、encode,转换为字节(含标签),encode_contents(不含标签)

body = soup.find('body')
val = body.encode()
print(type(body), type(val))

<class 'bs4.element.Tag'> <class 'bytes'>

10、find,获取匹配的第一个标签

tag = soup.find(name='p', attrs={'id': 'we11'}, recursive=True)

get获取标签属性 

tag = soup.find('a')
val = tag.get('href')

11、find_all,获取所有匹配标签并返回列表

tags=soup.find_all(name='a',limit=1)               限制范围
tags = soup.find_all(name=['a','div'])             查找列表内所有标签类型
import re
rep = re.compile('^H')
tags = soup.find_all(text=rep,limit=1)       #节点文本

12、has_attr,检查标签是否具有该属性

tag = soup.find('a')
val = tag.has_attr('href')

13、get_text、获取标签内部文本内容

tag = soup.find('a')
val = tag.get_text()

14、index,查询标签索引位置

tag = soup.find('body')
val = tag.index(tag.find('p'))

15、is_empty_element,是否是空标签或是自闭合标签(hr br、input、img、meta、link 、frame等)

tag = soup.find('hr')
val = tag.is_empty_element

16、关联标签

# tag.next
# tag.next_element
# tag.next_elements
# tag.next_sibling
# tag.next_siblings

# tag.previous
# tag.previous_element
# tag.previous_elements
# tag.previous_sibling
# tag.previous_siblings

# tag.parent
# tag.parents

17、查找某标签的关联标签

# tag.find_next(...)
# tag.find_all_next(...)
# tag.find_next_sibling(...)
# tag.find_next_siblings(...)

# tag.find_previous(...)
# tag.find_all_previous(...)
# tag.find_previous_sibling(...)
# tag.find_previous_siblings(...)

# tag.find_parent(...)
# tag.find_parents(...)

18、select,select_one、css选择器

tag =soup.select("div p")

select按照css选择器方式进行匹配,返回列表;select_one只匹配一个对象

19、wrap,用指定标签把当前标签包裹起来

from bs4.element import Tag
tag1 = Tag(name='div',attrs={'color':'red'})
tag1.string = 'newtag'
tag2 = soup.find('a')
val = tag2.wrap(tag1)
print(soup)

 20、unwrap,去掉当前标签保留其包裹的标签

tag = soup.find('div')
val = tag.unwrap()
print(soup)

val是所去掉的标签

21、标签内容

tag = soup.find('p')
print(tag.string)
tag.string='newcontents'
print(soup)

# tag = soup.find('body')
# val = tag.stripped_strings
# print(next(val))

22、append--当前标签内部追加一个标签

insert在当前标签内部指定位置插入一个标签

insert_after,insert_before在当前标签前后插入标签

replace_with替换指定标签

23、CSS选择器

soup.select('ui li')

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值