3.21（跟着学长学python)

禾太阳

于 2022-03-22 00:05:12 发布

阅读量818

点赞数

文章标签： python 开发语言 pycharm

本文链接：https://blog.csdn.net/qq_58181376/article/details/123649033

版权

二.文档的遍历

from bs4 import BeautifulSoup
file = open("./baidu.html","rb")
html = file.read()
bs =BeautifulSoup(html,"html.parser")  #parser解析器
#print(bs.head.contents)   #遍历所有文档
print(bs.head.contents[1])   #第一个

结果：<meta content="text/html;charest=utf-8" http-equiv="content-type"/>
三.文档的搜索

1.（1）find_all() 经常用

#字符串过滤:会查找与字符串完全匹配的内容

from bs4 import BeautifulSoup
file = open("./baidu.html","rb")
html = file.read()
bs =BeautifulSoup(html,"html.parser")  #parser解析器
t_list = bs.find_all("a")
print(t_list)

结果：以a开头以a结尾

（2）#正则表达式搜索，使用search（）方法来匹配内容

from bs4 import BeautifulSoup
file = open("./baidu.html","rb")
html = file.read()
bs =BeautifulSoup(html,"html.parser")  #parser解析器
import re
t_list = bs.find_all(re.compile("a"))
print(t_list)

结果：含a的标签的所有子内容

（3）#方法：传入一个函数（方法），根据函数的要求来搜索（了解）

from bs4 import BeautifulSoup
file = open("./baidu.html","rb")
html = file.read()
bs =BeautifulSoup(html,"html.parser")  #parser解析器
def name_is_exists(tag):
    return tag.has_attr("name")  #attr是属性
t_list = bs.find_all(name_is_exists)

for item in t_list:
    print(item)      #打印出来一个列表，看起来清楚

结果：所有属性为name的，逐行输出

（2）kwargs 参数

from bs4 import BeautifulSoup
file = open("./baidu.html","rb")
html = file.read()
bs =BeautifulSoup(html,"html.parser")  #parser解析器
t_list = bs.find_all(id="head")
for item in t_list:
    print(item)

结果：id=head里的所有子内容

from bs4 import BeautifulSoup
file = open("./baidu.html","rb")
html = file.read()
bs =BeautifulSoup(html,"html.parser")  #parser解析器
t_list = bs.find_all(class_=True)
#如果写成不加下划线python就会报错，因为在py里不是这样的
 #所以为了避免误用关键字，在html里就加_
for item in t_list:
    print(item)

结果：输出所有包含class的以及class里的子内容

from bs4 import BeautifulSoup
file = open("./baidu.html","rb")
html = file.read()
bs =BeautifulSoup(html,"html.parser")  #parser解析器
t_list = bs.find_all(href="http://news.baidu.com")
for item in t_list:
    print(item)

结果：输出所有包含href="http://news.baidu.com

（3）text（文本）参数

（4）limit 参数 ,限定得到信息的个数

from bs4 import BeautifulSoup
file = open("./baidu.html","rb")
html = file.read()
bs =BeautifulSoup(html,"html.parser")  #parser解析器
t_list = bs.find_all("a",limit=3)
for item in t_list:
    print(item)

结果：输出前三个包含a的

（5）css选择器 (重要)

t_list=bs.select('title')    #通过标签来查找
t_list=bs.select('.mnav')    #通过类名来查找
t_list=bs.select('#u1')      #css选择器里，#表示ID,通过id来查找
t_list=bs.select('a[class="bri"]')    #通过属性来查找
t_list=bs.select('head > title')     #通过子标签来查找
for item in t_list:
    print(item)

t_list=bs.select('.mnav ~ .bri')      #查找mnav里面的bri,兄弟标签
print(t_list[0].get_text())        #拿到文本

结果：只有文本

补充：正则表达式：字符串模式（判断一个字符串是否符合一定的标准）

import re
#创建模式对象

1.search

pat = re.compile("AA")    #此处的AA，是正则表达式，用来验证其他的字符串
m = pat.search("CBA")     #search字符串，被校验的内容
print(m)

结果：None

pat = re.compile("AA")  #此处的AA，是正则表达式，用来验证其他的字符串
m = pat.search("ABCAA")
print(m)

结果：<re.Match object; span=(3, 5), match='AA'>

pat = re.compile("AA") 
m = pat.search("AABCAADDCCAAA")  
print(m)

结果：<re.Match object; span=(0, 2), match='AA'>

pat = re.compile("AA")
m = re.search("asd","Aasd")    
#前面的字符串是规则（模板），后面的字符串是被校验的对象
print(m)

结果：<re.Match object; span=(1, 4), match='asd'>

2.findall 找到所有的字符串

print(re.findall("[A-Z]","ASDaDFGAa"))   
 #前面字符串是规则（正则表达式），后面字符串是被校验的字符串

print(re.findall("[A-Z]+","ASDaDFGAa"))

结果：

['A', 'S', 'D', 'D', 'F', 'G', 'A']
['ASD', 'DFGA']

3.sub

print(re.sub("a","A","abcdcasd"))    
#找到a用A来替换，在第三个字符串中查找

结果：AbcdcAsd

#建议在正则表达式中，被比较的字符串前面加上r，不用担心转义字符的问题
a = r"\aabd-\'"
print(a)

结果：\aabd-\'

禾太阳

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
3.21（跟着学长学python)

二.文档的遍历from bs4 import BeautifulSoupfile = open("./baidu.html","rb")html = file.read()bs =BeautifulSoup(html,"html.parser") #parser解析器#print(bs.head.contents) #遍历所有文档print(bs.head.contents[1]) #第一个结果：<meta content="text/html;charest=utf-8
复制链接

扫一扫