python与爬虫-02HTML相关内容-CSDN博客

本文链接：https://blog.csdn.net/weixin_45799003/article/details/124046677

1.正则表达式

1.regex
案例：词组正则字符串
规则：a至少出现一次；b重复5次；c重复偶数次；最后是d或e。
表达：aa*bbbbb(cc)*(d|e)
2.常用正则表达式及符号

符号	含义
*	匹配前面的内容，0或多个
+	匹配前面的内容，至少1个
[]	匹配任意字符
()	表达式编组
{m,n}	m到n次
[^]	匹配不在里面的字符
\|	匹配任意一个由竖线分割的字符
.	匹配任意单个字符
^	指开始位置
\	转义字符
$	表达式末尾
?!	不包含

3.其他
正则表达式并不是通用的！python与java里面的正则表达式好像就不太一样！
案例：[A-Za-z0-9\._+]+@[A-Za-z]+\.(com|org|edu|net)，这是一个邮箱的正则表达式

2. BeautifulSoup应用

1.案例代码

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen('https://pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html,'html.parser')
images = bs.find_all('img',{'src':re.compile('\.\.\/img\/gifts\/img.*\.jpg')})
for image in images:
    print(image['src'])

运行结果：

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg

解释：直接通过商品图片的文件路径查找。re.compile和re两个内容很重要！

3.获取属性

获取属性，输入代码：print(bs.html.body.img.attrs)，结果为：{'src': '../img/gifts/logo.jpg', 'style': 'float:left;'}
获取属性的某个属性值，输入代码：print(bs.html.body.img.attrs['src'])，结果为：../img/gifts/logo.jpg
公式就是myTag.attrs和myTag.attrs['shuxing']，前者返回一个字典对象，后者则提取其中的某个值。

4.Lambda表达式

该表达式本质上是一个函数，可作为变量传入另一个函数。
示例代码：

print(bs.find_all(lambda tag:len(tag.attrs)==2))
##表示获取有两个属性的所有标签

结果为：

[<img src="../img/gifts/logo.jpg" style="float:left;"/>, <tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>, <tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>, <tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>, <tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>, <tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>]

运行代码：print(bs.find_all(lambda tag:tag.get_text()=='Or maybe he\'s only resting?'))
结果为：[<span class="excitingNote">Or maybe he's only resting?</span>]
运行代码：print(bs.find_all('',text='Or maybe he\'s only resting?'))
结果为：["Or maybe he's only resting?"]
上面两个书里面说效果是一样的，但是，其实并不一样结果！
书上的解释：BeautifulSoup允许把特定类型的函数作为参数传入find_all函数，限制条件就是这些函数必须把一个标签对象作为参数并且返回布尔类型的结果。即是输入标签，符合函数的留下，不符合的不要！

PS：今天就到这里，最近在减肥，加油！！！！祝福我，顺利毕业！！！1