第二章复杂的HTML解析

最新推荐文章于 2023-04-29 14:50:27 发布

code_mryxj

最新推荐文章于 2023-04-29 14:50:27 发布

阅读量1k

点赞数

分类专栏：网络爬虫技术及小工具 python语法糖

本文链接：https://blog.csdn.net/yexiaohhjk/article/details/64245256

版权

python语法糖同时被 2 个专栏收录

15 篇文章 0 订阅

订阅专栏

网络爬虫技术及小工具

8 篇文章 0 订阅

订阅专栏

第二章复杂的HTML解析

通过BeautifulSoup对象，我们可以用findAll()函数抽取标签里的信息。
比如：抽取这个网页里只包含在<span class="green"> </span>标签里的文字

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html");
bs0bj = BeautifulSoup(html);
nameList = bs0bj.findAll("span",{"class":"green"});
for name in nameList:
    print(name.get_text())

get_text():它会把你正在处理的HTML文档中所有标签都清除，然后返回一个只包含文字的字符串。

find() 和 findAll():

find(tag,attributes,recursive,text,limit,keywords)

findAll(tag,attributes,recursive,text,keywords)

tag:你可以传入一份标签的名称，或多个标签名组成的python列表作为标签参数。
eg:

findAll({"h1","h2","h3","h4","h5","h6"});

将返回一个包含HTML文档中所有标题标签的列表。

attributes:是用一个Python字典封装的一个标签的若干属性和对应的属性值。
eg:

findAll("span",{"class",:{"green","red"}})

将返回HTML文档里红色和绿色两种眼色的span标签中间的文字

recursive:是一个bool变量。True的话，findAll()就会根据你的要求去查找标签参数的所有子标签，以及子标签的子标签。反之False,只会查找文档的一级标签。一把情况下这个参数不需设置。

text: 它使用标签的文本内容去匹配，而不是用标签属性去匹配。
假如我们要查找前面网页中包含“the prince”内容的标签数量。

nameList = bs0bj.findAll(text="the prince"）；
print(len(nameList))

keyword:可以让你选择那些具有指定属性的标签。
eg：

alltext=bs0bj.findAll(id="text");
print(allTest[0].get_text());

keyword参数的注意事项：
它其实是BeautifulSoup在技术上做的一个冗余操作。
比如:以下两行代码是完全一样的

bs0bj.findAll(id="test")
bs0bj.findAll("",{"id":"text"}

处理子标签：

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html");
bs0bj = BeautifulSoup(html);
for child in bs0bj.find("table",{"id":"giftList"}).children:
    print(child)

处理兄弟标签：

from urllib.request import urlopen;
from bs4 import BeautifulSoup;

html = urlopen("http://www.pythonscraping.com/pages/page3.html");
bs0bj = BeautifulSoup(html);
for sibling in bs0bj.find("table",{"id":"giftList"}).tr.next_siblings:
    print(sibling);

父标签的处理：

from urllib.request import urlopen;
from bs4 import BeautifulSoup;

html = urlopen("http://www.pythonscraping.com/pages/page3.html");
bs0bj = BeautifulSoup(html);
print(bs0bj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text());

正则表达式：

你可以去Regexpal这类网站上在线测试正则表达式。
正则表达式在实际中一个经典的应用是识别邮箱地址。
eg:[A-Za-z0-9\.+]+@[A-Zz-a]+\.(com|org|edu|net)

*- 重复任意次, 包括0次
| - 表示或
- - 重复至少1次
[] - 匹配其中的任意一个字符
() - 表达式编组, 在regex的规则里编组会优先运行
{m,n} - 重复m到n次
[^] - 匹配一个任意不在方括号中的字符
. - 匹配任意单个字符
^ - 标识字符串的开始
\ - 转义字符
$ - 标识字符串的结尾
在BeautifulSoup中使用regex, 提高效率, 如images = bsObj.findAll(“img”,{“src”:re.compile(“..imggifts/img.*.jpg”)})
regex可以作为BeautifulSoup语句的任意一个参数

下面这段代码会打印出图片的相对路径，都是以../img/gifts/img开头。

from urllib.request import urlopen;
from bs4 import BeautifulSoup;
import re
html = urlopen("http://www.pythonscraping.com/pages/page3.html");
bs0bj = BeautifulSoup(html);
images = bs0bj.findAll("img",{"src":re.compile("\.\.\/img\/gifts\/img.*\.jpg")});
for image in images:
    print(image["src"])

.attrs

对于标签对象, 可采用.attrs获取全部属性, 返回字典对象.
imgTag.attrs[“src”]就可以获取图片的资源位置。

其他的html解析模块:

lxml - 底层, 大部分源代码用c写成,因此处理速度会非常快
HTML parser - 自带的

code_mryxj

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
第二章复杂的HTML解析

通过BeautifulSoup对象，我们可以用findAll()函数抽取标签里的信息。比如：抽取这个网页里只包含在<span class="green"> </span>标签里的文字from urllib.request import urlopenfrom bs4 import BeautifulSouphtml = urlopen("http://www.pythonscraping.c
复制链接

扫一扫