《python网络数据采集》学习笔记（1）BeautifulSoup 安装部署 HTML解析

最新推荐文章于 2021-08-06 23:32:41 发布

xiaotong_cloud

最新推荐文章于 2021-08-06 23:32:41 发布

阅读量152

点赞数

分类专栏： python学习

本文链接：https://blog.csdn.net/huxiaotong_exp/article/details/82730760

版权

python学习专栏收录该内容

17 篇文章 0 订阅

订阅专栏

1.安装部署

安装部署

2.异常处理

openurl()会返回HTTP错误
调用的标签不存在,返回AttributeError

from urllib.request import urlopen
from urllib.error import HTTPError,URLError
from bs4 import BeautifulSoup

def getTitle(url):
    try:
        html = urlopen(url)
    except (HTTPError,URLError) as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read(),"html.parser")
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title

title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

3.HTML解析

1 find()和findAll()

.findAll(tag,attributes,recursive,test,limit,keyworks)
.find(tag,attributes,recursive,test,keyworks)

tag:可以传一个或多个标签组成的列表

.findAll({"h1","h2","h3","h4","h5","h6"})

attributes:是一个用python字典封装的一个标签的若干属性和对应的属性值

.findAll("span",{"class":{"red","green"}})

recursive:是一个布尔变量，设置是否递归解析，设置为False的话，只会解析文档的一级标签，默认值为True
text：用文本内容去匹配


nameList = bsObj.findAll(text="the prince")
print(len(nameList))

limit：对结果只取前limit项，find实际上是findAll(limit=1)的情景
keywords：选择具有指定属性的标签，实际上为一个冗余设计

# 一下调用的结果是一致的
allText1 = bsObj.findAll(id="text")
allText2 = bsObj.findAll("",{"id":"text"})

2 beautifulsoup对象

BeautifulSoup对象
标签tag对象
NavigableString对象：标签里的文字
Comment对象：HTML中的注释

3 导航树

子标签 .children

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html,"html.parser")
for child in bsObj.find("table",{"id":"giftList"}).children:
    print(child)

兄弟标签

next_siblings()
previous_siblings()
next_sibling()
previous_sibling()

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html,"html.parser")
for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
    print(sibling)

父标签

parent
parents

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html,"html.parser")

print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())

4 使用正则表达式

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html,"html.parser")
imgs = bsObj.findAll("img",{"src":re.compile("\.\./img/gifts/img.*\.jpg")})
print(imgs)
for img in imgs:
    print(img["src"])

5 获取属性

myImgTag.attrs["src"]

6 lambda表达式

可以将lambda表达式作为参数传入findAll()，BeautifulSoup用此lambda表达式评估每个标签对象，把评估结果为真的对象保留
此lambda表达式需要满足：

必须将一个标签作为参数传入
返回值为bool型

xiaotong_cloud

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
《python网络数据采集》学习笔记（1）BeautifulSoup 安装部署 HTML解析

1.安装部署安装部署2.异常处理openurl()会返回HTTP错误调用的标签不存在,返回AttributeErrorfrom urllib.request import urlopenfrom urllib.error import HTTPError,URLErrorfrom bs4 import BeautifulSoupdef getTitle(url): ...
复制链接

扫一扫

专栏目录