Python3 网络爬虫 Web Crawler 之 urlopen, BeautifulSoup, regex 和 lambda

最新推荐文章于 2024-10-08 12:37:10 发布

一勺秋水

最新推荐文章于 2024-10-08 12:37:10 发布

阅读量353

点赞数

分类专栏： Crawler 文章标签： crawler python 爬虫网络爬虫

本文链接：https://blog.csdn.net/hush_quiet/article/details/124417590

版权

Crawler 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

前言

这一篇仍有问题未解决 剩余：2

写爬虫时刻注意try-except

有问题就去查文档 When in doubt, read the docs!

常见错误类型

from urllib.error import HTTPError ，页面找不到了
from urllib.error import URLError 服务器不存在

HTTPError 是 URLError 的子类，所以先except HTTPError

基本格式

from urllib.request import urlopen
from urllib.request import HTTPError
from urllib.request import URLError

try:
    html = urlopen("https://baidu.com")
except HTTPError as e:
    print(e)
    # return null break
except URLError as e:
    print(e)
else:
    print('work well')
# else 语句在try成功时使用

BeautifulSoup 的 tag 为None 的情况

bs.nonExistTag.tag 会报错Attribute Error，需要排查tag为None的情况

bs = BeautifulSoup(html, 'html.parser')

try:
    badContent = bs.tag.othertag
except AttributeError as e:
    print("Tag not found!")
else:
    if badContent = None:
        print('Subtag not found')
    else:
        print('work well')
        
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
        return None
    try:
        bs = BeautifulSoup(html, 'html.parser')
        title = bs.body.h1
    except AttributeError as e:
        print(e)
        return None
    return title

BeautifulSoup基本语法

find_all

find_all -> findAll （别称），可以搜索所有满足条件的标签，并形成 ResultSet ，近似理解为列表，与 find 不同——形成单个标签。
find_all 和 find 的区别
find_all 有limit参数，确定搜索个数

find_all(tag, attributes, recursive, text, limit, keywords)
find(tag, attributes, recursive, text, keywords)

nameList = bs.findAll('div', {'class': 'home'})
for name in nameList:
    print(name.get_text())
    # print(name.text())

bs.find_all(['h1', 'h2', 'h3'], {'class': ['home', 'outside']})
# class_=['home', 'outside']
# 括号里的都满足

# recursive——布尔值，代表是否递归搜索，默认为TRUE，能够搜到子孙节点

可以发现bs.find_all(id='text') 和 bs.find_all('', {'id': 'text'}) 等效

疑问：我不理解上式中，为何加空字符串。并且实践后返回空

text搜索

nameList = bs.find_all(text='the prince')
# 返回由text内容构成的列表，这里是 ['the prince', 'the prince']

子孙节点和兄弟节点

所有的子节点都是子孙节点，所有子孙节点不一定是子节点

子节点的调用需要find()而不是find_all()，因为二者返回值不一样

for item in bs.find(...).children:
    pass	# 生成迭代对象，使用for循环

.next_siblings 和 .previous_siblings 也只适用于find，生成迭代对象（不包括自身）
.next_sibling 和 .previous_sibling 只是tag类型的上下两个tag类型（是个函数）

bs.find(...).next_siblings 和 bs.find(...).next_sibling

正则表达式的补充

?! “Does not contain”，不是很常用，仅限于了解

|，运算优先级最小，匹配两边各一个，但是可以这样用apple|banana|carrot，匹配三个substring

正则表达式与BeautiSoup结合

记得先import re

bs.find_all('img', {'src': re.compile(r'\.\./img/img\d?\.jpg)})

标签属性 Attribute

Tag.attrs 返回字典，keys 为属性名称

Tag.attrs['src'] or Tag['src']，使用时确保keys存在，否则报错 AttributeError

疑问：Tag.attrs.get['class'] or Tag.get['class'] 为什么使用无效？是关键字的问题吗

lambda表达式——接收标签，返回boolean

bs.findAll(lambda tag: tag.get_text() == 'I'm fine)

违法行为 bs.findAll(tag, lambda tag: tag.get('src') == 'listen')

因为tag对象在lambda中接收

一勺秋水

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录