Python数据分析：beautifulsoup解析网页

最新推荐文章于 2024-04-10 10:46:52 发布

Sweeney Chen

最新推荐文章于 2024-04-10 10:46:52 发布

阅读量4.1k

点赞数 3

分类专栏： Python数据分析文章标签： Python 数据分析 beautifulsoup

本文链接：https://blog.csdn.net/weixin_41792682/article/details/89509456

版权

Python数据分析专栏收录该内容

32 篇文章 7 订阅

订阅专栏

Python数据分析：beautifulsoup解析网页

BeautifulSoup

用于解析HTML或XML
步骤
1. 创建BeautifulSoup对象
2. 查询节点
  
  find 找到第一个满足条件的节点
  
  find_all 找到所有满足条件的节点

创建对象

创建BeautifulSoup对象
bs = BeautifulSoup(

url,

html_parser, 指定解码器

encoding 指定编码格式，需要与网页编码格式一致)

查找节点

<a href='a.html' class='a_link'>next page</a>

可以按照节点类型、属性或内容访问
按类型查找节点
- bs.find_all(‘a’)
按属性查找节点
- bs.find_all(‘a’,href=‘a.html’)
- bs.find_all(‘a’,href=‘a.html’,string=‘next_page’)
- bs.find_all(‘a’,class_=‘a_link’)

获取节点信息

node是已查找到的节点
node.name
- 获取节点标签名称
node[‘href’]
- 获取节点href属性
node.get_txt()
- 获取节点文字

异常处理

网络资源或URL经常变动
需要处理异常

from bs4 import BeautifulSoup
from urllib import request

html = request.urlopen("http://www.baidu.com")
# 创建beautifulsoup对象
bs_obj = BeautifulSoup(html, 'html.parser', from_encoding='utf-8');
print("title tag: ", bs_obj.title)

运行结果：
在这里插入图片描述

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

#创建对象
bs_obj = BeautifulSoup(html_doc, 'html.parser')

# 提取所有链接
print('提取所有链接')
link_list = bs_obj.find_all('a')
for link in link_list:
    print(link.name, link['href'], link.get_text())

运行结果：
在这里插入图片描述

# 提取一条链接,按ID指定
print(' 提取一条链接')
link = bs_obj.find('a', id='link2')
print(link.name, link['href'], link.get_text())

运行结果：
在这里插入图片描述

# 创建一个完整的函数处理title

def get_html_title(url):
    """
        获取url地址的title
    """
    try:
        html = urllib.request.urlopen(url)
    except Exception as e:
        return None
    
    try:
        bs_obj = BeautifulSoup(html.read(), 'html.parser')
        title = bs_obj.title
    except Exception as e:
        return None
    
    return title

title = get_html_title("http://www.jd.com")
if title is not None:
    print(title)
else:
    print("Title获取失败！")

运行结果：
在这里插入图片描述

beautifulsoup进阶

使用CSS方式、正则表达式查找节点
保存解析的内容
DOM树形结构
- children 只返回孩子节点
- desecdants 返回所有子孙节点
- next_siblings 返回下一个同辈节点
- previous_siblings 返回上一个同辈节点
- parent 返回父亲节点

正则表达式

简单的字符串匹配可以使用字符串方法完成，复杂、模糊的字符串匹配使用正则表达式
通过使用单个字符串描述匹配一系列符合某个语法规则的字符串
字符串操作的逻辑公式
常用于处理文本数据
匹配过程：依次拿出表达式和文本中的字符做比较，如果每个字符都能匹配，则匹配成功；否则失败。
import re
pattern = re.compile(‘str’) 返回pattern对象，推荐使用r’str’ 不需考虑转义字符
pattern.match()
正则表达式语法：https://docs.microsoft.com/zh-cn/previous-versions/visualstudio/visual-studio-2008/ae5bf541(v=vs.90)

Sweeney Chen

关注

3
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
1
评论
Python数据分析：beautifulsoup解析网页

Python数据分析：beautifulsoup解析网页BeautifulSoup用于解析HTML或XML步骤创建BeautifulSoup对象查询节点find 找到第一个满足条件的节点find_all 找到所有满足条件的节点创建对象创建BeautifulSoup对象bs = BeautifulSoup( url, ...
复制链接

扫一扫