爬虫系列笔记九BeautifulSoup

最新推荐文章于 2024-04-24 08:26:26 发布

想offer的第n天

最新推荐文章于 2024-04-24 08:26:26 发布

阅读量533

点赞数

分类专栏： python之爬虫文章标签：爬虫 beautifulsoup python

本文链接：https://blog.csdn.net/h91er/article/details/127398354

版权

python之爬虫专栏收录该内容

13 篇文章 1 订阅

订阅专栏

基本简介
1. BeautifulSoup简介
  bs4
2. 什么是BeautifulSoup？
  BeautifulSoup，和lxml一样，是一个html的解析器，主要功能也是解析和提取数据
3. 优缺点
  缺点：效率没有lxml的效率高
  优点：接口设计人性化，使用方便
安装以及创建
1. 安装
  pip install bs4
导入
from bs4 import BeautifulSoup
创建对象
服务器响应的文件生成对象
soup=BeautifulSoup(response.read().decode(),‘lxml’)
本地文件生成对象
soup=BeautifulSoup(open(‘1.html’),‘lxml’)
注意：默认打开文件的编码格式为gbk所以需要指定打开编码格式

本地练习代码

<!DOCTYPE html>
<html>
<head>
    <meta charset='utf-8'>
    <meta http-equiv='X-UA-Compatible' content='IE=edge'>
    <title>Page Title</title>
    <meta name='viewport' content='width=device-width, initial-scale=1'>
    <link rel='stylesheet' type='text/css' media='screen' href='main.css'>
    <script src='main.js'></script>
</head>
<body>
    <div>
        <ul>
            <li id="l1">张三</li>
            <li id="l2">李四</li>
            <li>王五</li>
            <a href="" id="" class="a1">张张张</a>
            <span>hhh</span>
        </ul>
    </div>

    <a href="" title="a2">百度</a>
    
    <div id="d1">
        <span>
            hhh
        </span>
    </div>
    <p id="p1" class='p1'>呵呵呵·</p>
    
</body>
</html>

from turtle import title
from bs4 import BeautifulSoup

# 通过解析本地文件 讲解bs4的基础语法

# 加载本地文件
soup=BeautifulSoup(open('C:/Users/86177/Desktop/tese/python/爬虫再学/bs4.html',encoding='utf-8'),'lxml')

# 根据标签名查找节点

print(soup.a)#找到的为第一个符合条件的节点
print(soup.a.attrs)#获取标签的属性和属性质
print(soup.a.name)#获取值


# bs4的一些函数
# （1）find返回第一个符合条件的数据
print(soup.find('a'))
print(soup.find('a',class_='a1'))#可以根据属性查找对象

# （2）find_all返回所有符合条件的数据
# 如果想获取多个标签的数据 那么需要在find_all的参数中加的是列表的数据（将所查的参数放入列表中）
print(soup.find_all('a'))
print(soup.find_all(['a','span']))
print(soup.find_all('li',limit=2))#获取前几个数据

# （3）select（常用）
# 1. element
#select方法返回的是一个列表 并且会返回多个数据
print(soup.select('a'))
#2. .class
    #eg:.firstname
    #可以通过.代表class （称为类选择器）
print(soup.select('.a1'))
#3. #id
    #eg:#firstname
    #'#'号代表id
print(soup.select('#l1'))
#4. 属性选择器
    #[attribute]
        #eg:li=soup.select('li[class]')
        #查找li标签中有id的标签
print(soup.select('li[id]'))
    #[attribute=value]
        #eg:li=soup.select('li[class='hengheng1']')
        #查找到li标签中id为l2的标签
print(soup.select('li[id="l2"]'))
#5. 层级选择器
    #element element
        #div p
        #后代选择器
        #找到div下面的li
print(soup.select('div li'))
    #element>element
        #div>p
        #子代选择器
print(soup.select('div > ul > li'))#在bs4中不加空格也可以查到数据，有些不加空格查不到数据
    #element,element
        #div,p
            #eg:soup=soup.select('a,span')
            #找到a标签和li标签的所有对象
print(soup.select('a,li'))
# 4.节点信息
# （1）获取节点的内容：适用于标签中嵌套标签的结构
	# obj.string
	# obj.get_text()
# 两者的区别：如果对象中只有内容 两者都可以获取到，如果对象中既有内容也有标签这时只有get_text可以获取到数据，string不能

# 获取节点信息
obj=soup.select('#d1')[0]

print(obj.string)
print(obj.get_text())

# （2）节点的属性
# tag.name 获取标签名
#eg:
tag=soup.find('li')
print(tag.name)
tag.attrs将属性值作为一个字典返回
obj=soup.select('#p1')[0]
print(obj)
print(obj.name)#name为标签的名字
# #将属性名作为一个字典返回
print(obj.attrs)

# （3）获取节点属性
obj.attrs.get('title')
obj.get('title')
obj['title']
# 获取节点的属性
obj=soup.select('#p1')[0]
print(obj.attrs.get('class'))
print(obj.get('class'))

在这里插入图片描述
bs4实践小实验《爬取星巴克菜单数据》

import urllib.request
from bs4 import BeautifulSoup

url='https://www.starbucks.com.cn/menu/'

response=urllib.request.urlopen(url)

content=response.read().decode('utf-8')

# print(content)

soup=BeautifulSoup(content,'lxml')

name_list=soup.select('ul[class="grid padded-3 product"] strong')
for i in name_list:
    print(i.string)