bs4和lxml用法

最新推荐文章于 2021-03-08 00:23:37 发布

lly2234317974

最新推荐文章于 2021-03-08 00:23:37 发布

阅读量2.6k

点赞数 1

分类专栏： Python第三方库文章标签： bs4 lxml 文档树

本文链接：https://blog.csdn.net/lly2234317974/article/details/79234372

版权

Python第三方库专栏收录该内容

3 篇文章 0 订阅

订阅专栏

# coding:utf-8
# Beautiful soup4:python支持的第三方库，可以快速的从html网页中提取所需要的数据
# bs4：beautiful soup 4简写
# lxml第三方的解析库，默认情况下，bs4会使用Python自带的解析器，但是lxml解析速度更快，功能更强大，底层是C语言写的

from bs4 import BeautifulSoup

index.html：

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="UTF-8"><title>测试bs4的网页</title>
	</head>
	<body>
	<a href="http://www.baidu.com" class="first second third" id="one">百度一下</a>
	<div>
	这是一个块标签
	<a href="http://www.tencent.com" class="second">腾讯</a>
	<a href="http://www.taobao.com" id="two">淘宝</a>
	<a href="http://www.qq.com">QQ邮箱</a>
	</div>
	</body>
	</html>

bs=BeautifulSoup(open('index.html'),'lxml')
# print type(bs)
# 获取html文档中的title标签内容
# print bs.title
# print('------------')
# print bs.head
# print('++++++++++++++')
# print bs.body
# name属性是在获取当前标签的名称，由于bs对象本身并不是一个html标签，所以并不是获取的标签名，而是获取一个bs所代表的值[document].就是获取print（）语句name前面的标签输出---获取标签名

# print bs.head.name
# print bs.body.name
# print bs.html.name
# print bs.p.name（错误，没有p标签）

###attrs：获取某一标签内部的所有属性值，就是放在开始标签内的那些键值对。开始标签<>,结束标签</>
# print bs.a.attrs
# print bs.html.attrs

#只获取某一标签其中一个元素

###标签中的class属性是可以设置多个的，获取class属性值是一个列表。

# print bs.head.meta['charset']
# print bs.html['lang']
# print bs.a['class']

# 删除标签的属性
# del bs.a['id']
# print bs.a.attrs

# BeautifulSoup()类，它将一个html文档，装换成一个复杂的，有层次的树形结构，从而形成了父节点和子节点之间的关系，每一个节点对应的是一个Python对象

# print bs.a.string
# print type(bs.a.string)# <class 'bs4.element.NavigableString'>
# print bs.title.string
# print bs.body.string

# print type(bs.title)# <class 'bs4.element.Tag'>
# print bs.title
# 树形结构中的Python对象类型：
# tag：指的是HTML中的一个标签（包含：开始标签，结束标签，标签中的所有内容），两个重要属性:name,attrs。
# NavigableString:指的是标签中内部的文字（不包含该标签）
# print bs.div.string
# BeatifulSoup:指整个HTML文档内容
# comment：该对象是一个特殊类型的NavigableString对象，表示获取的文档内容不包括注释内容的对象

# 遍历文档树

######### body标签中，<a></a>,<div></div>,是其的子节点。标签div中子节点<a></a>是body标签的子孙节点
# 1.遍历某一父节点的直接子节点
# contents属性，可以将tag类型的直接子节点以列表的方式进行输出，但是直接子节点内的子节点是无法单独获取的。

# print bs.body.contents
#输出： [u'\n', <meta charset="unicode-escape"/>, <title>\u6d4b\u8bd5bs4\u7684\u7f51\u9875</title>, u'\n']
# 2.children属性，获取的是一个list生成器对象，也可以称为迭代器对象（
# res=bs.body.children
# print type(res)
# for x in res:
# print x
'''
<a class="first second third" href="http://www.baidu.com" id="one">百度一下</a>

<div>
这是一个块标签
<a class="second" href="http://www.tencent.com">腾讯</a>
<a href="http://www.taobao.com" id="two">淘宝</a>
<a href="http://www.qq.com">QQ邮箱</a>
</div>
'''
# 3.descendants属性：获取所有子孙节点（包含直接子节点，又包含子节点的子节点）结果是一个生成器对象

# res=bs.body.descendants
# for x in res:
# print x
'''
输出是：
< a

class ="first second third" href="http://www.baidu.com" id="one" > 百度一下 < / a >

百度一下

< div >
这是一个块标签
< a

class ="second" href="http://www.tencent.com" > 腾讯 < / a >

< a
href = "http://www.taobao.com"
id = "two" > 淘宝 < / a >
< a
href = "http://www.qq.com" > QQ邮箱 < / a >
< / div >

这是一个块标签

< a

class ="second" href="http://www.tencent.com" > 腾讯 < / a >

腾讯

< a
href = "http://www.taobao.com"
id = "two" > 淘宝 < / a >
淘宝

< a
href = "http://www.qq.com" > QQ邮箱 < / a >
QQ邮箱

'''

# 4.string属性：获取标签节点内容，如果一个标签内部没有其他标签返回标签内部文字内容。若包含多个其它标签，返回None
# 5.parent属性：获取父节点属性
# head=bs.title.parent
# print head.name
# print type(head)
# 6.next_sibling属性：获取当前节点的下一个兄弟节点。若没有下一个兄弟节点，返回None。
# title=bs.meta.next_sibling
# print title
'''
<title>测试bs4的网页</title>
'''
# 7.previous_sibling属性：获取当前节点上一个兄弟节点，没有返回None
# meta=bs.title.previous_sibling
# print meta
'''
<meta charset="utf-8"/>
'''

# 文档内容搜索
# 1.find_all()：用于搜索当前标签的所有子节点。
# res=bs.find_all('a')
# for a in res:
# print a
'''
<a class="first second third" href="http://www.baidu.com" id="one">百度一下</a>
<a class="second" href="http://www.tencent.com">腾讯</a>
<a href="http://www.taobao.com" id="two">淘宝</a>
<a href="http://www.qq.com">QQ邮箱</a>

'''
# 通过标签属性进行搜索，id属性值是唯一的，class属性的值是可以重复的
# print bs.find(id='one')
# class在Python是关键字。使用class需要使用class_
# print bs.find_all(class_='second')
'''
返回一个列表
[<a class="first second third" href="http://www.baidu.com" id="one">\u767e\u5ea6\u4e00\u4e0b</a>, <a class="second" href="http://www.tencent.com">\u817e\u8baf</a>]

'''
# ####2.css选择器进行标签的查找
# 1.通过标签名进行查找
# print bs.select('title')
'''
[<title>\u6d4b\u8bd5bs4\u7684\u7f51\u9875</title>]
'''
# 2.通过class属性值查找，是用来匹配class属性值的一个固定用法
print bs.select('.first')
'''
[<a class="first second third" href="http://www.baidu.com" id="one">\u767e\u5ea6\u4e00\u4e0b</a>]
'''
# 3.通过id属性值查找。#匹配id值的固定用法
print bs.select('#one')
'''
[<a class="first second third" href="http://www.baidu.com" id="one">\u767e\u5ea6\u4e00\u4e0b</a>]
'''

lly2234317974

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
bs4和lxml用法

# coding:utf-8# Beautiful soup4:python支持的第三方库，可以快速的从html网页中提取所需要的数据# bs4：beautiful soup 4简写# lxml第三方的解析库，默认情况下，bs4会使用Python自带的解析器，但是lxml解析速度更快，功能更强大，底层是C语言写的from bs4 import BeautifulSoup
复制链接

扫一扫