#11 Python爬虫的进阶之路---BeautifulSoup

lrzbupt

于 2020-04-09 17:36:23 发布

阅读量160

点赞数

分类专栏： [01]Python爬虫

本文链接：https://blog.csdn.net/lrzbupt/article/details/105404055

版权

[01]Python爬虫专栏收录该内容

14 篇文章 0 订阅

订阅专栏

Python包的镜像安装

在使用pip或conda进行库安装时，由于使用海外服务器下载数据慢，我们可以选择使用国内的镜像站，本文以使用清华镜像源为例

#临时使用镜像源
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple some-package
#将镜像源设为默认
#首先升级pip到高于10.0.0版本
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pip -U
#其次设置默认
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

原文链接：https://mirrors.tuna.tsinghua.edu.cn/help/pypi/

#使用conda安装，更改镜像源
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/  
conda config --set show_channel_urls yes 
#conda安装到指定环境
conda install -n env_name package_name
#删除
conda remove -n env_name package_name
#查看
conda list -n env_name

HTML例

from bs4 import BeautifulSoup as BS

html_str = """
<html><head><title>The Dormouse's Story</title></head>
<body>
<p class="title"><b>The Dormouse's Story</b></p>
<p class="story">Once upon a time there wee three little sisters;and their names were
<a href="http://example.com/elsie" class='sister'id="link1"><!--Elsie--></a>,
<a href="http://example.com/lacie" class='sister'id="link2"><!--Lacie--></a> and
<a href="http://example.com/tillie" class='sister'id="link3"><!--Tillie--></a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

接下来都以上述字符串为例。

BeautifulSoup对象

创建BS对象

#通过字符串
soup = BS(html_str,'lxml',from_encoding='utf-8')
#通过文件创建
soup = BS(open(r'zhihu.html'))
print(soup.prettify())

BS将html转换成一个树形结构，每个节点都是一个对象。对象可以归纳为：Tag，NavigableString，BeautifulSoup， Comment。

Tag

就是HTML中的标记。包括标记和里面的内容如字符串。
可以通过soup.+标记名直接获取，其最重要的两个属性为name和attributes。
通过soup.tag_name.name获取name属性，通过对其赋值可以更改它的名字
通过soup.tag_name['attr_name']或soup.tag_name.get('attr_name')获取相应属性值，或使用.attrs,即soup.tag_name.attrs获取所有属性并以字典形式给出，赋值实现修改

NavigableString

如果想要获取标记内的内容，则需要使用.string,利用soup.tag_name.string获得内部文字，其类型为bs4.element.NavigableString。其本质上与Python的Unicode字符串相同，通过unicode(soup.p.string)实现类型转换。

BeautifulSoup

表示文档的全部内容，根节点，可认为是特殊的Tag对象，name为[document]，attribute为{}

Comment

html中常使用表示注释，在使用soup.p.string时会将注释符去掉，仅显示文本，但类型仍为comment，为避免数据提取混乱，一般通过string提取数据时先判断其类型type(soup.a.string)==bs4.element.Comment。

节点与遍历

子节点

Tag的.contents和.children属性。
.contents可以将子节点以列表方式输出；同样可以获取列表大小和索引。字符串没有子节点，没有contents属性。
.children返回生成器，可以对子节点for循环时使用
.descendants返回生成器，可以对所有子孙节点for循环
.string标记内没有标记或标记内只有一个标记，都会直接取出最内部的字符串内容
.strings对于多个字符串，可循环遍历
.stripped_strings去掉空格或空行

父节点

.parent某个元素的父节点
.parents递归到元素所有父辈节点直到[document]

兄弟节点

.next_sibling下一个兄弟节点
.prev_sibling上一个兄弟节点
.next_siblingsfor循环输出所有下面的兄弟节点

前后节点

.next_element前节点
.previous_element后节点

搜索与CSS选择器

介绍find_all(name,attrs,recursive,text,**kwargs)。
用于搜索当前Tag的所有Tag子节点。

name参数

查找名字为name的标记，可为字符串，正则式，列表和方法
字符串完全匹配
正则表达式通过match（）来匹配内容
列表任一匹配即可
方法定义为接收一个tag节点参数，返回true或false

kwargs

若一个指定名字的参数不是搜索的内置参数名，则会把其当做tag的属性来搜索

text

搜索文档中的字符串内容

CSS选择器

soup.select
可以通过标记名称，类名，id，是否含有某个属性，属性值来查找

lxml的XPath解析

#使用XPath语法抽取URL
from lxml import etree
#对于文件
html = etree.parse('xxx.html')
#对于字符串
html = etree.HTML(html_str)
urls = html.xpath(".//*[@class='sister']/@href")
print(urls)

具体XPath表达式参见#7

lrzbupt

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
#11 Python爬虫的进阶之路---BeautifulSoup

Python包的镜像安装在使用pip或conda进行库安装时，由于使用海外服务器下载数据慢，我们可以选择使用国内的镜像站，本文以使用清华镜像源为例#临时使用镜像源pip install -i https://pypi.tuna.tsinghua.edu.cn/simple some-package#将镜像源设为默认#首先升级pip到高于10.0.0版本pip install -i ht...
复制链接

扫一扫

专栏目录