活动地址:CSDN21天学习挑战赛
学习日记
1,学习知识点
爬虫解析器BeautifulSoup4
2,学习遇到的问题
内容较复杂
3,学习的收获
BeautifulSoup4
4,实操
一、BeautifulSoup4库介绍
1. 介绍
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间。
BeautifulSoup4将网页转换为一颗DOM树:
2. 下载模块
1. window电脑点击
win键+ R
,输入:cmd
2. 安装
beautifulsoup4
,输入对应的pip命令:pip install beautifulsoup4
,我已经安装过了出现版本就安装成功了
3. 导包<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python">form bs4 <span style="color:#c678dd">import</span> BeautifulSoup </code></span></span>
3. 解析库
BeautifulSoup在解析时实际上依赖解析器,它除了支持Python标准库中的HTML解析器外,还支持一 些第三方解析器(比如lxml):
解析器 使用方法 优势 劣势 Python标准库 BeautifulSoup(html,’html.parser’)
Python的内置标准库、执行速度适中、文档容错能力强 Python 2.7.3及Python3.2.2之前的版本文档容错能力差 lxml HTML解析库 BeautifulSoup(html,’lxml’)
速度快、文档容错能力强 需要安装C语言库 lxml XML解析库 BeautifulSoup(html,‘xml'
速度快、唯一支持XML的解析器 需要安装C语言库 htm5lib解析库 BeautifulSoup(html,’htm5llib’)
最好的容错性、以浏览器的方式解析文档、生成HTMLS格式的文档 速度慢、不依赖外部扩展 对于我们来说,我们最常使用的解析器是
lxml HTML
解析器,其次是html5lib.
二、上手操作
1. 基础操作
1. 读取HTML字符串:
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#c678dd">from</span> bs4 <span style="color:#c678dd">import</span> BeautifulSoup html <span style="color:#669900">=</span> <span style="color:#669900">''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel_body"> <ul class="list" id="list-1" name="element"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <a href="https://www.baidu.com">百度官网</a> <li class="element">Bar</li> </ul> </div> </div> '''</span> <span style="color:#5c6370"># 创建对象</span> soup <span style="color:#669900">=</span> BeautifulSoup<span style="color:#999999">(</span>html<span style="color:#999999">,</span> <span style="color:#669900">'lxml'</span><span style="color:#999999">)</span> </code></span></span>
2. 读取HTML文件:
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#c678dd">from</span> bs4 <span style="color:#c678dd">import</span> BeautifulSoup soup <span style="color:#669900">=</span> BeautifulSoup<span style="color:#999999">(</span><span style="color:#669900">open</span><span style="color:#999999">(</span><span style="color:#669900">'index.html'</span><span style="color:#999999">)</span><span style="color:#999999">,</span><span style="color:#669900">'lxml'</span><span style="color:#999999">)</span> </code></span></span>
3. 基本方法
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#c678dd">from</span> bs4 <span style="color:#c678dd">import</span> BeautifulSoup html <span style="color:#669900">=</span> <span style="color:#669900">''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel_body"> <ul class="list" id="list-1" name="element"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <a href="https://www.baidu.com">百度官网</a> <li class="element">Bar</li> </ul> </div> </div> '''</span> <span style="color:#5c6370"># 创建对象</span> soup <span style="color:#669900">=</span> BeautifulSoup<span style="color:#999999">(</span>html<span style="color:#999999">,</span> <span style="color:#669900">'lxml'</span><span style="color:#999999">)</span> <span style="color:#5c6370"># 缩进格式</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>prettify<span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">)</span> <span style="color:#5c6370"># 获取title标签的所有内容</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>title<span style="color:#999999">)</span> <span style="color:#5c6370"># 获取title标签的名称</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>title<span style="color:#999999">.</span>name<span style="color:#999999">)</span> <span style="color:#5c6370"># 获取title标签的文本内容</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>title<span style="color:#999999">.</span>string<span style="color:#999999">)</span> <span style="color:#5c6370"># 获取head标签的所有内容</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>head<span style="color:#999999">)</span> <span style="color:#5c6370"># 获取第一个div标签中的所有内容</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>div<span style="color:#999999">)</span> <span style="color:#5c6370"># 获取第一个div标签的id的值</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>div<span style="color:#999999">[</span><span style="color:#669900">"id"</span><span style="color:#999999">]</span><span style="color:#999999">)</span> <span style="color:#5c6370"># 获取第一个a标签中的所有内容</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>a<span style="color:#999999">)</span> <span style="color:#5c6370"># 获取所有的a标签中的所有内容</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>find_all<span style="color:#999999">(</span><span style="color:#669900">"a"</span><span style="color:#999999">)</span><span style="color:#999999">)</span> <span style="color:#5c6370"># 获取id="u1"</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>find<span style="color:#999999">(</span><span style="color:#669900">id</span><span style="color:#669900">=</span><span style="color:#669900">"u1"</span><span style="color:#999999">)</span><span style="color:#999999">)</span> <span style="color:#5c6370"># 获取所有的a标签,并遍历打印a标签中的href的值</span> <span style="color:#c678dd">for</span> item <span style="color:#c678dd">in</span> soup<span style="color:#999999">.</span>find_all<span style="color:#999999">(</span><span style="color:#669900">"a"</span><span style="color:#999999">)</span><span style="color:#999999">:</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>item<span style="color:#999999">.</span>get<span style="color:#999999">(</span><span style="color:#669900">"href"</span><span style="color:#999999">)</span><span style="color:#999999">)</span> <span style="color:#5c6370"># 获取所有的a标签,并遍历打印a标签的文本值</span> <span style="color:#c678dd">for</span> item <span style="color:#c678dd">in</span> soup<span style="color:#999999">.</span>find_all<span style="color:#999999">(</span><span style="color:#669900">"a"</span><span style="color:#999999">)</span><span style="color:#999999">:</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>item<span style="color:#999999">.</span>get_text<span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">)</span> </code></span></span>
2. 对象种类
Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
Tag , NavigableString , BeautifulSoup , Comment .
(1)Tag:Tag通俗点讲就是HTML中的一个个标签,例如:
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python">soup <span style="color:#669900">=</span> BeautifulSoup<span style="color:#999999">(</span><span style="color:#669900">'<b class="boldest">Extremely bold</b>'</span><span style="color:#999999">,</span><span style="color:#669900">'lxml'</span><span style="color:#999999">)</span> tag <span style="color:#669900">=</span> soup<span style="color:#999999">.</span>b <span style="color:#c678dd">print</span><span style="color:#999999">(</span>tag<span style="color:#999999">)</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span><span style="color:#669900">type</span><span style="color:#999999">(</span>tag<span style="color:#999999">)</span><span style="color:#999999">)</span> </code></span></span>
输出结果:
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#669900"><</span>b <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"boldest"</span><span style="color:#669900">></span>Extremely bold<span style="color:#669900"><</span><span style="color:#669900">/</span>b<span style="color:#669900">></span> <span style="color:#669900"><</span><span style="color:#c678dd">class</span> <span style="color:#669900">'bs4.element.Tag'</span><span style="color:#669900">></span> </code></span></span>
Tag有很多方法和属性,在 遍历文档树 和 搜索文档树 中有详细解释.现在介绍一下tag中最重要的属性:
name
和attributes
:name属性:
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#c678dd">print</span><span style="color:#999999">(</span>tag<span style="color:#999999">.</span>name<span style="color:#999999">)</span> <span style="color:#5c6370"># 输出结果:b</span> <span style="color:#5c6370"># 如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档:</span> tag<span style="color:#999999">.</span>name <span style="color:#669900">=</span> <span style="color:#669900">"b1"</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>tag<span style="color:#999999">)</span> <span style="color:#5c6370"># 输出结果:<b1 class="boldest">Extremely bold</b1></span> </code></span></span>
Attributes属性:
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#5c6370"># 取clas属性</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>tag<span style="color:#999999">[</span><span style="color:#669900">'class'</span><span style="color:#999999">]</span><span style="color:#999999">)</span> <span style="color:#5c6370"># 直接”点”取属性, 比如: .attrs :</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>tag<span style="color:#999999">.</span>attrs<span style="color:#999999">)</span> </code></span></span>
tag 的属性可以被添加、修改和删除:
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#5c6370"># 添加 id 属性</span> tag<span style="color:#999999">[</span><span style="color:#669900">'id'</span><span style="color:#999999">]</span> <span style="color:#669900">=</span> <span style="color:#98c379">1</span> <span style="color:#5c6370"># 修改 class 属性</span> tag<span style="color:#999999">[</span><span style="color:#669900">'class'</span><span style="color:#999999">]</span> <span style="color:#669900">=</span> <span style="color:#669900">'tl1'</span> <span style="color:#5c6370"># 删除 class 属性</span> <span style="color:#c678dd">del</span> tag<span style="color:#999999">[</span><span style="color:#669900">'class'</span><span style="color:#999999">]</span> </code></span></span>
(2)NavigableString:用
.string
获取标签内部的文字:<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>b<span style="color:#999999">.</span>string<span style="color:#999999">)</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span><span style="color:#669900">type</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>b<span style="color:#999999">.</span>string<span style="color:#999999">)</span><span style="color:#999999">)</span> </code></span></span>
(3)BeautifulSoup:表示的是一个文档的内容,可以获取它的类型,名称,以及属性:
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#c678dd">print</span><span style="color:#999999">(</span><span style="color:#669900">type</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>name<span style="color:#999999">)</span><span style="color:#999999">)</span> <span style="color:#5c6370"># <type 'unicode'></span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>name<span style="color:#999999">)</span> <span style="color:#5c6370"># [document]</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>attrs<span style="color:#999999">)</span> <span style="color:#5c6370"># 文档本身的属性为空</span> </code></span></span>
(4)Comment:是一个特殊类型的 NavigableString 对象,其输出的内容不包括注释符号。
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>b<span style="color:#999999">)</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>b<span style="color:#999999">.</span>string<span style="color:#999999">)</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span><span style="color:#669900">type</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>b<span style="color:#999999">.</span>string<span style="color:#999999">)</span><span style="color:#999999">)</span> </code></span></span>
3. 搜索文档树
1.
find_all(name, attrs, recursive, text, **kwargs)
(1)name 参数:name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉
匹配字符串:查找与字符串完整匹配的内容,用于查找文档中所有的
<a>
标签<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python">a_list <span style="color:#669900">=</span> soup<span style="color:#999999">.</span>find_all<span style="color:#999999">(</span><span style="color:#669900">"a"</span><span style="color:#999999">)</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>a_list<span style="color:#999999">)</span> </code></span></span>
匹配正则表达式:如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#5c6370"># 返回所有表示<body>和<b>标签</span> <span style="color:#c678dd">for</span> tag <span style="color:#c678dd">in</span> soup<span style="color:#999999">.</span>find_all<span style="color:#999999">(</span>re<span style="color:#999999">.</span><span style="color:#669900">compile</span><span style="color:#999999">(</span><span style="color:#669900">"^b"</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">:</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>tag<span style="color:#999999">.</span>name<span style="color:#999999">)</span> </code></span></span>
匹配列表:如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#5c6370"># 返回所有所有<p>标签和<a>标签:</span> soup<span style="color:#999999">.</span>find_all<span style="color:#999999">(</span><span style="color:#999999">[</span><span style="color:#669900">"p"</span><span style="color:#999999">,</span> <span style="color:#669900">"a"</span><span style="color:#999999">]</span><span style="color:#999999">)</span> </code></span></span>
(2)kwargs参数
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python">soup<span style="color:#999999">.</span>find_all<span style="color:#999999">(</span><span style="color:#669900">id</span><span style="color:#669900">=</span><span style="color:#669900">'link2'</span><span style="color:#999999">)</span> </code></span></span>
(3)text参数:通过 text 参数可以搜搜文档中的字符串内容,与 name 参数的可选值一样, text 参数接受 字符串 , 正则表达式 , 列表
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#5c6370"># 匹配字符串</span> soup<span style="color:#999999">.</span>find_all<span style="color:#999999">(</span>text<span style="color:#669900">=</span><span style="color:#669900">"a"</span><span style="color:#999999">)</span> <span style="color:#5c6370"># 匹配正则</span> soup<span style="color:#999999">.</span>find_all<span style="color:#999999">(</span>text<span style="color:#669900">=</span>re<span style="color:#999999">.</span><span style="color:#669900">compile</span><span style="color:#999999">(</span><span style="color:#669900">"^b"</span><span style="color:#999999">)</span><span style="color:#999999">)</span> <span style="color:#5c6370"># 匹配列表</span> soup<span style="color:#999999">.</span>find_all<span style="color:#999999">(</span>text<span style="color:#669900">=</span><span style="color:#999999">[</span><span style="color:#669900">"p"</span><span style="color:#999999">,</span> <span style="color:#669900">"a"</span><span style="color:#999999">]</span><span style="color:#999999">)</span> </code></span></span>
4. css选择器
我们在使用BeautifulSoup解析库时,经常会结合CSS选择器来提取数据。
注意:以下讲解CSS选择器只选择标签,至于获取属性值和文本内容我们后面再讲。
1. 根据标签名查找:比如写一个
li
就会选择所有li 标签
, 不过我们一般不用,因为我们都是精确到标签再提取数据的<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#c678dd">from</span> bs4 <span style="color:#c678dd">import</span> BeautifulSoup html <span style="color:#669900">=</span> <span style="color:#669900">''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel_body"> <ul class="list" id="list-1" name="element"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <a href="https://www.baidu.com">百度官网</a> <li class="element">Bar</li> </ul> </div> </div> '''</span> <span style="color:#5c6370"># 创建对象</span> soup <span style="color:#669900">=</span> BeautifulSoup<span style="color:#999999">(</span>html<span style="color:#999999">,</span> <span style="color:#669900">'lxml'</span><span style="color:#999999">)</span> <span style="color:#5c6370"># 1. 根据标签名查找:查找li标签</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>select<span style="color:#999999">(</span><span style="color:#669900">"li"</span><span style="color:#999999">)</span><span style="color:#999999">)</span> </code></span></span>
输出结果:
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#999999">[</span><span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Foo<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span><span style="color:#999999">,</span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Bar<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span><span style="color:#999999">,</span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Jay<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span><span style="color:#999999">,</span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Foo<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span><span style="color:#999999">,</span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Bar<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span><span style="color:#999999">]</span> </code></span></span>
2. 根据类名class查找。
.1ine
, 即一个点加line,这个表达式选的是class= "line "
的所有标签,".”
代表class<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>select<span style="color:#999999">(</span><span style="color:#669900">".panel_body"</span><span style="color:#999999">)</span><span style="color:#999999">)</span> </code></span></span>
输出结果:
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#669900"><</span><span style="color:#669900">/</span>ul<span style="color:#669900">></span> <span style="color:#669900"><</span>ul <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"list list-small"</span> <span style="color:#669900">id</span><span style="color:#669900">=</span><span style="color:#669900">"list-2"</span><span style="color:#669900">></span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Foo<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Bar<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span> <span style="color:#669900"><</span><span style="color:#669900">/</span>ul<span style="color:#669900">></span> <span style="color:#669900"><</span><span style="color:#669900">/</span>div<span style="color:#669900">></span><span style="color:#999999">]</span> </code></span></span>
3. 根据id查找。#box,即一个#和box表示选取id-”box "的所有标签,“#”代表id
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>select<span style="color:#999999">(</span><span style="color:#669900">"#list-1"</span><span style="color:#999999">)</span><span style="color:#999999">)</span> </code></span></span>
输出结果:
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#999999">[</span><span style="color:#669900"><</span>ul <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"list"</span> <span style="color:#669900">id</span><span style="color:#669900">=</span><span style="color:#669900">"list-1"</span> name<span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Foo<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Bar<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Jay<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span> <span style="color:#669900"><</span><span style="color:#669900">/</span>ul<span style="color:#669900">></span><span style="color:#999999">]</span> </code></span></span>
4. 根据属性的名字查找。class属性和id属性较为特殊,故单独拿出来定义一个
". "
和“”
来表示他们。比如:input[ name=“username”]这个表达式查找name= "username "的标签,此处注意和xpath语法的区别
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>select<span style="color:#999999">(</span><span style="color:#669900">'ul[ name="element"]'</span><span style="color:#999999">)</span><span style="color:#999999">)</span> </code></span></span>
输出结果:
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#999999">[</span><span style="color:#669900"><</span>ul <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"list"</span> <span style="color:#669900">id</span><span style="color:#669900">=</span><span style="color:#669900">"list-1"</span> name<span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Foo<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Bar<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Jay<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span> <span style="color:#669900"><</span><span style="color:#669900">/</span>ul<span style="color:#669900">></span><span style="color:#999999">]</span> </code></span></span>
5. 标签+类名或id的形式。
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#5c6370"># 查找id为list-1的ul标签</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>select<span style="color:#999999">(</span><span style="color:#669900">'ul#list-1'</span><span style="color:#999999">)</span><span style="color:#999999">)</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span><span style="color:#669900">"-"</span><span style="color:#669900">*</span><span style="color:#98c379">20</span><span style="color:#999999">)</span> <span style="color:#5c6370"># 查找class为list的ul标签</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>select<span style="color:#999999">(</span><span style="color:#669900">'ul.list'</span><span style="color:#999999">)</span><span style="color:#999999">)</span> </code></span></span>
输出结果:
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#999999">[</span><span style="color:#669900"><</span>ul <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"list"</span> <span style="color:#669900">id</span><span style="color:#669900">=</span><span style="color:#669900">"list-1"</span> name<span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Foo<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Bar<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Jay<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span> <span style="color:#669900"><</span><span style="color:#669900">/</span>ul<span style="color:#669900">></span><span style="color:#999999">]</span> <span style="color:#669900">-</span><span style="color:#669900">-</span><span style="color:#669900">-</span><span style="color:#669900">-</span><span style="color:#669900">-</span><span style="color:#669900">-</span><span style="color:#669900">-</span><span style="color:#669900">-</span><span style="color:#669900">-</span><span style="color:#669900">-</span><span style="color:#669900">-</span><span style="color:#669900">-</span><span style="color:#669900">-</span><span style="color:#669900">-</span><span style="color:#669900">-</span><span style="color:#669900">-</span><span style="color:#669900">-</span><span style="color:#669900">-</span><span style="color:#669900">-</span><span style="color:#669900">-</span> <span style="color:#999999">[</span><span style="color:#669900"><</span>ul <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"list"</span> <span style="color:#669900">id</span><span style="color:#669900">=</span><span style="color:#669900">"list-1"</span> name<span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Foo<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Bar<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Jay<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span> <span style="color:#669900"><</span><span style="color:#669900">/</span>ul<span style="color:#669900">></span><span style="color:#999999">,</span> <span style="color:#669900"><</span>ul <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"list list-small"</span> <span style="color:#669900">id</span><span style="color:#669900">=</span><span style="color:#669900">"list-2"</span><span style="color:#669900">></span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Foo<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Bar<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span> <span style="color:#669900"><</span><span style="color:#669900">/</span>ul<span style="color:#669900">></span><span style="color:#999999">]</span> </code></span></span>
6. 查找直接子元素
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#5c6370"># 查找id="list-1"的标签下的直接子标签li</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>select<span style="color:#999999">(</span><span style="color:#669900">'#list-1>li'</span><span style="color:#999999">)</span><span style="color:#999999">)</span> </code></span></span>
输出结果:
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#999999">[</span><span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Foo<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span><span style="color:#999999">,</span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Bar<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span><span style="color:#999999">,</span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Jay<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span><span style="color:#999999">]</span> </code></span></span>
7. 查找子孙标签
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#5c6370"># .panel_body和li之间是一个空格,这个表达式查找id=”.panel_body”的标签下的子或孙标签li</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>soup<span style="color:#999999">.</span>select<span style="color:#999999">(</span><span style="color:#669900">'.panel_body li'</span><span style="color:#999999">)</span><span style="color:#999999">)</span> </code></span></span>
输出结果:
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#999999">[</span><span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Foo<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span><span style="color:#999999">,</span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Bar<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span><span style="color:#999999">,</span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Jay<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span><span style="color:#999999">,</span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Foo<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span><span style="color:#999999">,</span> <span style="color:#669900"><</span>li <span style="color:#c678dd">class</span><span style="color:#669900">=</span><span style="color:#669900">"element"</span><span style="color:#669900">></span>Bar<span style="color:#669900"><</span><span style="color:#669900">/</span>li<span style="color:#669900">></span><span style="color:#999999">]</span> </code></span></span>
8. 取某个标签的属性
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#5c6370"># 1. 先取到<div class="panel_body"></span> div <span style="color:#669900">=</span> soup<span style="color:#999999">.</span>select<span style="color:#999999">(</span><span style="color:#669900">".panel_body"</span><span style="color:#999999">)</span><span style="color:#999999">[</span><span style="color:#98c379">0</span><span style="color:#999999">]</span> <span style="color:#5c6370"># 2. 再去下面的a标签下的href属性</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>div<span style="color:#999999">.</span>select<span style="color:#999999">(</span><span style="color:#669900">'a'</span><span style="color:#999999">)</span><span style="color:#999999">[</span><span style="color:#98c379">0</span><span style="color:#999999">]</span><span style="color:#999999">[</span><span style="color:#669900">"href"</span><span style="color:#999999">]</span><span style="color:#999999">)</span> </code></span></span>
输出结果:
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python">https<span style="color:#999999">:</span><span style="color:#669900">//</span>www<span style="color:#999999">.</span>baidu<span style="color:#999999">.</span>com </code></span></span>
9. 获取文本内容有四种方式:
(a)
string
:获得某个标签下的文本内容,强调-一个标签,不含嵌我。 返回-个字符串<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#5c6370"># 1. 先取到<div class="panel_body"></span> div <span style="color:#669900">=</span> soup<span style="color:#999999">.</span>select<span style="color:#999999">(</span><span style="color:#669900">".panel_body"</span><span style="color:#999999">)</span><span style="color:#999999">[</span><span style="color:#98c379">0</span><span style="color:#999999">]</span> <span style="color:#5c6370"># 2. 再去下面的a标签下</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span>div<span style="color:#999999">.</span>select<span style="color:#999999">(</span><span style="color:#669900">'a'</span><span style="color:#999999">)</span><span style="color:#999999">[</span><span style="color:#98c379">0</span><span style="color:#999999">]</span><span style="color:#999999">.</span>string<span style="color:#999999">)</span> </code></span></span>
输出结果:
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python">百度官网 </code></span></span>
(b)
strings
:获得某个标签下的所有文本内容,可以嵌套。返回-一个生成器,可用list(生成器)转换为列表<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#c678dd">print</span><span style="color:#999999">(</span>div<span style="color:#999999">.</span>strings<span style="color:#999999">)</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span><span style="color:#669900">list</span><span style="color:#999999">(</span>div<span style="color:#999999">.</span>strings<span style="color:#999999">)</span><span style="color:#999999">)</span> </code></span></span>
输出结果:
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#669900"><</span>generator <span style="color:#669900">object</span> Tag<span style="color:#999999">.</span>_all_strings at <span style="color:#98c379">0x000001AA58E525F0</span><span style="color:#669900">></span> <span style="color:#999999">[</span><span style="color:#669900">'\n'</span><span style="color:#999999">,</span> <span style="color:#669900">'\n'</span><span style="color:#999999">,</span> <span style="color:#669900">'Foo'</span><span style="color:#999999">,</span> <span style="color:#669900">'\n'</span><span style="color:#999999">,</span> <span style="color:#669900">'Bar'</span><span style="color:#999999">,</span> <span style="color:#669900">'\n'</span><span style="color:#999999">,</span> <span style="color:#669900">'Jay'</span><span style="color:#999999">,</span> <span style="color:#669900">'\n'</span><span style="color:#999999">,</span> <span style="color:#669900">'\n'</span><span style="color:#999999">,</span> <span style="color:#669900">'\n'</span><span style="color:#999999">,</span> <span style="color:#669900">'Foo'</span><span style="color:#999999">,</span> <span style="color:#669900">'\n'</span><span style="color:#999999">,</span> <span style="color:#669900">'百度官网'</span><span style="color:#999999">,</span> <span style="color:#669900">'\n'</span><span style="color:#999999">,</span> <span style="color:#669900">'Bar'</span><span style="color:#999999">,</span> <span style="color:#669900">'\n'</span><span style="color:#999999">,</span> <span style="color:#669900">'\n'</span><span style="color:#999999">]</span> </code></span></span>
(c)
stripped.strings
:跟(b)差不多,只不过它会去掉每个字符串头部和尾部的空格和换行符<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#c678dd">print</span><span style="color:#999999">(</span>div<span style="color:#999999">.</span>stripped_strings<span style="color:#999999">)</span> <span style="color:#c678dd">print</span><span style="color:#999999">(</span><span style="color:#669900">list</span><span style="color:#999999">(</span>div<span style="color:#999999">.</span>stripped_strings<span style="color:#999999">)</span><span style="color:#999999">)</span> </code></span></span>
输出结果:
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#669900"><</span>generator <span style="color:#669900">object</span> PageElement<span style="color:#999999">.</span>stripped_strings at <span style="color:#98c379">0x000001F9995525F0</span><span style="color:#669900">></span> <span style="color:#999999">[</span><span style="color:#669900">'Foo'</span><span style="color:#999999">,</span> <span style="color:#669900">'Bar'</span><span style="color:#999999">,</span> <span style="color:#669900">'Jay'</span><span style="color:#999999">,</span> <span style="color:#669900">'Foo'</span><span style="color:#999999">,</span> <span style="color:#669900">'百度官网'</span><span style="color:#999999">,</span> <span style="color:#669900">'Bar'</span><span style="color:#999999">]</span> </code></span></span>
(d)
get.text()
:获取所有字符串,含嵌套. 不过会把所有字符串拼接为一个,然后返回
注意2:
前3个都是属性,不加括号;最后一个是函数,加括号。<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"><span style="color:#c678dd">print</span><span style="color:#999999">(</span>div<span style="color:#999999">.</span>get_text<span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">)</span> </code></span></span>
输出结果:
<span style="color:#000000"><span style="background-color:#282c34"><code class="language-python"> Foo Bar Jay Foo 百度官网 Bar </code></span></span>