《数据采集与分析》期末考试爬虫部分重要知识点复习（详细版）

最新推荐文章于 2024-12-16 17:28:00 发布

-北天-

最新推荐文章于 2024-12-16 17:28:00 发布

阅读量2.2k

点赞数 6

分类专栏： Python大数据分析与挖掘文章标签：爬虫 python beautifulsoup

本文链接：https://blog.csdn.net/qq_52417436/article/details/130545061

版权

Python大数据分析与挖掘专栏收录该内容

23 篇文章 ¥29.90 ¥99.00

订阅专栏

超级会员免费看

本文详细回顾了《数据采集与分析》课程中关于爬虫的重要知识点，涵盖web工作原理、Robots协议、使用requests库请求网页、BeautifulSoup库解析HTML、获取元素属性值和文本等内容，旨在帮助学生准备期末考试。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

《数据采集与分析》期末考试爬虫部分重要知识点复习（详细版）

文章目录

《数据采集与分析》期末考试爬虫部分重要知识点复习（详细版）

一、预备知识

1、web基本工作原理

Web基本工作原理是基于客户端-服务器模型的。当用户在Web浏览器中输入URL时，浏览器会向服务器发送一个请求，请求服务器返回所请求的资源。这个请求通常是通过HTTP（Hypertext Transfer Protocol）发送的。

下面是一个最简单的web服务2层体系结构图：

在这里插入图片描述

服务器接收到请求后，会根据请求的URL和其他信息来确定需要返回的资源，并将资源以HTTP响应的形式返回给浏览器。HTTP响应包括一个状态码（例如200表示成功，404表示未找到请求的资源），以及响应的内容，例如HTML文档、图像、视频等。

浏览器接收到响应后，会解析响应的内容，并根据HTML文档中的指令和CSS样式来构建页面的可视化表示。如果页面中包含JavaScript代码，浏览器会执行这些代码来实现交互性和动态效果。

Web的工作原理基于开放标准，这些标准由W3C（World Wide Web Consortium）和其他组织制定和维护。这些标准确保了不同的浏览器和服务器之间的互操作性，使得Web成为一个开放、自由和创新的平台。

2、网络爬虫的Robots协议

Robots协议是一种用于指导网络爬虫如何访问网站的标准。它是通过在网站根目录下的robots.txt文件中定义的。

Robots协议规定了哪些页面可以被爬虫访问，哪些页面不能被访问，以及一个页面被访问的频率限制等。这使得网站管理员可以控制搜索引擎爬虫的访问行为，以保护网站的安全和隐私。

在robots.txt文件中，管理员可以使用一些指令来控制爬虫的访问行为。例如，"User-agent"指令用于指定哪个爬虫应该遵守这个协议；"Disallow"指令用于指定哪些页面不应该被爬虫访问；"Crawl-delay"指令用于指定爬虫访问网站的时间间隔。

需要注意的是，Robots协议仅仅是一种指导爬虫访问网站的标准，而不是一种强制的规则。一些不良的爬虫可能会无视这个协议，因此，如果网站管理员希望尽可能地保护网站的安全和隐私，还需要采取其他措施，比如使用验证码、IP封锁等。

二、爬取网页

1、请求服务器并获取网页

这里主要介绍requests库的使用：

import requests
url='http://httpbin.org/'
response = requests.get(url=url)

2、查看服务器端响应的状态码

response.status_code     #status_code等于200,表示浏览器正确获取了服务器端传递过来的网页

3、输出网页内容

print(response.text)

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <title>httpbin.org</title>
    <link href="https://fonts.googleapis.com/css?family=Open+Sans:400,700|Source+Code+Pro:300,600|Titillium+Web:400,600,700"
        rel="stylesheet">
    <link rel="stylesheet" type="text/css" href="/flasgger_static/swagger-ui.css">
    <link rel="icon" type="image/png" href="/static/favicon.ico" sizes="64x64 32x32 16x16" />
    <style>
        html {
            box-sizing: border-box;
            overflow: -moz-scrollbars-vertical;
            overflow-y: scroll;
        }

        *,
        *:before,
        *:after {
            box-sizing: inherit;
        }

        body {
            margin: 0;
...
</div>
</body>

</html>

输出网页源码也可以使用如下方法：

print(response.content.decode('utf-8')) #decode()方法将网页内容转换为utf-8编码格式

三、使用BeautifulSoup定位网页元素

下面给出部分网页内容，用于演示如何使用BeautifulSoup查找网页上需要的内容：

html='''
<html>
    <head>
        <title>
    The Dormouse's story
        </title>
    </head>
    <body>
        <p class="title">
            <b>
            The Dormouse's story
            </b>
        </p>
        <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
            Elsie
        </a>
            ,
        <a class="sister" href="http://example.com/lacie" id="link2">
            Lacie
        </a>
            and
        <a class="sister" href="http://example.com/tillie" id="link2">
            Tillie
        </a>
            ; and they lived at the bottom of a well.
        </p>
        <p class="story">爱丽丝梦游仙境</p>
    </body>
</html>
'''

1、导入BeautifulSoup库

#参数说明：html就是上面的html文档字符串，'html.parser'指明了解析该文档字符串的解析器是html解析器
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'html.parser')

BeautifulSoup是Python中一个常用的解析HTML和XML文档的库，它能够从HTML或XML文档中提取数据，并且可以通过类似于CSS选择器和XPath表达式的方式来定位元素。

下面是BeautifulSoup()函数中常用的参数介绍：

markup：要解析的HTML或XML文档的字符串形式，也可以是一个文件句柄（File object）。
features：指定解析器（Parser）的类型，常用的解析器有Python标准库中的html.parser、lxml、html5lib等。如果不指定解析器，则会根据安装的情况自动选择最佳解析器。
builder：指定文档树的类型，常用的类型有lxml、html5lib等。
parse_only：指定只解析文档中的某些部分，可以是一个标签名、一个CSS选择器或一个函数。
from_encoding/to_encoding：指定文档的编码类型和输出的编码类型，如果不指定，则自动检测编码类型。
exclude_encodings：指定在自动检测编码类型时需要排除的编码类型。
element_classes：指定各个标签名对应的类，可以用来解析自定义标签。
**kwargs：可以接收其他参数，比如SoupStrainer对象。

下面是BeautifulSoup类的基本元素：

Tag（标签）：HTML或XML文档中的标签，如<html>、<head>、<body>等，可以通过调用BeautifulSoup对象的find()或find_all()方法来获取。
NavigableString（可遍历的字符串）：标签内的字符串，如<p>这是一个段落</p>中的“这是一个段落”，可以通过调用Tag对象的string属性来获取。
Comment（注释）：HTML或XML文档中的注释，如，可以通过调用BeautifulSoup对象的find()或find_all()方法来获取。
BeautifulSoup对象：整个文档的解析树，包含了所有的标签、注释和字符串，可以通过调用BeautifulSoup()函数来创建。
ResultSet（结果集）：由多个Tag对象组成的列表，表示搜索到的所有符合条件的标签，可以通过调用find_all()方法来获取。
CSS选择器：一种用于定位HTML或XML文档中元素的语法，可以通过调用select()方法来实现，其返回值是一个ResultSet对象。

2、使用find/find_all函数查找所需的标签元素

在查找标签元素前我们需要了解一些HTML的标签元素含义：

<html>：定义HTML文档的根元素。
<head>：定义文档头部，包含了文档的元数据，如标题、样式、脚本等。
<title>：定义文档的标题，通常显示在浏览器的标题栏中。
<body>：定义文档的主体部分，包含了网页中的所有内容。
<h1>-<h6>：定义标题，用于标识不同级别的标题。
<p>：定义段落，用于分段显示文本内容。
<a>：定义超链接，用于链接到其他页面或文档。
<img>：定义图像，用于在网页中显示图片。
<ul>和<li>：定义无序列表和列表项，用于展示列表内容。
<ol>和<li>：定义有序列表和列表项，用于展示有序列表内容。
<table>、<tr>、<td>：定义表格、表格行和表格单元格，用于展示表格数据。
<form>、<input>、<button>：定义表单、输入框和按钮，用于用户交互和数据提交。

了解完HTML基本标签之后我们介绍一下find()函数中常用参数：

name：标签名，可以是字符串、正则表达式或一个列表。例如，soup.find('div')可以查找第一个<div>标签，soup.find(['div', 'p'])可以查找第一个<div>或<p>标签。
attrs：标签属性，可以是一个字典或一个关键字参数。例如，soup.find(attrs={'class': 'test'})可以查找第一个class属性为test的标签，soup.find(id='test')可以查找第一个id属性为test的标签。
text：标签文本，可以是一个字符串或一个正则表达式。例如，soup.find(text='Hello')可以查找第一个文本内容为Hello的标签，soup.find(text=re.compile('Hello.*'))可以查找第一个文本内容以Hello开头的标签。
limit：限制查找的数量，可以是一个整数或None。例如，soup.find_all('div', limit=3)可以查找前三个<div>标签。
recursive：是否对子孙标签进行查找，可以是True或False。默认为True，表示对子孙标签进行查找，如果设置为False，只会查找直接子标签。

需要注意的是，find()函数返回的是一个Tag对象，表示查找到的第一个标签。如果没有找到符合条件的标签，则返回None。

例如我们需要查找文档中的第一个

元素/标签：

first_p=soup.find("p")
first_p

<p class="title">
<b>
     The Dormouse's story
    </b>
</p>

如果我们要找到元素类型和属性：

#输出找到的元素类型，是bs4.element.Tag类型
print(type(first_p))
#输出找到的元素的属性，是一个字典
first_p.attrs

<class 'bs4.element.Tag'>
{'class': ['title']}

下面我们继续介绍BeautifulSoup库中find_all()函数常用参数：

name：标签名，可以是字符串、正则表达式或一个列表。例如，soup.find_all('div')可以查找所有<div>标签，soup.find_all(['div', 'p'])可以查找所有<div>或<p>标签。
attrs：标签属性，可以是一个字典或一个关键字参数。例如，soup.find_all(attrs={'class': 'test'})可以查找所有class属性为test的标签，soup.find_all(id='test')可以查找所有id属性为test的标签。
text：标签文本，可以是一个字符串或一个正则表达式。例如，soup.find_all(text='Hello')可以查找所有文本内容为Hello的标签，soup.find_all(text=re.compile('Hello.*'))可以查找所有文本内容以Hello开头的标签。
limit：限制查找的数量，可以是一个整数或None。例如，soup.find_all('div', limit=3)可以查找前三个<div>标签。
recursive：是否对子孙标签进行查找，可以是True或False。默认为True，表示对子孙标签进行查找，如果设置为False，只会查找直接子标签。
string：标签字符串，可以是一个字符串或一个正则表达式。例如，soup.find_all(string='Hello')可以查找所有标签字符串为Hello的标签，soup.find_all(string=re.compile('Hello.*'))可以查找所有标签字符串以Hello开头的标签。
class_：标签的class属性，注意这里属性名后面要加一个下划线，以避免与Python关键字冲突。例如，soup.find_all(class_='test')可以查找所有class属性为test的标签。

需要注意的是，find_all()函数返回的是一个ResultSet对象，表示查找到的所有符合条件的标签。如果没有找到符合条件的标签，则返回一个空的ResultSet对象。可以通过循环遍历ResultSet对象来逐个处理每个标签。

例如我们查找文档中的所有元素：

a_ls=soup.find_all('a')
for a in a_ls:
    print(a)

<a class="sister" href="http://example.com/elsie" id="link1">
     Elsie
    </a>
<a class="sister" href="http://example.com/lacie" id="link2">
     Lacie
    </a>
<a class="sister" href="http://example.com/tillie" id="link2">
     Tillie
    </a>

如果我们要查找文档中class='story’的p元素，可以利用attrs参数进行查找：

p_story1=soup.find_all('p',attrs={"class":"story"})

也可也使用class_参数进行查找：

p_story2 = soup.find_all('p',class_='story')

两个方法的输出结果一样：

[<p class="story">
     Once upon a time there were three little sisters; and their names were
     <a class="sister" href="http://example.com/elsie" id="link1">
      Elsie
     </a>
     ,
     <a class="sister" href="http://example.com/lacie" id="link2">
      Lacie
     </a>
     and
     <a class="sister" href="http://example.com/tillie" id="link2">
      Tillie
     </a>
     ; and they lived at the bottom of a well.
    </p>,
 <p class="story">爱丽丝梦游仙境</p>]

再举一个例子，例如我们要找出文档中class='sister’的元素，首先我们观察含有这个属性的标签是a标签，那么我们就可以写出如下代码：

sister=soup.find_all('a',class_='sister')

或者：

sister=soup.find_all('a',attrs={"class":"sister"})

跟上面的方法一样：

[<a class="sister" href="http://example.com/elsie" id="link1">
      Elsie
     </a>,
 <a class="sister" href="http://example.com/lacie" id="link2">
      Lacie
     </a>,
 <a class="sister" href="http://example.com/tillie" id="link2">
      Tillie
     </a>]

但是例如要找出"href"="http://example.com/elsie"的所有标签：

href = soup.find_all('a',attrs={"href":"http://example.com/elsie"})

[<a class="sister" href="http://example.com/elsie" id="link1">
      Elsie
     </a>]

这里我们就不能使用class_参数的方式了，具体原因请看上面的find_all()函数常用参数介绍中有关这个参数的介绍。

四、获取元素的属性值

1、判断元素是否含有某属性

例如判断文档中的第一个

元素是否含有class属性：

first_p.has_attr("class")

True

或者判断文档中第一个元素是否含有id属性：

first_a.has_attr("id")

True

2、得到元素的属性值

因为属性名和值构成字典，所以采用字典的访问形式得到属性值，例如输出文档中所有元素的href属性值：

a_ls=soup.find_all('a')
for a in a_ls:
    print(a["href"])

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

五、获取元素包含的文本

先找到class='story’的第一个p元素：

p_story_fst=soup.find('p',attrs={"class":"story"})
p_story_fst

<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">
     Elsie
    </a>
    ,
    <a class="sister" href="http://example.com/lacie" id="link2">
     Lacie
    </a>
    and
    <a class="sister" href="http://example.com/tillie" id="link2">
     Tillie
    </a>
    ; and they lived at the bottom of a well.
   </p>

1、使用get_text属性查看该元素所包含的html文本

print(p_story_fst.get_text)

<bound method PageElement.get_text of <p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">
     Elsie
    </a>
    ,
    <a class="sister" href="http://example.com/lacie" id="link2">
     Lacie
    </a>
    and
    <a class="sister" href="http://example.com/tillie" id="link2">
     Tillie
    </a>
    ; and they lived at the bottom of a well.
   </p>>

2、使用text属性查看该元素及子孙元素包含的文本（可能包含空白字符）

p_story_fst.text

'\n    Once upon a time there were three little sisters; and their names were\n    \n     Elsie\n    \n    ,\n    \n     Lacie\n    \n    and\n    \n     Tillie\n    \n    ; and they lived at the bottom of a well.\n   '

3、使用stripped_strings属性查看元素及其子孙包含的不带空白字符的文本

list(p_story_fst.stripped_strings)

['Once upon a time there were three little sisters; and their names were',
 'Elsie',
 ',',
 'Lacie',
 'and',
 'Tillie',
 '; and they lived at the bottom of a well.']

六、遍历文档元素

在这里插入图片描述

我们先找到class='story’的第一个p元素：

p_story_fst=soup.find('p',attrs={"class":"story"})
p_story_fst

1、向下遍历找到孩子元素

for child in p_story_fst.children:
    print(child)

Once upon a time there were three little sisters; and their names were
    
<a class="sister" href="http://example.com/elsie" id="link1">
     Elsie
    </a>

    ,
    
<a class="sister" href="http://example.com/lacie" id="link2">
     Lacie
    </a>

    and
    
<a class="sister" href="http://example.com/tillie" id="link2">
     Tillie
    </a>

    ; and they lived at the bottom of a well.

2、向上遍历找到父亲元素

parnt=p_story_fst.parent
parnt.name

'body'

3、平行遍历找到前面的兄弟节点

list(p_story_fst.previous_siblings)

['\n',
 <p class="title">
 <b>
      The Dormouse's story
     </b>
 </p>,
 '\n']

4、平行遍历找到后面的兄弟节点

list(p_story_fst.next_siblings)

['\n', <p class="story">爱丽丝梦游仙境</p>, '\n']

七、巩固练习

下面是一个简单的HTML文档：

test='''<html><head></head><body><span>1234 
<a href="www.test.edu.cn">This is a test!<b>abc</b></a></span> 
</body></html>'''

写出导入BeautifulSoup库和创建BeautifulSoup对象的代码：

from bs4 import BeautifulSoup
soup = BeautifulSoup(test, 'html.parser')

定义一个 pos 使得 pos 能定位到（指向）上述 html 代码中的 span 元素节点：

pos=soup.find('span')    
pos

<span>1234 
<a href="www.test.edu.cn">This is a test!<b>abc</b></a></span>

输出 span 元素内部包含的所有文本（包含子孙元素的文本）：

print(pos.get_text())

1234 
This is a test!abc

输出 span 元素后面直接包含的文本（不包含子孙元素的文本）：

print(pos.text.strip().split()[0])

找出a元素的孩子和父亲节点名称：

a_tag = soup.find('a')
children = [child.name for child in a_tag.children if child.name is not None]
parent = a_tag.parent.name
print(children)
print(parent)

['b']
span

找出a元素包含的超链接信息：

print(a_tag['href'])

www.test.edu.cn

找出a元素包含的兄弟信息：

previous_sibling = a_tag.previous_sibling  # 获取 a 元素前面的兄弟元素
next_sibling = a_tag.next_sibling  # 获取 a 元素后面的兄弟元素

print(previous_sibling)
print(next_sibling)