不良人1李星云-CSDN博客

本文链接：https://blog.csdn.net/2301_77213258/article/details/140105015

1.大数据的加工处理过程

采集清洗处理分析可视化

数据采集与预处理→数据存储与管理→数据处理与分析→数据可视化

2.数据采集的作用、格式、任务、数据结构。

全面性多维性高效性

作用：通过各种技术手段对外部各种数据源产生的数据实时或非实时地进行采集并加以利用。

格式：结构化数据、半结构化数据、非结构化数据

任务:采集方法：系统日志采集、分布式消息订阅分发、ETL、网络数据采集

网页抓取：使用网络爬虫从网页上提取信息。

API接口集成：通过调用API获取数据。

文件导入：从CSV、Excel等文件中读取数据。

实时监测：持续收集设备或传感器产生的实时数据。

数据结构：

数组、栈、队列、哈希表/字典。、链表、树和图（不是书上找的慎写）

或者是答数据类型（文本、图片、音频、视频）

3.数据的概念

数据是对客观事物的性质、状态以及相互关系等进行记载的物理符号或这些物理符号的组合，这些符号是抽象的、可识别的。

4.数据的组织形式

文件、数据库

5.数据清洗的作用（缺失值处理异常值处理数据类型转换重复值处理）

将大量原始数据中的“脏”数据“洗掉”，它是发现并纠正数据文件中可识别的错误的最后一道程序，包括检查数据一致性、处理无效值和缺失值等。

6.数据脱敏的作用

实现对敏感数据的保护

数据脱敏是在给定的规则、策略下对敏感数据进行变化、修改的技术。能够在很大程度上解决敏感数据在非可信环境中使用的问题。

7.HTML网页的结构

<html>

<head>

</head>

<body>

内容

</body>

</html>

html标签：根标签/根元素所有的内容都要写在html标签内，一个页面只有一个html标签

head标签：头部标签帮助浏览器解析页面，这里面的内容不会被浏览器解析到页面中。

title标签：网页的标题

body标签：书写网页的主体内容，这里面的内容会被解析到页面，也就是说用户看到的所有的内容都是写在body标签内

8.Beautifulsoup的作用

Beautifulsoup提供一些简单的，Python式的函数来处理导航、搜索、修改分析树等。

9.Beautifulsoup的四大对象

Tag对象（HTML中的一个个标签）、NavigableString对象（标签内的文字，用于操纵字符串）、BeautifulSoup对象（一个文档的全部内容，特殊的tag对象）、Comment对象（特殊的navigablestring对象，注释）

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "http://www.xtit.edu.cn/"

response = requests.get(url)

html = response.content.decode("utf-8")

soup = BeautifulSoup(html, "lxml")

# 显示标签信息

print("完整标签：\n", soup.a)

print("标签名称：\t", soup.a.name)

print("属性列表：\t", soup.a.attrs)

print("href属性：\t", soup.a["href"])

print("class属性：\t", soup.a.get("class"))

（3）运行exam10.py程序，显示a标签的信息。

完整标签：

</a>

标签名称： a

属性列表： {'href': '/', 'class': ['logo']}

href属性： /

class属性： ['logo']

5.3 获取NavigableString对象

NavigableString对象用于操纵字符串。获得标签的内容后，可以使用string属性获得标签内部的文字，返回NavigableString对象。

（1）在Exam03文件夹中新建exam11.py文件。

（2）编辑exam11.py文件：

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "http://www.xtit.edu.cn/"

response = requests.get(url)

html = response.content.decode("utf-8")

soup = BeautifulSoup(html, "lxml")

# 显示标签信息

print("完整标签：\n", soup.title)

print("标签文字：\t", soup.title.string)

print("对象类型：\t", type(soup.title.string))

6.1 遍历直接子节点

（1）在Exam03文件夹中新建exam12.py文件。

（2）编辑exam12.py文件：

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 遍历body标签的所有子节点

for elem in soup.body.contents:

print(elem)

（3）运行exam12.py程序，显示body标签的所有直接子节点。

网站列表

百度 - Baidu

…

</a>

注意，遍历body标签的所有子节点的代码也可以使用下述形式：

# 遍历body标签的所有子节点

for elem in soup.body.children:

print(elem)

6.2 遍历子孙节点

（1）在Exam03文件夹中新建exam13.py文件。

（2）编辑exam13.py文件：

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 遍历body标签的所有子节点

for elem in soup.body.descendants:

print(elem)

（3）运行exam13.py程序，显示body标签的所有子孙节点并递归循环。

网站列表

网站列表

百度 - Baidu

百度 - Baidu

…

</a>

6.3 遍历节点内容

（1）在Exam03文件夹中新建exam14.py文件。

（2）编辑exam14.py文件：

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 遍历body标签的所有文字

for elem in soup.body.strings:

print(elem)

（3）运行exam14.py程序，显示body标签内的所有文字。

网站列表

百度 - Baidu

…

湘潭理工学院

（4）修改exam14.py文件：

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 遍历body标签的所有文字，不显示空白行的内容

for elem in soup.body.stripped_strings:

print(elem)

（5）运行exam14.py程序，显示body标签内的所有文字，不显示空白行的内容。

网站列表

百度 - Baidu

百度

腾讯 - Tencent

腾讯

搜狐 - Sohu

搜狐

湘潭理工学院

6.4 获取直接父节点

（1）在Exam03文件夹中新建exam15.py文件。

（2）编辑exam15.py文件：

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 获得img标签

img = soup.img

print("父节点：\n", img.parent)

（3）运行exam15.py程序，显示img标签的直接父节点。

父节点：

</a>

6.5 遍历祖先节点

（1）在Exam03文件夹中新建exam16.py文件。

（2）编辑exam16.py文件：

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 获得img标签的祖先节点列表

for elem in soup.img.parents:

print("祖先节点：\t", elem.name)

（3）运行exam16.py程序，显示img标签的直接父节点到根节点的所有节点。

祖先节点： a

祖先节点： body

祖先节点： html

祖先节点： [document]

6.6 遍历兄弟节点

（1）在Exam03文件夹中新建exam17.py文件。

（2）编辑exam17.py文件：

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 获得a标签的兄弟节点列表

for elem in soup.a.next_siblings:

print("兄弟节点：\t", elem)

（3）运行exam17.py程序，第一个a标签的兄弟节点。

兄弟节点：

兄弟节点：

兄弟节点： 腾讯 - Tencent

兄弟节点：

兄弟节点： <a class="site" href="http://www.tencent.com">腾讯</a>

兄弟节点：

兄弟节点：

…

6.7 遍历前后节点

（1）在Exam03文件夹中新建exam18.py文件。

（2）编辑exam18.py文件：

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 获得a标签的兄弟节点列表

for elem in soup.a.next_elements:

print("后继节点：\t", elem)

（3）运行exam18.py程序，第一个a标签的兄弟节点。

后继节点：百度

后继节点：

后继节点：

后继节点： 腾讯 - Tencent

后继节点：腾讯 - Tencent

后继节点：

…

7.1 根据标签名称搜索子节点

（1）在Exam03文件夹中新建exam19.py文件。

（2）编辑exam19.py文件：

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 根据标签名称搜索子节点

for elem in soup.find_all("a"):

print("子节点：\t", elem)

（3）运行exam19.py程序，显示所有a标签。

节点： <a class="site" href="http://www.baidu.com">百度</a>

子节点： <a class="site" href="http://www.tencent.com">腾讯</a>

子节点： <a class="site" href="http://www.sohu.com">搜狐</a>

子节点： <a class="home" href="http://www.xtit.edu.cn">

</a>

import requests

from bs4 import BeautifulSoup

import re

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 根据正则表达式搜索子节点

for elem in soup.find_all(re.compile("^p")):

print("子节点：\t", elem)

（3）运行exam20.py程序，显示所有名称以“p”字符开始的标签。

子节点： 网站列表

子节点： 百度 - Baidu

子节点： 腾讯 - Tencent

子节点： 搜狐 - Sohu

子节点： 湘潭理工学院

7.3 根据列表搜索子节点

（1）在Exam03文件夹中新建exam21.py文件。

（2）编辑exam21.py文件：

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 根据列表搜索子节点

for elem in soup.find_all(["a", "p"]):

print("子节点：\t", elem)

（3）运行exam21.py程序，显示所有a、p标签。

子节点： 网站列表

子节点： 百度 - Baidu

子节点： <a class="site" href="http://www.baidu.com">百度</a>

子节点： 腾讯 - Tencent

子节点： <a class="site" href="http://www.tencent.com">腾讯</a>

子节点： 搜狐 - Sohu

子节点： <a class="site" href="http://www.sohu.com">搜狐</a>

子节点： 湘潭理工学院

子节点： <a class="home" href="http://www.xtit.edu.cn">

</a>

7.4 根据True搜索子节点

（1）在Exam03文件夹中新建exam22.py文件。

（2）编辑exam22.py文件：

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 根据True搜索子节点

for elem in soup.find_all(id=True):

print("子节点：\t", elem)

（3）运行exam22.py程序，显示所有包含id属性的标签，无论id值是什么。

子节点： 网站列表

7.5 根据方法搜索子节点

（1）在Exam03文件夹中新建exam23.py文件。

（2）编辑exam23.py文件：

import requests

from bs4 import BeautifulSoup

# 定义函数，判断标签是否包括class属性并且不包括href属性

def func(tag):

return tag.has_attr("class") and not tag.has_attr("href")

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 根据自定义方法搜索子节点

for elem in soup.find_all(func):

print("子节点：\t", elem)

（3）运行exam23.py程序，显示所有a标签。

节点： 百度 - Baidu

子节点： 腾讯 - Tencent

子节点： 搜狐 - Sohu

子节点： 湘潭理工学院

7.6 根据关键字和文本搜索子节点

通过name参数可以搜索标签名称，使用多个关键字可以同时过滤标签对象的多个属性。如果关键字是Python的关键词则后面需要加下划线，如“class_”。

通过text参数可以搜索标签中的字符串内容（不包括注释），支持正则表达式。

（1）在Exam03文件夹中新建exam24.py文件。

（2）编辑exam24.py文件：

import requests

from bs4 import BeautifulSoup

import re

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 根据多个关键字过滤

for elem in soup.find_all(class_=["site", "txt"], text=re.compile("\S*理工")):

print("子节点：\t", elem)

（3）运行exam24.py程序，显示所有class属性值是site或txt，文本包含“理工”的标签。

子节点： 湘潭理工学院

8.1 根据标签名搜索子节点

（1）在Exam03文件夹中新建exam25.py文件。

（2）编辑exam25.py文件：

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 通过标签名查找标签

for elem in soup.select('p'):

print("子节点：\t", elem)

（3）运行exam25.py程序，显示所有p标签。

子节点： 网站列表

子节点： 百度 - Baidu

子节点： 搜狐 - Sohu

子节点： 湘潭理工学院

8.2 根据类名搜索子节点

（1）在Exam03文件夹中新建exam26.py文件。

（2）编辑exam26.py文件：

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 通过类名查找标签

for elem in soup.select('.site'):

print("子节点：\t", elem)

（3）运行exam26.py程序，显示所有类名为“.site”的标签。

子节点： <a class="site" href="http://www.baidu.com">百度</a>

子节点： <a class="site" href="http://www.tencent.com">腾讯</a>

子节点： <a class="site" href="http://www.sohu.com">搜狐</a>

8.3 根据id搜索子节点

（1）在Exam03文件夹中新建exam27.py文件。

（2）编辑exam27.py文件：

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 通过id查找标签

for elem in soup.select('#list'):

print("子节点：\t", elem)

（3）运行exam27.py程序，显示id为“list”的标签。

子节点： 网站列表

8.4 根据属性值搜索子节点

（1）在Exam03文件夹中新建exam28.py文件。

（2）编辑exam28.py文件：

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 通过指定标签的属性值查找标签

for elem in soup.select('a[class="home"]'):

print("子节点：\t", elem)

（3）运行exam28.py程序，显示类名为“home”的a标签。

子节点： <a class="home" href="http://www.xtit.edu.cn">

</a>

（1）在Exam03文件夹中新建exam29.py文件。

（2）编辑exam29.py文件：

import requests

from bs4 import BeautifulSoup

from lxml import etree

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

str = response.read()

html = etree.HTML(str)

# 逐级搜索p标签

for elem in html.xpath("/html/body/p"):

print(etree.tostring(elem))

（3）运行exam29.py程序，逐级搜索p标签。

b'网站列表\n '

b'百度 - Baidu\n '

b'腾讯 - Tencent\n '

b'搜狐 - Sohu\n '

b'湘潭理工学院\n

（4）在Exam03文件夹中新建exam30.py文件。

（5）编辑exam30.py文件：

import requests

from bs4 import BeautifulSoup

from lxml import etree

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

str = response.read()

html = etree.HTML(str)

# 跳级搜索p标签

for elem in html.xpath("//p"):

print(etree.tostring(elem))

（6）运行exam30.py程序，跳级搜索p标签。

b'网站列表\n '

b'百度 - Baidu\n '

b'腾讯 - Tencent\n '

b'搜狐 - Sohu\n '

b'湘潭理工学院\n

（7）在Exam03文件夹中新建exam31.py文件。

（8）编辑exam31.py文件：

import requests

from bs4 import BeautifulSoup

from lxml import etree

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

str = response.read()

html = etree.HTML(str)

# 根据属性值搜索a标签

for elem in html.xpath('//a[@class="site"]'):

print(etree.tostring(elem))

（9）运行exam31.py程序，根据属性值搜索a标签。

b'<a href="http://www.baidu.com" class="site">百度</a>'

b'<a href="http://www.tencent.com" class="site">腾讯</a>'

b'<a href="http://www.sohu.com" class="site">搜狐</a>'

（1）在Exam03文件夹中新建exam32.py文件。

（2）编辑exam32.py文件：

import requests

from bs4 import BeautifulSoup

from lxml import etree

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

str = response.read()

html = etree.HTML(str)

# 根据谓词搜索前两个a标签

for elem in html.xpath('//a[position()<3]'):

print(etree.tostring(elem))

（3）运行exam32.py程序，根据谓词搜索前两个a标签。

b'<a href="http://www.baidu.com" class="site">百度</a>'

b'<a href="http://www.tencent.com" class="site">腾讯</a>'

9.3 函数

（1）在Exam03文件夹中新建exam33.py文件。

（2）编辑exam33.py文件：

import requests

from bs4 import BeautifulSoup

from lxml import etree

# 读取网页，获得HTML代码

url = "D:/2023学年第2学期/数据采集与预处理/源代码/Exam03/index.html"

response = open(url, "r", encoding="utf-8")

str = response.read()

html = etree.HTML(str)

# 根据函数搜索指定属性值的a标签的文本

for elem in html.xpath('//a[contains(@class, "site")]/text()'):

print(elem)

李星云

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "D:/可视化源代码/brl/index.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 查找所有的a标签

a_tags = soup.find_all('a')

# 输出所有a标签的href属性值

for tag in a_tags:

print(tag.get('href'))

import requests

from bs4 import BeautifulSoup

from lxml import etree

# 读取网页，获得HTML代码

url = "D:/可视化源代码/brl/index.html"

response = open(url, "r", encoding="utf-8")

str = response.read()

html = etree.HTML(str)

# 逐级搜索p标签

for elem in html.xpath("/html/body/p/text()"):

print(elem)

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "D:/可视化源代码/brl/index1.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 查找所有class为"text"的div标签，这些标签包含了新闻标题

news_titles = soup.find_all('div', class_='text')

# 查找所有class为"data"的div标签，这些标签包含了新闻发表时间

news_dates = soup.find_all('div', class_='data')

# 遍历新闻标题和日期，按顺序配对并打印

for title, date in zip(news_titles, news_dates):

# 获取<h3>标签中的文本作为标题，去除多余空格

title = title.h3.text.strip()

# 获取<h3>标签中的文本作为日期，去除多余空格

date = date.h3.text.strip()

# 打印标题和日期，使用\t分隔

print(f'{title}\t{date}')

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "D:/可视化源代码/brl/index2.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 查找所有class为"text"的div标签，这些标签包含了新闻标题

news_titles = soup.find_all('div', class_='text')

# 遍历标题和日期，按顺序配对并打印

for i in news_titles:

# 获取<h3>标签中的文本作为标题，去除多余空格

title = i.h3.text.strip()

# 获取<h3>标签中的文本作为日期，去除多余空格

date = i.p.text.strip()

# 打印标题和日期，使用\t分隔

print(f'{title}\t{date}')

import requests

from bs4 import BeautifulSoup

# 读取网页，获得HTML代码

url = "D:/可视化源代码/brl/index2.html"

response = open(url, "r", encoding="utf-8")

html = response.read()

soup = BeautifulSoup(html, "lxml")

# 查找所有class为"text"的div标签，这些标签包含了新闻标题

news_titles = soup.find_all('div', class_='text')

# 查找所有class为"data"的div标签，这些标签包含了新闻发表时间

news_urls = soup.find_all('a')

# 遍历标题和url，按顺序配对并打印

for i, j in zip(news_titles, news_urls):

# 获取<h3>标签中的文本作为标题，去除多余空格

title = i.h3.text.strip()

# 获取日期，去除多余空格

date = i.p.text.strip()

# 获取url

url = j.get('href')

# 打印标题和url，使用\t分隔

print(f'{date}\t{title}\t{url}')