数据采集与预处理实验_实验3:数据采集与预处理实验-CSDN博客

本文链接：https://blog.csdn.net/weixin_41176179/article/details/138868914

数据采集与预处理实验

文章目录

数据采集与预处理实验
要求
1、百度新闻的爬取&自由扩展爬取
2、bs4重新爬取百度新闻
3、卢小说网爬取与存储数据库

要求

五个小实验文档提交（前四个个人作业，第五个小组作业。需提交配套文档，文档模板如下）
1.百度新闻的爬取&自由扩展爬取
2.bs4重新爬取百度新闻
3.飞卢小说网爬取与存储数据库
4.设计一个数据库系统，要求至少5张表以上，用workbench作图，属性完整，必要的连接完整
5.Pyspider的安装与运行

提示：以下是全部实验报告下载地址，可供参考
（实验四、实验五只有实验报告）实验报告下载地址: 百度网盘

1、百度新闻的爬取&自由扩展爬取

代码如下（示例）：

# 导入需要的模块
import urllib.request  # 用于从URL获取数据
import re  # 用于正则表达式匹配
import datetime  # 用于处理日期和时间

# 定义两个URL，分别对应新闻和视频网站
url1 = 'https://news.baidu.com/'
url2 = 'https://v.xiaodutv.com/'

# 使用urllib.request模块从URL中获取内容，并解码为utf-8格式
content1 = urllib.request.urlopen(url1).read().decode('utf-8')
content2 = urllib.request.urlopen(url2).read().decode('utf-8')

# 定义两个正则表达式模式，用于匹配新闻和视频网站的特定内容
# pattern1用于匹配新闻网站的热点新闻标题
pattern1 = re.compile('<li class="hdline.*?<strong>.*?<a.*?>(.*?)</a>.*?strong>', re.S)
# pattern2用于匹配视频网站的热点视频标题
pattern2 = re.compile("<li class='poste.*?<a.*?<img.*?<p.*?>(.*?)</p>", re.S)

# 使用正则表达式从内容中提取匹配的信息
hotNews1 = re.findall(pattern1, content1)
hotNews2 = re.findall(pattern2, content2)

# 打印提取的热点新闻
for i in hotNews1:
    print("热点新闻：", i)
print("")
# 打印提取的热点视频
for i in hotNews2:
    print("热点视频:", i)
# 打印当前的时间
print(datetime.datetime.now())

2、bs4重新爬取百度新闻

代码如下（示例）：

from bs4 import BeautifulSoup  # 导入BeautifulSoup库，用于解析HTML内容
import requests  # 导入requests库，用于发送HTTP请求

url = 'https://news.baidu.com'  # 定义要爬取的网页URL
# 使用requests库发送GET请求，获取网页内容
res = requests.get(url)
# 使用BeautifulSoup解析网页内容
soup = BeautifulSoup(res.text, 'lxml')

# 定义一个函数，用于移除字符串中的空行
def remove_empty_lines(s):
    return '\n'.join([line for line in s.splitlines() if line.strip()])

print('热点新闻：\n')  # 打印标题，表示接下来输出的是热点新闻
# 循环6次，尝试查找不同类名的热点新闻
for i in range(6):
    news_list = soup.find_all(class_='hdline{}'.format(i))  # 使用find_all方法查找类名为'hdline0'到'hdline5'的元素
    for news in news_list:  # 遍历找到的所有新闻元素
        print(remove_empty_lines(news.get_text()))  # 打印新闻内容，并移除空行
print('\n其他新闻：\n')  # 打印标题，表示接下来输出的是其他新闻
# 查找类名为'ulist focuslistnews'的元素，这些元素通常包含其他新闻
news_list1 = soup.find_all(class_='ulist focuslistnews')
for news in news_list1:  # 遍历找到的所有其他新闻元素
    print(remove_empty_lines(news.get_text()))  # 打印新闻内容，并移除空行

print('热搜新闻词：\n')  # 打印标题，表示接下来输出的是热搜新闻词
# 查找类名为'bd'的元素，这些元素通常包含热搜新闻词
news_list2 = soup.find_all(class_='bd')
for news in news_list2:  # 遍历找到的所有热搜新闻词元素
    print(remove_empty_lines(news.get_text()))  # 打印热搜新闻词内容，并移除空行

3、卢小说网爬取与存储数据库

代码如下（示例）：

from bs4 import BeautifulSoup
import requests
import pymysql

#10本小说网址
url = [ 'https://b.faloo.com/1409514.html',
        'https://b.faloo.com/1406411.html',
        'https://b.faloo.com/1378270.html',
        'https://b.faloo.com/1406675.html',
        'https://b.faloo.com/1408380.html',
        'https://b.faloo.com/1236995.html',
        'https://b.faloo.com/1405654.html',
        'https://b.faloo.com/1376693.html',
        'https://b.faloo.com/1410749.html',
        'https://b.faloo.com/671060.html'
        ]

db = pymysql.connect(
    host='127.0.0.1',      # 外网/内网地址
    user='root',           # 账号
    password='123456789',  # 密码
    db='xiaoshuo',         # database名称
    charset='utf8'         # 编码格式
)

# 拿到游标
cursor = db.cursor()
def get_url(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        Topic = soup.find('h1', class_='fs23').text
        divs = soup.find_all('div', class_='DivTd3')
        return divs, Topic
    except requests.HTTPError as e:
        print(f"获取网页错误：{e}")
        return []

def get_content(divs, path, a):
        for div in divs:
            url = 'https:' + div.a['href']
            chapter_name = div.text
            try:
                rq = requests.get(url)
                rq.raise_for_status()
                soup1 = BeautifulSoup(rq.text, 'html.parser')
                chapter_content = soup1.find(name='div', class_=a).text
                # 存到数据库
                sql = "INSERT INTO xiao_shuo (书名,章节,内容) VALUES (%s,%s,%s)"
                val = (path, chapter_name, chapter_content)  # 这里替换为你要插入的实际值
                cursor.execute(sql, val)
                db.commit()
            except requests.HTTPError as e:
                print(f"访问章节页面失败：'{e}'")
        print(f"{path}------------------全部章节已下载到数据库！")

for i in range(10):
    url1, Topic = get_url(url[i])
    get_content(url1, Topic, 'noveContent')