【Python3爬虫】BeautifulSoup库爬取奇客Solidot科技咨询：采集10天内的咨询信息

本文链接：https://blog.csdn.net/m0_46360532/article/details/121737022

前言：本文爬取奇客网数据仅作为学习所用，也是因为已经一个多月没写博客了（因为近期所学的内容都不方便写成公开博客发出来）。
本文爬取技术用的是 BeautifulSoup库，关于其具体用法可自行在网上查阅相关资料，我当时参考的是《Python3 网络爬虫开发实战》这本书（不是打广告）

文章目录

1 网站介绍

本文所爬取的网站叫做奇客Solidot。

奇客Solidot：是至顶网下的科技资讯网站，主要面对开源自由软件和关心科技资讯读者群，包括众多中国开源软件的开发者，爱好者和布道者。

2 需求介绍

爬取本网站主页中 10 天内的新闻资讯，新闻资讯包含类别、标题、内容。

例如：

# 格式
# ===【类别】标题===
# 内容

======【Linux】CentOS Stream 9 释出======
CentOS 项目释出了新版的滚动更新发行版 CentOS Stream 9 ...
...

3 引入库介绍

此处只在本文的背景下介绍这些库在其中的作用。

requests：请求并响应奇客Solidot网的新闻资讯页面信息，即获取网页的 Html；

BeautifulSoup：解析 requests 库获取到的 Html 信息，从中拆分并收集出我们需要的信息；

time：因为需求中有“采集10天内的咨询信息”，因此此库用来进行时间日期的划分和推算（当前日期和前一天日期采用时间戳 - 一天占用的时间戳数量实现，并进行日期格式 YYMMdd 的转换）。

4 具体编码与说明

此次所爬内容，并无反爬机制的阻碍，也没有异步请求响应的出现，十分友好，因此本文就不再比对着 Html 代码来讲了，没了这些阻碍因素后，更适合练习各种爬虫请求库与解析库的使用，因此此网站是一个不错的爬虫练习网站，当然我们还是不要恶意去刷人家服务器流量为好。

而另一个不得不说明的就是本文的主要技术之一 BeautifulSoup 库，Beautiful Soup 是一款强大的解析工具，我们可以不再去写一些复杂的正则表达式，若假设 soup 是 BeautifulSoup 创建的对象，我们就可以利用像是 soup.标签名的方式来获取标签节点的信息，再利用 text 属性来获取标签中的文本信息，亦或是利用 soup.标签名.attrs={‘属性名’:‘属性值’} 配合 find_all 函数来获取相应匹配到的标签节点等等，总之，此库十分方便且强大，读者可自行去搜集相关资料进行学习，本文不再过多赘述其用法。

（1）首先定义一个用于此爬虫的爬虫工具类 MySpider。

class MySpider():
	# ...

（2）在其中定义初始化的必要信息，对于爬虫类来说，可以初始化网页地址，请求头信息（主要根据网页的反爬机制来自行设置）。

def __init__(self):
		'''初始化'''
		self.url = 'https://www.solidot.org/'
		self.headers = {
			'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36',
			'Accept-Encoding': 'gzip, deflate, br',
			'Accept-Language': 'zh-CN,zh;q=0.9'
		}

（3）用于请求并获取网页的响应信息，此处为了方便后续处理，最终将响应信息转为了 BeautifulSoup 库中的 BeautifulSoup 对象。

def loadPage(self, url):
		'''可以复用的页面请求方法'''
		response = requests.get(url, timeout = 10, headers =self.headers)
		if response.status_code == 200:
			return BeautifulSoup(response.content.decode('utf-8'), 'lxml')
		else:
			raise ValueError('status_code is:',response.status_code)

（4）解析 loadPage 方法返回的 BeautifulSoup 对象，从此对象中解析出我们需要的信息，即新闻的标题与内容，此处由于有的新闻有类别，而有的新闻无类别，因此要做一个判断，这里我令没有类别的新闻属于“无分类”这一类别，最后我们返回一个列表对象，列表中是标题和内容的交替出现。

def getContent(self, soup):
		'''根据网页内容，同时匹配标题和内容'''
		
		items = soup.find_all(attrs={'class':'block_m'})

		result = []
		for item in items:
			title = '' # 类别+标题
			content = '' # 内容
			# print(item.find_all(attrs={'class':'ct_tittle'})[0].div)

			# 包含标题的部分
			divTag = item.find_all(attrs={'class':'ct_tittle'})[0].div

			# print(divTag)
			if(divTag.span != None):
				title = '【' + divTag.span.a.text + '】' + divTag.find_all(name = 'a')[1].text
			else:
				title = '【无分类】' + divTag.find_all(name = 'a')[0].text
			# print(title)

			# 包含内容的部分
			divTag2 = item.find_all(attrs={'class':'p_content'})[0].div
			content = divTag2.text

			result.append('======' + title.strip() + '======' + '\n')
			result.append(content.strip() + '\n')
		
		return result

（5）将上一个函数执行后返回的结果列表持久化存入 txt 文件中（当然实际应用中也可存入 excel 表或是数据库中，此处只是举出一例），每天的新闻咨询结果单独存入一个文件夹，且当天日期需要来参与命名。

def saveContent(self, resultList, nowDate):
		'''数据持久化存入本地文件'''
		flag = 0
		nowDate = str(nowDate)
		fo = open('result-' + nowDate + '.txt', 'w', encoding='utf-8')
		fo.write('日期：' + nowDate[0:4] + '-' + nowDate[4:6] + '-' + nowDate[6:8] + '\n')
		for item in resultList:
			fo.write(item)
			flag += 1
			if (flag % 2 == 0):
				fo.write('\n')

（6）主函数如下，下面代码的迭代部分也不用过多解释了，就是对应需求中的采集 10 天的数据，而对于如何定位当前时间，以及如何回退当前时间至 10 天前，并计入这 10 天中每天的日期，此处我用的是当前时间戳来减去一天的时间戳数量，之后将时间戳转换为日期形式，来进行处理的，而奇客网又比较友好，不同时间的文章是直接在 url 中划分出来的，本着这样的思路，自然得到了如下的代码。
在这里插入图片描述

if __name__ == '__main__':
	spider = MySpider()

	# 获得当前时间时间戳
	nowTickets = int(time.time())

	for _ in range(10):

		#转换为其他日期格式,如:'%Y-%m-%d %H:%M:%S'
		timeArray = time.localtime(nowTickets)
		timeFormat = time.strftime('%Y-%m-%d %H:%M:%S', timeArray)
		nowDate = timeFormat[0:4] + timeFormat[5:7] + timeFormat[8:10]
		nowDate = eval(nowDate)

		soup = None

		if (_ == 0):
			soup = spider.loadPage(spider.url)
		else:
			soup = spider.loadPage(spider.url + '?issue=' + str(nowDate))

		
		resultList = spider.getContent(soup)
		spider.saveContent(resultList, nowDate)

		nowTickets -= 86400

5 最终结果

得到了 10 个新生成的 txt 文件，每个文件保存着当天的新闻资讯，爬取成功。
在这里插入图片描述

6 完整代码

#coding=utf-8
import requests
from bs4 import BeautifulSoup
import time

class MySpider():
	def __init__(self):
		'''初始化'''
		self.url = 'https://www.solidot.org/'
		self.headers = {
			'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36',
			'Accept-Encoding': 'gzip, deflate, br',
			'Accept-Language': 'zh-CN,zh;q=0.9'
		}

	def loadPage(self, url):
		'''可以复用的页面请求方法'''
		response = requests.get(url, timeout = 10, headers = self.headers)
		if response.status_code == 200:
			return BeautifulSoup(response.content.decode('utf-8'), 'lxml')
		else:
			raise ValueError('status_code is:',response.status_code)

	def getContent(self, soup):
		'''根据网页内容，同时匹配标题和内容'''
		
		items = soup.find_all(attrs={'class':'block_m'})

		result = []
		for item in items:
			title = '' # 类别+标题
			content = '' # 内容
			# print(item.find_all(attrs={'class':'ct_tittle'})[0].div)

			# 包含标题的部分
			divTag = item.find_all(attrs={'class':'ct_tittle'})[0].div

			# print(divTag)
			if(divTag.span != None):
				title = '【' + divTag.span.a.text + '】' + divTag.find_all(name = 'a')[1].text
			else:
				title = '【无分类】' + divTag.find_all(name = 'a')[0].text
			# print(title)

			# 包含内容的部分
			divTag2 = item.find_all(attrs={'class':'p_content'})[0].div
			content = divTag2.text

			result.append('======' + title.strip() + '======' + '\n')
			result.append(content.strip() + '\n')
		
		return result

	def saveContent(self, resultList, nowDate):
		'''数据持久化存入本地文件'''
		flag = 0
		nowDate = str(nowDate)
		fo = open('result-' + nowDate + '.txt', 'w', encoding='utf-8')
		fo.write('日期：' + nowDate[0:4] + '-' + nowDate[4:6] + '-' + nowDate[6:8] + '\n')
		for item in resultList:
			fo.write(item)
			flag += 1
			if (flag % 2 == 0):
				fo.write('\n')

if __name__ == '__main__':
	spider = MySpider()

	# 获得当前时间时间戳
	nowTickets = int(time.time())

	for _ in range(10):

		#转换为其他日期格式,如:'%Y-%m-%d %H:%M:%S'
		timeArray = time.localtime(nowTickets)
		timeFormat = time.strftime('%Y-%m-%d %H:%M:%S', timeArray)
		nowDate = timeFormat[0:4] + timeFormat[5:7] + timeFormat[8:10]
		nowDate = eval(nowDate)

		soup = None

		if (_ == 0):
			soup = spider.loadPage(spider.url)
		else:
			soup = spider.loadPage(spider.url + '?issue=' + str(nowDate))

		
		resultList = spider.getContent(soup)
		spider.saveContent(resultList, nowDate)

		nowTickets -= 86400