【class6】一个爬虫的案例

最新推荐文章于 2024-10-18 14:25:40 发布

fmc121104

最新推荐文章于 2024-10-18 14:25:40 发布

阅读量731

点赞数 17

文章标签：爬虫

本文链接：https://blog.csdn.net/fmc121104/article/details/137447364

版权

题⽬统计五⻚计算机专业可以报考的公务员职位信息，并⽣成Excel⽂档查询⽹址： https://nocturne-spider.baicizhan.com/practise/60/PAGE/1.html

【题⽬要求】 1. Excel⽂档保存路径：/Users/公务员职位信息.xlsx 2. ⼯作表命名：计算机科学与技术 3. 写⼊顺序为：地区、部⻔、⽤⼈司局和职位名称

爬取公务员网站的信息，并制作Excel，示例代码如下：

import requests
from bs4 import BeautifulSoup
import pandas as pd
import random

# 用于轮换以模仿不同浏览器的User-Agent列表
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15",
# 如有必要，添加更多的User-Agents
]

headers = {"User-Agent": random.choice(user_agents)}

areaList = []
departmentList = []
companyList = []
positionList = []

for page in range(1, 6):
url = f"https://nocturne-spider.baicizhan.com/practise/60/PAGE/{page}.html"
try:
response = requests.get(url, headers=headers)
response.raise_for_status() # 如果响应状态码为4XX/5XX，则引发HTTPError
html = response.text
soup = BeautifulSoup(html, "lxml")

table = soup.find(class_="table fsk01")
content_all = table.find_all("tr")[1:]

for item in content_all:
contents = item.find_all("td")
contentList = contents[:4]

areaList.append(contentList[0].string)
departmentList.append(contentList[1].string)
companyList.append(contentList[2].string)
positionList.append(contentList[3].string)

except requests.RequestException as e:
print(f"请求页面{page}时出错：{e}")
except AttributeError as e:
print(f"解析页面{page}时出错：{e}")

total = {
"地区": areaList,
"部门": departmentList,
"用人司局": companyList,
"职位": positionList
}

info = pd.DataFrame(total)

# 对文件路径使用变量会更灵活
file_path = "C:\\Users\\DELL\\公务员职位信息.xlsx"
writer = pd.ExcelWriter(file_path)
info.to_excel(excel_writer=writer, sheet_name="计算机科学与技术")
writer._save()

运行结果：