Python 爬虫练习批量爬取导师信息与照片

yshln

于 2024-06-07 10:44:02 发布

阅读量841

点赞数 8

文章标签： python 爬虫经验分享

本文链接：https://blog.csdn.net/yshln/article/details/139519772

版权

大模型时代，数据在信息科学中的地位已经非常重要。互联网目前就是一个庞大数据库。

使用爬虫在很大程度上可以使得user，更轻松、廉价地调配互联网上的数据资源。

本文以初学python两天的视角，介绍爬虫的基本思想，记录本人写的一个简单Python爬虫程序用于爬取LNU 金贸学院导师的图片与个人介绍。

初学爬虫的简单参考资料（很简洁的Python爬虫教程）

【Python+爬虫】爆肝两个月！拜托三连了！这绝对是全B站最用心（没有之一）的Python+爬虫公开课程，从入门到（不）入狱！_哔哩哔哩_bilibili

简单爬虫的一般程序

要爬取的内容导师的图片与介绍

师资队伍-辽宁大学金融与贸易学院 (lnu.edu.cn)

查看网站的源码确定信息内容

要爬取导师页的内容就需要先对上一级链接进行爬取，存储这些目标子链接进行逐个请求并爬取信息

教师主页要获取的内容是图片的url与相关介绍

代码（需要在代码存放的目录下创建一个叫pic的文件夹用来存放爬取的图片）

import requests # 这个库用来发送HTTP请求
from bs4 import BeautifulSoup # 这个库用来解析HTML
import os # 这个库用来处理文件和目录

url = 'https://jmxy.lnu.edu.cn/xygk/szdw.htm' # 爬取的网页
url_2 = "https://jmxy.lnu.edu.cn/" # 拼接图片的URL

# 伪装请求头 不是所有网站都一定需要
header = {
    'User-Agent': '去自己的浏览器上找一下',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'Host': 'jmxy.lnu.edu.cn'
}


# 进行HTTP请求
response = requests.get(url,headers =header)
response.encoding = 'utf-8'

# 定义存放信息的变量
href_list = {} 


# 解析并提取请求结果
if response.status_code == 200:
    soup_result = BeautifulSoup(response.text,'html.parser')
    links = soup_result.find_all('a')
    
    # 处理得到目标子链接
    for link in links:
        if link.get('href')[:11] == '../info/109':
            href_list[link.get_text()] = os.path.join(url_2,link.get('href')[3:])
else: 
    print('请求失败')


# 得到当前工作目录在该目录下有一个pic文件夹用来储存照片
path = os.path.join(os.getcwd(),"pic")

# 对每个子链接进行请求
for key,value in href_list.items():
    response_sub = requests.get(value,headers =header)
    response_sub.encoding = 'utf-8'

    with open('text.txt', 'a', encoding='utf-8') as f:
            f.write('\n\n'+key)

    if response_sub.status_code == 200:
        soup_result_sub = BeautifulSoup(response_sub.text,'html.parser')
        pic_link = soup_result_sub.find_all('img')

        # 爬图片
        for item in pic_link:
            if item.get('src')[:8] == '/__local':
                pic_link_deal = os.path.join(url_2,item.get('src')[1:])
                pic = requests.get(pic_link_deal,headers =header)
                if pic.status_code == 200:
                    name = os.path.join(path,key+'.jpg')
                    with open(name, 'wb') as f:
                        f.write(pic.content)
         
        # 爬文字   
        link_sub = soup_result_sub.find_all('p')
        for item in link_sub:
            if item.get_text()[:2] == '作者':
                pass
            else:
                text_sub = item.get_text().replace('版权所有©辽宁大学金融与贸易学院','')
                with open('text.txt', 'a', encoding='utf-8') as f:
                    f.write(text_sub)

伪装请求头的相关参数