python爬取创造营2021小哥哥的照片（python爬虫基础）

最新推荐文章于 2022-06-23 19:25:17 发布

Nefelibat

最新推荐文章于 2022-06-23 19:25:17 发布

阅读量881

点赞数 5

分类专栏：爬虫文章标签： python爬虫基础

本文链接：https://blog.csdn.net/qq_41821067/article/details/115696279

版权

爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

页面标签解析

创造营2021主页面

我们这里打开https://baike.baidu.com/item/%E5%88%9B%E9%80%A0%E8%90%A52021/53105386?fr=aladdin
页面往下拉可以看到一个学员评级的表格，表格里面学员的名字是一个超链接，点击名字可以跳转到学员个人页面，里面有学员照片，如下
在这里插入图片描述

这里我们通过表格的标题学员评级找到我们要爬取信息的总表格，然后获取学员个人主页面的链接

学员个人主页面

我们这里以周柯宇为例
在这里插入图片描述
如上图，我们要获取的就是红框里面的链接，然后通过这个链接找到学员图片的链接，进行图片爬取
按F12

在学员个人主页面，我们找到img标签的的链接就可以了

python代码

import json
import re
import requests
import datetime
from bs4 import BeautifulSoup
import os
# 获取当天的日期,并进行格式化,用于后面文件命名
today = datetime.date.today().strftime('%Y%m%d')
#在当前代码的目录下新建一个创造营文件夹
output_path = './创造营/'
# 爬取百度百科中《创造营2021》中参赛选手信息，返回html
def crawl_data():
    url = 'https://baike.baidu.com/item/%E5%88%9B%E9%80%A0%E8%90%A52021/53105386?fr=aladdin'
    #假装自己是浏览器，首先按F12,然后点击network,点击all,随意打开一个页面就行了，不同的浏览器这里是不同的
    headers = {

        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36'
    }
    try:
        #硬性的用request去获取url数据
        response = requests.get(url, headers=headers)
        #获取返回状态
        print('code = {}'.format(response.status_code))
        # print('response.text:\n{}'.format(response.text))
        #获取网页的html文本，并可以用浏览器进行解析
        #用BeautifulSoup对和文本进行解析，引擎是lxml
        soup = BeautifulSoup(response.text, 'lxml')
        #获取所有的表格，没有class的table
        tables = soup.find_all('table')
        tables = [table for table in tables if not table.has_attr('class')]
        #print(tables)
        crawl_table_title = "学员评级"
        # 因为有多个相同样式的表格，所以找相关的参考，例如前面元素的“参赛学员”
        for table in tables:
            # 对当前节点前面的标签和字符串进行查找
            table_titles = table.find_previous('div').find_all('h3')
            for title in table_titles:
                if crawl_table_title in title:
                    return table
        #print(table)
    except Exception as e:
        print(e)
    return None
# 从百度百科返回的html中解析得到选手信息，以当前日期作为文件名，存JSON文件
def parse_data(table_html):
    bs = BeautifulSoup(str(table_html), 'lxml')
    all_trs = bs.find_all('tr')
    print(all_trs)
    error_list = ['\'', '\"']
    stars = []
    #tr的下标是从0开始的，我们这里从1开始进行获取，也就是从表格的第二行开始，因为第一行是标题信息
    for tr in all_trs[1:]:
    #这里进行判断是为了防止标签为空，如果不进行判断，有的学员没有a标签，就会出现数组溢出的提示，并且爬取终止
        if tr.find_all('a'):
            all_a = tr.find_all('a')
            star = {}
            # 姓名
            #print(all_a[0])
            star["name"] = all_a[0].text
            #print(star['name'])
            # 个人百度百科链接
            star["link"] = 'https://baike.baidu.com' + all_a[0].get('href')
            stars.append(star)
    print(stars)
    json_data = json.loads(str(stars).replace("\'", "\""))
    print('json_data = ', json_data)
    if not os.path.exists(output_path):
        os.makedirs(output_path)
    with open(os.path.join(output_path, today + '.json'), 'w', encoding='utf-8') as f:
        json.dump(json_data, f, ensure_ascii=False)


# 爬取每个选手的百度百科图片，并保存
def crawl_pic_urls():
    with open(os.path.join(output_path, today + '.json'), 'r', encoding='UTF-8') as file:
        json_array = json.loads(file.read())
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36'
    }
    for star in json_array:
        name = star['name']
        link = star['link']
        # 在以下完成对每个选手图片的爬取，将所有图片url存储在一个列表pic_urls中
        response = requests.get(link, headers=headers)
        bs = BeautifulSoup(response.text, 'lxml')
        try:
            pic_list_url = bs.select('.summary-pic a')[0].get('href')
            pic_list_url = 'https://baike.baidu.com' + pic_list_url
        except Exception as e:
            print('出现异常: {}'.format(str(e)))
            continue
        pic_list_response = requests.get(pic_list_url, headers=headers)
        bs = BeautifulSoup(pic_list_response.text, 'lxml')
        pic_list_html = bs.select('.pic-list img')
        pic_urls = []
        for pic_html in pic_list_html:
            pic_url = pic_html.get('src')
            pic_urls.append(pic_url)
        # 根据图片链接列表pic_urls, 下载所有图片，保存在以name命名的文件夹中
        down_pic(name, pic_urls)


# 根据图片链接列表pic_urls, 下载所有图片，保存在以name命名的文件夹中
def down_pic(name, pic_urls):
    path = os.path.join(output_path, 'pictures', name)
    if not os.path.exists(path):
        os.makedirs(path)
    for i, pic_url in enumerate(pic_urls):
        try:
            pic = requests.get(pic_url, timeout=15)
            with open(os.path.join(path, str(i + 1) + '.jpg'), 'wb') as f:
                f.write(pic.content)
                print(name+'第%s张图片: %s' % (str(i + 1), str(pic_url)))
        except Exception as e:
            print('下载第%s张图片时失败: %s' % (str(i + 1), str(pic_url)))
            print(e)
            continue


# 遍历所爬取的每张图片，并打印所有图片的绝对路径
def show_pic_path(path):
    pic_num = 0
    for (dirpath, dirnames, filenames) in os.walk(path):
        for filename in filenames:
            pic_num += 1
            print("第%d张照片：%s" % (pic_num,os.path.join(dirpath, filename)))
    print("共爬取《创造营2021》选手的%d照片" % pic_num)


if __name__ == '__main__':
    #找到关心数据的url在什么地方
    html = crawl_data()
    print(html)
    #解析html文件为json格式，可以使用JSoN在线解析对其进行解析
    parse_data(html)
    #用url图片的地址获取图片
    crawl_pic_urls()
    #显示图片
    show_pic_path(output_path)

注意

获取headers的截图

在这里插入图片描述

输出结果

控制台信息

我们可以看到共爬取了168张照片
在这里插入图片描述

爬取结果

在这里插入图片描述

json文件

用json在线解析文件，可以看到我们得到的json包括学员名字和学员百度百科主页的链接
在这里插入图片描述

picture

生成如下picture文件夹
文件夹下有各个学员的文件夹，里面是下载的各个学员的照片
在这里插入图片描述
我们随便打开一张如下

Nefelibat

关注

5
点赞
踩
10

收藏

觉得还不错? 一键收藏
打赏
5
评论
python爬取创造营2021小哥哥的照片（python爬虫基础）

目录页面标签解析创造营2021主页面学员个人主页面python代码注意获取headers的截图输出结果控制台信息爬取结果json文件picture页面标签解析创造营2021主页面我们这里打开https://baike.baidu.com/item/%E5%88%9B%E9%80%A0%E8%90%A52021/53105386?fr=aladdin页面往下拉可以看到一个学员评级的表格，表格里面学员的名字是一个超链接，点击名字可以跳转到学员个人页面，里面有学员照片，如下这里我们通过表格的标题学员
复制链接

扫一扫