PaddlePaddle飞桨入门之旅(二)

最新推荐文章于 2024-01-24 01:53:59 发布

wuxianshen

最新推荐文章于 2024-01-24 01:53:59 发布

阅读量342

点赞数 1

分类专栏： DL Paddle

本文链接：https://blog.csdn.net/wuxianshen/article/details/105847432

版权

Paddle 同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

2 篇文章 0 订阅

订阅专栏

——记2020.4.13百度“Python小白逆袭大神”在线课程

这是一个最好的时代，也是一个最坏的时代。

这是一个最好的时代，互联网的兴起，各种学习资源俯拾即是，各种论坛社区提问即有回响，只要有心就能自我学习。

这是一个最坏的时代，互联网的兴起，各种娱乐诱惑比比皆是，各种视频游戏勾魂夺魄，稍一不自律，时间就一去不复返。

在这样一个时代，自律反而成了最难的事情，今早立下的目标，往往成为今晚失眠的理由。

在这样一个知识爆炸的时代，幸而有这样一批人，不但兢兢业业开发出好用的国产深度学习框架，同时通过一系列精心设计的课程，免费带着大家入门，监督大家学习，通过努力营造群体学习的气氛，丰厚的奖品，来鼓励大家学习，促进大家学习。这就是百度AI团队在做的。

上一个CV打卡营学的比较艰辛，听到这个“Python小白逆袭”的名字，第一感觉不会太难，听下来觉得自己还是太天真了，大厂出品的课程，怎么可能没有难度！

一共五天的在线课程每天讲课一小时，内容分别是：Python基础、Python进阶、常用Python库、PaddleHub体验与应用、EasyDL体验与应用。讲课内容简单，但是每天的作业内容很有挑战性，组建几个大微信群，大家在里面集体讨论问题，克服作业难关，无形中提高了一波编程功力。

Day1作业：

从一个简单的格式化输出操作，到利用python的os包进行文件遍历，帮助大家熟悉AI studio平台的使用。

Day2作业：

利用爬虫爬取《青春有你2》选手信息，掌握requests和bs4的使用。深度学习对数据需求量巨大，爬虫可以有效帮助我们准备数据。由于王姝慧图集有8张图片是动态加载的，需要通过getPhoto请求获得新的图片地址，当时由于时间紧张就手写了八张图片的地址，较优解法参见：https://blog.csdn.net/yinyiyu/article/details/105722144

def down_pic(name,pic_urls):
    '''
    根据图片链接列表pic_urls, 下载所有图片，保存在以name命名的文件夹中,
    '''
    path = 'work/'+'pics/'+name+'/'

    if not os.path.exists(path):
        os.makedirs(path)

    for i, pic_url in enumerate(pic_urls):
        try:
            pic = requests.get(pic_url, timeout=15)
            string = str(i + 1) + '.jpg'
            with open(path+string, 'wb') as f:
                f.write(pic.content)
                print('成功下载第%s张图片: %s' % (str(i + 1), str(pic_url)))
        except Exception as e:
            print('下载第%s张图片时失败: %s' % (str(i + 1), str(pic_url)))
            print(e)
            continue

special_shuhui = [
    'https://bkimg.cdn.bcebos.com/pic/0dd7912397dda144610bc8ebbdb7d0a20cf4866d',
    'https://bkimg.cdn.bcebos.com/pic/4034970a304e251f829d40a1a886c9177f3e5304',
    'https://bkimg.cdn.bcebos.com/pic/96dda144ad3459827c8841cb03f431adcbef846d',
    'https://bkimg.cdn.bcebos.com/pic/ac345982b2b7d0a2dccba0c4c4ef76094b369a6d',
    'https://bkimg.cdn.bcebos.com/pic/7c1ed21b0ef41bd598164cad5eda81cb39db3d25',
    'https://bkimg.cdn.bcebos.com/pic/d31b0ef41bd5ad6eebbcc0b38ecb39dbb6fd3c25',
    'https://bkimg.cdn.bcebos.com/pic/0df431adcbef76099f093dc621dda3cc7cd99e6d',
    'https://bkimg.cdn.bcebos.com/pic/eaf81a4c510fd9f9d72af9b0b965c32a2834359b0e9b'
]


def crawl_pic_urls():
    '''
    爬取每个选手的百度百科图片，并保存
    '''
    with open('work/'+ today + '.json', 'r', encoding='UTF-8') as file:
        json_array = json.loads(file.read())

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
    }

    all_urls = []
    for star in json_array:
        pic_urls = []
        name = star['name']
        link = star['link']

        #！！！请在以下完成对每个选手图片的爬取，将所有图片url存储在一个列表pic_urls中！！！
        pic_urls = []
        try:
            response = requests.get(link,headers=headers)
            soup = BeautifulSoup(response.text,'lxml')
            #返回的是class为summary-pic的<div>所有标签
            pic_link = soup.find(name='div', attrs={'class':'summary-pic'}).find('a')['href']

            if pic_link == None:
                print(link)

            match_obj = re.match(r'(.*)?\?.*', pic_link)
            pic_link = 'https://baike.baidu.com' + match_obj.group(1)
            #print(pic_link)

            response = requests.get(pic_link,headers=headers)
            soup = BeautifulSoup(response.text,'lxml')
            pic_list = soup.find(name='div', attrs={'class':'pic-list'}).find_all('a')
            for idx in range(len(pic_list)):
                if pic_list[idx].find('img') == None:
                    print(pic_link, idx)
                    pic_urls.append(special_shuhui[idx-30])
                    continue
                pic_addr = pic_list[idx].find('img').attrs['src']
                match_obj = re.match(r'(.*)?\?.*', pic_addr)
                pic_urls.append(match_obj.group(1))

        except Exception as e:
            print(e)

        #！！！根据图片链接列表pic_urls, 下载所有图片，保存在以name命名的文件夹中！！！
        down_pic(name,pic_urls)

Day3作业：

分析青春有你选手体重数据，绘制饼图。示例用的matplotlib，这里使用百度提供的pyechart来绘制。当时觉得用百度的库来绘图应该有加分，结果证明还不如老老实实用matplotlib的分高。

import pandas as pd

from pyecharts.charts import Pie
from pyecharts import options as opts

df = pd.read_json('data/20200423.json')
weights = df['weight'].apply(lambda x: float(''.join(filter(lambda ch: ch in '0123456789.', x))))

weight_seg = pd.cut(weights.values, bins=[0, 45, 50, 55, weights.max()+1])
weight_count = list(weight_seg.value_counts())

seg_names = ['<=45kg', '45~50kg', '50~55kg', '>55kg']
weight_data = []
for idx in range(len(seg_names)):
    weight_data.append((seg_names[idx], weight_count[idx]))

pie_show = Pie(opts.InitOpts(width = "900px",  height = "1000px"))
pie_show.add(
    series_name = '青春有你2',
    data_pair = weight_data,
    radius=["40%", "55%"],
    center=["50%", "32%"],
    label_opts=opts.LabelOpts(
        position="outside",
        formatter="{a|{a}}{abg|}\n{hr|}\n {b|{b}: }{c}  {per|{d}%}  ",
        background_color="#eee",
        border_color="#aaa",
        border_width=1,
        border_radius=4,
        color='auto',
        rich={
            "a": {"color": "#999", "lineHeight": 22, "align": "center"},
            "abg": {
                "backgroundColor": "#e3e3e3",
                "width": "100%",
                "align": "right",
                "height": 22,
                "borderRadius": [4, 4, 0, 0],
            },
            "hr": {
                "borderColor": "#aaa",
                "width": "100%",
                "borderWidth": 0.5,
                "height": 0,
            },
            "b": {"fontSize": 16, "lineHeight": 33},
            "per": {
                "color": "#eee",
                "backgroundColor": "#334455",
                "padding": [2, 4],
                "borderRadius": 2,
            },
        },
    )
) \
.set_global_opts(title_opts=opts.TitleOpts(title="青春有你2体重分布")) \
.render("weights.html")

Day4作业：

青春有你Top5选手识别。利用PaddleHub与训练的图像分类模型，自己提供数据集进行FineTune，进行选手识别。

爬取百度图片图片：

import re
import requests
from urllib import error
from bs4 import BeautifulSoup
import os
import numpy as np
import sys
import glob

import PIL.Image as Image

import paddlehub as hub

import cv2

num = 0
numPicture = 0
file = ''
List = []

def get_progressbar_str(progress):
    END = 170
    MAX_LEN = 30
    BAR_LEN = int(MAX_LEN * progress)
    return ('Progress:[' + '=' * BAR_LEN +
            ('>' if BAR_LEN < MAX_LEN else '') +
            ' ' * (MAX_LEN - BAR_LEN) +
            '] %.1f%%' % (progress * 100.))

def Find(url):
    global List
    print('正在检测图片总数，请稍等.....')
    t = 0
    i = 1
    s = 0
    while t < 1000:
        Url = url + str(t)
        try:
            Result = requests.get(Url, timeout=7)
        except BaseException:
            t = t + 60
            continue
        else:
            result = Result.text
            pic_url = re.findall('"objURL":"(.*?)",', result, re.S)  # 先利用正则表达式找到图片url
            s += len(pic_url)
            if len(pic_url) == 0:
                break
            else:
                List.append(pic_url)
                t = t + 60
    return s


def recommend(url):
    Re = []
    try:
        html = requests.get(url)
    except error.HTTPError as e:
        return
    else:
        html.encoding = 'utf-8'
        bsObj = BeautifulSoup(html.text, 'html.parser')
        div = bsObj.find('div', id='topRS')
        if div is not None:
            listA = div.findAll('a')
            for i in listA:
                if i is not None:
                    Re.append(i.get_text())
        return Re


def downloadPicture(path_prefix, html, keyword, label=''):
    global numPicture
    global num
    pic_url = re.findall('"objURL":"(.*?)",', html, re.S)  # 先利用正则表达式找到图片url
    for each in pic_url:
        #print('正在下载第' + str(num + 1) + '张图片，图片地址:' + str(each))
        try:
            if each is not None:
                pic = requests.get(each, timeout=7)
            else:
                continue
        except BaseException:
            #print('错误，当前图片无法下载')
            continue
        else:
            string = os.path.join(path_prefix, label + '_' + str(num) + '.jpg')
            fp = open(string, 'wb')
            fp.write(pic.content)
            fp.close()
            num += 1
            sys.stderr.write('\r\033[K' + get_progressbar_str(num / numPicture))
            sys.stderr.flush()
        if num >= numPicture:
            return

def difference(hist1,hist2):
    sum1 = 0
    for i in range(len(hist1)):
        if (hist1[i] == hist2[i]):
            sum1 += 1
        else:
            sum1 += 1 - float(abs(hist1[i] - hist2[i]))/ max(hist1[i], hist2[i])
    return sum1/len(hist1)

def similary_calculate(path1 , path2):
    try:
        img1 = Image.open(path1).resize((256,256)).convert('RGB')
        img2 = Image.open(path2).resize((256,256)).convert('RGB')
        return difference(img1.histogram(), img2.histogram())
    except Exception as e:
        print(e)
        return None

def check_folder(folder, pic):
    #module = hub.Module(name='pyramidbox_face_detection')
    module = hub.Module(name='ultra_light_fast_generic_face_detector_1mb_640')

    num_similar = 0
    num_bad = 0
    num_multi_face = 0
    pic_number = len(glob.glob(pathname=os.path.join(folder, '*.jpg')))
    processed_num = 0
    for root, directors, files in os.walk(folder):
        for filename in files:
            processed_num += 1
            sys.stderr.write('\r\033[K' + get_progressbar_str(processed_num / pic_number))
            sys.stderr.flush()

            filepath = os.path.join(root,filename)
            if (filepath.endswith(".png") or filepath.endswith(".jpg")):
                sim_prop = similary_calculate(pic, filepath)
                if sim_prop is None or sim_prop > 0.9:
                    os.remove(filepath)
                    num_similar += 1
                    continue

                if cv2.imread(filepath) is None:
                    os.remove(filepath)
                    num_bad += 1
                    continue

                res = module.face_detection(data={'image':[filepath]})
                if len(res[0]['data']) != 1:
                    os.remove(filepath)
                    num_multi_face += 1
                    continue
    print('剔除与测试集相似图片{}张, 剔除无效图片{}张, 剔除多人图片{}张'.format(num_similar, num_bad, num_multi_face))

def download_pics():
    #tm = int(input('请输入每类图片的下载数量 '))
    global numPicture
    global num
    tm = 500
    numPicture = tm
    line_list = []
    with open('./name.txt', encoding='utf-8') as f:
        line_list = [k.strip().split(sep=' ', maxsplit=2) for k in f.readlines()]  # strip space

    for item in line_list:
        word = item[0]
        label = item[1]
        keyword = word + '+青春有你'
        url = 'http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=' + keyword + '&pn='
        tot = Find(url)
        Recommend = recommend(url)  # 记录相关推荐
        print('经过检测{}类图片共有{}张, 计划下载{}张'.format(word, tot, numPicture))
        filedir = label
        y = os.path.exists(label)
        if y == 1:
            print('Directory already exists!')
            filedir = filedir+'_bk'
            os.mkdir(filedir)
        else:
            os.mkdir(filedir)
        t = 0
        tmp = url
        num = 0
        while t < numPicture:
            try:
                url = tmp + str(t)
                result = requests.get(url, timeout=10)
                #print(url)
            except error.HTTPError as e:
                print('Network connection error!')
                t = t + 60
            else:
                downloadPicture(filedir, result.text, word, label)
                t = t + 60
        #numPicture = numPicture + tm

        print('{} picture download finished! '.format(word))

if __name__ == '__main__':
    download_pics()

    for idx in range(5):
        label = str(idx)
        check_folder(os.path.join(os.getcwd(), label), os.path.join('test', label + '.jpg'))

百度图片上还是有很多不相关的图片，关键字的过滤效果不好。可以考虑去选手微博上爬图片，质量比较高。

还是要感谢百度提供的免费算力卡和免费课程，AI大势下还是要有强力的国产品牌!

wuxianshen

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
PaddlePaddle飞桨入门之旅(二)

——记2020.4.13百度“Python小白逆袭大神”在线课程这是一个最好的时代，也是一个最坏的时代。这是一个最好的时代，互联网的兴起，各种学习资源俯拾即是，各种论坛社区提问即有回响，只要有心就能自我学习。这是一个最坏的时代，互联网的兴起，各种娱乐诱惑比比皆是，各种视频游戏勾魂夺魄，稍一不自律，时间就一去不复返。在这样一个时代，自律反而成了最难的事情，今早立下的目标，往往成为今晚...
复制链接

扫一扫

专栏目录