百度飞桨7日打卡Python+AI小结

最新推荐文章于 2024-08-12 14:30:00 发布

勿語念千

最新推荐文章于 2024-08-12 14:30:00 发布

阅读量552

点赞数

分类专栏： AI Python 爬虫文章标签： python 乱码可视化数据分析自然语言处理

本文链接：https://blog.csdn.net/ZGEwen/article/details/105800186

版权

AI 同时被 3 个专栏收录

2 篇文章 0 订阅

订阅专栏

Python

2 篇文章 0 订阅

订阅专栏

爬虫

1 篇文章 0 订阅

订阅专栏

小结

百度飞桨7日打卡Python+AI小结

百度飞桨7日打卡Python+AI小结

课程全称：百度深度学习7日打卡第六期：Python小白逆袭大神.
课程概要描述：从Python进入人工智能领域
实践平台：Baidu AI Studio Notebook
收获：

理论：
- 系统概要快速的重新学习Python的基础语法数据结构及相关类库
- PaddleHub体验与应用
实践：
- 跟随Python知识点对应代码动手学习
- 乘法表内容打印，查找特定名称文件
- 写两个简单的爬虫，分别爬取百度百科及爱奇艺评论
- 百科数据的数据分析可视化
- PaddleHub百科照片多分类
- 爱奇艺评论数据词频统计、可视化展示、绘制词云
- PaddleHub对评论进行内容审核

Python知识点学习

字符串截取：

左闭右开

str[a:b]) #输出为从左到右索引在[a,b)范围内的字符

str = "Hello"
#Python索引两种方式，从左往右为从0开始逐一递增，从右往左为从-1开始逐一递减
print(str[1:4]) #输出为从左到右索引在[1,4)范围内，即索引为1到索引为3的字符。结果为：ell

List列表修改指定元素

错误写法
原因：只是修改了循环变量fruit的值，并没有修改fruits列表

'''
将fruits列表中的‘香蕉’替换为‘banana’
'''
fruits = ['apple','pear','香蕉','pineapple','草莓']
for fruit in fruits:
    if '香蕉' in fruit:
        fruit = 'banana'
print(fruits)

正确写法

'''
将fruits列表中的‘香蕉’替换为‘banana’
'''
fruits = ['apple','pear','香蕉','pineapple','草莓']
for i in range(len(fruits)):
    if '香蕉' in fruits[i]:
        fruits[i] = 'banana'
        break
print(fruits)

乘法表打印及查找特定名称文件

乘法表打印

输出结果为：

1*1=1   
1*2=2   2*2=4   
1*3=3   2*3=6   3*3=9   
1*4=4   2*4=8   3*4=12  4*4=16  
1*5=5   2*5=10  3*5=15  4*5=20  5*5=25  
1*6=6   2*6=12  3*6=18  4*6=24  5*6=30  6*6=36  
1*7=7   2*7=14  3*7=21  4*7=28  5*7=35  6*7=42  7*7=49  
1*8=8   2*8=16  3*8=24  4*8=32  5*8=40  6*8=48  7*8=56  8*8=64  
1*9=9   2*9=18  3*9=27  4*9=36  5*9=45  6*9=54  7*9=63  8*9=72  9*9=81

format方法填充对齐¹{:<3}对齐符，用空格填充，右对齐，字符宽度为3

def table():
    #在这里写下您的乘法口诀表代码吧！
    for row in range(1,10):
        for column in range(1,row+1):
            # {:<3}对齐符，用空格填充，右对齐，字符宽度为3
            print("{}*{}={:<3}".format(column,row,row*column),end=" ")
        print()

查找特定名称文件

filename：文件名称
os.walk方法，遍历一个目录内各个子目录和子文件。path为所要遍历的目录的地址, 返回的是一个三元组(root,dirs,files)。

root 所指的是当前正在遍历的这个文件夹的本身的地址
dirs 是一个list ，内容是该文件夹中所有的目录的名字(不包括子目录)
files 同样是 list , 内容是该文件夹中所有的文件(不包括子目录)

def findfiles(filename ):
    for root, dirs, files in os.walk(path):
        for fi in files:
            if filename in fi:
                 # 获取文件名添加到数组
                result.append(os.path.join(root, fi))

简单的爬虫

爬取百度百科图册

import re
p = re.compile('item')
for star in json_array:
	name = star['name']
   	link = star['link']
	#对每个选手图片的爬取，将所有图片url存储在一个列表pic_urls中
    # 将link中的item替换为pic https://baike.baidu.com/item/段艺璇/19429153  替换后为 https://baike.baidu.com/pic/%E6%AE%B5%E8%89%BA%E7%92%87/19429153
    pic_link = p.sub('pic',link)
    #以下为根据pic_link爬取逻辑

爬取爱奇艺评论

# 每次需进行更新
lastId=-1
# url
url = "https://sns-comment.iqiyi.com/v3/comment/get_comments.action?"
# 请求params
params = {
    "types":"time",
    "business_type":"17",
    "agent_type":"118",
    "agent_version":"9.11.5",
    "page_size": "20",
    "authcookie":"null",
    "content_id": "15068699100"
}
# 每次更新last_id
if lastId != "":
    params["last_id"] =  lastId
session = requests.Session()
#请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
    "Accept": "application/json",
    "Referer": "https://www.iqiyi.com/v_19ryfkiv8w.html",
}
response = session.get(url, headers=headers ,params=params)

返回的json数据中key可能不存在，lastId获取方式

for comment in comments:
        cid = comment["id"]
        try:
            content = comment["content"]
        except:
            content =''
        # 获取当前最后一个评论的id
        lastId=cid

lastId,评论中最后一个评论的id 获取方式2

lastId=comments[-1]["id"]

百科数据的数据分析可视化

import re
p = re.compile('kg')
def de_weigth(x):
    return p.sub('',x)
df = pd.read_json('data/data31557/20200422.json')
# df添加labels标签
df['labels'] = df['weight'].map(lambda x:de_weigth(x))
#设置切分区域
listBins = [0, 45, 50, 55, 1000]
#设置切分后对应标签
listLabels = ['<=45kg','45~50kg','50~55kg','>55kg']
df['labels'] = pd.cut( pd.to_numeric(d), bins=listBins, labels=listLabels, include_lowest=True)
#按照labels分组统计
grouped=df['labels'].groupby(df['labels'])
s1 = grouped.count()
labels = s1.index
count_list = s1.values

处理中文乱码

# 设置显示中文
plt.rcParams['font.sans-serif'] = ['SimHei'] # 指定默认字体
plt.figure(figsize=(20,15))
plt.pie(count_list,labels=labels,autopct='%1.1f%%',shadow=False,startangle=90,textprops={'fontsize':20})
plt.axis('equal')
plt.legend(fontsize = 20)
plt.title('''《青春有你2》参赛选手体重分布''',fontsize = 24)
plt.savefig('bar_result03.jpg')
plt.show()

PaddleHub百科照片多分类

参数设置中路径问题

爱奇艺评论分析

数据词频统计可视化展示中文乱码问题

	import matplotlib
	import matplotlib.pyplot as plt
	import matplotlib.font_manager as font_manager

解决中文乱码方法1：参看百科数据的数据分析可视化内设置方式
解决中文乱码方框方法2：

在方法1失效时可以使用
方法2直接使用外部的字体，仅需指定字体所在位置即可。

font_manager.FontProperties设置显示字体
fname='/home/SimHei.ttf' 为字体所在位置
plt.title('''词频统计结果TOP10''',fontsize = 24,fontproperties=fonts)
plt.xticks(rotation=45,fontsize=20,fontproperties=fonts)
plt.legend(prop=fonts)

 # 设置显示中文
fonts = font_manager.FontProperties(fname='/home/SimHei.ttf', size=23)

plt.figure(figsize=(20,15))
plt.title('''词频统计结果TOP10''',fontsize = 24,fontproperties=fonts)
plt.bar(range(len(count_list)), count_list,color='r',tick_label=word_list,facecolor='#9999ff',edgecolor='white')

# 调节横坐标的倾斜度，rotation是度数，设置刻度字体大小，设置中文字体
plt.xticks(rotation=45,fontsize=20,fontproperties=fonts)
plt.yticks(fontsize=20)
# 设置中文字体
plt.legend(prop=fonts)
plt.savefig('bar_top_10.jpg')
plt.show()

绘制词云

WordCloud绘制词云时，设置mask使用的图片应为白底

PaddleHub对评论进行内容审核

文本审核porn_detection_lstm预测代码示例.

def text_detection():
    '''
    使用hub对评论进行内容分析
    return：分析结果

    '''
    porn_detection_lstm = hub.Module(name="porn_detection_lstm")
    test_text =  [line.strip() for line in open('work/comment_new.txt', 'r', encoding='utf-8').readlines()]
    input_dict = {"text": test_text}
    results = porn_detection_lstm.detection(data=input_dict,use_gpu=True, batch_size=1)

    for index, text in enumerate(test_text):
        results[index]["text"] = text
    for index, result in enumerate(results):
        if six.PY2:
            print(
                json.dumps(results[index], encoding="utf8", ensure_ascii=False))
        else:
            print(results[index])

https://www.cnblogs.com/lvcm/p/8859225.html ↩︎

勿語念千

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
百度飞桨7日打卡Python+AI小结

小结百度7日打卡Python+AI小结Python知识点学习字符串截取：List列表修改指定元素乘法表打印及查找特定名称文件乘法表打印查找特定名称文件简单的爬虫爬取百度百科图册爬取爱奇艺评论百科数据的数据分析可视化PaddleHub百科照片多分类爱奇艺评论分析数据词频统计可视化展示中文乱码问题绘制词云PaddleHub对评论进行内容审核百度7日打卡Python+AI小结课程全称：百度深度学习7...
复制链接

扫一扫

专栏目录