百度AI-7Days-打卡集训营总结_百度ai青少年训练营-CSDN博客

本文链接：https://blog.csdn.net/qq_44707928/article/details/105882997

这里写自定义目录标题

打卡训练营总结
- Day1
- Day2
- Day3
- Day4
- Day5
- 感悟

打卡训练营总结

本次打卡一共有7次授课，5次任务。
Day1：简单的格式化输出和os库的简单应用
Day2：爬取选手信息（自写爬取图片）
Day3：《青春有你2》选手数据分析
Day4：《青春有你2》选手识别
Day5：综合大作业

Day1

第一次作业的内容：
1、输出 9*9 乘法口诀表(注意格式)
2、查找特定名称文件

1、输出 9*9 乘法口诀表(注意格式)：
这道题属于甜品级别的题目
直接格式化输出即可

def table():
    for i in range(1, 10):
        for j in range(1, i + 1):
            print("{}*{}={}".format(j, i, j*i),"\t", end="")
        print("\n")

1*1=1 	

1*2=2 	2*2=4 	

1*3=3 	2*3=6 	3*3=9 	

1*4=4 	2*4=8 	3*4=12 	4*4=16 	

1*5=5 	2*5=10 	3*5=15 	4*5=20 	5*5=25 	

1*6=6 	2*6=12 	3*6=18 	4*6=24 	5*6=30 	6*6=36 	

1*7=7 	2*7=14 	3*7=21 	4*7=28 	5*7=35 	6*7=42 	7*7=49 	

1*8=8 	2*8=16 	3*8=24 	4*8=32 	5*8=40 	6*8=48 	7*8=56 	8*8=64 	

1*9=9 	2*9=18 	3*9=27 	4*9=36 	5*9=45 	6*9=54 	7*9=63 	8*9=72 	9*9=81

2、查找特定名称文件：
这道题需要用到os模块中的walk方法
取出os.walk()中的三个值并赋给三个元素
使用for in 语句调用生成器fn
在循环中使用x.find()方法，如果返回>=0则找到目标文件之一
并保存给result列表

def findfiles():
    global result, path
    for dp, dn, fn in os.walk(path):
        for x in fn:
            if(x.find(filename) >= 0):
                result.append(x)
    for i in range(1, len(result) + 1):
        print("{} : {}".format(i, result[i-1]))

1 : 04:22:2020.txt
2 : 182020.doc
3 : new2020.txt

但要求格式输出应该是带有根目录的输出
导致分数只有85分
将其修改为：

def findfiles():
    global result, path
    for dp, dn, fn in os.walk(path):
        for x in fn:
            if(x.find(filename) >= 0):
                result.append(os.path.join(dp, x))
    for i in range(1, len(result) + 1):
        print("{} : {}".format(i, result[i-1]))

最后输出结果：

1 : Day1-homework/18/182020.doc
2 : Day1-homework/4/22/04:22:2020.txt
3 : Day1-homework/26/26/new2020.txt

Day2

第二次作业的内容：
《青春有你2》选手信息爬取
其中除了爬取选手图片模块，其他代码块训练营已经给出
这一次作业对于我这种爬虫小白来说还是有一定的难度
具体看步骤3

1、爬取百度百科中《青春有你2》中所有参赛选手信息，返回页面数据
直接爬取百度百科静态页面，没什么好说的，直接使用requests和bs4库即可

import json
import re
import requests
import datetime
from bs4 import BeautifulSoup
import os
#获取当天的日期,并进行格式化,用于后面文件命名，格式:20200420
today = datetime.date.today().strftime('%Y%m%d')    
def crawl_wiki_data():
    """
    爬取百度百科中《青春有你2》中参赛选手信息，返回html
    """
    headers = { 
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
    }
    url='https://baike.baidu.com/item/青春有你第二季'                         

    try:
        response = requests.get(url,headers=headers)
        soup = BeautifulSoup(response.text,'lxml')
        tables = soup.find_all('table',{'class':'table-view log-set-param'})
        crawl_table_title = "参赛学员"
        i = 1
        for table in  tables:           
            table_titles = table.find_previous('div').find_all('h3')
            for title in table_titles:
                if(crawl_table_title in title):
                    return table       
    except Exception as e:
        print(e)

2、对爬取的页面数据进行解析，并保存为JSON文件：
对表格的每一个属性进行爬取，并保存到json文件中

def parse_wiki_data(table_html):
    table_html = crawl_wiki_data()
    bs = BeautifulSoup(str(table_html),'lxml')
    all_trs = bs.find_all('tr')
    error_list = ['\'','\"']
    stars = []

    for tr in all_trs[1:]:
        all_tds = tr.find_all('td')
        star = {}
        #姓名
        star["name"]=all_tds[0].text
        #个人百度百科链接
        star["link"]= 'https://baike.baidu.com' + all_tds[0].find('a').get('href')
        #籍贯
        star["zone"]=all_tds[1].text
        #星座
        star["constellation"]=all_tds[2].text
        #身高
        star["height"]=all_tds[3].text
        #体重
        star["weight"]= all_tds[4].text
        #花语,去除掉花语中的单引号或双引号
        flower_word = all_tds[5].text
        for c in flower_word:
            if  c in error_list:
                flower_word=flower_word.replace(c,'')
        star["flower_word"]=flower_word 
        #公司
        if not all_tds[6].find('a') is  None:
            star["company"]= all_tds[6].find('a').text
        else:
            star["company"]= all_tds[6].text  
        # print(star)
        stars.append(star)

    json_data = json.loads(str(stars).replace("\'","\""))  # 将单引号转变为双引号
    with open('work/' + today + '.json', 'w', encoding='UTF-8') as f:
        json.dump(json_data, f, ensure_ascii=False)  
        # 汉字一定要加ensure_ascii=False

3、爬取每个选手的百度百科图片，并进行保存（自写代码块）:
由于刚开始没有分析清楚，直接爬取了img标签的src
导致爬取的图片全部都是缩略图
重新整理思路：
a) 进入选手图集的第一张图片地址
b) 获取图集中其他图片的整个a标签而非img标签
c) 获取每一个a标签的href
此时获取的href才为图集中，每个图片的地址
d) 这个地方还有一个坑：
百度图册每次只能动态加载30张图片
但有一位选手有超过30张图片导致无法全部爬取
所以使用get(“href”)方法时会报错
需要try except捕捉错误并continue
至于如何爬取全部的图片，到现在还没有解决办法
e) 最后使用find(‘img’, {‘id’: ‘imgPicture’}).get(“src”)获取高清图地址并下载保存

def crawl_pic_urls():
    with open('work/'+ today + '.json', 'r', encoding='UTF-8') as file:
        json_array = json.loads(file.read())
    headers = { 
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' 
    }
    for star in json_array:
        pic_urls = []
        name = star['name']
        link = star['link']
        try:
            response = requests.get(link, headers = headers)
            soup = BeautifulSoup(response.text, 'lxml')
            tables = soup.find_all('div',{'class':'summary-pic'})
            for table in tables:
                summary_pic_url = "https://baike.baidu.com" + table.find("a").get("href")
                response = requests.get(summary_pic_url, headers = headers)
                soup = BeautifulSoup(response.text, 'lxml')
                img_div = soup.find('div', {'class':'pic-list'})
                    # tag
                img_page_list = img_div.find_all("a")
                for img_page_list_elm in img_page_list:
                    try:
                            # 有可能没有href属性 所以要try except
                        img_page_url = 'https://baike.baidu.com' + img_page_list_elm.get("href")
                    except:
                        continue
                    response = requests.get(img_page_url, headers=headers)
                    soup = BeautifulSoup(response.text, 'lxml')
                    img_url = soup.find('img', {'id': 'imgPicture'}).get("src")
                    pic_urls.append(img_url)
        except Exception as e:
            print(e)
        down_pic(name,pic_urls)
def down_pic(name,pic_urls):
    path = 'work/' + 'pics/' + name + '/'
    if not os.path.exists(path):
      os.makedirs(path)
    for i, pic_url in enumerate(pic_urls):
        try:
            pic = requests.get(pic_url, timeout=15)
            string = str(i + 1) + '.jpg'
            with open(path+string, 'wb') as f:
                f.write(pic.content)
                print('成功下载第%s张图片: %s' % (str(i + 1), str(pic_url)))
        except Exception as e:
            print('下载第%s张图片时失败: %s' % (str(i + 1), str(pic_url)))
            print(e)
            continue

4、打印输出

def show_pic_path(path):
    pic_num = 0
    for (dirpath,dirnames,filenames) in os.walk(path):
        for filename in filenames:
           pic_num += 1
           print("第%d张照片：%s" % (pic_num,os.path.join(dirpath,filename)))           
    print("共爬取《青春有你2》选手的%d照片" % pic_num)

而后运行发现，百度图册更新了一张图片
不再是482 变成483
第479张照片：/home/aistudio/work/pics/戴燕妮/13.jpg
第480张照片：/home/aistudio/work/pics/戴燕妮/14.jpg
第481张照片：/home/aistudio/work/pics/戴燕妮/6.jpg
第482张照片：/home/aistudio/work/pics/戴燕妮/3.jpg
第483张照片：/home/aistudio/work/pics/魏奇奇/1.jpg
共爬取《青春有你2》选手的483照片

Day3

第三次作业内容：
《青春有你2》选手数据分析
分别使用json和pandas处理json数据
使用matplotlib进行可视化数据分析
难度较上次简单

1、本次作业没有提前创建文件夹，需要自己创建

import os
def mkjpg(name):    
    path = '/home/aistudio/work/result/'
    if not os.path.exists(path):
        os.mkdir(path)
        with open(path + name, 'wb') as f:
            f.close()

2、绘制选手区域分布柱状图
使用json库处理数据
中文字体问题：
1、plt.rcParams[‘font.sans-serif’] = [‘SimHei’]
2、别忘将字体文件复制到相应的字体库
3、可能notebook未加载字体，需要重启

import matplotlib.pyplot as plt
import numpy as np 
import json
import matplotlib.font_manager as font_manager
import os
# 显示matplotlib生成的图形
%matplotlib inline
# 魔法函数（Magic Functions）
# 内嵌绘图
with open('data/data31557/20200422.json', 'r', encoding='UTF-8') as file:
         json_array = json.loads(file.read())
with open('data/data31557/20200422.json', 'r', encoding='UTF-8') as file:
         json_array = json.loads(file.read())
# 绘制小姐姐区域分布柱状图,x轴为地区，y轴为该区域的小姐姐数量
# print(json_array)
zones = []
for star in json_array:
    zone = star['zone']
    zones.append(zone)
print(len(zones))
print(zones)
zone_list = []
count_list = []
for zone in zones:
    if zone not in zone_list:
        count = zones.count(zone)   # 记录每个地区的个数
        zone_list.append(zone)
        count_list.append(count)
print(zone_list)
print(count_list)
# 设置显示中文
plt.rcParams['font.sans-serif'] = ['SimHei']  # 指定默认字体
plt.figure(figsize=(20,15))  # 设置窗口大小
plt.bar(range(len(count_list)), count_list,color='r',tick_label=zone_list,facecolor='#9999ff',edgecolor='white')
# 这里是调节横坐标的倾斜度，rotation是度数，以及设置刻度字体大小
plt.xticks(rotation=45,fontsize=20)
plt.yticks(fontsize=20)
plt.legend()    # 左上角的标识，没有设置
plt.title('''《青春有你2》参赛选手''',fontsize = 24)
plt.savefig('/home/aistudio/work/result/bar_result.jpg')
plt.show()

在这里插入图片描述

3、使用pandas处理数据
pandas具体用法不再说明

import matplotlib.pyplot as plt
import numpy as np 
import json
import matplotlib.font_manager as font_manager
import pandas as pd
#显示matplotlib生成的图形
%matplotlib inline
df = pd.read_json('data/data31557/20200422.json')
grouped=df['name'].groupby(df['zone']) 
	 # 类似于Sql语句 查询name 属性 以zone 排序# name 为主属性 count的结果为每个组的数目
s = grouped.count()
# print(type(s))
zone_list = s.index
count_list = s.values
# 设置显示中文
plt.rcParams['font.sans-serif'] = ['SimHei'] # 指定默认字体
plt.figure(figsize=(20,15))
plt.bar(range(len(count_list)), count_list,color='r',tick_label=zone_list,facecolor='#9999ff',edgecolor='white')
# 这里是调节横坐标的倾斜度，rotation是度数，以及设置刻度字体大小
plt.xticks(rotation=45,fontsize=20)
plt.yticks(fontsize=20)
plt.legend()
plt.title('''《青春有你2》参赛选手''',fontsize = 24)
plt.savefig('/home/aistudio/work/result/bar_result02.jpg')
plt.show()

4、对选手体重分布进行可视化，绘制饼状图
使用panda处理数据
并将每个weight字符串的kg去掉，并转换为float类型
使用了两个一位数组来表示体重与count之间的对应关系
weitgh_list 为体重的列表
count_list 为每一个体重的个数
列表元素一一对应
weight_lable 为阶段性体重标签
weight_count 为阶段性体重count
列表元素一一对应

import matplotlib.pyplot as plt
import numpy as np 
import json
import pandas as pd
%matplotlib inline
df = pd.read_json('data/data31557/20200422.json')
grouped=df['name'].groupby(df['weight'])
s = grouped.count()
# print(s)
weight_list = s.index
count_list = s.values
i = 0
weight_count = [0, 0, 0, 0]
weight_labels = ["<=45kg", "45-50kg", "50-55kg", ">=55kg"]
for weight in weight_list:
    weight = float(weight[:-2])
    if(weight <= 45):
        weight_count[0] += count_list[i]
    if(45 < weight <= 50):
        weight_count[1] += count_list[i]
    if(50 < weight <= 55):
        weight_count[2] += count_list[i]
    if(weight > 55):
        weight_count[3] += count_list[i]
    i += 1
plt.rcParams['font.sans-serif']=['SimHei']  # 中文
plt.figure(figsize=(8,8))
plt.title("选手体重信息")
explode = [0, 0.1, 0.1, 0.1]
plt.pie(weight_count,labels=weight_labels, explode=explode, autopct='%1.1f%%', pctdistance=0.8, shadow=True)
mkjpg("pie.jpg")
plt.savefig("/home/aistudio/work/result/pie.jpg")
plt.show()
plt.close()

在这里插入图片描述

Day4

第四次作业内容：
《青春有你2》选手识别

这次作业看似很难
因为该人脸识别使用的是paddlehub框架
如果只是为了应付作业
只要将爬取的图片导入，无脑运行即可
最好还是看一看具体的实现过程，不要应付
虽然简单，出现的小毛病层出不穷
但都可以归结为路径问题
例如我出现的路径问题：
在Finetune中
需要读取list文件
所以设置dirpath为~/dataset/
导致其测试集和验证集寻找路径为dataset/dataset/test
所以需要将test文件夹复制到dataset/dataset/下
且不能删除dataset下的test文件夹
如果删除，则main函数不能寻址到dataset/test/

1、准备数据和标签
需要自己爬取图片并写入train_list.txt
验证集、测试集和标签已经写入

#CPU环境启动请务必执行该指令
%set_env CPU_NUM=1 
import paddlehub as hub
import os
path = 'dataset/train'
star_list = ['yushuxin', 'xujiaqi', 'zhaoxiaotang', 'anqi', 'wangchengxuan']
for i in range(5):
    for a, b, c in os.walk(path + '/' + star_list[i]):
        for name in c:
            with open('dataset/train_list.txt', 'a') as f:
                f.write('train/' + star_list[i] + '/' + name + ' ' + str(i) + '\n')
                print('train/' + star_list[i] + '/' + name + ' ' + str(i))

2、加载预训练模型
在运行这条语句时，可能会找不到模型且不能自动下载
需要切换到终端自行安装

module = hub.Module(name="resnet_v2_50_imagenet")

3、数据准备
接着需要加载图片数据集

from paddlehub.dataset.base_cv_dataset import BaseCVDataset
class DemoDataset(BaseCVDataset):	
   def __init__(self):	
       self.dataset_dir = "dataset"
       super(DemoDataset, self).__init__(
           base_path=self.dataset_dir,
           train_list_file="train_list.txt",
           validate_list_file="validate_list.txt",
           test_list_file="test_list.txt",
           label_list_file="label_list.txt",
           )
dataset = DemoDataset()

4、生成数据读取器
接着生成一个图像分类的reader，reader负责将dataset的数据进行预处理
接着以特定格式组织并输入给模型进行训练。
当我们生成一个图像分类的reader时，需要指定输入图片的大小

data_reader = hub.reader.ImageClassificationReader(
    image_width=module.get_expected_image_width(),
    image_height=module.get_expected_image_height(),
    images_mean=module.get_pretrained_images_mean(),
    images_std=module.get_pretrained_images_std(),
    dataset=dataset)

5、配置策略
在进行Finetune前，我们可以设置一些运行时的配置，例如如下代码中的配置，表示：
use_cuda：设置为False表示使用CPU进行训练。如果您本机支持GPU，且安装的是GPU版本的PaddlePaddle，我们建议您将这个选项设置为True；
epoch：迭代轮数；
batch_size：每次训练的时候，给模型输入的每批数据大小为32，模型训练时能够并行处理批数据，因此batch_size越大，训练的效率越高，但是同时带来了内存的负荷，过大的batch_size可能导致内存不足而无法训练，因此选择一个合适的batch_size是很重要的一步；
log_interval：每隔10 step打印一次训练日志；
eval_interval：每隔50 step在验证集上进行一次性能评估；
checkpoint_dir：将训练的参数和数据保存到cv_finetune_turtorial_demo目录中；
strategy：使用DefaultFinetuneStrategy策略进行finetune；

config = hub.RunConfig(
    use_cuda=False,                              #是否使用GPU训练，默认为False；
    num_epoch=5,                                #Fine-tune的轮数；
    checkpoint_dir="cv_finetune_turtorial_demo",#模型checkpoint保存路径, 若用户没有指定，程序会自动生成；
    batch_size=10,                              #训练的批大小，如果使用GPU，请根据实际情况调整batch_size；
    eval_interval=10,                           #模型评估的间隔，默认每100个step评估一次验证集；
    strategy=hub.finetune.strategy.DefaultFinetuneStrategy())  #Fine-tune优化策略；

6、组建Finetune Task
有了合适的预训练模型和准备要迁移的数据集后，我们开始组建一个Task。
由于该数据设置是一个二分类的任务
而我们下载的分类module是在ImageNet数据集上训练的千分类模型
所以我们需要对模型进行简单的微调，把模型改造为一个二分类模型：
获取module的上下文环境，包括输入和输出的变量，以及Paddle Program；
从输出变量中找到特征图提取层feature_map；
在feature_map后面接入一个全连接层，生成Task；

input_dict, output_dict, program = module.context(trainable=True)
img = input_dict["image"]
feature_map = output_dict["feature_map"]
feed_list = [img.name]

task = hub.ImageClassifierTask(
    data_reader=data_reader,
    feed_list=feed_list,
    feature=feature_map,
    num_classes=dataset.num_labels,
    config=config)

7、开始Finetune

run_states = task.finetune_and_eval()

	[2020-04-27 15:51:18,248] [    INFO] - Strategy with slanted triangle learning rate, L2 regularization, 
	[2020-04-27 15:51:18,282] [    INFO] - Try loading checkpoint from cv_finetune_turtorial_demo/ckpt.meta
	[2020-04-27 15:51:18,283] [    INFO] - PaddleHub model checkpoint not found, start from scratch...
	[2020-04-27 15:51:18,353] [    INFO] - PaddleHub finetune start
	[2020-04-27 15:51:57,303] [   TRAIN] - step 10 / 24: loss=0.19678 acc=0.98000 [step/sec: 0.26]
	[2020-04-27 15:51:57,304] [    INFO] - Evaluation on dev dataset start
	share_vars_from is set, scope is ignored.
	[2020-04-27 15:51:58,758] [    EVAL] - [dev dataset evaluation result] loss=0.74762 acc=0.80000 [step/sec: 1.58]
	[2020-04-27 15:51:58,759] [    EVAL] - best model saved to cv_finetune_turtorial_demo/best_model [best acc=0.80000]
	[2020-04-27 15:52:38,392] [   TRAIN] - step 20 / 24: loss=0.01303 acc=1.00000 [step/sec: 0.26]
	[2020-04-27 15:52:38,394] [    INFO] - Evaluation on dev dataset start
	[2020-04-27 15:52:39,285] [    EVAL] - [dev dataset evaluation result] loss=0.50611 acc=0.80000 [step/sec: 1.81]
	[2020-04-27 15:52:58,969] [    INFO] - Evaluation on dev dataset start
	[2020-04-27 15:52:59,872] [    EVAL] - [dev dataset evaluation result] loss=0.21779 acc=1.00000 [step/sec: 1.75]
	[2020-04-27 15:52:59,873] [    EVAL] - best model saved to cv_finetune_turtorial_demo/best_model [best acc=1.00000]
	[2020-04-27 15:53:00,624] [    INFO] - Load the best model from cv_finetune_turtorial_demo/best_model
	[2020-04-27 15:53:00,913] [    INFO] - Evaluation on test dataset start
	[2020-04-27 15:53:01,814] [    EVAL] - [test dataset evaluation result] loss=0.21779 acc=1.00000 [step/sec: 1.75]
	[2020-04-27 15:53:01,815] [    INFO] - Saving model checkpoint to cv_finetune_turtorial_demo/step_25
	[2020-04-27 15:53:02,668] [    INFO] - PaddleHub finetune finished.

8、预测

import numpy as np
import matplotlib.pyplot as plt 
import matplotlib.image as mpimg

with open("dataset/test_list.txt","r") as f:
    filepath = f.readlines()

data = [filepath[0].split(" ")[0],filepath[1].split(" ")[0],filepath[2].split(" ")[0],filepath[3].split(" ")[0],filepath[4].split(" ")[0]]

label_map = dataset.label_dict()
index = 0
run_states = task.predict(data=data)
results = [run_state.run_results for run_state in run_states]

for batch_result in results:
    print(batch_result)
    batch_result = np.argmax(batch_result, axis=2)[0]
    print(batch_result)
    for result in batch_result:
        index += 1
        result = label_map[result]
        print("input %i is %s, and the predict result is %s" %
              (index, data[index - 1], result))

	input 1 is dataset/test/yushuxin.jpg, and the predict result is 虞书欣
	input 2 is dataset/test/xujiaqi.jpg, and the predict result is 许佳琪
	input 3 is dataset/test/zhaoxiaotang.jpg, and the predict result is 赵小棠
	input 4 is dataset/test/anqi.jpg, and the predict result is 安崎
	input 5 is dataset/test/wangchengxuan.jpg, and the predict result is 王承渲

Day5

第四次作业内容：
综合大作业
第一步：爱奇艺《青春有你2》评论数据爬取
(参考链接：https://www.iqiyi.com/v_19ryfkiv8w.html#curid=15068699100_9f9bab7e0d1e30c494622af777f4ba39)
爬取任意一期正片视频下评论
评论条数不少于1000条

第二步：词频统计并可视化展示
数据预处理：清理清洗评论中特殊字符（如：@#￥%、emoji表情符）
清洗后结果存储为txt文档
中文分词：添加新增词（如：青你、奥利给、冲鸭）
去除停用词（如：哦、因此、不然、也好、但是）
统计top10高频词
可视化展示高频词

第三步：绘制词云
根据词频生成词云
可选项-添加背景图片，根据背景图片轮廓生成词云

第四步：结合PaddleHub，对评论进行内容审核

爬取评论：
对于一个初入爬虫的同学来说，静态页面的爬虫还没完全掌握
动态页面更是形如天书，但是经过查阅相关动态页面爬取的资料
加以分析之后，爬取爱奇艺评论属于比较简单的那种
1、首先打开开发者选项，进入network，清空
2、动态加载评论后，查看新生成的jsp文件
3、获取jsp的url和提交的paramter
4、查看paramter的规律后得出：
每一个最后一个评论的id即为下一个页面的last_id
5、通过这个规律来建立循环爬取，num为爬取的页数
6、爬取之后，通过re正则化来去除非中文字符
词频统计和绘制云图：
1、创建新词表和停用词表，分别保存在txt文件中
2、使用jieba库，并添加新词表和停用词表来分词
并返回一个列表
3、创建两个一一对应的列表来进行词频统计
并排序，选取词频最高的十个词返回
4、根据返回后的两个列表进行绘制词频统计图
5、使用WordCount库并设置相关参数来绘制云图
情感分析：
使用百度paddlehub的porn_detection_lstm模块进行情感分析

from __future__ import print_function
import requests
import json
import re #正则匹配
import time #时间处理模块
import jieba #中文分词
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.font_manager as font_manager
from PIL import Image
from wordcloud import WordCloud  #绘制词云模块
import paddlehub as hub

#请求爱奇艺评论接口，返回response信息
def getMovieinfo(params):
    '''
    请求爱奇艺评论接口，返回response信息
    参数  url: 评论的url
    :return: response信息
    '''
    url = "https://sns-comment.iqiyi.com/v3/comment/get_comments.action"
    headers = { 
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
    }
    
    response = requests.get(url, headers=headers, params=params)
    
    return response.text 

#解析json数据，获取评论
def saveMovieInfoToFile(lastID):
    '''
    解析json数据，获取评论
    参数  lastId:最后一条评论ID  arr:存放文本的list
    :return: 新的lastId
    '''
    count = 0
    # 请求参数
    params = {
    'agent_type': '118',
    'agent_version': '9.11.5',
    'authcookie': 'null',
    'business_type': '17',
    'content_id': '14722978600',
    'hot_size': '0',
    'last_id': lastID,
    'page':'',
    'page_size': '20',
    'types': 'time'
    }
    text = getMovieinfo(params)
    result = json.loads(text)
    with open("work/result.json", "w", encoding='UTF-8') as f:
        json.dump(result, f, ensure_ascii=False)  
    result = result['data']["comments"]
    for i in result:
        # 可能会报错 content不存在
        try:
            comment = i["content"]
            comment = clear_special_char(comment)
            with open("work/comments.txt", 'a') as f:
                f.write(comment)
            with open("work/comments_detection.txt", 'a') as f:
                f.write(comment + '\n')
            count += 1
            lastID = i["id"]
        except:
            continue
    return lastID, count
 #去除文本中特殊字符
def clear_special_char(content):
    '''
    正则处理特殊字符
    参数 content:原文本
    return: 清除后的文本
    '''
    pattern = re.compile(r'[^\u4e00-\u9fa5]')
    s = re.sub(pattern, "", content)
    return s
def fenci(file_path):
    '''
    利用jieba进行分词
    参数 text:需要分词的句子或文本
    return：分词结果
    '''
    jieba.load_userdict("work/add_word.txt")
    seg_list = jieba.lcut(file_path, cut_all=False)
    return seg_list
def movestopwords(file_path, seg_list):
    '''
    创建停用词表
    参数 file_path:停用词文本路径
    return：停用词list
    '''
    stopwords = [line.strip() for line in open(file_path, encoding='UTF-8').readlines()]
    for word in stopwords:
        while True:
            try:
                seg_list.remove(word)
            except:
                break
def drawcounts(text):
    '''
    绘制词频统计表
    参数 counts: 词频统计结果 num:绘制topN
    return：none
    '''
    %matplotlib inline
    Word_List = []
    Word_Count = []
    for word in text:
        if word not in Word_List:
            Word_List.append(word)
            Word_Count.append(1)
        else:
            location = 0
            for Word_elem in Word_List:
                if Word_elem == word:
                    break
                location += 1
            Word_Count[location] += 1
    Word_Count_all, Word_List_all = (list(t) for t in zip(*sorted(zip(Word_Count, Word_List))))
    Word_Count = Word_Count_all[-10:]
    Word_List = Word_List_all[-10:]
    Word_Count = Word_Count[::-1]
    Word_List = Word_List[::-1]
    print(Word_List)
    print(Word_Count)
    plt.figure(figsize=(10,7))
    plt.rcParams['font.sans-serif']=['SimHei']
    print(plt.rcParams['font.sans-serif'])
    plt.bar(range(len(Word_List)), Word_Count,color='r',tick_label=Word_List,facecolor='#9999ff',edgecolor='white')
    plt.xticks(rotation=45,fontsize=20)
    plt.yticks(fontsize=20)
    plt.title('《青春有你2》评论词频统计表',fontsize = 24)
    plt.show()
    return Word_List_all, Word_Count_all
def drawcloud(text):
    '''
    根据词频绘制词云图
    参数 word_f:统计出的词频结果
    return：none
    '''
    wc = WordCloud(
    font_path = 'simhei.ttf',
    background_color = 'white',
    random_state = 42,
    width = 1000,
    height = 860,
    )
    wc.fit_words(text)
    wc.to_file('work/WordCloud.png')
def text_detection():
    '''
    使用hub对评论进行内容分析
    return：分析结果
    '''
    porn_detection_lstm = hub.Module(name="porn_detection_lstm")
    f = open("work/comments_detection.txt", 'r', encoding='UTF-8')
    text_text = []
    for line in f:
        if len(line.strip()) == 1:
            continue
        else:
            text_text.append(line)
    f.close()
    input_dict = {"text":text_text}
    results = porn_detection_lstm.detection(data = input_dict, use_gpu=False, batch_size=1)
    for index, item in enumerate(results):
        if item['porn_detection_key'] == 'porn':
            print(item['text'], ':', item['porn_probs'])
#评论是多分页的，得多次请求爱奇艺的评论接口才能获取多页评论,有些评论含有表情、特殊字符之类的
#num 是页数，一页10条评论，假如爬取1000条评论，设置num=100
if __name__ == "__main__":
    lastID = 0
    num = 60
    count_sum = 0
    for i in range(num):
        lastID, count = saveMovieInfoToFile(lastID)
        count_sum += count
    print("一共获取{}个评论".format(count_sum))
    with open("work/comments.txt", 'r') as f:
        text = f.read()
        f.close
    Cut_Word = fenci(text)
    movestopwords("work/stop_word.txt", Cut_Word)
    Word_List, Woud_Count =  drawcounts(Cut_Word)
    Word_dic = dict(zip(Word_List, Woud_Count))
    drawcloud(Word_dic)
    display(Image.open('work/WordCloud.png'))
    text_detection()

一共获取1187个评论
['加油', '赵小棠', '虞书欣', '谢可寅', '喜欢', '许佳琪', '哈哈哈', '刘雨昕', '冲冲', '安崎']
[1155, 1090, 412, 156, 142, 119, 115, 102, 93, 83]
['SimHei']

在这里插入图片描述

王欣宇太棒了棒棒棒
 : 0.9326
王欣宇在组内第一我觉得是真是最迷惑的了甜到腻的人设真的不行笑起来眯眯眼确实并不好看
 : 0.9878
虞书欣妈妈爱你色色
 : 0.8055
敲爱欣欣子欣欣子冲鸭色色色色色色色
 : 0.9229
等背景音乐等了好久色色色
 : 0.9955
许佳琪敲级棒的吖色色色偷笑偷笑偷笑
 : 0.998
从训练生到青春制作人代表蔡徐坤用认真对待每一个音乐作品用努力对待每一次舞台我的宝贝你实在太甜了吧青春有你虞书欣佳琪放心飞黑琪永相随一起去看最高最美的风景啊宝贝你和棒棒糖我觉得你更甜
 : 0.6781
可怜可怜可怜可怜可怜可怜可怜可怜可怜可怜可怜可怜可怜色色色色色色色色色色色色色
 : 0.9996
啊啊啊王欣宇小姐姐主唱第一太棒了我超喜欢她色色
 : 0.9993
没大
 : 0.6133
虞书欣真他妈是个猪
 : 0.6786
丽莎要不要那么美丽天哪色色
 : 0.9976
不管怎么样我不喜欢乃万一点点都不喜欢第一眼就不喜欢微笑不知道怎么回事就是不喜欢爱死上官喜爱了还有虞书欣和安琪色色爱死了爱死了
 : 0.8494
怎么没声音呀
 : 0.7912
我是安崎的粉但是我觉得其他的小姐姐也挺好的比如刘雨昕实力也很棒上官喜爱也棒棒的喻言等等都是实力派既然你们有喜欢的练习生就去为她们投票没你要在这里说那些招黑的话完毕
 : 0.9373
我找不到自己的评论了总以为没有发出去呢我曾是少年中我真的很喜欢符雅凝艾依依和程曼鑫没打错名字吧符雅凝的蛮不错的相信大家初舞台就可以感觉到艾依依真的是宝藏女孩吧话很少但一直在微笑很可爱很亲切而且她的也很棒声音很有特点我蛮喜欢她的程曼鑫也棒棒的这三个女孩
 : 0.5917