豆瓣电影250数据分析精简版

最新推荐文章于 2022-12-21 14:44:09 发布

高中不复，大学纷飞

最新推荐文章于 2022-12-21 14:44:09 发布

阅读量758

点赞数

本文链接：https://blog.csdn.net/qq_44760912/article/details/111404991

版权

以前的文章链接

今天针对以前的豆瓣电影250数据分析代码进行一下总结，精简

时间：2020/12/19
通过接近一个月的学习，使用for循环来解决数据清洗、分析问题并不明智。一方面代码冗余度高，不够简洁；另一方面for循环间接增加了理解难度。使用pandas自带的一些方法以及函数思维对于数据分析来说才是上选。

测试数据集：链接：https://pan.baidu.com/s/13irC3frSixvhU6GmZVX16w
提取码：love

以下代码可以实现这些可视化：

条形图：电影名—评价人数top10
柱形图：国家计数评分计数年份计数
折线图：年份——国家计数（中外对比）
箱图：评分——国家计数（中外对比）
词云图：电影名词云图
散点图：评分——评价人数

import pandas as pd
import numpy as np
import time

#读取数据
df = pd.read_csv('douban_HTML/data/douban250_2.csv')

#将info的有效数据分离
def get_info(df,i):
    datalist = []
    if i == 0:
        datalist = df['info'].str.split('/',expand=True)[i].str.replace('(中国大陆)','').str.strip()
    elif i == 1:
        #取第一个国家
        datalist = df['info'].str.split('/',expand=True)[i].str.strip().replace('[1964()]','').str.split(' ',expand=True)[0]
    else:
        #取第一个类型作为主类型
        df = df['info'].str.split('/',expand=True)[i].str.split(' ')
        for data in df:
            datalist.append(' '.join(data).strip().replace('\xa0',''))
    return  datalist

#将info的有效数据精确分离
def get_info1(df,i):
    datalist = []
    if i == 1:
        #所有参与国家和地区，地区后面还可以进一步处理
        df = df['info'].str.split('/',expand=True)[i].str.strip().replace('1964(中国大陆)','中国大陆').str.split(' ')#.str.replace('[1964()]','')   进行替换操作需要strip()去除多余的空格
        for data in df:
            for x in data:
                datalist.append(x.strip().replace('\xa0',''))
    else:
        #所有电影类型
        df = df['info'].str.split('/',expand=True)[i].str.split(' ')
        for data in df:
            for x in data:
                datalist.append(x.strip().replace('\xa0',''))
    return  datalist
#函数调用
df['year'] = get_info(df,i=0)
df['host_country'] = get_info(df,i=1)
df['host_category'] = get_info(df,i=2)

df1 = pd.DataFrame()
df1['all_country'] = get_info1(df,i=1)
df2 = pd.DataFrame()
df2['all_category'] = get_info1(df,i=2)
#join横向连接,数据行不一致，所以选择取并outer
df = df.join(df1,how='outer').join(df2,how='outer')
#
# df.to_excel('qx.xlsx',index=None)

#统计单一分组
def get_groupby(df,data,i):
    data = str(data)
    datalist = []
    data_index = df.groupby(data)[data].count().index.tolist()#[data].count() == value_counts()
    data_value = df.groupby(data)[data].count().values.tolist()
    if i == 0:
        return data_index,data_value
    #如果需要构建饼图和地图    Pie Map     构建zip
    elif i == 1:
        for data in zip(data_index,data_value):
            datalist.append(data)
    return datalist

#清洗people
df['ev_people'] = df['people'].str.replace('[人评价]','')
dfp = df.sort_values('ev_people',ascending=False).head(10)

#对国家打标分类
def get_sys(x):
    if '中国' in str(x):
        return '中国'
    else:
        return '外国'
df['country'] = df['host_country'].apply(get_sys)
#即str.contains只能对Series使用,下面这一句会报错
# df['country1'] = df['host_country'].apply(lambda x: '中国' if x.str.contains('中国') == True else '外国')
# df['country1'] = np.where(df['host_country'].str.contains('中国'),'中国','外国')

#对电影类型进行列表展示
def get_lx(x,i):
    if i in str(x):
        return 1
    else:
        return 0
# datalist = ['剧情','犯罪']可以自己写入到列表
for i in list(df['all_category'].unique()):#datalist
    df[i] = df['host_category'].apply(get_lx,i=f'{i}')

#中外对比——年份，评分
xg_year = ['中国','外国']
#年份
df_y = df.groupby(['year'])['country'].value_counts().unstack().fillna(0).astype('int')
#评分
c_s = df.loc[df.country==xg_year[0],'score']
w_s = df.loc[df.country==xg_year[1],'score']

#对单一变量统计
x_year,y_year = get_groupby(df,'year',0)#0代表False，1代表True
x_country,y_country = get_groupby(df,'host_country',0)
x_score,y_score = get_groupby(df,'score',0)
x_category,y_category = get_groupby(df,'all_category',0)
yc_year,yw_year = list(df_y[xg_year[0]]),list(df_y[xg_year[1]])
yc_score,yw_score = list(c_s),list(w_s)
y_title,x_people = list(dfp.title),list(dfp.ev_people)

#对所有列进行重新排列
col_name = ['title','info','year','host_country','host_category','country','score','people','ev_people','all_country','all_category']
col_name = col_name.extend(list(df['all_category'].unique()))
df = df.reindex(columns=col_name)

#保存到新的excel
df.to_excel('qx.xlsx',index=False)

#测试输出
df.head(10)
# print(df.head(20))

高中不复，大学纷飞

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
1
评论
豆瓣电影250数据分析精简版

以前的文章链接今天针对以前的豆瓣电影250数据分析代码进行一下总结，精简时间：2020/12/19通过接近一个月的学习，使用for循环来解决数据清洗、分析问题并不明智。一方面代码冗余度高，不够简洁；另一方面for循环间接增加了理解难度。使用pandas自带的一些方法以及函数思维对于数据分析来说才是上选。测试数据集：链接：https://pan.baidu.com/s/13irC3frSixvhU6GmZVX16w提取码：love以下代码可以实现这些可视化：条形图：电影名—评价人数top10
复制链接

扫一扫