4 ---- 数据的合并和分组聚合（pandas）

qq_44647559

于 2021-05-10 09:34:11 发布

阅读量106

点赞数

分类专栏： 3 python三大库（已完结）

本文链接：https://blog.csdn.net/qq_44647559/article/details/116586454

版权

本文深入探讨了使用Python的Pandas库进行数据合并（如merge、concat和join）以及分组聚合（如groupby和agg）的方法。通过实例解析，详细介绍了如何有效地整合和分析数据集，提升数据处理能力。

摘要由CSDN通过智能技术生成

【问题1】字符串离散化的案例
案例：
对于这一组电影数据，如果我们希望统计电影分类（genre）的情况，应该如何处理数据？
例如，喜剧片的电影个数，冒险片的电影个数，爱情片的电影个数......


思路：
（1）重新构造一个全为0的数组，列名为分类。
（2）如果某一条数据中分类出现过，就让0变为1.
（3）最后统计每个分类（即列表的列）的电影个数（即1的个数）


注意：
（1）新数组的行数和以前一样
（2）新数组的列数是所有的genre（不重复）

# case1 ---- 前情提要

import pandas as pd

df = pd.read_csv('./code2/datasets_IMDB-Movie-Data.csv')






# （1）
print('\n【df.info()】')
print(df.info())

print('\n【df.head(1)】')
print(df.head(1))





# （2）
print('\n【df["Genre"]】')
print(df['Genre'])

print('\n【df["Genre"].str】')
print(df['Genre'].str)

print('\n【df["Genre"].str.split(",")】')
print(df['Genre'].str.split(','))





# （3）
# 统计分类的列表
# 需要双重循环，才能把嵌套列表展开
temp_list = df['Genre'].str.split(',').tolist()#[['Action', 'Adventure', 'Sci-Fi'], ['Adventure', 'Mystery', 'Sci-Fi'],...]---[[],[],[]]

genre_list = [i for j in temp_list for i in j]#['Action', 'Adventure', 'Sci-Fi', 'Adventure', 'Mystery', 'Sci-Fi', ...]-----[.....]

【df.info()】
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Title               1000 non-null   object 
 2   Genre               1000 non-null   object 
 3   Description         1000 non-null   object 
 4   Director            1000 non-null   object 
 5   Actors              1000 non-null   object 
 6   Year                1000 non-null   int64  
 7   Runtime (Minutes)   1000 non-null   int64  
 8   Rating              1000 non-null   float64
 9   Votes               1000 non-null   int64  
 10  Revenue (Millions)  872 non-null    float64
 11  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB
None

【df.head(1)】
   Rank                    Title                    Genre  \
0     1  Guardians of the Galaxy  Action,Adventure,Sci-Fi   

                                         Description    Director  \
0  A group of intergalactic criminals are forced ...  James Gunn   

                                              Actors  Year  Runtime (Minutes)  \
0  Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...  2014                121   

   Rating   Votes  Revenue (Millions)  Metascore  
0     8.1  757074              333.13       76.0  

【df["Genre"]】
0       Action,Adventure,Sci-Fi
1      Adventure,Mystery,Sci-Fi
2               Horror,Thriller
3       Animation,Comedy,Family
4      Action,Adventure,Fantasy
                 ...           
995         Crime,Drama,Mystery
996                      Horror
997         Drama,Music,Romance
998            Adventure,Comedy
999       Comedy,Family,Fantasy
Name: Genre, Length: 1000, dtype: object

【df["Genre"].str】
<pandas.core.strings.StringMethods object at 0x0000020818FBC910>

【df["Genre"].str.split(",")】
0       [Action, Adventure, Sci-Fi]
1      [Adventure, Mystery, Sci-Fi]
2                [Horror, Thriller]
3       [Animation, Comedy, Family]
4      [Action, Adventure, Fantasy]
                   ...             
995         [Crime, Drama, Mystery]
996                        [Horror]
997         [Drama, Music, Romance]
998             [Adventure, Comedy]
999       [Comedy, Family, Fantasy]
Name: Genre, Length: 1000, dtype: object

# case2 ---- 统计电影分类（genre）的情况

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

df = pd.read_csv('./code2/datasets_IMDB-Movie-Data.csv')
print("\n【df['Genre'].head(3)】")
print(df['Genre'].head(3))      # 测试前3行


# 统计分类的列表
temp_list = df['Genre'].str.split(',').tolist()     # [[], [], []] ------ 1000个
temp_genre_list = [i for j in temp_list for i in j]       # 双重嵌套循环   [ ] ------- 所有的分类放到genre_list中
genre_list = list(set(temp_genre_list))                  # 集合set可以去掉重复值 ------ set(）      去掉重复值后的列表 ----- []





#  step1 --- 构造全为0的数组
'''
（1）np.zeros(（行数，列数）)------- 构造全为0的数组
     np.ones (（行数，列数）)------- 构造全为1的数组
（2）df.shape     ------ 返回行和列(数组类型)
     df.shape[0]  ------ 行
     df.shape[1]  ------ 列
（3）index:    行索引
     columns:  列索引
'''
zeros_df = pd.DataFrame( np.zeros((df.shape[0] ,len(genre_list))) , columns=genre_list )  # 1000行---构造一个全0的数组，columns是每个分类
# print(zeros_df)





# step2 ---- 给每个电影出现分类的位置，赋值为1
'''
DataFrame的索引
（1）df.loc通过“标签索引”获取行数据（切片不包含C，但loc的切片包含C）
    t = pd.DataFrame(np.arange(12).reshape(3,4), index=list('abc'), columns=list('WXYZ'))

    t1 = t.loc['a','Z']               # 索引为：“a”行“Z”列的元素  ----- 取某行某列------- numpy数值类型
    t2 = t.loc['a']                   # 索引为：“a”行元素 ---- 取整行 ------ Series类型
    t3 = t.loc['a',:]                 # 与上述同

（2）df.iloc通过“位置”   获取行数据 
    t1 = t.iloc[1:]               # 第1行至末 -------- 行
    t2 = t.iloc[:,2]              # 第2列 --------列
    t3 = t.iloc[:,[2,1]]          # 第2列，第1列 ----- 多列（不连续）
    
    t.iloc[1:,:2] = 30            # 选中地方改为30，其他地方不变 ----- 赋值更改数据
    print(t)
'''
for i in range(df.shape[0]):     # df.shape[0]------行：1000行      df.shape-----返回数组类型（行，列）
    # zeros_df.loc[0，['Action','Adventure','Sci-Fi']] = 1          （1）索引定位：一行多列           （2）对一行多列进行赋值
    zeros_df.loc[i, temp_list[i]] = 1

print('\n【zeros_df.head(3)】')
print(zeros_df.head(3))           # 测试前3行




# step3 ---- 统计每个分类的电影的数量和
'''
df.sum(axis = 1)   # 计算行的和
df.sum(axis = 0)   # 计算列的和
'''
genre_count = zeros_df.sum(axis=0)      
print('\n【genre_count】')
print(genre_count)



# step4 ----- 排序
genre_count = genre_count.sort_values()           #  sort_values()-----默认降序。  若设置ascending=False，则升序。
print('\n【type(genre_count)】')
print(type(genre_count))
print('\n【genre_count：排序后genre_count】')
print(genre_count)





# step5 --------- 画图（条形图：离散数据）
plt.figure(figsize=(20,8), dpi=80)


_x = genre_count.index         # 注意：df.index 不要加上括号 
_y = genre_count.values
plt.bar(range(len(_x)),_y)     # 传入 x 和 y

plt.xticks(range(len(_x)), _x)  # x轴 ： 数字和字符串一一对应
plt.show()






# step6 ------ 补充
print('\n【genre_count】')
print(genre_count)

print('\n【genre_count.index】')
print(genre_count.index)

print('\n【genre_count.values】')
print(genre_count.values)

【df['Genre'].head(3)】
0     Action,Adventure,Sci-Fi
1    Adventure,Mystery,Sci-Fi
2             Horror,Thriller
Name: Genre, dtype: object

【zeros_df.head(3)】
   Biography  Music  Comedy  Crime  Fantasy  Sport  Family  Horror  Romance  \
0        0.0    0.0     0.0    0.0      0.0    0.0     0.0     0.0      0.0   
1        0.0    0.0     0.0    0.0      0.0    0.0     0.0     0.0      0.0   
2        0.0    0.0     0.0    0.0      0.0    0.0     0.0     1.0      0.0   

   Action  Mystery  Sci-Fi  Western  Animation  Adventure  Drama  War  \
0     1.0      0.0     1.0      0.0        0.0        1.0    0.0  0.0   
1     0.0      1.0     1.0      0.0        0.0        1.0    0.0  0.0   
2     0.0      0.0     0.0      0.0        0.0        0.0    0.0  0.0   

   Musical  Thriller  History  
0      0.0       0.0      0.0  
1      0.0       0.0      0.0  
2      0.0       1.0      0.0  

【genre_count】
Biography     81.0
Music         16.0
Comedy       279.0
Crime        150.0
Fantasy      101.0
Sport         18.0
Family        51.0
Horror       119.0
Romance      141.0
Action       303.0
Mystery      106.0
Sci-Fi       120.0
Western        7.0
Animation     49.0
Adventure    259.0
Drama        513.0
War           13.0
Musical        5.0
Thriller     195.0
History       29.0
dtype: float64

【type(genre_count)】
<class 'pandas.core.series.Series'>

【genre_count：排序后genre_count】
Musical        5.0
Western        7.0
War           13.0
Music         16.0
Sport         18.0
History       29.0
Animation     49.0
Family        51.0
Biography     81.0
Fantasy      101.0
Mystery      106.0
Horror       119.0
Sci-Fi       120.0
Romance      141.0
Crime        150.0
Thriller     195.0
Adventure    259.0
Comedy       279.0
Action       303.0
Drama        513.0
dtype: float64

在这里插入图片描述

【genre_count】
Musical        5.0
Western        7.0
War           13.0
Music         16.0
Sport         18.0
History       29.0
Animation     49.0
Family        51.0
Biography     81.0
Fantasy      101.0
Mystery      106.0
Horror       119.0
Sci-Fi       120.0
Romance      141.0
Crime        150.0
Thriller     195.0
Adventure    259.0
Comedy       279.0
Action       303.0
Drama        513.0
dtype: float64

【genre_count.index】
Index(['Musical', 'Western', 'War', 'Music', 'Sport', 'History', 'Animation',
       'Family', 'Biography', 'Fantasy', 'Mystery', 'Horror', 'Sci-Fi',
       'Romance', 'Crime', 'Thriller', 'Adventure', 'Comedy', 'Action',
       'Drama'],
      dtype='object')

【genre_count.values】
[  5.   7.  13.  16.  18.  29.  49.  51.  81. 101. 106. 119. 120. 141.
 150. 195. 259. 279. 303. 513.]

【问题2】数据合并

（1） 数据合并之join
join：默认情况下他是把行索引相同的数据合并到一起。（按照行索引进行合并）

（2） 数据合并之merge
merge：按照指定的列把数据按照一定的方式合并到一起。（按照列索引进行合并）

（3）merge之列索引相同：
1. 默认的合并方式 how='inner'（交集）
2. how = 'outer'，并集。nan补全（并集，即左并集与右并集）

（4）merge之列索引不同
1. left_on  = 'O', 左边为准。nan补齐（左并集）
2. right_on = 'X'，右边为准。nan补齐（右并集）（若一个有，一个没有，则用nan来补齐）

# case1 数据合并之join
'''
注意：按照行索引进行合并
（1）join：默认情况下他是把行索引相同的数据合并到一起。（按照行索引进行合并）
（2）创建全为0的数组 ---- np.zeros（(行数，列数)）
（3）a.shape    --- 返回a的行和列（返回数组类型）  
     a.shape[0] --- 行  
     a.shape[1] --- 列
'''
import pandas as pd
import numpy as np

df1 = pd.DataFrame( np.ones((2,4)), index=['A','B'], columns=list('abcd') )
df2 = pd.DataFrame( np.ones((3,3)), index=['A','B','C'], columns=list('xyz'))
print('\n【df1】')
print(df1)
print('\n【df2】')
print(df2)







t1 = df1.join(df2)              # 以df1为基准
t2 = df2.join(df1)              # 以df2为基准
print('\n【df1.join(df2)】')
print(t1)
print('\n【df2.join(df1)】')
print(t2)

【df1】
     a    b    c    d
A  1.0  1.0  1.0  1.0
B  1.0  1.0  1.0  1.0

【df2】
     x    y    z
A  1.0  1.0  1.0
B  1.0  1.0  1.0
C  1.0  1.0  1.0

【df1.join(df2)】
     a    b    c    d    x    y    z
A  1.0  1.0  1.0  1.0  1.0  1.0  1.0
B  1.0  1.0  1.0  1.0  1.0  1.0  1.0

【df2.join(df1)】
     x    y    z    a    b    c    d
A  1.0  1.0  1.0  1.0  1.0  1.0  1.0
B  1.0  1.0  1.0  1.0  1.0  1.0  1.0
C  1.0  1.0  1.0  NaN  NaN  NaN  NaN

# case2  前情提要-------merge
import pandas as pd

df1 = pd.DataFrame( {
   'key':['b','b','a','c','a','a','b'], 'data1':range(7)} )
df2 = pd.DataFrame( {
   'key':['a','b','d'], 'data2':range(3)} )
print('\n【df1】')
print(df1)
print('\n【df2】')
print(df2)


# （1）merge之列索引相同
t1 = pd.merge(df1, df2)       # 内连接 ----不指定on，则以两个DataFrame的列名交集作为连接键，这里指的是“key”。默认：how = 'inner'（以df1为基准）
t2 = pd.merge(df1, df2, on='key')                 # 内连接 ---- 默认：how = 'inner'   交集（以df1为基准）
t3 = pd.merge(df1, df2, on='key', how='outer')    # 外连接 ---- 设置：how = 'outer'   并集（以df1为基准）
print('\n【t1】')
print(t1)
print('\n【t2】')
print(t2)
print('\n【t3】')
print(t3)



t4 = pd.merge(df1, df2, on='key', how='left')    # 左连接（以df1为基准）
t5 = pd.merge(df1, df2, on='key', how='right')   # 右连接（以df2为基准）
print

最低0.47元/天解锁文章

qq_44647559

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
4 ---- 数据的合并和分组聚合（pandas）

【问题1】字符串离散化的案例案例：对于这一组电影数据，如果我们希望统计电影分类（genre）的情况，应该如何处理数据？例如，喜剧片的电影个数，冒险片的电影个数，爱情片的电影个数......思路：（1）重新构造一个全为0的数组，列名为分类。（2）如果某一条数据中分类出现过，就让0变为1.（3）最后统计每个分类（即列表的列）的电影个数（即1的个数）注意：（1）新数组的行数和以前一样（2）新数组的列数是所有的genre（不重复）# case1 ---- 前情提要import p
复制链接

扫一扫

专栏目录