python数据分析(五)——pandas数据的合并和分组聚合

kuluomi111

已于 2023-12-22 17:00:07 修改

阅读量264

点赞数

分类专栏： python数据分析——pandas 文章标签： python 数据分析 pandas

于 2023-12-22 16:58:35 首次发布

本文链接：https://blog.csdn.net/Vapus/article/details/126266462

版权

python数据分析——pandas 专栏收录该内容

7 篇文章 1 订阅

订阅专栏

系列文章目录：
python数据分析(一)——series和读取外部数据
 python数据分析(二)——DataFrame
python数据分析(三)——pandas缺失值处理
 python数据分析(四)——pandas常用统计方法

前言

对于这一组电影数据，希望统计电影分类(genre)的情况，应该如何处理数据？

思路：重新构造一个全为0的数组，列名为分类，如果某一条数据中分类出现过，就让0变为1

一、字符串离散化实例

代码如下（示例）：

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from matplotlib.font_manager import FontProperties

font = FontProperties(fname="/System/Library/Fonts/Supplemental/Songti.ttc", size=14)

df = pd.read_csv("./IMDB-Movie-Data.csv")

temp_list = list(df["Genre"].str.split(","))  # [[], [], []]
# print(temp_list)
genre_list = list(set([i for j in temp_list for i in j]))

# 构造全为0的数组
zeros_df = pd.DataFrame(np.zeros((df.shape[0], len(genre_list))), columns=genre_list)

# print(zeros_df)

# 给每个电影出现分类的位置赋值1
for i in range(df.shape[0]):
    # zeros_df.loc[0, ["Sci-fi", "Musical"] = 1
    # loc[index, columns]第一个参数为索引，第二个参数为列
    # iloc[index]只接受索引，不接受列参数
    zeros_df.loc[i, temp_list[i]] = 1

# 统计每个分类的电影的数量和
genre_count = zeros_df.sum(axis=0)
# print(zeros_df)

# 排序
genre_count = genre_count.sort_values(ascending=False)

# 画图
plt.figure(figsize=(20, 8), dpi=80)
_x = genre_count.index
_y = genre_count.values
plt.bar(range(len(_x)), _y, width=0.5)

plt.xticks(range(len(_x)), _x, rotation=45, fontproperties=font)
plt.yticks(range(0, 520, 100), fontproperties=font)

plt.xlabel("种类", fontproperties=font)
plt.ylabel("数量", fontproperties=font)

plt.show()

二、数据合并join：行索引

join：默认情况下是把行行索引相同的数据合并在一起，按照对应的行索引合并

数据合并行索引

In [18]: t1 = pd.DataFrame(np.ones((2, 4)), index=list("AB"),columns=list("abcd"))
In [19]: t1
Out[19]: 
     a    b    c    d
A  1.0  1.0  1.0  1.0
B  1.0  1.0  1.0  1.0

In [20]: t2 = pd.DataFrame(np.zeros((3, 3)), index=list("ABC"),columns=list("xyz"))
In [21]: t2
Out[21]: 
     x    y    z
A  0.0  0.0  0.0
B  0.0  0.0  0.0
C  0.0  0.0  0.0

In [22]: t1.join(t2) # 行数以t1为准
Out[22]: 
     a    b    c    d    x    y    z
A  1.0  1.0  1.0  1.0  0.0  0.0  0.0
B  1.0  1.0  1.0  1.0  0.0  0.0  0.0

In [23]: t2.join(t1) # 行数以t2为准
Out[23]: 
     x    y    z    a    b    c    d
A  0.0  0.0  0.0  1.0  1.0  1.0  1.0
B  0.0  0.0  0.0  1.0  1.0  1.0  1.0
C  0.0  0.0  0.0  NaN  NaN  NaN  NaN

三、数据合并merge：列索引

merge：按照指定的列把数据按照一定的方式合并到一起

数据合并列索引

In [18]: t1 = pd.DataFrame(np.ones((2, 4)), index=list("AB"),columns=list("abcd"))
In [19]: t1
Out[19]: 
     a    b    c    d
A  1.0  1.0  1.0  1.0
B  1.0  1.0  1.0  1.0

In [27]: t3 = pd.DataFrame(np.arange(9).reshape(3,3),columns=list("fax"))

In [28]: t3
Out[28]: 
   f  a  x
0  0  1  2
1  3  4  5
2  6  7  8

# 默认的合并方式是交集，inner
In [29]: t1.merge(t3,on="a")
Out[29]: 
     a    b    c    d  f  x
0  1.0  1.0  1.0  1.0  0  2
1  1.0  1.0  1.0  1.0  0  2

# outer合并方式是并集
In [30]: t1.merge(t3,on="a",how="outer")
Out[30]: 
     a    b    c    d  f  x
0  1.0  1.0  1.0  1.0  0  2
1  1.0  1.0  1.0  1.0  0  2
2  4.0  NaN  NaN  NaN  3  5
3  7.0  NaN  NaN  NaN  6  8

In [31]: t1.loc["A","a"]=100
In [32]: t1
Out[32]: 
       a    b    c    d
A  100.0  1.0  1.0  1.0
B    1.0  1.0  1.0  1.0

In [33]: t1.merge(t3,on="a",how="outer")
Out[33]: 
       a    b    c    d    f    x
0  100.0  1.0  1.0  1.0  NaN  NaN
1    1.0  1.0  1.0  1.0  0.0  2.0
2    4.0  NaN  NaN  NaN  3.0  5.0
3    7.0  NaN  NaN  NaN  6.0  8.0

In [34]: t3.merge(t1,on="a",how="outer")
Out[34]: 
     f      a    x    b    c    d
0  0.0    1.0  2.0  1.0  1.0  1.0
1  3.0    4.0  5.0  NaN  NaN  NaN
2  6.0    7.0  8.0  NaN  NaN  NaN
3  NaN  100.0  NaN  1.0  1.0  1.0

# left左边为准，NaN补全
In [35]: t3.merge(t1,on="a",how="left")
Out[35]: 
   f  a  x    b    c    d
0  0  1  2  1.0  1.0  1.0
1  3  4  5  NaN  NaN  NaN
2  6  7  8  NaN  NaN  NaN

# right右边为准，NaN补全
In [36]: t3.merge(t1,on="a",how="right")
Out[36]: 
     f      a    x    b    c    d
0  NaN  100.0  NaN  1.0  1.0  1.0
1  0.0    1.0  2.0  1.0  1.0  1.0

如果列索引不相同，无法使用on参数，则可以使用left_on和right_on参数

数据合并列索引

In [37]: t1 = pd.DataFrame(np.ones((3, 4)),index=list("ABC"), columns=list("MNOP"))
In [38]: t1["O"]=list("abc")

In [39]: t1
Out[39]: 
     M    N  O    P
A  1.0  1.0  a  1.0
B  1.0  1.0  b  1.0
C  1.0  1.0  c  1.0

In [53]: t2 = pd.DataFrame(np.zeros((2, 5)),index=list("AB"), columns=list("YWXYZ"))
In [54]: t2["X"]=list("cd")

In [55]: t2
Out[55]: 
     Y    W  X    Y    Z
A  0.0  0.0  c  0.0  0.0
B  0.0  0.0  d  0.0  0.0

# 默认是交集
In [56]: t1.merge(t2,left_on="O",right_on="X")
Out[56]: 
     M    N  O    P    Y    W  X    Y    Z
0  1.0  1.0  c  1.0  0.0  0.0  c  0.0  0.0

In [57]: t1.merge(t2,left_on="O",right_on="X",how="outer")
Out[57]: 
     M    N    O    P    Y    W    X    Y    Z
0  1.0  1.0    a  1.0  NaN  NaN  NaN  NaN  NaN
1  1.0  1.0    b  1.0  NaN  NaN  NaN  NaN  NaN
2  1.0  1.0    c  1.0  0.0  0.0    c  0.0  0.0
3  NaN  NaN  NaN  NaN  0.0  0.0    d  0.0  0.0

In [58]: t1.merge(t2,left_on="O",right_on="X",how="left")
Out[58]: 
     M    N  O    P    Y    W    X    Y    Z
0  1.0  1.0  a  1.0  NaN  NaN  NaN  NaN  NaN
1  1.0  1.0  b  1.0  NaN  NaN  NaN  NaN  NaN
2  1.0  1.0  c  1.0  0.0  0.0    c  0.0  0.0

In [59]: t1.merge(t2,left_on="O",right_on="X",how="right")
Out[59]: 
     M    N    O    P    Y    W  X    Y    Z
0  1.0  1.0    c  1.0  0.0  0.0  c  0.0  0.0
1  NaN  NaN  NaN  NaN  0.0  0.0  d  0.0  0.0

四、数据分组聚合

#返回类型
#DataFrameGroupBy
#可以进行遍历
#调用聚合方法

df.groupby(by='columns_name')
grouped是一个DataFrameGroupBy对象，是可迭代的
grouped中的每一个元素是一个元组
元组里面是(索引(分组的值)，分组之后的DataFrame)

# 对国家和省份进行分组统计
grouped = df.groupby(by=[df['Country'],df['State/Province'])
# series类型

# 获取分组之后的某一部分数据
df.groupby(by=['Country','State/Province'])['Country'].count()
# series类型

# 对某几列数据进行分组
df['Country'].grouby(by=[df['Country'],df['State/Province']]).count()
# series类型

t1 = df[['Country']].grouby(by=[df['Country'],df['State/Province']]).count()
t2 = df.groupby(by=['Country','State/Province'])[['Country']].count()
以上两条命令结果一样，但与之前结果的区别是返回的是DataFrame类型

在这里插入图片描述

美国和中国的星巴克数量

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from matplotlib.font_manager import FontProperties

font = FontProperties(fname="/System/Library/Fonts/Supplemental/Songti.ttc", size=14)

df = pd.read_csv("starbucks_store_worldwide.csv")
# print(df.info())
# print(df.head(1))
grouped = df.groupby(by='Country')
# print(grouped)
# 返回类型
# DataFrameGroupBy
# 可以进行遍历
# df[df['Country']='US']
# 调用聚合方法
country_count = grouped['Brand'].count()
print(country_count['US'])
print(country_count['CN'])

13608
2734

中国各个省份的星巴克数量

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from matplotlib.font_manager import FontProperties

font = FontProperties(fname="/System/Library/Fonts/Supplemental/Songti.ttc", size=14)

df = pd.read_csv("starbucks_store_worldwide.csv")

china_data = df[df['Country'] == 'CN']

grouped = china_data.groupby(by='State/Province').count()['Brand']

五、索引和复合索引

获取index：df.index
指定index：df.index = [‘x’,‘y’]
重新设置index：df.reindex(list(‘abcdef’))
指定某一列作为index：df.set_index(‘country’,drop=False)
返回index的唯一值：df.set_index(‘country’).index.unique()

指定某一列作为索引，但是不希望删掉，设置drop=False保留
在这里插入图片描述

取内部索引时，交换索引：d.swaplevel()

对于series来说，直接方括号[]取，对于dataframe数据来说，加上loc[]

series复合索引

在这里插入图片描述

dataframe复合索引

在这里插入图片描述

六、索引练习

店铺总数排名前10的国家

import matplotlib.pyplot as plt
import pandas as pd
from matplotlib.font_manager import FontProperties

font = FontProperties(fname="/System/Library/Fonts/Supplemental/Songti.ttc", size=14)


df = pd.read_csv("starbucks_store_worldwide.csv")

# 使用matplotlib呈现出店铺总数排名前10的国家
# 准备数据
data1 = df.groupby(by='Country').count()['Brand'].sort_values(ascending=False)[:10]

_x = data1.index
_y = data1.values

plt.figure(figsize=(20, 8), dpi=80)

plt.bar(range(len(_x)), _y)

plt.xticks(range(len(_x)), _x)
plt.show()

在这里插入图片描述

中国排名前10的城市

import matplotlib.pyplot as plt
import pandas as pd
from matplotlib.font_manager import FontProperties

font = FontProperties(fname="/System/Library/Fonts/Supplemental/Songti.ttc", size=14)


df = pd.read_csv("starbucks_store_worldwide.csv")
df = df[df['Country'] == 'CN']

# 使用matplotlib呈现中国排名前10城市的星巴克数据
data1 = df.groupby(by='City').count()['Brand'].sort_values(ascending=False)[:15]
data1 = data1.rename(index={'Hong Kong': '香港'})

_x = data1.index
_y = data1.values

plt.figure(figsize=(20, 8), dpi=80)

plt.barh(range(len(_x)), sorted(_y, reverse=False), height=0.4, color='orange')

plt.yticks(range(len(_x)), _x, fontproperties=font)
plt.xticks(fontproperties=font)
plt.show()

在这里插入图片描述

不同年份书的数量

df = pd.read_csv("books.csv")

# print(df.info())

data1 = df[pd.notnull(df['original_publication_year'])]

grouped = data1.groupby(by='original_publication_year').count()['title'][:10]

不同年份书的平均评分情况

data1 = df[pd.notnull(df['original_publication_year'])]

grouped = data1['average_rating'].groupby(by=data1['original_publication_year']).mean()

_x = grouped.index
_y = grouped.values

plt.figure(figsize=(20, 8), dpi=80)

plt.plot(range(len(_x)), _y)

plt.xticks(list(range(len(_x)))[::10], _x[::10].astype(int), fontproperties=font, rotation=45)
plt.yticks(fontproperties=font)

plt.show()

在这里插入图片描述

七、总结

把字符串离散化

获取字符串的去重后列表
构造全为0的数组，columns为字符串的列表
给全为0的数组赋值，遍历

join

按照index进行分组
t1.join(t2)

merge

按照列数据进行分组
t1.merge(t2, on=‘a’, how=‘inner’) 内连接
t1.merge(t2, on=‘a’, how=‘outter’) 外连接
t1.merge(t2, on=‘a’, how=‘left’) 以t1为准
t1.merge(t2, on=‘a’, how=‘right’) 以t2为准
t1.merge(t2, left_on=‘a’, right_on = ‘b’, how=‘right’) 没有相同的列名时

分组聚合

df.groupby(by=‘xx’)得到groupby对象，能够调用聚合方法，能够遍历
df.groupby(by=‘xx’).count()
df.groupby(by=‘xx’).mean()
df.groupby(by=‘xx’, ‘xx’).sum() 带复合索引的dataframe

索引

df.index
df.index=[‘a’, ‘c’] 设置索引的值
df.set_index(‘a’) 设置某一列为索引
df.set_index([‘a’, ‘c’])
从复合索引中取值 series: s1[‘a’][‘b’] s1[‘a’, ‘b’] dataframe df.loc[‘a’].loc[‘b’]
从里层索引开始选择 df.swaplevel() 交换里外层索引的位置

kuluomi111

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python数据分析(五)——pandas数据的合并和分组聚合

对于这一组电影数据，希望统计电影分类(genre)的情况，应该如何处理数据？思路：重新构造一个全为0的数组，列名为分类，如果某一条数据中分类出现过，就让0变为1把字符串离散化获取字符串的去重后列表构造全为0的数组，columns为字符串的列表给全为0的数组赋值，遍历join按照index进行分组merge按照列数据进行分组t1.merge(t2, on=‘a’, how=‘inner’) 内连接t1.merge(t2, on=‘a’, how=‘outter’) 外连接。
复制链接

扫一扫