pandas小记：pandas数据规整化-正则化、分组合并及重塑

-柚子皮-

已于 2023-06-26 21:50:31 修改

阅读量2.7w

点赞数 3

分类专栏： Pandas小记文章标签： pandas 数据规整化分组合并及重塑

于 2014-09-23 23:10:09 首次发布

本文链接：https://blog.csdn.net/pipisorry/article/details/39506169

版权

本文详细介绍了pandas库在数据规整中的关键操作，包括数据正则化（min-max标准化和Z-score标准化）、数据分组（分组属性、遍历分组、选择组、聚合操作等）和数据重塑（合并与连接）。通过实例展示了如何使用pandas进行数据预处理，如数据的归一化处理，以及如何进行分组聚合、数据的合并和重塑，以满足数据分析和建模的需求。

摘要由CSDN通过智能技术生成

http://blog.csdn.net/pipisorry/article/details/39506169

数据分析和建模方面的大量编程工作都是用在数据准备上的：加载、清理、转换以及重塑。有时候，存放在文件或数据库中的数据并不能满足数据处理应用的要求。

pandas和Python标准库提供了一组高级的、灵活的、高效的核心函数和算法，它们能够轻松地将数据规整化为正确的形式。

数据正则化data normalization

min-max标准化（对列进行）

df_norm = (df - df.min()) / (df.max() - df.min())

或者df_norm2=df.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))

Z-score标准化（对列进行）

1 pandas dataframe实现

df = (df - df.mean()) / df.std()
# df.fillna(df.min(), inplace=True)
# df.fillna(args.nan_sub, inplace=True)

Note: df_norm会保留nan值

2 np实现

df_norm2 = df.apply(lambda x: (x - np.mean(x)) / (np.std(x)))

上面df.std()和np.std()算出来的值不一样，因为np.std(）计算的是总体标准差；df.std()计算的是样本标准差，是无偏估计。

3 sklearn实现

scaler = preprocessing.StandardScaler().fit(dense_features_df)
dense_index = dense_features_df.index
scaler.transform(dense_features_df, copy=False)
dense_features_df = pd.DataFrame(dense_features_df, index=dense_index, columns=dense_cols)

1和3对比：建议使用3，运行时间3是1的100多倍。1不处理nan数据，但是一般处理完后需要使用min值填充；3 不允许存在nan值，否则报错，所以需要提前处理好nan数据。

示例：
df = pd.DataFrame([[1,2,3], [4,5,np.nan], [6,np.nan, 7]])
df
Out[27]:
0 1 2
0 1 2.0 3.0
1 4 5.0 NaN
2 6 NaN 7.0
df.std()
Out[28]:
0 2.516611
1 2.121320
2 2.828427
dtype: float64
(df - df.mean())/df.std()
Out[29]:
0 1 2
0 -1.059626 -0.707107 -0.707107
1 0.132453 0.707107 NaN
2 0.927173 NaN 0.707107

[数据标准化/归一化normalization]

[pandas学习（四）--数据的归一化]

dataframe每行都减去行平均值

use DataFrame's sub method and specify that the subtraction should happen row-wise (axis=0) as opposed to the default column-wise behaviour:

df.sub(df.mean(axis=1), axis=0)

相当于

norm_ui_array = df.values
for i in range(len(norm_ui_array)):
    norm_ui_array[i][norm_ui_array[i] != np.NaN] -= user_rat_mean[i]

[Pandas: Subtract row mean from each element in row]

皮皮Blog

分组Grouping

By “group by” we are referring to a process involving one or more of the following steps

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure

Note: 分组后成为groupby对象，groupby取属性或者聚合后就成为df对象，取某列就成为series对象。

group后的索引

分组后的索引还是原来的索引，不过分组再聚合后的索引就不一样了。

分组修改index方式

分组聚合后，结果会将组名作为新的index。the result of the aggregation will have the group names as the new index along the grouped axis. In the case of multiple keys, the result is a MultiIndex by default, though this can bechanged by using the as_index option.
修改这种默认方式可以使用df.groupby('A', as_index=False).sum()或者df.groupby(['A', 'B']).sum().reset_index()

分组示例

In [86]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
   ....:                    'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
   ....:                    'C' : np.random.randn(8),
   ....:                    'D' : np.random.randn(8)})
In [87]: df
     A      B         C         D
0  foo    one -1.202872 -0.055224
1  bar    one -1.814470  2.395985
2  foo    two  1.018601  1.552825
3  bar  three -0.595447  0.166599
4  foo    two  1.395433  0.047609
5  bar    two -0.392670 -0.136473
6  foo    one  0.007207 -0.561757
7  foo  three  1.928123 -1.623033

Grouping and then applying a function sum to the resulting groups.

In [88]: df.groupby('A').sum()
Out[88]: 
            C        D
A                     
bar -2.802588  2.42611
foo  3.146492 -0.63958

Grouping by multiple columns forms a hierarchical index, which we then applythe function.

In [89]: df.groupby(['A','B']).sum()
Out[89]: 
                  C         D
A   B                        
bar one   -1.814470  2.395985
    three -0.595447  0.166599
    two   -0.392670 -0.136473
foo one   -1.195665 -0.616981
    three  1.928123 -1.623033
    two    2.414034  1.600434

嵌套索引

每个sid对应一些时间，每个时间又对应一些uid,统计uid个数。

shop_ids = user_pay_df['sid'].unique()

s_group = user_pay_df.groupby(by=['sid', 'time']).count() print(s_group) for sid in shop_ids:  print(s_group.ix[sid])

索引时候可以先对最外层ix进行索引，就可以对不同的sid单独进行某些操作了。

                 uid
sid time
90   2015-07-06    3
     2015-07-07   10
     2015-07-08    7
     2015-07-09    4

            uid
time
2015-07-02    4
2015-07-03    1
2015-07-04    3

嵌套索引也可以通过下面的方式实现

group_process = lambda g: g.groupby(by=['time']).count()
s_group = user_pay_df.groupby(by=['sid']).apply(group_process)
print(s_group.ix[90])

不过奇怪的多了一列

            uid sid
time
2015-07-06    3    3
2015-07-07   10   10
2015-07-08    7    7
2015-07-09    4    4
2015-07-10   23   23

[Nested groupby in DataFrame]

[Need for speed: Slow nested groupbys and applys in Pandas]

分组对象的属性GroupBy object attributes

Note: 这里属性应该就是group后的聚合方法。

The groups attribute is a dict whose keys are the computed unique groupsand corresponding values being the axis labels belonging to each group.

gb.<TAB>
gb.agg        gb.boxplot    gb.cummin     gb.describe   gb.filter     gb.get_group  gb.height     
gb.last       gb.median     gb.ngroups    gb.plot       gb.rank       gb.std        gb.transform
gb.aggregate  gb.count      gb.cumprod    gb.dtype      gb.first      gb.groups     gb.hist       
gb.max        gb.min        gb.nth        gb.prod       gb.resample   gb.sum        gb.var
gb.apply      gb.cummax     gb.cumsum     gb.fillna     gb.gender     gb.head       gb.indices    
gb.mean       gb.name       gb.ohlc       gb.quantile   gb.size       gb.tail       gb.weight

[GroupBy¶]

In [19]: df.groupby('A').groups
Out[19]: {'bar': [1, 3, 5], 'foo': [0, 2, 4, 6, 7]}

In [21]: gb = df.groupby(['A', 'B'])
In [22]: gb.groups
Out[22]: 
{('bar', 'one'): [1],
 ('bar', 'three'): [3], ...
 ('foo', 'two'): [2, 4]}
In [23]: len(grouped)
Out[23]: 6

[GroupBy object attributes]

没有reduce聚合的分组

如果只是想分组，而不想聚合或者reduce，可以使用set_index重建索引实现（当然也可以使用下面的遍历分组方法，也不会聚合）

ut_index = lines.set_index(['user', 'location_id'])

user check-in_time location_id
0 0 201

最低0.47元/天解锁文章

-柚子皮-

关注

3
点赞
踩
20

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录