分组(Grouping),适用于有重复数据可统计、可分组的DataFrame
“group by” 指的是涵盖下列一项或多项步骤的处理流程:
分割:按条件把数据分割成多组;
应用:为每组单独应用函数;
组合:将处理结果组合成一个数据结构。
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
df
A | B | C | D | |
---|---|---|---|---|
0 | foo | one | 0.433992 | -1.333293 |
1 | bar | one | 0.194936 | -0.391367 |
2 | foo | two | 1.199633 | 0.119350 |
3 | bar | three | -0.786895 | 0.729469 |
4 | foo | two | 0.450172 | 0.418308 |
5 | bar | two | -0.256733 | -0.802991 |
6 | foo | one | 0.202326 | 0.758224 |
7 | foo | three | -1.381662 | 0.517538 |
df1=df.groupby('A').sum()#计算
df1#统计A组中所有bar、foo的和
C | D | |
---|---|---|
A | ||
bar | -0.848692 | -0.464888 |
foo | 0.904460 | 0.480128 |
也可进行多列分组:
df.groupby(['A', 'B']).sum()#同时统计A、B列可分组数据和
C | D | ||
---|---|---|---|
A | B | ||
bar | one | 0.194936 | -0.391367 |
three | -0.786895 | 0.729469 | |
two | -0.256733 | -0.802991 | |
foo | one | 0.636317 | -0.575069 |
three | -1.381662 | 0.517538 | |
two | 1.649805 | 0.537658 |
多层索引与重塑:
有时候为了更好的查看DataFrame数据,我们可以使用多层索引,并会使用到重塑方法。
首先需要创建一个MultiIndex(层次索引)对象
MultiIndex可以从阵列(使用的列表来创建 MultiIndex.from_arrays()),
元组(使用的阵列 MultiIndex.from_tuples()),
一个交叉组iterables(使用的 MultiIndex.from_product()),
或者一个DataFrame(使用 MultiIndex.from_frame())创建。
下列实例使用一个元组列表和Index构造函数创建
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two',
'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=['A', 'B','C','D'])
df
A | B | C | D | ||
---|---|---|---|---|---|
first | second | ||||
bar | one | 0.271837 | -0.997848 | -2.000626 | -0.034044 |
two | -0.779253 | -1.503962 | -1.244141 | 0.337168 | |
baz | one | 1.060027 | -1.522924 | -0.400959 | 0.813439 |
two | 1.896187 | -0.525100 | 0.341734 | 0.120456 | |
foo | one | -0.549181 | 0.474325 | 0.605349 | 0.161852 |
two | -0.732263 | 0.756273 | -0.986397 | 0.810937 | |
qux | one | -0.755619 | -1.457063 | 0.626315 | 0.405198 |
two | 0.377285 | 0.494047 | -1.161364 | -0.772654 |
stack()方法把 DataFrame 多列压缩至一层:
df1=df.stack()
df1
first second
bar one A 0.271837
B -0.997848
C -2.000626
D -0.034044
two A -0.779253
B -1.503962
C -1.244141
D 0.337168
baz one A 1.060027
B -1.522924
C -0.400959
D 0.813439
two A 1.896187
B -0.525100
C 0.341734
D 0.120456
foo one A -0.549181
B 0.474325
C 0.605349
D 0.161852
two A -0.732263
B 0.756273
C -0.986397
D 0.810937
qux one A -0.755619
B -1.457063
C 0.626315
D 0.405198
two A 0.377285
B 0.494047
C -1.161364
D -0.772654
dtype: float64
stack() 的逆操作是 unstack()
df1.unstack()
A | B | C | D | ||
---|---|---|---|---|---|
first | second | ||||
bar | one | 0.271837 | -0.997848 | -2.000626 | -0.034044 |
two | -0.779253 | -1.503962 | -1.244141 | 0.337168 | |
baz | one | 1.060027 | -1.522924 | -0.400959 | 0.813439 |
two | 1.896187 | -0.525100 | 0.341734 | 0.120456 | |
foo | one | -0.549181 | 0.474325 | 0.605349 | 0.161852 |
two | -0.732263 | 0.756273 | -0.986397 | 0.810937 | |
qux | one | -0.755619 | -1.457063 | 0.626315 | 0.405198 |
two | 0.377285 | 0.494047 | -1.161364 | -0.772654 |
数据输入 / 输出
文本文件 读 写
CSV read_csv to_csv
Fixed-WidthTextFile read_fwf
JSON read_json to_json
HTML read_html to_html
Local clipboard read_clipboard to_clipboard
MS Excel read_excel to_excel
二进制文件
OpenDocument read_excel
HDF5 Format read_hdf to_hdf
Feather Format read_feather to_feather
Parquet Format read_parquet to_parquet
ORC Format read_orc
Msgpack read_msgpack to_msgpack
Stata read_stata to_stata
SAS read_sas
SPSS read_spss
Python Pickle Format read_pickle to_pickle
SQL
SQL read_sql to_sql
Google Big Queryread_gbq to_gbq
df
A | B | C | D | ||
---|---|---|---|---|---|
first | second | ||||
bar | one | 0.271837 | -0.997848 | -2.000626 | -0.034044 |
two | -0.779253 | -1.503962 | -1.244141 | 0.337168 | |
baz | one | 1.060027 | -1.522924 | -0.400959 | 0.813439 |
two | 1.896187 | -0.525100 | 0.341734 | 0.120456 | |
foo | one | -0.549181 | 0.474325 | 0.605349 | 0.161852 |
two | -0.732263 | 0.756273 | -0.986397 | 0.810937 | |
qux | one | -0.755619 | -1.457063 | 0.626315 | 0.405198 |
two | 0.377285 | 0.494047 | -1.161364 | -0.772654 |
#将df写入csv文件。
df.to_csv('file.csv')
#打开创建好的csv文件夹
pd.read_csv('file.csv')
first | second | A | B | C | D | |
---|---|---|---|---|---|---|
0 | bar | one | 0.271837 | -0.997848 | -2.000626 | -0.034044 |
1 | bar | two | -0.779253 | -1.503962 | -1.244141 | 0.337168 |
2 | baz | one | 1.060027 | -1.522924 | -0.400959 | 0.813439 |
3 | baz | two | 1.896187 | -0.525100 | 0.341734 | 0.120456 |
4 | foo | one | -0.549181 | 0.474325 | 0.605349 | 0.161852 |
5 | foo | two | -0.732263 | 0.756273 | -0.986397 | 0.810937 |
6 | qux | one | -0.755619 | -1.457063 | 0.626315 | 0.405198 |
7 | qux | two | 0.377285 | 0.494047 | -1.161364 | -0.772654 |