1.层次化索引
层次化索引是pandas的一项重要功能,它使你能在一个轴上拥有多个(两个以上)索引级别。
1 data=Series(np.random.randn(10),
2 index=[['a','a','a','b','b','b','c','c','d','d'],
3 [1,2,3,1,2,3,1,2,2,3]])
4
5 data
6 Out[6]:
7 a 1 -2.842857
8 2 0.376199
9 3 -0.512978
10 b 1 0.225243
11 2 -1.242407
12 3 -0.663188
13 c 1 -0.149269
14 2 -1.079174
15 d 2 -0.952380
16 3 -1.113689
17 dtype: float64
这就是带MultiIndex索引的Series的格式化输出形式。索引之间的“间隔”表示“直接使用上面的标签”。
1 data.index
2 Out[7]:
3 MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
4 labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
层次化索引在数据重塑和基于分组的操作中扮演重要角色。
2.pd.read_csv与DataFrame的参数对比
help(pd.DataFrame)
Parameters
| ----------
| data : numpy ndarray (structured or homogeneous), dict, or DataFrame
| Dict can contain Series, arrays, constants, or list-like objects
| index : Index or array-like
| Index to use for resulting frame. Will default to np.arange(n) if
| no indexing information part of input data and no index provided
| columns : Index or array-like
| Column labels to use for resulting frame. Will default to
| np.arange(n) if no column labels are provided
| dtype : dtype, default None
| Data type to force. Only a single dtype is allowed. If None, infer
| copy : boolean, default False
| Copy data from inputs. Only affects DataFrame / 2d ndarray input
help(pd.read_csv)
read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=None, compact_ints=None, use_unsigned=None, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)
对比:
设置列名的方式不同,当读取没有列名的文件 时,可以设置header='None',自动分配列名,或者手动设置names = ['a',...'z']
df =pd.read_csv('data/ex2.csv',names=['a','b','c','d','message'],index_col='message')
dfOut[8]:
a | b | c | d | |
---|---|---|---|---|
message | ||||
hello | 1 | 2 | 3 | 4 |
world | 5 | 6 | 7 | 8 |
foo | 9 | 10 | 11 | 12 |
3 数据拆分
bins = [18,25,35,60,100]
ages =[20,22,25,30,31,19,33,28,44,51,66,34]
cats = pd.cut(ages,bins)
catsOut[2]:
[(18, 25], (18, 25], (18, 25], (25, 35], (25, 35], ..., (25, 35], (35, 60], (35, 60], (60, 100], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
cats.labels
Out[3]:
array([0, 0, 0, 1, 1, 0, 1, 1, 2, 2, 3, 1], dtype=int8)
group_names=['Youth','youngAdult','MiddleAged','Senior'] cats = pd.cut(ages,bins,labels = group_names) cats Out[18]:
[Youth, Youth, Youth, youngAdult, youngAdult, ..., youngAdult, MiddleAged, MiddleAged, Senior, youngAdult]
Length: 12
Categories (4, object): [Youth < youngAdult < MiddleAged < Senior]
4.Pandas中的数据聚合与分组计算之groupby
df=DataFrame({'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one'],'data1':np.random.randn(5),'data2':np.random.randn(5)})
Out[17]:
data1 | data2 | key1 | key2 | |
---|---|---|---|---|
0 | 1.165397 | -0.923362 | a | one |
1 | 0.849728 | -0.937067 | a | two |
2 | -0.751545 | 0.576415 | b | one |
3 | -0.270348 | -0.458194 | b | two |
4 | -0.225201 | -0.076616 | a | one |
- grouped = df['data1'].groupby(df['key1'])
- grouped.mean()
-
Out[19]:
key1 a 0.596641 b -0.510946 Name: data1, dtype: float64
- means = df['data1'].groupby([df['key1'],df['key2']]).mean()
-
Out[32]:
key1 key2 a one 0.470098 two 0.849728 b one -0.751545 two -0.270348 Name: data1, dtype: float64
- df.groupby([df['key1'],df['key2']]).size()
-
Out[26]:
key1 key2 a one 2 two 1 b one 1 two 1 dtype: int64
除了内建函数外,亦可以利用自定义函数,来进行分组运算
people=DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],index=['AA','BB','CC','DD','EE'])
people
Out[64]:
a | b | c | d | e | |
---|---|---|---|---|---|
AA | -1.969581 | -0.467297 | 1.003785 | 0.708328 | -0.045470 |
BB | 0.783007 | -0.097895 | 2.508619 | 0.392152 | -0.647674 |
CC | 0.744150 | 0.150627 | -2.206021 | 3.002937 | -0.127511 |
DD | 0.575631 | -1.202379 | 1.169723 | 0.502523 | 0.889531 |
EE | -0.573331 | -0.023822 | -1.461885 | 0.763456 | 0.763352 |
mapping={'a':'red','b':'red','c':'blue','d':'blue','e':'red'}
people.groupby(mapping,axis=1).sum()
Out[66]:
blue | red | |
---|---|---|
AA | 1.712113 | -2.482348 |
BB | 2.900771 | 0.037438 |
CC | 0.796916 | 0.767267 |
DD | 1.672246 | 0.262783 |
EE | -0.698429 | 0.166199 |
5 数据聚合
dict_obj = {'key1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'a'],
'key2' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'data1': np.random.randint(1,10, 8),
'data2': np.random.randint(1,10, 8)}
df_obj5 = pd.DataFrame(dict_obj)Out[5]:
data1 | data2 | key1 | key2 | |
---|---|---|---|---|
0 | 9 | 6 | a | one |
1 | 1 | 8 | b | one |
2 | 9 | 8 | a | two |
3 | 8 | 6 | b | three |
4 | 5 | 6 | a | two |
5 | 6 | 7 | b | two |
6 | 4 | 6 | a | one |
7 | 7 | 7 | a | three |
df_obj5.groupby(['key1','key2']).sum()
Out[6]:
data1 | data2 | ||
---|---|---|---|
key1 | key2 | ||
a | one | 13 | 12 |
three | 7 | 7 | |
two | 14 | 14 | |
b | one | 1 | 8 |
three | 8 | 6 | |
two | 6 | 7 |
def peak_range(df):
return df.max()-df.min()
print(df_obj5.groupby(['key1','key2']).agg(peak_range))
print(df_obj5.groupby(['key1','key2']).agg(lambda df:df.max()-df.min()))
data1 data2
key1 key2
a one 5 0
three 0 0
two 4 2
b one 0 0
three 0 0
two 0 0
data1 data2
key1 key2
a one 5 0
three 0 0
two 4 2
b one 0 0
three 0 0
two 0 0