利用Python进行数据分析之pandas

1.层次化索引

层次化索引是pandas的一项重要功能,它使你能在一个轴上拥有多个(两个以上)索引级别。

 1 data=Series(np.random.randn(10),
 2 index=[['a','a','a','b','b','b','c','c','d','d'],
 3 [1,2,3,1,2,3,1,2,2,3]])
 4 
 5 data
 6 Out[6]: 
 7 a  1   -2.842857
 8    2    0.376199
 9    3   -0.512978
10 b  1    0.225243
11    2   -1.242407
12    3   -0.663188
13 c  1   -0.149269
14    2   -1.079174
15 d  2   -0.952380
16    3   -1.113689
17 dtype: float64

这就是带MultiIndex索引的Series的格式化输出形式。索引之间的“间隔”表示“直接使用上面的标签”。

1 data.index
2 Out[7]: 
3 MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
4            labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

层次化索引在数据重塑和基于分组的操作中扮演重要角色。

2.pd.read_csv与DataFrame的参数对比

help(pd.DataFrame)

Parameters
 |  ----------
 |  data : numpy ndarray (structured or homogeneous), dict, or DataFrame
 |      Dict can contain Series, arrays, constants, or list-like objects
 |  index : Index or array-like
 |      Index to use for resulting frame. Will default to np.arange(n) if
 |      no indexing information part of input data and no index provided
 |  columns : Index or array-like
 |      Column labels to use for resulting frame. Will default to
 |      np.arange(n) if no column labels are provided
 |  dtype : dtype, default None
 |      Data type to force. Only a single dtype is allowed. If None, infer
 |  copy : boolean, default False
 |      Copy data from inputs. Only affects DataFrame / 2d ndarray input

help(pd.read_csv)

read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=None, compact_ints=None, use_unsigned=None, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)

对比:

设置列名的方式不同,当读取没有列名的文件 时,可以设置header='None',自动分配列名,或者手动设置names = ['a',...'z']

df =pd.read_csv('data/ex2.csv',names=['a','b','c','d','message'],index_col='message')
df

Out[8]:

abcd
message
hello1234
world5678
foo9101112

3 数据拆分

bins = [18,25,35,60,100]
ages =[20,22,25,30,31,19,33,28,44,51,66,34]
cats = pd.cut(ages,bins)
cats

Out[2]:

[(18, 25], (18, 25], (18, 25], (25, 35], (25, 35], ..., (25, 35], (35, 60], (35, 60], (60, 100], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

cats.labels

Out[3]:

array([0, 0, 0, 1, 1, 0, 1, 1, 2, 2, 3, 1], dtype=int8)
group_names=['Youth','youngAdult','MiddleAged','Senior']
cats = pd.cut(ages,bins,labels = group_names)
cats

Out[18]:

[Youth, Youth, Youth, youngAdult, youngAdult, ..., youngAdult, MiddleAged, MiddleAged, Senior, youngAdult]
Length: 12
Categories (4, object): [Youth < youngAdult < MiddleAged < Senior]

4.Pandas中的数据聚合与分组计算之groupby

df=DataFrame({'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one'],'data1':np.random.randn(5),'data2':np.random.randn(5)})

Out[17]:

data1data2key1key2
01.165397-0.923362aone
10.849728-0.937067atwo
2-0.7515450.576415bone
3-0.270348-0.458194btwo
4-0.225201-0.076616aone
  • grouped = df['data1'].groupby(df['key1'])
  • grouped.mean()
  • Out[19]:

    key1
    a    0.596641
    b   -0.510946
    Name: data1, dtype: float64
  1. means = df['data1'].groupby([df['key1'],df['key2']]).mean()
  2. Out[32]:

    key1  key2
    a     one     0.470098
          two     0.849728
    b     one    -0.751545
          two    -0.270348
    Name: data1, dtype: float64
  • df.groupby([df['key1'],df['key2']]).size()
  • Out[26]:

    key1  key2
    a     one     2
          two     1
    b     one     1
          two     1
    dtype: int64

除了内建函数外,亦可以利用自定义函数,来进行分组运算

people=DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],index=['AA','BB','CC','DD','EE'])
people

Out[64]:

abcde
AA-1.969581-0.4672971.0037850.708328-0.045470
BB0.783007-0.0978952.5086190.392152-0.647674
CC0.7441500.150627-2.2060213.002937-0.127511
DD0.575631-1.2023791.1697230.5025230.889531
EE-0.573331-0.023822-1.4618850.7634560.763352

mapping={'a':'red','b':'red','c':'blue','d':'blue','e':'red'}
people.groupby(mapping,axis=1).sum()

Out[66]:

bluered
AA1.712113-2.482348
BB2.9007710.037438
CC0.7969160.767267
DD1.6722460.262783
EE-0.6984290.166199


5 数据聚合

dict_obj = {'key1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'a'],
                 'key2' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                 'data1': np.random.randint(1,10, 8),
                 'data2': np.random.randint(1,10, 8)}
df_obj5 = pd.DataFrame(dict_obj)

Out[5]:

data1data2key1key2
096aone
118bone
298atwo
386bthree
456atwo
567btwo
646aone
777athree

df_obj5.groupby(['key1','key2']).sum()

Out[6]:

data1data2
key1key2
aone1312
three77
two1414
bone18
three86
two67

def peak_range(df):
    return df.max()-df.min()
print(df_obj5.groupby(['key1','key2']).agg(peak_range))
print(df_obj5.groupby(['key1','key2']).agg(lambda df:df.max()-df.min()))

            data1  data2
key1 key2               
a    one        5      0
     three      0      0
     two        4      2
b    one        0      0
     three      0      0
     two        0      0
            data1  data2
key1 key2               
a    one        5      0
     three      0      0
     two        4      2
b    one        0      0
     three      0      0
     two        0      0

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

且行且安~

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值