python数据科学包第二天

pandas入门

首先,下面所有的例子均需要导入包:

# 设置为 inline 风格
%matplotlib inline
# 包导入
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

一维数组:Series:

s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

输出:

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

二维数组创建:

dates = pd.date_range('20160301', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

输出

ABCD
2016-03-01-0.859805-0.069692-0.905092-0.553213
2016-03-02-0.3537850.031793-0.785213-0.212337
2016-03-031.7199760.9251450.241639-0.490166
2016-03-04-1.207854-0.001647-0.468976-0.781144
2016-03-050.4520341.3712081.1527291.470498
2016-03-061.3782270.246941-1.186630-0.411647
df.values
    array([[-8.59804740e-01, -6.96922374e-02, -9.05091676e-01,
            -5.53212518e-01],
           [-3.53785450e-01,  3.17933613e-02, -7.85212585e-01,
            -2.12337229e-01],
           [ 1.71997643e+00,  9.25144720e-01,  2.41639347e-01,
            -4.90166361e-01],
           [-1.20785422e+00, -1.64720630e-03, -4.68976209e-01,
            -7.81144372e-01],
           [ 4.52033577e-01,  1.37120779e+00,  1.15272905e+00,
             1.47049771e+00],
           [ 1.37822701e+00,  2.46941166e-01, -1.18662963e+00,
            -4.11647030e-01]])

用字典来构建DataFrame,如下:

# 使用字典来创建:key 为 DataFrame 的列;value 为对应列下的值
df1 = pd.DataFrame({
                  'A': 1,
                  'B': pd.Timestamp('20160301'),
                  'C': range(4),
                  'D': np.arange(5, 9),
                  'E': 'text',
                  'F': ['AA', 'BB', 'CC', 'DD']})
df1
ABCDEF
012016-03-0105textAA
112016-03-0116textBB
212016-03-0127textCC
312016-03-0138textDD
各个列可以通过DataFrame的属性来访问,如
df.A、df.B
type(df.A)
pandas.core.series.Series

可以看出,每一列或者每一行,都是Series数据

df.types  #查看每一列的数据类型
df.shape  #查看数据区的矩阵行列数
df.head() #查看矩阵的前五行,在函数内指定数字,可以查看指定行数
df.tail() #查看矩阵的后五行
df.index  #矩阵的行索引
df.columns#矩阵的列索引
df.values #查看矩阵的值,返回的是一个ndarray类型矩阵
df.describe()#查看矩阵的统计信息,默认是axis=0,及每一列的各行数据参与运算,生成每一列的统计信息,包括(每一列有多少值,平均值,标准差,最小值,四分位,中位数,四分之三分位,最大值),其中四分位的计算方法是减1计算法,具体可以查看百度百科
df.T      #矩阵的转置,或者使用transpose,类似numpy的方法
df.sort_index()#将行索引排序,参数axis可以指定为1,对列索引排序,ascending表示降序还是升序
df.sort_values(by='A')#依据每一列的值进行排序
df.['A']  #选择某列
df.A
df[2:4]	  #选择行,数字也可以换成行索引,布尔值索引,但当行索引也是数字时,默认使用数字,使用索引的切片,是包含最后一个元素所在的行的
df.loc[]  #使用索引来查找数据
df.iloc[] #使用数字来查找数据
#例子
df.loc[:,['B','C']],也可以访问某个特定的值
#
df.at[,]     #根据行列标签(索引)选择数据
df.iat[,]    #根据数字选择数据

df[df.A>0]   #根据布尔索引,找出A列大于0的行
df[df>0]     #非零的数据为NaN

矩阵运算

df['tag'] = ['a'] * 2 + ['b'] * 2 + ['c'] * 2
ABCDtag
2016-03-010.0332282.307123-0.585367-1.671832a
2016-03-02-1.9672990.727670-0.190863-0.163514a
2016-03-030.0653590.696804-0.5500400.717347b
2016-03-040.2348500.289520-1.0871731.534277b
2016-03-05-1.4596201.0409870.220130-0.068131c
2016-03-060.8654022.650889-0.015460-0.111889c
df[df.tag.isin(['a', 'c'])]
ABCDtag
2016-03-010.0332282.307123-0.585367-1.671832a
2016-03-02-1.9672990.727670-0.190863-0.163514a
2016-03-05-1.4596201.0409870.220130-0.068131c
2016-03-060.8654022.650889-0.015460-0.111889c
### 修改元素
s = pd.Series(np.arange(6), index=pd.date_range('20160301', periods=6))
df['E'] = s   # 修改一列
df.at[pd.Timestamp('20160301'), 'A'] = 0.4  #  修改某个元素
df.B = 200  #将一列的值都修改成一个值
df.iloc[:,2:5] = 1000 #修改子表

要结束网页的jupyter notebook,在控制台两次ctrl+c

重建索引

dates = pd.date_range('20160301', periods=6)
df = pd.DataFrame(data=np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df
ABCD
2016-03-01-0.9856660.2400580.7167210.352009
2016-03-02-1.5636440.0917661.0817640.951541
2016-03-030.279760-0.3161361.198073-0.562947
2016-03-041.174777-0.225305-0.280256-0.074768
2016-03-052.1733660.907038-1.104678-0.921779
2016-03-060.2004220.4426191.970330-0.609867
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1
ABCDE
2016-03-01-0.9856660.2400580.7167210.352009NaN
2016-03-02-1.5636440.0917661.0817640.951541NaN
2016-03-030.279760-0.3161361.198073-0.562947NaN
2016-03-041.174777-0.225305-0.280256-0.074768NaN
### 处理丢失数据
df1.loc[dates[1:3], 'E'] = 1
df1
ABCDE
2016-03-01-0.9856660.2400580.7167210.352009NaN
2016-03-02-1.5636440.0917661.0817640.9515411
2016-03-030.279760-0.3161361.198073-0.5629471
2016-03-041.174777-0.225305-0.280256-0.074768NaN

此时的E列有两个空值

df1.dropna()  #去除了空值所在的行
df1.fillna(value=5) #将空值替换成5
pd.isnull(df1) #判断是否有空数据
pd.isnull(df1).any()#列中是否有空数据
pd.isnull(df1).any().any()#表中是否有空数据

统计

空数据是不参与统计计算的

df1.mean()  #各列的平均值
df.sum(axis='columns') #计算加和
df.cumsum() #计算累加值
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)  #shift表示数据段整体往后移动两个位置,前面空出来的位置为NaN
df.sub(s, axis='index')#每个列都减处理,控制行变成空值,一般都是对空值做预处理在减
df.apply(np.cumsum) #效果同df.cumsum(),注意这里的sumsum属于np模块,不是pd模块,apply是吧一个列作为参数传给函数,applymap是把每个元素做处理
df.apply(lambda x: x.max() - x.min()) #得出每一列最大值减最小值
s = pd.Series(np.random.randint(0, 7, size=10)) ¥不包括最后一个数,但python自带函数是包括的
s.value_counts() #统计各种数字的数量
s.mode()         #最多的数是哪些

数据合并

df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
ABCD
01.098103-0.843356-0.3791350.419353
1-0.177702-0.225926-0.363542-0.153022
21.9382310.1548810.2913820.152774
3-0.460645-0.268697-1.5094690.698776
4-0.397048-0.9582230.212833-0.435485
50.525406-0.1775950.453216-0.093792
60.531912-0.8326670.2007210.943878
7-0.7408450.0986340.2740201.671997
82.1823791.7290101.3062690.580677
9-0.0315380.1597140.736667-0.122326
df1 = pd.concat([df.iloc[:3], df.iloc[3:7], df.iloc[7:]])
(df == df1).all().all()  #判断两个矩阵是否相等

# SQL 样式的联合查询
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
# SELECT * FROM left INNER JOIN right ON left.key = right.key;
pd.merge(left, right, on='key')

s = pd.Series(np.random.randint(1, 5, size=4), index=list('ABCD'))
df.append(s, ignore_index=True)

如果s有五列,在插入后,多出来新的一列,除了最后一行以外,其他都是NaN

分组统计

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                           'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df
ABCD
0fooone-0.580320-1.460149
1barone1.471201-1.079598
2footwo0.0948361.513204
3barthree-1.4988100.754968
4footwo0.1807090.415266
5bartwo0.358515-0.341988
6fooone-0.121082-0.408148
7foothree0.404648-0.320882
df.groupby('A').sum()
CD
A
bar0.330906-0.666618
foo-0.021208-0.260709
df.groupby(['A', 'B']).sum()
CD
AB
barone1.471201-1.079598
three-1.4988100.754968
two0.358515-0.341988
fooone-0.701402-1.868297
three0.404648-0.320882
two0.2755451.928470

形成了双索引的结构

数据整形

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
AB
firstsecond
barone0.0720260.422077
two-1.099181-0.354796
bazone1.285500-1.185525
two0.645316-0.660115
fooone0.696443-1.664527
two0.718399-0.154125
quxone-0.7400520.713089
two-0.672748-1.346843
stacked = df.stack()
first  second   
bar    one     A    0.072026
               B    0.422077
       two     A   -1.099181
               B   -0.354796
baz    one     A    1.285500
               B   -1.185525
       two     A    0.645316
               B   -0.660115
foo    one     A    0.696443
               B   -1.664527
       two     A    0.718399
               B   -0.154125
qux    one     A   -0.740052
               B    0.713089
       two     A   -0.672748
               B   -1.346843
dtype: float64
stacked.unstack()  #装换成以前的样子了
stacked.unstack().unstack()
AB
secondonetwoonetwo
first
bar0.072026-1.0991810.422077-0.354796
baz1.2855000.645316-1.185525-0.660115
foo0.6964430.718399-1.664527-0.154125
qux-0.740052-0.6727480.713089-1.346843
#可以指明选择哪一列作为列索引
stacked.unstack(1)
<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>second</th>
      <th>one</th>
      <th>two</th>
    </tr>
    <tr>
      <th>first</th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th rowspan="2" valign="top">bar</th>
      <th>A</th>
      <td>0.072026</td>
      <td>-1.099181</td>
    </tr>
    <tr>
      <th>B</th>
      <td>0.422077</td>
      <td>-0.354796</td>
    </tr>
    <tr>
      <th rowspan="2" valign="top">baz</th>
      <th>A</th>
      <td>1.285500</td>
      <td>0.645316</td>
    </tr>
    <tr>
      <th>B</th>
      <td>-1.185525</td>
      <td>-0.660115</td>
    </tr>
    <tr>
      <th rowspan="2" valign="top">foo</th>
      <th>A</th>
      <td>0.696443</td>
      <td>0.718399</td>
    </tr>
    <tr>
      <th>B</th>
      <td>-1.664527</td>
      <td>-0.154125</td>
    </tr>
    <tr>
      <th rowspan="2" valign="top">qux</th>
      <th>A</th>
      <td>-0.740052</td>
      <td>-0.672748</td>
    </tr>
    <tr>
      <th>B</th>
      <td>0.713089</td>
      <td>-1.346843</td>
    </tr>
  </tbody>
</table>
</div>

数据透视表

数据透视是指查看里面的一部分数据

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                    'B' : ['A', 'B', 'C'] * 4,
                    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                    'D' : np.random.randn(12),
                    'E' : np.random.randn(12)})
df
ABCDE
0oneAfoo1.4775331.557713
1oneBfoo0.0195282.483014
2twoCfoo-0.9124520.409732
3threeAbar0.502807-0.462401
4oneBbar1.709597-1.739413
5oneCbar-0.6581551.302735
6twoAfoo0.0078060.782926
7threeBfoo-0.067922-0.193820
8oneCfoo0.8067130.383870
9oneAbar0.7940170.749756
10twoBbar-0.532554-0.811900
11threeCbar0.4647311.168423
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Cbarfoo
AB
oneA0.7940171.477533
B1.7095970.019528
C-0.6581550.806713
threeA0.502807NaN
BNaN-0.067922
C0.464731NaN
twoANaN0.007806
B-0.532554NaN
CNaN-0.912452
**当这个数据透视表对应多个值时,回去取平均值**

时间序列

rng = pd.date_range('20160301', periods=600, freq='s')
rng
DatetimeIndex(['2016-03-01 00:00:00', '2016-03-01 00:00:01',
               '2016-03-01 00:00:02', '2016-03-01 00:00:03',
               '2016-03-01 00:00:04', '2016-03-01 00:00:05',
               '2016-03-01 00:00:06', '2016-03-01 00:00:07',
               '2016-03-01 00:00:08', '2016-03-01 00:00:09',
               ...
               '2016-03-01 00:09:50', '2016-03-01 00:09:51',
               '2016-03-01 00:09:52', '2016-03-01 00:09:53',
               '2016-03-01 00:09:54', '2016-03-01 00:09:55',
               '2016-03-01 00:09:56', '2016-03-01 00:09:57',
               '2016-03-01 00:09:58', '2016-03-01 00:09:59'],
              dtype='datetime64[ns]', length=600, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts
2016-03-01 00:00:00     34
2016-03-01 00:00:01      4
2016-03-01 00:00:02    382
2016-03-01 00:00:03    164
2016-03-01 00:00:04    178
2016-03-01 00:00:05    421
2016-03-01 00:00:06     34
2016-03-01 00:00:07     71
2016-03-01 00:00:08    316
2016-03-01 00:00:09    201
2016-03-01 00:00:10    214
2016-03-01 00:00:11    443
2016-03-01 00:00:12    185
2016-03-01 00:00:13     79
2016-03-01 00:00:14     38
2016-03-01 00:00:15    465
2016-03-01 00:00:16    309
2016-03-01 00:00:17     93
2016-03-01 00:00:18     20
2016-03-01 00:00:19    338
2016-03-01 00:00:20    149
2016-03-01 00:00:21     34
2016-03-01 00:00:22    257
2016-03-01 00:00:23    462
2016-03-01 00:00:24     41
2016-03-01 00:00:25    471
2016-03-01 00:00:26    313
2016-03-01 00:00:27    224
2016-03-01 00:00:28     78
2016-03-01 00:00:29    498
                      ... 
2016-03-01 00:09:30     61
2016-03-01 00:09:31    315
2016-03-01 00:09:32    388
2016-03-01 00:09:33    391
2016-03-01 00:09:34    263
2016-03-01 00:09:35     11
2016-03-01 00:09:36     61
2016-03-01 00:09:37    400
2016-03-01 00:09:38    109
2016-03-01 00:09:39    135
2016-03-01 00:09:40    267
2016-03-01 00:09:41    248
2016-03-01 00:09:42    469
2016-03-01 00:09:43    155
2016-03-01 00:09:44    284
2016-03-01 00:09:45    168
2016-03-01 00:09:46    228
2016-03-01 00:09:47    244
2016-03-01 00:09:48    442
2016-03-01 00:09:49    450
2016-03-01 00:09:50    226
2016-03-01 00:09:51    370
2016-03-01 00:09:52    192
2016-03-01 00:09:53    325
2016-03-01 00:09:54     82
2016-03-01 00:09:55    154
2016-03-01 00:09:56    285
2016-03-01 00:09:57     22
2016-03-01 00:09:58     48
2016-03-01 00:09:59    171
Freq: S, dtype: int32
#重新采样
ts.resample('2Min', how='sum')
2016-03-01 00:00:00    28595
2016-03-01 00:02:00    29339
2016-03-01 00:04:00    28991
2016-03-01 00:06:00    30789
2016-03-01 00:08:00    30131
Freq: 2T, dtype: int32
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
prng
PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2',
             '1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4',
             '1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2',
             '1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4',
             '1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2',
             '1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4',
             '1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2',
             '2000Q3', '2000Q4'],
            dtype='int64', freq='Q-NOV')
prng.to_timestamp()   #转成标准格式数据 年-月-日
pd.Timestamp('20160301')-pd.Timestamp('2016') #计算时间差

类别数据

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df
idraw_grade
01a
12b
23b
34a
45a
56e
df["grade"] = df["raw_grade"].astype("category")
df
idraw_gradegrade
01aa
12bb
23bb
34aa
45aa
56ee
df["grade"].cat.categories
Index([u'a', u'b', u'e'], dtype='object')
df["grade"].cat.categories = ["very good", "good", "very bad"]
df
idraw_gradegrade
01avery good
12bgood
23bgood
34avery good
45avery good
56every bad
df.sort_values(by='grade', ascending=True)
idraw_gradegrade
01avery good
34avery good
45avery good
12bgood
23bgood
56every bad

画图

ts = pd.Series(np.random.randn(1000), index=pd.date_range('20000101', periods=1000))
ts = ts.cumsum()
ts
2000-01-01     0.416424
2000-01-02     0.603304
2000-01-03    -0.237965
2000-01-04     0.317450
2000-01-05     0.665045
2000-01-06     2.468087
2000-01-07     2.758852
2000-01-08     2.271343
2000-01-09     3.129609
2000-01-10     5.171241
2000-01-11     5.049896
2000-01-12     5.185316
2000-01-13     4.169058
2000-01-14     2.862306
2000-01-15     4.018617
2000-01-16     4.456694
2000-01-17     5.824236
2000-01-18     6.094983
2000-01-19     5.880954
2000-01-20     5.875111
2000-01-21     6.008481
2000-01-22     6.835501
2000-01-23     7.480405
2000-01-24     6.849335
2000-01-25     7.608887
2000-01-26     9.029474
2000-01-27     8.859222
2000-01-28     7.162806
2000-01-29     7.398013
2000-01-30     7.391844
                ...    
2002-08-28    21.728409
2002-08-29    21.757852
2002-08-30    21.047643
2002-08-31    20.114996
2002-09-01    18.769902
2002-09-02    17.417680
2002-09-03    17.917688
2002-09-04    18.064786
2002-09-05    19.312356
2002-09-06    18.633479
2002-09-07    17.711879
2002-09-08    19.162369
2002-09-09    19.697896
2002-09-10    18.895018
2002-09-11    18.590989
2002-09-12    17.278925
2002-09-13    17.730168
2002-09-14    19.058526
2002-09-15    18.898382
2002-09-16    17.048621
2002-09-17    16.443233
2002-09-18    16.842284
2002-09-19    14.627031
2002-09-20    15.500982
2002-09-21    14.640444
2002-09-22    13.183795
2002-09-23    13.383657
2002-09-24    13.006229
2002-09-25    12.311008
2002-09-26    11.674804
Freq: D, dtype: float64
ts.plot()

在这里插入图片描述

数据读写

df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df
ABCD
0-1.052421-0.1649923.098604-0.966960
11.1941770.0868800.4960950.265308
20.2977241.284297-0.130855-0.229570
3-0.7870630.5536800.546853-0.322599
40.033174-1.2222810.320090-1.749333
50.1095750.3106841.620296-0.928869
60.761408-0.0276300.458341-0.785370
7-1.150479-0.7185841.0288660.419026
8-2.906881-0.295700-0.342306-0.765172
90.916363-1.181429-1.559657-1.171191
100.5786590.8047261.2994960.176843
110.150659-0.162833-1.0860551.240432
12-0.8192191.6682340.217604-0.779170
13-0.550658-0.672640-0.674157-0.637602
140.9015840.0460230.2443700.374293
150.971181-0.4426180.1790830.086095
16-0.570786-1.0192391.6848330.539140
17-1.4323141.3695882.0913000.733526
18-1.115526-0.1158842.636074-0.788859
191.6015541.2261820.169308-0.616585
200.5713160.5424320.3065950.780939
21-0.5404141.0366560.683224-0.116963
221.319110-1.2652071.3719240.881560
231.584346-1.719633-1.365020-0.617224
24-0.440420-0.7992650.376128-0.654581
25-0.261730-0.046325-0.2890090.505634
260.3850470.1127230.428345-0.008455
27-0.9216681.6098481.592532-0.623103
280.280799-0.231821-1.589829-1.791286
290.6615620.6213050.921586-0.312834
...............
700.0643850.669585-1.3470730.941348
71-1.534420-1.2277360.459771-1.150254
720.0107410.062820-1.0983011.268482
73-1.1835861.159889-0.186617-0.847210
74-0.705815-0.3718960.3130200.035314
75-2.945315-0.421227-0.4034791.387825
76-0.1223830.474282-2.039155-0.155960
770.921353-0.430436-0.5992530.911030
780.0184440.0986110.3204800.001282
79-0.188301-2.015690-0.427172-0.146939
80-0.0060220.2134211.358382-0.414890
810.5965460.0427081.325342-0.800222
82-1.736245-0.056213-0.415892-0.360570
830.463591-0.4042020.5771910.336023
84-1.3975570.4420120.007915-1.305628
85-0.137766-0.7717130.200956-0.365344
860.988833-0.165965-0.893573-0.318324
871.0937991.694406-0.8684200.100202
88-0.2406280.539268-1.0948411.737569
891.850923-0.472270-2.317345-0.544395
900.6172841.224130-1.7223660.236574
911.2829670.7385701.748848-0.106646
920.775707-0.494293-1.0984660.372206
93-0.8464660.7351441.4565201.622817
94-0.8609991.146650-1.0640131.400919
95-0.095498-1.8495182.3035320.688425
96-0.017921-0.558700-1.0616050.781250
97-1.0690701.106837-1.936800-0.782616
980.4362670.4635370.614982-0.123774
99-1.440635-1.506836-0.3868241.118260

100 rows × 4 columns

df.to_csv('data.csv')
%ls
%more data.csv
# pd.read_csv('data.csv')
pd.read_csv('data.csv', index_col=0)#以第一列作为索引

电影数据分析

准备工作

从网站 grouplens.org/datasets/movielens 下载 MovieLens 1M Dataset 数据。

数据说明

参阅数据介绍文件 README.txt

利用 Pandas 分析电影评分数据

  • 数据读取
  • 数据合并
  • 统计电影平均得分
  • 统计活跃电影 -> 获得评分的次数越多说明电影越活跃
  • 女生最喜欢的电影排行榜
  • 男生最喜欢的电影排行榜
  • 男女生评分差距最大的电影 -> 某类电影女生喜欢,但男生不喜欢
  • 最具争议的电影排行榜 -> 评分的方差最大
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
user_names = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=user_names, engine='python')

rating_names = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rating_names, engine='python')

movie_names = ['movie_id', 'title', 'genres']
movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=movie_names, engine='python')

pandas内部有两种实现方式,一种是C语言,一种是python,这里指定python,因为C语言功能不如python多。

print len(users)
users.head(5)
6040
user_idgenderageoccupationzip
01F11048067
12M561670072
23M251555117
34M45702460
45M252055455

数据合并

data = pd.merge(pd.merge(users, ratings), movies)
len(data)
data.head(5)
user_idgenderageoccupationzipmovie_idratingtimestamptitlegenres
01F1104806711935978300760One Flew Over the Cuckoo's Nest (1975)Drama
12M56167007211935978298413One Flew Over the Cuckoo's Nest (1975)Drama
212M25123279311934978220179One Flew Over the Cuckoo's Nest (1975)Drama
315M2572290311934978199279One Flew Over the Cuckoo's Nest (1975)Drama
417M5019535011935978158471One Flew Over the Cuckoo's Nest (1975)Drama
data[data.user_id == 1]  #查看你用户1对所有电影的评分情况
# 按性别查看各个电影的平均评分
mean_ratings_gender = data.pivot_table(values='rating', index='title', columns='gender', aggfunc='mean')
mean_ratings_gender.head(5)
genderFM
title
$1,000,000 Duck (1971)3.3750002.761905
'Night Mother (1986)3.3888893.352941
'Til There Was You (1997)2.6756762.733333
'burbs, The (1989)2.7934782.962085
...And Justice for All (1979)3.8285713.689024
# 男女意见想差最大的电影 -> 价值观/品味冲突
mean_ratings_gender['diff'] = mean_ratings_gender.F - mean_ratings_gender.M
mean_ratings_gender.head(5)
genderFMdiff
title
$1,000,000 Duck (1971)3.3750002.7619050.613095
'Night Mother (1986)3.3888893.3529410.035948
'Til There Was You (1997)2.6756762.733333-0.057658
'burbs, The (1989)2.7934782.962085-0.168607
...And Justice for All (1979)3.8285713.6890240.139547
mean_ratings_gender.sort_values(by='diff', ascending=True).head(10)
genderFMdiff
title
Tigrero: A Film That Was Never Made (1994)14.333333-3.333333
Neon Bible, The (1995)14.000000-3.000000
Enfer, L' (1994)13.750000-2.750000
Stalingrad (1993)13.593750-2.593750
Killer: A Journal of Murder (1995)13.428571-2.428571
Dangerous Ground (1997)13.333333-2.333333
In God's Hands (1998)13.333333-2.333333
Rosie (1998)13.333333-2.333333
Flying Saucer, The (1950)13.300000-2.300000
Jamaica Inn (1939)13.142857-2.142857
# 活跃电影排行榜
ratings_by_movie_title = data.groupby('title').size()
ratings_by_movie_title.head(5)
title
$1,000,000 Duck (1971)            37
'Night Mother (1986)              70
'Til There Was You (1997)         52
'burbs, The (1989)               303
...And Justice for All (1979)    199
dtype: int64
# 前二十大高分电影 -> 平均评分最高的电影
mean_ratings = data.pivot_table(values='rating', index='title', aggfunc='mean')
top_20_mean_ratings = mean_ratings.sort_values(ascending=False).head(20)
top_20_mean_ratings
title
Gate of Heavenly Peace, The (1995)                                     5.000000
Lured (1947)                                                           5.000000
Ulysses (Ulisse) (1954)                                                5.000000
Smashing Time (1967)                                                   5.000000
Follow the Bitch (1998)                                                5.000000
Song of Freedom (1936)                                                 5.000000
Bittersweet Motel (2000)                                               5.000000
Baby, The (1973)                                                       5.000000
One Little Indian (1973)                                               5.000000
Schlafes Bruder (Brother of Sleep) (1995)                              5.000000
I Am Cuba (Soy Cuba/Ya Kuba) (1964)                                    4.800000
Lamerica (1994)                                                        4.750000
Apple, The (Sib) (1998)                                                4.666667
Sanjuro (1962)                                                         4.608696
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)    4.560510
Shawshank Redemption, The (1994)                                       4.554558
Godfather, The (1972)                                                  4.524966
Close Shave, A (1995)                                                  4.520548
Usual Suspects, The (1995)                                             4.517106
Schindler's List (1993)                                                4.510417
Name: rating, dtype: float64
# 前十大热闹电影的平均评分 -> 不一定越热闹的电影,评分越高
mean_ratings[top_10_ratings.index]
title
American Beauty (1999)                                   4.317386
Star Wars: Episode IV - A New Hope (1977)                4.453694
Star Wars: Episode V - The Empire Strikes Back (1980)    4.292977
Star Wars: Episode VI - Return of the Jedi (1983)        4.022893
Jurassic Park (1993)                                     3.763847
Saving Private Ryan (1998)                               4.337354
Terminator 2: Judgment Day (1991)                        4.058513
Matrix, The (1999)                                       4.315830
Back to the Future (1985)                                3.990321
Silence of the Lambs, The (1991)                         4.351823
Name: rating, dtype: float64
# 前二十大高分电影的热闹程度 -> 不一定评分越高的电影越热闹,可能某个很小众的电影看得人少,但评分很高
ratings_by_movie_title[top_20_mean_ratings.index]
title
Gate of Heavenly Peace, The (1995)                                        3
Lured (1947)                                                              1
Ulysses (Ulisse) (1954)                                                   1
Smashing Time (1967)                                                      2
Follow the Bitch (1998)                                                   1
Song of Freedom (1936)                                                    1
Bittersweet Motel (2000)                                                  1
Baby, The (1973)                                                          1
One Little Indian (1973)                                                  1
Schlafes Bruder (Brother of Sleep) (1995)                                 1
I Am Cuba (Soy Cuba/Ya Kuba) (1964)                                       5
Lamerica (1994)                                                           8
Apple, The (Sib) (1998)                                                   9
Sanjuro (1962)                                                           69
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)     628
Shawshank Redemption, The (1994)                                       2227
Godfather, The (1972)                                                  2223
Close Shave, A (1995)                                                   657
Usual Suspects, The (1995)                                             1783
Schindler's List (1993)                                                2304
dtype: int64
# 十大好电影 -> 活跃度超过 1000 的高分电影
top_10_movies = mean_ratings[top_ratings.index].sort_values(ascending=False).head(10)
top_10_movies
title
Shawshank Redemption, The (1994)                                               4.554558
Godfather, The (1972)                                                          4.524966
Usual Suspects, The (1995)                                                     4.517106
Schindler's List (1993)                                                        4.510417
Raiders of the Lost Ark (1981)                                                 4.477725
Rear Window (1954)                                                             4.476190
Star Wars: Episode IV - A New Hope (1977)                                      4.453694
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)    4.449890
Casablanca (1942)                                                              4.412822
Sixth Sense, The (1999)                                                        4.406263
Name: rating, dtype: float64

pandas核心数据结构

import pandas as pd
import numpy as np

Series

Series 是一维带标签的数组,数组里可以放任意的数据(整数,浮点数,字符串,Python Object)。其基本的创建函数是:

s = pd.Series(data, index=index)

其中 index 是一个列表,用来作为数据的标签。data 可以是不同的数据类型:

  • Python 字典
  • ndarray 对象
  • 一个标量值,如 5
从 ndaray 创建
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s
a    0.747292
b   -1.120276
c   -0.132692
d   -0.267813
e   -0.590904
dtype: float64
s.index
Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')
s = pd.Series(np.random.randn(5))
s
0    0.324214
1   -0.183776
2   -0.518808
3    0.866421
4   -0.601668
dtype: float64
s.index
Int64Index([0, 1, 2, 3, 4], dtype='int64')
从字典创建
# 空值的默认处理
d = {'a' : 0., 'b' : 1., 'd' : 3}
s = pd.Series(d, index=list('abcd'))
s
a     0
b     1
c   NaN
d     3
dtype: float64
从标量创建
pd.Series(3, index=list('abcde'))
a    3
b    3
c    3
d    3
e    3
dtype: int64

Series特性

Series 是类 ndarray 对象

熟悉 numpy 的同学对下面的操作应该不会陌生。我们在 numpy 简介里也介绍过下面的索引方式。

s = pd.Series(np.random.randn(5))
s
0    0.882069
1   -0.134360
2   -0.925088
3    0.191072
4    2.546704
dtype: float64
s[0]
0.88206876023157332
s[:3]
0    0.882069
1   -0.134360
2   -0.925088
dtype: float64
s[[1, 3, 4]]
1   -0.134360
3    0.191072
4    2.546704
dtype: float64
np.exp(s)
0     2.415892
1     0.874275
2     0.396497
3     1.210546
4    12.764963
dtype: float64
np.sin(s)
0    0.772055
1   -0.133957
2   -0.798673
3    0.189911
4    0.560416
dtype: float64
Series 是类字典对象
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s
a   -2.149840
b   -0.924115
c    0.481231
d    1.033813
e   -0.462794
dtype: float64
s['a']
-2.1498403551053218
s['e'] = 5
s
a   -2.149840
b   -0.924115
c    0.481231
d    1.033813
e    5.000000
dtype: float64
s['g'] = 100
s
a     -2.149840
b     -0.924115
c      0.481231
d      1.033813
e      5.000000
g    100.000000
dtype: float64
'e' in s
True
'f' in s
False
# s['f']
print s.get('f')
None
print s.get('f', np.nan)
nan
标签对齐操作
s1 = pd.Series(np.random.randn(3), index=['a', 'c', 'e'])
s2 = pd.Series(np.random.randn(3), index=['a', 'd', 'e'])
print '{0}\n\n{1}'.format(s1, s2)
a   -0.917905
c   -0.744616
e    0.114522
dtype: float64

a    0.721087
d   -0.471575
e    0.796093
dtype: float64
s1 + s2
name 属性
s = pd.Series(np.random.randn(5), name='Some Thing')
s
0    0.623787
1    0.517239
2    1.551314
3    1.414463
4   -1.224611
Name: Some Thing, dtype: float64
s.name
'Some Thing'



a   -0.196818
c         NaN
d         NaN
e    0.910615
dtype: float64

DataFrame

DataFrame 是二维带行标签和列标签的数组。可以把 DataFrame 想你成一个 Excel 表格或一个 SQL 数据库的表格,还可以相像成是一个 Series 对象字典。它是 Pandas 里最常用的数据结构。

创建 DataFrame 的基本格式是:

df = pd.DataFrame(data, index=index, columns=columns)

其中 index 是行标签,columns 是列标签,data 可以是下面的数据:

  • 由一维 numpy 数组,list,Series 构成的字典
  • 二维 numpy 数组
  • 一个 Series
  • 另外的 DataFrame 对象
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
pd.DataFrame(d)
onetwo
a11
b22
c33
dNaN4
pd.DataFrame(d, index=['d', 'b', 'a'])
onetwo
dNaN4
b22
a11
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
twothree
d4NaN
b2NaN
a1NaN

如果是列表,必须元素个数一样,否则会报错

d = {'one' : [1, 2, 3, 4],
     'two' : [21, 22, 23, 24]}
pd.DataFrame(d)
onetwo
0121
1222
2323
3424
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
onetwo
a121
b222
c323
d424
从结构化数据中创建
data = [(1, 2.2, 'Hello'), (2, 3., "World")]
pd.DataFrame(data)
012
012.2Hello
123.0World
pd.DataFrame(data, index=['first', 'second'], columns=['A', 'B', 'C'])
ABC
first12.2Hello
second23.0World
从字典列表创建
data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(data)
abc
012NaN
151020
```python pd.DataFrame(data, index=['first', 'second']) ```
abc
first12NaN
second51020
pd.DataFrame(data, columns=['a', 'b'])
ab
012
1510
从元组字典创建

了解其创建的原理,实际应用中,会通过数据清洗的方式,把数据整理成方便 Pandas 导入且可读性好的格式。最后再通过 reindex/groupby 等方式转换成复杂数据结构。

d = {('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
     ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
     ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
     ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
     ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}}
# 多级标签
pd.DataFrame(d)
ab
abcab
AB415810
C3267NaN
DNaNNaNNaNNaN9
从 Series 创建
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
pd.DataFrame(s)
0
a-0.789343
b0.127384
c1.084005
d-0.755011
e-0.963299
pd.DataFrame(s, index=['a', 'c', 'd'])
0
a-0.789343
c1.084005
d-0.755011

如果这里指定多列,就会报错,不同于字典作为参数进行创建

pd.DataFrame(s, index=['a', 'c', 'd'], columns=['A'])
A
a-0.789343
c1.084005
d-0.755011
列选择/增加/删除
df = pd.DataFrame(np.random.randn(6, 4), columns=['one', 'two', 'three', 'four'])
df
onetwothreefour
02.045300-0.981722-0.656081-0.639517
1-0.5507800.248781-0.1464240.217392
21.7027750.103998-0.662138-0.534071
3-2.0356810.0150251.3682090.178378
4-1.0922080.091108-0.892496-0.611198
50.0935020.2674281.189654-0.258723
df['one']
0    2.045300
1   -0.550780
2    1.702775
3   -2.035681
4   -1.092208
5    0.093502
Name: one, dtype: float64
df['three'] = df['one'] + df['two']
df
onetwothreefour
02.045300-0.9817221.063578-0.639517
1-0.5507800.248781-0.3019990.217392
21.7027750.1039981.806773-0.534071
3-2.0356810.015025-2.0206560.178378
4-1.0922080.091108-1.001100-0.611198
50.0935020.2674280.360931-0.258723
df['flag'] = df['one'] > 0
df
onetwothreefourflag
02.045300-0.9817221.063578-0.639517True
1-0.5507800.248781-0.3019990.217392False
21.7027750.1039981.806773-0.534071True
3-2.0356810.015025-2.0206560.178378False
4-1.0922080.091108-1.001100-0.611198False
50.0935020.2674280.360931-0.258723True
del df['three']
df
onetwofourflag
02.045300-0.981722-0.639517True
1-0.5507800.2487810.217392False
21.7027750.103998-0.534071True
3-2.0356810.0150250.178378False
4-1.0922080.091108-0.611198False
50.0935020.267428-0.258723True
four = df.pop('four')
four
0   -0.639517
1    0.217392
2   -0.534071
3    0.178378
4   -0.611198
5   -0.258723
Name: four, dtype: float64
df
onetwoflag
02.045300-0.981722True
1-0.5507800.248781False
21.7027750.103998True
3-2.0356810.015025False
4-1.0922080.091108False
50.0935020.267428True
df['five'] = 5
df
onetwoflagfive
02.045300-0.981722True5
1-0.5507800.248781False5
21.7027750.103998True5
3-2.0356810.015025False5
4-1.0922080.091108False5
50.0935020.267428True5
df['one_trunc'] = df['one'][:2]
df
onetwoflagfiveone_trunc
02.045300-0.981722True52.04530
1-0.5507800.248781False5-0.55078
21.7027750.103998True5NaN
3-2.0356810.015025False5NaN
4-1.0922080.091108False5NaN
50.0935020.267428True5NaN
# 指定插入位置
df.insert(1, 'bar', df['one'])
df
onebartwoflagfiveone_trunc
02.0453002.045300-0.981722True52.04530
1-0.550780-0.5507800.248781False5-0.55078
21.7027751.7027750.103998True5NaN
3-2.035681-2.0356810.015025False5NaN
4-1.092208-1.0922080.091108False5NaN
50.0935020.0935020.267428True5NaN
使用 assign() 方法来插入新列

更方便地使用 methd chains 的方法来实现

df = pd.DataFrame(np.random.randint(1, 5, (6, 4)), columns=list('ABCD'))
df
ABCD
04334
11442
21443
32443
41242
53414
df.assign(Ratio = df['A'] / df['B'])
ABCDRatio
043341.333333
114420.250000
214430.250000
324430.500000
412420.500000
534140.750000

insert是直接改变原DataFrame,assign是生成一个新的,assign的另一个特性是可以传入函数进行计算。

df.assign(AB_Ratio = lambda x: x.A / x.B, CD_Ratio = lambda x: x.C - x.D)
ABCDAB_RatioCD_Ratio
043341.333333-1
114420.2500002
214430.2500001
324430.5000001
412420.5000002
534140.750000-3
df.assign(AB_Ratio = lambda x: x.A / x.B).assign(ABD_Ratio = lambda x: x.AB_Ratio * x.D)
ABCDAB_RatioABD_Ratio
043341.3333335.333333
114420.2500000.500000
214430.2500000.750000
324430.5000001.500000
412420.5000001.000000
534140.7500003.000000
索引和选择

对应的操作,语法和返回结果

  • 选择一列 -> df[col] -> Series
  • 根据行标签选择一行 -> df.loc[label] -> Series
  • 根据行位置选择一行 -> df.iloc[label] -> Series
  • 选择多行 -> df[5:10] -> DataFrame
  • 根据布尔向量选择多行 -> df[bool_vector] -> DataFrame
df = pd.DataFrame(np.random.randint(1, 10, (6, 4)), index=list('abcdef'), columns=list('ABCD'))
df
ABCD
a2266
b8357
c4683
d7839
e8442
f4243
df['A']
a    2
b    8
c    4
d    7
e    8
f    4
Name: A, dtype: int32
df.loc['a']
A    2
B    2
C    6
D    6
Name: a, dtype: int32
df.iloc[0]
A    2
B    2
C    6
D    6
Name: a, dtype: int32
df[1:4]
ABCD
b8357
c4683
d7839
使用iloc比这样直接访问效率要高df.iloc[1:4]
df[[False, True, True, False, True, False]]
ABCD
b8357
c4683
e8442
数据对齐

DataFrame 在进行数据计算时,会自动按行和列进行数据对齐。最终的计算结果会合并两个 DataFrame。

df1 = pd.DataFrame(np.random.randn(10, 4), index=list('abcdefghij'), columns=['A', 'B', 'C', 'D'])
df1
ABCD
a0.576428-0.037913-0.329787-1.752916
b0.406743-1.044561-0.7244470.374599
c0.0735780.423914-1.499770-0.488374
d-0.3776091.137422-1.951169-0.814306
e-2.171648-2.364502-0.8335940.168636
f-1.134800-0.9274690.8868890.542603
g0.6251040.115953-1.2826091.031292
h0.4035090.2632070.403614-0.177888
i0.148494-2.0342530.134859-0.960650
j0.094200-1.8032880.057472-0.338958
df2 = pd.DataFrame(np.random.randn(7, 3), index=list('cdefghi'), columns=['A', 'B', 'C'])
df2
ABC
c0.8845180.337344-1.072027
d0.264036-0.152542-0.225544
e1.048813-1.4964421.022348
f0.895314-0.8902361.230465
g-0.588162-0.492354-0.739563
h-2.5803221.104810-0.167137
i-0.8427380.1717350.847714
df1 + df2
ABCD
aNaNNaNNaNNaN
bNaNNaNNaNNaN
c0.9580960.761259-2.571797NaN
d-0.1135730.984880-2.176713NaN
e-1.122834-3.8609440.188754NaN
f-0.239486-1.8177052.117354NaN
g0.036942-0.376401-2.022171NaN
h-2.1768131.3680160.236476NaN
i-0.694245-1.8625170.982573NaN
jNaNNaNNaNNaN
df1 - df1.iloc[0]
ABCD
a0.0000000.0000000.0000000.000000
b-0.169685-1.006648-0.3946602.127515
c-0.5028500.461827-1.1699831.264541
d-0.9540371.175335-1.6213820.938610
e-2.748076-2.326589-0.5038071.921551
f-1.711228-0.8895561.2166762.295518
g0.0486760.153866-0.9528222.784208
h-0.1729190.3011190.7334001.575028
i-0.427934-1.9963400.4646460.792265
j-0.482228-1.7653750.3872591.413957
这里两个运算对象不是同类型,用到广播原理,生成新的DataFrame,是原来DataFrame的每行减去第0行。
使用 numpy 函数

Pandas 与 numpy 在核心数据结构上是完全兼容的

df = pd.DataFrame(np.random.randn(10, 4), columns=['one', 'two', 'three', 'four'])
df
onetwothreefour
0-1.1218181.2336860.681618-0.502204
11.469664-0.060555-0.0448570.725021
21.2196700.1087091.8060630.332685
3-0.1906151.244102-0.8638501.795335
4-0.133109-0.1015910.8187241.246230
50.7298040.7165932.472841-0.078224
60.0101361.725441-1.0711941.602945
71.002507-1.122593-0.147411-1.678843
8-0.5500770.230777-0.658470-1.680395
91.0062710.455683-2.279833-0.823792
np.exp(df)
onetwothreefour
00.3256873.4338641.9770730.605196
14.3477740.9412420.9561342.064774
23.3860691.1148386.0864401.394708
30.8264503.4698170.4215366.021490
40.8753690.9033992.2676043.477210
52.0746752.04744611.8560820.924757
61.0101875.6149950.3425994.967641
72.7251050.3254350.8629390.186590
80.5769051.2595780.5176430.186300
92.7353821.5772500.1023010.438765
np.asarray(df) == df.values
array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]], dtype=bool)
type(np.asarray(df))
numpy.ndarray
np.asarray(df) == df
onetwothreefour
0TrueTrueTrueTrue
1TrueTrueTrueTrue
2TrueTrueTrueTrue
3TrueTrueTrueTrue
4TrueTrueTrueTrue
5TrueTrueTrueTrue
6TrueTrueTrueTrue
7TrueTrueTrueTrue
8TrueTrueTrueTrue
9TrueTrueTrueTrue

Panel

Panel 是三维带标签的数组。实际上,Pandas 的名称由来就是由 Panel 演进的,即 pan(el)-da(ta)-s。Panel 比较少用,但依然是最重要的基础数据结构之一。

  • items: 坐标轴 0,索引对应的元素是一个 DataFrame
  • major_axis: 坐标轴 1, DataFrame 里的行标签
  • minor_axis: 坐标轴 2, DataFrame 里的列标签
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
        'Item2' : pd.DataFrame(np.random.randn(4, 2))}
pn = pd.Panel(data)
pn
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2
pn['Item1']
012
00.638298-1.6008223.112210
10.3940990.1841290.438450
20.427692-0.2945560.039430
31.5550460.9337490.218616
pn.items
Index([u'Item1', u'Item2'], dtype='object')
pn.major_axis
Int64Index([0, 1, 2, 3], dtype='int64')
pn.minor_axis
Int64Index([0, 1, 2], dtype='int64')
# 函数调用
pn.major_xs(pn.major_axis[0])
Item1Item2
00.638298-1.427579
1-1.600822-0.778090
23.112210NaN
# 函数调用
pn.minor_xs(pn.major_axis[1])
Item1Item2
0-1.600822-0.778090
10.1841290.698347
2-0.294556-0.167423
30.9337490.205092
pn.to_frame()
Item1Item2
majorminor
000.638298-1.427579
1-1.600822-0.778090
100.394099-0.999929
10.1841290.698347
200.4276920.559905
1-0.294556-0.167423
301.555046-1.992102
10.9337490.205092

自动去除了NaN一列

基础运算

import pandas as pd
import numpy as np

重新索引

Series

s = pd.Series([1, 3, 5, 6, 8], index=list('acefh'))
s
a    1
c    3
e    5
f    6
h    8
dtype: int64
s.reindex(list('abcdefgh'))
a     1
b   NaN
c     3
d   NaN
e     5
f     6
g   NaN
h     8
dtype: float64
s.reindex(list('abcdefgh'), fill_value=0)
a    1
b    0
c    3
d    0
e    5
f    6
g    0
h    8
dtype: int64
# method='bfill'
s.reindex(list('abcdefgh'), method='ffill')
a    1
b    1
c    3
d    3
e    5
f    6
g    6
h    8
dtype: int64

DataFrame

df = pd.DataFrame(np.random.randn(4, 6), index=list('ADFH'), columns=['one', 'two', 'three', 'four', 'five', 'six'])
df
onetwothreefourfivesix
A-0.049437-0.5264991.7806621.1547472.434957-1.579278
D-0.0752260.552163-0.462732-0.936051-0.5900410.484505
F1.4861680.7259070.598127-0.704809-2.815687-0.062462
H-0.900819-0.177751-0.2327960.234088-1.7585741.255955
df2 = df.reindex(index=list('ABCDEFGH'))
df2

默认也可以不写index,但columns得写

onetwothreefourfivesix
A-0.049437-0.5264991.7806621.1547472.434957-1.579278
BNaNNaNNaNNaNNaNNaN
CNaNNaNNaNNaNNaNNaN
D-0.0752260.552163-0.462732-0.936051-0.5900410.484505
ENaNNaNNaNNaNNaNNaN
F1.4861680.7259070.598127-0.704809-2.815687-0.062462
GNaNNaNNaNNaNNaNNaN
H-0.900819-0.177751-0.2327960.234088-1.7585741.255955

reindex是拷贝一份数据出来

df.reindex(columns=['one', 'three', 'five', 'seven'])
onethreefiveseven
A100.0000001.7806622.434957NaN
D-0.075226-0.462732-0.590041NaN
F1.4861680.598127-2.815687NaN
H-0.900819-0.232796-1.758574NaN
df.reindex(columns=['one', 'three', 'five', 'seven'], fill_value=0)
onethreefiveseven
A100.0000001.7806622.4349570
D-0.075226-0.462732-0.5900410
F1.4861680.598127-2.8156870
H-0.900819-0.232796-1.7585740
# fill method 只对行有效
df.reindex(columns=['one', 'three', 'five', 'seven'], method='ffill')
onethreefiveseven
A100.0000001.7806622.434957NaN
D-0.075226-0.462732-0.590041NaN
F1.4861680.598127-2.815687NaN
H-0.900819-0.232796-1.758574NaN
df.reindex(index=list('ABCDEFGH'), method='ffill')
onetwothreefourfivesix
A100.000000-0.5264991.7806621.1547472.434957-1.579278
B100.000000-0.5264991.7806621.1547472.434957-1.579278
C100.000000-0.5264991.7806621.1547472.434957-1.579278
D-0.0752260.552163-0.462732-0.936051-0.5900410.484505
E-0.0752260.552163-0.462732-0.936051-0.5900410.484505
F1.4861680.7259070.598127-0.704809-2.815687-0.062462
G1.4861680.7259070.598127-0.704809-2.815687-0.062462
H-0.900819-0.177751-0.2327960.234088-1.7585741.255955

丢弃部分数据

df = pd.DataFrame(np.random.randn(4, 6), index=list('ABCD'), columns=['one', 'two', 'three', 'four', 'five', 'six'])
df
onetwothreefourfivesix
A-0.665415-0.0613670.0750580.626415-1.748458-0.608540
B-1.4551861.8466910.2342760.660298-2.169835-1.476485
C0.3222810.5053780.198458-0.831919-0.6307890.762524
D0.703684-0.8275970.1780630.108453-0.4189920.242912
df.drop('A')
onetwothreefourfivesix
B-1.4551861.8466910.2342760.660298-2.169835-1.476485
C0.3222810.5053780.198458-0.831919-0.6307890.762524
D0.703684-0.8275970.1780630.108453-0.4189920.242912
df2 = df.drop(['two', 'four'], axis=1)
df2
onethreefivesix
A-0.6654150.075058-1.748458-0.608540
B-1.4551860.234276-2.169835-1.476485
C0.3222810.198458-0.6307890.762524
D0.7036840.178063-0.4189920.242912

drop是是拷贝一份数据出来的

广播运算

df = pd.DataFrame(np.arange(12).reshape(4, 3), index=['one', 'two', 'three', 'four'], columns=list('ABC'))
df
ABC
one012
two345
three678
four91011
df.loc['one']
A    0
B    1
C    2
Name: one, dtype: int64
df - df.loc['one']
ABC
one000
two333
three666
four999

函数应用

  • apply: 将数据按行或列进行计算
  • applymap: 将数据按元素为进行计算
df = pd.DataFrame(np.arange(12).reshape(4, 3), index=['one', 'two', 'three', 'four'], columns=list('ABC'))
df
ABC
one012
two345
three678
four91011
# 每一列作为一个 Series 作为参数传递给 lambda 函数
df.apply(lambda x: x.max() - x.min())
A    9
B    9
C    9
dtype: int64
# 每一行作为一个 Series 作为参数传递给 lambda 函数
df.apply(lambda x: x.max() - x.min(), axis=1)
one      2
two      2
three    2
four     2
dtype: int64
# 返回多个值组成的 Series
def min_max(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
df.apply(min_max, axis=1)
minmax
one02
two35
three68
four911
# applymap: 逐元素运算
df = pd.DataFrame(np.random.randn(4, 3), index=['one', 'two', 'three', 'four'], columns=list('ABC'))
df
ABC
one-1.126089-0.2865841.538841
two1.804348-0.709293-0.400643
three-1.008037-0.7916480.388505
four-0.0718270.659098-0.505030
formater = '{0:.02f}'.format
# formater = lambda x: '%.02f' % x
df.applymap(formater)

0表示第0个参数,formater的写法也是一种函数

ABC
one-1.13-0.291.54
two1.80-0.71-0.40
three-1.01-0.790.39
four-0.070.66-0.51

排序和排名

df = pd.DataFrame(np.random.randint(1, 10, (4, 3)), index=list('ABCD'), columns=['one', 'two', 'three'])
df
onetwothree
A785
B264
C454
D621
df.sort_values(by='one')
onetwothree
B264
C454
D621
A785
s = pd.Series([3, 6, 2, 6, 4])
s.rank()
0    2.0
1    4.5
2    1.0
3    4.5
4    3.0
dtype: float64
s.rank(method='first', ascending=False)

method='first’表示并列时,先出现排前面
默认是average,还有一种是last

0    4
1    1
2    5
3    2
4    3
dtype: float64

DataFrame默认是按照每列进行排名,与sort_values不同

数据唯一性及成员资格

适用于 Series

s = pd.Series(list('abbcdabacad'))
s
0     a
1     b
2     b
3     c
4     d
5     a
6     b
7     a
8     c
9     a
10    d
dtype: object
s.unique()
array(['a', 'b', 'c', 'd'], dtype=object)
s.value_counts()
a    4
b    3
d    2
c    2
dtype: int64
s.isin(['a', 'b', 'c'])
0      True
1      True
2      True
3      True
4     False
5      True
6      True
7      True
8      True
9      True
10    False
dtype: bool
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值