Python pandas 初步

pandas 一般用来进行数据分析。

0. 引入pandas

from pandas import Series, DataFrame
import pandas as pd

1. Series

1. 创建Series
obj = Series([4,7,-5,3])
> output:
> 0  4
> 1  7
> 2  -5
> 3  3

# 索引在左边,值在右边
print obj.values #array([4,7,-5,3])
print obj.index  #Int64Index([0,1,2,3])

obj2 = Series([4,7,-5,3], index=['d','b','a','c'])

直接通过Python字典创建Series
data = {'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}
obj3= Series(data)

states = ['California', 'Ohio','Oregon','Texas']
obj4 = Series(sdata, index=states)
print obj4
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
dtype: float64
1.2 检测缺失数据
# pandas中的isnull和notnull函数可用于检测缺失数据,
# Series也有isnull()实例方法
pd.isnull(obj4)
pd.notnull(obj4)
obj4.isnull()
1.3 Series在算术运算中会自动对齐不同索引的数据
California       NaN
Ohio           70000
Oregon         32000
Texas         142000
Utah             NaN
dtype: float64

obj4.name = 'population'
obj4.index.name = 'state'
print obj4
> output:
state
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
Name: population, dtype: float64

2.DataFrame

DataFrame是一个表格型的数据结构,它含有一组有序的列,每列可以是不同的值类型(数值,字符串,布尔值等)。DataFrame既有行索引也有列索引,可以看成由Series组成的字典(共用同一个索引)

data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
            'year':[2000,2001,2002,2001,2002],
            'pop':[1.5,1.7,3.6,2.4,2.9]
            }
frame = DataFrame(data)
print frame
# 如果指定了列序列,则DataFrame的列就会按照指定顺序进行排列。
DataFrame(data, columns=['year','state','pop'])

# 如果传入的列在数据中找不到,就会产生NA值
frame2 = DataFrame(data, columns=['year','state','pop','debt'],
                       index = ['one','two','three','four','five'])

       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN

print frame2.ix['three']
#output 
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

frame2['debt'] = 16.5
print frame2
# output
       year   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2002  Nevada  2.9  16.5

frame2['debt'] = np.arange(5.)
print frame2
# output
       year   state  pop  debt
one    2000    Ohio  1.5     0
two    2001    Ohio  1.7     1
three  2002    Ohio  3.6     2
four   2001  Nevada  2.4     3
five   2002  Nevada  2.9     4

# 将列表或数组赋值给某个列时,其长度必须跟DataFrame的长度相匹配,如果赋值的是Series,就会精确匹配DataFrame的索引,所有的空位都将被填上缺失值。
val = Series([-1.2, -1.5, -1.7],index = ['two','four','five'])
frame2['debt'] = val
print frame2
# output
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7

frame2['eastern'] = frame2.state== 'Ohio'
print frame2
# output
       year   state  pop  debt eastern
one    2000    Ohio  1.5   NaN    True
two    2001    Ohio  1.7  -1.2    True
three  2002    Ohio  3.6   NaN    True
four   2001  Nevada  2.4  -1.5   False
five   2002  Nevada  2.9  -1.7   False

del frame2['eastern']
print frame2.columns
# output 
Index([year, state, pop, debt], dtype = object)
2. 嵌套字典

将嵌套字典传给DataFrame, 它会被解释为:外层字典的键作为列,内层键作为行索引。

pop = {'Nevada':{2001:2.4, 2002:2.9},
           'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}
           }
frame3 = DataFrame(pop)
print frame3
# output 
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6

# 对结果进行转置
print frame3.T
# output
        2000  2001  2002
Nevada   NaN   2.4   2.9
Ohio     1.5   1.7   3.6

3.基本功能

1.重新索引reindex
 obj = Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])
print obj

# output
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

obj2 = obj.reindex(['a','b','c','d','e'])
print obj2

# output 
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

obj2 = obj.reindex(['a','b','c','d','e'],fill_value=0.0)
print obj2

# output
a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

obj3 = Series(['blue','purple','yellow'],index=[0,2,4])
obj3 = obj3.reindex(range(6),method='ffill')
print obj3

# output
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object
2.丢弃指定轴上的项
obj = Series(np.arange(5.), index=['a','b','c','d','e'])
new_obj = obj.drop('c')
print new_obj

# output
a    0
b    1
d    3
e    4
dtype: float64
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值