Python学习笔记之(四)——强大的数组计算 Panda

Python学习笔记之(四)——强大的数组计算 Panda

(首发日期:2018年01月12日14:35:47更新日期:2018年01月12日20:36:51 )

【参考链接】:

Pandas 入门

1. pandas入门

1.1. series

from pandas import Series, DataFrame
import pandas as pd
obj = Series([4, 7, -5, 3])
obj
0    4
1    7
2   -5
3    3
dtype: int64

Seriers的交互式显示的字符串表示形式是索引在左边,值在右边。因为我们没有给数据指定索引,那么就会默认的创建一个包含整数0到 N-1 (这里N是数据的长度)的索引。 你可以分别的通过它的 values 和 index 属性来获取Series的数组表示和索引对象。

	obj.values
	obj.index
	obj[2]
array([ 4,  7, -5,  3])
RangeIndex(start=0, stop=4, step=1)
-5

既然上面没有给出通常的模式,下面就写出通常的标准模式,创建一个带有索引来确定每一个数据点的Series:

	obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
	obj2
	obj2[2]
	obj2['a']
   d    4
    b    7
    a   -5
    c    3
    dtype: int64

-5


-5

可见,使用默认索引[2]和指定索引[‘a’]都可以正确的取到值。

obj2[obj2>0]
d    4
b    7
c    3
dtype: int64
'b' in obj2
True

如果你有一些数据在一个Python字典中,你可以通过传递字典来从这些数据创建一个Series:

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3

    Ohio      35000
    Oregon    16000
    Texas     71000
    Utah       5000
    dtype: int64
states = ['California','Ohio','Oregon','Texas'] 
obj4 = Series(sdata, index=states)
obj4
    California        NaN
    Ohio          35000.0
    Oregon        16000.0
    Texas         71000.0
    dtype: float64

sdata 中的3个值被放在了合适的位置,但因为没有发现对应于 ‘California’ 的值,就出现了 NaN (不是一个数),这在pandas中被用来标记数据缺失或 NA 值。我使用“missing”或“NA”来表示数度丢失。

	pd.isnull(obj4)
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
	pd.notnull(obj4)
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool
	obj3 + obj4#Series的一个重要功能是在算术用算中它会自动对齐不同索引的数据
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64
	obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
	obj
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

4.2.1.2DataFrame
【说明】:抱歉的很,在jupyter当中显示很好的表格,到了这里就乱七八糟的除了很多乱码了!后面就比较难看了!目前先就放在这里,等查明原因再更新。

	data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
	        'year': [2000, 2001, 2002, 2001, 2002],
	        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
	frame = DataFrame(data)
	frame
popstateyear
01.5Ohio2000
11.7Ohio2001
23.6Ohio2002
32.4Nevada2001
42.9Nevada2002

可见,由此产生的DataFrame和Series一样,它的索引会自动分配,并且对列进行了排序。但是如果你设定了一个列的顺序,DataFrame的列将会精确的按照你所传递的顺序排列:

DataFrame(data, columns=['year', 'state', 'pop'])
yearstatepop
02000Ohio1.5
12001Ohio1.7
22002Ohio3.6
32001Nevada2.4
42002Nevada2.9

和Series一样,如果你传递了一个行,但不包括在 data 中,在结果中它会表示为NA值:

frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'])
frame2
frame2.columns
yearstatepopdebt
02000Ohio1.5NaN
12001Ohio1.7NaN
22002Ohio3.6NaN
32001Nevada2.4NaN
42002Nevada2.9NaN
Index(['year', 'state', 'pop', 'debt'], dtype='object')

和Series一样,在DataFrame中的一列可以通过字典记法或属性来检索:

frame2['state']
frame2.state
0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object






0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object

通过行号或名字来检索,例如 ix 索引成员(field):

frame2
frame2.ix[2]
yearstatepopdebt
02000Ohio1.5NaN
12001Ohio1.7NaN
22002Ohio3.6NaN
32001Nevada2.4NaN
42002Nevada2.9NaN
/home/lucky/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:2: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  from ipykernel import kernelapp as app





year     2002
state    Ohio
pop       3.6
debt      NaN
Name: 2, dtype: object
import numpy as np
frame2['debt']=12
frame2
frame2['debt']=np.arange(5)
frame2
yearstatepopdebt
02000Ohio1.512
12001Ohio1.712
22002Ohio3.612
32001Nevada2.412
42002Nevada2.912
yearstatepopdebt
02000Ohio1.50
12001Ohio1.71
22002Ohio3.62
32001Nevada2.43
42002Nevada2.94
frame2.loc[2]=np.arange(4)#修改某一行
frame2
frame2.loc[2]=2
frame2
yearstatepopdebt
02000Ohio1.50
12001Ohio1.71
2012.03
32001Nevada2.43
42002Nevada2.94
yearstatepopdebt
02000Ohio1.50
12001Ohio1.71
2222.02
32001Nevada2.43
42002Nevada2.94

通过列表或数组给一列赋值时,所赋的值的长度必须和DataFrame的长度相匹配。如果你使用Series来赋值,它会代替在DataFrame中精确匹配的索引的值,并在所有的空洞插入丢失数据:

val = Series([-1.2, -1.5, -1.7], index=[1, 3, 4])
frame2
frame2['debt'] = val
frame2
yearstatepopdebt
02000Ohio1.50
12001Ohio1.71
2222.02
32001Nevada2.43
42002Nevada2.94
yearstatepopdebt
02000Ohio1.5NaN
12001Ohio1.7-1.2
2222.0NaN
32001Nevada2.4-1.5
42002Nevada2.9-1.7

给一个不存在的列赋值,将会创建一个新的列。 像字典一样 del 关键字将会删除列:

frame2['eastern'] = frame2.state == 'Ohio'
frame2
yearstatepopdebteastern
02000Ohio1.5NaNTrue
12001Ohio1.7-1.2True
2222.0NaNFalse
32001Nevada2.4-1.5False
42002Nevada2.9-1.7False
del frame2['eastern']
frame2
yearstatepopdebt
02000Ohio1.5NaN
12001Ohio1.7-1.2
2222.0NaN
32001Nevada2.4-1.5
42002Nevada2.9-1.7
pop = {'Nevada': {2001: 2.4, 2002: 2.9},'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
#结果被传递到DataFrame,它的外部键会被解释为列索引,内部键会被解释为行索引:
pop
frame3 = DataFrame(pop)
frame3
{'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
NevadaOhio
2000NaN1.5
20012.41.7
20022.93.6
frame3
frame3.index.name = 'year'; frame3.columns.name = 'state'#显示列名和索引名
frame3
NevadaOhio
2000NaN1.5
20012.41.7
20022.93.6
stateNevadaOhio
year
2000NaN1.5
20012.41.7
20022.93.6

values 属性返回一个包含在DataFrame中的数据的二维ndarray

frame3.values
frame2
frame2.values
array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])
yearstatepopdebt
02000Ohio1.5NaN
12001Ohio1.7-1.2
2222.0NaN
32001Nevada2.4-1.5
42002Nevada2.9-1.7
array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2, 2, 2.0, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

4.2.1.3索引对象

构建一个Series或DataFrame时任何数组或其它序列标签在内部转化为索引(index)

obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index
Index(['a', 'b', 'c'], dtype='object')
index = pd.Index(np.arange(3))
index
obj2 = Series([1.5, -2.5, 0], index=['a', 'b', 'c'])
obj2.index 
index
obj2 = Series([1.5, -2.5, 0], index=[np.arange(3)])
obj2.index
obj2.index is index
obj2 = Series([1.5, -2.5, 0], index=index)
obj2.index
index
obj2.index is index
Int64Index([0, 1, 2], dtype='int64')






Index(['a', 'b', 'c'], dtype='object')






Int64Index([0, 1, 2], dtype='int64')






Int64Index([0, 1, 2], dtype='int64')






False






Int64Index([0, 1, 2], dtype='int64')






Int64Index([0, 1, 2], dtype='int64')






True

除了类似于阵列,索引也有类似固定大小集合一样的功能:

frame3
'Ohio' in frame3.columns
'2002' in frame3.index
2002 in frame3.index
stateNevadaOhio
year
2000NaN1.5
20012.41.7
20022.93.6
True






False






True

4.2.2 重要功能

4.2.2.1 重新索引

obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

在Series上调用 reindex 重排数据,使得它符合新的索引,如果那个索引的值不存在就引入缺失数据值:

obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

为了对时间序列这样的数据排序,当重建索引的时候可能想要对值进行内插或填充。 method 选项可以使你做到这一点,使用一个如 ffill 的方法来向前填充值:

obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3
obj3.reindex(range(6), method='ffill')#向前填充,也就是[1]看[0],[3]看[2]
obj3.reindex(range(6), method='bfill')#向后填充,也就是[1]看[2],[3]看[4]
obj3
0      blue
2    purple
4    yellow
dtype: object






0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object






0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
dtype: object






0      blue
2    purple
4    yellow
dtype: object

对于DataFrame, reindex 可以改变(行)索引,列或两者。当只传入行或者列当中的一个序列时,结果中的行被重新索引了:

frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],columns=['Ohio', 'Texas', 'California'])
frame
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2
frame2 = frame.reindex(columns=['Ohio1', 'Texas', 'California'])
frame2
states
frame.loc[['a', 'b', 'c', 'd'], states]
OhioTexasCalifornia
a012
c345
d678
OhioTexasCalifornia
a0.01.02.0
bNaNNaNNaN
c3.04.05.0
d6.07.08.0
Ohio1TexasCalifornia
aNaN12
cNaN45
dNaN78
['California', 'Ohio', 'Oregon', 'Texas']
CaliforniaOhioOregonTexas
a2.00.0NaN1.0
bNaNNaNNaNNaN
c5.03.0NaN4.0
d8.06.0NaN7.0

4.2.3 从一个坐标轴删除条目

drop 方法将会返回一个新的对象并从坐标轴中删除指定的一个或多个值:

obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj
new_obj = obj.drop('c')
new_obj
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64






a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
#对于DataFrame,可以从任何坐标轴删除索引值:
data = DataFrame(np.arange(16).reshape((4, 4)),index=['Ohio', 'Colorado', 'Utah', 'New York'],columns=['one', 'two', 'three', 'four'])
data
data.drop(['Colorado', 'Ohio'])
data
data.drop('two', axis=1)
data.drop(['two','four'], axis=1)
onetwothreefour
Ohio0123
Colorado4567
Utah891011
New York12131415
onetwothreefour
Utah891011
New York12131415
onetwothreefour
Ohio0123
Colorado4567
Utah891011
New York12131415
onethreefour
Ohio023
Colorado467
Utah81011
New York121415
onethree
Ohio02
Colorado46
Utah810
New York1214

4.2.3.1 索引挑选和过滤

除了可以使用Series的索引值,也可以仅使用整数来索引

obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj
obj['b'] 
obj[1]
obj[2:4]
obj[['b', 'a', 'd']]
obj[[1, 3]]
obj[obj < 2]

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64






1.0






1.0






c    2.0
d    3.0
dtype: float64






b    1.0
a    0.0
d    3.0
dtype: float64






b    1.0
d    3.0
dtype: float64






a    0.0
b    1.0
dtype: float64
obj['b':'c']#使用标签来切片和正常的Python切片并不一样,它会把结束点也包括在内:
b    1.0
c    2.0
dtype: float64
obj['b':'c'] = 5#赋值
obj
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64
data = DataFrame(np.arange(16).reshape((4, 4)),['Ohio', 'Colorado', 'Utah', 'New York'],columns=['one', 'two', 'three', 'four'])
data
#索引DataFrame来检索一个或多个列,可以使用一个单一值或一个序列:
data['two'] 
data[['three', 'one']]
data['Ohio':'New York']
#data['Ohio','New York']
#data[['Ohio','New York'],axis=1]
data[:2]

data[data['three'] > 5]


onetwothreefour
Ohio0123
Colorado4567
Utah891011
New York12131415
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64
threeone
Ohio20
Colorado64
Utah108
New York1412
onetwothreefour
Ohio0123
Colorado4567
Utah891011
New York12131415
onetwothreefour
Ohio0123
Colorado4567
onetwothreefour
Colorado4567
Utah891011
New York12131415
data <5
data
data[data <5]
data[data <5]=0
data
onetwothreefour
OhioTrueTrueTrueTrue
ColoradoTrueFalseFalseFalse
UtahFalseFalseFalseFalse
New YorkFalseFalseFalseFalse
onetwothreefour
Ohio0123
Colorado4567
Utah891011
New York12131415
onetwothreefour
Ohio0.01.02.03.0
Colorado4.0NaNNaNNaN
UtahNaNNaNNaNNaN
New YorkNaNNaNNaNNaN
onetwothreefour
Ohio0000
Colorado0567
Utah891011
New York12131415
data
data.ix['Colorado']
data.ix['Colorado',['two','four']]
data.ix[['Colorado','Utah'],[2,1,3]]
data.ix[2]
data.ix[:'Utah','one']
data.ix[data.three >5,:3]

onetwothreefour
Ohio0000
Colorado0567
Utah891011
New York12131415
/home/lucky/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:2: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  from ipykernel import kernelapp as app





one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64






two     5
four    7
Name: Colorado, dtype: int64
threetwofour
Colorado657
Utah10911
one       8
two       9
three    10
four     11
Name: Utah, dtype: int64






Ohio        0
Colorado    0
Utah        8
Name: one, dtype: int64
onetwothree
Colorado056
Utah8910
New York121314

4.2.1.5 算术和数据对齐

pandas的最重要的特性之一是在具有不同索引的对象间进行算术运算的行为。当把对象加起来时,如果有任何的索引对不相同的话,在结果中将会把各自的索引联合起来。让我们看一个简单的例子:

dataframe1 = DataFrame(np.arange(9).reshape((3, 3)),index=['Ohio', 'Colorado', 'Utah'],columns=list('bcd'))
dataframe1
dataframe2 = DataFrame(np.arange(12).reshape((4, 3)),index=['beijing', 'Colorado', 'shanghai','Utah'],columns=list('abd'))
dataframe2
bcd
Ohio012
Colorado345
Utah678
abd
beijing012
Colorado345
shanghai678
Utah91011
dataframe=dataframe1+dataframe2
dataframe
dataframe.reindex(columns=dataframe.columns,fill_value=0)#fill_value没有生效,咋回事?

abcd
ColoradoNaN7.0NaN10.0
OhioNaNNaNNaNNaN
UtahNaN16.0NaN19.0
beijingNaNNaNNaNNaN
shanghaiNaNNaNNaNNaN
abcd
ColoradoNaN7.0NaN10.0
OhioNaNNaNNaNNaN
UtahNaN16.0NaN19.0
beijingNaNNaNNaNNaN
shanghaiNaNNaNNaNNaN
#广播
arr= np.arange(12).reshape(3,4)
arr
arr[0]
arr-arr[0]
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])






array([0, 1, 2, 3])






array([[0, 0, 0, 0],
       [4, 4, 4, 4],
       [8, 8, 8, 8]])
dataframe2 = DataFrame(np.arange(12).reshape((4, 3)),index=['beijing', 'Colorado', 'shanghai','Utah'],columns=list('abd'))
dataframe2
series = dataframe2.loc['beijing']
series
#series = dataframe2.loc[0]#出错,怎么回事?
#series
dataframe2 -series#如果一个索引值在DataFrame的列和Series的索引里都找不着,对象将会从它们的联合重建索引
abd
beijing012
Colorado345
shanghai678
Utah91011
a    0
b    1
d    2
Name: beijing, dtype: int64
abd
beijing000
Colorado333
shanghai666
Utah999
series2 = Series(range(3),index=list('bde'))
series2
dataframe2
dataframe2 -series2
b    0
d    1
e    2
dtype: int64
abd
beijing012
Colorado345
shanghai678
Utah91011
abde
beijingNaN1.01.0NaN
ColoradoNaN4.04.0NaN
shanghaiNaN7.07.0NaN
UtahNaN10.010.0NaN
#上面的series2是dataframe的一个行,但是如果要扩展列则:
series3 = dataframe2['d']
series3
dataframe2
dataframe2 - series3
dataframe2.sub(series3,axis=0)#其他列的都是本列的扩展,例如这里实际是减去一列(2,5,8,11),每列都这样减去(2,5,8,11)了
beijing      2
Colorado     5
shanghai     8
Utah        11
Name: d, dtype: int64
abd
beijing012
Colorado345
shanghai678
Utah91011
ColoradoUtahabbeijingdshanghai
beijingNaNNaNNaNNaNNaNNaNNaN
ColoradoNaNNaNNaNNaNNaNNaNNaN
shanghaiNaNNaNNaNNaNNaNNaNNaN
UtahNaNNaNNaNNaNNaNNaNNaN
abd
beijing-2-10
Colorado-2-10
shanghai-2-10
Utah-2-10
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值