449人阅读 评论(0)

# 一. Series

Series: pandas的长枪(数据表中的一列或一行,观测向量,一维数组...)


Series1 = pd.Series(np.random.randn(4))

print Series1,type(Series1)

print Series1.index

print Series1.values



0   -0.676256

1    0.533014

2   -0.935212

3   -0.940822

dtype: float64 <class 'pandas.core.series.Series'>

Int64Index([0, 1, 2, 3], dtype='int64')

[-0.67625578  0.53301431 -0.93521212 -0.94082195]


## Series⽀持过滤的原理就如同NumPy


print Series1>0

print Series1[Series1>0]



0 0.030480

1 0.072746

2 -0.186607

3 -1.412244

dtype: float64 <class 'pandas.core.series.Series'>

Int64Index([0, 1, 2, 3], dtype='int64')

[ 0.03048042 0.07274621 -0.18660749 -1.41224432]



print Series1*2

print Series1+5


0 0.06096

1 1 0.145492

2 -0.373215

3 -2.824489

dtype: float64

0 5.030480

1 5.072746

2 4.813393

3 3.587756

dtype: float64


## 以及Universal Function

numpy.frompyfunc(out,nin,nout) 返回的是一个函数，nin是输入的参数个数，nout是函数返回的对象的个数函数说明

## 在序列上就使用行标，而不是创建1个2列的数据表，能够轻松辨别哪是数据，哪是元数据


Series2 = pd.Series(Series1.values,index=['norm_'+unicode(i) for i in xrange(4)])

print Series2,type(Series2)

print Series2.index

print type(Series2.index)

print Series2.values



norm_0   -0.676256

norm_1    0.533014

norm_2   -0.935212

norm_3   -0.940822

dtype: float64 <class 'pandas.core.series.Series'>

Index([u'norm_0', u'norm_1', u'norm_2', u'norm_3'], dtype='object')

<class 'pandas.core.index.Index'>

[-0.67625578  0.53301431 -0.93521212 -0.94082195]


（当然也不尽然像Ordered Dict，因为⾏索引甚⾄可以重复，不推荐重复的行索引不代表不能用）


print Series2[['norm_0','norm_3']]



norm_0   -0.676256

norm_3   -0.940822

dtype: float64



print 'norm_0' in Series2

print 'norm_6' in Series2



True

False


## 从Key不重复的Ordered Dict或者从Dict来定义Series就不需要担心行索引重复：


Series3_Dict = {"Japan":"Tokyo","S.Korea":"Seoul","China":"Beijing"}

Series3_pdSeries = pd.Series(Series3_Dict)

print Series3_pdSeries

print Series3_pdSeries.values

print Series3_pdSeries.index



China Beijing

Japan Tokyo

S.Korea Seoul

dtype: object

['Beijing' 'Tokyo' 'Seoul']

Index([u'China', u'Japan', u'S.Korea'], dtype='object')



Series4_IndexList = ["Japan","China","Singapore","S.Korea"]

Series4_pdSeries = pd.Series( Series3_Dict ,index = Series4_IndexList)

print Series4_pdSeries

print Series4_pdSeries.values

print Series4_pdSeries.index

print Series4_pdSeries.isnull()

print Series4_pdSeries.notnull()



print Series4_pdSeries.name

print Series4_pdSeries.index.name



Series4_pdSeries.name = "Capital Series"

Series4_pdSeries.index.name = "Nation"

print Series4_pdSeries



Nation

Japan Tokyo

China Beijing

Singapore NaN

S.Korea Seoul

Name: Capital Series, dtype: object


"字典"？不是的，⾏index可以重复，尽管不推荐。


Series5_IndexList = ['A','B','B','C']

Series5 = pd.Series(Series1.values,index = Series5_IndexList)

print Series5

print Series5[['B','A']]



A 0.030480

B 0.072746

B -0.186607

C -1.412244

dtype: float64

B 0.072746

B -0.186607

A 0.030480

dtype: float64


# 二. DataFrame

DataFrame：pandas的战锤(数据表，⼆维数组)

Series的有序集合，就像R的DataFrame一样方便。

## 从NumPy二维数组、从文件或者从数据库定义：数据虽好，勿忘列名


dataNumPy = np.asarray([('Japan','Tokyo',4000),('S.Korea','Seoul',1300),('China','Beijing',9100)])

DF1 = pd.DataFrame(dataNumPy,columns=['nation','capital','GDP'])

DF1


## 等长的列数据保存在一个字典里（JSON）：很不幸，字典key是无序的



DF2


GDP    capital    nation


0 4900 Tokyo Japan

1 1300 Seoul S.Korea

2 9100 Beijing China

PS:由于懒得截图放过来，这里没有了边框线。

## 从另一个DataFrame定义DataFrame：啊，强迫症犯了！


DF21 = pd.DataFrame(DF2,columns=['nation','capital','GDP'])

DF21



DF22 = pd.DataFrame(DF2,columns=['nation','capital','GDP'],index = [2,0,1])

DF22



nation capital GDP

2 China Beijing 9100

0 Japan Tokyo 4900

1 S.Korea Seoul 1300


## 从DataFrame中取出列？两种方法（与JavaScript完全一致！）

OMG，囧，我竟然都快忘了js语法了，现在想起了，但是对象的属性既可以obj.x也可以obj[x]

• '.'的写法容易与其他预留关键字产生冲突

• '[ ]'的写法最安全。

## 从DataFrame中取出行？（至少）两种⽅法：

• 方法1和方法2：


print DF22[0:1] #给出的实际是DataFrame

print DF22.ix[0] #通过对应Index给出⾏,**ix**好爽。



nation  capital   GDP

2  China  Beijing  9100

nation     Japan

capital    Tokyo

GDP         4900

Name: 0, dtype: object

• 方法3 像NumPy切片一样的终极招式：iloc


print DF22.iloc[0,:]    #第一个参数是第几行，第二个参数是列。这里呢，就是第0行，全部列

print DF22.iloc[:,0]    #根据上面的描述，这里是全部行，第0列



nation       China

capital    Beijing

GDP           9100

Name: 2, dtype: object

2      China

0      Japan

1    S.Korea

Name: nation, dtype: object


## 动态增加列列，但是无法用"."的方式，只能用"[]"


DF22['population'] = [1600,130,55]

DF22



nation    capital    GDP    population

2    China    Beijing    9100    1600

0    Japan    Tokyo    4900    130

1    S.Korea    Seoul    1300    55


# 三. Index：行级索引

Index：pandas进⾏数据操纵的鬼牌（行级索引）

⾏级索引是：

• 元数据

• 可能由真实数据产生，因此可以视作数据

• 可以由多重索引也就是多个列组合而成

• 可以和列名进行交换，也可以进行堆叠和展开，达到Excel透视表效果

Index有四种...哦不，很多种写法，⼀些重要的索引类型包括：

• pd.Index（普通）

• Int64Index（数值型索引）

• MultiIndex（多重索引，在数据操纵中更详细描述）

• DatetimeIndex（以时间格式作为索引）

• PeriodIndex （含周期的时间格式作为索引）

## 直接定义普通索引，长得就和普通的Series⼀样


index_names = ['a','b','c']

Series_for_Index = pd.Series(index_names)

print pd.Index(index_names)

print pd.Index(Series_for_Index)



Index([u'a', u'b', u'c'], dtype='object')

Index([u'a', u'b', u'c'], dtype='object')



index_names = ['a','b','c']

index0 = pd.Index(index_names)

print index0.get_values()

index0[2] = 'd'



['a' 'b' 'c']

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-36-f34da0a8623c> in <module>()

2 index0 = pd.Index(index_names)

3 print index0.get_values()

----> 4 index0[2] = 'd'

C:\Anaconda\lib\site-packages\pandas\core\index.pyc in __setitem__(self, key, value)

1055

1056     def __setitem__(self, key, value):

-> 1057         raise TypeError("Indexes does not support mutable operations")

1058

1059     def __getitem__(self, key):

TypeError: Indexes does not support mutable operations


## 扔进去一个含有多元组的List，就有了MultiIndex


multi1 = pd.Index([('Row_'+str(x+1),'Col_'+str(y+1)) for x in xrange(4) for y in xrange(4)])

multi1.name = ['index1','index2']

print multi1



MultiIndex(levels=[[u'Row_1', u'Row_2', u'Row_3', u'Row_4'], [u'Col_1', u'Col_2', u'Col_3', u'Col_4']],

labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]])


## 对于Series来说，如果拥有了多重Index，数据，变形！

• 二重MultiIndex的Series可以unstack()成DataFrame

• DataFrame可以stack成拥有⼆重MultiIndex的Series


data_for_multi1 = pd.Series(xrange(0,16),index=multi1)

data_for_multi1



Row_1  Col_1     0

Col_2     1

Col_3     2

Col_4     3

Row_2  Col_1     4

Col_2     5

Col_3     6

Col_4     7

Row_3  Col_1     8

Col_2     9

Col_3    10

Col_4    11

Row_4  Col_1    12

Col_2    13

Col_3    14

Col_4    15

dtype: int32


### 二重MultiIndex的Series可以unstack()成DataFrame


data_for_multi1.unstack()


### DataFrame可以stack成拥有⼆重MultiIndex的Series


data_for_multi1.unstack().stack()



Row_1  Col_1     0

Col_2     1

Col_3     2

Col_4     3

Row_2  Col_1     4

Col_2     5

Col_3     6

Col_4     7

Row_3  Col_1     8

Col_2     9

Col_3    10

Col_4    11

Row_4  Col_1    12

Col_2    13

Col_3    14

Col_4    15

dtype: int32


## 非平衡数据的例子：


multi2 = pd.Index([('Row_'+str(x+1),'Col_'+str(y+1)) for x in xrange(5) for y in xrange(x)])

multi2



MultiIndex(levels=[[u'Row_2', u'Row_3', u'Row_4', u'Row_5'], [u'Col_1', u'Col_2', u'Col_3', u'Col_4']],

labels=[[0, 1, 1, 2, 2, 2, 3, 3, 3, 3], [0, 0, 1, 0, 1, 2, 0, 1, 2, 3]])


data_for_multi2 = pd.Series(np.arange(10),index = multi2) data_for_multi2



Row_2  Col_1    0

Row_3  Col_1    1

Col_2    2

Row_4  Col_1    3

Col_2    4

Col_3    5

Row_5  Col_1    6

Col_2    7

Col_3    8

Col_4    9

dtype: int32


## DateTime标准库如此好⽤，你值得拥有


import datetime

dates = [datetime.datetime(2015,1,1),datetime.datetime(2015,1,8),datetime.datetime(2015,1,30)]

pd.DatetimeIndex(dates)



DatetimeIndex(['2015-01-01', '2015-01-08', '2015-01-30'], dtype='datetime64[ns]', freq=None, tz=None)

### 如果你不仅需要时间格式统一，时间频率也要统一的话


periodindex1 = pd.period_range('2015-01','2015-04',freq='M')

print periodindex1



PeriodIndex(['2015-01', '2015-02', '2015-03', '2015-04'], dtype='int64', freq='M')


### 月级精度和日级精度如何转换？


print periodindex1.asfreq('D',how='start')

print periodindex1.asfreq('D',how='end')



PeriodIndex(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01'], dtype='int64', freq='D')

PeriodIndex(['2015-01-31', '2015-02-28', '2015-03-31', '2015-04-30'], dtype='int64', freq='D')


### 最后的最后，我要真正把两种频率的时间精度匹配上？


periodindex_mon = pd.period_range('2015-01','2015-03',freq='M').asfreq('D',how='start')

periodindex_day = pd.period_range('2015-01-01','2015-03-31',freq='D')

print periodindex_mon

print periodindex_day



PeriodIndex(['2015-01-01', '2015-02-01', '2015-03-01'], dtype='int64', freq='D')

PeriodIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',

'2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08',

'2015-01-09', '2015-01-10', '2015-01-11', '2015-01-12',

'2015-01-13', '2015-01-14', '2015-01-15', '2015-01-16',

'2015-01-17', '2015-01-18', '2015-01-19', '2015-01-20',

'2015-01-21', '2015-01-22', '2015-01-23', '2015-01-24',

'2015-01-25', '2015-01-26', '2015-01-27', '2015-01-28',

'2015-01-29', '2015-01-30', '2015-01-31', '2015-02-01',

'2015-02-02', '2015-02-03', '2015-02-04', '2015-02-05',

'2015-02-06', '2015-02-07', '2015-02-08', '2015-02-09',

'2015-02-10', '2015-02-11', '2015-02-12', '2015-02-13',

'2015-02-14', '2015-02-15', '2015-02-16', '2015-02-17',

'2015-02-18', '2015-02-19', '2015-02-20', '2015-02-21',

'2015-02-22', '2015-02-23', '2015-02-24', '2015-02-25',

'2015-02-26', '2015-02-27', '2015-02-28', '2015-03-01',

'2015-03-02', '2015-03-03', '2015-03-04', '2015-03-05',

'2015-03-06', '2015-03-07', '2015-03-08', '2015-03-09',

'2015-03-10', '2015-03-11', '2015-03-12', '2015-03-13',

'2015-03-14', '2015-03-15', '2015-03-16', '2015-03-17',

'2015-03-18', '2015-03-19', '2015-03-20', '2015-03-21',

'2015-03-22', '2015-03-23', '2015-03-24', '2015-03-25',

'2015-03-26', '2015-03-27', '2015-03-28', '2015-03-29',

'2015-03-30', '2015-03-31'],

dtype='int64', freq='D')


### 粗粒度数据＋reindex＋ffill/bfill


full_ts = pd.Series(periodindex_mon,index=periodindex_mon).reindex(periodindex_day,method='ffill')

full_ts


## 关于索引，⽅便的操作有？


index1 = pd.Index(['A','B','B','C','C'])

index2 = pd.Index(['C','D','E','E','F'])

index3 = pd.Index(['B','C','A'])

print index1.append(index2)

print index1.difference(index2)

print index1.intersection(index2)

print index1.union(index2) # Support unique-value Index well

print index1.isin(index2)

print index1.delete(2)

print index1.insert(0,'K') # Not suggested

print index3.drop('A') # Support unique-value Index well

print index1.is_monotonic,index2.is_monotonic,index3.is_monotonic

print index1.is_unique,index2.is_unique,index3.is_unique


<button href="javascript:void(0);" _xhe_href="javascript:void(0);" class="copyCode btn btn-xs" data-clipboard-text="" "="" data-toggle="tooltip" data-placement="bottom" title="" style="color: rgb(255, 255, 255); font-style: inherit; font-variant: inherit; font-stretch: inherit; font-size: 12px; line-height: 1.5; font-family: inherit; margin: 0px 0px 0px 5px; overflow: visible; cursor: pointer; vertical-align: middle; border: 1px solid transparent; white-space: nowrap; padding-right: 5px; padding-left: 5px; border-radius: 3px; -webkit-user-select: none; box-shadow: rgba(0, 0, 0, 0.0980392) 0px 1px 2px; background-image: none; background-color: rgba(0, 0, 0, 0.74902);">复制
Index([u'A', u'B', u'B', u'C', u'C', u'C', u'D', u'E', u'E', u'F'], dtype='object')

Index([u'A', u'B'], dtype='object')

Index([u'C', u'C'], dtype='object')

Index([u'A', u'B', u'B', u'C', u'C', u'D', u'E', u'E', u'F'], dtype='object')

[False False False  True  True]

Index([u'A', u'B', u'C', u'C'], dtype='object')

Index([u'K', u'A', u'B', u'B', u'C', u'C'], dtype='object')

Index([u'B', u'C'], dtype='object')

True True False

False False True`
0
0

* 以上用户言论只代表其个人观点，不代表CSDN网站的观点或立场
个人资料
• 访问：21862次
• 积分：575
• 等级：
• 排名：千里之外
• 原创：34篇
• 转载：14篇
• 译文：0篇
• 评论：0条
评论排行