课堂学习笔记
机器学习—数据科学包二
pandas的学习
结构化的数据分析
ipython初体验
In [8]: pwd
Out[8]: '/Users/huanying/pandas_tutor'
In [9]: !echo "print('hello pandas')" > hello.py
In [10]: ls
hello.py
In [11]: %run hello.py
hello pandas
In [12]: more hello.py
print('hello pandas')
pandas 入门一
pd.Series 序列:Series 用于存储一行或者一列的数据,以及与之相关的索引集合(类似于列表,但是有索引)
>>> s = pd.Series([1,3,5,np.NaN,8])
>>> s
0 1.0
1 3.0
2 5.0
3 NaN
4 8.0
dtype: float64
**pd.date_range**
>>> dates = pd.date_range('20160301',periods=6)
>>> dates
DatetimeIndex(['2016-03-01', '2016-03-02', '2016-03-03', '2016-03-04',
'2016-03-05', '2016-03-06'],
dtype='datetime64[ns]', freq='D')
pd.DataFrame
>>> data = pd.DataFrame(np.random.randn(6,4), index=dates,columns=list('ABCD'))
>>> data
A B C D
2016-03-01 0.583543 0.504762 -0.371658 0.196402
2016-03-02 -0.489371 0.282419 -0.069540 0.737287
2016-03-03 0.553235 -0.356941 -0.440889 -0.247903
2016-03-04 0.911576 -0.399150 -0.105229 -0.797159
2016-03-05 -0.577667 -0.936059 0.390769 -0.463623
2016-03-06 -0.059509 1.049200 0.151439 -0.863817
使用字典创建pd.DataFrame
>>> d = {'A':1, 'B':pd.Timestamp('20130301'),'C':range(4),'D':np.arange(4)}
>>> d
{'A': 1, 'B': Timestamp('2013-03-01 00:00:00'), 'C': range(0, 4), 'D': array([0, 1, 2, 3])}
>>> df=pd.DataFrame(d)
>>> df
A B C D
0 1 2013-03-01 0 0
1 1 2013-03-01 1 1
2 1 2013-03-01 2 2
3 1 2013-03-01 3 3
>>> df.B
0 2013-03-01
1 2013-03-01
2 2013-03-01
3 2013-03-01
Name: B, dtype: datetime64[ns]
pandas的DataFrame类型下的基本操作
data = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
data.head()// 前五行
data.head(2) 前两行
data.tail() 默认后五行
排序
data.sort_index(axis=1) 将第一行的index排序,正序
data.sort_index(axis=0,ascending=False) 将第一列的index反向排序
>>> data.sort_values(by='B')\\根据B排序
A B C D
2016-03-05 -0.577667 -0.936059 0.390769 -0.463623
2016-03-04 0.911576 -0.399150 -0.105229 -0.797159
2016-03-03 0.553235 -0.356941 -0.440889 -0.247903
2016-03-02 -0.489371 0.282419 -0.069540 0.737287
2016-03-01 0.583543 0.504762 -0.371658 0.196402
2016-03-06 -0.059509 1.049200 0.151439 -0.863817
选择
>>> data['2016-03-03':'2016-03-05']
A B C D
2016-03-03 0.553235 -0.356941 -0.440889 -0.247903
2016-03-04 0.911576 -0.399150 -0.105229 -0.797159
2016-03-05 -0.577667 -0.936059 0.390769 -0.463623
>>> data[2:4]
A B C D
2016-03-03 0.553235 -0.356941 -0.440889 -0.247903
2016-03-04 0.911576 -0.399150 -0.105229 -0.797159
>>> data
A B C D
2016-03-01 0.583543 0.504762 -0.371658 0.196402
2016-03-02 -0.489371 0.282419 -0.069540 0.737287
2016-03-03 0.553235 -0.356941 -0.440889 -0.247903
2016-03-04 0.911576 -0.399150 -0.105229 -0.797159
2016-03-05 -0.577667 -0.936059 0.390769 -0.463623
2016-03-06 -0.059509 1.049200 0.151439 -0.863817
效率更高的根据index查找 loc() iloc()
>>> data.loc['2016-03-03':'2016-03-05']
A B C D
2016-03-03 0.553235 -0.356941 -0.440889 -0.247903
2016-03-04 0.911576 -0.399150 -0.105229 -0.797159
2016-03-05 -0.577667 -0.936059 0.390769 -0.463623
>>> data.iloc[2:4]
A B C D
2016-03-03 0.553235 -0.356941 -0.440889 -0.247903
2016-03-04 0.911576 -0.399150 -0.105229 -0.797159
>>> data.loc[:,['B','C']]
B C
2016-03-01 0.504762 -0.371658
2016-03-02 0.282419 -0.069540
2016-03-03 -0.356941 -0.440889
2016-03-04 -0.399150 -0.105229
2016-03-05 -0.936059 0.390769
2016-03-06 1.049200 0.151439
>>> data.loc['20160302':'20160305',['B','C']]
B C
2016-03-02 0.282419 -0.069540
2016-03-03 -0.356941 -0.440889
2016-03-04 -0.399150 -0.105229
2016-03-05 -0.936059 0.390769
>>> data.loc['20160302','B']
0.2824190431994151
>>> data.at[pd.Timestamp('20160302'),'B']
0.2824190431994151
>>> data.iloc[1:2,2:3]
C
2016-03-02 -0.06954
>>> data.iloc[1,1]
0.2824190431994151
>>> data.iat[1,1]
0.2824190431994151
查找关键词
>>> data2[data2.TAG.isin(['a','c'])]
A B C D TAG
2016-03-01 0.583543 0.504762 -0.371658 0.196402 a
2016-03-02 -0.489371 0.282419 -0.069540 0.737287 a
2016-03-05 -0.577667 -0.936059 0.390769 -0.463623 c
2016-03-06 -0.059509 1.049200 0.151439 -0.863817 c
修改
>>> data2 = data.copy()
>>> tag = ['a']*2+['b']*2+['c']*2
>>> data2['TAG']=tag
>>> data2
A B C D TAG
2016-03-01 0.583543 0.504762 -0.371658 0.196402 a
2016-03-02 -0.489371 0.282419 -0.069540 0.737287 a
2016-03-03 0.553235 -0.356941 -0.440889 -0.247903 b
2016-03-04 0.911576 -0.399150 -0.105229 -0.797159 b
2016-03-05 -0.577667 -0.936059 0.390769 -0.463623 c
2016-03-06 -0.059509 1.049200 0.151439 -0.863817 c
>>> data2[data2.TAG.isin(['a','c'])]
A B C D TAG
2016-03-01 0.583543 0.504762 -0.371658 0.196402 a
2016-03-02 -0.489371 0.282419 -0.069540 0.737287 a
2016-03-05 -0.577667 -0.936059 0.390769 -0.463623 c
2016-03-06 -0.059509 1.049200 0.151439 -0.863817 c
>>> data2['TAG']
2016-03-01 a
2016-03-02 a
2016-03-03 b
2016-03-04 b
2016-03-05 c
2016-03-06 c
Freq: D, Name: TAG, dtype: object
>>> data.iat[0,0]=100
>>> data
A B C D
2016-03-01 100.000000 0.504762 -0.371658 0.196402
2016-03-02 -0.489371 0.282419 -0.069540 0.737287
2016-03-03 0.553235 -0.356941 -0.440889 -0.247903
2016-03-04 0.911576 -0.399150 -0.105229 -0.797159
2016-03-05 -0.577667 -0.936059 0.390769 -0.463623
2016-03-06 -0.059509 1.049200 0.151439 -0.863817
>>> data.A=range(6)
>>> data
A B C D
2016-03-01 0 0.504762 -0.371658 0.196402
2016-03-02 1 0.282419 -0.069540 0.737287
2016-03-03 2 -0.356941 -0.440889 -0.247903
2016-03-04 3 -0.399150 -0.105229 -0.797159
2016-03-05 4 -0.936059 0.390769 -0.463623
2016-03-06 5 1.049200 0.151439 -0.863817
pandas入门二
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dates=pd.date_range('20160301',periods=6)
df=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df
ouput:
A B C D
2016-03-01 -0.402800 0.852554 -0.769964 -0.605106
2016-03-02 -0.162305 0.590158 0.254271 -0.384514
2016-03-03 -0.658019 1.842996 0.598671 -0.248724
2016-03-04 0.129554 -0.894538 -0.601339 -1.054243
2016-03-05 0.605167 0.027548 -0.047656 -1.113462
2016-03-06 -1.258484 -0.257318 0.121183 -0.856894
df1=df.reindex(index=dates[0:4], columns=list(df.columns)+['E'])
df1
output:
A B C D E
2016-03-01 -0.402800 0.852554 -0.769964 -0.605106 NaN
2016-03-02 -0.162305 0.590158 0.254271 -0.384514 NaN
2016-03-03 -0.658019 1.842996 0.598671 -0.248724 NaN
2016-03-04 0.129554 -0.894538 -0.601339 -1.054243 NaN
df1.loc[dates[1:3],'E']=2
df1
output:
A B C D E
2016-03-01 -0.402800 0.852554 -0.769964 -0.605106 NaN
2016-03-02 -0.162305 0.590158 0.254271 -0.384514 2.0
2016-03-03 -0.658019 1.842996 0.598671 -0.248724 2.0
2016-03-04 0.129554 -0.894538 -0.601339 -1.054243 NaN
df1.dropna()
output:
A B C D E
2016-03-02 -0.162305 0.590158 0.254271 -0.384514 2.0
2016-03-03 -0.658019 1.842996 0.598671 -0.248724 2.0
df1.fillna(value=5)
output:
A B C D E
2016-03-01 -0.402800 0.852554 -0.769964 -0.605106 5.0
2016-03-02 -0.162305 0.590158 0.254271 -0.384514 2.0
2016-03-03 -0.658019 1.842996 0.598671 -0.248724 2.0
2016-03-04 0.129554 -0.894538 -0.601339 -1.054243 5.0
计算均值mean() 和 累加cumsum()
pandas Series: 序列:Series 用于存储一行或者一列的数据,以及与之相关的索引集合(类似于列表,但是有索引)
pandas DataFrame 的 apply() 函数,虽然也是作用于DataFrame的每个值,但是接受的参数不是各个值本身,而是DataFrame里各行(或列),返回一个新的行(列):
Series 计数的操作
iloc() concat()
合并
分组
pandas入门三
数据整形
数据透视
时间序列
cat.categories 按某一列重新编码分类
数据可视化
数据载入与保存
实例:MovieLens 电影数据分析一
实例:MovieLens 电影数据分析二
pandas 核心数据结构一
Series
Series对象的性质:类ndarray对象、类dict对象、标签对齐操作。
类ndarray
类dict对象
标签对齐对象
DataFrame
DataFrame 是有标签的二维数组
用字典创建DataFrame
用列表创建DataFrame
特性:增加/删除
Panel
三维带标签的数组
items相当于分类标签。
Major_axis即为dataframe中的index。
Minor_axis为dataframe中的columns。
pandas 基础运算
重新索引
丢弃部分数据
映射函数 apply/applymap
applymap
排序和排名
DataFrame排序
排名: