一.pandas数据结构
1.Series
Series是一种一维的数组型对象,它包含了一个值序列,并且包含了数据标签,称为索引
该结构能够存储各种数据类型,比如字符数,整数,浮点数,python对象等
Series用name和index属性来描述数据值
Series是一维数据结构,维数不可改变
obj = pd.Series([1,2,3,4])
obj.values
obj2 = pd.Series([5,6,7,8],index=['a','b','c','d'])
obj2[obj2 > 0]
obj2 * 2
'b' in obj2
创建一个空Series对象
s = pd.Series()
ndarray创建Series对象
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[1,2,3,4])
dict创建Series对象
data = {'a':0,'b':1,'c':2}
s = pd.Series(data)
标量创建Series对象,必须提供索引
s = pd.Series(5,index = [0,1,2,3])
2.DataFrame
是一种二维表格型数据的结构,既有行索引,又有列索引,行索引是index,列索引是columns。在创建该结构时,可以指定相应的索引值
data = {'state':['ohio','ohio','ohio','nada','nada','nada'],
'year':[2000,2001,2002,2001,2002,2003],
'pop':[1.1,1.2,1.3,2.1,2.2,2.3]
frame = pd.DataFrame(data)
frame.head()
pd.DataFrame(data,columns=['year','state','pop'])
frame['state']
frame.year
创建空的DataFrame对象
df = pd.DataFrame()
列表创建DataFrame对象
data = [1,2,3,4]
df = pd.DataFrame(data)
data = [['alex',10],['bob',12],['clark',14]]
df = pd.DataFrame(data,columns=['Name','Age'])
字典嵌套列表创建DataFrame对象
data = {'Name':['tom','jack','steve','ricky'],'Age':[23,24,25,26]}
df = pd.DataFrame(data)
列表嵌套字典创建DataFrame对象
data = [{'a':1,'b':2},{'a':5,'b':10,'c':20}]
df = pd.DataFrame(data)
Series创建DataFrame对象
d = {'one':pd.Series([1,2,3],index=['a','b','c']),
'two':pd.Series([1,2,3,4],index=['a','b','c','d'])}
df = pd.DataFrame(d)
二.基本功能
1.重建索引
obj = pd.Series([1,2,3,4],index=['a','b','c','d'])
obj2 = obj.reindex(['a','b','c','d','e'])
2.轴向上删除条目
obj = pd.Series(np.arange(5),index=['a','b','c','d','e'])
obj2 = obj.drop('c')
obj.drop(['d','c'])
3.索引,选择与过滤
obj = pd.Series(np.arange(4),index=['a','b','c','d'])
obj['b']
obj[2:4]
obj[['b','c','d']]
obj[obj<2]
obj['b':'c'] = 5
data = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['ohio','colo','utah','new'],
columns=['one','two','three','four'])
data['two']
data[['three','one']]
data[:2]
data[data['three']>5]
data < 5
data[data < 5] = 0
data.loc['colo',['two','three']]
data.iloc[2,[3,0,1]]
data.iloc[2]
data.iloc[[1,2],[3,0,1]]
data.loc[:'utah','two']
data.iloc[:,:3][data.three>5]
三.描述性统计的概述与计算
df = pd.DataFrame([1,np.nan],[3,-1][np.nan,np.nan],[0,3]],
index=['a','b','c','d'],
columns=['one','two'])
df.sum()
df.sum(axis='columns')
df.mean()
df.describe()
1.相关性和协方差
returns = price.pct_change()
returns.tail()
returns.corr()
returns.cov()
2.唯一值,计数和成员属性
obj = pd.Series(['c','a','d','a','a','b','b','c','c'])
uniques = obj.unique()
obj.value_counts()
mask = obj.isin(['b','c'])
obj[mask]