Lesson 1
1.1 Create data: df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
zip()用法:接受任意多个序列作为序列,返回一个元组,如下:
names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]
BabyDataSet = list(zip(names,births))
Out:[('Bob', 968), ('Jessica', 155), ('Mary', 77), ('John', 578), ('Mel', 973)]
1.2 Get data: df = pd.read_csv(Location, names=['Names','Births'])
删除读入的csv文件:
import os
os.remove(Location)
1.3 Prepare data:
查看dataframe的数据类型:df.dtypes
transform()与apply():
data.groupby(key).transform(np.mean)#形状与data相同
data.groupby(key).apply(np.mean)#形状是分组后的,简化
生成具有固定频率的时间序列:pd.data_range(start='12/31/2011', end='12/31/2013', periods=None, freq='D')#D表示以自然日为单位,这个参数用来指定计时单位,比如“5H”代表5小时更新一次
stack()与unstack():
df = pd.DataFrame(data = d, index = i)
stack = df.stack()#print(stack)
print(stack.index)
unstack = df.unstack()#print(unstack)
print(unstack.index)
互换column name 与row name T :
transpose = df.T
1.4 Analysis data
对某一列排序: df.sort_values(['Births'], ascending=False)#降序
选择dataframe的部分:df.loc['name']
df.loc[inclusive:inclusive]
df.loc[df.index[5:],'col']#前面行,后面列名
df.iloc[inclusive:exclusive]#inclusive --index (integer)
输出前5行:df.head()
输出后5行:df.tail()
求某列的max、min、avg、median:df['Births'].max()