1、时间操作
创建时间:
import datetime
dt = datetime.datetime(year=2017,month=11,day=24,hour=10,minute=30)
dt
>datetime.datetime(2017, 11, 24, 10, 30)
print (dt)
>2017-11-24 10:30:00
将时间转化为时间戳,可以读取他的信息
import pandas as pd
ts = pd.Timestamp('2017-11-24')
ts
Timestamp('2017-11-24 00:00:00')
然后就可以对时间戳进行一系列操作:
ts.month
>11
ts.day
>24
ts + pd.Timedelta('5 days')
>Timestamp('2017-11-29 00:00:00')
将时间戳转化为pandas的形式
pd.to_datetime('2017-11-24')
>Timestamp('2017-11-24 00:00:00')
pd.to_datetime('24/11/2017')
>Timestamp('2017-11-24 00:00:00')
2、时间序列
创建一个时间序列:
pd.Series(pd.date_range(start='2017-11-24',periods = 10,freq = '12H'))
>0 2017-11-24 00:00:00
1 2017-11-24 12:00:00
2 2017-11-25 00:00:00
3 2017-11-25 12:00:00
4 2017-11-26 00:00:00
5 2017-11-26 12:00:00
6 2017-11-27 00:00:00
7 2017-11-27 12:00:00
8 2017-11-28 00:00:00
9 2017-11-28 12:00:00
dtype: datetime64[ns]
操作示例:
data = pd.read_csv('./data/flowdata.csv')
data.head()
> Time L06_347 LS06_347 LS06_348
0 2009-01-01 00:00:00 0.137417 0.097500 0.016833
1 2009-01-01 03:00:00 0.131250 0.088833 0.016417
2 2009-01-01 06:00:00 0.113500 0.091250 0.016750
3 2009-01-01 09:00:00 0.135750 0.091500 0.016250
4 2009-01-01 12:00:00 0.140917 0.096167 0.017000
对于一组数据,首先将时间格式转化,也可将其设置为以时间为索引
data['Time'] = pd.to_datetime(data['Time'])
data = data.set_index('Time') #以时间为索引
data
> L06_347 LS06_347 LS06_348
Time
2009-01-01 00:00:00 0.137417 0.097500 0.016833
2009-01-01 03:00:00 0.131250 0.088833 0.016417
2009-01-01 06:00:00 0.113500 0.091250 0.016750
也可以在导入文件的时候直接设置时间为索引:
data = pd.read_csv('./data/flowdata.csv',index_col = 0,parse_dates = True)
data.head()
> L06_347 LS06_347 LS06_348
Time
2009-01-01 00:00:00 0.137417 0.097500 0.016833
2009-01-01 03:00:00 0.131250 0.088833 0.016417
可以按照时间来取数据:
data[pd.Timestamp('2012-01-01 09:00'):pd.Timestamp('2012-01-01 19:00')]
data[data.index.month == 1]
data[(data.index.hour > 8) & (data.index.hour <12)]
3、时间的重采样:
对于一组数据
data.head()
> L06_347 LS06_347 LS06_348 month
Time
2009-01-01 00:00:00 0.137417 0.097500 0.016833 1
2009-01-01 03:00:00 0.131250 0.088833 0.016417 1
2009-01-01 06:00:00 0.113500 0.091250 0.016750 1
2009-01-01 09:00:00 0.135750 0.091500 0.016250 1
2009-01-01 12:00:00 0.140917 0.096167 0.017000 1
data.resample('D').mean().head() #D默认以时间为间隔,
> L06_347 LS06_347 LS06_348 month
Time
2009-01-01 0.125010 0.092281 0.016635 1
2009-01-02 0.124146 0.095781 0.016406 1
2009-01-03 0.113562 0.085542 0.016094 1
2009-01-04 0.140198 0.102708 0.017323 1
2009-01-05 0.128812 0.104490 0.018167 1
类似地操作:
data.resample('D',how='mean').head()
data.resample('D').max().head()
data.resample('3D').mean().head() #三天一统计
data.resample('M').mean().head() #按月统计
4、数据集排序
data = pd.DataFrame({'group':['a','a','a','b','b','b','c','c','c'],
'data':[4,3,2,1,12,3,4,5,7]})
data.sort_values(by=['group','data'],ascending = [False,True],inplace=True)
#先以group降序,再以data升序
data.sort_values(by='k2') #按照k2列排序
5、数据整合
data.drop_duplicates() #去掉完全一样的的行
data.drop_duplicates(subset='k1') #去掉k1相同的行
6、apply函数:对当前每一个样本执行相同操作
对于这样一组数据:
data = pd.DataFrame({'food':['A1','A2','B1','B2','B3','C1','C2'],'data':[1,2,3,4,5,6,7]})
data
> data food
0 1 A1
1 2 A2
2 3 B1
3 4 B2
4 5 B3
5 6 C1
6 7 C2
我们想将其数据合并,比如‘A1’‘A2’合并成A,其他也合并成BC,就可以使用apply代替for循环进行操作:
def food_map(series):
if series['food'] == 'A1':
return 'A'
elif series['food'] == 'A2':
return 'A'
elif series['food'] == 'B1':
return 'B'
elif series['food'] == 'B2':
return 'B'
elif series['food'] == 'B3':
return 'B'
elif series['food'] == 'C1':
return 'C'
elif series['food'] == 'C2':
return 'C'
data['food_map'] = data.apply(food_map,axis = 'columns')
data
> data food food_map
0 1 A1 A
1 2 A2 A
2 3 B1 B
3 4 B2 B
4 5 B3 B
5 6 C1 C
6 7 C2 C
当然,下面这种操作更加简单:
food2Upper = {
'A1':'A',
'A2':'A',
'B1':'B',
'B2':'B',
'B3':'B',
'C1':'C',
'C2':'C'
}
data['upper'] = data['food'].map(food2Upper) #映射
>data food food_map upper
0 1 A1 A A
1 2 A2 A A
2 3 B1 B B
3 4 B2 B B
4 5 B3 B B
5 6 C1 C C
6 7 C2 C C
另外一个例子:缺失值统计:
import pandas as pd
import numpy as np
titanic = pd.read_csv('titanic_train.csv')
titanic.head()
def not_null_count(columns):
columns_null = pd.isnull(columns)
null = columns[columns_null]
return len(null)
columns_null_count = titanic.apply(not_null_count)
columns_null_count
>PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
7、cut:连续值离散化
ages = [15,18,20,21,22,34,41,52,63,79]
bins = [10,40,80] #bins=5,平均切5份
bins_res = pd.cut(ages,bins)
bins_res
>[(10, 40], (10, 40], (10, 40], (10, 40], (10, 40], (10, 40], (40, 80], (40, 80], (40, 80], (40, 80]]
Categories (2, interval[int64]): [(10, 40] < (40, 80]]
bins_res.labels
>array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1], dtype=int8)
也可以给区间指定名字:
pd.cut(ages,[10,30,50,80])
group_names = ['Yonth','Mille','Old']
#pd.cut(ages,[10,20,50,80],labels=group_names)
pd.value_counts(pd.cut(ages,[10,20,50,80],labels=group_names))
>Mille 4
Old 3
Yonth 3
dtype: int64
8、缺失值:
df = pd.DataFrame([range(3),[0, np.nan,0],[0,0,np.nan],range(3)]) #构建带有缺失值的df
df.isnull() #判断缺失值,返回True和False
df.isnull().any() #判断哪一列有缺失值,有返回True
df.isnull().any(axis = 1) #判断哪一行有缺失值
df.fillna(5) #把缺失值填充为5
df[df.isnull().any(axis = 1)] #找到带有缺失值的样本(行)
9、字符串处理
s = pd.Series(['A','b','B','gaer','AGER',np.nan])
s
>0 A
1 b
2 B
3 gaer
4 AGER
5 NaN
dtype: object
对于字符串,也可以采用一些基本操作:
s.str.lower() #小写
s.str.upper() #大写
s.str.len() #长度
index.str.strip() #去掉字符串中的空格
index.str.lstrip() #去左空格
index.str.rstrip() #去右空格
可以替换字符串中的字符:
df = pd.DataFrame(np.random.randn(3,2),columns = ['A a','B b'],index = range(3))
df
> A a B b
0 -1.392628 1.020082
1 0.866707 0.654731
2 -0.320871 1.360513
df.columns = df.columns.str.replace(' ','_')
df
> A_a B_b
0 -1.392628 1.020082
1 0.866707 0.654731
2 -0.320871 1.360513
字符串切分.split
s = pd.Series(['a_b_C','c_d_e','f_g_h'])
s
>0 a_b_C
1 c_d_e
2 f_g_h
dtype: object
s.str.split('_')
>0 [a, b, C]
1 [c, d, e]
2 [f, g, h]
dtype: object
s.str.split('_',expand = True) #可以直接转化为df
0 1 2
0 a b C
1 c d e
2 f g h
判断字符串是否包含某字符:
s = pd.Series(['A','Aas','Afgew','Ager','Agre','Ager'])
s.str.contains('Ag')
>0 False
1 False
2 False
3 True
4 True
5 True
dtype: bool
另外,关联分析可能会用到.get_dummies,效果与.split(’_’,expand = True)类似:
s = pd.Series(['a','a|b','a|c'])
s.str.get_dummies(sep = '|')
a b c
0 1 0 0
1 1 1 0
2 1 0 1