numpy_pandas总结

最新推荐文章于 2023-05-17 15:52:53 发布

BigDong305

最新推荐文章于 2023-05-17 15:52:53 发布

阅读量355

点赞数

分类专栏： Pandas Numpy 文章标签： pandas

本文链接：https://blog.csdn.net/qq_39965716/article/details/85089202

版权

Pandas 同时被 2 个专栏收录

1 篇文章 0 订阅

订阅专栏

Numpy

1 篇文章 0 订阅

订阅专栏

Numpy

Numpy的ndarray 多维数组对象
1>>np.zeros(10)
out: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])


2>>np.zeros((3,6))
Out: 
array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])


3>> np.empty((2,3,2))  #empty是   2个三行2列的          (2,3,4)  表示两个三行四列
Out[18]: 
array([[[7.411e-322, 7.411e-322],
        [7.411e-322, 1.324e-321],
        [6.917e-322, 6.670e-322]],
        
       [[7.411e-322, 1.364e-321],
        [5.929e-322, 6.225e-322],
        [6.620e-322, 1.102e-321]]])



numpy花式索引:

>>arr=np.arange(32).reshape((8,4))

res>

([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])
       
       
       
>>arr[[1,5,7,2]][:,[0,3,1,2]]
表示先取第二行,第六行,第八行,第三行,再取对应的第0列,第四列,第二列,第三列


([[ 4,  7,  5,  6],
       [20, 23, 21, 22],
       [28, 31, 29, 30],
       [ 8, 11,  9, 10]])



arr=np.arange(15)
arr
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])


arr=np.arange(15).reshape((3,5))
#注意这里的reshape(()),将arr变成(3,5),注意括号


     ([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])


arr.T


     ([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])



reshape((z,x,y))三维的是这种格式 z[0] x[1] y[2]

>>np.arange(16).reshape((4,4,1)) 这里表示变成4个4行1列的数组

([[[ 0],
        [ 1],
        [ 2],
        [ 3]],
       [[ 4],
        [ 5],
        [ 6],
        [ 7]],
       [[ 8],
        [ 9],
        [10],
        [11]],
       [[12],
        [13],
        [14],
        [15]]])


>>np.arange(16).reshape((2,4,2)) #两个4行2列的数组
- [x]                                 

array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5],
        [ 6,  7]],
       [[ 8,  9],
        [10, 11],
        [12, 13],
        [14, 15]]])


reshape((z,x,y))三维的是这种格式 z[0] x[1] y[2]

reshape(2,3,5)
arr.transpose(1,0,2)
这么一变后就成了
reshape(3,2,5)  
这里相当于 (x,z,y)了



where的使用
np.where(cond,xarr,yarr)
当condition为真时执行xarr,否则执行yarr
cond是这里面的值满足条件, a>5表示a的值大于5

例1:
arr=np.random.randn(4,4)

array([[ 0.58362199, -1.22005289,  0.94003037,  0.85728213],
       [-0.36807289,  1.2904295 ,  0.51543205, -1.38009279],
       [ 0.71112867, -0.31929579,  1.28679074,  1.00649122],
       [-0.12385735, -0.72882532,  0.75112539,  1.04397347]])
       
       
>>np.where(arr>0,2,-2)    相当于if arr>0 把大于0的 

array([[ 2, -2,  2,  2],
       [-2,  2,  2, -2],
       [ 2, -2,  2,  2],
       [-2, -2,  2,  2]])


例2:

>>aa = np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

>>> np.where(aa,1,-1)

array([-1,  1,  1,  1,  1,  1,  1,  1,  1,  1])  # 0为False，所以第一个输出-1


>>> np.where(aa > 5,1,-1) #这里是aa的位置大于5的 6,7,8,9这几个数的位置
array([-1, -1, -1, -1, -1, -1,  1,  1,  1,  1])


>>arr=np.array([1,1,1,1,1,1,1,1,1,1])

 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

>>np.where(arr>5,1,-1)  arr中没有大于5的,所以都是false

 array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1])


3.当只有条件时,np.where(condition)没有x,y时,返回的结果是满足条件的下标

>>arr=np.array([2,5,4,6,8])

array([2, 5, 4, 6, 8])

>>np.where(arr>=5)
 (array([1, 3, 4]),)  返回的是5,6,8对应的下标

>>np.where(arr%2==1)  找到arr为奇数的下标
这里返回的是(array([1]),)   只有5是奇数,对应的下标为1


>>arr=np.array([[1,0,3],[4,5,0]])
array([[1,0,3],[4,5,0]]

>>np.where(arr,[30,40,50],[60,70,80])
这里表示,在arr中,从[1,0,3]开始,如果arr为true即不为0就在[30,,50]中找对应位置的数,如果为0就在[60,70,80]中找对应位置的数
1为true 对应的是30,  0=false  对用70, 3为true 对应50
4为true 对应30 , 5为true 对应40,  0为false 对应80
结果如下:

array([[30, 70, 50],
       [30, 40, 80]])

类似下边这个例子:
np.where([[False, True], [True, True]],   
        [[1, 2], [3, 4]], [[9, 8], [7, 6]])
 
条件时cond=[[False, True], [True, True]]
      x =  [[1, 2], [3, 4]]
      y =  [[9, 8], [7, 6]]
      
      cond1 = False 对应y=9
      cond2 = True  对应x=2
      cond3 = True  对用x=3
      cond4 = True  对用x=4
      
res>>
array([[9, 2],
       [3, 4]])

np.dot:点积



arr=np.arange(20).reshape(4,5)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

>>arr2=np.ones(5)
array([1.,1.,1.,1.,1.])


np.dot(arr,arr2)或者arr.dot(arr2)   表示arr与arr2每一个相乘的和作为结果

1 x 1 + 2 x 1 +3 x 1 + 4 x 1 =10
5 x 1 + 6 x 1+ 7 x 1 + 8 x 1 + 9 x 1 =35
...

res>>
array([10., 35., 60., 85.])

pandas的使用

Series
类似一维数组,,带有索引的序列
Series
1.索引
>>obj=Series([1,2,3,4,5,6],index=['c','d','a','b','e','f'])

c    1
d    2
a    3
b    4
e    5
f    6

>>obj.reindex(['a','b','c','d','e','f','j'])故意多出来个j

a   3.00
b   4.00
c   1.00
d   2.00
e   5.00
f   6.00
j    nan

>>obj.reindex(['a','b','c','d','e','f','j'],fill_value=1) 填充nan为1

a    3
b    4
c    1
d    2
e    5
f    6
j    1

>>obj.drop(['a','b'])  删除索引以及数据

c    1
d    2
e    5
f    6


data=DataFrame(np.arange(16).reshape((4,4)),columns=['one','two','three','four'],index=['a','b','c','d'])


   one  two  three  four
a    0    1      2     3
b    4    5      6     7
c    8    9     10    11
d   12   13     14    15


>>data[1:3]=0  赋值操作 表示将2,3行设为0

    one  two  three  four
a    0    1      2     3
b    0    0      0     0
c    0    0      0     0
d   12   13     14    15


data[data<5]=0   表示将数据小于5 置为0


>>data[data<5]

   one  two  three  four
a 0.00 1.00   2.00  3.00
b 4.00  nan    nan   nan
c  nan  nan    nan   nan
d  nan  nan    nan   nan


排序:
data.sort_index(by='one') 
等于
data.sort_value('one')


data.sort_index(by=['one'.'two'])
等于
data.sort_value(['one'.'two'])

一般都是使用sort_value(),

处理nan数据:

>>data=DataFrame([[1,6.5,3],[1,np.NAN,np.NAN],[np.NAN,np.NAN,np.NAN],[np.NaN,6,3]])

     0    1    2
0 1.00 6.50 3.00
1 1.00  nan  nan
2  nan  nan  nan
3  nan 6.00 3.00

>>data.dropna()  #删除所有的nan
 
     0    1    2
0 1.00 6.50 3.00


>>data.dropna(how='all') 只删除全部为nan的那行
Out: 
     0    1    2
0 1.00 6.50 3.00
1 1.00  nan  nan
3  nan 6.00 3.00


填充空缺数字:
>>data.fillna(1)
     
     0    1    2
0 1.00 6.50 3.00
1 1.00 1.00 1.00
2 1.00 1.00 1.00
3 1.00 6.00 3.00


设置索引:
>>data=DataFrame({'a':range(7),'b':range(7,0,-1),'c':['one','one','one','two','two','two','two'],'d':[0,1,2,0,1,2,3]})


   a  b    c  d
0  0  7  one  0
1  1  6  one  1
2  2  5  one  2
3  3  4  two  0
4  4  3  two  1
5  5  2  two  2
6  6  1  two  3


set_index函数会将一个或多个列转换为行索引
>>data.set_index(['c','d'])  把'c''d'对应的值 作为行索引, 
 
       a  b
c   d      
one 0  0  7
    1  1  6
    2  2  5
two 0  3  4
    1  4  3
    2  5  2
    3  6  1

读取文本格式的数据

data=pd.read_csv('xb.csv',encoding='utf-8')
相当于
data=pd.read_table('xb.csv',sep=',')  这里分隔符需要指定


分块读取文件:根据chunksize大小进行逐块迭代
data=pd.read_csv('xb.csv',chunksize=1000)
tot=Series([])
for piece in data:
    tot=tot.add(piece['key'].value_counts(),fill_value=0)
    tot=tot.order(ascending=False)



HDF5工业级库
数据分析都是IO密集型,不是cpu密集型,hdf适合'一次写多次读'

读取HDF5
hd5_read=pd.HDFStore('mydata.h5')
h5 = pd.HDFStore(path, 'r', complevel=4, complib='blosc')#读取
h5 = pd.HDFStore(path, 'a', complevel=4, complib='blosc')#写入
h5.close()是必须要的,否则会内存溢出

搜一下具体的
与mongodb交互

数据规整化::清理转换合并重塑

数据合并:
merage参数:
on 用于连接的列名,必须存在左右两个dataframe中,如果未指定以left和right列名的交集作为连接键
left_on :左侧dataframe中用作连接键的列
right_on :相反

left_index:左侧行索引用作连接键



pandas.merage(df1,df2)


>>left1=DataFrame({'key':['a','b','a','a','b','c'],'value':range(6)})

  key  value
0   a      0
1   b      1
2   a      2
3   a      3
4   b      4
5   c      5



>>right1=DataFrame({'group_val':{3.5,7}},index=['a','b'])


   group_val
a       3.50
b       7.00


pd.merge(left1,right1,left_on='key',right_index=True) 按key合并,
这种属于内部索引,a b没有共同的部分就删除掉,这里没有c

  key  value  group_val
0   a      0       3.50
2   a      2       3.50
3   a      3       3.50
1   b      1       7.00
4   b      4       7.00

pd.merge(left1,right1,left_on='key',right_index=True,how='outer')
这种是外部拼接,把不是共同的,另一个没有的用nan表示

  key  value  group_val
0   a      0       3.50
2   a      2       3.50
3   a      3       3.50
1   b      1       7.00
4   b      4       7.00
5   c      5        nan



join的使用

left=DataFrame([[1,2],[3,4],[5,6]],index=['a','c','e'],columns=['oh','new'])



Out[134]: 
   oh  new
a   1    2
c   3    4
e   5    6


right=DataFrame([[7,8],[9,10],[11,12],[13,14]],index=['b','c','d','e'],columns=['mis','alb'])

   mis  alb
b    7    8
c    9   10
d   11   12
e   13   14


>>left.join(right) 只拼接相同的index的部分   

   oh  new   mis   alb
a   1    2   nan   nan
c   3    4  9.00 10.00
e   5    6 13.00 14.00

>>left.join(right,how='outer')  自左往右,根据index 排列出,

    oh  new   mis   alb
a 1.00 2.00   nan   nan
b  nan  nan  7.00  8.00
c 3.00 4.00  9.00 10.00
d  nan  nan 11.00 12.00
e 5.00 6.00 13.00 14.00


concat针对于series拼接
pandas.concat()

>>s=Series([0,1],index=['a','b'])

>>s1=Series([2,3,4],index=['c','d','e'])

>>s2=Series([5,6],index=['f','g']) #竖向拼接
>>pd.concat([s,s1,s2])

a    0
b    1
c    2
d    3
e    4
f    5
g    6

>>s2=Series([5,6],index=['f','g'],axis=1)#横向拼接


     0    1    2
a 0.00  nan  nan
b 1.00  nan  nan
c  nan 2.00  nan
d  nan 3.00  nan
e  nan 4.00  nan
f  nan  nan 5.00
g  nan  nan 6.00


s4=pd.concat([s*5])  表示s的数值乘以5




>>pd.concat([s,s4],axis=1,join='inner')  inner表示只有共同部分的拼接,默认为outer

   0  1
a  0  0
b  1  5


where的用法:

重塑和轴向旋转

stack  将数据的 列 旋转 为行
unstack 将数据的行旋转为列,

>>data=DataFrame(np.arange(6).reshape(2,3),index=pd.Index(['oh','col'],name='state'),columns=pd.Index(['one','two','three'],name='number'))


number  one  two  three
state                  
oh        0    1      2
col       3    4      5


>>data.stack()    列转行

state  number
oh     one       0
       two       1
       three     2
col    one       3
       two       4
       three     5


>>data.unstack()   行转列


number  state
one     oh       0
        col      3
two     oh       1
        col      4
three   oh       2
        col      5
dtype: int32

数据转换 1.移除重复数据

data=DataFrame({'k1':['one']*3 + ['two']*4,'k2':[1,1,2,3,3,4,4]})

    k1  k2
0  one   1
1  one   1
2  one   2
3  two   3
4  two   3
5  two   4
6  two   4


>>data.duplicated() 判断行是否有重复,返回布尔类型

 
0    False
1     True    表示第1行与0行重复,4行与第3行重复   第6行与第5行重复
2    False
3    False
4     True
5    False
6     True


>>data.drop_duplicates()  #删除重复行

    k1  k2
0  one   1
2  one   2
3  two   3
5  two   4



data['v1'] =range(7)

    k1  k2  v1
0  one   1   0
1  one   1   1
2  one   2   2
3  two   3   3
4  two   3   4
5  two   4   5
6  two   4   6

>>data.drop_duplicates(['k1']) 指定某行进行查重去重,默认保留第一次出现的值

    k1  k2  v1
0  one   1   0
3  two   3   3

data.drop_duplicates(['k1','k2'],keep='last')  keep表示保留哪一个,默认是first,  False表示删除所有重复的只要重复都不保留

    k1  k2  v1
1  one   1   1
2  one   2   2
4  two   3   4
6  two   4   6


替换replace:

>>data.replace([0,1],[1,2])  可以使用数组替换多个
等价于
>>data.replace({0:1,1:2})


    k1  k2  v1
0  one   2   1
1  one   2   2
2  one   2   2
3  two   3   3
4  two   3   4
5  two   4   5
6  two   4   6





索引重命名;
把a改名称apple
把列one改名成zero
data.rename(index={'a':'apple'},columns={"one":"zero"}) 

data.rename(index={'a':'apple'},columns={"one":"zero"},inplace=True)  就地修改某个数据集,修改后直接输入data 会是改过的这个,不会出现之前的那个样子了


data

    k1  k2  v1
0  one   1   0

data.rename(columns={'k1':'kk'},inplace=True)

    kk  k2  v1
0  one   1   0

>>data  再输入data也是kk了,, 往常都是输入data出现之前的样子,

    kk  k2  v1
0  one   1   0


pd.cut(data,bin)分割

把data数据按照bin方式分割成一块一块的
把ages分成 18到25,26到35,35到60,60以上

>>ages=[20,30,25,26,35,36,65,66,100]

>>bins=[18,25,35,60,100]

>>cat=pd.cut(ages,bins)   默认是左开右闭的方式 可以指定right=False左闭右开

Out[]: 
[(18, 25], (25, 35], (18, 25], (25, 35], (25, 35], (35, 60], (60, 100], (60, 100], (60, 100]]
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]


data=np.random.randn(20)


>>pd.cut(data,2,precision=2) 将数据分成2组,precision表示显示精度,默认为3




数据监测:
>>data=DataFrame(np.random.randn(1000,4))

>>data.describe()   描述出各种统计的结果
  
            0       1       2       3
count 1000.00 1000.00 1000.00 1000.00
mean    -0.04    0.05   -0.02    0.01
std      1.04    0.98    1.03    1.03
min     -3.30   -2.93   -3.52   -3.19
25%     -0.74   -0.57   -0.73   -0.71
50%     -0.05    0.06   -0.02    0.02
75%      0.72    0.71    0.68    0.72
max      3.00    3.50    3.36    3.57

字符串操作:

字符串方法:
strip rstrip lstrip去除空白字符串

split()   通过指定的分隔符

ljust rjust 使用空格或其他字符串填充字符串的空白侧以返回符合最低宽度的字符串

可视化matpplotlib见博客

数据聚合与分组

层次化索引删除:

frame = pd.DataFrame(np.arange(12).reshape((4, 3)),index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],columns=[['Ohio', 'Ohio', 'Colorado'],['Green', 'Red', 'Green']])
frame

     Ohio     Colorado
    Green Red    Green
a 1     0   1        2
  2     3   4        5
b 1     6   7        8
  2     9  10       11



删除第一列的索引:

frame.index=frame.index.droplevel()
frame

   Ohio     Colorado
  Green Red    Green
1     0   1        2
2     3   4        5
1     6   7        8
2     9  10       11


#删除第一行的索引
frame.columns=frame.columns.droplevel()
frame
Out[255]: 
   Green  Red  Green
1      0    1      2
2      3    4      5
1      6    7      8
2      9   10     11


分组使用:

df=DataFrame({'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one'],'data1':np.random.randn(5),'data2':np.random.randn(5)})


   data1  data2 key1 key2
0   1.98  -1.79    a  one
1  -2.16   0.57    a  two
2   1.52   0.50    b  one
3  -0.46   0.69    b  two
4  -0.60   0.04    a  one

将data1数据按key1分组

d=df['data1'].groupby(df['key1'])  直接打印d是不行的,没有数据的,只能用来进行运算才可以
d.mean() 求均值

key1
a   -0.26
b    0.53

日期和时间

from datetime import datetime

>>now=datetime.now()
out:
datetime.datetime(2018, 12, 11, 11, 19, 54, 575000)

now.year
Out[]: 2018

now.hour
Out[]: 11

now.year,now.hour,now.day
Out[]: (2018, 11, 11)

delta=datetime(2018,1,7)-datetime(2008,6,25,8,15)
Out[]: datetime.timedelta(3482, 56700)

delta.days
Out[]: 3482

delta.seconds
Out[]: 56700

按照"11:14""如果是"11-11" 需要时%H-%M即可

a=datetime.strptime(data,'%H:%M').time()

Out[]: 
    datetime.time(15, 30)



时间转为字符串: 使用 strftime 调整正则
>>stamp=datetime(2011,12,11)

Out:
    datetime.datetime(2011, 12, 11, 0, 0)

>>str(stamp)
Out[]: '2011-12-11 00:00:00'

>> stamp.strftime('%Y_%m_%d')
Out[37]: '2011_12_11'


字符串转为时间:
转换的时候一定要注意,字符串什么样,写(相当于正则匹配的时候就要什么格式)
这里'2018-12-11' 匹配时'%Y-%m-%d'
要是'2018/12/1' 匹配时'%Y/%m/%d'

例:
value='2018-12-11'
>>datetime.strptime(value,'%Y-%m-%d')
这里是拿到datetime
Out[40]: datetime.datetime(2018, 12, 11, 0, 0)

只拿到日期:date
datetime.strptime(value,'%Y-%m-%d').date()
Out[]: datetime.date(2018, 12, 11)

只拿到时间time
datetime.strptime(value,'%Y-%m-%d').time()
Out[42]: datetime.time(0, 0)


对于多个字符串转为时间

datestr=['7/12/2018','5/11/2018']
[datetime.strptime(x,'%d/%m/%Y')for x in datestr]
Out[44]: [datetime.datetime(2018, 12, 7, 0, 0), 

datetime.datetime(2018, 11, 5, 0, 0)]
[datetime.strptime(x,'%d/%m/%Y').date()for x in datestr]
Out[45]: [datetime.date(2018, 12, 7), 

datetime.date(2018, 11, 5)]
[datetime.strptime(x,'%d/%m/%Y').time()for x in datestr]
Out[46]: [datetime.time(0, 0), datetime.time(0, 0)]




简化字符串转为时间的过程:

from dateutil.parser import parse

parse('2018-12-11')
Out[49]: datetime.datetime(2018, 12, 11, 0, 0)

parse('2018-12-11').time()
Out[50]: datetime.time(0, 0)

和上边达到同样的效果

parse('jan 31,1997 10:45 PM')
Out[51]: datetime.datetime(2018, 1, 31, 22, 45)

表示 日 出现在月的前面  
parse('6/12/2018',dayfirst=True)
Out[52]: datetime.datetime(2018, 12, 6, 0, 0)


不指定的话就会先月后日
parse('6/12/2018')
Out[53]: datetime.datetime(2018, 6, 12, 0, 0)


pd.to_datetime()使用:
>>datestr
Out[54]: ['7/12/2018', '5/11/2018']

>>pd.to_datetime(datestr)
Out[55]: DatetimeIndex(['2018-07-12', '2018-05-11'], dtype='datetime64[ns]', freq=None)

>>pd.to_datetime(datestr+[None])
Out[56]: DatetimeIndex(['2018-07-12', '2018-05-11', 'NaT'], dtype='datetime64[ns]', freq=None)

>>id=pd.to_datetime(datestr+[None])
id
Out[58]: DatetimeIndex(['2018-07-12', '2018-05-11', 'NaT'], dtype='datetime64[ns]', freq=None)

>>id[2]
Out[59]: NaT
NaT(not a time)pandas时间戳数据NA的值

>>pd.isna(id)
Out[60]: array([False, False,  True])


时间选取
ts=Series(np.random.randn(1000),index=pd.date_range('1/1/2000',periods=1000))
从2000 1月1日开始经过1000天

Out[70]: 
2000-01-01   -1.56
2000-01-02    0.82
2000-01-03    2.00
....
2002-09-24   -0.08
2002-09-25    0.80
2002-09-26    2.85

dates=pd.date_range('1/1/2000',periods=100,freq='W-WED')间隔一周
DatetimeIndex(['2000-01-05', '2000-01-12', ...,'2000-01-19'],freq='W-WED')

freq=""的参数
D: Day 每日历日
B:businessDay 每工作日
H: Hour   每小时
...
W-MON:   从指定的星期几开始算起
BM:BusinessMonthEnd  每月最后一个工作日
MS:MonthBegin  每月第一个日历日

其他...



时区转换:

>>Timestamp('2018-03-04 05:00:00')
>>stamp_utc=stamp.tz_localize('utc')
>>stamp_utc
Out[81]: Timestamp('2018-03-04 05:00:00+0000', tz='UTC')

>>stamp_utc.tz_convert('US/Eastern')
Out[83]: Timestamp('2018-03-04 00:00:00-0500', tz='US/Eastern')


统计采样:
rng=pd.date_range('1/1/2018',periods=12,freq='T')
ts=Series(np.arange(12),index=rng)
out:
2018-01-01 00:00:00     0
2018-01-01 00:01:00     1
2018-01-01 00:02:00     2
2018-01-01 00:03:00     3
2018-01-01 00:04:00     4
2018-01-01 00:05:00     5
2018-01-01 00:06:00     6
2018-01-01 00:07:00     7


ts.resample('5min',how='sum')

Out[90]: 
2018-01-01 00:00:00    10
2018-01-01 00:05:00    35
2018-01-01 00:10:00    21
Freq: 5T, dtype: int32

默认是left,从0-4 5个,closed='right'表示从1-5

>>ts.resample('5min',how='sum',closed='right',label='right')


Out[99]: 
2018-01-01 00:00:00     0
2018-01-01 00:05:00    15
2018-01-01 00:10:00    40
2018-01-01 00:15:00    11



注意fillna这个ffill表示使用上一个数据填充这个位置



rng=pd.date_range('2012-06-01 09:30','2012-06-01 15:59',freq='T')
rng=rng.append([rng+pd.offsets.BDay(i) for i in range(1,4)])
ts=Series(np.arange(len(rng),dtype=float),index=rng)


2012-06-01 09:30:00      0.00
2012-06-01 09:31:00      1.00
2012-06-01 09:32:00      2.00
2012-06-01 09:33:00      3.00
2012-06-01 09:34:00      4.00
2012-06-01 09:35:00      5.00
2012-06-01 09:36:00      6.00
2012-06-01 09:37:00      7.00
        .....
2012-06-06 15:54:00   1554.00
2012-06-06 15:55:00   1555.00
2012-06-06 15:56:00   1556.00
2012-06-06 15:57:00   1557.00
2012-06-06 15:58:00   1558.00
2012-06-06 15:59:00   1559.00


from datetime import time
>> ts[time(10,0)] 获取在10点时候的值

相当于
>>ts.at_time(time(10,0))

Out[115]: 
2012-06-01 10:00:00     30.00
2012-06-04 10:00:00    420.00
2012-06-05 10:00:00    810.00
2012-06-06 10:00:00   1200.00

ts.between_time(time(10,0),time(10,2))
获取在10:00-10:02之间的值

Out[118]: 
2012-06-01 10:00:00     30.00
2012-06-01 10:01:00     31.00
2012-06-01 10:02:00     32.00
2012-06-04 10:00:00    420.00
2012-06-04 10:01:00    421.00
2012-06-04 10:02:00    422.00
2012-06-05 10:00:00    810.00
2012-06-05 10:01:00    811.00
2012-06-05 10:02:00    812.00
2012-06-06 10:00:00   1200.00
2012-06-06 10:01:00   1201.00
2012-06-06 10:02:00   1202.00

高级应用

arr1=np.array([[1,2,3],[4,5,6]])
arr2=np.array([[1,2,3],[4,5,6]])

np.concatenate([arr1,arr2],axis=1) 拼接的时候在[]里面写入,方向由0 1决定
相当于
np.hstack([arr1,arr2]) 按行拼接

Out[8]: 
array([[1, 2, 3, 1, 2, 3],
       [4, 5, 6, 4, 5, 6]])

np.concatenate([arr1,arr2],axis=0)
相当于
np.vstack([arr1,arr2]) #按列拼接

Out[9]: 
array([[1, 2, 3],
       [4, 5, 6],
       [1, 2, 3],
       [4, 5, 6]])



repeat 重复添加

arr=np.random.randn(2,2)
Out[14]: 
array([[1.09629858, 1.71509561],
       [0.41436319, 0.13921361]])
       

>>arr.repeat(2,axis=1) 横向重复
Out[18]: 
array([[1.09629858, 1.09629858, 1.71509561, 1.71509561],
       [0.41436319, 0.41436319, 0.13921361, 0.13921361]])


arr.repeat(2,axis=0) 纵向重复
Out[19]: 
array([[1.09629858, 1.71509561],
       [1.09629858, 1.71509561],
       [0.41436319, 0.13921361],
       [0.41436319, 0.13921361]])


arr.repeat([2,3],axis=0)   #指定重复的个数
Out[21]: 
array([[1.09629858, 1.71509561],
       [1.09629858, 1.71509561],
       [0.41436319, 0.13921361],
       [0.41436319, 0.13921361],
       [0.41436319, 0.13921361]])



np.tile(arr,2)   和复制,这里只纵向复制
Out[23]: 
array([[1.09629858, 1.71509561, 1.09629858, 1.71509561],
       [0.41436319, 0.13921361, 0.41436319, 0.13921361]])

np.tile(arr,(2,1))  想横向复制的话需要(2,1)
Out[24]: 
array([[1.09629858, 1.71509561],
       [0.41436319, 0.13921361],
       [1.09629858, 1.71509561],
       [0.41436319, 0.13921361]])
np.tile(arr,(3,2)) 横向纵向同时,(3,2)   横3纵2
Out[25]: 
array([[1.09629858, 1.71509561, 1.09629858, 1.71509561],
       [0.41436319, 0.13921361, 0.41436319, 0.13921361],
       [1.09629858, 1.71509561, 1.09629858, 1.71509561],
       [0.41436319, 0.13921361, 0.41436319, 0.13921361],
       [1.09629858, 1.71509561, 1.09629858, 1.71509561],
       [0.41436319, 0.13921361, 0.41436319, 0.13921361]])




arr=np.arange(10)*100
Out[27]: array([  0, 100, 200, 300, 400, 500, 600, 700, 800, 900])

>>inds=[7,1,2,6]

>>arr[inds]
相当于
>>arr.take(inds)
Out[31]: array([700, 100, 200, 600])

arr.put(inds,42)  把indx位置上的数替换成指定的数42
arr
Out[33]: array([  0,  42,  42, 300, 400, 500,  42,  42, 800, 900])

arr.put(inds,[40,41,42,43])  指定数字
arr
Out[35]: array([  0,  41,  42, 300, 400, 500,  43,  40, 800, 900])


指定列

inds=[2,0,2,1]
arr=np.random.randn(2,4)
arr
Out[40]: 
array([[-1.06828244,  0.17687066, -0.02489764, -0.74426123],
       [-0.87043835, -0.57858744, -0.08275799, -0.34850213]])
       
>>arr.take(inds,axis=-1) 列改成2 0 2 1对应的数
Out[41]: 
array([[-0.02489764, -1.06828244, -0.02489764,  0.17687066],
       [-0.08275799, -0.87043835, -0.08275799, -0.57858744]])
       
注意put没有axis,暂不支持       



广播机制:
类似

(3,3)  +  (3,1)    (3,3)   即三行三列每个位置+三行第一列

1,2,3  1,2,3      2,3,4  
4,5,6  2,3,4      6,7,8
7,8,9  3,4,5      10,11,12





accumulate使用

arr=np.arange(15).reshape((3,5))
arr
Out[52]: 
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
       
np.add.accumulate(arr,axis=0) axis默认是0
Out[54]: 
array([[ 0,  1,  2,  3,  4],        下一行加上上一行作为本行
       [ 5,  7,  9, 11, 13],
       [15, 18, 21, 24, 27]])
       
np.add.accumulate(arr,axis=1)        后一列加上前一列和作为本列
Out[55]: 
array([[ 0,  1,  3,  6, 10],
       [ 5, 11, 18, 26, 35],
       [10, 21, 33, 46, 60]])


outer:输出结果的维度是两个输入数据的维度之和
arr1=np.random.randn(3,4)
arr2=np.random.randn(5)
result=np.subtract.outer(arr1,arr2)
result.shape
Out[61]: (3, 4, 5)


reduceat 数据切片聚合

arr=np.arange(10)
arr
Out[67]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

np.add.reduceat(arr,[0,5,8]) 相当于[0:5] [5:8] [8:]求和
Out[68]: array([10, 18, 17])