pandas、numpy、scipy常见用法

最新推荐文章于 2024-08-21 00:15:00 发布

若云流风

最新推荐文章于 2024-08-21 00:15:00 发布

阅读量2.9k

点赞数 5

分类专栏：机器学习利用python进行数据分析文章标签： pandas numpy scipy 数据分析机器学习

本文链接：https://blog.csdn.net/ruoyunliufeng/article/details/80012206

版权

利用python进行数据分析同时被 2 个专栏收录

45 篇文章 51 订阅

订阅专栏

机器学习

36 篇文章 1 订阅

订阅专栏

pandas、numpy、scipy常见用法

导入标准库

    In [139]: 
  

import matplotlib.pyplot as plt
%matplotlib notebook 
import seaborn as sns
sns.set(style='whitegrid', context='notebook')
sns.reset_orig()

import pandas as pd
import numpy as np
import scipy as sp
import scipy.io

一、pandas

1.DataFrame:表格型数据结构

    In [111]: 
  

df = pd.DataFrame(data={'y': [1, 2, 3],
                       'score': [93.5, 89.4, 90.3],
                       'name': ['Dirac', 'Pauli', 'Bohr'],
                       'birthday': ['1902-08-08', '1900-04-25', '1895-10-07']})
print(type(df))
print(df.dtypes)
df

<class 'pandas.core.frame.DataFrame'>
birthday     object
name         object
score       float64
y             int64
dtype: object

      Out[111]: 
    

	birthday	name	score	y
0	1902-08-08	Dirac	93.5	1
1	1900-04-25	Pauli	89.4	2
2	1895-10-07	Bohr	90.3	3

2.read_csv：读取csv数据方法

    In [112]: 
  

df.to_csv("./test.csv")

    In [113]: 
  

df = pd.read_csv('./test.csv')
df

      Out[113]: 
    

	Unnamed: 0	birthday	name	score	y
0	0	1902-08-08	Dirac	93.5	1
1	1	1900-04-25	Pauli	89.4	2
2	2	1895-10-07	Bohr	90.3	3

3.Series：类似于一维数组的对象

    In [114]: 
  

items = pd.Series(data=[93.5, 89.4, 90.3], name='score')
print(type(items))
items

<class 'pandas.core.series.Series'>

      Out[114]: 
    

0    93.5
1    89.4
2    90.3
Name: score, dtype: float64

4.concat:合并不同的轴数据

    In [115]: 
  

items2 = pd.Series(data=['1902-08-08', '1900-04-25'], name='birthday')
print('')
print(items2)
print('')
print('按列合并到一起:')
print(pd.concat(objs=[items, items2], axis=0))
print('')
print('按行合并到一起:')
print(pd.concat(objs=[items, items2], axis=1))

0    1902-08-08
1    1900-04-25
Name: birthday, dtype: object

按列合并到一起:
0          93.5
1          89.4
2          90.3
0    1902-08-08
1    1900-04-25
dtype: object

按行合并到一起:
   score    birthday
0   93.5  1902-08-08
1   89.4  1900-04-25
2   90.3         NaN

5.to_datetime：时间格式转换

    In [116]: 
  

pd.to_datetime(arg=df.birthday, format='%Y-%m-%d')

      Out[116]: 
    

0            1902-08-08
1            1900-04-25
2   1895-10-07 00:00:00
Name: birthday, dtype: datetime64[ns]

6.merge:数据合并

    In [41]: 
  

df_new = pd.DataFrame(data=list(zip(['Dirac', 'Pauli', 'Bohr', 'Einstein'],
                                    [True, False, True, True])),
                      columns=['name', 'friendly'])

df_merge = pd.merge(left=df, right=df_new, on='name', how='outer')
df_merge

      Out[41]: 
    

	Unnamed: 0	birthday	name	score	y	friendly
0	0.0	1902-08-08	Dirac	93.5	1.0	True
1	1.0	1900-04-25	Pauli	89.4	2.0	False
2	2.0	1885-10-07	Bohr	90.3	3.0	True
3	NaN	NaN	Einstein	NaN	NaN	True

7.date_range：时间序列索引

    In [117]: 
  

pd.date_range(start=df.birthday[2], end=df.birthday[0],
              freq='M')

      Out[117]: 
    

DatetimeIndex(['1895-10-31', '1895-11-30', '1895-12-31', '1896-01-31',
               '1896-02-29', '1896-03-31', '1896-04-30', '1896-05-31',
               '1896-06-30', '1896-07-31', '1896-08-31', '1896-09-30',
               '1896-10-31', '1896-11-30', '1896-12-31', '1897-01-31',
               '1897-02-28', '1897-03-31', '1897-04-30', '1897-05-31',
               '1897-06-30', '1897-07-31', '1897-08-31', '1897-09-30',
               '1897-10-31', '1897-11-30', '1897-12-31', '1898-01-31',
               '1898-02-28', '1898-03-31', '1898-04-30', '1898-05-31',
               '1898-06-30', '1898-07-31', '1898-08-31', '1898-09-30',
               '1898-10-31', '1898-11-30', '1898-12-31', '1899-01-31',
               '1899-02-28', '1899-03-31', '1899-04-30', '1899-05-31',
               '1899-06-30', '1899-07-31', '1899-08-31', '1899-09-30',
               '1899-10-31', '1899-11-30', '1899-12-31', '1900-01-31',
               '1900-02-28', '1900-03-31', '1900-04-30', '1900-05-31',
               '1900-06-30', '1900-07-31', '1900-08-31', '1900-09-30',
               '1900-10-31', '1900-11-30', '1900-12-31', '1901-01-31',
               '1901-02-28', '1901-03-31', '1901-04-30', '1901-05-31',
               '1901-06-30', '1901-07-31', '1901-08-31', '1901-09-30',
               '1901-10-31', '1901-11-30', '1901-12-31', '1902-01-31',
               '1902-02-28', '1902-03-31', '1902-04-30', '1902-05-31',
               '1902-06-30', '1902-07-31'],
              dtype='datetime64[ns]', freq='M')

8.read_table：读取表格数据，与read_csv类似

    In [119]: 
  

df = pd.read_table(filepath_or_buffer='test.csv')
df

      Out[119]: 
    

	,birthday,name,score,y
0	0,1902-08-08,Dirac,93.5,1
1	1,1900-04-25,Pauli,89.4,2
2	2,1895-10-07,Bohr,90.3,3

9. util.testing:集合很多常用功能的模块

    In [128]: 
  

import pandas.util.testing as tm
tm.np.random.choice(['red','green'], 10)

      Out[128]: 
    

array(['green', 'red', 'green', 'red', 'red', 'green', 'green', 'red',
       'red', 'red'], 
      dtype='|S5')

10.isnull：判断是否为空

    In [121]: 
  

test_list = [[None, 1, 2, 3, 4], [None, 1, None, 3, None]]
print(pd.isnull(test_list))

pd.isnull(df_merge)

[[ True False False False False]
 [ True False  True False  True]]

      Out[121]: 
    

	Unnamed: 0	birthday	name	score	y	friendly
0	False	False	False	False	False	False
1	False	False	False	False	False	False
2	False	False	False	False	False	False
3	True	True	False	True	True	False

11.value_counts：值的数量

       In [121]: 
     

pd.value_counts(dataset.y)

二、numpy

1.arry：基本的数组类型

    In [46]: 
  

np.array(object=[[1, 9, 9, 1], [2, 0, 1, 6]], dtype=np.float32)

      Out[46]: 
    

array([[ 1.,  9.,  9.,  1.],
       [ 2.,  0.,  1.,  6.]], dtype=float32)

2.zeros：生成值为0的数组

    In [47]: 
  

np.zeros(shape=(2, 4), dtype=int)

      Out[47]: 
    

array([[0, 0, 0, 0],
       [0, 0, 0, 0]])

3.arange：数组生成（开始，结尾，步长）

    In [48]: 
  

np.arange(start=1.5, stop=8.5, step=0.7, dtype=float)

      Out[48]: 
    

array([ 1.5,  2.2,  2.9,  3.6,  4.3,  5. ,  5.7,  6.4,  7.1,  7.8])

4.sqrt：数组开方

    In [49]: 
  

np.sqrt([16, 9, 4])

      Out[49]: 
    

array([ 4.,  3.,  2.])

5.ones：值为1的数组

    In [50]: 
  

np.ones(shape=(2, 3, 1), dtype=np.unicode)

      Out[50]: 
    

array([[[u'1'],
        [u'1'],
        [u'1']],

       [[u'1'],
        [u'1'],
        [u'1']]], 
      dtype='<U1')

6.sum：求和

    In [51]: 
  

vals = np.arange(0, 12, 1).reshape((3, 4))
print(vals)
print('')
print('sum entire array =', np.sum(vals))
print('sum along columns =', np.sum(vals, axis=0))
print('sum along rows =', np.sum(vals, axis=1))

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

('sum entire array =', 66)
('sum along columns =', array([12, 15, 18, 21]))
('sum along rows =', array([ 6, 22, 38]))

7. mean：求平均值

    In [52]: 
  

vals = np.array([1, 2, 3, 4]*3).reshape((3, 4))
print(vals)
print('')
print('mean entire array =', np.mean(vals))
print('mean along columns =', np.mean(vals, axis=0))
print('mean along rows =', np.mean(vals, axis=1))

[[1 2 3 4]
 [1 2 3 4]
 [1 2 3 4]]

('mean entire array =', 2.5)
('mean along columns =', array([ 1.,  2.,  3.,  4.]))
('mean along rows =', array([ 2.5,  2.5,  2.5]))

8.linspace:等差数列

    In [53]: 
  

np.linspace(0, 19.3, 6)

      Out[53]: 
    

array([  0.  ,   3.86,   7.72,  11.58,  15.44,  19.3 ])

9.asarray：拷贝的时候不会复制对象

    In [54]: 
  

vals = np.array([9, 2, 3, 5])
print(type(vals))
print(vals)
a = np.asarray(vals)
a += 1
print(vals) # vals changes because it was not copied when assigning 'a'

<type 'numpy.ndarray'>
[9 2 3 5]
[10  3  4  6]

三、Scipy

1.stats：统计假设与检验的包

    In [129]: 
  

# 得到标准差，忽略NA
vals = [0.0, np.nan, 8.3, 2.4, np.nan, 3.2]
sp.nanstd(vals)

      Out[129]: 
    

3.0243801017729237

    In [140]: 
  

# 正态分布
x = np.linspace(0,10,50)
# 画高斯曲线
plt.plot(x, sp.stats.norm.pdf(x=x, loc=5, scale=2))
# 高斯随机样本
sp.stats.norm.rvs(loc=5, scale=2, size=4)
plt.show()

2.sparse：矩阵压缩

    In [59]: 
  

vals = np.array([[0, 3.4, 2], [0, 9.9, 0], [0, 0, -5.4]])
print(vals)
print('')
a = sp.sparse.csr_matrix(vals)
print(type(a))
print('non-zero entries =', a.data) # 稀疏矩阵中元素的个数
print('diagonal entries =',a.diagonal())# 对角数据
print('upper triangular =\n',sp.sparse.triu(a))

[[ 0.   3.4  2. ]
 [ 0.   9.9  0. ]
 [ 0.   0.  -5.4]]

<class 'scipy.sparse.csr.csr_matrix'>
('non-zero entries =', array([ 3.4,  2. ,  9.9, -5.4]))
('diagonal entries =', array([ 0. ,  9.9, -5.4]))
('upper triangular =\n', <3x3 sparse matrix of type '<type 'numpy.float64'>'
	with 4 stored elements in COOrdinate format>)

3.optimize 最优化函数库

    In [141]: 
  

# 求函数的根

f = lambda x: x**2 - 3*x + 2 # = (x-1)*(x-2)
print(f)
roots = (sp.optimize.brentq(f=f, a=0, b=1.5),
         sp.optimize.brentq(f=f, a=1.5, b=5))
print('First root =', roots[0])
print('Second root =', roots[1])

<function <lambda> at 0x0D724CF0>
('First root =', 1.0000000000000002)
('Second root =', 1.9999999999999998)

    In [143]: 
  

# 最小二乘法参数优化

x = np.linspace(0, 10, 10)
y = np.array([-0.5, -1.8, -1.3, -0.1, 0.4,
              1.6, 3.5, 8.9, 12.6, 24.8])

# 二次函数形式拟合
f = lambda beta, x:  beta[0] + beta[1]*x + beta[2]*x**2

# f和实际值之间的差异
error_function = lambda beta, x, y: f(beta, x) - y

beta_0 = (0.0, 0.0, 0.0)

beta, _ = sp.optimize.leastsq(func=error_function, x0=beta_0, args=(x, y))
print('optimal parameters =', beta)
plt.scatter(x, y);
plt.plot(x, [f(beta, xx) for xx in x])
plt.show()

('optimal parameters =', array([ 0.6, -2.2,  0.4]))

4. io:读取matlab文件

    In [62]: 
  

# 将数组转换成matlab数据

# 初始化数组
np.set_printoptions(precision=1)
matrix = np.random.random(size=(8, 6))
print(matrix)

# 创建行字典
data_dict = {'row'+str(r_id): row for r_id, row in
             zip(range(len(matrix)), matrix)}
# 将每行变量，写入matlab文件
scipy.io.savemat('random_array.mat', mdict=data_dict, oned_as='row')

# 读取刚保存的数据
loaded_data_dict = scipy.io.loadmat('random_array.mat')
loaded_data_dict

[[ 0.5  0.3  0.7  0.5  0.6  0.2]
 [ 0.2  0.8  0.6  0.6  0.7  0.6]
 [ 0.5  0.3  0.9  0.4  0.6  0.3]
 [ 0.9  0.   0.5  0.6  0.9  0.4]
 [ 0.5  0.3  0.2  0.5  0.7  0.1]
 [ 0.6  0.9  1.   0.8  0.2  0.3]
 [ 0.5  0.6  0.5  0.6  0.8  0.9]
 [ 0.6  0.8  0.   0.8  0.5  0.2]]

      Out[62]: 
    

{'__globals__': [],
 '__header__': 'MATLAB 5.0 MAT-file Platform: nt, Created on: Thu Apr 19 09:47:48 2018',
 '__version__': '1.0',
 'row0': array([[ 0.5,  0.3,  0.7,  0.5,  0.6,  0.2]]),
 'row1': array([[ 0.2,  0.8,  0.6,  0.6,  0.7,  0.6]]),
 'row2': array([[ 0.5,  0.3,  0.9,  0.4,  0.6,  0.3]]),
 'row3': array([[ 0.9,  0. ,  0.5,  0.6,  0.9,  0.4]]),
 'row4': array([[ 0.5,  0.3,  0.2,  0.5,  0.7,  0.1]]),
 'row5': array([[ 0.6,  0.9,  1. ,  0.8,  0.2,  0.3]]),
 'row6': array([[ 0.5,  0.6,  0.5,  0.6,  0.8,  0.9]]),
 'row7': array([[ 0.6,  0.8,  0. ,  0.8,  0.5,  0.2]])}

5.linalg：线性代数模块

    In [63]: 
  

matrix = np.array([[4.3, 8.9],[2.2, 3.4]])
print(matrix)
print('')

# 求范数
norm = sp.linalg.norm(matrix)
print('norm =', norm)
# Alternate method
print(norm == np.square([v for row in matrix for v in row]).sum()**(0.5))
print('')

# 求特征值和特征向量
eigvals, eigvecs = sp.linalg.eig(matrix)
print('eigenvalues =', eigvals)
print('eigenvectors =\n', eigvecs)

[[ 4.3  8.9]
 [ 2.2  3.4]]

('norm =', 10.681760154581267)
True

('eigenvalues =', array([ 8.3+0.j, -0.6+0.j]))
('eigenvectors =\n', array([[ 0.9, -0.9],
       [ 0.4,  0.5]]))

6.interpolate：插值

    In [144]: 
  

# 散点拟合

x = np.linspace(0, 10, 10)
xs = np.linspace(0, 11, 50)
y = np.array([0.5, 1.8, 1.3, 3.5, 3.4,
              5.2, 3.5, 1.0, -2.3, -6.3])
spline = sp.interpolate.UnivariateSpline(x, y)
plt.scatter(x, y);
plt.plot(xs, spline(xs))
plt.show()

7.special：排列、组合、阶乘

    In [145]: 
  

x = np.linspace(0,10,500)
fix, ax = plt.subplots(2)

ax[0].set_title('Zero and first order bessel functions of the first kind')
ax[0].plot(x, sp.special.j0(x), c='blue', alpha=0.6)
ax[0].plot(x, sp.special.j1(x), c='red', alpha=0.6)

ax[1].set_title('Zero and first order bessel functions of the second kind')
ax[1].plot(x, sp.special.y0(x), c='blue', alpha=0.6)
ax[1].plot(x, sp.special.y1(x), c='red', alpha=0.6)
ax[1].set_ylim(-2,1); ax[1].set_xlim(0.5,10)
ax[1].annotate('$Y_0$ and $Y_1$ approach -$\infty$', xy=(1,-1.7), xytext=(2.5, -0.9),
               arrowprops=dict(arrowstyle='->', lw=1), fontsize=15)

plt.show()

8. signal：信号处理

    In [146]: 
  

# A modified example posted in the docs:
# http://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.lfilter.html#scipy.signal.lfilter

import scipy.signal
np.random.seed(0)

x = np.linspace(0,6*np.pi,100)
y = [sp.special.sph_jn(n=3, z=xi)[0][0] for xi in x]
y = [yi + (np.random.random()-0.5)*0.7 for yi in y]
# y = np.sin(x)

# 得到一个3阶低通巴特沃斯滤波器参数
b, a = sp.signal.butter(3, 0.08)

# Initialize filter
zi = sp.signal.lfilter_zi(b, a)

# Apply filter
y_smooth, _ = sp.signal.lfilter(b, a, y, zi=zi*y[0])

plt.plot(x, y, c='blue', alpha=0.6)
plt.plot(x, y_smooth, c='red', alpha=0.6)
plt.title('Noisy spherical bessel function signal processing')
plt.savefig('noisy_signal_fit.png', bbox_inches='tight')
plt.show()

D:\python2713\lib\anaconda_install\lib\site-packages\ipykernel_launcher.py:9: DeprecationWarning: `sph_jn` is deprecated!
scipy.special.sph_jn is deprecated in scipy 0.18.0. Use scipy.special.spherical_jn instead. Note that the new function has a different signature.
  if __name__ == '__main__':

9.ndimage:图像处理

    In [148]: 
  

# 模糊图像

# 导入图像
figure = plt.imread('noisy_signal_fit.png')

# 模糊图像
figure_blur = sp.ndimage.filters.gaussian_filter(figure, sigma=2)# sigma值越大。越模糊

# 画图
pics = [figure, figure_blur]
sns.set_style('white')
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
for pic, ax in zip(pics, axes):
    ax.imshow(pic); ax.set_xticks([]); ax.set_yticks([])

10.misc：图像处理

    In [149]: 
  

# 获得浣熊脸

# 获取浣熊
pics = sp.misc.face(), sp.misc.face(gray=True)

# 画出来
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
for pic, ax in zip(pics, axes):
    ax.imshow(pic); ax.set_xticks([]); ax.set_yticks([])
plt.show()