Python数据分析第8天
Pandas的应用
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = 'SimHei'
plt.rcParams['axes.unicode_minus'] = False
%config InlineBackend.figure_format = 'svg'
DataFrame对象的其他方法
df1 = pd.DataFrame({
'col1': [[1, 2, 3], [4, 5, 6]],
'col2': [[0] * 3, [1] * 4]
})
df1
col1 col2
0 [1, 2, 3] [0, 0, 0]
1 [4, 5, 6] [1, 1, 1, 1]
df1.col1.explode()
0 1
0 2
0 3
1 4
1 5
1 6
df1.explode('col1', ignore_index=True)
col1 col2
0 1 [0, 0, 0]
1 2 [0, 0, 0]
2 3 [0, 0, 0]
3 4 [1, 1, 1, 1]
4 5 [1, 1, 1, 1]
5 6 [1, 1, 1, 1]
df1.explode('col2')
col1 col2
0 [1, 2, 3] 0
0 [1, 2, 3] 0
0 [1, 2, 3] 0
1 [4, 5, 6] 1
1 [4, 5, 6] 1
1 [4, 5, 6] 1
1 [4, 5, 6] 1
df1.explode('col1').explode('col2')
col1 col2
0 1 0
0 1 0
0 1 0
0 2 0
0 2 0
0 2 0
0 3 0
0 3 0
0 3 0
1 4 1
1 4 1
1 4 1
1 4 1
1 5 1
1 5 1
1 5 1
1 5 1
1 6 1
1 6 1
1 6 1
1 6 1
ser1 = pd.Series(np.random.randint(10, 100, 10))
ser1
0 76
1 93
2 31
3 94
4 87
5 86
6 94
7 39
8 25
9 54
ser1.rolling(3).sum()
0 NaN
1 NaN
2 200.0
3 218.0
4 212.0
5 267.0
6 267.0
7 219.0
8 158.0
9 118.0
from sklearn.datasets import load_boston
boston_dataset = load_boston()
boston_dataset.data
array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
4.9800e+00],
[2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
9.1400e+00],
[2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
4.0300e+00],
...,
[6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
5.6400e+00],
[1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
6.4800e+00],
[4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
7.8800e+00]])
boston_dataset.target.shape
(506,)
boston_dataset.feature_names
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
boston_df = pd.DataFrame(
data=boston_dataset.data,
columns=boston_dataset.feature_names
)
boston_df
![在这里插入图片描述](https://img-blog.csdnimg.cn/5dba7cd49e1b4613ab0241d0359eaa32.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA6Zuq55CD5rua5rua5rua,size_20,color_FFFFFF,t_70,g_se,x_16#pic_center)
boston_df['PRICE'] = boston_dataset.target
boston_df
boston_df.cov()
协方差
协方差(covariance):用于衡量两个随机变量的联合变化程度。如果变量 X X X的较大值主要与另一个变量 Y Y Y的较大值相对应,而两者较小值也相对应,那么两个变量倾向于表现出相似的行为,协方差为正。如果一个变量的较大值主要对应于另一个变量的较小值,则两个变量倾向于表现出相反的行为,协方差为负。简单的说,协方差的正负号显示着两个变量的相关性。方差是协方差的一种特殊情况,即变量与自身的协方差。
c o v ( X , Y ) = E ( ( X − μ ) ( Y − υ ) ) = E ( X ⋅ Y ) − μ υ cov(X,Y) = E((X - \mu)(Y - \upsilon)) = E(X \cdot Y) - \mu\upsilon cov(X,Y)=E((X−μ)(Y−υ))=E(X⋅Y)−μυ
如果 X X X和 Y Y Y是统计独立的,那么二者的协方差为0,这是因为在 X X X和 Y Y Y独立的情况下:
E ( X ⋅ Y ) = E ( X ) ⋅ E ( Y ) = μ υ E(X \cdot Y) = E(X) \cdot E(Y) = \mu\upsilon E(X⋅Y)=E(X)⋅E(Y)=μυ
相关系数
协方差的数值大小取决于变量的大小,通常是不容易解释的,但是正态形式的协方差大小可以显示两变量线性关系的强弱。在统计学中,皮尔逊积矩相关系数用于度量两个变量 X X X和 Y Y Y之间的相关程度(线性相关),它的值介于-1到1之间。
ρ X , Y = c o v ( X , Y ) σ X σ Y \rho_{X,Y} = \frac {cov(X, Y)} {\sigma_{X}\sigma_{Y}} ρX,Y=σXσYcov(X,Y)
估算样本的协方差和标准差,可以得到样本皮尔逊系数,通常用英文小写字母 r r r表示。
r = ∑ i = 1 n ( X i − X ˉ ) ( Y i − Y ˉ ) ∑ i = 1 n ( X i − X ˉ ) 2 ∑ i = 1 n ( Y i − Y ˉ ) 2 r = \frac {\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})} {\sqrt{\sum_{i=1}^{n}(X_i - \bar{X})^2} \sqrt{\sum_{i=1}^{n}(Y_i - \bar{Y})^2}} r=∑i=1n(Xi−Xˉ)2∑i=1n(Yi−Yˉ)2∑i=1n(Xi−Xˉ)(Yi−Yˉ)
等价的表达式为:
r = 1 n − 1 ∑ i = 1 n ( X i − X ˉ σ X ) ( Y i − Y ˉ σ Y ) r = \frac {1} {n - 1} \sum_{i=1}^n \left( \frac {X_i - \bar{X}} {\sigma_X} \right) \left( \frac {Y_i - \bar{Y}} {\sigma_{Y}} \right) r=n−11i=1∑n(σXXi−Xˉ)(σYYi−Yˉ)
- 皮尔逊相关系数
- 两个变量之间是线性关系,都是连续数据。
- 两个变量的总体是正态分布,或接近正态的单峰分布。
- 两个变量的观测值是成对的,每对观测值之间相互独立。
- 斯皮尔曼相关系数
斯皮尔曼相关系数对数据条件的要求没有皮尔逊相关系数严格,只要两个变量的观测值是成对的等级评定资料,或者是由连续变量观测资料转化得到的等级资料,不论两个变量的总体分布形态、样本容量的大小如何,都可以用斯皮尔曼等级相关系数来进行研究。
boston_df.corr()
boston_df.corr('spearman').style.background_gradient('RdYlBu', subset=['PRICE'])
import pandas_datareader as pdr
baidu_df = pdr.get_data_stooq('BIDU', start='2021-11-1', end='2021-12-8')
baidu_df.sort_index(inplace=True)
baidu_df
baidu_df.index
DatetimeIndex(['2021-11-01', '2021-11-02', '2021-11-03', '2021-11-04',
'2021-11-05', '2021-11-08', '2021-11-09', '2021-11-10',
'2021-11-11', '2021-11-12', '2021-11-15', '2021-11-16',
'2021-11-17', '2021-11-18', '2021-11-19', '2021-11-22',
'2021-11-23', '2021-11-24', '2021-11-26', '2021-11-29',
'2021-11-30', '2021-12-01', '2021-12-02', '2021-12-03',
'2021-12-06', '2021-12-07', '2021-12-08'],
dtype='datetime64[ns]', name='Date', freq=None)
baidu_df.Close.rolling(10).mean()
plt.plot(np.arange(baidu_df.index.size), baidu_df.Close)
plt.xticks(np.arange(baidu_df.index.size),
rotation=45,
labels=baidu_df.index.month.astype(str).values + '-' + baidu_df.index.day.astype(str).values)
plt.yticks(np.arange(100, 181, 10))
plt.show()
Index对象
# 范围索引
sales_data = np.random.randint(400, 1000, 12)
month_index = pd.RangeIndex(1, 13, name='月份')
ser = pd.Series(data=sales_data, index=month_index)
ser月份
1 581
2 639
3 559
4 940
5 638
6 616
7 853
8 865
9 665
10 747
11 485
12 641
月份
1 581
2 639
3 559
4 940
5 638
6 616
7 853
8 865
9 665
10 747
11 485
12 641
ser.index
RangeIndex(start=1, stop=13, step=1, name='月份')
# 多级索引
ids = np.arange(1001, 1006)
sms = ['期中', '期末']
index = pd.MultiIndex.from_product((ids, sms), names=['学号', '学期'])
courses = ['语文', '数学', '英语']
scores = np.random.randint(60, 101, (10, 3))
df = pd.DataFrame(data=scores, columns=courses, index=index)
df
df.reset_index(level=1)
pd.date_range('2021-1-1', '2021-6-1', periods=10)
pd.date_range('2021-1-1', '2021-6-1', periods=10)
1
pd.date_range('2021-1-1', '2021-6-1', periods=10)
DatetimeIndex(['2021-01-01 00:00:00', '2021-01-17 18:40:00',
'2021-02-03 13:20:00', '2021-02-20 08:00:00',
'2021-03-09 02:40:00', '2021-03-25 21:20:00',
'2021-04-11 16:00:00', '2021-04-28 10:40:00',
'2021-05-15 05:20:00', '2021-06-01 00:00:00'],
dtype='datetime64[ns]', freq=None)
temp = pd.date_range('2021-1-1', '2021-6-1', freq='W')
temp
DatetimeIndex(['2021-01-03', '2021-01-10', '2021-01-17', '2021-01-24',
'2021-01-31', '2021-02-07', '2021-02-14', '2021-02-21',
'2021-02-28', '2021-03-07', '2021-03-14', '2021-03-21',
'2021-03-28', '2021-04-04', '2021-04-11', '2021-04-18',
'2021-04-25', '2021-05-02', '2021-05-09', '2021-05-16',
'2021-05-23', '2021-05-30'],
dtype='datetime64[ns]', freq='W-SUN')
temp - pd.DateOffset(days=2)
DatetimeIndex(['2021-01-01', '2021-01-08', '2021-01-15', '2021-01-22',
'2021-01-29', '2021-02-05', '2021-02-12', '2021-02-19',
'2021-02-26', '2021-03-05', '2021-03-12', '2021-03-19',
'2021-03-26', '2021-04-02', '2021-04-09', '2021-04-16',
'2021-04-23', '2021-04-30', '2021-05-07', '2021-05-14',
'2021-05-21', '2021-05-28'],
dtype='datetime64[ns]', freq=None)
temp + pd.DateOffset(days=2)
DatetimeIndex(['2021-01-05', '2021-01-12', '2021-01-19', '2021-01-26',
'2021-02-02', '2021-02-09', '2021-02-16', '2021-02-23',
'2021-03-02', '2021-03-09', '2021-03-16', '2021-03-23',
'2021-03-30', '2021-04-06', '2021-04-13', '2021-04-20',
'2021-04-27', '2021-05-04', '2021-05-11', '2021-05-18',
'2021-05-25', '2021-06-01'],
dtype='datetime64[ns]', freq=None)
baidu_df.head(5)
Open High Low Close Volume
Date
2021-11-01 162.5400 170.5300 162.46 170.32 3922601
2021-11-02 166.1400 166.1400 161.33 162.27 4162546
2021-11-03 163.1500 165.5000 162.66 165.37 2482714
2021-11-04 167.0200 167.7800 162.28 162.58 3126021
2021-11-05 163.6777 163.6777 157.76 158.23 4272586
baidu_df.shift(3, fill_value=0)
Date
2021-11-01 0.0000 0.0000 0.00 0.00 0
2021-11-02 0.0000 0.0000 0.00 0.00 0
2021-11-03 0.0000 0.0000 0.00 0.00 0
2021-11-04 162.5400 170.5300 162.46 170.32 3922601
2021-11-05 166.1400 166.1400 161.33 162.27 4162546
2021-11-08 163.1500 165.5000 162.66 165.37 2482714
2021-11-09 167.0200 167.7800 162.28 162.58 3126021
2021-11-10 163.6777 163.6777 157.76 158.23 4272586
2021-11-11 159.6600 161.5800 158.41 161.42 2271899
2021-11-12 160.5200 162.8800 158.59 161.63 2405390
2021-11-15 161.7700 164.7500 160.62 161.56 3052889
2021-11-16 164.8800 168.8000 164.10 167.26 3121199
2021-11-17 166.9750 171.3500 166.33 170.57 3143286
2021-11-18 171.4050 171.9900 167.69 168.67 1940542
2021-11-19 171.9100 173.1700 169.05 171.27 3232698
2021-11-22 173.3800 173.6000 160.07 161.82 6474141
2021-11-23 155.7800 158.2000 152.41 154.36 5584166
2021-11-24 154.1100 154.9000 151.04 151.77 3989447
2021-11-26 152.0000 152.4700 146.89 147.81 4423414
2021-11-29 147.9000 151.6200 147.76 150.49 3376812
2021-11-30 148.9300 151.5800 146.89 151.39 2663238
2021-12-01 148.5200 154.4500 147.89 153.06 3267102
2021-12-02 153.0000 153.0000 148.80 150.29 3745624
2021-12-03 149.1200 151.4500 147.01 149.84 4815568
2021-12-06 150.6600 152.0000 147.51 148.83 4139282
2021-12-07 148.4000 151.5500 145.20 148.96 4979653
2021-12-08 141.8500 143.6000 132.14 137.39 10411847
baidu_df.asfreq('10D', method='ffill')
Open High Low Close Volume
Date
2021-11-01 162.54 170.53 162.46 170.32 3922601
2021-11-11 164.88 168.80 164.10 167.26 3121199
2021-11-21 154.11 154.90 151.04 151.77 3989447
2021-12-01 150.66 152.00 147.51 148.83 4139282