1 Python数据分析 NumPy Pandas Tushare财经数据接口包

最新推荐文章于 2023-02-01 11:08:12 发布

Jianhao92

最新推荐文章于 2023-02-01 11:08:12 发布

阅读量621

点赞数 1

分类专栏： Python数据分析文章标签：数据分析 numpy pandas tushare

本文链接：https://blog.csdn.net/qq_36565509/article/details/107319112

版权

Python数据分析专栏收录该内容

5 篇文章 2 订阅

订阅专栏

Python数据分析

1 NumPy模块

1.1 介绍

NumPy(Numerical Python) 是用于科学计算的基础库，支持多维度的数组与矩阵运算。

1.2 ndarray对象

1.2.1 介绍

ndarray对象是用于存放同类型元素的多维数组对象，是一系列同类型数据的集合。
ndarray对象中的每个元素在内存中都占据相同大小的存储区域。

数组和列表的区别是数组中的所有元素类型必须相同，类型优先级：字符串 > 浮点型 > 整数

1.2.2 创建ndarray对象

1.2.2.1 numpy.array

numpy.array(object, dtype = None, copy = True, order = None, subok = False, ndmin = 0)

参数	说明
object	数组或嵌套的数列
dtype	元素的数据类型
copy	对象是否需要复制
order	创建数组的样式，C为行方向，F为列方向，A为任意方向(默认)
subok	默认返回一个与基类类型一致的数组
ndmin	指定生成数组的最小维度

import numpy as np

# 一维数组
np.array([1, 2, 3])  # array([1, 2, 3])
# 二维数组
np.array([[1, 2, 3], [4, 5, 6]])
'''
array([[1, 2, 3],
       [4, 5, 6]])
'''
# 数组中的所有元素类型必须相同，类型优先级：字符串>浮点型>整数
np.array([1, 1.2, '12'])  # array(['1', '1.2', '12'], dtype='<U32')
np.array([1, 1.2, 12])  # array([ 1. ,  1.2, 12. ])
# 指定最小维度
np.array([1, 2, 3], ndmin=5)  # array([[[[[1, 2, 3]]]]]))
# 指定dtype
np.array([1, 2, 3], dtype=complex)  # array([1.+0.j, 2.+0.j, 3.+0.j])

1.2.2.2 numpy提供的routines函数

numpy.empty
用于创建未初始化的数组，元素为随机值。

arr = np.empty([3, 2], dtype=int) 
arr.fill(100)
'''
array([[100, 100],
       [100, 100]])
'''

numpy.zeros

np.zeros(shape=(2, 3))  # 默认为浮点数
'''
array([[0., 0., 0.],
       [0., 0., 0.]])
'''
np.zeros(3, dtype=np.int)  # 指定为整形
array([0, 0, 0])

numpy.ones

np.ones(3)  # 默认为浮点数
'''
array([1., 1., 1.])
'''
np.ones(3, dtype=np.int)  # 指定为整形
'''
array([1, 1, 1])
'''

np.linspace
指定元素个数，返回一维等差数列。

np.linspace(开始值, 终止值, 元素个数)

np.linspace(0, 20, num=10)
'''
array([ 0.        ,  2.22222222,  4.44444444,  6.66666667,  8.88888889,
       11.11111111, 13.33333333, 15.55555556, 17.77777778, 20.        ])
'''

np.arange
指定步长，返回一维等差数列。

np.arange(开始值, 终止值, 步长)

np.arange(0, 20, step=5)  # array([ 0,  5, 10, 15])

np.random.randint
返回随机数组，指定元素取值范围。

np.random.randint(0, 100, size=(2, 3))
'''
array([[84, 26, 20],
       [10, 88,  4]])
'''

np.random.random
返回随机数组，元素取值范围为[0, 1]。

np.random.random(size=(2, 3))
'''
array([[0.34844375, 0.44087602, 0.82370203],
       [0.04277734, 0.8713185 , 0.57144526]])
'''

1.2.2.3 matplotlib.pyplot

利用图片数据生成ndarray对象。

import matplotlib.pyplot as plt

# imread返回numpy数组。
img_arr = plt.imread('./test.jpg')
# 使用numpy数组进行图像展示。
plt.imshow(img_arr)

1.2.3 NumPy基本类型

在这里插入图片描述

arr0 = np.array([1, 2, 3], dtype='float32')  # array([1., 2., 3.], dtype=float32)
arr1 = arr0.astype('int8')  
print(arr1)  # array([1, 2, 3], dtype=int8)
print(arr0)  # array([1., 2., 3.], dtype=float32)

arr0.dtype = 'int16'  
print(arr0)  # array([    0, 16256,     0, 16384,     0, 16448], dtype=int16)

1.2.4 ndarray对象的属性

属性	说明
ndim	秩(rank)，即轴的数量或维度的数量
shape	数组的维度
size	数组元素的总个数
dtype	数组的元素类型

arr = np.random.random(size=(2, 3))

arr.ndim  # 2
arr.shape  # (2, 3)
arr.size  # 6
arr.dtype  # dtype('float64')

1.3 操作ndarray对象

1.3.1 索引操作

arr = np.random.randint(0, 100, size=(5, 6))
'''
array([[36, 50, 31, 15],
       [71,  3,  8,  9],
       [47, 12,  8, 47]])
'''
arr[1][3]  # 9

1.3.2 切片操作

# 取出前两行的数据
arr[0:2]
'''
array([[36, 50, 31, 15],
       [71,  3,  8,  9]])
'''
# 取出前两列的数据
arr[:, 0:2]
'''
array([[36, 50],
       [71,  3],
       [47, 12]])
'''
# 取出前两行前两列的数据
arr[0:2, 0:2]
'''
array([[36, 50],
       [71,  3]])
'''

1.3.3 翻转操作

arr
'''
array([[36, 50, 31, 15],
       [71,  3,  8,  9],
       [47, 12,  8, 47]])
'''
# 按行翻转
arr[::-1]
'''
array([[47, 12,  8, 47],
       [71,  3,  8,  9],
       [36, 50, 31, 15]])
'''
# 按列翻转
arr[:, ::-1]
'''
array([[15, 31, 50, 36],
       [ 9,  8,  3, 71],
       [47,  8, 12, 47]])
'''
# 按行和列翻转
arr[::-1, ::-1]
'''
array([[47,  8, 12, 47],
       [ 9,  8,  3, 71],
       [15, 31, 50, 36]])
'''

1.3.4 翻转操作案例翻转图片

img_arr.shape  # (626, 413, 3)
# 将图片上下翻转
plt.imshow(img_arr[::-1])
plt.imshow(img_arr[::-1, :, :])
# 将图片左右翻转
plt.imshow(img_arr[:, ::-1, :])
# 反色处理
plt.imshow(img_arr[:, :, ::-1])
# 图片裁剪
plt.imshow(img_arr[170:390, 100:320, :])

1.3.5 变形操作 reshape

变形操作不能修改数组的元素个数。

arr.shape  # (3, 4)
# 二维数组 => 一维数组
arr1 = arr.reshape((12,))  # array([36, 50, 31, 15, 71,  3,  8,  9, 47, 12,  8, 47])
# 二维数组变形
arr2 = arr.reshape(2, 6)
'''
array([[36, 50, 31, 15, 71,  3],
       [ 8,  9, 47, 12,  8, 47]])
'''

1.3.6 级联操作 concatenate

将多个numpy数组在行方向上进行横向拼接或在列方向上进行纵向拼接。

numpy.concatenate((a1, a2, ...), axis)

a1, a2, …：相同类型的数组；
axis：轴向，0表示列(默认)，1表示行。
axis=0，列方向进行拼接，列数要相等；
axis=1，行方向进行拼接，行数要相等。

arr1 = np.random.randint(0, 100, size=(2, 1))
arr2 = np.random.randint(0, 100, size=(2, 3))
np.concatenate((arr1, arr2), axis=1)
'''
array([[83, 24, 42, 66],
       [96, 66, 25, 52]])
'''

1.3.7 级联操作案例图片九宫格

img_arr_3 = np.concatenate((img_arr, img_arr, img_arr), axis=1)  # 横向拼接
img_arr_9 = np.concatenate((img_arr_3, img_arr_3, img_arr_3), axis=0)  # 纵向拼接
plt.imshow(img_arr_9)

1.4 函数

1.4.1 统计函数

1.4.1.1 amin，amax

numpy.amin() 用于获取数组中指定轴向上的元素最小值。
numpy.amax() 用于获取数组中指定轴向上的元素最大值。
参数axis指定轴向，0表示列(默认)，1表示行。

arr
'''
array([[36, 50, 31, 15],
       [71,  3,  8,  9],
       [47, 12,  8, 47]])
'''
np.amin(arr, axis=0)  # array([36,  3,  8,  9])
np.amin(arr, axis=1)  # array([15,  3,  8])

1.4.1.2 极差 ptp

numpy.ptp()函数用于计算数组中指定轴向上的元素最大值与最小值之差，即：最大值 - 最小值。

arr
'''
array([[36, 50, 31, 15],
       [71,  3,  8,  9],
       [47, 12,  8, 47]])
'''
np.ptp(arr, axis=0)  # array([35, 47, 23, 38])
np.ptp(arr, axis=1)  # array([35, 68, 39])

1.4.1.3 中位数 median

numpy.median()函数用于计算中位数。

arr
'''
array([[36, 50, 31, 15],
       [71,  3,  8,  9],
       [47, 12,  8, 47]])
'''
np.median(arr, axis=0)  # array([47., 12.,  8., 15.])
np.median(arr, axis=1)  # array([33.5,  8.5, 29.5])

1.4.1.4 算术平均值 mean

numpy.mean()函数用于计算算术平均值。

arr
'''
array([[36, 50, 31, 15],
       [71,  3,  8,  9],
       [47, 12,  8, 47]])
'''
np.mean(arr, axis=0)  # array([51.33333333, 21.66666667, 15.66666667, 23.66666667])
np.mean(arr, axis=1)  # array([33.  , 22.75, 28.5 ])

1.4.1.5 方差 var

方差是每个样本值与全体样本值的平均值之差的平方的平均数，即mean((x - x.mean())**2)。

arr
'''
array([[36, 50, 31, 15],
       [71,  3,  8,  9],
       [47, 12,  8, 47]])
'''
np.var(arr, axis=0)  # array([213.55555556, 414.88888889, 117.55555556, 278.22222222])
np.var(arr, axis=1)  # array([156.5   , 781.1875, 344.25  ])

1.4.1.6 标准差 std

标准差是方差的算术平方根，用于表示一组数据的离散程度。

std = sqrt(mean((x - x.mean())**2))

arr
'''
array([[36, 50, 31, 15],
       [71,  3,  8,  9],
       [47, 12,  8, 47]])
'''
np.std(arr, axis=0)  # array([14.61354014, 20.36882149, 10.84230398, 16.67999467])
np.std(arr, axis=1)  # array([12.509996  , 27.94973166, 18.55397532])

1.4.2 数学函数

1.4.2.1 三角函数

标准的三角函数：sin()、cos()、tan()。

a = np.array([0, 45, 60, 270, 540])
np.sin(a*np.pi/180)  # array([ 0.00000000e+00,  7.07106781e-01,  8.66025404e-01, -1.00000000e+00, 3.67394040e-16])
np.cos(a*np.pi/180)  # array([ 1.00000000e+00,  7.07106781e-01,  5.00000000e-01, -1.83697020e-16, -1.00000000e+00])
np.tan(a*np.pi/180)  # array([ 0.00000000e+00,  1.00000000e+00,  1.73205081e+00,  5.44374645e+15, -3.67394040e-16])

1.4.2.2 四舍五入 around

numpy.around() 函数返回指定数字的四舍五入值。
参数decimals表示舍入的小数位数，默认值为0，如果为负，将四舍五入到小数点左侧对应的位置。

a = np.array([1.0, 5.55, 123, 0.567, 25.532])  
np.around(a)  # array([  1.,   6., 123.,   1.,  26.])
np.around(a, decimals=1)  # array([  1. ,   5.6, 123. ,   0.6,  25.5])
np.around(a, decimals=-1)  # array([  0.,  10., 120.,   0.,  30.])

1.4.3 线性代数与矩阵

1.4.3.1 点积 dot

numpy.dot()用于计算两个数组的矩阵乘积。

np.dot([[2,1], [4,3]], [[1,2], [1,0]])
'''
array([[3, 4],
       [7, 8]])
'''

1.4.3.2 转置 T

arr
'''
array([[36, 50, 31, 15],
       [71,  3,  8,  9],
       [47, 12,  8, 47]])
'''
arr.T
'''
array([[36, 71, 47],
       [50,  3, 12],
       [31,  8,  8],
       [15,  9, 47]])
'''

1.4.3.3 矩阵库numpy.matlib简介

NumPy中包含一个矩阵库numpy.matlib，矩阵库中的函数返回的是一个矩阵，而不是ndarray对象。

2 Pandas

Pandas是一个分析结构化数据的工具集，用于数据挖掘和数据分析，同时也提供数据清洗功能。

数据结构

数据结构	维度	描述
Series	1	带标签的一维同构数组
DataFrame	2	带标签的大小可变的二维异构表格

2.1 Series

2.1.1 介绍

Series是一种类似于一维数组的对象，由数据(NumPy的ndarray数组对象)及对应的索引组成。

导入包

import pandas as pd
import numpy as np
from pandas import Series, DataFrame

2.1.2 创建Series

list1 = [1, 2, 3, 4, 5]
Series(data=list1)  # 隐式索引
'''
0    1
1    2
2    3
3    4
4    5
dtype: int64
'''

dict1 = {
    'A': 100,
    'B': 99,
    'C': 120,
}
Series(data=dict1)  # 显式索引
'''
A    100
B     99
C    120
dtype: int64
'''

2.1.3 索引与切片

设置索引
隐式索引：未指定时自动生成的索引（0，1，2…）;
显示索引：自定义索引，通过index参数设置或传入字典数据。

s = Series(data=np.random.randint(0, 100, size=(3,)), index=['A', 'B', 'C'])
'''
A    61
B    17
C    37
dtype: int32
'''

索引取值

s[0]  # 44
s['A']  # 44
s.A  # 44

切片取值

s[0: 3]
s['A': 'D']
'''
A    44
B    90
C    39
dtype: int32
'''

2.1.4 属性

# 索引
s.index  # Index(['A', 'B', 'C'], dtype='object')
# 值
s.values  # array([44, 90, 39])

s.size  # 3
s.shape  # (3,)

shape

2.1.5 常用方法

head 和 tail

s.head(2)
'''
A    44
B    90
dtype: int32
'''

s.tail(2)
'''
B    90
C    39
dtype: int32
'''

unique 和 nunique

s = Series(data=[1,1,2,2,3,3,3,3,3,3,4,5,6,7,7,7])
# 去除重复元素
s.unique()  # array([1, 2, 3, 4, 5, 6, 7])
# 统计去重后的元素个数
s.nunique()  # 7

算术运算
索引一致的元素可以进行算数运算，否则补空(NaN)。

s1 = Series(data=[1,2,3,4,5], index=['a','b','c','d','e'])
s2 = Series(data=[1,2,3,4,5], index=['a','b','f','d','e'])
s3 = s1 + s2
'''
a     2.0
b     4.0
c     NaN
d     8.0
e    10.0
f     NaN
dtype: float64
'''

isnull 和 notnull

# 检测Series中的元素是否为空，空则返回True，否则返回False。
s.isnull()
'''
a    False
b    False
c     True
d    False
e    False
f     True
dtype: bool
'''
# 取出空数据
s[s.isnull()]
'''
c   NaN
f   NaN
dtype: float64
'''

# 检测Series中的元素是否不为空，非空则返回True，否则返回False。
s.notnull()
'''
a     True
b     True
c    False
d     True
e     True
f    False
dtype: bool
'''
# 取出非空数据，数据清洗。
s[s.notnull()]
'''
a     2.0
b     4.0
d     8.0
e    10.0
dtype: float64
'''

2.2 DataFrame

2.2.1 介绍

DataFrame是Pandas中的表格型数据结构，包含有一组有序的列，列与列之间数据类型可以不同(数值、字符串、布尔型等)，可以视为由Series组成的字典。

行索引：index
列索引：columns
值：values

2.2.2 创建DataFrame

ndarray数组

df = DataFrame(data=np.random.randint(0, 100, size=(5, 6)), columns=['a','b','c','d','e','f'], index=['A','B','C','D','E'])
'''
    a	b	c	d	e	f
A	16	49	89	28	14	17
B	86	35	95	90	4	85
C	88	67	57	13	1	76
D	31	34	62	30	52	89
E	92	56	98	20	1	16
'''

字典

dict1 = {
    'name': ['A', 'B', 'C'],
    'salary': [10000, 20000, 30000]
}
df = DataFrame(data=dict1, index=['a', 'b', 'c'])
'''
   name	salary
a	A	10000
b	B	20000
c	C	30000
'''

练习
根据以下考试成绩表，创建一个DataFrame，命名为score_df。

	张三	李四
语文	150	0
数学	150	0
英语	150	0
理综	300	0

score_dict = {
    '张三': [150, 150, 150, 300],
    '李四': [0, 0, 0, 0]
}
score_df = DataFrame(data=score_dict, index=['语文', '数学', '英语', '理综'])

2.2.3 属性

dict1 = {
    'name': ['A', 'B', 'C'],
    'salary': [10000, 20000, 30000]
}
df = DataFrame(data=dict1, index=['a','b','c'])

df.values
'''
array([['A', 10000],
       ['B', 20000],
       ['C', 30000]], dtype=object)
'''

df.columns
'''
Index(['name', 'salary'], dtype='object')
'''

df.index
'''
Index(['a', 'b', 'c'], dtype='object')
'''

df.shape  # (3, 2)

2.2.4 索引操作

df
'''
	    张三	李四
语文	150	    0
数学	150	    0
英语	150	    0
理综	300	    0
'''

对列进行索引取值。

df['张三']
df[['张三', '李四']]

iloc与loc对行进行索引取值。
iloc是通过隐式索引取行；
loc是通过显式索引取行。

df.loc['语文']
df.iloc[0]
df.iloc[[1, 2, 3]]

取元素

df.loc['数学', '张三']  # 150
df.iloc[1, 0]  # 150

df.iloc[[0, 2], 0]
'''
语文    150
英语    150
Name: 张三, dtype: int64
'''

2.2.5 切片操作

对行进行切片

df[1: 3]
'''
	    张三	李四
数学	150	     0
英语	150	     0
'''

对列进行切片

df.iloc[:, 0: 1]
'''
	    张三
语文	150
数学	150
英语	150
理综	300
'''

索引
df[col]: 取列
df.loc[index]: 取行
df.iloc[index, col]: 取元素

切片
df[index1: index3]: 切行
df.iloc[:, col1: col3]: 切列

2.2.6 练习

初始数据

# 期中考试成绩
midterm_score_dict = {
    '张三': [150, 150, 150, 300],
    '李四': [0, 0, 0, 0]
}
midterm_score_df = DataFrame(data=midterm_score_dict, index=['语文', '数学', '英语', '理综'])

# 期末考试成绩
final_score_dict = {
	'张三': [100, 90, 90, 100],
    '李四': [0, 0, 0, 0]
}
final_score_df = DataFrame(data=final_score_dict, index=['语文', '数学', '英语', '理综'])

求期中期末的平均值。

average_df = (midterm_score_df + final_score_df) / 2

张三期中考试数学被发现作弊，记0分处理。

midterm_score_df.loc['数学', '张三'] = 0

李四因为举报张三作弊有功，期中考试所有科目加100分。

midterm_score_df['李四'] += 100

期中考试给每位学生的每个科目都加10分。

midterm_score_df += 10

2.2.7 时间数据类型转换

pd.to_datetime(col)

准备数据

info_dict = {
    'name': ['Jay', 'Tom', 'Bobo'],
    'hire_date': ['2010-10-11', '2012-12-01', '2011-11-12'],
    'salary': [10000, 20000, 30000]
}
df = DataFrame(data=info_dict)
'''
    name	hire_date	salary
0	Jay	    2010-10-11	10000
1	Tom	    2012-12-01	20000
2	Bobo	2011-11-12	30000
'''

查看信息

df.info()
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   name       3 non-null      object
 1   hire_date  3 non-null      object
 2   salary     3 non-null      int64 
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes
'''

字符串格式的时间数据转换成时间序列类型数据

df['hire_date'] = pd.to_datetime(df['hire_date'])

再次查看信息

df.info()
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   name       3 non-null      object        
 1   hire_date  3 non-null      datetime64[ns]
 2   salary     3 non-null      int64         
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 200.0+ bytes
'''

2.2.8 将某一列设置为行索引

将hire_date列设置为行索引。

new_df = df.set_index('hire_date')
'''
	        name	salary
hire_date		
2010-10-11	Jay	    10000
2012-12-01	Tom	    20000
2011-11-12	Bobo	30000
'''

new_df.shape  # (3, 2)

3 Tushare财经数据接口包

3.1 简介

Tushare是一个财经数据接口包，主要用于提供便于分析的股票等金融数据。Tushare返回的数据类型基本都是Pandas的DataFrame，便于使用Pandas/NumPy/Matplotlib进行数据分析和可视化。

安装Tushare

python -m pip install tushare

3.2 股票分析

需求：

使用Tushare包获取贵州茅台[600519]的近十年股票行情数据；
输出该股票所有收盘比开盘的涨幅超过3%的日期；
输出该股票所有开盘比前日收盘的跌幅超过2%的日期；
假如从2010年1月1日开始，每月第一个交易日买入1手股票，每年最后一个交易日卖出所有股票，到今天为止收益如何？

3.2.1 问题1

使用Tushare包获取某股票(600519)的历史行情数据。

导入包

import tushare as ts
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

用Tushare包获取某股票的历史行情数据

df = ts.get_k_data(code='600519', start='2010-01-01')

持久化存储

df.to_csv('./maotai.csv')

从外部加载数据

df = pd.read_csv('./maotai.csv')
df.head()

删除Unnamed: 0列
注意，drop系列函数中axis=0表示行，axis=1表示列。

df.drop(labels='Unnamed: 0', axis=1, inplace=True)

查看数据信息。

df.info()
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2551 entries, 0 to 2550
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   date    2551 non-null   object 
 1   open    2551 non-null   float64
 2   close   2551 non-null   float64
 3   high    2551 non-null   float64
 4   low     2551 non-null   float64
 5   volume  2551 non-null   float64
 6   code    2551 non-null   int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 139.6+ KB
'''

格式转换，将date列的字符串类型的时间数据转换为时间序列类型。

df['date'] = pd.to_datetime(df['date'])

df['date'].dtype  # dtype('<M8[ns]')

df.info()
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2551 entries, 0 to 2550
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    2551 non-null   datetime64[ns]
 1   open    2551 non-null   float64       
 2   close   2551 non-null   float64       
 3   high    2551 non-null   float64       
 4   low     2551 non-null   float64       
 5   volume  2551 non-null   float64       
 6   code    2551 non-null   int64         
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 139.6 KB
'''

将date列作为源数据的行索引。

df.set_index(keys='date', inplace=True)

df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2551 entries, 2010-01-04 to 2020-07-13
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   open    2551 non-null   float64
 1   close   2551 non-null   float64
 2   high    2551 non-null   float64
 3   low     2551 non-null   float64
 4   volume  2551 non-null   float64
 5   code    2551 non-null   int64  
dtypes: float64(5), int64(1)
memory usage: 139.5 KB

3.2.2 问题2

输出该股票所有收盘比开盘的涨幅超过3%的日期。

(收盘 - 开盘) / 开盘 > 0.03
(df['close'] - df['open']) / df['open'] > 0.03

注，在df的相关操作中如果返回了布尔值，下一步马上考虑将布尔值作为原始数据的行索引。

获取满足要求的数据

df.loc[(df['close'] - df['open']) / df['open'] > 0.03]

获取满足要求的日期

df.loc[(df['close'] - df['open']) / df['open'] > 0.03].index

3.2.3 问题3

输出该股票所有开盘比前日收盘的跌幅超过2%的日期。

(开盘 - 前日收盘) / 前日收盘 < -0.02

df['close'].shift(1).head()
'''
date
2010-01-04        NaN
2010-01-05    108.446
2010-01-06    108.127
2010-01-07    106.417
2010-01-08    104.477
Name: close, dtype: float64
'''

(df['open'] - df['close'].shift(1)) / df['close'].shift(1) < -0.02
df.loc[(df['open'] - df['close'].shift(1)) / df['close'].shift(1) < -0.02]
df.loc[(df['open'] - df['close'].shift(1)) / df['close'].shift(1) < -0.02].index

3.2.4 问题4

假如从2010年1月1日开始，每月第一个交易日买入1手(100支)股票，每年最后一个交易日卖出所有股票，到今天为止收益如何？

分析：

买股票
每月的第一个交易日根据开盘价买入一手股票，即100支股票，
则一年需要买入12月 * 100支 = 1200支股票。

卖股票
每年最后一个交易日(12-31)根据开盘价卖出所有的股票，
则一年需要卖出1200支股票。

现在是2020年7月，则2020年只能买入700支股票，无法卖出。此时在计算总收益时需要将剩余股票的价值也计算在内。

new_df = df['2010':'2020']

数据的重新取样 resample

# 每个月第一个交易日对应的行数据。
df_monthly = new_df.resample(rule='M').first()

计算买入股票一共花了多少钱

cost = df_monthly['open'].sum() * 100  # 4636917.100000001

计算卖出股票收入多少钱，A表示年。

df_yearly = new_df.resample('A').last()[0:-1]
recv = df_yearly['open'].sum() * 1200  # 4368184.8

计算剩余股票的价值

last = 700 * df['open'][-1]

计算总收益

recv + last - cost  # 913567.6999999993