cp11_statistics_dropna_np.isnan(raw).any()_columns==shape[1]_

本文链接：https://blog.csdn.net/Linli522362242/article/details/99728616

本文深入探讨了金融领域的核心模型，包括正态性检验、投资组合理论、主成分分析和贝叶斯回归等，强调了正态分布在理论假设中的重要性，并通过实例展示了这些方法在实际应用中的有效性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Normality tests正态性检验
A large number of important financial models, like the mean-variance portfolio
theory and the capital asset pricing model (CAPM), rest on the assumption that
returns of securities are normally distributed; therefore, this chapter presents some
approaches to test a given time series for normality of returns.

Portfolio theory投资组合理论
Modern portfolio theory (MPT) can be considered one of the biggest successes of
statistics in finance; starting in the early 1950s with the work of pioneer Harry
Markowitz, this theory began to replace people’s reliance on judgment and
experience with rigorous mathematical and statistical methods when it comes to the
investment of money in financial markets. In that sense, it is maybe the first real
quantitative approach in finance.

Principal component analysis主成分分析
Principal component analysis (PCA) is quite a popular tool in finance, for example,
when it comes to implementing equity investment strategies or analyzing the
principal components that explain the movement in interest rates. Its major benefit is
“complexity reduction,” achieved by deriving a small set of linearly independent
(noncorrelated, orthogonal) components from a potentially large set of maybe highly
correlated time series components; we illustrate the application based on the German
DAX index and the 30 stocks contained in that index.

Bayesian regression
On a fundamental level, Bayesian statistics introduces the notion of beliefs of agents
and the updating of beliefs to statistics; when it comes to linear regression, for
example, this might take on the form of having a statistical distribution for regression
parameters instead of single point estimates (e.g., for the intercept and slope of the
regression line). Nowadays, Bayesian methods are rather popular and important in
finance, which is why we illustrate some (advanced) applications in this chapter.
A large number of important financial models, like the mean-variance portfolio
theory and the capital asset pricing model (CAPM), rest on the assumption that
returns of securities are normally distributed; therefore, this chapter presents some
approaches to test a given time series for normality of returns.

Portfolio theory
Modern portfolio theory (MPT) can be considered one of the biggest successes of
statistics in finance; starting in the early 1950s with the work of pioneer Harry
Markowitz, this theory began to replace people’s reliance on judgment and
experience with rigorous mathematical and statistical methods when it comes to the
investment of money in financial markets. In that sense, it is maybe the first real
quantitative approach in finance.

Principal component analysis
Principal component analysis (PCA) is quite a popular tool in finance, for example,
when it comes to implementing equity investment strategies or analyzing the
principal components that explain the movement in interest rates. Its major benefit is
“complexity reduction,” achieved by deriving a small set of linearly independent
(noncorrelated, orthogonal) components from a potentially large set of maybe highly
correlated time series components; we illustrate the application based on the German
DAX index and the 30 stocks contained in that index.

“Machine Learning”

Machine learning (or statistical learning) is based on advanced statistical methods and is
considered a subdiscipline of artificial intelligence (AI). Like statistics itself, machine learning
offers a rich set of approaches and models to learn from data sets and create predictions based
on what is learned. Different algorithms of learning are distinguished, such as those for
supervised learning or unsupervised learning. The types of problems solved by the algorithms
differ as well, such as estimation or classification. The examples presented in this chapter fall
in the category of supervised learning for classification.

Normality Tests
The normal distribution can be considered the most important distribution in finance and
one of the major statistical building blocks of financial theory. Among others, the
following cornerstones of financial theory rest to a large extent on the normal distribution
of stock market returns:

Portfolio theory
When stock returns are normally distributed, optimal portfolio choice can be cast into
a setting where only the mean return and the variance of the returns (or the volatility)
as well as the covariances between different stocks are relevant for an investment
decision (i.e., an optimal portfolio composition).

Capital asset pricing model
Again, when stock returns are normally distributed, prices of single stocks can be
elegantly expressed in relationship to a broad market index; the relationship is
generally expressed by a measure for the comovement of a single stock with the
market index called beta ().

Efficient markets hypothesis
An efficient market is a market where prices reflect all available information, where
“all” can be defined more narrowly or more widely (e.g., as in “all publicly
available” information vs. including also “only privately available” information); if
this hypothesis holds true, then stock prices fluctuate randomly and returns are
normally distributed.

Option pricing theory
Brownian motion is the standard and benchmark model for the modeling of random
stock (and other security) price movements; the famous Black-Scholes-Merton option
pricing formula uses a geometric Brownian motion as the model for a stock’s random
fluctuations over time, leading to normally distributed returns.

This by far nonexhaustive list underpins the importance of the normality assumption in
finance.

Benchmark Case
To set the stage for further analyses, we start with the geometric Brownian motion as one
of the canonical stochastic processes used in financial modeling. The following can be
said about the characteristics of paths from a geometric Brownian motion S:

import math
import numpy as np #random
import scipy.stats as scs
import statsmodels.api as sm
import matplotlib as mpl
import matplotlib.pyplot as plt

#from pylab import mpl, plt
%matplotlib inline

plt.style.use('seaborn')
mpl.rcParams['font.family'] = 'serif'

def gen_paths(s0, r, sigma, T, M, I):
''' Generate Monte Carlo paths for geometric Brownian motion.
Parameters
==========
S0: float
initial stock/index value
r: float
constant short rate
sigma: float
constant volatility
T: float
final time horizon
M: int
number of time steps/intervals
I: int
number of paths to be simulated
Returns
=======
paths: ndarray, shape (M + 1, I)
simulated paths given the parameters
'''
dt = T/M
paths = np.zeros((M+1, I))
paths[0] = s0 #s0 is a vector with I dimensions
for t in range(1, M+1):
rand = np.random.standard_normal(I) #这里假设是正态分布，所以最终的结果也必须是正态分布
rand = (rand-rand.mean()) / rand.std() #standardlization
#Vectorized Euler discretization of geometric Brownian motion
paths[t] = paths[t-1] * np.exp((r-0.5* sigma**2)*dt + sigma * math.sqrt(dt)*rand)# rand==z
return paths

s0 = 100.
r = 0.05
sigma = 0.2 #Constant volatility factor
T = 1.0 #Time horizon in year fractions.
M = 50 #Number of time intervals.
I = 250000 #Number of simulated processes.

np.random.seed(1000)

paths = gen_paths(s0, r, sigma, T, M, I)

plt.figure( figsize=(10,6) )
plt.plot(paths[:, :10]) #shows the first 10 simulated paths from the simulation:
plt.xlabel('time steps')
plt.ylabel('index level')
plt.title("Ten simulated paths of geometric Brownian motion")

The main interest is in the distribution of the log returns. To this end, an ndarray object with all the log returns is created based on the simulated paths. Here, a single simulated path and the resulting log returns are shown:

def print_statistics(array):
''' Prints selected statistics.
Parameters
==========
array: ndarray
object to generate statistics on
'''
sta = scs.describe(array)
print('%14s %15s' % ('statistic', 'value'))
print(30 * '-')
print('%14s %15.5f' % ('size', sta[0])) # M*I= 50*250 000
print('%14s %15.5f' % ('min', sta[1][0]))
print('%14s %15.5f' % ('max', sta[1][1]))
print('%14s %15.5f' % ('mean', sta[2]))
print('%14s %15.5f' % ('std', np.sqrt(sta[3])))
print('%14s %15.5f' % ('skew', sta[4]))
print('%14s %15.5f' % ('kurtosis', sta[5]))

print_statistics(log_returns.flatten())

compares the frequency distribution of the simulated log returns with the probability
density function (PDF) of the normal distribution given the parameterizations for r and sigma. The
function used is norm.pdf() from the scipy.stats subpackage. There is obviously quite a good fit

plt.figure( figsize=(10,6) )

plt.hist( log_returns.flatten(), bins=70, normed=True, label='frequency', color='b')
plt.xlabel('log return')
plt.ylabel('frequency')

x=np.linspace( plt.axis()[0], plt.axis()[1] )
plt.plot( x, #The location (loc) keyword specifies the mean.
#The scale (scale) keyword specifies the standard deviation
scs.norm.pdf( x, loc =r/M, scale=sigma/np.sqrt(M) ),
'r',
lw=2.0,
label='pdf'
)
plt.legend()

Comparing a frequency distribution (histogram) with a theoretical PDF is not the only way to graphically “test” for normality. So-called quantile-quantile (QQ) plots are also well suited for this task. Here, sample quantile values are compared to theoretical quantile values. For normally distributed sample data sets, such a plot might look like Figure 13-3,with the absolute majority of the quantile values (dots) lying on a straight line

sm.qqplot(log_returns.flatten()[::500], line="s")
plt.grid(True)
plt.grid(True)
plt.xlabel("theoretical quantiles")
plt.ylabel('sample quantitles')
plt.title("Figure 13-3. Quantile-quantile plot for log returns of geometric Brownian motion"

However appealing the graphical approaches might be, they generally cannot replace more rigorous testing procedures. The function normality_tests() combines three different statistical tests:

Skewness test (skewtest) This tests whether the skew of the sample data is “normal” (i.e., has a value close enough to zero).

Kurtosis test (kurtosistest) Similarly, this tests whether the kurtosis of the sample data is “normal” (again, close enough to zero).

Normality test (normaltest) This combines the other two test approaches to test for normality.

The test values indicate that the log returns of the geometric Brownian motion are indeed normally distributed — i.e., they show p-values of 0.05 or above:

def normality_test(arr):
''' Tests for normality distribution of given data set.
Parameters
==========
array: ndarray
object to generate statistics on
'''
print("Skew of data set %14.3f" % scs.skew(arr))
print("Skew test p-value %14.3f" % scs.skewtest(arr)[1])
print("Kurt of data set %14.3f" % scs.kurtosis(arr))
print("Kurt test p-value %14.3f" % scs.kurtosistest(arr)[1])
print("Norm test p-value %14.3f" % scs.normaltest(arr)[1])

normality_test(log_returns.flatten()) #All p-values are well above 0.05.

（以上是对数收益率,log_return）

end-of-period values到期日的值

Finally, a check whether the end-of-period values are indeed log-normally distributed. This boils down to a normality test, since one only has to transform the data by applying the log function to it to then arrive at normally distributed values (or maybe not). Figure 13-4 plots both the log-normally distributed end-of-period values and the transformed ones (“log index level”):

f, (ax1, ax2) = plt.subplots(1,2, figsize=(10,6))

ax1.hist(paths[-1], bins=30)
ax1.set_xlabel('index level')
ax1.set_ylabel('frequency')
ax1.set_title('regular data')

ax2.hist(np.log(paths[-1]), bins=30)
ax2.set_xlabel('log index level')
ax2.set_title('log data')

The statistics for the data set show expected behavior — for example, a mean value close to 105 and a standard deviation (volatility) close to 20%:

print_statistics(paths[-1])

The log index level values also have skew and kurtosis values close to zero:

print_statistics(np.log(paths[-1]))

This data set also shows high p-values(不能拒绝原假设，服从正态分布), providing strong support for the normal distribution hypothesis:

normality_test(np.log(paths[-1]))

(到这里，说明原数据不是正态分布的，只有取了对数后才是正态分布)

compares again the frequency distribution with the PDF of the normal distribution, showing a pretty good fit (as now is, of course, to be expected):

plt.figure(figsize=(10,6))
log_data = np.log(paths[-1])
plt.hist(log_data, bins=70, normed=True, label='observed')
plt.xlabel('index levels')
plt.ylabel('frequency')

x = np.linspace(plt.axis()[0], plt.axis()[1])
plt.plot(x, scs.norm.pdf(x, log_data.mean(), log_data.std()), 'r', lw=2.0, label='pdf')
plt.legend()

also supports the hypothesis that the log index levels are normally distributed:

sm.qqplot(log_data, line='s')
plt.grid(True)
plt.xlabel('theoretical quantitles')
plt.ylabel('sample quantitles')

#################

rand = np.random.standard_normal(I) #这里假设是正态分布，所以最终的结果进行正态检验也必须是正态分布

#################

Real-World Data

The data management tool of choice is pandas (see Chapter 8). Figure 13-7 shows the normalized prices over time:

import pandas as pd
import numpy as np

# first column as index
raw = pd.read_csv('../source/tr_eikon_eod_data.csv', index_col = 0, parse_dates = True)#.dropna()
raw.head(n=15)

raw.info()

np.isnan(raw).any()

raw=raw.dropna()

symbols = ['SPY', 'GLD', 'AAPL.O', 'MSFT.O']
data = raw[symbols]
data = data.dropna()
np.isnan(data).any() # no exists NAN

data.info()

data.head()

data.iloc[0]

(data/data.iloc[0]*100).plot(figsize=(10,6), title="Normalized prices of financial instruments over time")

log_returns = np.log(data/data.shift(1))
log_returns.head()

log_returns.hist(bins=50, figsize=(10,8))

all log returns in the form of histograms. Although not easy to judge,one can guess that these frequency distributions might not be normal。

for sym in symbols:
print("\nResults for symbol {}".format(sym))
print(30*'-')
log_data = np.array(log_returns[sym].dropna())
print_statistics(log_data)

The kurtosis values seem to be especially far from normal for all four data sets

sm.qqplot(log_returns['SPY'].dropna(), line='s') # the first item is NAN, so we need to drop it
plt.title('SPY')
plt.xlabel("theoretical quantiles")
plt.ylabel('sample quantiles')

QQ plot for the SPY ETF. Obviously, the sample quantile values do not lie on a straight line, indicating “non-normality.” On the left and right sides there are many values that lie well below the line and well above the line, respectively. In other words, the time series data exhibits fat tails. This term refers to a (frequency) distribution where large negative and positive values are observed more often than a normal distribution would imply.

sm.qqplot(log_returns['MSFT.O'].dropna(), line='s')
plt.grid(True)
plt.xlabel('theoretical quantiles')
plt.ylabel('sample quantiles')
plt.title('Quantile-quantile plot for Microsoft log returns')

for sym in symbols:
print('\nResults for symbol {}'.format(sym))
print(32*'-')
log_data = np.array(log_returns[sym].dropna())
normality_test(log_data) #Normality test results for the times series of the financial instruments.

The p-values of the different tests are all zero, strongly rejecting the test hypothesis that the different sample data sets are normally (distributed). This shows that the normal assumption for stock market returns and other asset classes — as, for example, embodied in the geometric Brownian motion model — cannot be justified in general and that one might have to use richer models that are able to generate fat tails (e.g., jump diffusion models or models with stochastic volatility).

Portfolio Optimization

Modern or mean-variance portfolio theory (MPT) is a major cornerstone of financial theory.

As pointed out previously, the assumption of normally distributed returns is fundamental to the theory:

By looking only at mean and variance, we are necessarily assuming that no other statistics are necessary to describe the distribution of end-of-period wealth. Unless investors have a special type of utility function (quadratic utility function), it is necessary to assume that returns have a normal distribution, which can be completely described by mean and variance.

The Data

The analysis and examples that follow use the same financial instruments(金融资产) as before. The basic idea of MPT is to make use of diversification to achieve a minimal portfolio risk given a target return level or a maximum portfolio return given a certain level of risk. One would expect such diversification effects for the right combination of a larger number of assets and a certain diversity in the assets. However, to convey the basic ideas and to show typical effects, four financial instruments shall suffice.

symbols = ['AAPL.O', 'MSFT.O', 'SPY', 'GLD'] #Four financial instruments for portfolio composition.
noa = len(symbols) #Number of financial instruments defined.
data = raw[symbols]
rets = np.log( data/data.shift(1))
rets.hist(bins=40, figsize=(10,8))

The covariance matrix for the financial instruments to be invested in is the central piece of the portfolio selection process. pandas has a built-in method to generate the covariance matrix on which the same scaling factor is applied

Over the period of the time series data, we see significant differences in the annualized performance. We use a factor of 252 trading days to annualize the daily returns:

rets.mean()*252 # 252 trading days

The Basic Theory

“In what follows, we assume that an investor is not allowed to set up short positions（空头头寸） in a security. Only long positions（多头头寸） are allowed, which means that 100% of the investor’s wealth has to be divided among the available assets in such a way that all positions are long (positive) and that the positions add up to 100%. Given the four securities, you could for example invest equal amounts into every security (i.e., 25% of your wealth in each). The following code generates four random numbers between 0 and 1 and then normalizes the values such that the sum of all values equals 1:

#Random portfolio weights
weights = np.random.random(noa) # noa: Number of financial instruments defined
weights /= np.sum(weights) #Random portfolio weights
weights

array([0.07650728, 0.06021919, 0.63364218, 0.22963135])

Translated into Python this boils down to a single line of code including annualization(we multiply
again by 252 to get annualized return values:)

np.sum( rets.mean() * weights ) * 252 #expected portfolio return
#rets = np.log( data/data.shift(1)

np.dot( rets.cov()*252, weights ) # gamma is a specified element in the corvariance

np.dot( np.dot( rets.cov()*252, weights ), weights.T)

# w^T
np.dot( weights.T, np.dot( rets.cov()*252, weights ) )#Annualized portfolio variance given the portfolio weights.

Annualized portfolio volatility given the portfolio weights.

This mainly completes the tool set for mean-variance portfolio selection. Of paramount interest to investors is what risk-return profiles are possible for a given set of financial instruments, and their statistical characteristics. To this end, the following implements a Monte Carlo simulation (see Chapter 12) to generate random portfolio weight vectors on a larger scale. For every simulated allocation, the code records the resulting expected portfolio return and variance. To simplify the code, two functions, port_ret() and port_vol(), are defined:

def port_ret(weights):
#Annualized portfolio return given the portfolio weights.
return np.sum(rets.mean() * weights) *252

def port_vol(weights):
#Annualized portfolio volatility (standard deviation) given the portfolio weights.
return np.sqrt(np.dot(weights.T, np.dot(rets.cov()*252, weights)))

prets = []
pvols = []
for p in range(2500): #2500 pofolios
weights = np.random.random(noa) # noa: Number of financial instruments defined
weights /= np.sum(weights) # generates "noa" random numbers between 0 and 1 and then normalizes
#the values such that the sum of all values equals 1
prets.append(port_ret(weights))
pvols.append(port_vol(weights))#Collects the resulting statistics in list objects.

prets = np.array(prets)
pvols = np.array(pvols)

prets[:10]

夏普比率（Sharpe Ratio），又被称为夏普指数 --- 基金绩效评价标准化指标。夏普比率在现代投资理论的研究表明，风险的大小在决定组合的表现上具有基础性的作用。风险调整后的收益率就是一个可以同时对收益与风险加以考虑的综合指标，以期能够排除风险因素对绩效评估的不利影响。夏普比率就是一个可以同时对收益与风险加以综合考虑的三大经典指标之一。 投资中有一个常规的特点，即投资标的的预期报酬越高，投资人所能忍受的波动风险越高；反之，预期报酬越低，波动风险也越低。所以理性的投资人选择投资标的与投资组合的主要目的为：在固定所能承受的风险下，追求最大的报酬；或在固定的预期报酬下，追求最低的风险。

其中E(Rp)：投资组合预期报酬率

Rf：无风险利率

σp：投资组合的标准差

目的是计算投资组合每承受一单位总风险，会产生多少的超额报酬。比率依据资产配置线(Capital Allocation Line,CAL)的观念而来，是市场上最常见的衡量比率。当投资组合内的资产皆为风险性资产时，适用夏普比率。夏普指数代表投资人每多承担一分风险，可以拿到几分超额报酬；若为正值，代表基金报酬率高过波动风险；若为负值，代表基金操作风险大过于报酬率。这样一来，每个投资组合都可以计算Sharpe Ratio,即投资回报与多冒风险的比例，这个比例越高，投资组合越佳。

举例而言，假如国债的回报是3% (risk free rate)，而您的投资组合预期回报是15%，您的投资组合的标准偏差是6%，那么用15%－3%,可以得出12%（代表您超出无风险投资的回报），再用12%/6%=2，代表投资者风险每增长1%，换来的是2%的多余收益。

夏普理论告诉我们，投资时也要比较风险，尽可能用科学的方法以冒小风险来换大回报。所以说，投资者应该成熟起来，尽量避免一些不值得冒的风险。同时当您在投资时如缺乏投资经验与研究时间，可以让真正的专业人士（不是只会卖金融产品给你的SALER）来帮到您建立起适合自己的，可承受风险最小化的投资组合。这些投资组合可以通过Sharpe Ratio来衡量出风险和回报比例。 [1]

plt.figure(figsize=(10,6))
plt.scatter(pvols, prets, c=(prets-0)/pvols, marker='o', cmap='coolwarm')
plt.xlabel('expected volatility')
plt.ylabel('exptected return')
plt.colorbar(label='Sharpe ratio')
plt.title("Expected return and volatility for random portfolio weights")

It is clear by inspection of Figure 13-12 that not all weight distributions perform well when measured in terms of mean{ np.sum(rets.mean() * weights) 252 ###Annualized portfolio return } and volatility{ np.sqrt(np.dot(weights.T, np.dot(rets.cov()252, weights))) ###portfolio standard deviation}. For example, for a fixed risk level of, say, 15%(0.15), there are multiple portfolios that all show different returns. As an investor, one is generally interested in the maximum return given a fixed risk level or the minimum risk given a fixed return expectation. This set of portfolios then makes up the so-called efficient frontier. This is derived later in this section.

Optimal Portfolios

This minimization function is quite general and allows for equality constraints, inequality constraints, and numerical bounds for the parameters

First, the maximization of the Sharpe ratio. Formally, the negative value of the Sharpe ratio is minimized to derive at the maximum value and the optimal portfolio composition. The constraint is that all parameters (weights) add up to 1. This can be formulated as follows using the conventions of the minimize() function.The parameter values (weights) are also bound to be between 0 and 1. These values are provided to the minimization function as a tuple of tuples.

The only input that is missing for a call of the optimization function is a starting parameter list (initial guess for the weights vector). An equal distribution of weights will do

import scipy.optimize as sco

#we minimize the negative value of the Sharpe ratio
def min_func_sharpe(weights): #Function to be minimized.
return -( port_ret(weights) - 0) / port_vol(weights)

The constraint is that all parameters (weights) add up to 1.

cons = ({'type':'eq', 'fun':lambda x: np.sum(x)-1}) #Equality constraint. # np.sum(x) ===1

We also bound the parameter values (weights) to be within 0 and 1. These values are provided to the minimization function as a tuple of tuples in this case:

bnds = tuple( (0,1) for x in range(noa) ) #Bounds for the parameters.

The only input that is missing for a call of the optimization function is a starting parameter list (initial guesses for the weights). We simply use an equal distribution:

eweights = np.array( noa* [1. /noa,] ) #Equal weights vector.
eweights

Calling the function returns more than just the optimal parameter values. The results are stored in an object called opts. The main interest lies in getting the optimal portfolio composition. To this end, one can access the results object by providing the key of interest; i.e., x in this case:

%%time

#the portfolio with the maximum Sharpe ratio:
opts = sco.minimize(min_func_sharpe, eweights,
method='SLSQP', #a sequential least squares programming algorithm
bounds = bnds,
constraints = cons)

opts

opts['x'].round(3) #The optimal portfolio weights.

np.round( port_ret( opts['x'] ), 3) #The resulting portfolio return.

port_vol(opts['x']).round(3) #The resulting portfolio volatility.

port_ret(opts['x']) / port_vol(opts['x']) # The maximum Sharpe ratio.

Next, the minimization of the variance of the portfolio. This is the same as minimizing the volatility:

#X: optv['x'].round(3) #absolute minimum std.dev or volatility portfolio

optv = sco.minimize( port_vol, eweights,
method = 'SLSQP',
bounds = bnds,
constraints = cons)
optv

Efficient Frontier

The derivation of all optimal portfolios — i.e., all portfolios with minimum volatility for a given target return level (or all portfolios with maximum return for a given risk level) — is similar to the previous optimizations. The only difference is that one has to iterate over multiple starting conditions.

The approach taken is to fix a target return level and to derive for each such level those portfolio weights that lead to the minimum volatility value. For the optimization, this leads to two conditions: one for the target return level, tret, and one for the sum of the portfolio weights as before. The boundary values for each parameter stay the same. When iterating over different target return levels (trets), one condition for the minimization changes. That is why the constraints dictionary is updated during every loop:

##Annualized portfolio return given the portfolio weights.

cons = ({'type':'eq', 'fun':lambda x: port_ret(x)-tret},## port_ret(weights): np.sum(rets.mean() * weights) *252
{'type':'eq', 'fun':lambda x: np.sum(x)-1} ##np.sum(x) ===1#The constraint is that all parameters (weights) add up to 1.
) #The two binding constraints for the efficient frontier.

bnds = tuple((0,1) for x in weights
)

%%time
trets = np.linspace(0.05, 0.2, 50)
tvols = []
#The minimization of portfolio volatility for different target returns.
for tret in trets: #Annualized portfolio volatility (standard deviation) given the portfolio weights.
res = sco.minimize(port_vol, #np.sqrt(np.dot(weights.T, np.dot(rets.cov()*252, weights))) #
eweights, #np.array( noa* [1. /noa,] )
method = 'SLSQP', ##a sequential least squares programming algorithm
bounds = bnds, #bnds = tuple( (0,1) for x in range(noa) ) #Bounds for the parameters.
constraints=cons)
tvols.append(res['fun'])
tvols = np.array(tvols)

sco.minimize(port_vol, #np.sqrt(np.dot(weights.T, np.dot(rets.cov()*252, weights))) #
eweights, #np.array( noa* [1. /noa,] )
method = 'SLSQP',
bounds = bnds, #bnds = tuple( (0,1) for x in range(noa) ) #Bounds for the parameters.
constraints=cons)

Figure 13-13 shows the optimization results. Crosses(a blue curve) indicate the optimal portfolios given a certain target return; the dots are, as before, the random portfolios. In addition, the figure shows two larger stars, one for the minimum volatility/variance portfolio (the leftmost portfolio) and one for the portfolio with the maximum Sharpe ratio:

plt.figure( figsize=(10,6) )
plt.scatter(pvols, prets, c=prets/pvols,
marker='.', alpha=0.8, cmap='coolwarm')
plt.plot(tvols, trets, 'b', lw=4.0) #blue curve and y-axis: 0.05~0.2
plt.plot(port_vol(opts['x']), #opts:the portfolio with the maximum Sharpe ratio:
port_ret(opts['x']),
'y*', # yellow star
markersize=15.0
)
plt.plot(port_vol(optv['x']),#optv: the minimization of the volatility of the portfolio
port_ret(optv['x']),
'r*', # red star
markersize=15.0
)
plt.xlabel('expected volatility')
plt.ylabel('expected return')
plt.colorbar(label='Sharpe ratio')
plt.title('Figure 13-13. Minimum risk portfolios for given return levels (efficient frontier)')

The efficient frontier is comprised of all optimal portfolios with a higher return than the absolute minimum variance portfolio. These portfolios dominate all other portfolios in terms of expected returns given a certain risk level.

Capital Market Line

In addition to risky financial instruments like stocks or commodities (such as gold), there is in general one universal, riskless investment opportunity available: cash or cash accounts. In an idealized world, .money. held in a cash account with a .large bank can be considered riskless. (e.g., through public deposit insurance schemes). .The downside is. that such a riskless investment generally yields .only a small return, sometimes close to zero..

However, taking into account such a riskless asset enhances the efficient investment opportunity set for investors considerably. The basic idea is that investors first determine .an efficient portfolio of risky assets(the highest sharp ratio) and then add the riskless asset(on y-axis or on expected return-axis) to the mix. By adjusting the proportion of the investor’s wealth to be invested in the riskless asset it is possible to achieve any risk-return profile that lies onthe straight line (in the risk-return space) between the riskless asset and the efficient portfolio.

################################################################################################

Interpolation

Compared to regression, interpolation (e.g., with cubic splines) is more involved mathematically. It is also limited to low-dimensional problems. Given an ordered set of observation points (ordered in the x dimension), the basic idea is to do a regression between two neighboring data points in such a way that not only are the data points perfectly matched by the resulting piecewise-defined interpolation function(分段插值函数), but also the function is continuously differentiable（可微分） at the data points. Continuous differentiability requires at least interpolation of degree 3—i.e., with cubic splines三次样条插值. However, the approach also works in general with quadratic and even linear splines.

splrep函数和splev函数处理回归系数的估计和回归结果的拟合

import scipy.interpolate as spi
x = np.linspace(-2 * np.pi, 2*np.pi, 25)

def f(x):
return np.sin(x) + 0.5*x

ipo = spi.splrep(x,f(x),k=1) #Sequence of length 3 returned by splrep (knots, coefficients, degree)
iy = spi.splev(x, ipo) #the interpolation already seems really good with linear splines (i.e.,k=1)

plt.plot(x, f(x), 'b', label='f(x)')
plt.plot(x, iy, 'r.', label='interpolation')

plt.legend(loc=0)
plt.grid(True)
plt.xlabel('x')
plt.ylabel('f(x)')

xd = np.linspace(1.0, 3.0, 50)###a much smaller interval
#coeff
iyd = spi.splev(xd, ipo) #ipo = spi.splrep(x,f(x),k=1) #spline regression plotting

plt.plot(xd, f(xd), 'b', label='f(x)')
plt.plot(xd, iyd, 'r.', label='interpolation')
plt.legend(loc=0)
plt.grid(True)

plt.xlabel('x')
plt.ylabel('y')

ipo = spi.splrep(x, f(x), k=3) #Sequence of length 3 returned by splrep (knots, coefficients, degree)
iyd = spi.splev(xd, ipo) #ipo = spi.splrep(x,f(x),k=1)

plt.plot(xd, f(xd),'b', label='f(x)')
plt.plot(xd, iyd, 'r.', label='interpolation')

plt.legend(loc=0)
plt.grid(True)
plt.xlabel('x')
plt.ylabel('f(x)')

INTERPOLATION

In those cases where spline interpolation can be applied you can expect better approximation results compared to a least-squares regression approach. However, remember that you need to have sorted (and “nonnoisy”) data and that the approach is limited to low-dimensional problems. It is also computationally more demanding and might therefore take (much) longer than regression in certain use cases.

##############################################################################################

For the calculations that follow, a functional approximation and the first derivative for the efficient frontier are used. Cubic splines interpolation provides such a differentiable functional approximation (see Chapter 9). For the spline interpolation, only those portfolios from the efficient frontier are used. Via this numerical approach it is possible to define a continuously differentiable function f(x) for the efficient frontier and the respective first derivative function df(x):

import scipy.interpolate as sci

tvols

np.argmin(tvols) #[3]: 0.10944758

For the spline interpolation, we only use the portfolios from the efficient frontier. The following code selects exactly these portfolios from our previously used sets tvols and trets:

ind = np.argmin(tvols) #Index position of minimum volatility portfolio.
evols = tvols[ind:] #Relevant portfolio volatility
erets = trets[ind:] #Relevant portfolio return values #trets = np.linspace(0.05, 0.2, 50)

tck = sci.splrep( evols, erets ) #Cubic splines interpolation on these values.

def f(x): # ''' Efficient frontier function (splines approximation). '''
return sci.splev(x, tck, der=0)

def df(x): # ''' First derivative of efficient frontier function. '''
return sci.splev(x, tck, der=1)

def equations(p, rf=0.01): # a=p[0] b=p[1] x=p[2] #The equation values are all zero.
eq1 = rf - p[0] #The equations describing the capital market line (CML).
eq2 = rf + p[1]*p[2] - f(p[2]) #The equations describing the capital market line (CML).
eq3 = p[1] - df(p[2]) #The equations describing the capital market line (CML).
return eq1, eq2, eq3

opt = sco.fsolve( equations, [0.01, 0.5, 0.15]) #Solving these equations for given initial values.
opt # a=0.01 b=0.8447 x_volatility=0.195 #基本上约等于夏普率最大值的资产组合处的波动率值（红星处）

np.round(equations(opt), 6) #The equation values are all zero. # check

Figure 13-14 presents the results graphically; the star represents the optimal portfolio from the efficient frontier for which the tangent line passes through the riskless asset point

plt.figure( figsize=(10,6) )
plt.scatter( pvols, prets, c=(prets-0.01)/pvols, marker='.', cmap='coolwarm' )
plt.plot(evols, erets, 'b', lw=4.0) #blue curve
cx = np.linspace(0.0, 0.3)
plt.plot(cx, opt[0] + opt[1]*cx, 'r', lw=1.5) # tangent line
plt.plot(opt[2], f(opt[2]), 'y*', markersize=15.0 )

plt.grid(True)
plt.axhline(0, color='k', ls='--', lw=2.0)
plt.axvline(0, color='k', ls='--', lw=2.0)
plt.xlabel('expected volatility')
plt.ylabel('expcected return')
plt.colorbar(label='Sharpe ratio')

The portfolio weights of the optimal (tangent) portfolio are as follows. Only three of the four assets are in the mix:

cons = ({'type':'eq', 'fun': lambda x: port_ret(x)-f(opt[2])},
{'type':'eq', 'fun': lambda x: np.sum(x) -1}
)#Binding constraints for the tangent portfolio (gold star in Figure 13-14).
res = sco.minimize(port_vol, eweights, method='SLSQP', bounds=bnds, constraints=cons)

# portfolio weights of the optimal (tangent) portfolio are as follows. Only three of the four assets
#are in the mix since the last value is 0:
res['x'].round(3) #The portfolio weights for this particular portfolio.

port_ret(res['x'])

port_vol(res['x'])

port_ret(res['x']) / port_vol(res['x'])

summary:

Portfolio Optimization

Efficient Frontier

Capital Market Line

Principal Component Analysis

Principal component analysis (PCA) has become a popular tool in finance. Wikipedia defines the technique as follows:

Principal component analysis (PCA) is a statistical procedure that uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components.

import numpy as np
import pandas as pd
#import pandas.io.data as web #The pandas.io.data module is moved to a separate package
#pip install pandas_datareader in your Anaconda Prompt
import pandas_datareader.data as web
from sklearn.decomposition import KernelPCA

symbols=['ADS.DE', 'ALV.DE', 'BAS.DE', 'BAYN.DE', 'BEI.DE',
'BMW.DE', 'CBK.DE', 'CON.DE', 'DAI.DE', 'DB1.DE',
'DBK.DE', 'DPW.DE', 'DTE.DE', 'EOAN.DE', 'FME.DE',
'FRE.DE', 'HEI.DE', 'HEN3.DE', 'IFX.DE', 'LHA.DE',
'LIN.DE', 'LXS.DE', 'MRK.DE', 'MUV2.DE', 'RWE.DE',
'SAP.DE', 'SIE.DE', 'TKA.DE', 'VOW3.DE', #'SDF.DE', # since the SDF.DE's data is nan before 2015
'^GDAXI']# last item is a German DAX index

#I just use 29 symbols # since the SDF.DE's data is nan before 2015

%%time
data = pd.DataFrame()
for sym in symbols: #We work only with the closing values of each data set that we retrieve
data[sym] = web.DataReader(sym, data_source='yahoo')['Close']

I just use the data before '2014-9-30'

np.isnan(data).any() #data is a dataframe, the returned answer is about each data series' judegement

data=data.dropna() # drop nan on each row # actually, all data exists so we don't need to do it

np.isnan(data).any()

data.columns.values.tolist() # get the rest columnnames #convert a dataframe's columnnames to a list

without column 'SDF.DE'

The DataFrame object data now has log return data for the 30 DAX stocks

data[data.columns[:6]].head()

Applying PCA

Usually, PCA works with normalized data sets.

scale_function = lambda x: (x-x.mean()) / x.std()

pca = KernelPCA().fit( data.apply(scale_function ) )

The importance or explanatory power of each component is given by its Eigenvalue(特征值). These are found in an attribute of the KernelPCA object. The analysis gives too many components:

len(pca.lambdas_) # pca is sorted by the size of variance from left to right

Therefore, let us only have a look at the first 10 components. The tenth component already has almost negligible influence?

pca.lambdas_[:10].round() # since <100

We are mainly interested in the relative importance of each component, so we will
normalize these values. Again, we use a convenience function for this:

get_we = lambda x: x/x.sum()
get_we(pca.lambdas_)[:10]

With this information, the picture becomes much clearer. The first component already explains about 65% of the variability(易变性) in the 30 time series. The first five components explain about 95% of the variability:

get_we(pca.lambdas_)[:5].sum()

Constructing a PCA Index

Next, we use PCA to construct a PCA (or factor) index over time and compare it with the original index. First, we have a PCA index with a single component only:

pca = KernelPCA(n_components=1).fit(data.apply(scale_function)) #65% of the variability
dax['PCA_1'] = pca.transform(-data) # save the PCA index(we contructed) to the column 'PCA_1'

import matplotlib.pyplot as plt
%matplotlib inline

dax.apply(scale_function).plot(figsize=(8,4))

(figure)shows the results for normalized data — already not too bad, given the rather
simple application of the approach:

Let us see if we can improve the results by adding more components. To this end, we need to calculate a weighted average from the single resulting components:

pca = KernelPCA(n_components=5).fit(data.apply(scale_function)) #95% of the variability
pca_components = pca.transform(-data)
weights = get_we(pca.lambdas_) #the relative importance
dax['PCA_5'] = np.dot(pca_components, weights)

Note, if your figure display above situation, you should shutdown your anaconda, and redo it OR

pca = KernelPCA(n_components=5).fit(data.apply(scale_function))
pca_components = pca.transform(data)
weights = get_we(pca.lambdas_)
dax['PCA_5'] = np.dot(pca_components, weights)

import matplotlib.pyplot as plt
%matplotlib inline

dax.apply(scale_function).plot(figsize=(8,4))

are still “good,” but not that much better than before — at least upon visual inspection:

In view of the results so far, we want to inspect the relationship between the DAX index
and the PCA index in a different way — via a scatter plot, adding date information to the
mix.

First, we convert the DatetimeIndex of the DataFrame object to a matplotlib compatible
format:

import matplotlib as mpl
mpl_dates = mpl.dates.date2num(data.index) #data.index_mpl_repr()
mpl_dates[:20]

import matplotlib as mpl
mpl_dates = mpl.dates.date2num(data.index._mpl_repr()) # if no atribute 'toordinal'
mpl_dates[:20]

plt.figure(figsize=(8,4))
plt.scatter(dax['PCA_5'], dax['^GDAXI'], c=mpl_dates)

#aftet fit, we get coefficients
lin_reg = np.polyval(np.polyfit(dax['PCA_5'],
dax['^GDAXI'],1
),
dax['PCA_5'] # variable
) # get fitted y values
plt.plot(dax['PCA_5'], lin_reg, 'r', lw=3)

plt.grid(True)
plt.xlabel('PCA_5')
plt.ylabel('^GDAXI')

plt.colorbar(ticks=mpl.dates.DayLocator(interval=250), #every 250 days get a date
format=mpl.dates.DateFormatter('%d %b %y')
)

plt.title('DAX return values against PCA return values with linear regression')

reveals that there is obviously some kind of structural break sometime in the middle of 2011. If the PCA index were to perfectly replicate the DAX index, we would expect all the points to lie on a straight line and to see the regression line going through these points. Perfection is hard to achieve, but we can maybe do better.

###########################################################

compare to

plt.figure(figsize=(8,4))
plt.scatter(dax['PCA_5'], dax['^GDAXI'], c=mpl_dates)
#aftet fit, we get coefficients
lin_reg = np.polyval(np.polyfit(dax['PCA_5'],
dax['^GDAXI'],20
),
dax['PCA_5'] # variable
) # get fitted y values
plt.plot(dax['PCA_5'], lin_reg, 'r', lw=3)
plt.grid(True)
plt.xlabel('PCA_5')
plt.ylabel('^GDAXI')
plt.colorbar(ticks=mpl.dates.DayLocator(interval=250), #every 250 days get a date
format=mpl.dates.DateFormatter('%d %b %y')
)
plt.title('DAX return values against PCA return values with linear regression')

#############################################################

To this end, let us divide the total time frame into two subintervals. We can then
implement an early and a late regression:

dax.index

cut_date = '2011/7/1' #get it from the figure's color bar
early_pca = dax[dax.index < cut_date]['PCA_5']
#aftet fit, we get coefficients
early_reg = np.polyval(np.polyfit(early_pca,
dax['^GDAXI'][dax.index<cut_date],1),
early_pca## variable
)## get fitted y values
early_pca.head()

late_pca = dax[dax.index >= cut_date]['PCA_5']
late_reg = np.polyval(np.polyfit(late_pca,
dax['^GDAXI'][dax.index>=cut_date],1),
late_pca)

plt.figure(figsize=(8,4))

plt.scatter(dax['PCA_5'], dax['^GDAXI'], c=mpl_dates)
plt.plot(early_pca, early_reg, 'r', lw=3)
plt.plot(late_pca, late_reg, 'r', lw=3)

plt.grid(True)
plt.xlabel('PCA_5')
plt.ylabel('^GDAXI')
plt.title('DAX index values against PCA index values with early and late regression (regime switch)')

plt.colorbar(ticks=mpl.dates.DayLocator(interval=250),
format=mpl.dates.DateFormatter('%d %b %y')
) #regime switch机制转换

Bayesian Regression

H stands for an event, the hypothesis,

and D represents the data an experiment or the real world might present

import numpy as np
import pandas as pd
import datetime as dt
from pylab import mpl, plt

plt.style.use('seaborn')
mpl.rcParams['font.family'] = 'serif'
np.random.seed(1000)
%matplotlib inline

x = np.linspace(0,10, 500)
y = 4 + 2*x + np.random.standard_normal( len(x) )*2
len(x)

As a benchmark, consider first an ordinary least-squares regression given the noisy data, using NumPy’s polyfit function

reg = np.polyfit(x,y,1) #linear regression
reg

Note that the highest-order monomial factor (in this case, the slope of the regression line) is at index level 0 and that the intercept is at index level 1.

plt.figure(figsize=(10,6))
plt.scatter(x,y,c=y, marker='v', cmap='coolwarm')
# 4 + 2*x
plt.plot(x, reg[1] + reg[0]*x, lw=2.0)
plt.colorbar()
plt.grid(True)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Sample data points and regression line')

#http://people.duke.edu/~ccc14/sta-663/PyMC3.html

#pip install --process-dependency-links git+https://github.com/pymc-devs/pymc3

import pymc3 as pm

Estimating parameters of a linear regreession model

%%time
with pm.Model() as model:
#model
alpha = pm.Normal('alpha', mu=0, sd=20) #Defines the priors.
beta = pm.Normal('beta', mu=0, sd=10) #Defines the priors.
sigma = pm.Uniform('sigma', lower=0, upper=10) #Defines the priors.# standard deviation
#x = np.linspace(0,10, 500)
y_est = alpha + beta*x # Specifies the linear regression.
#Defines the likelihood. #y = 4 + 2*x + np.random.standard_normal( len(x) )*2
#y - N(ax+b, sigma^2)
likelihood = pm.Normal('y', mu=y_est, sd=sigma, observed = y)

#inference
start = pm.find_MAP() # find starting value by optimization
step = pm.NUTS() # # Hamiltonian MCMC with No U-Turn Sampler
#trace = pm.sample(niter, step, start, random_seed=123, progressbar=True)
#np.random.seed(1000)
trace = pm.sample(100, step, start, progressbar=True) # better to set progressbar to True
#trace = pm.sample(100, step, start=start, progressbar=False)
## draw 100 posterior samples using NUTS sampling

pm.summary(trace) #Shows summary statistics from samplings.

trace[0]

The three estimates shown are rather close to the original values (4, 2, 2). However, the whole procedure yields more estimates. They are best illustrated with the help of a trace plot, as in Figure 13-16 — i.e., a plot showing the resulting posterior distribution for the different parameters as well as all single estimates per sample. The posterior distribution gives an intuitive sense about the uncertainty in the estimates:

Taking only the alpha and beta values from the regression, one can draw all resulting regression lines as shown in Figure

plt.figure(figsize=(10,6))
plt.scatter(x,y,c=y, marker='v',cmap='coolwarm')
plt.colorbar()
plt.xlabel('x')
plt.ylabel('y')
for i in range(len(trace)):
plt.plot(x,trace['alpha'][i] + trace['beta'][i]*x)