7.1 统计学的基本概念

7.1.1 描述性统计学与推断性统计学（Descriptive Statistics vs. Inferential Statistics）

• 描述性统计：descript characteristics of large data sets.
• 推断性统计：used to make forecasts, estimates or judgements about a large date set on the basis of the statistical characteristics of sample.

forecast: n. （天气、财经等的）预测，预报；预想

7.1.2 总体与样本（Populations and Samples）

• 总体：the set of all possible members of a stated group.
• 样本：a subset of the population.
• 参数（parameter）：any descriptive measure of a population characteristic.
• 样本统计量（sample statistics）：quantity computed from or used to describe a sample.

subset: n. [数] 子集；子设备；小团体

compute: vt. 计算；估算；用计算机计算

measure: n. 测量；措施；程度；尺寸

一、名义尺度（Nominal Scale）

• contains the lease information;
• classifies observation by characteristics without order.

lease: n. 租约；租期；租赁物；租赁权

observation: n. 观察；监视；观察报告

二、排序尺度（Ordinal Scale）

• higher level measurement than nominal scales;
• classified observation by characteristics with order;
• can decide which one is better or not;
• but can’t tell the difference between any two obervations.

三、区间尺度（Interval Scale）

• provide relative ranking, like ordinal scales;
• can decide how much the difference is(that is, provide equal difference between scale values) ;
• doesn’t have a real zero(which means zero doesn’t mean nothing).

ranking: n. 等级；地位

四、比率尺度（Ratio Scale）

• most refined level of measurement;
• provide ranking and equal differences between scale values;
• have true zero as the origin.

origin: n. 起源；原点；出身；开端

7.1.4 数据的展现

• Holding period return formula

R t = P t − P t − 1 + D t P t − 1 W h e r e : P t = price per share at the end of time period t P t − 1 = price per share at the end of time period t − 1, the time period immediately preceding time period t D t = cash distributions received during time period t \begin{array}{l} R_t = \frac{P_t - P_{t-1} + D_t}{P_{t-1}} \\ Where: \\ P_t = \text{price per share at the end of time period t} \\ P_{t−1} = \text{price per share at the end of time period t − 1, the time period immediately preceding time period t} \\ D_t = \text{cash distributions received during time period t} \end{array}

• two characteristics:
• First, it has an element of time attached to it.
• Second, rate of return has no currency unit attached to it.

formula: n. 公式，准则；配方；婴儿食品

holding period return (HPR): 持有期收益，同holding period yield(HPY)

distribution: n. 分布；分配；供应

element: n. 元素；要素；原理；成分；自然环境

currency: n. 货币；通货

unit: n. 单位，单元；装置；部队；部件

一、频率分布（Frequency Distribution）

• 频数分布（Frequency distributions）：summarize statistical data by assigning it to specified groups or intervals.

• Steps to construct a frequency distribution:

1. Step 1: Define the intervals (Sort the data in ascending order （升序）, Calculate the range of the data, Decide on the number of intervals (k) in the frequency distribution)
2. Step 2: Tally the observations (Determine interval width)
3. Step 3: Count the observations (Count the number of observations , and Construct a table of the intervals)
• As the above procedure makes clear, a frequency distribution groups data into a set of intervals.
• Each observation falls into only one interval (mutually exclusive), and the total number of intervals covers all the values represented in the data.
• In practice, we may want the intervals to begin and end with whole numbers for ease of interpretation.
• In practice, we also need to explain the choice of the number of intervals, k.
• A large number of empty intervals may indicate that we are trying to organize the data to present too much detail.
• we can consider increasingly larger intervals (smaller values of k) until we have a frequency distribution that effectively summarizes the distribution.
• 众数区间（modal interval）：For any frequency distribution, the interval with the greatest frequency.

• 绝对频数（absolute frequency, simply the frequency）：The actual number of observations in a given interval.

• 相对频数（relative frequency）：the absolute frequency of each interval divided by the total number of observations.
r e l a t i v e   f r e q u e n c y = a b s o l u t e   f r e q u e n c y t o t a l   n u m b e r   o f   o b s e r v a t i o n s relative \ frequency = \frac{absolute \ frequency}{total \ number \ of \ observations}

• 累积频数（cumulative absolute frequency, cumulative relative frequency）：sum the absolute or relative frequencies starting at the lowest interval and progressing through the highest.

• Cumulative frequency distribution:

• plot either the cumulative absolute or cumulative relative frequency against the upper interval limit.
• X-axis: upper limits;
• Y-axis: cumulative (absolute/relative) frequency.
• allows us to see how many or what percent of the observations lie below a certain value.
• The change in the cumulative relative frequency as we move from one interval to the next is the next interval’s relative frequency (or absolute frequency).

• The fact that the slope is steep indicates that these frequencies are large.

summarize: vt. 总结；概述

assigning: v. 分配，分派（工作）；指定；委派；确定（价值、时间等）；把……归因于；转让

tally: vt. 使符合；计算；记录

procedure: n. 程序，手续；步骤

represented: v. 代表；表现；描写

ease: n. 容易；舒适；安逸

​ v. 减轻，缓解；小心缓缓地移动；使容易；放松；（使）贬值；（股票价格、利率等）下降，下跌

interpretation: n. 解释；翻译；演出

choice: n. 选择；选择权；精选品

indicate: vt. 表明；指出；预示；象征

plot: vt. 密谋；绘图；划分；标绘

against: prep. 反对，违反；对……不利；紧靠，倚，碰撞；针对；预防，抵御；（体育比赛）对阵；以……为背景；参照，和……相比；（赌博）预计……的失败

slope: n. 斜坡；倾斜；斜率；扛枪姿势

二、直方图与频数多边形（Histogram vs Frequency Polygon）

• Histogram

• the bar chart of absolute frequency distribution;
• X-axis: intervals;
• Y-axis: absolute frequency of intervals.
• we can find the mode (the most concentrated observations) quickly from histogram.

• Frequency Polygon

• X-axis: the midpoint of the interval;
• Y-axis: absolute frequency of intervals.

midpoint: n. 中点；正中央

7.2 中心趋势

• Measures of location include not only measures of central tendency but other measures that illustrate the location or distribution of data.
• A measure of central tendency specifies where the data are centered.
• the arithmetic mean, the median, the mode, the weighted mean, and the geometric mean.

tendency: n. 倾向，趋势；癖好

illustrate: vt. 阐明，举例说明；图解

specifies: vt. 指定；详细说明；列举；把…列入说明书

7.2.1 均值（Mean）

一、算术平均（Arithmetic Mean）

• arithmetic means: the sum of the observation values divided by the number of observations (single-period).

• population mean:
μ = ∑ i = 1 N X i N N : 总 体 个 数 \mu = \frac{\sum_{i=1}^N X_i}{N} \\ N: 总体个数

• sample mean:
X ˉ = ∑ i = 1 n X i n n : 样 本 容 量 \bar X = \frac{\sum_{i=1}^n X_i}{n} \\ n: 样本容量

• Features of arithmetic means:

• All interval and ratio data sets have an arithmetic mean.
• All data values (including outliers) are considered and included in the arithmetic mean computation.
• A data set has only one arithmetic mean.
• Tips：

• Arithmetic mean uses all observations, which may lead to this mean is not a good representation of the data set when there are outliers.

• Arithmetic mean uses all observations, which means that this mean can be true mean of the data set and can estimates this mean as the value of next observation.

• This arithmetic mean is the only measure of central tendency for which the sum of the deviations from the mean is zero. Mathematically, this property can be expressed as follows:
s u m   o f   m e a n   d e v i a t i o n s = ∑ i = 1 n ( X i − X ˉ ) = 0 sum \ of \ mean \ deviations = \sum_{i=1}^n (X_i - \bar X) = 0

feature: n. 特色，特征；容貌；特写或专题节目

outliers: n. 离开主体的人（或物）；（地质）外露层；（统计）异常值；局外人（远离业务、职务）

computation: n. 估计，计算

representation: n. 代表；表现；表示法；陈述

deviation: n. 偏差；误差；背离

二、几何平均（Geometric Mean）

• calculates investment returns over multiple periods (for the past or for the future);

• measures compound growth rates.
G = X 1 ∗ X 2 ∗ ⋯ ∗ X n n = ( X 1 ∗ X 2 ∗ ⋯ ∗ X n ) 1 n R G = ( 1 + R 1 ) ∗ ( 1 + R 2 ) ∗ ⋯ ∗ ( 1 + R n ) n − 1 R n : 代 表 第 n 期 的 收 益 率 G = \sqrt[n]{X_1 * X_2 * \cdots * X_n}=(X_1 * X_2 * \cdots * X_n)^\frac{1}{n} \\ R_G = \sqrt[n]{(1+R_1)*(1+R_2)*\cdots*(1+R_n)}-1 \\ R_n: 代表第n期的收益率

• Tips:

1. Geometric mean is always less than arithmetic mean.
2. When all values are equal, the geometric mean and the arithmetic mean are same.
3. In general, the difference between the arithmetic and geometric means increases with the variability in the period-by-period observations

variability: n. 可变性，变化性；变异性

三、调和平均数（Harmonic Mean）

• Harmonic mean: for computing average cost of shares purchased over time.

X ˉ H a r m o n i c = N ∑ i = 1 N 1 X i \bar X_{Harmonic} = \frac{N}{\sum_{i=1}^N \frac{1}{X_i}}

• Tips:
1. for values that not all equal: harmonic mean < geometric mean < arithmetic mean
2. geometric mean of past annual returns is the appropriate measure of past performance and of future multi-year performance, gives us the average annual compound return
3. arithmetic mean of past annual returns is the statiscally best estimator of the next year’s (one year) returns.
4. for forward-looking model, compounding with arithmetic mean of future returns is much better compounding with geometric mean of future returns.
• Uncertainty in cash flows or returns causes the arithmetic mean to be larger than the geometric mean.
• The more uncertain the returns, the more divergence exists between the arithmetic and geometric means.

uncertainty: n. 犹豫；不确定的事物；不确定度

divergence: n. 差异；分歧；分散，发散；（气流或海洋的）分开处

四、加权平均（Weighted Mean）

• weighted mean: offer different weights to different observations.
X ˉ w = ∑ i = 1 n w i X i = ( w 1 X i + w 2 X 2 + ⋯ + w n X n ) w h e r e : X 1 , X 2 , ⋯ X n : o b s e r v e d   v a l u e s w 1 , w 2 , ⋯ w n : w e i g h t s   t o   c o r r e s p o n d i n g   X ( ∑ w i = 1 ) . \begin{array}{l} \bar X_w = \sum_{i=1}^n w_iX_i = (w_1X_i + w_2X_2 + \cdots + w_nX_n) \\ where: \\ X_1,X_2, \cdots X_n: observed \ values \\ w_1,w_2, \cdots w_n: weights \ to \ corresponding \ X (\sum{w_i=1}). \end{array}

7.2.2 中位数（Median）

• median: the midpoint of a data set when the data is arranged in ascending or descending order.

• median is not affected by outliers, which means median is better than arithmetic mean to measure the central tendency.

• the median is less mathematically tractable than the mean.

• The median, however, does not use all the information about the size and magnitude of the observations.
• Calculating the median is also more complex.

magnitude: n. 大小；量级；震级；重要；光度

7.2.3 众数（Mode）

• mode: is the value which occurs most frequently in a data set.
• there may be more than one mode or no mode.
• unimodal, bimodal, trimodal.
• When all the values in a data set are different, the distribution has no mode because no value occurs more frequently than any other value.

7.3 离散程度

• Dispersion: deviation around the central tendency.

7.3.1 绝对离散程度（Absolute Dispersion）

• Absolute dispersion is the amount of variability present without comparison to any reference point or benchmark.
• range, mean absolute deviation, variance, and standard deviation.

reward: n. 报酬；报答；酬谢

comparison: n. 比较；对照；比喻；比较关系

reference: n. 参考，参照；涉及，提及；参考书目；介绍信；证明书

benchmark: n. 基准；标准检查程序

variance: n. 变异；变化；不一致；分歧；方差

一、分位数（Quantiles）

• 四分位数（Quartiles）: divided into quarters

• 五分位数（Quintiles）: divided into fifths

• 十分位数（Deciles）: divided into tenths

• 百分位数（Percentiles）: divided into hundredths

• Given a set of observations, the yth percentile is the value at or below which y percent of observations lie.

• To find quantiles:

1. decide the positon using the following formula:
L y = ( n + 1 ) y 100 w h e r e : y: given percentile n: number of data sorted in ascending order. L y : Location of y \begin{array}{l} L_y = (n + 1)\frac{y}{100} \\ where: \\ \text{y: given percentile} \\ \text{n: number of data sorted in ascending order.} \\ L_y: \text{Location of y} \end{array}

2. When L y L_y is whole number, find the value directly; When L y L_y is not a whole number: linear iteration.

lie: vi. 躺；说谎；位于；展现

formula: n. 公式，准则；配方；婴儿食品

iteration: n. 迭代；反复；重复

二、极差（Range）

• Range: the distance between the largest and the samllest value in data set.
• Range= maximum value - minimum value
• One advantage of the range is ease of computation.
• A disadvantage is that the range uses only two pieces of information from the distribution. It cannot tell us how the data are distributed (that is, the shape of the distribution).

• MAD: the mean value of the absolute values of the deviation of individual observations from the arithmetic mean.
M A D = ∑ i = 1 n ∣ X i − X ˉ ∣ n MAD = \frac{\sum_{i=1}^n \mid X_i - \bar X \mid}{n}

• The mean absolute deviation uses all of the observations in the sample and is thus superior to the range as a measure of dispersion.
• One technical drawback of MAD is that it is difficult to manipulate mathematically compared with the next measure we will introduce, variance.

absolute value: 绝对值

drawback: n. 缺点，不利条件；退税

manipulate: vt. 操纵；操作；巧妙地处理；篡改

四、方差与标准差（Variance and Standard Deviation）

• Variance is defined as the average of the squared deviations around the mean.

• Standard deviation is the positive square root of the variance.

• Population variance: the arithmetic average squared deviations from the mean.
σ 2 = ∑ i = 1 N ( X i − μ ) 2 N \sigma^2 = \frac{\sum_{i=1}^N (X_i - \mu)^2}{N}

• Population standard deviation: the positive squared root of the population variance.
σ = ∑ i = 1 N ( X i − μ ) 2 N \sigma = \sqrt{\frac{\sum_{i=1}^N (X_i - \mu)^2}{N}}

• Sample variance: the statistic that measures of dispersion for sample of n observations from a population.
s 2 = ∑ i = 1 n ( X i − X ˉ ) 2 n − 1 Use the entire number of sample observations, n, instead of n − 1, will systematically underestimate the population parameter,  σ 2 , particularly for small sample sizes. s^2 = \frac{\sum_{i=1}^n (X_i - \bar X)^2}{n - 1} \\ \text{Use the entire number of sample observations, n, instead of n − 1, will systematically underestimate the population parameter, } σ^2 \text{, particularly for small sample sizes.}

• Sample standard deviation: positive square root of the sample variance.
s = ∑ i = 1 n ( X i − X ˉ ) 2 n − 1 s = \sqrt{\frac{\sum_{i=1}^n (X_i - \bar X)^2}{n - 1}} \\

square root: 平方根；二次根

underestimate: vt. 低估；看轻

7.3.2 相对离散程度（Relative Dispersion）

一、变异系数（Coefficient of Variation，CV）

• Relative dispersion: the amount of variability in a distribution relative to a reference point of benchmark.

• Relative dispersion is commonly measured with the coefficient of variation(CV)，which is the ratio of the standard deviation of a set of observations to their mean value.

C V = S X X ˉ = standard deviation of X average value of X CV = \frac{S_X}{\bar X} = \frac{\text{standard deviation of X}}{\text{average value of X}}

• CV measures the amount of dispersion relative to the distribution’s mean.

• CV measures the risk (variability) per unit of expected return (mean).

• the coefficient of variation permits direct comparisons of dispersion across different data sets.

• the coefficient of variation is a scale-free measure (that is, it has no units of measurement).

• 变异系数衡量的是单位收益下的风险，夏普比率衡量单位风险下的超额收益。
Sharp ratio = R P − R F σ P R P ： 代 表 资 产 P 的 收 益 率 ； σ P ： 代 表 资 产 收 益 率 的 标 准 差 ； R F ： 代 表 无 风 险 收 益 率 ； R P − R F ： 表 示 资 产 P 超 过 无 风 险 收 益 率 的 超 额 收 益 （ e x c e s s   r e t u r n ） 。 \begin{array}{l} \text{Sharp ratio} = \frac{R_P - R_F}{\sigma_P} \\ R_P：代表资产P的收益率； \\ \sigma_P：代表资产收益率的标准差； \\ R_F：代表无风险收益率； \\ R_P - R_F：表示资产P超过无风险收益率的超额收益（excess \ return）。 \end{array}

• 与变异系数不同，对于风险厌恶的投资者而言，夏普比率越大越好。

permit: vi. 许可；允许

二、切比雪夫不等式（Chebyshev’s Inequality）

• Chebyshev’s inequality: According to Chebyshev’s inequality, for any distribution with finite variance, the proportion of the observations within k standard deviations of the arithmetic mean is at least 1 − 1 k 2 1 − \frac{1}{k^2} for all k > 1.
• The importance of Chebyshev’s inequality stems from its generality. The inequality holds for samples and populations and for discrete and continuous data regardless of the shape of the distribution.

proportion: n. 比例，占比；部分；面积；均衡

stems: n. 干；茎；船首；血统

generality: n. 概论；普遍性；大部分

7.4 偏度与峰度（Skewness and Kurtosis）

7.4.1 偏度（Skewness）

• Symmetrical distribution: shaped identically on both sides of its mean.

• One of the most important distributions is the normal distribution, which has the following characteristics:

• Its mean, median, and mode are equal.
• It is completely described by two parameters—its mean and variance.
• Roughly 68 percent of its observations lie between plus and minus one standard deviation from the mean; 95 percent lie between plus and minus two standard deviations; and 99 percent lie between plus and minus three standard deviations.
• Skewness: the extent to which a distribution is not symmetrical (resulted from outliers).

• positively skewed(right skewed): many outliers in the right tail; long upper(right) tail; frequent small losses and a few extreme gains.

• negatively skewed(left skewed): many outliers in the left tail; long lower(left) tail; frequent small gains and a few extreme losses.

• Median is always in the middle.

• Investors should be attracted by a positive skew because the mean return falls above the median. Relative to the mean return, positive skew amounts to a limited, though frequent, downside compared with a somewhat unlimited, but less frequent, upside.

• Skewness (sometimes referred to as relative skewness) is computed as the average cubed deviation from the mean standardized by dividing by the standard deviation cubed to make the measure free of scale.
sample skewness = 1 n ∑ i = 1 n ( X i − X ˉ ) 3 s 3 s : sample standard deviation \begin{array}{l} \text{sample skewness} = \frac{1}{n} \frac{\sum_{i=1}^n (X_i - \bar X)^3}{s^3} \\ s: \text{sample standard deviation} \end{array}

• symmetric: sample skewness=0.

• right skewed: sample skewness>0; the average magnitude of positive deviations is larger than the average magnitude of negative deviations.

• left skewed: sample skewness<0.

extent: n. 程度；范围；长度

7.4.2 峰度（Kurtosis）

• Kurtosis is a measure of the combined weight of the tails of a distribution relative to the rest of the distribution—that is, the proportion of the total probability that is in the tails.

• Leptokurtic: more peaked than a normal distribution (more peaked, fat tails)
• Platykurtic: less peaked or flatter than a normal distribution (less peaked, thin tails)
• Mesokurtic: same kurtosis with a normal distribution

sample kurtosis = 1 n ∑ i = 1 n ( X i − X ˉ ) 4 s 4 s : sample standard deviation \begin{array}{l} \text{sample kurtosis} = \frac{1}{n} \frac{\sum_{i=1}^n (X_i - \bar X)^4}{s^4} \\ s: \text{sample standard deviation} \end{array}

• 尖峰态（Leptokurtic）

• The leptokurtic distribution is more likely than the normal distribution to generate observations in the tail regions defined by the intersection of graphs near a standard deviation of about ±2.5.
• The leptokurtic distribution is also more likely to generate observations that are near the mean, defined here as the region ±1 standard deviation around the mean.
• In compensation, to have probabilities sum to 1, the leptokurtic distribution generates fewer observations in the regions between the central region and the two tail regions.
• 超峰度（Excess kurtosis）

• excess kurtosis: more or less kurtosis than the normal distribution.

e x c e s s   k u r t o s i s = s a m p l e   k u r t o s i s − 3 excess \ kurtosis = sample \ kurtosis - 3

• For a sample of 100 or larger taken from a normal distribution, a sample excess kurtosis of 1.0 or larger would be considered unusually large.

• In general, greater positive kurtosis and more negative skew in returns distributions indicates increased risk.

• 点赞
• 评论
• 分享
x

海报分享

扫一扫，分享海报

• 收藏
• 打赏

打赏

一个有梦想的叫花子！

你的鼓励将是我创作的最大动力

C币 余额
2C币 4C币 6C币 10C币 20C币 50C币
• 举报
• 一键三连

点赞Mark关注该博主, 随时了解TA的最新博文

05-12 26
10-31 6216
07-11 8963
04-09 2175