Part 1
Probability is often characterized as “a precise way
to deal with our ignorance or uncertainty”. Everyone
has an intuitive understanding of the question “what are the chance
of (something happening)?”. Stochastic process is
then dealing with probabilities over time (or over some independent
and indexed variable such as distance). There exist a number of
excellent or classic textbooks on probability and stochastic
processes. It is one of my favorite oral examine
question which I always tell student beforehand to prepare as well
as in my opinion the most useful tools of an applied mathematician
and/or engineer.
Yet in my experience it is also one of the most confusing
subjects for many students to learn. Why?
In this series of blog articles (of which this is the
first) I shall try to explain the subject in my
own way and my experience in learning the subject. It is NOT my
intention to replace the excellent textbooks.The main
purpose of these articles, I hope, is that by reading the articles
will make the subject matter more approachable and less imposing.
They are NOT meant to replace the many excellent textbook on the
subject. I write this article not
in the rigorous style required for a scholastic textbook but more
in the spirit of a teacher who is engaged in a face-to-face session
with a student. It will be highly informal but
will make the big picture come across easier. Hopefully, it will
even make it possible to read and gain insight to textbook sand
articles written in measure-theoretic language. My approach will be
strictly from a user point of view requiring nothing beyond
freshman calculus and ability to visualize n-dimensional space as a
natural generalization of our familiar 3-D space. So here goes . . .
Let us start by making one simplifying assumption which for
people interested in practical application is not at all important
or restrictive. This is the
Finiteness Assumption (FA) – We assume there is
no INFINITLY large number, i.e., no infinity but there can be very
large numbers, e.g.10^100 (a number estimated to be larger than the
total number of atoms in the universe.) If one deals only with real
computation on digital computers, this assumption is automatically
satisfied. By making this assumption we assume away all the
measure-theoretic terminologies that populate theoretical
probability literature and confuse the uninitiated.
With the FA assumption we now define what is a random
variable.
Random Variable (r.v.) – a random variable is a
variable that may take on any number of finite values when sampled
(i.e. looked at). We characterize a r.v. by specifying its
histogram.A histogram spells out which sampled
values in a range of values the r.v.may take on what percentage of
the time. Fig. 1 it a typical histogram. It is actually a histogram
of a random variable which is the readership (or hits) of my blog
articles for the past four years.
Fig. 1 histogram of readership of my blog articles (2009-2013):
x-axis is #of hits, y-axis is #of article in this hit range
Note each bar of the histogram is expressed as a percentage so
that the total sum of bars adds up to one or 100%, i.e.,with
probability one (for sure) the r.v. takes on values somewhere in
the total range. While the range of values this r.v. may take on is
finite by virtue of assumption FA, to completely
specify ar.v. still can take a great deal of data. (In fact, it
took me about 3 hours to collect data and make this graph which is
why I did not compile the data for all 5+ year of my blog life)
This is inconvenient in computation. To simplify
the description (specification)we develop two common rough
characterizations.
The Mean of a r.v. – Intuitively, if you
imagine a cardboard cutout of the shape of the histogram, then the
value along the x-axis at which a knife edge placed perpendicular
to the x-axis that will balance this cardboard shape is the mean of
this r.v.. Mathematically, it is simply the
average of the value of hits for each article, the Science Net in
fact compute this value for all bloggers and displays the top-100
bloggers. Mown current average happens to be 4130 per article and
ranks 26th on the list.
Variance of a r.v. - This is a measure of the
spread of the histogram. As mall variance roughly mean the
histogram is mostly spread over a small range of numbers around its
mean and vice versa for a large variance. It is a measure of the
variability of the values of the r.v.. In stock market terminology,
the b of a stock is simply the variance of the daily value of the
stock and a measure of its volatility. Mathematically variance is
called the second central moments of the
histogram
Now we can develop further rough characterization of the
histogram by defining what are called its higher central moments,
such as skewness of the histogram, which is the
third central moment. But in practice such higher
moment are rarely needed nor data on these moments often
available.
So much for a single r.v.. But we often have to deals with more
than one random variable. Let us consider two r.v.s, x and y. Now
the histogram of the random variables x-y becomes a 3D
object.Graphically it looks like a multi-peak terrain map (think of
Quilin in theKwangxi province of south China or the skyscrapers of
the Manhattan island of NY). But here a new concept intrudes. It's
called “joint probability” or
“correlation/covariance (in case of an approximate
specification)” between the r.v.s x and y. It captures
relationship, if any, between the r.v.s. We are all familiar with
notion that smart parents tends to produce smart children. If we
represent the intelligence of parents as r.v. x and that of the
child is .r.v y, then mathematically we say y is positively
correlated with x. If we look down on the 3D histogram of x and y,
then we shall see the peaks scatter along a northeast to southwest
direction as illustrated in Fig.2
Fig.2 bird’s eye view of 3D histogram with correlation
In other words, knowing the value of y will give a different
idea about the probable value of x. More generally we say x and y
are NOT independent but
correlated. Mathematically we denote the joint
probability p(x,y) (i.e., the histogram) as a general 3D function.
We also define conditional probability of x given
the value of y as
p(x/y)=p(x,y)/p(y) or p(y/x)= p(x,y)/p(x)
Where p(y) and p(x) , called marginally
probability of y and x respectively are simply the
resultant 2D histograms when we collapse the 3Dhistogram onto the y
or x axis respectively. Graphically, the conditional probability
p(x/y) is simply the 2Dhistogram one sees if we
take a cross sectional view of the 3D histogram at the particular
value of y. Mathematically we need to divide p(x,y) by p(y) to
normalize the values so that p(x/y) will still have area equal to
one (100%)satisfying the definition of a histogram.
Now it is possible that the bird’s eye view of the 3D histogram
is a rectangle (vs. the view of Fig. 2). In other word p(x/y)=p(x)
no matter which value of y we choose. In this case, by definition
of p(x/y), we have p(x,y)=p(y)p(x). We say the r.v.s x and y are
independent. Intuitively this satisfies the notion
that knowing y does not tell us anything new about the probable
values of x and vice versa about y when knowing x.
Computationally,this simplifies a function of 2 variables into
product of single variable functions, a great computational
simplification when n random variables are involved.
To roughly characterize the two general r.v.s we have a mean
vector [x,y] and a 2x2 covariance matrix with diagonal element the
variance of x and y and the symmetrical covariance in the
off-diagonal position
sx2
sxy
syxsy2
To summarize. We have so far introduced concepts
1. 1.Random variable characterized by histograms
2. 2.Rough characterization of histograms by mean and variance
3. 3.Joint probability (3Dhistogram) of two r.v.s
4. 4.Independence and conditional probability
5. 5.Covariance matrix
Now suppose we have n r.v.s [ x1 ,
x2 , . . . , xn] instead of two, everything I
said about the two r.v.s apply.We merely have to change 2D and 3D
to n and n+1 dimensions. The mean of n r.v.s becomes a n-vector and
the covariance matrix is a nxn matrix. In your mind’s eye you can
visualize everything in n dimension the same way as Fig.1 and 2.The
joint probability (histogram) p(x1 , x2 , . .
. , x)is a n variable function. And if the n variables are
independent from each other, we write p(x1 ,
x2 , . . . , xn
)=p(x1)p(x2). . . p(xn). No new
concepts are involved.
Concept-wise, believe it or not,these in my opinion
are all you need to know about probability and stochastic processes
to function in the engineering world even if your interest is
academic and theoretical. In my 46 years of active
research and engineering consulting in stochastic control and
optimization, I never had to go beyond the knowledge described
above. The following articles will simply illustrate and explain
how to apply these ideas to more practical uses.
Computationally, because of exponential growth, to deal with
arbitrary n-variable function is impossible.http://blog.sciencenet.cn/blog-1565-26889.html.
Data-wise, it also involve astronomically large amount of data. To
simplify notations at least theoretically, we make a continuous
approximation of these discrete data and introduce continuous
variables and functions. To emphasize,for our purpose, this is only
a convenient approximation and simplification. No new ideas are
involved. This will be the content of next article. Beyond
introducing continuous variables, we also need to develop various
special cases of joint probability structures to simplify
description and calculations,subsequent articles will address these
issues. Once again, let me emphasize that from my view point these
simplifications and special cases are need for computational
feasibility and practicality. Nothing conceptually new is
involved.