PACSLZ-501001: Statistical Inference for Big Data

PACSLZ-501001: Statistical Inference for Big Data Exercise 1

QQ1703105484

The goal of this homework is to get started in using Python for data processing. We will practice generating random data, representing functions, computing statistics from data, empirical distributions, and selecting subsets of data with conditions. Mathematically, the most important part is to work with conditional distributions. We will also practice making simple plots.

  1. Generate Randome Variables and Compute Empiri- cal Distributions

We will start with generating some data samples with a given distribution. For the following two PMFs,

Task 1: Generate 1000 samples from PMF1, and 500 samples from PMF2. You can either write your own random sample generator by using the inverse CDF approach we talked about in class, or use the np.random.choice() function.

Task 2: Write a function compareHIST(D,  p),  where  D  is  an  1-D  array  of  data  sam- ples, and p is a valid PMF. Compute and plot the empirical distribution of D using mat- plotlib.pyplot.hist(), and plot p against it in the same plot for comparison.

Task 3: Mix the two datasets generated in Task 1 into an array of 1500 samples. Compute the ensemble distribution of this mixture from PMF1 and PMF2, and compare that with the empirical distribution of the mixed dataset, by using the compareHIST() function in Task 2.

  1. Data Manipulation and Conditional Distributions

Use the following code to generate labels for the two datasets generated from the previous tasks, stack with the data samples, and mix into a dataset of length 1500.

Dataset =  Dataset [ : ,  np . random . permutation ( 1500 ) ]

Task 4: Compute the empirical distribution of the mixture dataset, and compute the correct ensemble, use compareHIST() to verify.

Task 5: Select the sub-sequences with Labels == 1 and Labels ==2. Again verify this using

compareHIST().

Task 6: For each value in x ∈ {0, 1, . . . , 7}, compute both the conditional probability Plabel|X(1 x) using the Bayes rule, and using the empirical distributions (by  selecting the  right sub-sequences). Compare your results in a single plot.

  1. Convergence of Empirical Average

Task 7: Randomly select a function f by choosing f (x) for x 0, . . . , 7 , with whatever distribution you like.

For number of samples varying in the set steps = [10, 20, 40, 80, 160, 320, 630, 1280, 2560, 5120, 10240], generate this number of random samples from a given PMF ( use PMF1 in the previous tasks ). Compute the empirical average

1 Σ f (x )

Make a plot of of the empirical average as the number of samples increases, and observe the empirical average approach to the ensemble average.

Repeat this experiment 20 times, and observe that in all cases the empirical average converge to the same limit.

Task 8: Only if you know this from somewhere else, give a theoretical range of the empirical averages computed from Task 7. Plot them in the same picture. You should see something like the following.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值