PACSLZ-501001: Statistical Inference for Big Data Exercise 1
QQ1703105484
The goal of this homework is to get started in using Python for data processing. We will practice generating random data, representing functions, computing statistics from data, empirical distributions, and selecting subsets of data with conditions. Mathematically, the most important part is to work with conditional distributions. We will also practice making simple plots.
- Generate Randome Variables and Compute Empiri- cal Distributions
Task 1: Generate 1000 samples from PMF1, and 500 samples from PMF2. You can either write your own random sample generator by using the inverse CDF approach we talked about in class, or use the np.random.choice() function.
Task 2: Write a function compareHIST(D, p), where D is an 1-D array of data sam- ples, and p is a valid PMF. Compute and plot the empirical distribution of D using mat- plotlib.pyplot.hist(), and plot p against it in the same plot for comparison.
Task 3: Mix the two datasets generated in Task 1 into an array of 1500 samples. Compute the ensemble distribution of this mixture from PMF1 and PMF2, and compare that with the empirical distribution of the mixed dataset, by using the compareHIST() function in Task 2.
- Data Manipulation and Conditional Distributions
Dataset = Dataset [ : , np . random . permutation ( 1500 ) ]
Task 4: Compute the empirical distribution of the mixture dataset, and compute the correct ensemble, use compareHIST() to verify.
Task 5: Select the sub-sequences with Labels == 1 and Labels ==2. Again verify this using
compareHIST().
Task 6: For each value in x ∈ {0, 1, . . . , 7}, compute both the conditional probability Plabel|X(1 x) using the Bayes rule, and using the empirical distributions (by selecting the right sub-sequences). Compare your results in a single plot.
- Convergence of Empirical Average
Task 7: Randomly select a function f by choosing f (x) for x 0, . . . , 7 , with whatever distribution you like.
For number of samples varying in the set steps = [10, 20, 40, 80, 160, 320, 630, 1280, 2560, 5120, 10240], generate this number of random samples from a given PMF ( use PMF1 in the previous tasks ). Compute the empirical average
1 Σ f (x )
Make a plot of of the empirical average as the number of samples increases, and observe the empirical average approach to the ensemble average.
Repeat this experiment 20 times, and observe that in all cases the empirical average converge to the same limit.
Task 8: Only if you know this from somewhere else, give a theoretical range of the empirical averages computed from Task 7. Plot them in the same picture. You should see something like the following.