sampling brief —— python data science cookbook

最新推荐文章于 2021-12-20 11:43:10 发布

SnailDove

最新推荐文章于 2021-12-20 11:43:10 发布

阅读量564

点赞数

本文链接：https://blog.csdn.net/you1314520me/article/details/54940263

版权

python 同时被 3 个专栏收录

8 篇文章 0 订阅

订阅专栏

data preprocessing

3 篇文章 0 订阅

订阅专栏

data statistics

1 篇文章 0 订阅

订阅专栏

simple random sampling

Typically, in scenarios where it’s very expensive to access the whole dataset, sampling can be effectively used to extract a portion of the dataset for analysis.Sampling can be effectively used in EDA as well. A sample should be a good representative of the underlying dataset.It should have approximately the same characteristics as the underlying dataset. For example, with respect to the mean, the sample mean should be as close to the original data’s mean value as possible. There are several sampling techniques; we will cover one of them here. In simple random sampling, there is an equal chance of selecting any tuple.

For our example, we want to sample ten records randomly from the Iris dataset.

In step 2, we will do a random selection using thechoice function from numpy.random. the choice function randomly picks n integers, where n is the size of the sample, which is dictated by no_records in our case.

numpy.random. choice ( a, size=None, replace=True, p=None )

Generates a random sample from a given 1-D array

New in version 1.7.0.

Parameters:

Parameters:	a : 1-D array-like or int If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if a was np.arange(n) size : int or tuple of ints, optional Output shape. If the given shape is, e.g., `(m, n, k)`, then `m * n * k` samples are drawn. Default is None, in which case a single value is returned. replace : boolean, optional Whether the sample is with or without replacement p : 1-D array-like, optional The probabilities associated with each entry in a. If not given the sample assumes a uniform distribution over all entries in a.
Returns:	samples : 1-D ndarray, shape (size,) The generated random samples
Raises:	ValueError If a is an int and less than zero, if a or p are not 1-dimensional, if a is an array-like of size 0, if p is not a vector of probabilities, if a and p have different lengths, or if replace=False and the sample size is greater than the population size

a : 1-D array-like or int

If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if a was np.arange(n)

size : int or tuple of ints, optional

Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.

replace : boolean, optional

Whether the sample is with or without replacement

p : 1-D array-like, optional

The probabilities associated with each entry in a. If not given the sample assumes a uniform distribution over all entries in a.

Returns:

samples : 1-D ndarray, shape (size,)

The generated random samples

Raises:

ValueError

If a is an int and less than zero, if a or p are not 1-dimensional, if a is an array-like of size 0, if p is not a vector of probabilities, if a and p have different lengths, or if replace=False and the sample size is greater than the population size

numpy.random.choice refer to : https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.random.choice.html

Interpret parametr replace

it specifies whether we need to sample with replacement or without replacement. Sampling without replacement removes the sampled item from the original list so it will not be a candidate for future sampling. Sampling with replacement does the opposite; every element has an equal chance to be sampled in future sampling even though it’s been sampled before.

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
@author: snaildove
"""
# Load libraries
from sklearn.datasets import load_iris
import numpy as np
# 1. Load the Iris data set
data = load_iris()
x = data['data']
# Let’s demonstrate how sampling is performed:
# 2. Randomly sample 10 records from the loaded dataset
no_records = 10
print "data set dimensions :"
print x.shape;
print "original data set :"
print x;
x_sample_indx = np.random.choice(range(x.shape[0]),no_records)
print "sampling data indices :"
print x_sample_indx;
print "show sampling data : "
print x[x_sample_indx,:]

output :

data set dimensions :
(150, 4)
original data set :
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 ..., 
 [ 6.5  3.   5.2  2. ]
 [ 6.2  3.4  5.4  2.3]
 [ 5.9  3.   5.1  1.8]]
sampling data indices :
[  5 125  30  62  87  45  53  83  61  98]
show sampling data : 
[[ 5.4  3.9  1.7  0.4]
 [ 7.2  3.2  6.   1.8]
 [ 4.8  3.1  1.6  0.2]
 ..., 
 [ 6.   2.7  5.1  1.6]
 [ 5.9  3.   4.2  1.5]
 [ 5.1  2.5  3.   1.1]]

Stratified sampling 分层抽样

If the underlying dataset consists of different groups, a simple random sampling may fail to capture adequate samples in order to be able to represent the data. For example, in a two-class classification problem, 10% of the data belongs to the positive class and 90% belongs to the negative class. This kind of problem is called class imbalance problem in machine learning. When we do sampling on such imbalanced datasets, the sample should also reflect the preceding percentages. This kind of sampling is called stratified sampling. We will look more into stratified sampling in future chapters on machine learning.

more about stratified sampling refer to : https://en.wikipedia.org/wiki/Stratified_sampling

Progressive sampling (由小量到大量的迭代抽样)

How do we determine the correct sample size that we need for a given problem? We discussed several sampling techniques before but we don’t have a strategy to select the correct sample size. There is no simple answer for this. One way to do this is to use progressive sampling. Select a sample size and get the samples through any of the sampling techniques, apply the desired operation on the data, and record the results. Now, increase the sample size and repeat the steps. This iterative process is called progressive sampling.