sampling brief —— python data science cookbook

  • simple random sampling 

Typically, in scenarios where it’s very expensive to access the whole dataset, sampling can be effectively used to extract a portion of the dataset for analysis.Sampling can be effectively used in EDA as well. A sample should be a good representative of the underlying dataset.It should have approximately the same characteristics as the underlying dataset. For example, with respect to the mean, the sample mean should be as close to the original data’s mean value as possible. There are several sampling techniques; we will cover one of them here. In simple random sampling, there is an equal chance of selecting any tuple. 

For our example, we want to sample ten records randomly from the Iris dataset.

In step 2, we will do a random selection using thechoice function from numpy.random. the choice function randomly picks n integers, where n is the size of the sample, which is dictated by no_records in our case. 

numpy.random. choice ( asize=Nonereplace=Truep=None )

Generates a random sample from a given 1-D array

New in version 1.7.0.

Parameters:

a : 1-D array-like or int

If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if a was np.arange(n)

size : int or tuple of ints, optional

Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.

replace : boolean, optional

Whether the sample is with or without replacement

p : 1-D array-like, optional

The probabilities associated with each entry in a. If not given the sample assumes a uniform distribution over all entries in a.

Returns:

samples : 1-D ndarray, shape (size,)

The generated random samples

Raises:

ValueError

If a is an int and less than zero, if a or p are not 1-dimensional, if a is an array-like of size 0, if p is not a vector of probabilities, if a and p have different lengths, or if replace=False and the sample size is greater than the population size

numpy.random.choice refer to : https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.random.choice.html

Interpret parametr replace 

it specifies whether we need to sample with replacement or without replacement. Sampling without replacement removes the sampled item from the original list so it will not be a candidate for future sampling. Sampling with replacement does the opposite; every element has an equal chance to be sampled in future sampling even though it’s been sampled before.

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
@author: snaildove
"""
# Load libraries
from sklearn.datasets import load_iris
import numpy as np
# 1. Load the Iris data set
data = load_iris()
x = data['data']
# Let’s demonstrate how sampling is performed:
# 2. Randomly sample 10 records from the loaded dataset
no_records = 10
print "data set dimensions :"
print x.shape;
print "original data set :"
print x;
x_sample_indx = np.random.choice(range(x.shape[0]),no_records)
print "sampling data indices :"
print x_sample_indx;
print "show sampling data : "
print x[x_sample_indx,:]

output :

data set dimensions :
(150, 4)
original data set :
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 ..., 
 [ 6.5  3.   5.2  2. ]
 [ 6.2  3.4  5.4  2.3]
 [ 5.9  3.   5.1  1.8]]
sampling data indices :
[  5 125  30  62  87  45  53  83  61  98]
show sampling data : 
[[ 5.4  3.9  1.7  0.4]
 [ 7.2  3.2  6.   1.8]
 [ 4.8  3.1  1.6  0.2]
 ..., 
 [ 6.   2.7  5.1  1.6]
 [ 5.9  3.   4.2  1.5]
 [ 5.1  2.5  3.   1.1]]

  • Stratified sampling 分层抽样

If the underlying dataset consists of different groups, a simple random sampling may fail to capture adequate samples in order to be able to represent the data. For example, in a two-class classification problem, 10% of the data belongs to the positive class and 90% belongs to the negative class. This kind of problem is called class imbalance problem in machine learning. When we do sampling on such imbalanced datasets, the sample should also reflect the preceding percentages. This kind of sampling is called stratified sampling. We will look more into stratified sampling in future chapters on machine learning. 

more about  stratified sampling refer to : https://en.wikipedia.org/wiki/Stratified_sampling

  • Progressive sampling (由小量到大量的迭代抽样)

How do we determine the correct sample size that we need for a given problem? We discussed several sampling techniques before but we don’t have a strategy to select the correct sample size. There is no simple answer for this. One way to do this is to use progressive sampling. Select a sample size and get the samples through any of the sampling techniques, apply the desired operation on the data, and record the results. Now, increase the sample size and repeat the steps. This iterative process is called progressive sampling.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值