python抽取一定比例数据_Python:数据抽样

抽样是一种减少数据量的方法,常见的有随机抽样和分层抽样。

1.随机抽样

按照数据会不会被放回参与下一次抽取,可以分为有放回抽样和无放回抽样。

(1)无放回抽样

第一步:加载数据

这里使用sklearn库中的datasets模块下的iris数据集。

{'data': array([[5.1, 3.5, 1.4, 0.2],

[4.9, 3. , 1.4, 0.2],

[4.7, 3.2, 1.3, 0.2],

[4.6, 3.1, 1.5, 0.2],

[5. , 3.6, 1.4, 0.2],

[5.4, 3.9, 1.7, 0.4],

[4.6, 3.4, 1.4, 0.3],

[5. , 3.4, 1.5, 0.2],

[4.4, 2.9, 1.4, 0.2],

[4.9, 3.1, 1.5, 0.1],

[5.4, 3.7, 1.5, 0.2],

[4.8, 3.4, 1.6, 0.2],

[4.8, 3. , 1.4, 0.1],

[4.3, 3. , 1.1, 0.1],

[5.8, 4. , 1.2, 0.2],

[5.7, 4.4, 1.5, 0.4],

[5.4, 3.9, 1.3, 0.4],

[5.1, 3.5, 1.4, 0.3],

[5.7, 3.8, 1.7, 0.3],

[5.1, 3.8, 1.5, 0.3],

[5.4, 3.4, 1.7, 0.2],

[5.1, 3.7, 1.5, 0.4],

[4.6, 3.6, 1. , 0.2],

[5.1, 3.3, 1.7, 0.5],

[4.8, 3.4, 1.9, 0.2],

[5. , 3. , 1.6, 0.2],

[5. , 3.4, 1.6, 0.4],

[5.2, 3.5, 1.5, 0.2],

[5.2, 3.4, 1.4, 0.2],

[4.7, 3.2, 1.6, 0.2],

[4.8, 3.1, 1.6, 0.2],

[5.4, 3.4, 1.5, 0.4],

[5.2, 4.1, 1.5, 0.1],

[5.5, 4.2, 1.4, 0.2],

[4.9, 3.1, 1.5, 0.2],

[5. , 3.2, 1.2, 0.2],

[5.5, 3.5, 1.3, 0.2],

[4.9, 3.6, 1.4, 0.1],

[4.4, 3. , 1.3, 0.2],

[5.1, 3.4, 1.5, 0.2],

[5. , 3.5, 1.3, 0.3],

[4.5, 2.3, 1.3, 0.3],

[4.4, 3.2, 1.3, 0.2],

[5. , 3.5, 1.6, 0.6],

[5.1, 3.8, 1.9, 0.4],

[4.8, 3. , 1.4, 0.3],

[5.1, 3.8, 1.6, 0.2],

[4.6, 3.2, 1.4, 0.2],

[5.3, 3.7, 1.5, 0.2],

[5. , 3.3, 1.4, 0.2],

[7. , 3.2, 4.7, 1.4],

[6.4, 3.2, 4.5, 1.5],

[6.9, 3.1, 4.9, 1.5],

[5.5, 2.3, 4. , 1.3],

[6.5, 2.8, 4.6, 1.5],

[5.7, 2.8, 4.5, 1.3],

[6.3, 3.3, 4.7, 1.6],

[4.9, 2.4, 3.3, 1. ],

[6.6, 2.9, 4.6, 1.3],

[5.2, 2.7, 3.9, 1.4],

[5. , 2. , 3.5, 1. ],

[5.9, 3. , 4.2, 1.5],

[6. , 2.2, 4. , 1. ],

[6.1, 2.9, 4.7, 1.4],

[5.6, 2.9, 3.6, 1.3],

[6.7, 3.1, 4.4, 1.4],

[5.6, 3. , 4.5, 1.5],

[5.8, 2.7, 4.1, 1. ],

[6.2, 2.2, 4.5, 1.5],

[5.6, 2.5, 3.9, 1.1],

[5.9, 3.2, 4.8, 1.8],

[6.1, 2.8, 4. , 1.3],

[6.3, 2.5, 4.9, 1.5],

[6.1, 2.8, 4.7, 1.2],

[6.4, 2.9, 4.3, 1.3],

[6.6, 3. , 4.4, 1.4],

[6.8, 2.8, 4.8, 1.4],

[6.7, 3. , 5. , 1.7],

[6. , 2.9, 4.5, 1.5],

[5.7, 2.6, 3.5, 1. ],

[5.5, 2.4, 3.8, 1.1],

[5.5, 2.4, 3.7, 1. ],

[5.8, 2.7, 3.9, 1.2],

[6. , 2.7, 5.1, 1.6],

[5.4, 3. , 4.5, 1.5],

[6. , 3.4, 4.5, 1.6],

[6.7, 3.1, 4.7, 1.5],

[6.3, 2.3, 4.4, 1.3],

[5.6, 3. , 4.1, 1.3],

[5.5, 2.5, 4. , 1.3],

[5.5, 2.6, 4.4, 1.2],

[6.1, 3. , 4.6, 1.4],

[5.8, 2.6, 4. , 1.2],

[5. , 2.3, 3.3, 1. ],

[5.6, 2.7, 4.2, 1.3],

[5.7, 3. , 4.2, 1.2],

[5.7, 2.9, 4.2, 1.3],

[6.2, 2.9, 4.3, 1.3],

[5.1, 2.5, 3. , 1.1],

[5.7, 2.8, 4.1, 1.3],

[6.3, 3.3, 6. , 2.5],

[5.8, 2.7, 5.1, 1.9],

[7.1, 3. , 5.9, 2.1],

[6.3, 2.9, 5.6, 1.8],

[6.5, 3. , 5.8, 2.2],

[7.6, 3. , 6.6, 2.1],

[4.9, 2.5, 4.5, 1.7],

[7.3, 2.9, 6.3, 1.8],

[6.7, 2.5, 5.8, 1.8],

[7.2, 3.6, 6.1, 2.5],

[6.5, 3.2, 5.1, 2. ],

[6.4, 2.7, 5.3, 1.9],

[6.8, 3. , 5.5, 2.1],

[5.7, 2.5, 5. , 2. ],

[5.8, 2.8, 5.1, 2.4],

[6.4, 3.2, 5.3, 2.3],

[6.5, 3. , 5.5, 1.8],

[7.7, 3.8, 6.7, 2.2],

[7.7, 2.6, 6.9, 2.3],

[6. , 2.2, 5. , 1.5],

[6.9, 3.2, 5.7, 2.3],

[5.6, 2.8, 4.9, 2. ],

[7.7, 2.8, 6.7, 2. ],

[6.3, 2.7, 4.9, 1.8],

[6.7, 3.3, 5.7, 2.1],

[7.2, 3.2, 6. , 1.8],

[6.2, 2.8, 4.8, 1.8],

[6.1, 3. , 4.9, 1.8],

[6.4, 2.8, 5.6, 2.1],

[7.2, 3. , 5.8, 1.6],

[7.4, 2.8, 6.1, 1.9],

[7.9, 3.8, 6.4, 2. ],

[6.4, 2.8, 5.6, 2.2],

[6.3, 2.8, 5.1, 1.5],

[6.1, 2.6, 5.6, 1.4],

[7.7, 3. , 6.1, 2.3],

[6.3, 3.4, 5.6, 2.4],

[6.4, 3.1, 5.5, 1.8],

[6. , 3. , 4.8, 1.8],

[6.9, 3.1, 5.4, 2.1],

[6.7, 3.1, 5.6, 2.4],

[6.9, 3.1, 5.1, 2.3],

[5.8, 2.7, 5.1, 1.9],

[6.8, 3.2, 5.9, 2.3],

[6.7, 3.3, 5.7, 2.5],

[6.7, 3. , 5.2, 2.3],

[6.3, 2.5, 5. , 1.9],

[6.5, 3. , 5.2, 2. ],

[6.2, 3.4, 5.4, 2.3],

[5.9, 3. , 5.1, 1.8]]),

'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),

'frame': None,

'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='

这是一个Bunch类型,结构类似于字典,可以使用键来查找对应的值。只是这里的值为array(数组)格式。

我们只需要data键对应的值,并将其转换为数据框格式,最后添加上列名。代码如下:

数据如下:

第二步:进行抽样

抽样使用的方法是数据框自带的sample()方法,查看帮助文档如下:

help(iris.sample)

Help on method sample in module pandas.core.generic:

sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None) -> ~FrameOrSeries method of pandas.core.frame.DataFrame instance

Return a random sample of items from an axis of object.

You can use `random_state` for reproducibility.

Parameters

----------

n : int, optional#设置抽样样本个数,必须为整数。不能与frac参数一起设置。

Number of items from axis to return. Cannot be used with `frac`.

Default = 1 if `frac` = None.

frac : float, optional#设置抽样数据占总数居的比例。不能与参数n一起设置。

Fraction of axis items to return. Cannot be used with `n`.

replace : bool, default False

Allow or disallow sampling of the same row more than once.#是否有放回抽样

weights : str or ndarray-like, optional

Default 'None' results in equal probability weighting.

If passed a Series, will align with target object on index. Index

values in weights not found in sampled object will be ignored and

index values in sampled object not in weights will be assigned

weights of zero.

If called on a DataFrame, will accept the name of a column

when axis = 0.

Unless weights are a Series, weights must be same length as axis

being sampled.

If weights do not sum to 1, they will be normalized to sum to 1.

Missing values in the weights column will be treated as zero.

Infinite values not allowed.

random_state : int or numpy.random.RandomState, optional

Seed for the random number generator (if int), or numpy RandomState

object.

axis : {0 or ‘index’, 1 or ‘columns’, None}, default None

Axis to sample. Accepts axis number or name. Default is stat axis

for given data type (0 for Series and DataFrames).#默认为0,即按数据框的行进行抽样,若设置为1则按列抽样。

Returns

-------

Series or DataFrame

A new object of same type as caller containing `n` items randomly

sampled from the caller object.

See Also

--------

numpy.random.choice: Generates a random sample from a given 1-D numpy

array.

Notes

-----

If `frac` > 1, `replacement` should be set to `True`.

Examples

--------

>>> df = pd.DataFrame({'num_legs': [2, 4, 8, 0],

... 'num_wings': [2, 0, 0, 0],

... 'num_specimen_seen': [10, 2, 1, 8]},

... index=['falcon', 'dog', 'spider', 'fish'])

>>> df

num_legs num_wings num_specimen_seen

falcon 2 2 10

dog 4 0 2

spider 8 0 1

fish 0 0 8

Extract 3 random elements from the ``Series`` ``df['num_legs']``:

Note that we use `random_state` to ensure the reproducibility of

the examples.

>>> df['num_legs'].sample(n=3, random_state=1)

fish 0

spider 8

falcon 2

Name: num_legs, dtype: int64

A random 50% sample of the ``DataFrame`` with replacement:

>>> df.sample(frac=0.5, replace=True, random_state=1)

num_legs num_wings num_specimen_seen

dog 4 0 2

fish 0 0 8

An upsample sample of the ``DataFrame`` with replacement:

Note that `replace` parameter has to be `True` for `frac` parameter > 1.

#如果抽样比例大于1,则一定要同时设置允许有放回抽样为true。

>>> df.sample(frac=2, replace=True, random_state=1)

num_legs num_wings num_specimen_seen

dog 4 0 2

fish 0 0 8

falcon 2 2 10

falcon 2 2 10

fish 0 0 8

dog 4 0 2

fish 0 0 8

dog 4 0 2

Using a DataFrame column as weights. Rows with larger value in the

`num_specimen_seen` column are more likely to be sampled.

>>> df.sample(n=2, weights='num_specimen_seen', random_state=1)

num_legs num_wings num_specimen_seen

falcon 2 2 10

fish 0 0 8

a.指定样本数量的抽样方法

这里我随机抽了两次,结果都是一样的,因为两次设置了相同的random_state,使得随机抽样结果可以复现。这个有点像R里面设置随机数种子。

b.按比例抽样

iris数据集共150条记录,按照20%抽样,应该得到30记录。

(2)有放回抽样

a.指定样本数量的抽样方法

就抽10条数据也能出现重复……不过,有放回抽样只是说抽样结果“可能”会有重复,不一定每次有放回抽样都会有重复。

b.按比例抽样

2.分层抽样

指按照某一属性将数据集划分为多个层,并在每个层抽取一定数量的样本。iris原始数据集中有target字段,表示每个记录对应的鸢尾花类型,取值为0、1、2,可以使用该字段作为分层的字段。

第一步:将原数据rawdata中target字段添加到现有的iris数据框中。

最新iris数据集的基本信息:

第二步:按照target不同的取值类型进行分层抽样并将抽样结果组合起来。

注释:数据框有append()方法,惊不惊喜,意不意外!不过,append()方法只能在行后面拼接新的行,不能拼接列。

Help on method append in module pandas.core.frame:

append(other, ignore_index=False, verify_integrity=False, sort=False) -> 'DataFrame' method of pandas.core.frame.DataFrame instance

Append rows of `other` to the end of caller, returning a new object.

Columns in `other` that are not in the caller are added as new columns.

Parameters

----------

other : DataFrame or Series/dict-like object, or list of these

The data to append.

ignore_index : bool, default False

If True, do not use the index labels.

verify_integrity : bool, default False

If True, raise ValueError on creating index with duplicates.

sort : bool, default False

Sort columns if the columns of `self` and `other` are not aligned.

.. versionadded:: 0.23.0

.. versionchanged:: 1.0.0

Changed to not sort by default.

Returns

-------

DataFrame

除了数据框的append()方法,pandas库中的concat()函数也可以对数据框进行拼接。

注释:axis = 0表示按行拼接。concat()函数可以按列拼接~

help(pd.concat)

Help on function concat in module pandas.core.reshape.concat:

concat(objs: Union[Iterable[Union[ForwardRef('DataFrame'), ForwardRef('Series')]], Mapping[Union[Hashable, NoneType], Union[ForwardRef('DataFrame'), ForwardRef('Series')]]], axis=0, join='outer', ignore_index: bool = False, keys=None, levels=None, names=None, verify_integrity: bool = False, sort: bool = False, copy: bool = True) -> Union[ForwardRef('DataFrame'), ForwardRef('Series')]

Concatenate pandas objects along a particular axis with optional set logic

along the other axes.

Can also add a layer of hierarchical indexing on the concatenation axis,

which may be useful if the labels are the same (or overlapping) on

the passed axis number.

Parameters

----------

objs : a sequence or mapping of Series or DataFrame objects

If a dict is passed, the sorted keys will be used as the `keys`

argument, unless it is passed, in which case the values will be

selected (see below). Any None objects will be dropped silently unless

they are all None in which case a ValueError will be raised.

axis : {0/'index', 1/'columns'}, default 0

The axis to concatenate along.

join : {'inner', 'outer'}, default 'outer'

How to handle indexes on other axis (or axes).

ignore_index : bool, default False

If True, do not use the index values along the concatenation axis. The

resulting axis will be labeled 0, ..., n - 1. This is useful if you are

concatenating objects where the concatenation axis does not have

meaningful indexing information. Note the index values on the other

axes are still respected in the join.

keys : sequence, default None

If multiple levels passed, should contain tuples. Construct

hierarchical index using the passed keys as the outermost level.

levels : list of sequences, default None

Specific levels (unique values) to use for constructing a

MultiIndex. Otherwise they will be inferred from the keys.

names : list, default None

Names for the levels in the resulting hierarchical index.

verify_integrity : bool, default False

Check whether the new concatenated axis contains duplicates. This can

be very expensive relative to the actual data concatenation.

sort : bool, default False

Sort non-concatenation axis if it is not already aligned when `join`

is 'outer'.

This has no effect when ``join='inner'``, which already preserves

the order of the non-concatenation axis.

.. versionadded:: 0.23.0

.. versionchanged:: 1.0.0

Changed to not sort by default.

copy : bool, default True

If False, do not copy data unnecessarily.

注意,区分concat()和merge()。

参考资料:用Python玩转数据_中国大学MOOC(慕课)​www.icourse163.org

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值