用python进行数据分析——第七章：数据规整化、清洗、转化、合并、重塑【3】：数据转换

本文链接：https://blog.csdn.net/wangdi_37927/article/details/104432756

数据转换

移除重复数据

duplicated、drop_duplicates、

利用函数和映射进行数据转换

map

替换值

replace

重命名轴索引

.index.map

rename——

data.rename(index={'OHIO':'FHDJ'},columns={‘fdjh’:'fhdhgj'})

离散化和面元划分

pd.cut、pd.qcut

>> factors = np.random.randn(9)
[ 2.12046097  0.24486218  1.64494175 -0.27307614 -2.11238291 2.15422205 -0.46832859  0.16444572  1.52536248]

pd.qcut()
qcut是根据这些值的频率来选择箱子的均匀间隔，即每个箱子中含有的数的数量是相同的

传入q参数

>>> pd.qcut(factors, 3) #返回每个数对应的分组
[(1.525, 2.154], (-0.158, 1.525], (1.525, 2.154], (-2.113, -0.158], (-2.113, -0.158], (1.525, 2.154], (-2.113, -0.158], (-0.158, 1.525], (-0.158, 1.525]]
Categories (3, interval[float64]): [(-2.113, -0.158] < (-0.158, 1.525] < (1.525, 2.154]]

>>> pd.qcut(factors, 3).value_counts() #计算每个分组中含有的数的数量
(-2.113, -0.158]    3
(-0.158, 1.525]     3
(1.525, 2.154]      3

传入lable参数

>>> pd.qcut(factors, 3,labels=["a","b","c"]) #返回每个数对应的分组，但分组名称由label指示
[c, b, c, a, a, c, a, b, b]
Categories (3, object): [a < b < c]

>>> pd.qcut(factors, 3,labels=False) #返回每个数对应的分组，但仅显示分组下标
[2 1 2 0 0 2 0 1 1]

传入retbins参数

>>> pd.qcut(factors, 3,retbins=True)# 返回每个数对应的分组，且额外返回bins，即每个边界值
[(1.525, 2.154], (-0.158, 1.525], (1.525, 2.154], (-2.113, -0.158], (-2.113, -0.158], (1.525, 2.154], (-2.113, -0.158], (-0.158, 1.525], (-0.158, 1.525]]
Categories (3, interval[float64]): [(-2.113, -0.158] < (-0.158, 1.525] < (1.525, 2.154],array([-2.113,  -0.158 ,  1.525,  2.154]))

参数   说明
x   ndarray或Series
q   integer，指示划分的组数
labels   array或bool，默认为None。当传入数组时，分组的名称由label指示；当传入Flase时，仅显示分组下标
retbins   bool，是否返回bins，默认为False。当传入True时，额外返回bins，即每个边界值。
precision   int，精度，默认为3

pd.cut()
cut将根据值本身来选择箱子均匀间隔，即每个箱子的间距都是相同的

传入bins参数

>>> pd.cut(factors, 3) #返回每个数对应的分组
[(0.732, 2.154], (-0.69, 0.732], (0.732, 2.154], (-0.69, 0.732], (-2.117, -0.69], (0.732, 2.154], (-0.69, 0.732], (-0.69, 0.732], (0.732, 2.154]]
Categories (3, interval[float64]): [(-2.117, -0.69] < (-0.69, 0.732] < (0.732, 2.154]]

>>> pd.cut(factors, bins=[-3,-2,-1,0,1,2,3])
[(2, 3], (0, 1], (1, 2], (-1, 0], (-3, -2], (2, 3], (-1, 0], (0, 1], (1, 2]]
Categories (6, interval[int64]): [(-3, -2] < (-2, -1] < (-1, 0] < (0, 1] (1, 2] < (2, 3]]

>>> pd.cut(factors, 3).value_counts() #计算每个分组中含有的数的数量
Categories (3, interval[float64]): [(-2.117, -0.69] < (-0.69, 0.732] < (0.732, 2.154]]
(-2.117, -0.69]    1
(-0.69, 0.732]     4
(0.732, 2.154]     4

传入lable参数

>>> pd.cut(factors, 3,labels=["a","b","c"]) #返回每个数对应的分组，但分组名称由label指示
[c, b, c, b, a, c, b, b, c]
Categories (3, object): [a < b < c]

>>> pd.cut(factors, 3,labels=False) #返回每个数对应的分组，但仅显示分组下标
[2 1 2 1 0 2 1 1 2]

传入retbins参数

>>> pd.cut(factors, 3,retbins=True)# 返回每个数对应的分组，且额外返回bins，即每个边界值
([(0.732, 2.154], (-0.69, 0.732], (0.732, 2.154], (-0.69, 0.732], (-2.117, -0.69], (0.732, 2.154], (-0.69, 0.732], (-0.69, 0.732], (0.732, 2.154]]
Categories (3, interval[float64]): [(-2.117, -0.69] < (-0.69, 0.732] < (0.732, 2.154]], array([-2.11664951, -0.69018126,  0.7320204 ,  2.15422205]))

参数   说明
x   array，仅能使用一维数组
bins   integer或sequence of scalars，指示划分的组数或指定组距
labels   array或bool，默认为None。当传入数组时，分组的名称由label指示；当传入Flase时，仅显示分组下标
retbins   bool，是否返回bins，默认为False。当传入True时，额外返回bins，即每个边界值。
precision   int，精度，默认为3

检测和过滤异常值

排列和随机采样

numpy.random.permutation

numpy.random.permutation+take函数

numpy.random.randint+take函数

numpy.random.permutation：

输入一个数或者数组，生成一个随机序列，对多维数组来说是多维随机打乱而不是1维/

>>np.random.permutation([1, 4, 9, 12, 15])
array([15,  1,  9,  4, 12])

>>arr = np.arange(9).reshape((3, 3))
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
>>np.random.permutation(arr)
array([[6, 7, 8],
       [0, 1, 2],
       [3, 4, 5]]) 

>>permutation = list(np.random.permutation(10))
[5, 1, 7, 6, 8, 9, 4, 0, 2, 3]
>>Y = np.array([[1,1,1,1,0,0,0,0,0,0]])
>>Y_new = Y[:, permutation]
array([[0, 1, 0, 0, 0, 0, 0, 1, 1, 1]])

numpy.random.permutation+take函数：

df.take(numpy.random.permutation(len(df))[:3])

#take函数
arr = np.arange(6)*100 #arr初始化为array( [0,100,200,300,400,500])
inds = [4,3,2]
arr.take(inds)
#相当于从arr序列中依次获取索引为4,3,2位置上的元素，因此输出为array([400,300,200])

numpy.random.randint+take函数：

numpy.random.randint(low, high=None, size=None, dtype='l')

numpy.random.randint(low, high=None, size=None, dtype='l')

函数的作用是，返回一个随机整型数，范围从低（包括）到高（不包括），即[low, high)。
如果没有写参数high的值，则返回[0,low)的值。

参数如下：

low: int
生成的数值最低要大于等于low。
（hign = None时，生成的数值要在[0, low)区间内）
high: int (可选)
如果使用这个值，则生成的数值在[low, high)区间。
size: int or tuple of ints(可选)
输出随机数的尺寸，比如size = (m * n* k)则输出同规模即m * n* k个随机数。默认是None的，仅仅返回满足要求的单一随机数。
dtype: dtype(可选)：
想要输出的格式。如int64、int等等
输出：

out: int or ndarray of ints
返回一个随机数或随机数数组

函数的作用是，返回一个随机整型数，范围从低（包括）到高（不包括），即[low, high)。
如果没有写参数high的值，则返回[0,low)的值。

bag = np.array([5,7,-1,6,4])
sampler = np.random.randint(0,len(bag),size=10)
draws = bag.take(sampler)

计算指标/哑变量

get_dummies

get_dummies:

我理解get_dummies是将拥有不同值的变量转换为0/1数值。打个比方，小明有黄、红、蓝三种颜色的帽子，小明今天戴黄色帽子用1表示，红色帽子用2表示，蓝色帽子用3表示。但1、2、3数值大小本身是没有意义的，只是用于区分帽子的颜色，因此在实际分析时，需要将1、2、3转化为0、1，如下代码所示：

import pandas as pd
xiaoming=pd.DataFrame([1,2,3],index=['yellow','red','blue'],columns=['hat'])
print(xiaoming)
hat_ranks=pd.get_dummies(xiaoming['hat'],prefix='hat')
print(hat_ranks.head())

        hat
yellow    1
red       2
blue      3
        hat_1  hat_2  hat_3
yellow      1      0      0
red         0      1      0
blue        0      0      1

注：这里书的p215和216内容没太看明白，以后有机会再看=。=