4.2 数据转换(全)

4.2数据转换

import numpy as np
import pandas as pd

移除重复数据

data = pd.DataFrame({'k1': ['one','two']*3+['two'],
                     'k2': [1,1,2,3,3,4,4]})
data
k1k2
0one1
1two1
2one2
3two3
4one3
5two4
6two4
duplicated方法
data.duplicated()
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

duplicated方法会返回布尔型的Series表示每一行是否有重复。例子中,第6行和第7行显然是相同的,所以第六行的值是True

drop_duplicates()方法

顾名思义,该方法是用来删除重复的行,第二次出现重复的行就会被删除

data.drop_duplicates()
k1k2
0one1
1two1
2one2
3two3
4one3
5two4
参数一:传入列

以上两个方法都可以设置根据哪一列来进行删除,直接传入包含列的list即可,例如

data['v1'] = np.arange(7)
data
k1k2v1
0one10
1two11
2one22
3two33
4one34
5two45
6two46

例一:只传入一列

data.drop_duplicates(['k1'])
k1k2v1
0one10
1two11
data.duplicated(['k1'])
0    False
1    False
2     True
3     True
4     True
5     True
6     True
dtype: bool

例二:传入两列

data.drop_duplicates(['k1','v1'])
k1k2v1
0one10
1two11
2one22
3two33
4one34
5two45
6two46
data.duplicated(['k1','k2'])
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool
参数二:keep

以上两个方法默认第一个出现的不算duplicate,设置**keep=‘last’**则可以保留最后一个:

data.drop_duplicates(['k1','k2'], keep='last')
k1k2v1
0one10
1two11
2one22
3two33
4one34
6two46

利用函数或映射进行数据转换

例如:

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami','corned beef', 'Bacon', 'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
foodounces
0bacon4.0
1pulled pork3.0
2bacon12.0
3Pastrami6.0
4corned beef7.5
5Bacon8.0
6pastrami3.0
7honey ham5.0
8nova lox6.0
任务:添加一列表示该肉类食物来源的动物类型
meat_to_animal = {'bacon':'pig', 
                  'pulled pork':'pig',
                  'pastrami': 'cow',
                  'corned beef': 'cow',
                  'honey ham': 'pig',
                  'nova lox': 'salmon'}
使用映射
lowercased = data['food'].str.lower()
lowercased
0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object
data['animal'] = lowercased.map(meat_to_animal)
data
foodouncesanimal
0bacon4.0pig
1pulled pork3.0pig
2bacon12.0pig
3Pastrami6.0cow
4corned beef7.5cow
5Bacon8.0pig
6pastrami3.0cow
7honey ham5.0pig
8nova lox6.0salmon
data = data.drop('animal',axis=1)
使用函数(lambda函数)
data['animal'] = data['food'].map(lambda x: meat_to_animal[x.lower()])
data
foodouncesanimal
0bacon4.0pig
1pulled pork3.0pig
2bacon12.0pig
3Pastrami6.0cow
4corned beef7.5cow
5Bacon8.0pig
6pastrami3.0cow
7honey ham5.0pig
8nova lox6.0salmon

替换值

fillna方法可以看做用于替换值的特殊情况(NA是特殊值),replace方法更简单、灵活

data = pd.Series([1.,-999.,2.,-999.,-1000,3])
data
0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64
data.replace(-999,np.nan)
0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

希望将-1000也替换掉?那么就传入一个list!

data.replace([-999,-1000],np.nan)
0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

希望将-999和-1000替换成不同的值?同样,传入一个list!

data.replace([-999,-1000],[np.nan, 0])
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

传入dict也可

data.replace({-999:np.nan, -1000:0})
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

重命名轴索引

  • 轴索引也可以通过和之前介绍过的名字一样的map方法重新命名
  • rename方法
df.index.map
data = pd.DataFrame(np.arange(12).reshape(3,4),
                    index=['Ohio','Colorado','New York'],
                    columns=['one','two','three','four'])
data
onetwothreefour
Ohio0123
Colorado4567
New York891011
transform = lambda x: x[:4].upper()
data.index = data.index.map(transform)
data
onetwothreefour
OHIO0123
COLO4567
NEW891011
rename
直接修改index参数和columns参数
data.rename(index=str.title,columns=str.upper)
ONETWOTHREEFOUR
Ohio0123
Colo4567
New891011
使用dict部分修改index和columns
data.rename(index={'OHIO':'INDIANA'}, columns={'three':'peekaboo'})
onetwopeekaboofour
INDIANA0123
COLO4567
NEW891011
inplace参数

直接修改数据集,而非返回副本

data.rename(index=str.upper, columns=str.title, inplace=True)
data
OneTwoThreeFour
OHIO0123
COLO4567
NEW891011

离散化和面元分割

pd.cut方法
  • 根据面元(bins)划分
  • 返回的是Categorical对象
ages = np.random.randint(120,size=100)
ages
array([ 46,  21, 116, 119, 118, 119,  90,  88,  16,  83, 103,   8,  17,
        92,  75,   6,  19,  63,  19,  92,  91,  74,  10,  87,  25,  17,
       116, 118,  60,  98,  52,  54, 101,  26,  66,  58,  37,  68,  19,
        91,  65,  98,  57,  84,  85, 100,  32,  87, 105,  21,  29, 104,
       109,  15, 101,  81,  10,  34,  27, 117,  85,  42,  54,  16, 112,
       102,  51,  13,  93,  43,   8,  55,  33,  34,  72, 103,  79,  61,
        41,  36,  47,  72, 116, 100,   0,  38,  41,  84,  54,  64,  46,
        56,  10, 116,  78,  76, 116,  98,  20,  48])
bins=[18,25,35,60,100] # 分为(18,25], (25,35], (35,60], (60,100]这几个区间
cats = pd.cut(ages,bins)
cats
[(35.0, 60.0], (18.0, 25.0], NaN, NaN, NaN, ..., (60.0, 100.0], NaN, (60.0, 100.0], (18.0, 25.0], (35.0, 60.0]]
Length: 100
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
categorial对象的codes属性
  • 数字代表了属于的面元的index
  • -1代表不属于任何一个面元
cats.codes
array([ 2,  0, -1, -1, -1, -1,  3,  3, -1,  3, -1, -1, -1,  3,  3, -1,  0,
        3,  0,  3,  3,  3, -1,  3,  0, -1, -1, -1,  2,  3,  2,  2, -1,  1,
        3,  2,  2,  3,  0,  3,  3,  3,  2,  3,  3,  3,  1,  3, -1,  0,  1,
       -1, -1, -1, -1,  3, -1,  1,  1, -1,  3,  2,  2, -1, -1, -1,  2, -1,
        3,  2, -1,  2,  1,  1,  3, -1,  3,  3,  2,  2,  2,  3, -1,  3, -1,
        2,  2,  3,  2,  3,  2,  2, -1, -1,  3,  3, -1,  3,  0,  2],
      dtype=int8)
categorial对象的categories属性
  • 包含了所属区间以及一些其他信息
cats.categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')
pd.value_counts()方法
  • 这边可以用来计算categorial各类别分别有几个值
pd.value_counts(cats)
(60, 100]    33
(35, 60]     21
(25, 35]      7
(18, 25]      7
dtype: int64
categorial的right参数
  • 修改区间的开闭方向
  • 默认是左开右闭
pd.cut(ages,bins,right=False)
[[35.0, 60.0), [18.0, 25.0), NaN, NaN, NaN, ..., [60.0, 100.0), NaN, [60.0, 100.0), [18.0, 25.0), [35.0, 60.0)]
Length: 100
Categories (4, interval[int64]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]
categorial的label参数
  • 自定义不同面元对应的名称
pd.cut(ages,bins,labels=['Youth','Young Adult','Middle Age','Senior'])
[Middle Age, Youth, NaN, NaN, NaN, ..., Senior, NaN, Senior, Youth, Middle Age]
Length: 100
Categories (4, object): [Youth < Young Adult < Middle Age < Senior]
传入的是面元数量而非边界?
  • 会根据最大值和最小值来计算等长的面元,并以此作为边界来划分
  • 可选参数precision来确定小数的位数
data = np.random.rand(20)
pd.cut(data,4)
[(0.338, 0.524], (0.152, 0.338], (0.524, 0.71], (0.524, 0.71], (0.524, 0.71], ..., (0.71, 0.895], (0.338, 0.524], (0.338, 0.524], (0.524, 0.71], (0.524, 0.71]]
Length: 20
Categories (4, interval[float64]): [(0.152, 0.338] < (0.338, 0.524] < (0.524, 0.71] < (0.71, 0.895]]
pd.cut(data,4,precision=2)
[(0.34, 0.52], (0.15, 0.34], (0.52, 0.71], (0.52, 0.71], (0.52, 0.71], ..., (0.71, 0.9], (0.34, 0.52], (0.34, 0.52], (0.52, 0.71], (0.52, 0.71]]
Length: 20
Categories (4, interval[float64]): [(0.15, 0.34] < (0.34, 0.52] < (0.52, 0.71] < (0.71, 0.9]]
qcut方法
  • 根据样本分位数对数据进行面元划分->能得到大小基本相等的面元
data = np.random.rand(1000)
cats = pd.qcut(data,4)
cats
[(0.481, 0.745], (0.256, 0.481], (-0.00040400000000000006, 0.256], (-0.00040400000000000006, 0.256], (0.481, 0.745], ..., (0.256, 0.481], (-0.00040400000000000006, 0.256], (0.256, 0.481], (0.745, 0.999], (-0.00040400000000000006, 0.256]]
Length: 1000
Categories (4, interval[float64]): [(-0.00040400000000000006, 0.256] < (0.256, 0.481] < (0.481, 0.745] < (0.745, 0.999]]
cats.value_counts() 
(-0.00040400000000000006, 0.256]    250
(0.256, 0.481]                      250
(0.481, 0.745]                      250
(0.745, 0.999]                      250
dtype: int64

可以发现每个区间的值都是一样的(在数量可以被categories整除时)

qcut:自定义分位数
cats = pd.qcut(data,[0,0.1,0.5,0.9,1.]) 
cats
[(0.481, 0.882], (0.0954, 0.481], (0.0954, 0.481], (-0.00040400000000000006, 0.0954], (0.481, 0.882], ..., (0.0954, 0.481], (0.0954, 0.481], (0.0954, 0.481], (0.882, 0.999], (0.0954, 0.481]]
Length: 1000
Categories (4, interval[float64]): [(-0.00040400000000000006, 0.0954] < (0.0954, 0.481] < (0.481, 0.882] < (0.882, 0.999]]
cats.value_counts()
(-0.00040400000000000006, 0.0954]    100
(0.0954, 0.481]                      400
(0.481, 0.882]                       400
(0.882, 0.999]                       100
dtype: int64

可以发现是按照自定义的分位数划分的数量 [(0.1-0)*1000 (0.5-0.1)*1000 (0.9-0.5)*1000 (1-0.1)*1000]

检测和过滤异常值

data = pd.DataFrame(np.random.randn(1000,4))
data.describe()
0123
count1000.0000001000.0000001000.0000001000.000000
mean0.072513-0.0359900.000788-0.057542
std0.9985851.0119631.0226480.985153
min-2.869199-3.294376-3.174468-3.809258
25%-0.576538-0.749310-0.680436-0.694334
50%0.107212-0.0807160.013427-0.045903
75%0.7175620.6734830.6999500.570491
max3.5253823.3437223.1968832.934261

例一:找出某列中绝对值大小超过3的值

col = data[2]
col[np.abs(col)>3]
183    3.015125
437    3.196883
509   -3.174468
Name: 2, dtype: float64

例二:找出全部绝对值大小超过3的行
顺便复习一下DataFrame.any方法:
DataFrame.any(self, axis=0, bool_only=None, skipna=True, level=None, **kwargs)

np.abs(data)>3返回的是布尔类型的DataFrame,然后.any(1)返回True值代表该行有绝对值大小超过3的数字,外面data[]代表返回这些行的原始值

data[(np.abs(data)>3).any(1)] # axis=1
0123
193.5253820.857763-0.464948-1.067385
760.520871-3.2291110.3923700.652889
183-0.1320461.1983313.015125-0.259849
3683.2899280.797813-1.672622-0.315966
437-0.877511-0.2409203.196883-0.595124
4381.5554363.343722-0.838151-0.353502
5010.898879-3.294376-0.5212160.363553
509-1.7988962.085908-3.174468-0.632310
679-1.043170-2.1082860.940602-3.809258
7310.6809523.0005030.885260-1.007260
881-0.103924-3.025645-0.708571-1.093002

设置值

data[np.abs(data)>3]=np.sign(data)*3 #所有的值都将在区间[-3, 3]

复习一下np.sign()
$ sign(x) = \begin{cases} 1, & \text{if x x x > 0 }\ 0, & \text{if x x x = 0 }\ -1, & \text{if x x x < 0} \end{cases} $

data.describe()
0123
count1000.0000001000.0000001000.0000001000.000000
mean0.071697-0.0357850.000750-0.056733
std0.9960081.0091881.0214790.982397
min-2.869199-3.000000-3.000000-3.000000
25%-0.576538-0.749310-0.680436-0.694334
50%0.107212-0.0807160.013427-0.045903
75%0.7175620.6734830.6999500.570491
max3.0000003.0000003.0000002.934261

根据值的正负,np.sign(data)还能生成1和-1

np.sign(data)
0123
01.01.01.01.0
1-1.0-1.0-1.01.0
21.0-1.0-1.01.0
31.01.0-1.0-1.0
41.0-1.01.01.0
...............
9951.01.01.0-1.0
9961.01.01.01.0
997-1.0-1.01.0-1.0
998-1.01.0-1.01.0
999-1.01.0-1.0-1.0

1000 rows × 4 columns

排列和随机采样

numpy.random.permutation

排列函数->对Series或DataFrame的列的排列工作

df = pd.DataFrame(np.arange(5*4).reshape(5,4))
df
0123
00123
14567
2891011
312131415
416171819
sampler = np.random.permutation(5)
sampler
array([3, 2, 1, 0, 4])
对行排列
  • iloc
  • take(DataFrame.take(indices, axis=0, convert=True, is_copy=True, **kwargs) 参数:convert : translate neg to pos indices (default))
df.iloc[sampler]
0123
312131415
2891011
14567
00123
416171819
df.take(sampler)
0123
312131415
2891011
14567
00123
416171819
sample方法
  • 随机选取若干行
df.sample(n=3)
0123
14567
312131415
2891011
replace参数
choices = pd.Series([5,7,-1,6,4])
choices.sample(n=10,replace=True) #如果replace参数为False(默认),那么n就不能大于Series的总长度
0    5
0    5
3    6
0    5
1    7
3    6
2   -1
1    7
0    5
2   -1
dtype: int64

计算指标/哑变量

df = pd.DataFrame({'key': ['b','b','a','c','a','b'], 'data1': np.arange(6)})
df
keydata1
0b0
1b1
2a2
3c3
4a4
5b5
pd.get_dummies(df['key'])
abc
0010
1010
2100
3001
4100
5010
prefix参数
  • 加上前缀
dummies = pd.get_dummies(df['key'],prefix='key')
dummies
key_akey_bkey_c
0010
1010
2100
3001
4100
5010
df_with_dummy = df[['data1']].join(dummies) #Series没有join函数,这边用[[]]转成DataFrame
df_with_dummy
data1key_akey_bkey_c
00010
11010
22100
33001
44100
55010
某行同时属于多个分类
mnames=['movie_id','title','genres']
movies = pd.read_table('PythonForDataAnalysis-master/ch07/ml-1m/movies.dat',sep='::',header=None,names=mnames,engine='python')
# engine默认是c,如果使用python,可以解析更多别的内容
movies.head()
movie_idtitlegenres
01Toy Story (1995)Animation|Children's|Comedy
12Jumanji (1995)Adventure|Children's|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama
45Father of the Bride Part II (1995)Comedy
  1. 抽取出不同的genre值
all_genres = []
for x in movies.genres:
    all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
genres
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)
  1. 从一个全是zero的DataFrame开始构建指标DataFrame
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i,indices]=1 # i代表当前电影的序号(行坐标),indices代表genres(列坐标)
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic
movie_idtitlegenresGenre_AnimationGenre_Children'sGenre_ComedyGenre_AdventureGenre_FantasyGenre_RomanceGenre_Drama...Genre_CrimeGenre_ThrillerGenre_HorrorGenre_Sci-FiGenre_DocumentaryGenre_WarGenre_MusicalGenre_MysteryGenre_Film-NoirGenre_Western
01Toy Story (1995)Animation|Children's|Comedy1.01.01.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
12Jumanji (1995)Adventure|Children's|Fantasy0.01.00.01.01.00.00.0...0.00.00.00.00.00.00.00.00.00.0
23Grumpier Old Men (1995)Comedy|Romance0.00.01.00.00.01.00.0...0.00.00.00.00.00.00.00.00.00.0
34Waiting to Exhale (1995)Comedy|Drama0.00.01.00.00.00.01.0...0.00.00.00.00.00.00.00.00.00.0
45Father of the Bride Part II (1995)Comedy0.00.01.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0

5 rows × 21 columns

注意:对于很大的数据,用这种方式构建多成员指标会很慢。最好使用更低级的函数,将其写⼊NumPy数组,然后用DataFrame包装。

values = np.random.rand(10)
values
array([0.72757738, 0.11463144, 0.98715221, 0.67817538, 0.76816537,
       0.34697118, 0.5721833 , 0.24276292, 0.88097025, 0.30371988])
bins = np.arange(0,1,0.2)
bins
array([0. , 0.2, 0.4, 0.6, 0.8])
pd.get_dummies(pd.cut(values,bins))
(0.0, 0.2](0.2, 0.4](0.4, 0.6](0.6, 0.8]
00001
11000
20000
30001
40001
50100
60010
70100
80000
90100
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值