Python数据分析之清洗

最新推荐文章于 2024-06-09 20:59:55 发布

置顶 MeiNinghang

最新推荐文章于 2024-06-09 20:59:55 发布

阅读量842

点赞数

分类专栏： python pandas 文章标签： Python pandas 利用Python进行数据分析第二版数据分析

python 同时被 2 个专栏收录

10 篇文章 0 订阅

订阅专栏

pandas

3 篇文章 0 订阅

订阅专栏

# 缺失值

import pandas as pd
import numpy as np
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

string_data

0 aardvark 1 artichoke 2 NaN 3 avocado dtype: object 检查

string_data.isnull()

0 False 1 False 2 True 3 False dtype: bool

#None也被视为NA
string_data[0] = None

string_data.isnull()

0 True 1 False 2 True 3 False dtype: bool

df = pd.DataFrame({'dropna':'祛除缺失值','fillna':'填充','isnull':'检查是','notnull':'检查不是'},index = ['1','2','3','4'])

df.iloc[1] #一些方法

dropna 祛除缺失值 fillna 填充 isnull 检查是 notnull 检查不是 Name: 2, dtype: object ## 过滤缺失值

from numpy import nan as NA

data = pd.Series([1,NA,3.5,NA,7])

data.dropna() #删除缺失值

0 1.0 2 3.5 4 7.0 dtype: float64

data[data.notnull()] #等价写法

0 1.0 2 3.5 4 7.0 dtype: float64 ### DataFrame操作

data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])

data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
2	NaN	NaN	NaN
3	NaN	6.5	3.0

cleaned = data.dropna()
data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
2	NaN	NaN	NaN
3	NaN	6.5	3.0

cleaned

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	1.0	6.5	3.0

#### 删除全为NA的行

data.dropna(how = 'all')

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
3	NaN	6.5	3.0

#### 列删除操作

data[4] = NA
data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2	4
0	1.0	6.5	3.0	NaN
1	1.0	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
3	NaN	6.5	3.0	NaN

data.dropna(1,how = 'all')

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
2	NaN	NaN	NaN
3	NaN	6.5	3.0

#### 删除具有特定NA值数量的行

df = pd.DataFrame(np.random.randn(7,3))
df.iloc[:4,1 ]= NA
df.iloc[:2,2] = NA
df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	-0.088513	NaN	NaN
1	0.637624	NaN	NaN
2	0.054991	NaN	0.795304
3	-1.069859	NaN	-1.572700
4	0.230697	-1.893626	1.205645
5	-0.918601	-0.827842	-1.712528
6	0.916691	-0.238712	0.724451

df.dropna() #有NA值就删除该行

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
4	0.230697	-1.893626	1.205645
5	-0.918601	-0.827842	-1.712528
6	0.916691	-0.238712	0.724451

df.dropna(thresh=2) #删除具有两个NA值的行

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
2	0.054991	NaN	0.795304
3	-1.069859	NaN	-1.572700
4	0.230697	-1.893626	1.205645
5	-0.918601	-0.827842	-1.712528
6	0.916691	-0.238712	0.724451

### 填充缺失值

df.fillna(0) #以0填充

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	-0.088513	0.000000	0.000000
1	0.637624	0.000000	0.000000
2	0.054991	0.000000	0.795304
3	-1.069859	0.000000	-1.572700
4	0.230697	-1.893626	1.205645
5	-0.918601	-0.827842	-1.712528
6	0.916691	-0.238712	0.724451

df.fillna({1:0.5,2:0}) #对不同的行使用不同的填充数

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	-0.088513	0.500000	0.000000
1	0.637624	0.500000	0.000000
2	0.054991	0.500000	0.795304
3	-1.069859	0.500000	-1.572700
4	0.230697	-1.893626	1.205645
5	-0.918601	-0.827842	-1.712528
6	0.916691	-0.238712	0.724451

_ = df.fillna(0,inplace=True)  #作用于原对象
df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	-0.088513	0.000000	0.000000
1	0.637624	0.000000	0.000000
2	0.054991	0.000000	0.795304
3	-1.069859	0.000000	-1.572700
4	0.230697	-1.893626	1.205645
5	-0.918601	-0.827842	-1.712528
6	0.916691	-0.238712	0.724451

#### 前向填充

df = pd.DataFrame(np.random.randn(6,3))
df.iloc[2:,1] = NA
df.iloc[4:,2] = NA
df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	-0.251059	-1.375272	0.538068
1	0.329612	1.157921	0.508928
2	-0.118987	NaN	0.694185
3	0.724308	NaN	-0.620832
4	0.045673	NaN	NaN
5	-0.317643	NaN	NaN

df.fillna(method='ffill') #前向

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	-0.251059	-1.375272	0.538068
1	0.329612	1.157921	0.508928
2	-0.118987	1.157921	0.694185
3	0.724308	1.157921	-0.620832
4	0.045673	1.157921	-0.620832
5	-0.317643	1.157921	-0.620832

df.fillna(method = 'ffill',limit=2) #数量限制

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2
0	-0.251059	-1.375272	0.538068
1	0.329612	1.157921	0.508928
2	-0.118987	1.157921	0.694185
3	0.724308	1.157921	-0.620832
4	0.045673	NaN	-0.620832
5	-0.317643	NaN	-0.620832

#### 特殊填充

data = pd.Series([1.,NA,3.5,NA,7])
data.fillna(data.mean()) #以平均值填充

0 1.000000 1 3.833333 2 3.500000 3 3.833333 4 7.000000 dtype: float64 参数说明: value Scalar value or dict-like object to use to fill missing values method Interpolation; by default ‘ffill’ if function called with no other arguments axis Axis to fill on; default axis=0 inplace Modify the calling object without producing a copy limit For forward and backward filling, maximum number of consecutive periods to fill # 数据转换

data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],  'k2': [1, 1, 2, 3, 3, 4, 4]})

data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	k1	k2
0	one	1
1	two	1
2	one	2
3	two	3
4	one	3
5	two	4
6	two	4

### 检查是否与前项重复

data.duplicated()

0 False 1 False 2 False 3 False 4 False 5 False 6 True dtype: bool #### 去重

data.drop_duplicates()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	k1	k2
0	one	1
1	two	1
2	one	2
3	two	3
4	one	3
5	two	4

#### 指定列

data['v1'] = range(7)
data.drop_duplicates(['k1'])

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	k1	k2	v1
0	one	1	0
1	two	1	1

#### 指定保留重复的第一项还是最后一项

data.drop_duplicates(['k1','k2'])

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	k1	k2	v1
0	one	1	0
1	two	1	1
2	one	2	2
3	two	3	3
4	one	3	4
5	two	4	5

data.drop_duplicates(['k1','k2'],keep = 'last')

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	k1	k2	v1
0	one	1	0
1	two	1	1
2	one	2	2
3	two	3	3
4	one	3	4
6	two	4	6

## 使用函数或映射转换

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
 'Pastrami', 'corned beef', 'Bacon',
 'pastrami', 'honey ham', 'nova lox'],
 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	food	ounces
0	bacon	4.0
1	pulled pork	3.0
2	bacon	12.0
3	Pastrami	6.0
4	corned beef	7.5
5	Bacon	8.0
6	pastrami	3.0
7	honey ham	5.0
8	nova lox	6.0

### 映射

meat_to_animal = {
  'bacon': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}

#### 更改大小写

lowercased = data.food.str.lower()
lowercased

0 bacon 1 pulled pork 2 bacon 3 pastrami 4 corned beef 5 bacon 6 pastrami 7 honey ham 8 nova lox Name: food, dtype: object #### 增加新列,应用映射’

data['animal'] = lowercased.map(meat_to_animal)

data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	food	ounces	animal
0	bacon	4.0	pig
1	pulled pork	3.0	pig
2	bacon	12.0	pig
3	Pastrami	6.0	cow
4	corned beef	7.5	cow
5	Bacon	8.0	pig
6	pastrami	3.0	cow
7	honey ham	5.0	pig
8	nova lox	6.0	salmon

#### 函数写法

data['food'].map(lambda x: meat_to_animal[x.lower()])

0 pig 1 pig 2 pig 3 cow 4 cow 5 pig 6 cow 7 pig 8 salmon Name: food, dtype: object ### 替换

data = pd.Series([1., -999., 2., -999., -1000., 3.])

data

0 1.0 1 -999.0 2 2.0 3 -999.0 4 -1000.0 5 3.0 dtype: float64

data.replace(-999,np.nan)

0 1.0 1 NaN 2 2.0 3 NaN 4 -1000.0 5 3.0 dtype: float64

data.replace([-999,-1000],[np.nan,0]) #替换多个值

0 1.0 1 NaN 2 2.0 3 NaN 4 0.0 5 3.0 dtype: float64

data.replace({-999:np.nan,-1000:0}) #等价写法

0 1.0 1 NaN 2 2.0 3 NaN 4 0.0 5 3.0 dtype: float64 ### 轴重命名

data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['Ohio', 'Colorado', 'New York'],
columns=['one', 'two', 'three', 'four'])

#### 映射

transform = lambda x: x[:4].upper()

data.index.map(transform)

Index([‘OHIO’, ‘COLO’, ‘NEW ‘], dtype=’object’)

data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
New York	8	9	10	11

data.index = data.index.map(transform) #作用于原对象

data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	one	two	three	four
OHIO	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

### rename

data.rename(index = str.title,columns = str.upper)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	ONE	TWO	THREE	FOUR
Ohio	0	1	2	3
Colo	4	5	6	7
New	8	9	10	11

data.rename(index={'OHIO': 'INDIANA'},columns={'three': 'peekaboo'}) #特定替换

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	one	two	peekaboo	four
INDIANA	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

data.rename(index = {'OHIO':'INDIANA'},inplace = True) #作用于原对象

data

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	one	two	three	four
INDIANA	0	1	2	3
COLO	4	5	6	7
NEW	8	9	10	11

## 分位数

ages = [20,22,25,27,21,23,37,31,61,45,41,32]

bins = [18,25,35,60,100]

cats = pd.cut(ages,bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], …, (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]] Length: 12 Categories (4, interval[int64]): [(18, 25]

cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]] closed=’right’, dtype=’interval[int64]’)

pd.value_counts(cats)

(18, 25] 5 (35, 60] 3 (25, 35] 3 (60, 100] 1 dtype: int64 ##### 包含

pd.cut(ages,[18,26,36,61,100],right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), …, [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)] Length: 12 Categories (4, interval[int64]): [[18, 26) ##### 命名

group_names = ['Y','YA','WA','S']

pd.cut(ages,bins,labels = group_names)

[Y, Y, Y, YA, Y, …, YA, S, WA, WA, YA] Length: 12 Categories (4, object): [Y #### 等分

data = np.random.rand(20)
pd.cut(data,4,precision=2) #precision控制小数点数量

[(0.49, 0.74], (0.25, 0.49], (0.49, 0.74], (0.25, 0.49], (-3.6e-05, 0.25], …, (0.49, 0.74], (0.25, 0.49], (0.74, 0.99], (0.74, 0.99], (0.74, 0.99]] Length: 20 Categories (4, interval[float64]): [(-3.6e-05, 0.25] #### qcut

data = np.random.randn(1000)
cats = pd.qcut(data,4)
cats

[(-2.9539999999999997, -0.658], (0.669, 3.876], (-2.9539999999999997, -0.658], (-0.0173, 0.669], (-0.658, -0.0173], …, (0.669, 3.876], (-0.0173, 0.669], (-0.658, -0.0173], (-0.0173, 0.669], (0.669, 3.876]] Length: 1000 Categories (4, interval[float64]): [(-2.9539999999999997, -0.658]

pd.value_counts(cats)

(0.669, 3.876] 250 (-0.0173, 0.669] 250 (-0.658, -0.0173] 250 (-2.9539999999999997, -0.658] 250 dtype: int64 #### 自定义

pd.qcut(data,[0,0.1,0.5,0.9,1.])

[(-1.236, -0.0173], (-0.0173, 1.233], (-1.236, -0.0173], (-0.0173, 1.233], (-1.236, -0.0173], …, (-0.0173, 1.233], (-0.0173, 1.233], (-1.236, -0.0173], (-0.0173, 1.233], (-0.0173, 1.233]] Length: 1000 Categories (4, interval[float64]): [(-2.9539999999999997, -1.236] ### 过滤异常值

data = pd.DataFrame(np.random.randn(1000,4))

data.describe()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2	3
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	-0.075999	0.020124	-0.002652	-0.024497
std	0.976555	0.955685	0.986620	1.010481
min	-3.955131	-3.403433	-3.214173	-3.405990
25%	-0.752617	-0.605502	-0.651444	-0.677522
50%	-0.077978	0.009801	-0.037225	-0.004654
75%	0.587452	0.686600	0.613769	0.641942
max	3.054668	3.369481	3.081614	2.983911

#### 按条件过滤

col = data[2]

col[np.abs(col) > 3]

824 3.081614 938 -3.214173 Name: 2, dtype: float64 #### 检查其所在行

data[(np.abs(data) > 3).any(1)]

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2	3
233	-3.955131	0.644402	0.906170	0.871523
268	3.054668	1.215981	0.069664	-0.956292
291	-0.989445	-0.333444	-0.302554	-3.118577
419	-0.334358	-3.403433	-1.207032	-0.812907
528	0.716480	3.369481	-0.067002	-0.062219
682	0.984525	-0.013239	0.127607	-3.405990
769	0.518519	-0.263145	-0.192145	-3.161299
822	-1.235697	-3.032942	0.594252	1.165504
824	0.945307	-2.096176	3.081614	1.307574
832	-3.816917	0.758124	-0.424847	0.016269
938	1.172387	0.353721	-3.214173	-0.088637

#### 操作

data[np.abs(data) > 3]  = np.sign(data) * 3
data.describe

np.sign(data).head() #产生1和-1'

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2	3
0	1.0	-1.0	1.0	-1.0
1	-1.0	1.0	1.0	1.0
2	-1.0	1.0	-1.0	1.0
3	-1.0	-1.0	1.0	-1.0
4	-1.0	1.0	-1.0	1.0

data.iloc[938] #验证

0 1.172387 1 0.353721 2 -3.000000 3 -0.088637 Name: 938, dtype: float64 ## 组合和抽样

df = pd.DataFrame(np.arange(5*4).reshape((5,4)))

sampler = np.random.permutation(5)

sampler

array([3, 2, 1, 0, 4])

df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2	3
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19

df.take(sampler)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2	3
3	12	13	14	15
2	8	9	10	11
1	4	5	6	7
0	0	1	2	3
4	16	17	18	19

#### 随机抽样

df.sample(n=3)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0	1	2	3
1	4	5	6	7
3	12	13	14	15
0	0	1	2	3

#### 作用于原对象

choicesv  = pd.Series([5,7,-1,6,4])

draws = choicesv.sample(n = 10,replace = True)

draws

4 4 0 5 3 6 1 7 1 7 3 6 4 4 1 7 2 -1 1 7 dtype: int64 ### Computing Indicator/Dummy Variables

df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})

df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	data1	key
0	0	b
1	1	b
2	2	a
3	3	c
4	4	a
5	5	b

pd.get_dummies(df.key)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	a	b	c
0	0	1	0
1	0	1	0
2	1	0	0
3	0	0	1
4	1	0	0
5	0	1	0

dummies = pd.get_dummies(df.key,prefix = 'key')

df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	data1	key_a	key_b	key_c
0	0	0	1	0
1	1	0	1	0
2	2	1	0	0
3	3	0	0	1
4	4	1	0	0
5	5	0	1	0

#### 行

mnames = ['movie_id', 'title', 'genres']

movies = pd.read_table('/Users/meininghang/Downloads/pydata-book-2nd-edition/datasets/movielens/movies.dat',
                       sep = '::',header = None,names = mnames)

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:2: ParserWarning: Falling back to the ‘python’ engine because the ‘c’ engine does not support regex separators (separators > 1 char and different from ‘\s+’ are interpreted as regex); you can avoid this warning by specifying engine=’python’.

movies[:10]

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	movie_id	title	genres
0	1	Toy Story (1995)	Animation\|Children’s\|Comedy
1	2	Jumanji (1995)	Adventure\|Children’s\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy
5	6	Heat (1995)	Action\|Crime\|Thriller
6	7	Sabrina (1995)	Comedy\|Romance
7	8	Tom and Huck (1995)	Adventure\|Children’s
8	9	Sudden Death (1995)	Action
9	10	GoldenEye (1995)	Action\|Adventure\|Thriller

#### 题材

all_genres = []

for x in movies.genres:
    all_genres.extend(x.split('|'))

genres = pd.unique(all_genres)
genres

array([‘Animation’, “Children’s”, ‘Comedy’, ‘Adventure’, ‘Fantasy’, ‘Romance’, ‘Drama’, ‘Action’, ‘Crime’, ‘Thriller’, ‘Horror’, ‘Sci-Fi’, ‘Documentary’, ‘War’, ‘Musical’, ‘Mystery’, ‘Film-Noir’, ‘Western’], dtype=object)

zero_matrix = np.zeros((len(movies),len(genres)))

dummies = pd.DataFrame(zero_matrix,columns = genres)

gen = movies.genres[0]

gen.split('|')

[‘Animation’, “Children’s”, ‘Comedy’]

dummies.columns.get_indexer(gen.split('|'))

array([0, 1, 2])

for i,gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i,indices] = 1

movies_windic = movies.join(dummies.add_prefix('Genre_'))

movies_windic.iloc[0]

movie_id 1 title Toy Story (1995) genres Animation|Children’s|Comedy Genre_Animation 1 Genre_Children’s 1 Genre_Comedy 1 Genre_Adventure 0 Genre_Fantasy 0 Genre_Romance 0 Genre_Drama 0 Genre_Action 0 Genre_Crime 0 Genre_Thriller 0 Genre_Horror 0 Genre_Sci-Fi 0 Genre_Documentary 0 Genre_War 0 Genre_Musical 0 Genre_Mystery 0 Genre_Film-Noir 0 Genre_Western 0 Name: 0, dtype: object #### get_dummies

np.random.seed(12345)

values = np.random.rand(10)

values

array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503, 0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])

bins = [0,0.2,0.4,0.6,0.8,1]

pd.get_dummies(pd.cut(values,bins))

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	(0.0, 0.2]	(0.2, 0.4]	(0.4, 0.6]	(0.6, 0.8]	(0.8, 1.0]
0	0	0	0	0	1
1	0	1	0	0	0
2	1	0	0	0	0
3	0	1	0	0	0
4	0	0	1	0	0
5	0	0	1	0	0
6	0	0	0	0	1
7	0	0	0	1	0
8	0	0	0	1	0
9	0	0	0	1	0

字符串操作

内置方法

val = 'a,b,  guido'

val.split(' , ') #切

['a,b,  guido']

修剪

pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

链接

f,s,t = pieces

f + '::' + s + '::' + t

'a::b::guido'

位置

'guido' in val

True

val.index(',')

val.find(':')

-1

计数

val.count(',')

替换

val.replace(',','::')

'a::b::  guido'

val.replace(',',' ')

'a b   guido'

参数:
Argument Description
count Return the number of non-overlapping occurrences of substring in the string.
endswith Returns True if string ends with suffix.
startswith Returns True if string starts with prefix.
join Use string as delimiter for concatenating a sequence of other strings.
index Return position of first character in substring if found in the string; raises ValueError if not found.
find Return position of first character of first occurrence of substring in the string; like index, but returns –1 if not found.
rfind Return position of first character of last occurrence of substring in the string; returns –1 if not found.
replace Replace occurrences of string with another string.
strip, rstrip, lstrip Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively) for each element.
split Break string into list of substrings using passed delimiter.
lower Convert alphabet characters to lowercase.
upper Convert alphabet characters to uppercase.
casefold Convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form.
ljust, rjust Left justify or right justify, respectively; pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width.

正则

import re

text = 'foo    bar\t baz    \tqux'

re.split('\s+',text)

['foo', 'bar', 'baz', 'qux']

编译

regex = re.compile('\s+')

regex.split(text)

['foo', 'bar', 'baz', 'qux']

查询方式

regex.findall(text)

['    ', '\t ', '    \t']

regex.search(text) #返回第一个匹配结果

<_sre.SRE_Match object; span=(3, 7), match='    '>

电子邮件

text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

替换

regex.sub('READCTED',text)

'Dave READCTED\nSteve READCTED\nRob READCTED\nRyan READCTED\n'

分组

pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)

m = regex.match('wesm@bright.net')

m.groups()

('wesm', 'bright', 'net')

regex.findall(text) #返回分组结果

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

sub同时分组

regex.sub(r'Username: \1, Domain: \2, Suffix: \3',text)

'Dave Username: dave, Domain: google, Suffix: com\nSteve Username: steve, Domain: gmail, Suffix: com\nRob Username: rob, Domain: gmail, Suffix: com\nRyan Username: ryan, Domain: yahoo, Suffix: com\n'

参数:
Argument Description
findall Return all non-overlapping matching patterns in a string as a list
finditer Like findall, but returns an iterator
match Match pattern at start of string and optionally segment pattern components into groups; if the pattern matches, returns a match object, and otherwise None
search Scan string for match to pattern; returning a match object if so; unlike match, the match can be anywhere in the string as opposed to only at the beginning
split Break string into pieces at each occurrence of pattern
sub, subn Replace all (sub) or first n occurrences (subn) of pattern in string with replacement expression; use symbols \1, \2, … to refer to match group elements in the replacement string

向量化

data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}

data = pd.Series(data)

data

Dave     dave@google.com
Rob        rob@gmail.com
Steve    steve@gmail.com
Wes                  NaN
dtype: object

data.isnull()

Dave     False
Rob      False
Steve    False
Wes       True
dtype: bool

包含

data.str.contains('gmail')

Dave     False
Rob       True
Steve     True
Wes        NaN
dtype: object

切片

data.str[:5]

Dave     dave@
Rob      rob@g
Steve    steve
Wes        NaN
dtype: object

方法:
Method Description
cat Concatenate strings element-wise with optional delimiter
contains Return boolean array if each string contains pattern/regex
count Count occurrences of pattern
extract
Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group
endswith Equivalent to x.endswith(pattern) for each element
startswith Equivalent to x.startswith(pattern) for each element
findall Compute list of all occurrences of pattern/regex for each string
get Index into each element (retrieve i-th element)
isalnum Equivalent to built-in str.alnum
isalpha Equivalent to built-in str.isalpha
isdecimal Equivalent to built-in str.isdecimal
isdigit Equivalent to built-in str.isdigit
islower Equivalent to built-in str.islower
isnumeric Equivalent to built-in str.isnumeric
isupper Equivalent to built-in str.isupper
join Join strings in each element of the Series with passed separator
len Compute length of each string
lower, upper Convert cases; equivalent to x.lower() or x.upper() for each element
match Use re.match with the passed regular expression on each element, returning matched groups as list
pad Add whitespace to left, right, or both sides of strings
center Equivalent to pad(side=’both’)
repeat Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string)
replace Replace occurrences of pattern/regex with some other string
slice Slice each string in the Series
split Split strings on delimiter or regular expression
strip Trim whitespace from both sides, including newlines
rstrip Trim whitespace on right side
lstrip Trim whitespace on left side

MeiNinghang

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Python数据分析之清洗

# 缺失值import pandas as pdimport numpy as npstring_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])string_data 0 aardvark 1 artichoke 2
复制链接

扫一扫