第7章数据清洗和准备_当你拿到一份原始数据,准备清洗和处理,你准备怎么做-CSDN博客

本文链接：https://blog.csdn.net/Marco458748194811/article/details/109862913

Python数据分析再学习

1. 处理缺失数据

1.1 检测缺失数据

strings = pd.Series(['a', 'cd', 'qwer', 'NaN'])
strings.isnull()

strings[0] = None
strings.isnull()

1.2 过滤缺失数据

## Series过滤缺失值
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()

## DataFrame过滤缺失值
data = pd.DataFrame([1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.])
cleaned = data.dropna()  ## dropna默认丢弃任何含有缺失值的行

## 只丢弃全部为NA的行或者列
data.dropna(how = 'all')
data.dropna(axis = 1, how = 'all')

## 时间序列数据
df = pd.DataFrame(np.random.randn((7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 3] = NA
df.dropna()
df.dropna(thresh = 2)

1.3 填充缺失数据

## 对不同列填充不同值
df.fillna({1:0.5, 2: 0})

## 向前插补
df = pd.DataFrame(np.random.randn((6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df.fillna(method = 'ffill', limit = 2)

## 填充平均值
data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())

2. 数据转换

2.1 移除重复数据

data = pd.DataFrame({'k1': ['one', 'two'] * 3 + 'two', 'k2': [1, 1, 2, 3, 3, 4, 4]})

## 判断各行是否为重复行
data.duplicated()

## 去除重复行
data.drop_duplicates()

## 只根据某几列过滤重复项
data.drop_duplicates(['k1'])

2.2 利用函数或映射进行数据转换

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami', 'corned beef'], 'ounces': [4, 3, 12, 6, 7.5]})

meat_to_animal = {'bacon': 'pig', 'pulled pork': 'pig', 'pastrami': 'cow', 'corned beef': 'cow'}

## 添加一列肉类食物来源
### 方法1
lowercased = data['food'].str.lower()
data['animal'] = lowerceased.map(meat_to_animal)
### 方法2
data['animal'] = data['food'].apply(lambda x: meat_to_animal[x.lower()])
### 方法3
data['animal'] = data['food'].str.lower().replace(meat_to_animal)  ## 下一节讲到的替换值

注：这里方法2也可以用map函数，但是方法1却不能用apply函数，因为这里传入的是一个列表，而不是函数，所以不能用apply函数。而map函数可以接受一个函数或含有映射关系的字典型对象，所以可以用map函数。

2.3 替换值

data = pd.Series([1., -999., 2., -999., -1000., 3.])
## 方法1：传入替换列表
data.replace([-999, -1000], [np.nan, 0])
## 方法2：传入字典
data.replace({-999: np.nan, -1000: 0})

2.4 重命名轴索引

data = pd.DataFrame(np.arange(12).reshape((3, 4)), index = ['Ohio', 'Colorado', 'New York'], columns = ['one', 'two', 'three', 'four'])

## map方法
transform = lambda x: x[:4].upper()
data.index = data.index.map(transform)

## rename方法
data.rename(index = str.title, columns = str.upper)
data.rename(index = {'OHIO': 'INDIANA'}, columns = {'three': 'peekaboo'}, inplace = True)  ## 就地修改数据集

2.5 离散化和面元划分

## 依据面元边界划分
ages = [20, 22, 25, 27, 21, 23, 37, 31,61]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)

## 一些属性
cats.codes
cats.categories
pd.value_counts(cats)

## 自定义面元名称
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'oldAged']
pd.cut(ages, cut, labels = group_names)

## 依据面元数量划分
data = np.random.rand(20)
pd.cut(data, 4, precision = 2)

## 依据样本分位数划分
data = np.random.randn(1000)
cats = pd.qcut(data, 4)

## 自定义分位数
pd.cut(data, [0, 0.1, 0.5, 0.9, 1.])

2.6 检测和过滤异常值

data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

## 某列绝对值超过3的值
col = data[2]
col[np.abs(col)] > 3]

## 含有绝对值超过3的值的行
data[(np.abs(data) > 3).any(1)]

## 将值限制在区间-3到3
data[np.abs(data) > 3] = np.sign(data) * 3

2.7 排列和随机采样

## 排列
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4))
sampler = np.random.permutation(5)
df.take(sampler)
## 或者
df.iloc[sampler]

## 随机抽样
choices = pd.Series([5, 7, -1, 6, 4])
np.random.seed(1234)
draws = choices.sample(n = 10, replace = True)

2.8 计算指标/哑变量

## 根据DataFrame某一列派生出矩阵，只含1和0
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
pd.get_dummies(df['key'])

## 列名加上前缀
dummies = pd.get_dummies(df['key'], prefix = 'key')
df_with_dummy = df[['data1']].join(dummies)

注：这里重点说一下为什么要用两个中括号去取df的子集。

因为实际上df['data1']取到的以一列，默认为Series对象，所以要想即使取到的是一列，也要是DataFrame对象，就要用两个中括号。
这个在df.loc和df.iloc取子集时也是类似的。若是取多行（列）自然是DataFrame对象，但是要是取单个行（列），df.loc[val], df.iloc[where]默认得到的是Series对象，df.loc[[val]], df.iloc[[where]]得到的就是DataFrame对象。
创建np.array是也是类似，np.array[vector]默认是向量，只有np.array[[vector]]才是数组

## 示例
>>> df
  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   b      5

>>> type(df['key'])
<class 'pandas.core.series.Series'>

>>> type(df[['key']])
<class 'pandas.core.frame.DataFrame'>

>>> type(df.loc[0])
<class 'pandas.core.series.Series'>

>>> type(df.loc[[0]])
<class 'pandas.core.frame.DataFrame'>

>>> type(df.iloc[0])
<class 'pandas.core.series.Series'>

>>> type(df.iloc[[0]])
<class 'pandas.core.frame.DataFrame'>

## 读取数据
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('datasets/movielens/movies.dat', header = None, names = mnames)
movies[:10]

## 获取所有的genre值
all_genres = []
### 方法1
for x in movies.genres:
    all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
### 方法2
for x in movies['genres']:
    y = x.split('|')
    all_genres = np.union1d(all_genres, y)

## 创建指标DataFrame
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns = genres)

## 填充0或1
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

## 联结
movies_windic = movies.join(dummies.add_prefix(prefix = 'Genre_'))

一个有用的操作

np.random.seed(1234)
values = np.random.rand(10)
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))

3. 字符串操作

3.1 字符串对象方法

## 拆分字符串并去除空白符
val = 'a, b,  guido'
pieces = [x.strip() for x in val.split('|')]

## 利用加法连接（类似R中的paste）
first, second, third = pieces
first + '::' + second + '::' + third

## 利用join方法连接
'::'.join(pieces)

## 检测子串
'guido' in val
val.index(',')
val.find(':')

## 返回指定子串出现的次数
val.count(',')

## 将指定模式替换另一个模式
val.replace(',', '::')
val.replace(',', '')  ## 可以用于删除模式

3.2 正则表达式

## 拆分字符串，分隔符为一个或多个空白符
## 空白符包括制表符、空格、换行符
import re
text = 'foo    bar \t baz    \tqux'
re.split('\s+', text)

## 自己编译regex得到一个可重用的regex对象
regex = re.compile('\s+')
regex.split(text)

## 得到匹配regex的所有模式
regex.findall(text)