Python数据清洗

栗子艾李子

已于 2022-09-16 09:07:53 修改

阅读量3.4k

点赞数 6

分类专栏：数据清洗文章标签： python pandas 数据分析

于 2022-09-15 21:03:58 首次发布

本文链接：https://blog.csdn.net/m0_53317797/article/details/126879093

版权

数据清洗专栏收录该内容

5 篇文章 2 订阅

订阅专栏

实验题目

Python 数据清洗

实验目和要求

1）了解 Python 数据清洗的特点；

2）掌握 Numpy、Pandas、Matplotlib 库的使用

3）能进行与Python 数据清洗有关的操作

实验步骤

1）CSV 文件数据集清洗和分析

2）泰坦尼克号数据集查看和清洗

3）写入数据集，并对数据集进行处理

4）导入外部数据集并对数据集进行清洗和整理

实验内容

1、CSV文件数据集清洗和分析

1）准备CSV文件，并保存为animal.csv

2）使用Pandas读取CSV文件内容，分别实现下述操作：

选择第0—2行数据

跳过第1—3行数据

选择white大于1的数据

选择blue大于1且green大于3的数据

删除有缺失值行的数据

删除有缺失值列的数据

分别用字符串“miss”和5填充有缺失值列的数据

csv文件：

white,red,blue,pink,black,green,animal
1,2,3,4,5,6,cat
2,3,6,NA,2,3,dog
1,2,5,NULL,7,6,pig
2,3,4,NA,2,1,moouse

import pandas as pd
import numpy as np
df = pd.read_csv("animal.csv")
# rows = df[0:3]
# print(rows)
# df  = pd.read_csv("animal.csv" , skiprows=[1,3])
print(df)
print(df[df.white >1])
print(df[(df.blue >1)&(df.green >3)])
print(df.dropna(axis=0))
print(df.dropna(axis=1))
print(df.fillna('miss'))
print(df.fillna('5'))

运行截图：

选择第0—2行数据

跳过第1—3行数据

选择white大于1的数据

选择blue大于1且green大于3的数据

删除有缺失值行的数据

删除有缺失值列的数据

Miss填充：

5填充：

2、泰坦尼克号数据集查看和清洗

1）导入泰坦尼克号数据集train.csv

Passenger,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,Braund,male,22,1,0,A/5 21171,7.25,,S
2,1,1,Cumings,female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,Heikkiner,female,26,0,0,STON/02.,7.925,,S
4,1,1,Futrelle,female,35,1,0,113803,53.1,C123,S
5,0,3,Allen,male,35,0,0,373450,8.05,,S
6,0,3,Moran,male,,0,0,330877,8.4583,,Q
7,0,1,McCarthy,male,54,0,0,17463,51.8625,E46,S
8,0,3,Palsson,male,2,3,1,349909,21.075,,S
9,1,3,Johnson,female,27,0,2,347742,11.1333,,S
10,1,2,Nasser,female,14,1,0,237736,30.0708,,C
11,1,3,Sandstrom,female,4,1,1,PP9549,16.7,G6,S
12,1,1,Bonnell,female,58,0,0,113783,26.55,C103,S
13,0,3,Saundercon,male,20,0,0,A/5. 21514,8.05,,S
14,0,3,Andersson,male,39,1,5,347082,31.275,,S
15,0,3,Vestrom,female,14,0,0,350406,7.8542,,S
16,1,2,Hewlett,female,55,0,0,248706,16,,S
17,0,3,Rice,male,2,4,1,382652,29.125,,Q
18,1,2,Williams,male,,0,0,244373,13,,S
19,0,3,Vander,female,31,1,0,345763,18,,S
20,1,3,Masselman,female,,0,0,2649,7.225,,C
21,0,2,Fynney,male,35,0,0,239865,26,,S
22,1,2,Beesley,male,34,0,0,248698,13,D56,S
23,1,3,McGowan,female,15,0,0,330923,8.0292,,Q
24,1,1,Sloper,male,28,0,0,113788,35.5,A6,S
25,0,3,Palsson,female,8,3,1,349909,21.075,,S
26,1,3,Asplund,female,38,1,5,347077,31.3875,,S

2）使用Pandas读取该数据集，并查看数据维度，查看各列缺失值情况

实现代码：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
train_df = pd.read_csv('train.csv',engine='python')
train_df.head(3)
('===数据维度:{:}行{:}列===\n'.format(train_df.shape[0],train_df.shape[1]))
print('===各类数据类型如下:===')
train_df.info()
print('===各列缺失值情况如下:===')
train_df.info()
print(train_df.columns.values)
print('===查看数据分布:===')
print(train_df.describe())
print('===Cabin列缺失过多，进行删除操作:===')
train_df = train_df.drop("Cabin", axis = 1)
print(train_df.columns.values)
print("对Age进行填充： ")
s =train_df['Age'].value_counts(normalize = True)
missing_age = train_df['Age'].isnull()
train_df.loc[missing_age,'Age'] = np.random.choice(s.index,size = len(train_df[missing_age]),p = s.values)
print(train_df['Age'])
print("中位数填充：")
age_median=train_df.Age.median()
train_df.Age.fillna(age_median, inplace=True)
print(train_df['Age'])
print("====Embarked填充====")
train_df['Embarked'] = train_df['Embarked'].fillna(method = 'ffill')
print(train_df['Embarked'])
# 男性和女性存活情况
print("====性别与存活关系====")
train_df[['Sex','Survived']].groupby('Sex').mean().plot.bar()
survive_sex=train_df.groupby(['Sex','Survived'])['Survived'].count()
print('女性存活率%.2f%%,男性存活率%.2f%%' %
     (survive_sex.loc['female',1]/survive_sex.loc['female'].sum()*100,
      survive_sex.loc['male',1]/survive_sex.loc['male'].sum()*100)
     )
# 查看survived 与 Sex的关系
Survived_Sex = train_df['Sex'].groupby(train_df['Survived'])
print(Survived_Sex.value_counts().unstack())
Survived_Sex.value_counts().unstack().plot(kind = 'bar', stacked = True)
plt.show()
print("====年龄和存活关系====")
average_age=train_df['Age'].mean()
std_age=train_df['Age'].std()
count_nan_age=train_df['Age'].isnull().sum()
rand1=np.random.randint(average_age-std_age,average_age+std_age,size=count_nan_age)
train_df['Age'][np.isnan(train_df['Age'])]=rand1
plt.figure(figsize=(12,5))
plt.subplot(121)
train_df['Age'].hist(bins=70)
plt.xlabel('Age')
plt.ylabel('Num')

plt.subplot(122)
train_df.boxplot(column='Age',showfliers=False)
plt.show()
train_df['Age'].describe()
children_df=train_df[train_df['Age']<=12]
juvenile_df = train_df[(train_df['Age'] > 12) & (train_df['Age'] < 18)]
adults_df=train_df[(train_df['Age']>=18)&(train_df['Age']<65)]
agedness_df=train_df[train_df['Age']>=65]
children_survived_sum = children_df['Survived'].sum()
juvenile_survived_sum = juvenile_df['Survived'].sum()
adults_survived_sum = adults_df['Survived'].sum()
agedness_survived_sum = agedness_df['Survived'].sum()
print('儿童生还人数为%d人，少年生还人数为%d人，成年人生还人数为%d人，老年人生还人数为%d人'%(children_survived_sum, juvenile_survived_sum, adults_survived_sum , agedness_survived_sum))
children_survived_rate = children_df["Survived"].mean()
juvenile_survived_rate = juvenile_df['Survived'].mean()
adults_survived_rate = adults_df['Survived'].mean()
agedness_survived_rate = agedness_df['Survived'].mean()
print('儿童生还率为%f，少年生还率为%f，成年人生还率为%f，老年人生还率为%f'%(children_survived_rate, juvenile_survived_rate, adults_survived_rate, agedness_survived_rate))
x = ['children', 'juvenile', 'adults', 'agedness']
b = [40, 26, 275, 1]
y = [children_survived_rate, juvenile_survived_rate , adults_survived_rate, agedness_survived_rate]
plt.figure(figsize=(12,5))
plt.subplot(121)
x_pos = list(range(len(x)))
rects = plt.bar(x_pos, b, align='center', alpha=0.5)
def autolabel(rects): #显示数据的高度
    for ii,rect in enumerate(rects):
        height = rect.get_height()
        plt.text(rect.get_x()+rect.get_width()/2., 1.02*height, '%s'% (b[ii]),
            ha='center', va='bottom')
autolabel(rects)
plt.ylabel('Survival num')
plt.xticks(x_pos, x)

plt.subplot(122)
x_pos = list(range(len(x)))
rects = plt.bar(x_pos, y, align='center', alpha=0.5)
def autolabel(rects):
    for ii,rect in enumerate(rects):
        height = rect.get_height()
        plt.text(rect.get_x()+rect.get_width()/2., 1.02*height, '%s'% (y[ii]),
            ha='center', va='bottom')
autolabel(rects)
plt.ylabel('Survival rate')
plt.xticks(x_pos, x)

plt.show()

实现效果：

各列缺失值：

查看列名、查看数据分布对Cabin列中数据直接删除

填充age列，随机选择适当的年龄进行填充或用中位数填充age列

随即填充：

中位数填充：

embarked列采用“ffill”方法填充

得出生还人数与年龄、性别的关系，并用图形表示

图形表示：
性别与存活率的关系

3.写入数据集，并对数据集进行处理

1)创建数据集，包含4名员工的信息（姓名、年龄、月薪）

2)实现以下功能

将姓名为“李刚”的员工月薪调整为10000

删除姓名为“鸿宇”的员工数据

找出月薪值为5000的员工的数据

3)输出操作结果

import pandas as pd

dt = {'name': ['张燕', '黎明', '李刚', '鸿宇'], 'age': ['21', '23', '34', '24'], 'salary': ['5000', '4000', '7000', '6000']}
df = pd.DataFrame(dt)
# df['salary'][df['name'] == '李刚'] = 10000
print(df)
print(df['salary'])

df['salary'][df['name'] == '李刚'] = 10000
print('调整李刚的月薪后：' )
print(df)
df.drop([3], axis=0, inplace=True)
print('删除鸿宇：' )
print(df)
df = df[df['salary'] == '5000']
print('月薪5000：' )
print(df)
# print(df['salary'])

实验结果：

4、导入外部数据集并对数据集进行清洗和整理

1)导入外部数据集file10.xls，包含商品的信息（价格，成交量，卖家位置）

序号	商品	价格	成交量	卖家位置
1	男鞋2019潮鞋英伦休闲鞋子春夏季板鞋韩	236	4	福建
2	秋季内增高男鞋百搭男士运动鞋低帮休闲鞋子	368	209	浙江
3	军迷T恤特种兵T恤男翻领军长袖春秋外套衣	198	46	北京
4	Dr.Martens马汀博士1461经典3孔马丁单鞋	1299	214	上海
5	奥莱购off ow white blazer aj1 af1 presto hd2017联名男女潮鞋	1090	21	江苏
6	男鞋运动鞋2018新款春秋季男士正品跑步	88	419	江苏
7	夏季中年男鞋透气网面男士爸爸鞋子男30-	171.4	171	浙江
8	特卖骆驼男鞋春季真皮户外工装鞋厚底休闲鞋	869	33	北京
9	舒适铆钉夏季轻便人字拖男鞋凉拖鞋防滑平	148	13	广东
10	老北京工艺布鞋中老年爸爸男军单板鞋开车鞋	99	2	河北

输入代码：

import pandas as pd
import numpy as np
df = pd.read_excel(r"file10.xlsx")
rows = df[0:5]
print("读取其中第0—4行数据：")
print(rows)

cols = df[['商品','价格']]
print("读取“商品”、“价格”列数：")
print(cols.head())
print("读取价格小于200的商品数据：")
print(df[(df['价格']<200)])

df['销售额'] = df['价格'] * df['成交量']
print("读取该XLS文档数据，并创建一个新列“销售列”，该列为价格X成交量:")
print(df.head)

grouped = df['成交量'].groupby(df['卖家位置'])
print("按“卖家位置”进行分组，并计算“成交量”列的平均值:")
print(grouped.mean())

2)实现以下功能

读取其中第0—4行数据