Python数据清洗

  实验题目

Python 数据清洗

实验目和要求

1)了解 Python 数据清洗的特点;

2)掌握 Numpy、Pandas、Matplotlib 库的使用

3)能进行与Python 数据清洗有关的操作

实验步骤

1)CSV 文件数据集清洗和分析

2)泰坦尼克号数据集查看和清洗

3)写入数据集,并对数据集进行处理

4)导入外部数据集并对数据集进行清洗和整理

实验内容

1CSV文件数据集清洗和分析

1)准备CSV文件,并保存为animal.csv

2)使用Pandas读取CSV文件内容,分别实现下述操作:

选择第0—2行数据

跳过第1—3行数据

选择white大于1的数据

选择blue大于1且green大于3的数据

删除有缺失值行的数据

删除有缺失值列的数据

分别用字符串“miss”和5填充有缺失值列的数据

csv文件:

white,red,blue,pink,black,green,animal
1,2,3,4,5,6,cat
2,3,6,NA,2,3,dog
1,2,5,NULL,7,6,pig
2,3,4,NA,2,1,moouse
import pandas as pd
import numpy as np
df = pd.read_csv("animal.csv")
# rows = df[0:3]
# print(rows)
# df  = pd.read_csv("animal.csv" , skiprows=[1,3])
print(df)
print(df[df.white >1])
print(df[(df.blue >1)&(df.green >3)])
print(df.dropna(axis=0))
print(df.dropna(axis=1))
print(df.fillna('miss'))
print(df.fillna('5'))

运行截图:

选择第0—2行数据

跳过第1—3行数据

 选择white大于1的数据

 选择blue大于1且green大于3的数据

 删除有缺失值行的数据

 删除有缺失值列的数据

Miss填充:

 5填充:

 

2、泰坦尼克号数据集查看和清洗

1)导入泰坦尼克号数据集train.csv

Passenger,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,Braund,male,22,1,0,A/5 21171,7.25,,S
2,1,1,Cumings,female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,Heikkiner,female,26,0,0,STON/02.,7.925,,S
4,1,1,Futrelle,female,35,1,0,113803,53.1,C123,S
5,0,3,Allen,male,35,0,0,373450,8.05,,S
6,0,3,Moran,male,,0,0,330877,8.4583,,Q
7,0,1,McCarthy,male,54,0,0,17463,51.8625,E46,S
8,0,3,Palsson,male,2,3,1,349909,21.075,,S
9,1,3,Johnson,female,27,0,2,347742,11.1333,,S
10,1,2,Nasser,female,14,1,0,237736,30.0708,,C
11,1,3,Sandstrom,female,4,1,1,PP9549,16.7,G6,S
12,1,1,Bonnell,female,58,0,0,113783,26.55,C103,S
13,0,3,Saundercon,male,20,0,0,A/5. 21514,8.05,,S
14,0,3,Andersson,male,39,1,5,347082,31.275,,S
15,0,3,Vestrom,female,14,0,0,350406,7.8542,,S
16,1,2,Hewlett,female,55,0,0,248706,16,,S
17,0,3,Rice,male,2,4,1,382652,29.125,,Q
18,1,2,Williams,male,,0,0,244373,13,,S
19,0,3,Vander,female,31,1,0,345763,18,,S
20,1,3,Masselman,female,,0,0,2649,7.225,,C
21,0,2,Fynney,male,35,0,0,239865,26,,S
22,1,2,Beesley,male,34,0,0,248698,13,D56,S
23,1,3,McGowan,female,15,0,0,330923,8.0292,,Q
24,1,1,Sloper,male,28,0,0,113788,35.5,A6,S
25,0,3,Palsson,female,8,3,1,349909,21.075,,S
26,1,3,Asplund,female,38,1,5,347077,31.3875,,S
 

2)使用Pandas读取该数据集,并查看数据维度,查看各列缺失值情况

实现代码:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
train_df = pd.read_csv('train.csv',engine='python')
train_df.head(3)
('===数据维度:{:}行{:}列===\n'.format(train_df.shape[0],train_df.shape[1]))
print('===各类数据类型如下:===')
train_df.info()
print('===各列缺失值情况如下:===')
train_df.info()
print(train_df.columns.values)
print('===查看数据分布:===')
print(train_df.describe())
print('===Cabin列缺失过多,进行删除操作:===')
train_df = train_df.drop("Cabin", axis = 1)
print(train_df.columns.values)
print("对Age进行填充: ")
s =train_df['Age'].value_counts(normalize = True)
missing_age = train_df['Age'].isnull()
train_df.loc[missing_age,'Age'] = np.random.choice(s.index,size = len(train_df[missing_age]),p = s.values)
print(train_df['Age'])
print("中位数填充:")
age_median=train_df.Age.median()
train_df.Age.fillna(age_median, inplace=True)
print(train_df['Age'])
print("====Embarked填充====")
train_df['Embarked'] = train_df['Embarked'].fillna(method = 'ffill')
print(train_df['Embarked'])
# 男性和女性存活情况
print("====性别与存活关系====")
train_df[['Sex','Survived']].groupby('Sex').mean().plot.bar()
survive_sex=train_df.groupby(['Sex','Survived'])['Survived'].count()
print('女性存活率%.2f%%,男性存活率%.2f%%' %
     (survive_sex.loc['female',1]/survive_sex.loc['female'].sum()*100,
      survive_sex.loc['male',1]/survive_sex.loc['male'].sum()*100)
     )
# 查看survived 与 Sex的关系
Survived_Sex = train_df['Sex'].groupby(train_df['Survived'])
print(Survived_Sex.value_counts().unstack())
Survived_Sex.value_counts().unstack().plot(kind = 'bar', stacked = True)
plt.show()
print("====年龄和存活关系====")
average_age=train_df['Age'].mean()
std_age=train_df['Age'].std()
count_nan_age=train_df['Age'].isnull().sum()
rand1=np.random.randint(average_age-std_age,average_age+std_age,size=count_nan_age)
train_df['Age'][np.isnan(train_df['Age'])]=rand1
plt.figure(figsize=(12,5))
plt.subplot(121)
train_df['Age'].hist(bins=70)
plt.xlabel('Age')
plt.ylabel('Num')

plt.subplot(122)
train_df.boxplot(column='Age',showfliers=False)
plt.show()
train_df['Age'].describe()
children_df=train_df[train_df['Age']<=12]
juvenile_df = train_df[(train_df['Age'] > 12) & (train_df['Age'] < 18)]
adults_df=train_df[(train_df['Age']>=18)&(train_df['Age']<65)]
agedness_df=train_df[train_df['Age']>=65]
children_survived_sum = children_df['Survived'].sum()
juvenile_survived_sum = juvenile_df['Survived'].sum()
adults_survived_sum = adults_df['Survived'].sum()
agedness_survived_sum = agedness_df['Survived'].sum()
print('儿童生还人数为%d人,少年生还人数为%d人,成年人生还人数为%d人,老年人生还人数为%d人'%(children_survived_sum, juvenile_survived_sum, adults_survived_sum , agedness_survived_sum))
children_survived_rate = children_df["Survived"].mean()
juvenile_survived_rate = juvenile_df['Survived'].mean()
adults_survived_rate = adults_df['Survived'].mean()
agedness_survived_rate = agedness_df['Survived'].mean()
print('儿童生还率为%f,少年生还率为%f,成年人生还率为%f,老年人生还率为%f'%(children_survived_rate, juvenile_survived_rate, adults_survived_rate, agedness_survived_rate))
x = ['children', 'juvenile', 'adults', 'agedness']
b = [40, 26, 275, 1]
y = [children_survived_rate, juvenile_survived_rate , adults_survived_rate, agedness_survived_rate]
plt.figure(figsize=(12,5))
plt.subplot(121)
x_pos = list(range(len(x)))
rects = plt.bar(x_pos, b, align='center', alpha=0.5)
def autolabel(rects): #显示数据的高度
    for ii,rect in enumerate(rects):
        height = rect.get_height()
        plt.text(rect.get_x()+rect.get_width()/2., 1.02*height, '%s'% (b[ii]),
            ha='center', va='bottom')
autolabel(rects)
plt.ylabel('Survival num')
plt.xticks(x_pos, x)

plt.subplot(122)
x_pos = list(range(len(x)))
rects = plt.bar(x_pos, y, align='center', alpha=0.5)
def autolabel(rects):
    for ii,rect in enumerate(rects):
        height = rect.get_height()
        plt.text(rect.get_x()+rect.get_width()/2., 1.02*height, '%s'% (y[ii]),
            ha='center', va='bottom')
autolabel(rects)
plt.ylabel('Survival rate')
plt.xticks(x_pos, x)

plt.show()

实现效果:

 各列缺失值:

 查看列名、查看数据分布对Cabin列中数据直接删除

 填充age列,随机选择适当的年龄进行填充或用中位数填充age列

随即填充:

 中位数填充:

 embarked列采用“ffill”方法填充

 得出生还人数与年龄、性别的关系,并用图形表示

图形表示:
性别与存活率的关系

 

3.写入数据集,并对数据集进行处理

1)创建数据集,包含4名员工的信息(姓名、年龄、月薪)

2)实现以下功能

将姓名为“李刚”的员工月薪调整为10000

删除姓名为“鸿宇”的员工数据

找出月薪值为5000的员工的数据

    3)输出操作结果

import pandas as pd

dt = {'name': ['张燕', '黎明', '李刚', '鸿宇'], 'age': ['21', '23', '34', '24'], 'salary': ['5000', '4000', '7000', '6000']}
df = pd.DataFrame(dt)
# df['salary'][df['name'] == '李刚'] = 10000
print(df)
print(df['salary'])

df['salary'][df['name'] == '李刚'] = 10000
print('调整李刚的月薪后:' )
print(df)
df.drop([3], axis=0, inplace=True)
print('删除鸿宇:' )
print(df)
df = df[df['salary'] == '5000']
print('月薪5000:' )
print(df)
# print(df['salary'])

实验结果:

 4、导入外部数据集并对数据集进行清洗和整理 

1)导入外部数据集file10.xls,包含商品的信息(价格,成交量,卖家位置)

序号商品价格成交量卖家位置
1男鞋2019潮鞋英伦休闲鞋子春夏季板鞋韩2364福建
2秋季内增高男鞋百搭男士运动鞋低帮休闲鞋子368209浙江
3军迷T恤特种兵T恤男翻领军长袖春秋外套衣19846北京
4Dr.Martens马汀博士1461经典3孔马丁单鞋1299214上海
5奥莱购off ow white blazer aj1 af1 presto hd2017联名男女潮鞋109021江苏
6男鞋运动鞋2018新款春秋季男士正品跑步88419江苏
7夏季中年男鞋透气网面男士爸爸鞋子男30-171.4171浙江
8特卖骆驼男鞋春季真皮户外工装鞋厚底休闲鞋86933北京
9舒适 铆钉夏季轻便人字拖男鞋凉拖鞋防滑平14813广东
10老北京工艺布鞋中老年爸爸男军单板鞋开车鞋992河北

输入代码:

import pandas as pd
import numpy as np
df = pd.read_excel(r"file10.xlsx")
rows = df[0:5]
print("读取其中第0—4行数据:")
print(rows)

cols = df[['商品','价格']]
print("读取“商品”、“价格”列数:")
print(cols.head())
print("读取价格小于200的商品数据:")
print(df[(df['价格']<200)])

df['销售额'] = df['价格'] * df['成交量']
print("读取该XLS文档数据,并创建一个新列“销售列”,该列为价格X成交量:")
print(df.head)

grouped = df['成交量'].groupby(df['卖家位置'])
print("按“卖家位置”进行分组,并计算“成交量”列的平均值:")
print(grouped.mean())

2)实现以下功能

读取其中第0—4行数据

 读取“商品”、“价格”列数

 读取价格小于200的商品数据

 读取该XLS文档数据,并创建一个新列“销售列”,该列为价格X成交量

 按“卖家位置”进行分组,并计算“成交量”列的平均值

  • 6
    点赞
  • 62
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值