这一部分的内容比较细碎,代码不是很多重点在于梳理数据清洗和特征处理的各个步骤,为之后的数据分析打好基础。
思维导图
一些代码
import numpy as np
import pandas as pd
df = pd.read_csv('D:\\pythondata\\train.csv')
print(df.head(3))
print(df.info())
print(df.isnull().sum())
print(df[df.duplicated()])
df['AgeBand']=pd.cut(df['Age'],5,labels=['1','2','3','4','5'])
print(df.head())
df['AgeBand']=pd.cut(df['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels=['1','2','3','4','5'])
print(df.head())
df['AgeBand']=pd.cut(df['Age'],[0,5,15,30,50,80],labels=['1','2','3','4','5'])
print(df.head())
print(df['Sex'].value_counts())
print(df['Cabin'].value_counts())
print(df['Embarked'].value_counts())
import numpy as np
import pandas as pd
df = pd.read_csv('D:\\pythondata\\train.csv')
print(df['Sex'].unique())
print(df['Sex'].nunique())
df['Sex_num']=df['Sex'].replace(['male','female'],[1,2])
print(df.head(3))
df['Sex_num1']=df['Sex'].map({'male':1,'female':2})
print(df.head(3))
小知识点
(1)缺失值的两种查找方式:1.df.info(),对比non_null数目和总体数目的值判断是否有缺失。
2.用.isnull()可以判断是否为缺失值,.isnull().sum()可以用来计算缺失值的总数。
(2)对缺失值的处理
1.dropna(axis=) axis=0代表对含有缺失值的行进行删除
axis=1代表对含有缺失值的列进行删除(通常不采取,会删掉变量)
import numpy as np
import pandas as pd
df = pd.read_csv('D:\\pythondata\\train.csv')
print(df.head(3))
print(df.info())
print(df.dropna(axis=0).info())
结果:
#原始情况
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 183 entries, 1 to 889
Data columns (total 12 columns):
#axis=0
PassengerId 183 non-null int64
Survived 183 non-null int64
Pclass 183 non-null int64
Name 183 non-null object
Sex 183 non-null object
Age 183 non-null float64
SibSp 183 non-null int64
Parch 183 non-null int64
Ticket 183 non-null object
Fare 183 non-null float64
Cabin 183 non-null object
Embarked 183 non-null object
#axis=1
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
2.drop还有选项dropna(axis=0,how='any’or ‘all’)
any表示行中一旦有空值就删除,all表示行中所有皆为空时才删除
3.dropna(axis=0, how=‘any’, thresh=3)
thresh表示一行中至少有3个非NA值时将其保留。用于处理缺失值可能普遍存在,尽量可以保存一些有效数据。
【参考】
.dropna(axis=0,how=’any’,thresh=None,subset=None,inplace=False)
axis: 默认axis=0。0为按行删除,1为按列删除
how: 默认 ‘any’。 ‘any’指带缺失值的所有行/列;'all’指清除一整行/列都是缺失值的行/列
thresh: int,保留含有int个非nan值的行
subset: 删除特定列中包含缺失值的行或列
inplace: 默认False,即筛选后的数据存为副本,True表示直接在原数据上更改
4.fillna(value)
对缺失值进行填充,可以填充相同值或者不同的值
fillna()
参数:
inplace=True/False 直接修改数据本身,不直接修改数据本身
method{}pad/ffill:用前一个非缺失值填充
backfill/bfill:用下一个非缺失值填充该缺失值
None:指定值替换