# missing data is so common that many pandas methods automatically filter for it# 虽然 Pandas 为我们提供了过滤缺失值的函数,但是仍然不是很推荐使用,因为数据最好不要轻易过滤,通常的做法都是# 为其添加一份计算后的默认值
mean_age = titanic_survival['Age'].mean()print(mean_age)
'''
index tells the method which column to group by
values is th column that we want to apply the calculation to
aggfunc specifies the calculation we want to perform
'''
passenger_survival = titanic_survival.pivot_table(index='Pclass', values='Survived', aggfunc=np.mean)print(passenger_survival)# 注意:aggfunc 属性如果不写,默认就是求均值
avg_age = titanic_survival.pivot_table(index='Pclass', values='Age')print(avg_age)
age = titanic_survival.pivot_table(index='Pclass', values='Age', aggfunc=np.mean)print(age)
Fare Survived
Embarked
C 10072.2962 93
Q 1022.2543 30
S 17439.3988 217
# specifying axis = 1 or axis = 'columns' will drop any columns that have null values
drop_col = titanic_survival.dropna(axis=1)print(drop_col.head())# 如果 Age 和 Sex 列缺失值,那么丢掉这一行样本
new_data = titanic_survival.dropna(axis=0, subset=['Age','Sex'])print(new_data.head())# 对应的 fillna 函数则是对 null 值进行填充
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex SibSp Parch \
0 Braund, Mr. Owen Harris male 1 0
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 1 0
2 Heikkinen, Miss. Laina female 0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 1 0
4 Allen, Mr. William Henry male 0 0
Ticket Fare
0 A/5 21171 7.2500
1 PC 17599 71.2833
2 STON/O2. 3101282 7.9250
3 113803 53.1000
4 373450 8.0500
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
PassengerId Survived Pclass Name Sex \
0 631 1 1 Barkworth, Mr. Algernon Henry Wilson male
1 852 0 3 Svensson, Mr. Johan male
2 494 0 1 Artagaveytia, Mr. Ramon male
3 97 0 1 Goldschmidt, Mr. George B male
4 117 0 3 Connors, Mr. Patrick male
Age SibSp Parch Ticket Fare Cabin Embarked
0 80.0 0 0 27042 30.0000 A23 S
1 74.0 0 0 347060 7.7750 NaN S
2 71.0 0 0 PC 17609 49.5042 NaN C
3 71.0 0 0 PC 17754 34.6542 A5 C
4 70.5 0 0 370369 7.7500 NaN Q
3.8 自定义函数
# 定义新函数返回第一百行的数据defhandredth_data(column):
data = column.loc[99]return data
data = titanic_survival.apply(handredth_data)print(data)# 获取每列的缺失值的样本数defnull_count(column):
col_null = pd.isnull(column)
null = column[col_null]returnlen(null)
count = titanic_survival.apply(null_count)print('----------')print(count)print(help(pd.isnull))
PassengerId 100
Survived 0
Pclass 2
Name Kantor, Mr. Sinai
Sex male
Age 34
SibSp 1
Parch 0
Ticket 244367
Fare 26
Cabin NaN
Embarked S
dtype: object
----------
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Help on function isna in module pandas.core.dtypes.missing:
isna(obj)
Detect missing values for an array-like object.
This function takes a scalar or array-like object and indictates
whether values are missing (``NaN`` in numeric arrays, ``None`` or ``NaN``
in object arrays, ``NaT`` in datetimelike).
Parameters
----------
obj : scalar or array-like
Object to check for null or missing values.
Returns
-------
bool or array-like of bool
For scalar input, returns a scalar boolean.
For array input, returns an array of boolean indicating whether each
corresponding element is missing.
See Also
--------
notna : boolean inverse of pandas.isna.
Series.isna : Detetct missing values in a Series.
DataFrame.isna : Detect missing values in a DataFrame.
Index.isna : Detect missing values in an Index.
Examples
--------
Scalar arguments (including strings) result in a scalar boolean.
>>> pd.isna('dog')
False
>>> pd.isna(np.nan)
True
ndarrays result in an ndarray of booleans.
>>> array = np.array([[1, np.nan, 3], [4, 5, np.nan]])
>>> array
array([[ 1., nan, 3.],
[ 4., 5., nan]])
>>> pd.isna(array)
array([[False, True, False],
[False, False, True]])
For indexes, an ndarray of booleans is returned.
>>> index = pd.DatetimeIndex(["2017-07-05", "2017-07-06", None,
... "2017-07-08"])
>>> index
DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'],
dtype='datetime64[ns]', freq=None)
>>> pd.isna(index)
array([False, False, True, False])
For Series and DataFrame, the same type is returned, containing booleans.
>>> df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']])
>>> df
0 1 2
0 ant bee cat
1 dog None fly
>>> pd.isna(df)
0 1 2
0 False False False
1 False True False
>>> pd.isna(df[1])
0 False
1 True
Name: 1, dtype: bool
None
3.9 每行迭代及数据转换
ages = titanic_survival['Age']print(ages.head())defwhich_class(row):
pclass = row['Pclass']if pd.isnull(pclass):return'Unknown'elif pclass ==1:return'First Class'elif pclass ==2:return'Second Class'else:return'Third Class'# apply 函数中,axis 属性为1,表示对每行进行函数判断,即数据迭代
result = titanic_survival.apply(which_class, axis=1)print(result.head())defage_class(row):
age = row['Age']if pd.isna(age):return'Unknown'elif age <18:return'年轻人'elif age <40:return'中年人'else:return'老年人'
age_lable = titanic_survival.apply(age_class, axis=1)print(age_lable.tail())
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64
0 Third Class
1 First Class
2 Third Class
3 First Class
4 Third Class
dtype: object
886 中年人
887 中年人
888 Unknown
889 中年人
890 中年人
dtype: object
3.10 巧妙分组计算数据之间的关系
# 为 DataFrame 新增一列
titanic_survival['age_label']= age_lable
result = titanic_survival.pivot_table(index='age_label', values='Survived')print(result)