pandas处理泰坦尼克号数据集(1)基础处理

数据集描述

Survived:0代表死亡,1代表存活

Pclass:乘客所持票类,有三种值(1,2,3)

Name:乘客姓名

Sex:乘客性别

Age:乘客年龄(有缺失)

SibSp:乘客兄弟姐妹/配偶的个数(整数值)

Parch:乘客父母/孩子的个数(整数值)

Ticket:票号(字符串)

Fare:乘客所持票的价格(浮点数,0-500不等)

Cabin:乘客所在船舱(有缺失)

Embark:乘客登船港口:S、C、Q(有缺失)

import pandas as pd
import numpy as np
titanic_df = pd.read_csv("titanic_train.csv")
titanic_df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
# 查看某一列缺失值
age = titanic_df.Age
age_is_null = pd.isnull(age)
# print(age_is_null)
age_null_true = age[age_is_null]
# print(age_null_true)
age_null_count = len(age_null_true)
print(age_null_count)
177
# 求平均年龄
mean_age = sum(titanic_df.Age) / len(titanic_df.Age)
mean_age
nan
# 显然,我们必须去掉缺失值
mean_age = titanic_df['Age'][age_is_null == False]
correct_mean_age = sum(mean_age) / len(mean_age)
print(correct_mean_age)

# pandas 中内置函数
mean_age2 = titanic_df.Age.mean()
print(mean_age2)
29.69911764705882
29.69911764705882
# 计算每个舱位的票价平均值
pclass = [1, 2, 3]
mean_fares_by_class = {}
for c in pclass:
    pclass_rows = titanic_df[titanic_df.Pclass == c]
    pclass_fares = pclass_rows.Fare.mean()
    mean_fares_by_class[c] = pclass_fares
mean_fares_by_class
{1: 84.15468749999992, 2: 20.66218315217391, 3: 13.675550101832997}
# 使用数据透视表的方式计算每个舱位的票价平均值
mean_fares_by_class = titanic_df.pivot_table(
                index="Pclass", values="Fare", aggfunc=np.mean)
<matplotlib.axes._subplots.AxesSubplot at 0x1caa0d94c88>

在这里插入图片描述

# 同理,计算多个变量之间的关系, 默认求平均值
port_stats = titanic_df.pivot_table(index="Embarked", values=["Age", "Fare"])
port_stats.plot(kind="bar")
<matplotlib.axes._subplots.AxesSubplot at 0x1caa0e677f0>

在这里插入图片描述

# 指定行或列,去除缺失值
print(titanic_df.shape)
drop_na_columns = titanic_df.dropna(axis='columns')
print(drop_na_columns.shape)

new_titanic_df = titanic_df.dropna(axis=0, subset=['Age', 'Sex'])
print(new_titanic_df.shape)
(891, 12)
(891, 9)
(714, 12)
# 提取某一 “单元格” 的值
row_index_99_age = titanic_df.loc[99,'Age']
print(row_index_99_age)
34.0
# 年龄从大到小排序
new_titanic_df = titanic_df.sort_values("Age", ascending=False)
print(new_titanic_df[:10])
# 重置索引
reset_index = titanic_df.reset_index(drop=True)
print(reset_index[:10])
     PassengerId  Survived  Pclass                                  Name  \
630          631         1       1  Barkworth, Mr. Algernon Henry Wilson   
851          852         0       3                   Svensson, Mr. Johan   
493          494         0       1               Artagaveytia, Mr. Ramon   
96            97         0       1             Goldschmidt, Mr. George B   
116          117         0       3                  Connors, Mr. Patrick   
672          673         0       2           Mitchell, Mr. Henry Michael   
745          746         0       1          Crosby, Capt. Edward Gifford   
33            34         0       2                 Wheadon, Mr. Edward H   
54            55         0       1        Ostby, Mr. Engelhart Cornelius   
280          281         0       3                      Duane, Mr. Frank   

      Sex   Age  SibSp  Parch      Ticket     Fare Cabin Embarked  
630  male  80.0      0      0       27042  30.0000   A23        S  
851  male  74.0      0      0      347060   7.7750   NaN        S  
493  male  71.0      0      0    PC 17609  49.5042   NaN        C  
96   male  71.0      0      0    PC 17754  34.6542    A5        C  
116  male  70.5      0      0      370369   7.7500   NaN        Q  
672  male  70.0      0      0  C.A. 24580  10.5000   NaN        S  
745  male  70.0      1      1   WE/P 5735  71.0000   B22        S  
33   male  66.0      0      0  C.A. 24579  10.5000   NaN        S  
54   male  65.0      0      1      113509  61.9792   B30        C  
280  male  65.0      0      0      336439   7.7500   NaN        Q  
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   
5            6         0       3   
6            7         0       1   
7            8         0       3   
8            9         1       3   
9           10         1       2   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   
5                                   Moran, Mr. James    male   NaN      0   
6                            McCarthy, Mr. Timothy J    male  54.0      0   
7                     Palsson, Master. Gosta Leonard    male   2.0      3   
8  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)  female  27.0      0   
9                Nasser, Mrs. Nicholas (Adele Achem)  female  14.0      1   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
5      0            330877   8.4583   NaN        Q  
6      0             17463  51.8625   E46        S  
7      1            349909  21.0750   NaN        S  
8      2            347742  11.1333   NaN        S  
9      0            237736  30.0708   NaN        C  

自定义函数使用


def row_100(column):
    """提取第100行"""
    return column.iloc[99]
titanic_df.apply(row_100)
PassengerId                  100
Survived                       0
Pclass                         2
Name           Kantor, Mr. Sinai
Sex                         male
Age                           34
SibSp                          1
Parch                          0
Ticket                    244367
Fare                          26
Cabin                        NaN
Embarked                       S
dtype: object
# 查看不含空值的列
def not_null_count(column):
    column_null = pd.isnull(column)
    null = column[column_null]
    return len(null)

column_null_count = titanic_df.apply(not_null_count)
column_null_count
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
def which_class(row):
    pclass = row['Pclass']
    if pd.isnull(pclass):
        return "Unknown"
    elif pclass == 1:
        return "First Class"
    elif pclass == 2:
        return "Second class"
    elif pclass == 3:
        return "Third Class"
    
classes = titanic_df.apply(which_class, axis=1)
classes
0       Third Class
1       First Class
2       Third Class
3       First Class
4       Third Class
5       Third Class
6       First Class
7       Third Class
8       Third Class
9      Second class
10      Third Class
11      First Class
12      Third Class
13      Third Class
14      Third Class
15     Second class
16      Third Class
17     Second class
18      Third Class
19      Third Class
20     Second class
21     Second class
22      Third Class
23      First Class
24      Third Class
25      Third Class
26      Third Class
27      First Class
28      Third Class
29      Third Class
           ...     
861    Second class
862     First Class
863     Third Class
864    Second class
865    Second class
866    Second class
867     First Class
868     Third Class
869     Third Class
870     Third Class
871     First Class
872     First Class
873     Third Class
874    Second class
875     Third Class
876     Third Class
877     Third Class
878     Third Class
879     First Class
880    Second class
881     Third Class
882     Third Class
883    Second class
884     Third Class
885     Third Class
886    Second class
887     First Class
888     Third Class
889     First Class
890     Third Class
Length: 891, dtype: object
def is_minor(row):
    if row["Age"] < 18:
        return True
    else:
        return False

minors = titanic_df.apply(is_minor, axis=1)
# print(minors)

# 制作年龄标签
def generate_age_label(row):
    age = row["Age"]
    if pd.isnull(age):
        return "unknown"
    elif age < 18:
        return "minor"
    else:
        return "adult"
    
age_labels = titanic_df.apply(generate_age_label, axis=1)
age_labels

0        adult
1        adult
2        adult
3        adult
4        adult
5      unknown
6        adult
7        minor
8        adult
9        minor
10       minor
11       adult
12       adult
13       adult
14       minor
15       adult
16       minor
17     unknown
18       adult
19     unknown
20       adult
21       adult
22       minor
23       adult
24       minor
25       adult
26     unknown
27       adult
28     unknown
29     unknown
        ...   
861      adult
862      adult
863    unknown
864      adult
865      adult
866      adult
867      adult
868    unknown
869      minor
870      adult
871      adult
872      adult
873      adult
874      adult
875      minor
876      adult
877      adult
878    unknown
879      adult
880      adult
881      adult
882      adult
883      adult
884      adult
885      adult
886      adult
887      adult
888    unknown
889      adult
890      adult
Length: 891, dtype: object
titanic_df['age_labels'] = age_labels
age_group_survival = titanic_df.pivot_table(index="age_labels", values="Survived")
age_group_survival
Survived
age_labels
adult0.381032
minor0.539823
unknown0.293785
age_group_survival.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1caa0d2c7b8>

在这里插入图片描述


  • 1
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
FProwth算法可以用于处理分类数据,可以帮助我们发现在数据集频繁现的组合。下面我们将使用FP-growth算法来处理著名的泰坦尼克号数据集。 首先,我们需要将泰坦尼克号数据集转换成适合FP-growth算法使用的格式,即每个数据样本表示为一个项集。我们将使用Pythonpandas库和sklearn库来加载和处理数据。 ```python import pandas as pd from sklearn.preprocessing import LabelEncoder # 读取数据 data = pd.read_csv('titanic.csv') # 删除无用列 data.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True) # 处理缺失值 data.fillna(method='ffill', inplace=True) # 将分类变量转换为数字 le = LabelEncoder() data['Sex'] = le.fit_transform(data['Sex']) data['Embarked'] = le.fit_transform(data['Embarked'].astype(str)) # 将数据转换为项集格式 dataset = [] for i in range(len(data)): items = [] items.append('Pclass=' + str(data.iloc[i]['Pclass'])) items.append('Sex=' + str(data.iloc[i]['Sex'])) items.append('Age=' + str(int(data.iloc[i]['Age']))) items.append('SibSp=' + str(data.iloc[i]['SibSp'])) items.append('Parch=' + str(data.iloc[i]['Parch'])) items.append('Fare=' + str(int(data.iloc[i]['Fare']))) items.append('Embarked=' + str(data.iloc[i]['Embarked'])) items.append('Survived=' + str(data.iloc[i]['Survived'])) dataset.append(items) ``` 接下来,我们使用mlxtend库的FP-growth算法来挖掘频繁项集。 ```python from mlxtend.preprocessing import TransactionEncoder from mlxtend.frequent_patterns import fpgrowth # 将项集转换为布尔矩阵 te = TransactionEncoder() te_ary = te.fit_transform(dataset) # 将布尔矩阵转换为DataFrame df = pd.DataFrame(te_ary, columns=te.columns_) # 使用FP-growth算法挖掘频繁项集 frequent_itemsets = fpgrowth(df, min_support=0.1, use_colnames=True) # 输频繁项集 print(frequent_itemsets) ``` 上述代码,我们使用了mlxtend库的TransactionEncoder类将项集转换为布尔矩阵,然后使用fpgrowth函数来挖掘频繁项集。在本例,我们将最小支持度设置为0.1。 运行上述代码后,我们可以得到以下的频繁项集: ``` support itemsets 0 0.629630 (Survived) 1 0.551066 (Sex) 2 0.397306 (Pclass) 3 0.306958 (Age) 4 0.282871 (Fare) 5 0.276094 (SibSp) 6 0.237785 (Parch) 7 0.188552 (Embarked) 8 0.190797 (Sex, Pclass) 9 0.162738 (Sex, Age) 10 0.159371 (Survived, Sex) 11 0.146023 (Pclass, Age) 12 0.113804 (Survived, Age) 13 0.107028 (Survived, Fare) 14 0.101010 (Sex, Survived) 15 0.100102 (Survived, Sex, Age) ``` 从上述结果,我们可以发现: - 63%的乘客在事故幸存; - 55%的乘客是男性; - 40%的乘客乘坐的是3等舱; - 31%的乘客年龄在20到40岁之间; - 28%的乘客支付的船票价格在10到30英镑之间; - 27%的乘客有一个兄弟姐妹或配偶; - 24%的乘客有一个父母或子女; - 19%的乘客从南安普敦登船; - 19%的乘客是女性且乘坐的是3等舱; - 16%的乘客是男性且年龄在20到40岁之间; - 15%的幸存者是女性; - 14%的乘客是1等舱且年龄在20到40岁之间; - 11%的幸存者年龄在20到30岁之间; - 11%的幸存者支付的船票价格在10到30英镑之间; - 10%的女性幸存者。 这些信息可以帮助我们更好地理解泰坦尼克号数据集,并从挖掘一些有用的信息。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

jayvee_

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值