目前排名1420/19000
建模预测在下一篇帖子:
https://blog.csdn.net/QianLong_/article/details/105780130
# 查看当前挂载的数据集目录, 该目录下的变更重启环境后会自动还原
# View dataset directory. This directory will be recovered automatically after resetting environment.
!ls /home/aistudio/data
data31483
# 查看工作区文件, 该目录下的变更将会持久保存. 请及时清理不必要的文件, 避免加载过慢.
# View personal work directory. All changes under this directory will be kept even after reset. Please clean unnecessary files in time to speed up environment loading.
!ls /home/aistudio/work
gender_submission.csv test.csv train.csv
# 如果需要进行持久化安装, 需要使用持久化路径, 如下方代码示例:
# If a persistence installation is required, you need to use the persistence path as the following:
!mkdir /home/aistudio/external-libraries
!pip install seaborn -t /home/aistudio/external-libraries
Looking in indexes: https://pypi.mirrors.ustc.edu.cn/simple/
Collecting seaborn
[?25l Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/70/bd/5e6bf595fe6ee0f257ae49336dd180768c1ed3d7c7155b2fdf894c1c808a/seaborn-0.10.0-py3-none-any.whl (215kB)
[K |████████████████████████████████| 225kB 16.7MB/s eta 0:00:01
[?25hCollecting numpy>=1.13.3 (from seaborn)
[?25l Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/e7/38/f14d6706ae4fa327bdb023ef40b4d902bccd314d886fac4031687a8acc74/numpy-1.18.3-cp37-cp37m-manylinux1_x86_64.whl (20.2MB)
[K |████████████████████████████████| 20.2MB 21kB/s eta 0:00:0131
[?25hCollecting pandas>=0.22.0 (from seaborn)
[?25l Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/4a/6a/94b219b8ea0f2d580169e85ed1edc0163743f55aaeca8a44c2e8fc1e344e/pandas-1.0.3-cp37-cp37m-manylinux1_x86_64.whl (10.0MB)
[K |████████████████████████████████| 10.0MB 465kB/s eta 0:00:01
[?25hCollecting matplotlib>=2.1.2 (from seaborn)
[?25l Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/b2/c2/71fcf957710f3ba1f09088b35776a799ba7dd95f7c2b195ec800933b276b/matplotlib-3.2.1-cp37-cp37m-manylinux1_x86_64.whl (12.4MB)
[K |████████████████████████████████| 12.4MB 23kB/s eta 0:00:015
[?25hCollecting scipy>=1.0.1 (from seaborn)
[?25l Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/dd/82/c1fe128f3526b128cfd185580ba40d01371c5d299fcf7f77968e22dfcc2e/scipy-1.4.1-cp37-cp37m-manylinux1_x86_64.whl (26.1MB)
[K |████████████████████████████████| 26.1MB 119kB/s eta 0:00:01
[?25hCollecting pytz>=2017.2 (from pandas>=0.22.0->seaborn)
[?25l Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/e7/f9/f0b53f88060247251bf481fa6ea62cd0d25bf1b11a87888e53ce5b7c8ad2/pytz-2019.3-py2.py3-none-any.whl (509kB)
[K |████████████████████████████████| 512kB 47.9MB/s eta 0:00:01
[?25hCollecting python-dateutil>=2.6.1 (from pandas>=0.22.0->seaborn)
[?25l Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl (227kB)
[K |████████████████████████████████| 235kB 61.2MB/s eta 0:00:01
[?25hCollecting cycler>=0.10 (from matplotlib>=2.1.2->seaborn)
Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl
Collecting kiwisolver>=1.0.1 (from matplotlib>=2.1.2->seaborn)
[?25l Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/31/b9/6202dcae729998a0ade30e80ac00f616542ef445b088ec970d407dfd41c0/kiwisolver-1.2.0-cp37-cp37m-manylinux1_x86_64.whl (88kB)
[K |████████████████████████████████| 92kB 42.9MB/s eta 0:00:01
[?25hCollecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 (from matplotlib>=2.1.2->seaborn)
[?25l Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/8a/bb/488841f56197b13700afd5658fc279a2025a39e22449b7cf29864669b15d/pyparsing-2.4.7-py2.py3-none-any.whl (67kB)
[K |████████████████████████████████| 71kB 36.7MB/s eta 0:00:01
[?25hCollecting six>=1.5 (from python-dateutil>=2.6.1->pandas>=0.22.0->seaborn)
Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/65/eb/1f97cb97bfc2390a276969c6fae16075da282f5058082d4cb10c6c5c1dba/six-1.14.0-py2.py3-none-any.whl
[31mERROR: paddlepaddle 1.7.1 has requirement scipy<=1.3.1; python_version >= "3.5", but you'll have scipy 1.4.1 which is incompatible.[0m
Installing collected packages: numpy, pytz, six, python-dateutil, pandas, cycler, kiwisolver, pyparsing, matplotlib, scipy, seaborn
Successfully installed cycler-0.10.0 kiwisolver-1.2.0 matplotlib-3.2.1 numpy-1.18.3 pandas-1.0.3 pyparsing-2.4.7 python-dateutil-2.8.1 pytz-2019.3 scipy-1.4.1 seaborn-0.10.0 six-1.14.0
# 同时添加如下代码, 这样每次环境(kernel)启动的时候只要运行下方代码即可:
# Also add the following code, so that every time the environment (kernel) starts, just run the following code:
import sys
sys.path.append('/home/aistudio/external-libraries')
请点击此处查看本环境基本用法.
Please click here for more detailed instructions.
项目开始
#数据解压
!unzip /home/aistudio/data/data31483/titanic.zip -d /home/aistudio/work/
Archive: /home/aistudio/data/data31483/titanic.zip
inflating: /home/aistudio/work/gender_submission.csv
inflating: /home/aistudio/work/test.csv
inflating: /home/aistudio/work/train.csv
#读取数据
import pandas as pd
trainSet = pd.read_csv('work/train.csv')
testSet = pd.read_csv('work/test.csv')
print(trainSet.shape)
trainSet.head()
(891, 12)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
print(testSet.shape)
testSet.head()
(418, 11)
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
trainSet.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
testSet.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 332 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 417 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
数据分析
trainSet.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
数据可视化分析
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#绘制相关矩阵
ax = sns.heatmap(trainSet[["Survived","SibSp","Parch","Age","Fare"]].corr(),annot=True, cmap = "coolwarm")
由上图可知,sibsp 和 parch 的相关度挺高,在特征处理是可能用到
trainSet.hist(figsize=(15,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f38dce29410>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f38dce297d0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f38dcdd28d0>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x7f38dcdfdfd0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f38dcdb1910>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f38dcd5af10>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x7f38dcd66290>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f38dcd43490>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f38dccedd90>]],
dtype=object)
查看各个类别的存活率
print(trainSet[["Pclass","Survived"]].groupby