泰坦尼克号数据分析_第1部分：泰坦尼克号-数据分析基础

最新推荐文章于 2024-06-07 09:07:42 发布

weixin_26713521

最新推荐文章于 2024-06-07 09:07:42 发布

阅读量876

点赞数 2

文章标签： python 数据分析大数据人工智能机器学习

原文链接：https://medium.com/swlh/part-1-titanic-basic-of-data-analysis-ab3025d29f6e

版权

本文是关于泰坦尼克号数据分析的第一部分，介绍了数据分析的基础知识，通过实例探讨如何利用Python进行数据预处理和初步探索。

摘要由CSDN通过智能技术生成

泰坦尼克号数据分析

My goal was to get a better understanding of how to work with tabular data so I challenged myself and started with the Titanic -project. I think this was an excellent way to learn the basics of data analysis with python.

我的目标是更好地了解如何使用表格数据，因此我挑战自我并开始了Titanic项目。我认为这是学习python数据分析基础知识的绝佳方法。

You can find the competition here: https://www.kaggle.com/c/titanicI really recommend you to try it yourself if you want to learn how to analyze the data and build machine learning models.

您可以在这里找到比赛： https : //www.kaggle.com/c/titanic如果您想学习如何分析数据和建立机器学习模型，我真的建议您自己尝试一下。

I started by uploading the packages:

我首先上传了软件包：

import pandas as pd import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Pandas is a great package for tabular data analysis. Numpy provides a high-performance multidimensional array object and tools for working with these arrays. Matplotlib packages help you to generate plots, histograms, power spectra, bar charts, etc., with just a few lines of code. Seaborn is developed based on the Matplotlib library and it can be used to create attractive and informative statistical graphics.

Pandas是用于表格数据分析的出色软件包。 Numpy提供了高性能的多维数组对象和用于处理这些数组的工具。 Matplotlib软件包可帮助您仅用几行代码即可生成图，直方图，功率谱，条形图等。 Seaborn是基于Matplotlib库开发的，可用于创建引人入胜且内容丰富的统计图形。

After loading these packages I loaded the data:

加载这些软件包后，我加载了数据：

df=pd.read_csv("train.csv")

Then I had a quick look at the data:

然后，我快速浏览了一下数据：

df.head()
#This prints you the first 5 rows of the table
#If you want to print 10 rows of the table instead of 5, then use
df.head(10)

Image for post — Screenshot of the first rows

df.tail()
# This prints you out the last five rows of the table

I recommend starting with a look at the data so that you can be sure everything is as it should be. This is how you can avoid stupid mistakes in further analysis.

我建议先查看数据，以确保所有内容都应该是正确的。这样可以避免进一步分析中的愚蠢错误。

df.shape
#This prints you the number of rows and columns

It is a good habit to print out the shape of the data in the beginning so you can check the number of columns and rows and be sure you haven’t missed any data during the analysis.

在开始时打印出数据的形状是个好习惯，因此您可以检查列数和行数，并确保在分析过程中没有遗漏任何数据。

分析数据 (Analyze the data)

Then I continued to look at the data by counting the values. This gave me a lot of information about the content of the data.

然后，我继续通过计算值来查看数据。这给了我很多有关数据内容的信息。

df['Pclass'].value_counts()
# Prints out count of classes values

I prefer using percentages to showcase values. It is easier to understand the values in percentages.

我更喜欢使用百分比来展示价值。更容易理解百分比值。

df['Pclass'].value_counts(normalize=True)
# same as above just that using "normalize=True" value is printed in percentages

I counted values for each column separately. In the future, I challenge myself to do the function which prints out values but it was not my scope in this project.

我分别计算每列的值。将来，我会挑战自己执行输出值的功能，但这不是我在本项目中的工作范围。

I wanted to understand also the values of different columns so I used the describe() method for that.

我还想了解不同列的值，因此我使用了describe()方法。

df['Fare'].describe()
# describe() is used to view basic statistical details like count, mean, minimum and maximum values.

Here you can see for example that the minimum price for the ticket was 0,00 $ and the maximum price was 512,33 $.

例如，在这里您可以看到门票的最低价格为0,00 $，最高价格为512,33 $。

I did several crosstables to understand which were the determinant values for the surviving.

我做了几个交叉表，以了解哪些是生存的决定性价值。

pd.crosstab(df['Survived'], df['Sex'])
# crosstable number of sex based on surviving.

pd.crosstab(df['Survived'], df['Sex'], normalize=True)
# Using "normalize=True", you get values in percentage.

Doing crosstables with different values gives you information about the possible correlations between the variables, for example, sex and surviving. As you can see, 26% of women survived and most of the men, 52%, didn’t survive.

使用不同的值进行交叉表可为您提供有关变量之间可能的相关性的信息，例如性别和存活率。如您所见，有26％的女性幸存下来，而大多数男性(52％)没有幸存。

可视化数据 (Visualize the data)

It is nice to have numerical values in tables but it is easier to understand the visualized data, at least for me. This is why I plotted histograms and bar charts. By creating histograms and bar charts I learned how to visualize the data. Here are a few examples:

在表格中有数值很高兴，但至少对于我来说，更容易理解可视化数据。这就是为什么我绘制直方图和条形图的原因。通过创建直方图和条形图，我学习了如何可视化数据。这里有一些例子：

df.hist(column='Age')

I used seaborn library for the bar charts.

我使用seaborn库制作条形图。

sns.countplot(x='Sex', hue='Survived', data=df);

Also, I used a heatmap to see the correlation between different columns.

另外，我使用热图来查看不同列之间的相关性。

corrmat = df.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, annot=True, square=True, annot_kws={'size': 15});

Heatmap shows that there is a strong negative correlation between Fares and Classes, so that when one increases other decreases. It is logical because ticket prices in the 1st class are higher than in the 3rd class.

热图显示，票价和舱位之间有很强的负相关性，因此当票价增加时，其他票价会下降。这是合乎逻辑的，因为第一类的机票价格高于第三类的机票价格。

If we focus on analyzing the correlations between surviving and other values, we see that there is a strong positive correlation between surviving and fare. The probability to survive is higher when the ticket price has been higher.

如果我们专注于分析幸存值与其他值之间的相关性，我们会发现幸存率和票价之间存在很强的正相关性。当门票价格较高时，生存的可能性较高。

You can find the project in Github. please feel free to try it yourself and comment if there is something that needs clarifying!

您可以在Github中找到该项目。请随时尝试一下，如果有需要澄清的地方，请发表评论！

Thank you for the highly trained monkey (Risto Hinno) for motivating and inspiring me!

感谢您训练有素的猴子( Risto Hinno )激励和启发我！