数据可视化 信息可视化_简单的数据可视化

数据可视化 信息可视化

Simple Data Visualization

简单的数据可视化

Data Science is the art of story-telling as it is the way to explain to people how beautiful and useful the data is, to those who are not aware of data by transforming it into some understandable form. So, data visualization is one of the strongest tools or say steps in Data Science to translate data in a form that everyone can understand.

数据科学是讲故事的艺术,因为它是通过将数据转换成某种可以理解的形式来向人们解释数据的美观和实用性的方法,向那些不了解数据的人解释数据的方法。 因此,数据可视化是数据科学中最强大的工具之一,或者说是将数据转换为每个人都可以理解的形式的步骤。

This post is for the beginners who just started with data visualizations for EDA.

本文适用于刚开始使用EDA数据可视化的初学者。

What is Data Visualization?

什么是数据可视化?

Data Visualization is a graphical representation of the information and data to make it useful and understandable by everyone. This is done by using visual tools including charts, graphs and maps.

数据可视化是信息和数据的图形表示,以使每个人都可以使用和理解。 这可以通过使用包括图表,图形和地图在内的可视工具来完成。

Today, we are surrounded by huge data from all the aspects of life, be it social, technical, personal and medical. So, to deal with the data data scientists perform various steps to transform that data in some usable form and data visualization is one the ways in which data is allowed to take form that everyone could read. Since it is said, “A picture is worth a thousand words”, the same goes with data.

今天,我们被生活,社会,技术,个人和医疗等各个方面的海量数据所包围。 因此,为了处理数据,数据科学家执行了各种步骤来以某种可用的形式转换数据,并且数据可视化是允许数据采用所有人都可以阅读的形式的一种方式。 既然说“一张图片值一千个单词”,那么数据也是如此。

Data Visualization is attempted at two times of training the model to study the dataset, viz. while performing EDA and later at the conclusion of the analysis to check correctness, accuracy, prediction etc. EDA(Exploratory Data Analysis) is a step in data science methodology in which a person (more specific, the one who studies data) analyzes the data, to get familiarized with it, performing all the manipulations to remove indiscrepancies in the data. In this sequence analysis isn’t complete without having done with visualizations.

两次训练模型以研究数据集即尝试进行数据可视化。 在执行EDA时以及随后在分析结束时检查准确性,准确性,预测等。EDA(探索性数据分析)是数据科学方法学中的一个步骤,其中一个人(更具体地来说,是研究数据的人)分析数据,要熟悉它,请执行所有操作以消除数据中的差异。 在此序列中,没有可视化就无法完成分析。

Data Visualizations is best performed with libraries including matplotlib, seaborn and tableau application. In this I will focus on matplotlib.

最好使用包括matplotlib,seaborn和tableau应用程序的库来执行数据可视化。 在本文中,我将重点介绍matplotlib。

>>Import matplotlib.pyplot as plt

>>将matplotlib.pyplot导入为plt

Dataset Used

使用的数据集

To understand better, it is recommended to implement what you learn, so I am going to take an example of a dataset to show how visualization is helpful.

为了更好地理解,建议实施您所学的知识,因此,我将以数据集为例说明可视化如何帮助您。

Dataset taken is: https://www.kaggle.com/saurograndi/airplane-crashes-since-1908/downloa

数据集为: https : //www.kaggle.com/saurograndi/airplane-crashes-since-1908/downloa

Understanding Dataset

了解数据集

Dealing with a dataset is the next step before actual visualizations. In the dataset “Airplane Crashes Since 1908”, the number of entries : 5268 and the number of features : 13.

处理数据集是实际可视化之前的下一步。 在数据集“自1908年以来的飞机坠毁”中,条目数:5268,特征数:13。

Image for post
Features Include 功能包括

Let’s see how does the dataset look like:

让我们看看数据集的样子:

Image for post
Dataset 数据集

Data Manipulation

数据处理

Data Manipulation is a crucial task, it is the process of changing the data to vanish discrepancies and removing missing values or changing them to make the data easier to implement and study.

数据操作是一项至关重要的任务,它是更改数据以消除差异并消除缺失值或更改它们以使数据更易于实现和研究的过程。

To do this we need to check out for discrepancies including missing values, outliers and so on. In this article I have focused on missing values and dealing with them.

为此,我们需要检查是否存在差异,包括缺失值,离群值等。 在本文中,我重点介绍了缺失的价值观并加以处理。

After checking for missing values, I found the following result:

检查缺少的值后,我发现以下结果:

Image for post
% of Missing Values 缺失值的百分比

This picture depicts the percentage of the missing values in different feature column of the dataset. So, it can be concluded that the columns to be neglected are : Time, Flight #, Route, Registration,cn/In, Summary; having crucial amount of missing data. But we are not going to remove Summary as it holds some important values for the various entries.

此图描绘了数据集的不同要素列中缺失值的百分比。 因此,可以得出结论,可以忽略的列是:时间,航班号,路线,注册,cn / In,摘要; 丢失的数据非常关键。 但是我们不会删除摘要,因为摘要对于各个条目具有一些重要的值。

So, deleting Time, Flight #, Route, Registration,cn/In and further dropping missing values from remaining features to get a perfect dataset to perform visualization.

因此,删除“时间”,“航班号”,“航线”,“注册”,“ cn / In”,然后从其余要素中进一步删除缺失值,以获得执行可视化的理想数据集。

Data Visualization

数据可视化

So, finally let’s perform simple graph visualization to calculate average survival rate.

因此,最后让我们执行简单的图形可视化以计算平均存活率。

In the dataset we have now after data manipulation, we will calculator survival rate as:

在数据处理后的数据集中,我们将生存率计算为:

data_copy[“Survival Rate”] = 100 * (data_copy[“Aboard”] — data_copy[“Fatalities”]) / data_copy[“Aboard”]

data_copy [“生存率”] = 100 *(data_copy [“ Aboard”] — data_copy [“ Fatalities”])/ data_copy [“ Aboard”]

>>data_copy_mean = data_copy[“Survival Rate”].mean()

>> data_copy_mean = data_copy [“生存率”] .mean()

>>survival_per_year = data_copy[[“Date”,”Survival Rate”]].groupby(data_copy[“Date”].dt.year).agg([“mean”])

>> Survival_per_year = data_copy [[“ Date”,“ Survival Rate”]]。groupby(data_copy [“ Date”]。dt.year).agg([“ mean”])

>>survival_per_year.plot(legend=None)

>> Survival_per_year.plot(legend = None)

>>plt.ylabel(“Average Survival Rate, %”)

>> plt.ylabel(“平均存活率,%”)

>>plt.xlabel(“Year”)

>> plt.xlabel(“年份”)

>>plt.title(“Average Survival Rate/Year”)

>> 标题(“平均成活率/年”)

>>plt.xticks([x for x in range(1908,2009,10)], rotation=’vertical’)

>> plt.xticks([x for range inx(1908,2009,10)],旋转=“垂直”)

>>plt.axhline(y=data_copy_mean, color=’g’, linestyle=’ — ‘)

>> plt.axhline(y = data_copy_mean,color ='g',linestyle ='—')

>>plt.show()

>> plt.show()

So, performing visualization we get following graph:

因此,执行可视化,我们得到以下图形:

Image for post
Graph for Average Survival Rate
平均成活率图表

The average survival rate per year is ~16.75%.

每年的平均生存率为〜16.75%。

Conclusion:

结论:

  1. Identified missing values.

    确定缺失值。
  2. Dealt with missing values.

    处理缺失值。
  3. Performed visualization using matplotlib.pyplot library

    使用matplotlib.pyplot库执行可视化
  4. Found average survival rate with the help of visualized graph : ~16.75%

    借助可视化图表找到的平均生存率: 〜16.75%

For code check here : https://www.kaggle.com/tanvi05/datavisualization-airplane-crashes-since-1908

有关代码,请点击此处 https : //www.kaggle.com/tanvi05/datavisualization-airplane-crashes-since-1908

翻译自: https://medium.com/swlh/simple-data-visualization-68cf9f008d0b

数据可视化 信息可视化

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值