数据可视化 t-sne_ML-通过可视化了解数据

数据可视化 t-sne

数据可视化 t-sne

ML-通过可视化了解数据 (ML - Understanding Data with Visualization)

介绍 (Introduction)

In the previous chapter, we have discussed the importance of data for Machine Learning algorithms along with some Python recipes to understand the data with statistics. There is another way called Visualization, to understand the data.

在上一章中,我们讨论了数据对于机器学习算法的重要性以及一些Python配方,以了解具有统计信息的数据。 还有另一种称为可视化的方式来理解数据。

With the help of data visualization, we can see how the data looks like and what kind of correlation is held by the attributes of data. It is the fastest way to see if the features correspond to the output. With the help of following Python recipes, we can understand ML data with statistics.

借助数据可视化,我们可以看到数据的外观以及数据的属性保持什么样的关联。 这是查看要素是否与输出相对应的最快方法。 借助以下Python食谱,我们可以了解具有统计信息的ML数据。

Data Visualization Techniques

单变量图:独立理解属性 (Univariate Plots: Understanding Attributes Independently)

The simplest type of visualization is single-variable or “univariate” visualization. With the help of univariate visualization, we can understand each attribute of our dataset independently. The following are some techniques in Python to implement univariate visualization −

最简单的可视化类型是单变量或“单变量”可视化。 借助单变量可视化,我们可以独立了解数据集的每个属性。 以下是Python中用于实现单变量可视化的一些技术-

直方图 (Histograms)

Histograms group the data in bins and is the fastest way to get idea about the distribution of each attribute in dataset. The following are some of the characteristics of histograms −

直方图将数据按箱进行分组,这是了解有关数据集中每个属性分布的最快方法。 以下是直方图的一些特征-

  • It provides us a count of the number of observations in each bin created for visualization.

    它为我们提供了为可视化而创建的每个箱中观测值的计数。

  • From the shape of the bin, we can easily observe the distribution i.e. weather it is Gaussian, skewed or exponential.

    从垃圾箱的形状,我们可以轻松观察分布,即天气为高斯分布,偏斜或指数分布。

  • Histograms also help us to see possible outliers.

    直方图还可以帮助我们查看可能的异常值。

(Example)

The code shown below is an example of Python script creating the histogram of the attributes of Pima Indian Diabetes dataset. Here, we will be using hist() function on Pandas DataFrame to generate histograms and matplotlib for ploting them.

下面显示的代码是一个Python脚本示例,用于创建Pima Indian Diabetes数据集的属性直方图。 在这里,我们将在Pandas DataFrame上使用hist()函数生成直方图,并绘制matplotlib进行绘制。


from matplotlib import pyplot
from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)
data.hist()
pyplot.show()

输出量 (Output)

Graph

The above output shows that it created the histogram for each attribute in the dataset. From this, we can observe that perhaps age, pedi and test attribute may have exponential distribution while mass and plas have Gaussian distribution.

上面的输出显示它为数据集中的每个属性创建了直方图。 由此,我们可以观察到年龄,pedi和测验属性可能具有指数分布,而质量和plas具有高斯分布。

密度图 (Density Plots)

Another quick and easy technique for getting each attributes distribution is Density plots. It is also like histogram but having a smooth curve drawn through the top of each bin. We can call them as abstracted histograms.

获取每个属性分布的另一种快速简便的技术是密度图。 它也类似于直方图,但在每个容器的顶部都有一条平滑的曲线。 我们可以称它们为抽象直方图。

(Example)

In the following example, Python script will generate Density Plots for the distribution of attributes of Pima Indian Diabetes dataset.

在下面的示例中,Python脚本将生成“密度图”,用于分布Pima印度糖尿病数据集的属性。


from matplotlib import pyplot
from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)
data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
pyplot.show()

输出量 (Output)

Density

From the above output, the difference between Density plots and Histograms can be easily understood.

从上面的输出中,可以很容易地理解密度图和直方图之间的差异。

箱形图和晶须图 (Box and Whisker Plots)

Box and Whisker plots, also called boxplots in short, is another useful technique to review the distribution of each attribute’s distribution. The following are the characteristics of this technique −

Box和Whisker图(也简称为boxplots)是另一种有用的技术,可用于检查每个属性的分布。 以下是此技术的特点-

  • It is univariate in nature and summarizes the distribution of each attribute.

    它本质上是单变量的,总结了每个属性的分布。

  • It draws a line for the middle value i.e. for median.

    它为中间值(即中位数)画一条线。

  • It draws a box around the 25% and 75%.

    它在25%和75%周围绘制一个框。

  • It also draws whiskers which will give us an idea about the spread of the data.

    它还会绘制晶须,这将使我们对数据的传播有所了解。

  • The dots outside the whiskers signifies the outlier values. Outlier values would be 1.5 times greater than the size of the spread of the middle data.

    晶须外的点表示离群值。 离群值比中间数据散布的大小大1.5倍。

(Example)

In the following example, Python script will generate Density Plots for the distribution of attributes of Pima Indian Diabetes dataset.

在下面的示例中,Python脚本将生成“密度图”,用于分布Pima印度糖尿病数据集的属性。


from matplotlib import pyplot
from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)
data.plot(kind='box', subplots=True, layout=(3,3), sharex=False,sharey=False)
pyplot.show()

输出量 (Output)

Mass

From the above plot of attribute’s distribution, it can be observed that age, test and skin appear skewed towards smaller values.

从上面的属性分布图可以看出,年龄,测试和皮肤似乎偏向较小的值。

多元图:多个变量之间的相互作用 (Multivariate Plots: Interaction Among Multiple Variables)

Another type of visualization is multi-variable or “multivariate” visualization. With the help of multivariate visualization, we can understand interaction between multiple attributes of our dataset. The following are some techniques in Python to implement multivariate visualization −

可视化的另一种类型是多变量或“多变量”可视化。 借助多元可视化,我们可以了解数据集多个属性之间的相互作用。 以下是Python中实现多元可视化的一些技术-

相关矩阵图 (Correlation Matrix Plot)

Correlation is an indication about the changes between two variables. In our previous chapters, we have discussed Pearson’s Correlation coefficients and the importance of Correlation too. We can plot correlation matrix to show which variable is having a high or low correlation in respect to another variable.

相关性是有关两个变量之间变化的指示。 在前面的章节中,我们讨论了Pearson的相关系数以及相关性的重要性。 我们可以绘制相关矩阵以显示哪个变量相对于另一个变量具有较高或较低的相关性。

(Example)

In the following example, Python script will generate and plot correlation matrix for the Pima Indian Diabetes dataset. It can be generated with the help of corr() function on Pandas DataFrame and plotted with the help of pyplot.

在以下示例中,Python脚本将为Pima印度糖尿病数据集生成并绘制相关矩阵。 它可以借助Pandas DataFrame上的corr()函数生成,并借助pyplot进行绘制。


from matplotlib import pyplot
from pandas import read_csv
import numpy
Path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(Path, names=names)
correlations = data.corr()
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
pyplot.show()

输出量 (Output)

Class

From the above output of correlation matrix, we can see that it is symmetrical i.e. the bottom left is same as the top right. It is also observed that each variable is positively correlated with each other.

从相关矩阵的上述输出中,我们可以看到它是对称的,即左下与右上相同。 还观察到,每个变量彼此正相关。

散点图 (Scatter Matrix Plot)

Scatter plots shows how much one variable is affected by another or the relationship between them with the help of dots in two dimensions. Scatter plots are very much like line graphs in the concept that they use horizontal and vertical axes to plot data points.

散点图借助二维点显示一个变量受另一个变量影响的程度或它们之间的关系。 散点图在概念上非常像线图,因为它们使用水平和垂直轴绘制数据点。

(Example)

In the following example, Python script will generate and plot Scatter matrix for the Pima Indian Diabetes dataset. It can be generated with the help of scatter_matrix() function on Pandas DataFrame and plotted with the help of pyplot.

在以下示例中,Python脚本将为Pima印度糖尿病数据集生成并绘制散点图矩阵。 它可以在Pandas DataFrame上的scatter_matrix()函数的帮助下生成,并在pyplot的帮助下进行绘制。


from matplotlib import pyplot
from pandas import read_csv
from pandas.tools.plotting import scatter_matrix
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)
scatter_matrix(data)
pyplot.show()

输出量 (Output)

plot Scatter matrix

翻译自: https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_understanding_data_with_visualization.htm

数据可视化 t-sne

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值