在R中创建晶须和盒图

本文介绍了如何在R中创建箱形图(又称箱须图),用于衡量和可视化数据分布。箱形图通过五位数摘要(中位数、四分位数、IQR、最小值和最大值)帮助分析数据分布,识别异常值。文章还讲解了如何在R中创建单个箱形图、多个箱形图,以及如何添加槽口来比较不同分布的中位数差异。
摘要由CSDN通过智能技术生成

Box plots in R are a good way to measure and visualize how closely your data is distributed. These are also sometimes known as box and whisker plots. Each data distribution has certain measures of central tendency – mean, median and mode.

R中的箱形图是衡量和可视化数据分布的紧密程度的好方法。 这些有时也称为箱形图和晶须图。 每个数据分布都有一定的集中趋势度量- 均值中位数众数

Some distributions are closely placed around the median and mean values, while others get spread across a wide range of values and also contain a number of outliers. Box plots let you examine your data using a five-number summary. These are:

一些分布紧密地位于中值和平均值附近,而其他分布则分布在广泛的值范围内,并且还包含许多离群值 。 箱形图使您可以使用五位数摘要检查数据。 这些是:

  • Median – The mid-value of the set – known as Q2

    中位数 –集合的中间值–称为Q2
  • First quartile – The number half-way between the median and the smallest value of the set – known as Q1

    第一个四分位数 -集合的中位数和最小值之间的中间数-称为Q1
  • Third quartile – The number half-way between the median and the largest number in the set – known as Q3

    第三四分位数 -中间值与集合中最大数之间的中间数-称为Q3
  • The distance between Q1 and Q3 is known as the interquartile range – IQR.

    Q1和Q3之间的距离称为四分位间距– IQR
  • Minimum – Q1 -1.5*IQR – not the smallest value

    最小值 – Q1 -1.5 * IQR –不是最小值
  • Maximum – Q3 + 1.5*IQR – not the largest value

    最大值 – Q3 + 1.5 * IQR –不是最大值

Any data point that is beyond the limits of the minimum and maximum values is treated as an outlier. Thus the box plot can give you a comprehensive idea of the data distribution.

任何超出最小值和最大值限制的数据点均被视为异常值。 因此,箱形图可以为您提供有关数据分布的全面概念。

在R中创建箱形图 (Creating Box Plots in R)

Box plots can be created using the boxplot() function in R. Let us try creating our first box plot by making use of the R’s builtin airquality dataset.

可以使用R中的boxplot boxplot()函数创建boxplot() 。让我们尝试使用R的内置空气质量数据集创建第一个箱形图。

This is a dataframe with 6 columns and 153 rows, recording weather data like wind speed, temperature, ozone quantity, etc. Let us try making a box plot for the wind speed column of the dataset.

这是一个数据框有6列和153列,记录气象数据,如风速,温度,臭氧数量等让我们尝试使该数据集的风速柱箱线图。


boxplot(airquality$Wind)
Box Plot R
Box Plot in R
R中的箱形图

Interpretations:

释义:

  • The thick line slicing through the box represents the median of the data set – which is roughly around 10.

    穿过框的粗线表示数据集的中位数,大约为10。
  • The lower half of the box looks larger the upper half – indicating the values less than the median are more dispersed.

    框的下半部分看起来较大,上半部分看起来更大-表示小于中位数的值更加分散。
  • The upper and lower boundaries of the box represent the Q3 and Q1 points respectively.

    框的上下边界分别代表Q3和Q1点。
  • The smaller horizontal lines extending outside the box, known as whiskers represent the minimum and maximum values.

    延伸到框外的较小水平线(称为晶须)代表最小值和最大值。
  • The small circles above the maximum mark here are the outliers.

    最高标记上方的小圆圈是异常值。

Let us try plotting a box plot for another variable in the dataset.

让我们尝试为数据集中的另一个变量绘制箱形图。


boxplot(airquality$Ozone)
Box Plot Ozone
Box Plot Ozone
箱形臭氧

It can be observed that this dataset has two outliers above the maximum mark and the data is dispersed above the median value.

可以观察到,该数据集在最大标记之上有两个异常值,并且数据在中值之上分散。

建立多个箱形图 (Building Multiple Box Plots)

R also makes it possible to compare the distribution of two variables using multiple box plots.

R还可以使用多个箱形图比较两个变量的分布。


> boxplot(airquality$Ozone,airquality$Temp, names=c('Ozone','Temperature'),col=c('red','orange'))
Multi Box Plot
Multi-Box Plot
多箱图

The command uses two different colors to distinguish the variables. The names to the different plots are provided by the names attribute to the function.

该命令使用两种不同的颜色来区分变量。 函数的名称属性提供了不同图的名称。

用箱形图绘制变量关系 (Plotting Variable Relationships with Box Plots)

It is also possible to compare a variable against any other categorical variable in the dataset. For example, if we wish to look at the distribution of the temperature for every individual month, we only need to include the two variables within the formula part as – Temp ~ Month, setting data to the data frame name.

还可以将变量与数据集中的任何其他类别变量进行比较。 例如,如果我们希望查看每个月的温度分布,则只需在公式部分中包括两个变量,例如– Temp〜Month ,即可将数据设置为数据框名称。

Temp ~ Month means that we wish to know the relationship of Temp depending upon the month. Let us now execute the command and try building a horizontal plot instead of a vertical one.

Temp〜Month表示我们希望根据月份了解Temp的关系。 现在让我们执行命令,并尝试构建水平图而不是垂直图。


boxplot(Temp ~ Month, data=airquality, horizontal= TRUE, col=c('red','green'))
Multi Boxplot
Multi Boxplot
多箱线图

在R中的箱形图上添加槽口 (Adding Notches to Box Plots in R)

A variation to the box plot is sometimes seen with notches added. Notch is nothing but a small compression in the middle of the box, identified by its width and height.

有时会在添加槽口的情况下看到箱形图的变化。 Notch只是盒子中间的一个小压缩,由宽度和高度确定。

Two plots with similar notch dimensions tell us that the two plots were likely drawn on data selected from the same distribution. Also, if two notches do not overlap, the medians of the distributions are likely to be different. Notches can be added setting the notch parameter to TRUE.

缺口大小相似的两个图告诉我们,这两个图可能是根据从相同分布中选择的数据绘制的。 同样,如果两个凹口不重叠,则分布的中位数可能会不同。 可以添加缺口,将缺口参数设置为TRUE。

Let us make a notched variant of the above multigraph.

让我们对上面的多重图进行刻槽。


> boxplot(Temp ~ Month, data=airquality, horizontal= TRUE, notch= TRUE, col=c('red','green','orange','blue','purple'))
Notched Plot
Notched Plot
缺口图

翻译自: https://www.journaldev.com/36405/creating-whisker-and-box-plots-in-r

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值