概率论与数理统计 1 Overview and Descriptive Statistics(概述和描述性统计) (上篇)

1.1 Populations,Samples,and Processes

Engineers and scientists are constantly exposed to collections of facts, or data(数据), both in their professional capacities and in everyday activities.

An investigation will typically focus on a well-defined collection of objects constituting a population(总体) of interest

When desired information is available for all objects in the population, we have what is called a census(人口普查).(impractical or infeasible,不切实际的或不可行的)

Instead, a subset of the population—a sample(样本)—is selected in some prescribed manner

A variable(变量) is any characteristic whose value may change from one object to another in the population.We shall initially denote variables by lowercase letters from the end of our alphabet(用字母表末尾的小写字母). e.g.,x = brand of calculator owned by a student

Data(数据) results from making observations either on a single variable or simultaneously on two or more variables(数据是对单个变量或同时对两个或多个变量进行观察的结果)

A univariate data set(单变量数据集) consists of observations on a single variable.

We have bivariate data(双边量数据) when observations are made on each of two variables.

numerical variable 数值变量 categorical variable 类别变量

Multivariate data(多变量数据) arises when observations are made on more than one variable (so bivariate is a special case of multivariate).

Branches of Statistics

An investigator who has collected data may wish simply to summarize and describe important features of the data. This entails using methods from descriptive statistics(描述性统计学)

Some of these methods are graphical(图形化的) in nature; the construction of histograms(柱状图), boxplots(箱型图), and scatter plots(散点图) are primary examples. Other descriptive methods involve calculation of numerical summary measures(计算数值的汇总度量), such as means(平均值), standard deviations(标准差), and correlation coefficients(相关系数)

Having obtained a sample from a population, an investigator would frequently like to use sample information to draw some type of conclusion (make an inference of some sort) about the population. That is, the sample is a means to an end rather than an end in itself(样本是达到目的的一种手段,而不是目的本身). Techniques for generalizing from a sample to a population(从样本推广到总体) are gathered within the branch of our discipline called inferential statistics(推理统计学).

Mastery of probability(掌握概率) leads to a better understanding of how inferential procedures are developed and used(如何开发和使用推理程序), how statistical conclusions can be translated into everyday language and interpreted(如何将统计结论翻译成日常语言并加以解释), and when and where pitfalls can occur in applying the methods(在应用这些方法时何时何地会出现陷阱). Probability and statistics both deal with questions involving populations and samples, but do so in an “inverse manner(处理方式相反)” to one another.(概率论是统计推断的基础,在给定数据生成过程下观测、研究数据的性质;而统计推断则根据观测的数据,反向思考其数据生成过程。)

在这里插入图片描述

Before we can understand what a particular sample can tell us about the population, we should first understand the uncertainty associated with taking a sample from a given population. This is why we study probability before statistics.

There are a number of problem situations in which we fit questions into the framework of inferential statistics by conceptualizing a population(概念化一个总体).

the Scope of Modern Statistics

These days statistical methodology is employed by investigators in virtually all disciplines, including such areas as
● molecular biology(分子生物学) (analysis of microarray data,微阵列数据分析)
● ecology(生态学) (describing quantitatively how individuals in various animal and plant and populations are spatially distributed,定量地描述个体在各种动物和植物中的表现 与 种群是空间分布的)
● materials engineering(材料工程) (studying properties of various treatments to retard corrosion,研究各种缓蚀剂的性能)
● marketing (市场营销)(developing market surveys and strategies for marketing new products,制定新产品的市场调查和营销策略)
● public health (公共卫生)(identifying sources of diseases and ways to treat them,查明疾病来源和治疗方法)
● civil engineering (土木工程)(assessing the effects of stress on structural elements and the impacts of traffic flows on communities,评估应力对结构构件的影响和交通流量对社区的影响)

Enumerative Versus Analytic Studies(枚举与分析研究)

enumerative studies(计数研究): Interest is focused on a finite, identifiable, unchanging collection of individuals or objects that make up a population(组成一个总体的有限的、可识别的、不变的个体或对象集合). A sampling frame(抽样框架) that is, a listing of the individuals or objects(被抽样的个体或对象的清单) to be sampled is either available to an investigator or else can be constructed.

analytic studies(分析研究):An analytic study is broadly(广义地) defined as one that is not enumerative in nature(性质上不是列举). Such studies are often carried out with the objective(目标) of improving a future product by taking action on a process of some sort (e.g., recalibrating equipment or adjusting the level of some input such as the amount of a catalyst 重新校准设备或调整一些输入的水平,如催化剂的数量). Data can often be obtained only on an existing process, one that may differ in important respects from the future process. There is thus no sampling frame listing the individuals or objects of interest.

collecting data

When data collection entails selecting individuals or objects from a frame, the simplest method for ensuring a representative selection is to take a simple random sample. This is one for which any particular subset of the specified size(特定大小的任何特定子集) (e.g., a sample of size 100) has the same chance of being selected.

stratified sampling(分层抽样):entails separating the population units into nonoverlapping groups and taking a sample from each one.

1.2 Pictorial and Tabular Methods in Descriptive Statistics(描述统计学中的图形和表格方法)

visual displays 可视化

Notation

The number of observations in a single sample, that is, the sample size(样本大小), will often be denoted by n. e.g. , for the sample of universities {Stanford, Iowa State, Wyoming, Rochester}, n=4.

If two samples are simultaneously under consideration, either m and n or n1 and n2 can be used to denote the numbers of observations.

Given a data set(数据集) consisting of n observations on some variable x, the individual observations will be denoted by x1, x2, x3,…, xn

Stem-and-Leaf displays(茎叶图)

constructing a Stem-and-Leaf display

  1. Select one or more leading digits(前导数字) for the stem values(茎值). The trailing digits become the leaves.
  2. List possible stem values in a vertical column.
  3. Record the leaf for each observation beside the corresponding stem value.
  4. Indicate the units for stems and leaves(标明茎和叶的单位) some place in the display.

In general, a display based on between 5 and 20 stems is recommended.

手工创建的叶值没必要按照从小到大的顺序排列。

A stem-and-leaf display conveys information about the following aspects of the data:

● identification of a typical or representative value(对典型或代表性值的识别)
● extent of spread about the typical value(典型值的传播范围)
● presence of any gaps in the data(数据中存在的任何漏洞)

● extent of symmetry in the distribution of values(数值分布的对称程度)
● number and locations of peaks(峰值的数量和位置)
● presence of any outliers—values far from the rest of the data(任何离群值的存在——距离其他数据很远)

Dotplots(点图)

A dotplot is an attractive summary of numerical data when the data set is reasonably small or there are relatively few distinct data values(当数据集相当小或有相对较少的不同数据值时). Each observation is represented by a dot above the corresponding location on a horizontal measurement scale.(每个观测值都用水平测量标度上相应位置上方的一个点表示) When a value occurs more than once, there is a dot for each occurrence, and these dots are stacked vertically

在这里插入图片描述

Histograms(直方图)

A numerical variable is discrete if its set of possible values either is finite or else can be listed in an infinite sequence (one in which there is a first number, a second number, and so on)(一个数值变量的可能值集是有限的,或者可以在一个无限序列(其中有第一个数、第二个数,等等)中列出). A numerical variable is continuous if its possible values consist of an entire interval on the number line(数值变量的可能值由数轴上的整个区间组成)

A discrete variable(离散变量) x almost always results from counting(计数)

Continuous variables(连续变量) arise from making measurements(测量).

The frequency(频率) of any particular x value is the number of times that value occurs in the data set. The relative frequency(相对频率) of a value is the fraction or proportion of times the value occurs:

在这里插入图片描述

Multiplying a relative frequency by 100 gives a percentage(将相对频率乘以100得到一个百分比)

The relative fre quencies, or percentages, are usually of more interest than the frequencies themselves.

In theory, the relative frequencies should sum to 1(理论上,相对频率的总和应该是1), but in practice the sum may differ slightly from 1 because of rounding(由于四舍五入的关系,总和可能与1略有不同). A frequency distribution(频率分布) is a tabulation(表) of the frequencies and/or relative frequencies.

Constructing a Histogram for discrete data:
First, determine the frequency and relative frequency of each x value(确定每个x值的频率和相对频率).

Then mark possible x values on a horizontal scale(在水平刻度上标记可能的x值).

Above each value, draw a rectangle whose height is the relative frequency (or alternatively, the frequency) of that value(在每个值之上,画一个矩形,其高度是该值的相对频率(或频率))

The rectangles should have equal widths(这些矩形的宽度应该相等).

This construction ensures that the area of each rectangle is proportional(成正比) to the relative frequency of the value.

Constructing a histogram for continuous data(连续数据) (measurements) entails subdividing the measurement axis(细分测量轴) into a suitable number of class intervals(类间隔) or classes(类), such that each observation is contained in exactly one class(每个观察值都包含在一个类中).

One potential difficulty is that occasionally an observation lies on a class boundary so therefore does not fall in exactly one interval(有时观察值位于类边界上,因此不会恰好落在一个区间内), for example, 29.0

处理这个问题的一种方法是使用像27.55,28.05,…,31.55这样的边界。向类边界添加百分位数字可以防止观察结果落在结果边界上。另一种方法是一个边界上的观测值被放置在边界右边的区间内。

Constructing a Histogram for continuous data:Equal class Widths(相等的类宽度)

Determine the frequency and relative frequency for each class.

Mark the class boundaries on a horizontal measurement axis.

Above each class interval,draw a rectangle whose height is the corresponding relative frequency (or frequency).

Equal-width classes may not be a sensible choice if there are some regions of the measurement scale that have a high concentration of data values and other parts where data is quite sparse(度量尺度的某些区域数据值高度集中,而其他部分数据相当稀疏).

If a large number of equal-width classes are used, many classes will have zero frequency(如果使用了大量的等宽类,那么许多类的频率将为零). A sound choice is to use a few wider intervals near extreme observations and narrower intervals in the region of high concentration(在极端观测附近使用较宽的间隔,而在高度集中的区域使用较窄的间隔).

在这里插入图片描述

Constructing a Histogram for continuous data: unequal class Widths (不等类宽度)

After determining frequencies and relative frequencies, calculate the height of each rectangle using the formula

在这里插入图片描述

The resulting rectangle heights(矩形高度) are usually called densities(密度), and the vertical scale(垂直比例) is the density scale(密度比例). This prescription(规定) will also work when class widths are equal.

在这里插入图片描述

Histogram Shapes(直方图的形状)

A unimodal histogram(单峰直方图) is one that rises to a single peak and then declines

A bimodal histogram(双峰直方图) has two different peaks.

Bimodality(双模态) can occur when the data set consists of observations on two quite different kinds of individuals or objects(当数据集包含对两种完全不同的个体或对象的观察时).

Only if the two separate histograms are far apart relative to their spreads will bimodality occur in the histogram of combined data.

The number of peaks may well depend on the choice of class intervals, particularly with a small number of observations. The larger the number of classes, the more likely it is that bimodality or multimodality(多模态) will manifest(出现) itself.

A histogram is symmetric(对称的) if the left half is a mirror image of the right half. A unimodal(单峰的) histogram is positively skewed(正向倾斜) if the right or upper tail(右尾或上尾) is stretched out compared with the left or lower tail and negatively skewed(负向倾斜) if the stretching is to the left.

在这里插入图片描述

Qualitative Data(定性数据)

Both a frequency distribution and a histogram can be constructed when the data set is qualitative (categorical,分类的) in nature. In some cases, there will be a natural ordering(顺序) of classes—for example, freshmen, sophomores, juniors, seniors, graduate students—whereas in other cases the order will be arbitrary—for example, Catholic, Jewish, Protestant, and the like. With such categorical data, the intervals above which rectangles are constructed should have equal width.

Multivariate data(多元数据)

Multivariate data is generally rather difficult to describe visually(很难直观地描述). Several methods for doing so appear later in the book, notably scatterplots for bivariate numerical data(二元数值数据的散点)

  • 2
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Lum0s!

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值