-
- Chapter 1 Introduction and Data Collection
- Learning Objectives
- How Statistics is used in business
- The sources of data used in business
- The types of data used in business
- The basics of Microsoft Excel
- The basics of STATA
- What is statistics?
- A branch of mathematics taking and transforming numbers into useful information for decision makers
- Methods for processing & analyzing numbers
- Methods for helping reduce the uncertainty inherent in decision making
- Types of Statistics
- Statistics
- The branch of mathematics that transforms data into useful information for decision makers.
- Descriptive Statistics
- Collect data
- e.g., Survey
- Present data
- e.g., Tables and graphs
- Characterize data
- e.g., Sample mean =
- Inferential Statistics
- Estimation
- e.g., Estimate the population mean weight using the sample mean weight
- Hypothesis testing
- e.g., Test the claim that the population mean weight is 120 pounds
- Statistics
- Basic Vocabulary of Statistics⭐⭐⭐⭐⭐
- VARIABLEA
- variable is a characteristic of an item or individual.
- DATA
- Data are the different values associated with a variable.
- OPERATIONAL DEFINITIONS
- Data values are meaningless unless their variables have operational definitions, universally accepted meanings that are clear to all associated with an analysis.
- POPULATIONA
- population consists of all the items or individuals about which you want to draw a conclusion.
- SAMPLEA
- sample is the portion of a population selected for analysis.
- STATISTIC
- A statistic is a numerical measure that describes a characteristic of a sample.
- PARAMETER
- A parameter is a numerical measure that describes a characteristic of a population.
- VARIABLEA
- Sources of data fall into four categories
- Data distributed by an organization or an individual
- A designed experiment
- A survey
- An observational study
- Types of Variables
- Categorical (qualitative) variables have values that can only be placed into categories, such as “yes” and “no.”
- Numerical (quantitative) variables have values that represent quantities.
- Types of Data
- Two Types
- Learning Objectives
- Chapter 2 Presenting Data in Tables and Charts
- Learning Objectives
- To develop tables and charts for categorical data
- To develop tables and charts for numerical data
- The principles of properly presenting graphs
- Categorical Data Are Summarized By Tables & Graphs. 【分类变量】
- 分类变量
- Organizing Categorical Data:
- Summary Table
- A summary table indicates the frequency, amount, or percentage of items in a set of categories so that you can see differences between categories.
- Bar and Pie Charts
- Bar charts and Pie charts are often used for categorical data
- Length of bar or size of pie slice shows the frequency or percentage for each category
- Bar Chart
- In a bar chart, a bar shows each category, the length of which represents the amount, frequency or percentage of values falling into a category.
- Pie Chart
- The pie chart is a circle broken up into slices that represent categories. The size of each slice of the pie varies according to the percentage in each category.
- Pareto Chart 【帕雷托】
- Used to portray categorical data (nominal scale)
- A vertical bar chart, where categories are shown in descending order of frequency
- A cumulative polygon is shown in the same graph
- Used to separate the “vital few” from the “trivial many”
- Summary Table
- Tables and Charts for Numerical Data
- types
- Organizing Numerical Data:
- Ordered Array
- An ordered array is a sequence of data, in rank order, from the smallest value to the largest value.
- Shows range (minimum value to maximum value)
- May help identify outliers (unusual observations)
- An ordered array is a sequence of data, in rank order, from the smallest value to the largest value.
- Stem-and-Leaf Display @❓ 已解决
- A simple way to see how the data are distributed and where concentrations of data exist
- METHOD:
- Separate the sorted data series into leading digits (the stems) and the trailing digits (the leaves)
- A stem-and-leaf display organizes data into groups (called stems) so that the values within each group (the leaves) branch out to the right on each row.
- Frequency Distribution ⭐⭐⭐ @❓ : Histogram&Polygon (对分布的呈现)
- The frequency distribution is a summary table in which the data are arranged into numerically ordered classes.
- You must give attention to selecting the appropriate number of class groupings for the table, determining a suitable width of a class grouping, and establishing the boundaries of each class grouping to avoid overlapping.
- In general, a frequency distribution should have at least 5 but no more than 15 classes.
- The width of a class interval = Highest value–Lowest value
- Example: A manufacturer of insulation randomly selects 20 winter days and records the daily high temperature
- 24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27
- Sort raw data in ascending order:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
- Find range: 58 - 12 = 46
- Select number of classes: 5 (usually between 5 and 15)
- Compute class interval (width): 10 (46/5 then round up)
- Determine class boundaries (limits):
- Class 1: 10 to less than 20
- Class 2: 20 to less than 30
- Class 3: 30 to less than 40
- Class 4: 40 to less than 50
- Class 5: 50 to less than 60
- Compute class midpoints: 15, 25, 35, 45, 55
- Count observations & assign to classes
- Frequency Distribution & Cumulative Frequency
- Example: A manufacturer of insulation randomly selects 20 winter days and records the daily high temperature
- Why Use a Frequency Distribution?
- It condenses the raw data into a more useful form
- It allows for a quick visual interpretation of the data
- It enables the determination of the major characteristics of the data set including where the data are concentrated / clustered
- Some Tips
- Different class boundaries may provide different pictures for the same data (especially for smaller data sets)
- Shifts in data concentration may show up when different class boundaries are chosen
- As the size of the data set increases, the impact of alterations in the selection of class boundaries is greatly reduced (样本足够大)
- When comparing two or more groups with different sample sizes, you must use either a relative frequency or a percentage distribution
- The Histogram 【直方图】
- A vertical bar chart of the data in a frequency distribution is called a histogram.
- In a histogram there are no gaps between adjacent bars.
- The class boundaries (or class midpoints) are shown on the horizontal axis.
- The vertical axis is either frequency, relative frequency, or percentage.
- The height of the bars represent the frequency, relative frequency, or percentage
- A vertical bar chart of the data in a frequency distribution is called a histogram.
- The Polygon
- A percentage polygon is formed by having the midpoint of each class represent the data in that class and then connecting the sequence of midpoints at their respective class percentages.
- Useful when there are two or more groups to compare.
- Graphing Cumulative Frequencies: The Ogive【拱形曲线】 (Cumulative % Polygon)
- Scatter Plots【散点图】
- Scatter plots are used for numerical data consisting of paired observations taken from two numerical variables
- One variable is measured on the vertical axis and the other variable is measured on the horizontal axis
- Scatter plots are used to examine possible relationships between two numerical variables
- Example
- Ordered Array
- Principles of Excellent Graphs
- The graph should not distort the data.
- The graph should not contain unnecessary adornments (sometimes referred to as chart junk).
- The scale on the vertical axis should begin at zero.
- All axes should be properly labeled.
- The graph should contain a title.
- The simplest possible graph should be used for a given set of data.
- Graphical Errors:
- Chart Junk
- No Relative Basis
- Compressing the Vertical Axis
- No Zero Point on the Vertical Axis
- Learning Objectives
- Chapter 3 Numerical DescriptiveMeasures
- Learning Objectives
- • To describe the properties of central tendency, variation, and shape in numerical data
- CENTRAL TENDENCY
- The central tendency is the extent to which the data values group around a typical or central value.
- VARIATION
- The variation is the amount of dispersion, or scattering, of values away from a central value.
- SHAPE
- The shape is the pattern of the distribution of values from the lowest value to the highest value.
- CENTRAL TENDENCY
- • To construct and interpret a boxplot
- • To compute descriptive summary measures for a population
- • To compute the covariance 【协方差】and the coefficient of correlation【相关系数】
- • To describe the properties of central tendency, variation, and shape in numerical data
- 3.1 Central Tendency 【Sample】
- The Mean【平均值】
- The arithmetic mean (typically referred to as the mean) is the most common measure of central tendency. The mean is the only common measure in which all the values play an equal role.
- @公式
- Sample Mean @公式
- The Medium【中位数】
- The median is the middle value in an ordered array of data that has been ranked from smallest to largest. Half the values are smaller than or equal to the median, and half the values are larger than or equal to the median
- Rules:
- • Rule 1 Odd number of values,
- the median is the measurement associated with the middle-ranked value.
- • Rule 2 Even number of values,
- the median is the measurement associated with the average of the two middle-ranked values.
- @公式
- The Mode
- The mode is the value in a set of data that appears most frequently.
- Like the median and unlike the mean, extreme values do not affect the mode.
- Often, there is no mode or there are several modes in a set of data.
- The Mean【平均值】
- 3.2 Variation and Shape 【Sample】
- The Range
- The range is the simplest numerical descriptive measure of variation in a set of data.
- @公式
- The Variance【方差】 and the Standard Deviation【标准差】
- Variance【方差】
- These statistics measure the “average” scatter around the mean—how larger values fluctuate above it and how smaller values fluctuate below it.(数据相对于均值的波动)@❓
- @公式 Sample
- The Coefficient of Variation【变异系数】
- expressed as a percentage rather than in terms of the units of the particular data
- measures the scatter in the data relative to the mean(数据相对于平均值的散度)@❓
- @公式
- Variance【方差】
- Z Scores
- An extreme value or outlier is a value located far away from the mean. The Z score, which is the difference between the value and the mean, divided by the standard deviation, is useful in identifying outliers. Values located far away from the mean will have either very small (negative) Z scores or very large (positive) Z scores.
- @公式
- Tips
- The value of Z is normal,if it is smaller than 1.
- The value of Z is NOT normal,if it is greater than 1.
- Tips
- Shape
- Shape is the pattern of the distribution of data values throughout the entire range of all the values.(定义域的分布)
- A distribution is either symmetrical (对称)or skewed(偏离).
- In a symmetrical distribution, the values below the mean are distributed in exactly the same way as the values above the mean. In this case, the low and high values balance each other out.
- In a skewed distribution, the values are not symmetrical around the mean. This skewness results in an imbalance of low values or high values.
- Shape also can influence the relationship of the mean to the median. In most cases:
- • Mean < median: negative, or left-skewed
- • Mean = median: symmetric, or zero skewness
- • Mean > median: positive, or right-skewed
- The Range
- 3.3 Exploring Numerical Data 【Sample】
- The quartiles and the five-number summary and constructs a boxplot.
- Quartiles【四分位数】
- Quartiles split a set of data into four equal parts
- The first quartile, Q1, divides the smallest 25.0% of the values from the other 75.0% that are larger.
- The second quartile, Q2, is the median—50.0% of the values are smaller than or equal to the median and 50.0% are larger than or equal to the median.
- The third quartile,Q3 , divides the smallest 75.0% of the values from the largest 25.0%.
- @公式
- Quartiles split a set of data into four equal parts
- The Five-Number Summary
- The Boxplot
- based on the five-number summary
- 3.4 Numerical Descriptive Measures for a Population
- The Population Mean
- @公式
- The Population Variance and Standard Deviation
- @公式
- The Empirical Rule (经验法则)
- In most data sets, a large portion of the values tend to cluster somewhere near the median.
- In right-skewed data sets, this clustering occurs to the left of the mean—that is, at a value less than the mean.
- In left-skewed data sets, the values tend to cluster to the right of the mean—that is, greater than the mean.
- In symmetric data sets, where the median and mean are the same, the values often tend to cluster around the median and mean, producing a bell-shaped distribution. You can use the empirical rule to examine the variability in bell-shaped distributions:
- 经验所得
- In most data sets, a large portion of the values tend to cluster somewhere near the median.
- The Population Mean
- 3.5 The Covariance 【协方差】and the Coefficient of Correlation【相关系数】
- The Covariance【Sample】
- The covariance measures the strength of the linear relationship between two numerical variables (X and Y).
- @公式
- The Coefficient of Correlation 【Population】
- When dealing with population data for two numerical variables, the Greek letter is used as the symbol for the coefficient of correlation.
- Three different types of association between two variables.
- @公式 @❓ :由样本推测总体的值?
- r is BETWEEN [-1,1]
- r最大值是1,r最小值是-1.
- The Covariance【Sample】
- Learning Objectives
- Chapter 1 Introduction and Data Collection
【01】C1-C3
最新推荐文章于 2024-02-19 18:11:14 发布