【01】C1-C3

最新推荐文章于 2024-02-19 18:11:14 发布

JessicaGD

最新推荐文章于 2024-02-19 18:11:14 发布

阅读量255

点赞数

文章标签： c1

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/JessicaGD/article/details/122306194

版权

- Chapter 1 Introduction and Data Collection
  - Learning Objectives
    - How Statistics is used in business
    - The sources of data used in business
    - The types of data used in business
    - The basics of Microsoft Excel
    - The basics of STATA
  - What is statistics?
    - A branch of mathematics taking and transforming numbers into useful information for decision makers
    - Methods for processing & analyzing numbers
    - Methods for helping reduce the uncertainty inherent in decision making
  - Types of Statistics
    - Statistics
      - The branch of mathematics that transforms data into useful information for decision makers.
    - Descriptive Statistics
    - Collect data
      - e.g., Survey
    - Present data
      - e.g., Tables and graphs
    - Characterize data
      - e.g., Sample mean =
    - Inferential Statistics
    - Estimation
      - e.g., Estimate the population mean weight using the sample mean weight
    - Hypothesis testing
      - e.g., Test the claim that the population mean weight is 120 pounds
  - Basic Vocabulary of Statistics⭐⭐⭐⭐⭐
    - VARIABLEA
      - variable is a characteristic of an item or individual.
    - DATA
      - Data are the different values associated with a variable.
    - OPERATIONAL DEFINITIONS
      - Data values are meaningless unless their variables have operational definitions, universally accepted meanings that are clear to all associated with an analysis.
    - POPULATIONA
      - population consists of all the items or individuals about which you want to draw a conclusion.
    - SAMPLEA
      - sample is the portion of a population selected for analysis.
    - STATISTIC
      - A statistic is a numerical measure that describes a characteristic of a sample.
    - PARAMETER
      - A parameter is a numerical measure that describes a characteristic of a population.
  - Sources of data fall into four categories
    - Data distributed by an organization or an individual
    - A designed experiment
    - A survey
    - An observational study
  - Types of Variables
    - Categorical (qualitative) variables have values that can only be placed into categories, such as “yes” and “no.”
    - Numerical (quantitative) variables have values that represent quantities.
  - Types of Data
    - Two Types
- Chapter 2 Presenting Data in Tables and Charts
  - Learning Objectives
    - To develop tables and charts for categorical data
    - To develop tables and charts for numerical data
    - The principles of properly presenting graphs
  - Categorical Data Are Summarized By Tables & Graphs. 【分类变量】
    - 分类变量
  - Organizing Categorical Data:
    - Summary Table
      - A summary table indicates the frequency, amount, or percentage of items in a set of categories so that you can see differences between categories.
    - Bar and Pie Charts
      - Bar charts and Pie charts are often used for categorical data
      - Length of bar or size of pie slice shows the frequency or percentage for each category
      - Bar Chart
        In a bar chart, a bar shows each category, the length of which represents the amount, frequency or percentage of values falling into a category.
      - Pie Chart
        The pie chart is a circle broken up into slices that represent categories. The size of each slice of the pie varies according to the percentage in each category.
      - Pareto Chart 【帕雷托】
        Used to portray categorical data (nominal scale)
        
        A vertical bar chart, where categories are shown in descending order of frequency
        
        A cumulative polygon is shown in the same graph
        
        Used to separate the “vital few” from the “trivial many”
  - Tables and Charts for Numerical Data
    - types
  - Organizing Numerical Data:
    - Ordered Array
      - An ordered array is a sequence of data, in rank order, from the smallest value to the largest value.
        Shows range (minimum value to maximum value)
        
        May help identify outliers (unusual observations)
    - Stem-and-Leaf Display @❓ 已解决
      - A simple way to see how the data are distributed and where concentrations of data exist
      - METHOD:
        Separate the sorted data series into leading digits (the stems) and the trailing digits (the leaves)
      - A stem-and-leaf display organizes data into groups (called stems) so that the values within each group (the leaves) branch out to the right on each row.
    - Frequency Distribution ⭐⭐⭐ @❓ : Histogram&Polygon (对分布的呈现)
      - The frequency distribution is a summary table in which the data are arranged into numerically ordered classes.
      - You must give attention to selecting the appropriate number of class groupings for the table, determining a suitable width of a class grouping, and establishing the boundaries of each class grouping to avoid overlapping.
      - In general, a frequency distribution should have at least 5 but no more than 15 classes.
      - The width of a class interval = Highest value–Lowest value
        Example: A manufacturer of insulation randomly selects 20 winter days and records the daily high temperature
        24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27
        
        Sort raw data in ascending order:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
        
        Find range: 58 - 12 = 46
        
        Select number of classes: 5 (usually between 5 and 15)
        
        Compute class interval (width): 10 (46/5 then round up)
        
        Determine class boundaries (limits):
        Class 1: 10 to less than 20
        
        Class 2: 20 to less than 30
        
        Class 3: 30 to less than 40
        
        Class 4: 40 to less than 50
        
        Class 5: 50 to less than 60
        
        Compute class midpoints: 15, 25, 35, 45, 55
        
        Count observations & assign to classes
        Frequency Distribution & Cumulative Frequency
      - Why Use a Frequency Distribution?
        It condenses the raw data into a more useful form
        
        It allows for a quick visual interpretation of the data
        It enables the determination of the major characteristics of the data set including where the data are concentrated / clustered
      - Some Tips
        Different class boundaries may provide different pictures for the same data (especially for smaller data sets)
        
        Shifts in data concentration may show up when different class boundaries are chosen
        
        As the size of the data set increases, the impact of alterations in the selection of class boundaries is greatly reduced (样本足够大)
        
        When comparing two or more groups with different sample sizes, you must use either a relative frequency or a percentage distribution
    - The Histogram 【直方图】
      - A vertical bar chart of the data in a frequency distribution is called a histogram.
        
        In a histogram there are no gaps between adjacent bars.
        
        The class boundaries (or class midpoints) are shown on the horizontal axis.
        
        The vertical axis is either frequency, relative frequency, or percentage.
        The height of the bars represent the frequency, relative frequency, or percentage
    - The Polygon
      - A percentage polygon is formed by having the midpoint of each class represent the data in that class and then connecting the sequence of midpoints at their respective class percentages.
      - Useful when there are two or more groups to compare.
      - Graphing Cumulative Frequencies: The Ogive【拱形曲线】 (Cumulative % Polygon)
      - Scatter Plots【散点图】
        Scatter plots are used for numerical data consisting of paired observations taken from two numerical variables
        
        One variable is measured on the vertical axis and the other variable is measured on the horizontal axis
        
        Scatter plots are used to examine possible relationships between two numerical variables
        
        Example
  - Principles of Excellent Graphs
    - The graph should not distort the data.
    - The graph should not contain unnecessary adornments (sometimes referred to as chart junk).
    - The scale on the vertical axis should begin at zero.
    - All axes should be properly labeled.
    - The graph should contain a title.
    - The simplest possible graph should be used for a given set of data.
  - Graphical Errors:
    - Chart Junk
    - No Relative Basis
    - Compressing the Vertical Axis
    - No Zero Point on the Vertical Axis
- Chapter 3 Numerical DescriptiveMeasures
  - Learning Objectives
    - • To describe the properties of central tendency, variation, and shape in numerical data
      - CENTRAL TENDENCY
        The central tendency is the extent to which the data values group around a typical or central value.
      - VARIATION
        The variation is the amount of dispersion, or scattering, of values away from a central value.
      - SHAPE
        The shape is the pattern of the distribution of values from the lowest value to the highest value.
    - • To construct and interpret a boxplot
    - • To compute descriptive summary measures for a population
    - • To compute the covariance 【协方差】and the coefficient of correlation【相关系数】
  - 3.1 Central Tendency 【Sample】
    - The Mean【平均值】
      - The arithmetic mean (typically referred to as the mean) is the most common measure of central tendency. The mean is the only common measure in which all the values play an equal role.
      - @公式
        
        Sample Mean @公式
    - The Medium【中位数】
      - The median is the middle value in an ordered array of data that has been ranked from smallest to largest. Half the values are smaller than or equal to the median, and half the values are larger than or equal to the median
      - Rules:
      - • Rule 1 Odd number of values,
        the median is the measurement associated with the middle-ranked value.
      - • Rule 2 Even number of values,
        the median is the measurement associated with the average of the two middle-ranked values.
      - @公式
    - The Mode
      - The mode is the value in a set of data that appears most frequently.
      - Like the median and unlike the mean, extreme values do not affect the mode.
      - Often, there is no mode or there are several modes in a set of data.
  - 3.2 Variation and Shape 【Sample】
    - The Range
      - The range is the simplest numerical descriptive measure of variation in a set of data.
      - @公式
    - The Variance【方差】 and the Standard Deviation【标准差】
      - Variance【方差】
        These statistics measure the “average” scatter around the mean—how larger values fluctuate above it and how smaller values fluctuate below it.（数据相对于均值的波动）@❓
        
        @公式 Sample
      - The Coefficient of Variation【变异系数】
        expressed as a percentage rather than in terms of the units of the particular data
        
        measures the scatter in the data relative to the mean（数据相对于平均值的散度）@❓
        
        @公式
    - Z Scores
      - An extreme value or outlier is a value located far away from the mean. The Z score, which is the difference between the value and the mean, divided by the standard deviation, is useful in identifying outliers. Values located far away from the mean will have either very small (negative) Z scores or very large (positive) Z scores.
      - @公式
        
        Tips
        The value of Z is normal，if it is smaller than 1.
        
        The value of Z is NOT normal，if it is greater than 1.
    - Shape
      - Shape is the pattern of the distribution of data values throughout the entire range of all the values.(定义域的分布)
      - A distribution is either symmetrical （对称）or skewed（偏离）.
        In a symmetrical distribution, the values below the mean are distributed in exactly the same way as the values above the mean. In this case, the low and high values balance each other out.
        
        In a skewed distribution, the values are not symmetrical around the mean. This skewness results in an imbalance of low values or high values.
      - Shape also can influence the relationship of the mean to the median. In most cases:
        • Mean < median: negative, or left-skewed
        
        • Mean = median: symmetric, or zero skewness
        
        • Mean > median: positive, or right-skewed
  - 3.3 Exploring Numerical Data 【Sample】
    - The quartiles and the five-number summary and constructs a boxplot.
    - Quartiles【四分位数】
      - Quartiles split a set of data into four equal parts
        The first quartile, Q1, divides the smallest 25.0% of the values from the other 75.0% that are larger.
        
        The second quartile, Q2, is the median—50.0% of the values are smaller than or equal to the median and 50.0% are larger than or equal to the median.
        
        The third quartile,Q3 , divides the smallest 75.0% of the values from the largest 25.0%.
      - @公式
    - The Five-Number Summary
    - The Boxplot
      - based on the five-number summary
  - 3.4 Numerical Descriptive Measures for a Population
    - The Population Mean
      - @公式
    - The Population Variance and Standard Deviation
      - @公式
    - The Empirical Rule （经验法则）
      - In most data sets, a large portion of the values tend to cluster somewhere near the median.
        In right-skewed data sets, this clustering occurs to the left of the mean—that is, at a value less than the mean.
        
        In left-skewed data sets, the values tend to cluster to the right of the mean—that is, greater than the mean.
        
        In symmetric data sets, where the median and mean are the same, the values often tend to cluster around the median and mean, producing a bell-shaped distribution. You can use the empirical rule to examine the variability in bell-shaped distributions:
        经验所得
  - 3.5 The Covariance 【协方差】and the Coefficient of Correlation【相关系数】
    - The Covariance【Sample】
      - The covariance measures the strength of the linear relationship between two numerical variables (X and Y).
      - @公式
    - The Coefficient of Correlation 【Population】
      - When dealing with population data for two numerical variables, the Greek letter is used as the symbol for the coefficient of correlation.
      - Three different types of association between two variables.
      - @公式 @❓ ：由样本推测总体的值？
      - r is BETWEEN [-1,1]
      - r最大值是1，r最小值是-1.

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。