数据eda
数据科学和机器学习统计 (STATISTICS FOR DATA SCIENCE AND MACHINE LEARNING)
Categorical variables are the ones where the possible values are provided as a set of options, it can be pre-defined or open. An example can be the gender of a person. In the case of Ordinal variables, the options can be ordered by some rule, like the Likert Scale:
分类变量是将可能的值作为一组选项提供的变量,可以预定义或打开。 一个例子可以是一个人的性别。 对于序数变量,可以按照某些规则对选项进行排序,例如Likert Scale:
- Like 喜欢
- Like Somewhat 有点像
- Neutral 中性
- Dislike Somewhat 有点不喜欢
- Dislike 不喜欢
To simplify further examples, we will use a simple example, based on a group of students that have passed or not 2 distinct exams, the results are represented in the next RxC table:
为了简化更多示例,我们将使用一个简单示例,该示例基于一组已通过或未通过2次不同考试的学生,结果显示在下一个RxC表中:
![Image for post](https://img-service.csdnimg.cn/img_convert/9a3440fb9c61152819bedf44a9f9765a.png)
Statisticians have developed specific techniques to analyze this data, the most important are:
统计人员已经开发出分析此数据的特定技术,其中最重要的是:
协议措施 (Measures of Agreement)
百分比协议 (Percent Agreement)
Calculated as the divisions between the number of cases where the rates are in a certain class by the total number of rates.
计算为费率在特定类别中的案例数除以费率总数。
![Image for post](https://img-service.csdnimg.cn/img_convert/b27034cad79444d27687c98f5ac6053d.png)
- The percent agreement for Passing the exam 2 is 25/(25+60) = 0.29, so 29.4% 通过考试2的百分比协议是25 /(25 + 60)= 0.29,所以29.4%
- The percent agreement for Passing the exam 1 is 30/85 = 0.35, so 35.3% 通过考试1的百分比协议是30/85 = 0.35,所以35.3%
- The percent agreement of passing the exam 1 and not passing the exam 2 is 10/85 = 0.117, so 11.7%. 通过考试1和未通过考试2的百分比协议是10/85 = 0.117,所以11.7%。
The problem with the percent agreement is that the data can be obtained only by chance.
百分比一致性的问题在于只能偶然获得数据。
科恩的卡帕 (Cohen’s Kappa)
![Image for post](https://img-service.csdnimg.cn/img_convert/73a417d94c8f5c99bf3938e0253cf531.png)