目录
2. Algorithms for Classification
2.1 Choosing a Classification Algorithm
1. Classification
Input
- A document d
- Often represented as a vector of features 通常表示为一个特征向量
- A fixed output set of classes C = {c1,c2,…ck}
- Categorical, not continuous (regression) or ordinal (ranking) 分类的,不是连续的(回归)或顺序的(排名)。
Output
- A predicted class c ∈ C
1.1 Text Classification Tasks
一些常见的例子
- 主题分类 Topic classification
- 情感分析 Sentiment analysis
- 本土语言识别 Native-language identification
- 自然语言推理 Natural language inference
- 自动事实核查 Automatic fact-checking
- 释义 Paraphrase
输入可能不是一个长的文件
- 句子或推文级情感分析
2. Algorithms for Classification
2.1 Choosing a Classification Algorithm
- Bias vs. Variance
- Bias: assumptions we made in our model 我们在模型中所作的假设
- Variance: sensitivity to training set 对训练集的敏感性
- Underlying assumptions, e.g., independence
- Complexity
- Speed
2.2 Naïve Bayes
Pros:
- Fast to train and classify
- robust, low-variance -> good for low data situations
- optimal classifier if independence assumption is correct
- extremely simple to implement.
Cons:
- Independence assumption rarely holds
- low accuracy compared to similar me