Lecture 4 Text Classification

Fundamentals of Classification

Classification 分类

  • Input:

    • A document d: Often represented as a vector of features 一个文档 d:通常表示为特征向量
    • A fixed output set of classes C = {c1, c2, …, ck}: Categorical, not continuous or ordinal 一个固定的输出类别集合 C = {c1, c2, …, ck}:类别是离散的,不是连续或有序的
  • Output:

    • A predicted class c ∈ C 一个预测的类别 c ∈ C

Text Classification Tasks 文本分类任务

  • Some common examples:

    • Topic classification 主题分类
    • Sentiment analysis 情感分析
    • Native-language identification 母语识别
    • Natural language interence 自然语言推理
    • Automatic fact-checking 自动事实检查
    • Paraphrase 释义(paraphrase)
  • Input may not be a long document 输入可能不是一个长文档

Topic Classification 主题分类
  • Motivation: Library science, information retrieval 动机:图书馆科学,信息检索

  • Classes: Topic categories E.g. “jobs”, “international news” 类别:主题类别 例如"工作",“国际新闻”

  • Features:

    • Unigram bag-of-words, with stop-words removed 一元词袋模型,移除了停用词
    • Longer n-grams for phrases 长的n-grams表示短语
  • Examples of corpora:

    • Reuters news corpus. E.g. RCV1, NLTK
    • Pubmed abstracts
    • Tweets with hashtags
Sentiment Analysis 情感分析
  • Motivation: Opinion mining, business analytics 动机:观点挖掘,商业分析

  • Classes: Positive/Negative/(Neutral) 类别:积极/消极/(中立)

  • Features:

    • N-grams N-grams
    • Polarity lexicons(极性词词典)
  • Examples of corpora:

    • Movie review dataset in NLTK
    • SEMEVAL Twitter polarity datasets
Native-Language Identification 母语识别
  • Motivation: Forensic linguistics, educational applications 动机:法证语言学,教育应用

  • Classes: First language of author 类别:作者的第一语言

  • Features:

    • Word N-grams
    • Syntactic patterns (POS, parse trees) 句法模式
    • Phonological features 语音特征
  • Example of corpora:

    • TOEFL/IELTS essay corpora
Natural Language Inference 自然语言推理
  • Also called textual entailment 也称为文本蕴含

  • Motivation: Language understanding 动机:语言理解

  • Classes: entailment, contradiction, neutral 类别:蕴含,矛盾,中立

  • Features:

    • Word overlap 词重叠
    • Length difference between the sentences 句子之间的长度差异
    • N-grams N-grams
  • Examples of corpora:

    • SNLI, MNLI

Building a Text Classifier 构建文本分类器

  1. Identify a task of interest 确定感兴趣的任务
  2. Collect an appropriate corpus 收集适当的语料库
  3. Carry out annotation 执行注释
  4. Select features 选择特征
  5. Choose a machine learning algorithm 选择机器学习算法
  6. Train model and tune hyperparameters using hold-out development data 使用留出法开发数据训练模型并调整超参数
  7. Repeat earlier steps as needed 根据需要重复前面的步骤
  8. Train fianl model 训练最终模型
  9. Evaluate model on hold-out test data 在留出测试数据上评估模型

Algorithms for Classification

Choosing a Classification Algorithm 选择分类算法

  • Bias vs. Variance 偏差与方差

    • Bias: Assumptions made in the model 偏差:模型中做出的假设
    • Variance: Sensitivity to training set 方差:对训练集的敏感度
  • Underlying assumptions E.g. Independence 潜在假设 例如 独立性

  • Complexity 复杂度

  • Speed 速度

Naive Bayes 朴素贝叶斯

  • Find the class with the highest likelihood under Bayes Law: 根据贝叶斯定理找到最有可能的类别
    placeholder

    • Probability of the class times probability of features given the class 类别的概率乘以给定类别的特征概率
  • Naively assumes features are independent: 天真地假设特征是独立的
    placeholder

  • Pros:

    • Fast to train and classify 训练和分类速度快
    • Robust, low-variance -> good for low data situations 稳健,方差小 -> 对于数据量小的情况效果好
    • Optimal classifier if independence assumption is correct 如果独立性假设正确,那么它是最优分类器
    • Extremely simple to implement 实现极其简单
  • Cons:

    • Independence assumption rarely holds 独立性假设很少成立

    • Low accuracy compared to similar methods in most situations 在大多数情况下,与类似方法相比,精度较低

    • Smoothing required for unseen class/feature combinations 对于未见过的类别/特征组合,需要进行平滑处理

Logistic Regression 逻辑回归

  • A classifier, even called regression 一个分类器,甚至被称为回归

  • A linear model, but users softmax function squashing to get valid probability 一个线性模型,但使用 softmax 函数将其压缩成有效的概率
    placeholder

  • Training maximizes probability of training data subject to regularization which encourages low or sparse weights 训练过程最大化训练数据的概率,同时通过正则化来鼓励低权重或稀疏权重

  • Pros:

    • Unlike Naive Bayes not confounded by diverse, correlated features gives better performance 与朴素贝叶斯不同,不会被多样性、相关特征所混淆,性能更好
  • Cons:

    • Slow to train 训练速度慢
    • Feature scaling needed 需要特征缩放
    • Requires a lot of data to work well in practice 在实际应用中需要大量数据才能表现良好
    • Choosing regularization strategy is important since overfitting is a big problem 选择正则化策略很重要,因为过拟合是一个大问题

Support Vector Machines 支持向量机

  • Finds hyperplane which separates the training data with maximum margin 找到一个超平面,该超平面最大限度地分离训练数据

  • Pros:

    • Fast and accurate linear classifier 快速且准确的线性分类器
    • Can do non-linearity with kernel trick 可以使用核技巧进行非线性处理
    • Works well with huge feature sets 对于大规模特征集合工作良好
  • Cons:

    • Multiclass classification awkward 多类别分类不方便
    • Feature scaling needed 需要特征缩放
    • Deals poorly with class imbalances 对类别不平衡处理不佳
    • Interpretability 可解释性差

K-Nearest Neighbor K-最近邻

  • Classify based on majority class of k-nearest training examples in feature space 根据特征空间中最近的 k 个训练样本的多数类别进行分类

  • Definition of the nearest can vary: "最近"的定义可以变化

    • Euclidean distance 欧几里得距离
    • Cosine distance 余弦距离
  • Pros:

    • Simple but surprisingly effective 简单但效果惊人
    • No training required 无需训练
    • Inherently multiclass 内置多类别
    • Optimal classifier with infinite data 在无穷大的数据下是最优分类器
  • Cons:

    • Have to select k 必须选择 k
    • Issues with imbalanced classes 类别不平衡问题
    • Often slow for finding the neighbors 在寻找最近邻时往往很慢
    • Features must be selected carefully 特征必须仔细选择

Decision Tree 决策树

  • Construct a tree where nodes correspond to tests on individual features 构建一个树,其中节点对应于对单个特征的测试

  • Leaves are final class decisions 叶子节点是最终的类别决策

  • Based on greedy maximization of mutual information 基于贪心最大化互信息

  • Pros:

    • Fast to build and test 建立和测试速度快
    • Feature scaling irrelevant 特征缩放无关
    • Good for small feature sets 对于小规模特征集合表现良好
    • Handles non-linearly-separable problems 能处理非线性可分问题
  • Cons:

    • In practice, not very interpretable 在实际中,解释性不强
    • Highly redundant sub-trees 高度冗余的子树
    • Not competitive for large feature sets 对于大规模特征集合,表现不强

Random Forests 随机森林

  • An ensemble classifier 一种集成分类器

  • Consists of decision trees trained on different subsets of the training and feature space 由在训练集和特征空间的不同子集上训练的决策树组成

  • Final class decision is majority voting of sub-classifiers 最终的类别决策是子分类器的多数投票

  • Pros:

    • Usually more accurate and more robust than decision trees 通常比决策树更准确、更稳健
    • Great classifier for medium feature sets 对于中等规模的特征集合是很好的分类器
    • Training easily parallelized 训练过程可以轻易并行化
  • Cons:

    • Interpretability 可解释性
    • Slow with large feature sets 在大规模特征集合中训练慢

Neural Networks 神经网络

  • An interconnected set of nodes typically arranged in layers 一个由多个节点相互连接的集合,通常按层排列

  • Input layer(features), output layer(class probabilities), and one or more hidden layers 输入层(特征),输出层(类别概率),以及一个或多个隐藏层

  • Each node performs a linear weighting of its inputs from previous layer, passes result through activation function to nodes in next layer 每个节点对其来自上一层的输入进行线性加权,然后通过激活函数将结果传递给下一层的节点

  • Pros:

    • Extremely powerful, dominant method in NLP and computer vision 极其强大,是 NLP 和计算机视觉中的主导方法
    • Little feature engineering 很少需要特征工程
  • Cons:

    • Not an off-the-shelf classifier 不是开箱即用的分类器
    • Many hyperparameters, difficult to optimize 多个超参数,难以优化
    • Slow to train 训练慢
    • Prone to overfitting 易于过拟合

Hyperparameter Tuning 超参数调优

  • Dataset for tuning: 调优的数据集

    • Development set 开发集
    • Not the training set or the test set 不是训练集或测试集
    • k-fold cross-validation k-折交叉验证
  • Specific hyper-parameters are classifier specific. But many hyper-parameters relate to regularization 具体的超参数是分类器特定的。但许多超参数与正则化有关

    • Regularization hyperparameters penalize model complexity 正则化超参数惩罚模型复杂性
    • Used to prevent overfitting 用于防止过拟合
  • For multiple hyperparameters, use grid search 对于多个超参数,使用网格搜索

Evaluation

Confusion Matrix 混淆矩阵

Classified As
ClassAB
ATrue PositiveFalse Positive
BFalse NegativeTrue Negative

Evaluation Metrics 评估指标

  • Accuracy = True Positive / (True Positive + False Positive + True Negative + False Negative) 准确率 = 真正例 / (真正例 + 假正例 + 真负例 + 假负例)
  • Precision = True Positive / (True Positive + False Positive) 精确度 = 真正例 / (真正例 + 假正例)
  • Recall = True Positive / (True Positive + False Negative) 召回率 = 真正例 / (真正例 + 假负例)
  • F1-score = (2 * precision * recall) / (precision + recall) F1-分数 = (2 * 精确度 * 召回率) / (精确度 + 召回率)
    • Macroaverage: Average F-score across classes 宏平均:所有类别的 F-分数的平均值
    • Micreaverage: Calculate F-score using sum of counts 微平均:使用计数和来计算 F-分数
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值