手写基于Logistic Regression的文章分类器(AG News)

本文介绍了如何从头开始创建一个文本分类器,使用Log-Linear模型(包括LogisticRegression),并详细阐述了预处理、TF-IDF特征提取、模型训练、评价指标(如Accuracy、Precision、Recall和F1Score)的过程。
摘要由CSDN通过智能技术生成

A Text Classifier based on Log-Linear Model

A simple model from scratch.

code repository

Dataset

AG News

Implementation

Preprocess

  • Sample data to reduce dataset size
  • Merge the Title and Content
  • With the help of nltk:
    • Remove punctuations and numbers
    • Remove URLs
    • Split the Content into words
    • Filter stopwords
    • Stem and Lemmarize

Feature Extraction

Use TF-IDF as the Feature

  • Keep only the most frequent words for a reasonable feature size
  • Calculate TF-IDF
    T F ( t , d ) = count ( t , d ) ∑ k count ( k , d ) TF(t,d) = \frac{\text{count}(t, d)}{\sum_k \text{count}(k, d)} TF(t,d)=kcount(k,d)count(t,d)
    , where count( t t t, d d d) means the count of term t t t in document d d d.
    I D F ( t , D ) = log ⁡ N + 1 n u m ( t , D ) + 1 + 1 IDF(t, D) = \log \frac{N + 1}{num(t, D) + 1} + 1 IDF(t,D)=lognum(t,D)+1N+1+1
    , where num( t t t, D D D) means the number of documents in D D D that contains term t t t, and D D D is the set of all documents.
    T F − I D F = T F × I D F TF-IDF = TF \times IDF TFIDF=TF×IDF
    Note that L2 normalisation is applied to the final TF-IDF for a better performance.

Log-Linear Model

  • Logistic Regression Model
    y ^ = softmax ( X N , F W F , C + b C ) \hat{y} = \text{softmax}\big( X_{N,F}W_{F,C}+b_{C} \big) y^=softmax(XN,FWF,C+bC)
    , where N N N is the size of train data, F F F is the number of features and C C C is the number of classes.

  • Cross Entropy Loss
    loss = − 1 N ∑ i = 1 N ∑ j = 1 C y i , j log ⁡ y ^ i , j \text{loss} = -\frac{1}{N} \sum_{i=1}^N \sum_{j=1}^{C} y_{i,j} \log \hat{y}_{i,j} loss=N1i=1Nj=1Cyi,jlogy^i,j
    , where y i , j = 1 y_{i,j}=1 yi,j=1 if train text i i i belongs to class j j j, and 0 0 0 otherwise.

  • Gradients
    d W = 1 N X T ⋅ ( y ^ − y ) d b = 1 N ∑ i = 1 N ( y ^ − y ) dW = \frac{1}{N} X^T \cdot\big(\hat{y} - y\big) \\ db =\frac{1}{N} \sum_{i=1}^{N} \big(\hat{y} - y\big) dW=N1XT(y^y)db=N1i=1N(y^y)

Update Algorithm

Gradient Descend with a shrinking learning rate.

Evalutaion

  • Accuracy
    Accuracy = T P + T N T P + T N + F P + F N \text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN} Accuracy=TP+TN+FP+FNTP+TN
  • F1 Score (macro)
    Precision = T P T P + F P Recall = T P T P + F N F1 Score = 2 × P R P + R Macro F1 Score = 1 C ∑ i = 1 C F1 Score \text{Precision} = \frac{TP}{TP+FP} \\ \text{Recall} = \frac{TP}{TP+FN} \\ \text{F1 Score} = 2 \times \frac{PR}{P+R}\\ \text{Macro F1 Score} = \frac{1}{C}\sum_{i=1}^{C}\text{F1 Score} Precision=TP+FPTPRecall=TP+FNTPF1 Score=2×P+RPRMacro F1 Score=C1i=1CF1 Score
  • 8
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值