Naive Bayesian

朴素贝叶斯简单来说就是以贝叶斯原理为基础的根据先验概率计算出后验概率,然后选择概率较大的那一类作为分类结果的方法。

假设有样本数据集D={d1,d2,…,dn},样本对应的特征属性集为X={x1,x2,…,xd},类别变量为Y={y1,y2,…,ym},那么给出X的时候,如何估计类别变量Y呢

根据贝叶斯定理,有

其中P(Y)称为先验概率,P(Y|X)称为后验概率,P(X|Y)/P(X)称为可能性函数

这时,再假设特征各特征之间互相独立(朴素思想的体现),根据全概率公式,类别为yi的概率为

举一个来自于https://www.bilibili.com/video/av36338359?from=search&seid=7919712705936593079的例子

根据上面的公式

根据上表

以上是针对特征为离散变量的情况,特征为连续变量时,可以使用高斯分布对其概率密度进行估计,即假设连续变量服从高斯分布

比如,给男生加一个经济收入这一项

其他条件不变,如果某男生(帅,性格不好,不上进,经济收入4k),则嫁里面经济收入均值为4.8,标准差假设为2,则可以计算出

P(收入=4k|嫁)

同理,可以计算出

P(收入=4k|不嫁)

即可以得出最终结果

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Sure, here's how you can perform sentiment analysis of restaurant comments with the steps you mentioned: ```python import pandas as pd import jieba from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline from sklearn.metrics import accuracy_score # Load the data df = pd.read_csv('restaurant_comments.csv', encoding='gb18030') # Create the sentiment label df['sentiment'] = df['star'].apply(lambda x: 1 if x > 3 else 0) # Tokenize the Chinese comments df['comments'] = df['comments'].apply(lambda x: ' '.join(jieba.cut(x))) # Split the dataset into training and test sets X_train, X_test, y_train, y_test = train_test_split(df['comments'], df['sentiment'], test_size=0.2) # Create a pipeline with CountVectorizer and MultinomialNB pipeline = Pipeline([ ('vectorizer', CountVectorizer()), ('classifier', MultinomialNB()) ]) # Train the model pipeline.fit(X_train, y_train) # Test the trained model using test dataset and print the accuracy y_pred = pipeline.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print('Accuracy:', accuracy) ``` In this code, we first load the data using pandas with encoding='gb18030'. Then we create the sentiment label based on the 'star' column. We tokenize the Chinese comments using jieba, split the dataset into training and test sets using train_test_split from sklearn. We then create a pipeline with CountVectorizer and MultinomialNB. We fit the pipeline on the training data and test the trained model using test dataset. Finally, we print the accuracy score of the model using accuracy_score from sklearn. You can try other models as well and tune the hyperparameters to improve the accuracy.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值