基于词级N-gram的词袋模型对Twitter数据进行情感分析

鱼弦

于 2024-10-07 07:00:00 发布

阅读量281

点赞数 12

分类专栏：人工智能时代文章标签： twitter

本文链接：https://blog.csdn.net/feng1790291543/article/details/142702099

版权

人工智能时代专栏收录该内容

1 篇文章 1 订阅

订阅专栏

基于词级N-gram的词袋模型对Twitter数据进行情感分析

介绍

词级N-gram的词袋(Bag of Words, BoW)模型是一种基础的文本处理方法。通过将文本转化为特征向量，可以应用于机器学习算法中。对于Twitter上的文本情感分析，该方法可以帮助我们自动判断推文是表达正面、负面还是中立的情感。

应用使用场景

市场情报：分析消费者对产品或品牌的情感倾向。
政治评论：跟踪公众对政策的反应和意见。
社交舆情监测：识别热点话题，预警潜在公关危机。
客户服务：自动分类并响应用户反馈和投诉。

下面是关于如何实现上述功能的代码示例，使用Python和一些常用的库：

1. 市场情报：分析消费者对产品或品牌的情感倾向

我们可以使用TextBlob进行基本的情感分析：

from textblob import TextBlob

def analyze_sentiment(text):
    blob = TextBlob(text)
    # 返回情感极性，范围从-1到1
    return blob.sentiment.polarity

# 示例
text = "I love the new design of this phone!"
sentiment = analyze_sentiment(text)
print(f"Sentiment polarity: {sentiment}")

2. 政治评论：跟踪公众对政策的反应和意见

可以使用Twitter API获取公众意见并分析情感。这需要注册一个Twitter开发者帐户来获取API密钥。

import tweepy
from textblob import TextBlob

# Twitter API 配置
api_key = 'YOUR_API_KEY'
api_secret_key = 'YOUR_API_SECRET_KEY'
access_token = 'YOUR_ACCESS_TOKEN'
access_token_secret = 'YOUR_ACCESS_TOKEN_SECRET'

auth = tweepy.OAuthHandler(api_key, api_secret_key)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

def get_political_opinions(keyword):
    tweets = api.search(q=keyword, count=100, lang='en')
    
    for tweet in tweets:
        analysis = TextBlob(tweet.text)
        print(f"Tweet: {tweet.text} | Sentiment: {analysis.sentiment.polarity}")

# 示例
get_political_opinions("new policy")

3. 社交舆情监测：识别热点话题，预警潜在公关危机

可以使用tweepy结合nltk库统计词频：

import tweepy
from nltk.corpus import stopwords
from collections import Counter
import re

# Twitter API 配置，如上

def clean_text(text):
    text = re.sub(r'http\S+', '', text)  # 移除URL
    text = re.sub(r'@\S+', '', text)     # 移除@用户
    text = re.sub(r'#', '', text)        # 移除#符号
    return text

def monitor_social_media(keyword):
    tweets = api.search(q=keyword, count=100, lang='en')
    words = []
    
    for tweet in tweets:
        clean_tweet = clean_text(tweet.text)
        words.extend(clean_tweet.lower().split())

    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word not in stop_words and len(word) > 1]

    word_freq = Counter(filtered_words)
    most_common = word_freq.most_common(10)
    print(f"Most common words related to '{keyword}': {most_common}")

# 示例
monitor_social_media("brand crisis")

4. 客户服务：自动分类并响应用户反馈和投诉

使用scikit-learn训练简单的文本分类模型，比如朴素贝叶斯分类器：

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

# 示例数据
feedbacks = [
    "The product stopped working after a week",
    "Great customer service!",
    "I'm very satisfied with the purchase",
    "The delivery was late and the package was damaged",
    "Fantastic experience, highly recommend!"
]
labels = ["complaint", "praise", "praise", "complaint", "praise"]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(feedbacks)
y = labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = MultinomialNB()
clf.fit(X_train, y_train)

def classify_feedback(feedback):
    feedback_vector = vectorizer.transform([feedback])
    prediction = clf.predict(feedback_vector)
    return prediction[0]

# 示例
new_feedback = "The staff were very helpful!"
category = classify_feedback(new_feedback)
print(f"The feedback is classified as: {category}")

原理解释

词级N-gram是指从文本中提取连续n个单词的组合。这些组合被用于捕捉局部上下文信息。而BoW模型则将这些N-gram转换为向量表示，不考虑单词顺序，仅统计其出现频次。

算法原理流程图

+---------------------+
|  收集Twitter数据    |
+---------+-----------+
          |
          v
+---------------------+
|  数据预处理         | 
|  - 去除特殊字符     | 
|  - 转换为小写       |
|  - 去掉停用词等     |
+---------+-----------+
          |
          v
+---------------------+
|  生成N-gram特征     |
+---------+-----------+
          |
          v
+---------------------+
|  建立词袋模型       |
+---------+-----------+
          |
          v
+---------------------+
|  特征向量化         |
+---------+-----------+
          |
          v
+---------------------+
|  分类器训练与预测   |
+---------------------+

算法原理解释

数据预处理：清洗推文文本，包括去除标点符号、链接、缩写扩展、停止词过滤等步骤。
生成N-gram特征：根据指定的n值（如2, 3），生成单词的组合。
建立词袋模型：构建一个包含所有N-gram的词典，记录其出现次数。
特征向量化：将每条推文转化为对应N-gram词频的矢量表示。
分类器训练与预测：利用机器学习算法（如SVM, Logistic Regression）训练分类器，对新推文进行情感预测。

实际详细应用代码示例实现

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# 示例推文数据
data = {
    'tweet': ["I love this product", "This is the worst service ever",
              "Had a fantastic experience", "Not good, not bad",
              "Absolutely terrible customer support"],
    'sentiment': [1, 0, 1, 2, 0]  # 1: Positive, 0: Negative, 2: Neutral
}

df = pd.DataFrame(data)

# 数据预处理
def preprocess_text(text):
    return text.lower()

df['tweet'] = df['tweet'].apply(preprocess_text)

# N-gram特征提取
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(df['tweet'])
y = df['sentiment']

# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 训练模型
model = MultinomialNB()
model.fit(X_train, y_train)

# 预测并测试模型
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")