Disasters_on_social_media

    社交媒体上有些讨论是关于灾难,疾病,暴乱的,有些只是开玩笑或者是电影情节,我们该如何让机器能分辨出这两种讨论呢?
在这里插入图片描述

一、数据处理
(1)提取原数据的有用部分
    对于这个题目,有用的数据只有text(新闻内容),choose_one(是否和灾难等有联系)。为了便于后续的机器学习还要把不能被识别的choose_one转换为简单编码的形式。

questions = pd.read_csv("socialmedia_relevant_cols_clean.csv")
questions = questions[['text','choose_one']]

A = {'Not Relevant':0,'Relevant':1,"Can't Decide":2}
for i in A:
    questions.loc[questions.choose_one==i,'class_label'] = int(A[i])
#print(questions.head(5))

在这里插入图片描述
(2)对文本内容进行数据清理
    采用正则化的方法,删除掉文本内不要的部分。

def standardize_text(df, text_field):
    df[text_field] = df[text_field].apply(lambda x: re.sub(re.compile(r'http\S+'), '',x))
    df[text_field] = df[text_field].apply(lambda x: re.sub(re.compile(r'http'), '',x))
    df[text_field] = df[text_field].apply(lambda x: re.sub(re.compile(r'@\S+'), '',x))
    df[text_field] = df[text_field].apply(lambda x: re.sub(re.compile(r'@'), 'at',x))
    df[text_field] = df[text_field].apply(lambda x: re.sub(re.compile(r'[^A-Za-z0-9(),!?@\'\`\"\_\n]'), ' ',x))
    df[text_field] = df[text_field].str.lower()
    return df

questions = standardize_text(questions,'text')
questions.to_csv('clean_data.csv', encoding='utf-8')
#print(questions.head(5))

在这里插入图片描述
(3)分词
    通过这种分词,直接将数据分成list的形式。

clean_questions = pd.read_csv("clean_data.csv")
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
clean_questions.loc[:,'tokens'] = clean_questions['text'].apply(tokenizer.tokenize)
print(clean_questions['tokens'].head(5))

在这里插入图片描述
二、使用词袋模型进行测试
(1)使用词袋模型,将数据转换为词向量。

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

def cv(data):
    count_vectorizer = CountVectorizer()
    emb = count_vectorizer.fit_transform(data)
    return emb, count_vectorizer

X_train_counts, count_vectorizer = cv(X_train)
X_test_counts = count_vectorizer.transform(X_test)

(2)训练,测试模型

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C=30, class_weight='balanced', solver='newton-cg',
                        multi_class='multinomial', n_jobs=None, random_state=40)
clf.fit(X_train_counts, y_train)
y_predicted_counts = clf.predict(X_test_counts)

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report

def get_metrics(y_test, y_predicted):
    # true positives / (true positives + false positives)
    precision = precision_score(y_test, y_predicted, pos_label=None, average='weighted')
    # true positives / (true positives + false negatives)
    recall = recall_score(y_test, y_predicted, pos_label=None, average='weighted')
    # harmonic mean of precision and recall
    f1 = f1_score(y_test, y_predicted, pos_label=None, average='weighted')
    # (true positives + true negatives)/total
    accuracy = accuracy_score(y_test, y_predicted)
    return accuracy, precision, recall, f1
accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted_counts)
print("accuracy = %0.3f, precision = %0.3f, recall= %0.3f, f1 = %0.3f" % (accuracy, precision, recall, f1))

在这里插入图片描述

(3)TF-IDF bag

def tfidf(data):
        tfidf_vectorizer = TfidfVectorizer()
        train = tfidf_vectorizer.fit_transform(data)
        return train, tfidf_vectorizer

在这里插入图片描述
三、使用W2V模型进行测试

word2vec_path =('data/GoogleNews-vectors-negative300.bin')
word2vec = gensim.models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True)

def get_average_word2vec(tokens_list, vector, generate_missing=False, k=300):
    if len(tokens_list)<1:
        return np.zeros(k)
    if generate_missing:
        vectorized = [vector[word] if word in vector else np.random.rand(k) for word in tokens_list]
    else:
        vectorized = [vector[word] if word in vector else np.zeros(k) for word in tokens_list]
    length = len(vectorized)
    summed = np.sum(vectorized, axis=0)
    averaged = np.divide(summed,length)
    return averaged

def get_word2vec_embeddings(vectors, clean_questions, generate_missing=False):
    embeddings = clean_questions['tokens'].apply(lambda x: get_average_word2vec(x, vectors,
                                                                generate_missing=generate_missing))
    return list(embeddings)


embeddings = get_word2vec_embeddings(word2vec, clean_questions)
X_train_word2vec, X_test_word2vec, y_train_word2vec, y_test_word2vec = train_test_split(embeddings,
                                                                        list_labels, test_size=0.2,
                                                                        random_state=3)
clf_w2v = LogisticRegression(C=30, class_weight='balanced', solver='newton-cg',
                              multi_class='multinomial', n_jobs=None, random_state=40)
clf_w2v = clf_w2v.fit(X_train_word2vec, y_train_word2vec)
y_predicted_word2vec = clf_w2v.predict(X_test_word2vec)
accuracy_w2v, precision_w2v, recall_w2v, f1_w2v = get_metrics(y_test_word2vec, y_predicted_word2vec)
print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" %
      (accuracy_w2v, precision_w2v, recall_w2v, f1_w2v ))


在这里插入图片描述

润色:In the recent years, the world is constantly stricken by various terrifying natural or man-made disasters, all of which calls on our attention to the global sustainable development, defined as the overall coordinating development of nature, society and economy, to meet the current needs without at the cost of the future.\\ Back in 2015, UN Assembly has stipulated the 2030 Agenda for Sustainable Development to address the problem including 17 goals which can be further classified into 5 categories: \textbf{fundamental necessities pursuit} (GOAL 1, 2), \textbf{sustainable social development} (GOAL 3, 4, 5, 6, 7), \textbf{sustainable economic development} (GOAL 8, 9, 10, 11, 12),\textbf{ sustainable ecological development} (GOAL 13, 14, 15) and \textbf{human symbiont pursuit} (GOAL 16, 17). In the article, we mainly discuss the relationships between the SDGs, the priority ranking of SDGs, base on which future implications are given including predicted blue print, ideal achievements and potential goals. Finally the influence of external factors on the structure is researched. In consideration of the priority of each SDG, the point weight is given to each SDG itself, and the edge weight is given to the degree of correlation between SDGs (that is, the degree of influence), to represent the degree of contribution of each SDG to the human-wellbeing. By final calculation, Goal 5,2,1 rank the top3.\\ In future implication, the sliding window model is employed. The predicted blue-print in 10 years is given qualitively and quantitively, and for the ideal achievements we find that fundamental necessities pursuit has a huge impact on the other goals, while human symbiont pursuit influence gradually and sustainable economic development has a general impact too.
02-21
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值