2021年美赛C题- 胡蜂的传播(图像分类模型，word2vec词向量化, 多模态模型构建)

最新推荐文章于 2023-02-11 17:46:47 发布

str_717

最新推荐文章于 2023-02-11 17:46:47 发布

阅读量2.8k

点赞数 2

分类专栏： Python 文章标签：深度学习 python 神经网络计算机视觉自然语言处理

本文链接：https://blog.csdn.net/str_717/article/details/116489522

版权

Python 专栏收录该内容

4 篇文章 5 订阅

订阅专栏

2021年美赛C题（大数据题）- 胡蜂的传播（图像，文本和地理位置信息——构建多模态模型）

今年参加美赛，属实有些无奈。队伍里只有我一个人在负责模型和代码，加上在假期各自在家没办法做到紧密合作，所以结果也可想而知😅。但是我还是想在这里分享一下：

我的思路 (本篇)
在网上看到的另外一个类似的思路（F）
O奖思路
对三者进行比较。

0. 赛题

具体题目：链接

总结：就是要仅仅通过比赛方所提供的数据（图像，文本描述和地理位置信息）来判断每一个市民所汇报的“观测到胡蜂”的事件是真的还是假的。

1. 我的思路

1.1 Overview

图片：CNN图像分类模型
文本：word2vec 得到每条描述的向量，利用该向量与wikipedia中该条目下的描述进行相似度判断
位置：计算相对位置
综合：综合上述三个参数，得到一个可信度

1.2 图像数据模型

通过transfer-learning的方式，构建CNN图像分类模型。
关于迁移学习的一些内容，可以查看之前的相关文章

读入图片数据（由于两个类别的数据之间过于不平衡，所以使用了图像增强的技术）

import tensorflow as tf
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.applications.vgg16 import preprocess_input
from tensorflow.keras.layers import GlobalAveragePooling2D
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.preprocessing import image_dataset_from_directory

import matplotlib.pyplot as plt


IMG_SIZE = (224,224) #此处为自己设置的图片的SIZE 
IMG_SHAPE = IMG_SIZE + (3,) # 根据设置的图片形状得到 —— （160, 160, 3）
NUM_CLASSES = 2 # 此处为需要进行分类的类别数量
BATCH_SIZE = 64 # 一个BATCH的大小

train_path = r"F:\PythonLearning\Project\IMAGE&TEXT-2021美赛\Problem C\data\train"


train_gen = tf.keras.preprocessing.image.ImageDataGenerator(preprocessing_function=preprocess_input,
    rotation_range=30,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    validation_split=0.2
)

train_generator = train_gen.flow_from_directory(train_path, target_size=IMG_SIZE, batch_size=BATCH_SIZE, subset="training")
val_generator = train_gen.flow_from_directory(train_path, target_size=IMG_SIZE, batch_size=BATCH_SIZE, subset="validation")

构建transfer-learning模型

def add_input_top_model(base_model, class_num, input_shape):
'''向最后一个卷积网络添加卷积层&添加输入层保证输入数据的shape符合模型
  Args:
    base_model: keras model excluding top
    class_num: number of classes
    input_shaope: shape of input image
  Returns:
    添加了全连接层的神经网络
'''
preprocessinput = preprocess_input

inputs = tf.keras.Input(shape=input_shape)
x = preprocessinput(inputs)
x = base_model(x, training=False)
x = GlobalAveragePooling2D()(x)
# 若为2分类则
if class_num == 2:
    outputs = Dense(1)(x) #logit
else:
    outputs = Dense(class_num, activation='softmax')(x)
model = Model(inputs=inputs, outputs=outputs)
return model


def model_compile(model, learning_rate=0.001):
    class_num = model.output.shape[1]
    if class_num == 2:
        #如果是二分类模型
        model.compile(optimizer=Adam(learning_rate=learning_rate),
              loss=BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])
    else:
        model.compile(optimizer=Adam(learning_rate=learning_rate), 
                      loss='sparse_categorical_crossentropy', 
                      metrics=['accuracy'])
    return model

使用fine-tune的方法构建

'''1.1 构建预训练模型'''
print("BASE MODE:")
base_model = VGG16(input_shape=IMG_SHAPE ,include_top=False, weights='imagenet')
base_model.trainable = False # 固定所有预训练模型层的参数
# Let's take a look at the base model architecture
# base_model.summary()

'''1.2 微调所需！！！'''
base_model.trainable = True
fine_tune_at = 16 #表明从第十六层开始重新进行训练
for layer in base_model.layers[:fine_tune_at]:
    layer.trainable = False

'''1.2 添加顶层分类器 & 输入层'''
print("ADD CLS TOP LAYER & INPUT LAYER:")
final_model = add_input_top_model(base_model, NUM_CLASSES, IMG_SHAPE)
# final_model.summary()

'''1.3 Compile'''
model = model_compile(final_model)
model.summary()

训练模型

callback = tf.keras.callbacks.EarlyStopping(monitor='loss', min_delta=0.005, patience=3)
history = model.fit(train_generator,validation_data=val_generator,epochs=100, callbacks=[callback])

Learning Curve

'''四. Learning Curve'''
# acc = history.history['accuracy']
# val_acc = history.history['val_accuracy']

# loss = history.history['loss']
# val_loss = history.history['val_loss']

# plt.figure(figsize=(8, 8))
# plt.subplot(2, 1, 1)
# plt.plot(acc, label='Training Accuracy')
# plt.plot(val_acc, label='Validation Accuracy')
# plt.legend(loc='lower right')
# plt.ylabel('Accuracy')
# plt.ylim([min(plt.ylim()),1])
# plt.title('Training and Validation Accuracy')

# plt.subplot(2, 1, 2)
# plt.plot(loss, label='Training Loss')
# plt.plot(val_loss, label='Validation Loss')
# plt.legend(loc='upper right')
# plt.ylabel('Cross Entropy')
# plt.ylim([0,1.0])
# plt.title('Training and Validation Loss')
# plt.xlabel('epoch')
# plt.show()

1.3 文本处理

使用word2vec将文本vectorize，进一步使用COSINE-SIMILARITY进行相似度判断。

此处由于当前的样本量较小，所以使用的是在Google News上预训练的word2vec模型。

引入库

import string
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import gensim.models
from sklearn.metrics.pairwise import cosine_similarity

Preprocseeing

def remove_punc(data):
    texts = []
    for i in range(len(data)):
        if data.loc[i,"have_lab_comment"] == True:
            texts.append(str(data.loc[i,"Notes"]))
        else:
            texts.append(str(data.loc[i,"Lab Comments"]))
    data["text"] = texts
    data["clean_text"] = data.text
    for each in string.punctuation:
        data["clean_text"] = data["clean_text"].apply(lambda x: x.replace(each,''))
    return data


def lower_case(data):
    data["clean_text"] = data.clean_text.apply(lambda x: x.lower())
    return data


def remove_number(data):
    data["clean_text"] = data.clean_text.apply(lambda x: ''.join(word for word in x if not word.isdigit()))
    return data


def remove_stopword(data):
    stop_words = set(stopwords.words('english'))
    index = 0
    for each_words in data.clean_text.apply(lambda x: word_tokenize(x)):
        words = [w for w in each_words if not w in stop_words]
        data.loc[index,"clean_text"] = " ".join(words)
        index += 1
    return data

def lemmatize_word(data):
    index = 0
    lemmatizer = WordNetLemmatizer()
    for each_text in data.clean_text.apply(lambda x: word_tokenize(x)):
        words = [w for w in each_text]
        lemmatized = [lemmatizer.lemmatize(word, pos="v") for word in words] #'a',"n","r","v"
        data.loc[index,"clean_text"] = " ".join(lemmatized)
        index += 1
    return data


all_data = pd.read_excel("2021MCMProblemC_DataSet.xlsx")
all_text_data = all_data.copy()
all_text_data["have_lab_comment"] = [all_data.loc[i,"Lab Comments"] == ' ' for i in range(len(all_data))]
No_lab_comment = all_text_data[all_text_data["have_lab_comment"] == True]
Have_lab_comment = all_text_data[all_text_data["have_lab_comment"] == False]


cleaned_text_data = all_text_data.copy()
cleaned_text_data = remove_punc(cleaned_text_data)
cleaned_text_data = lower_case(cleaned_text_data)
cleaned_text_data = remove_number(cleaned_text_data)
cleaned_text_data = remove_stopword(cleaned_text_data)
cleaned_text_data = lemmatize_word(cleaned_text_data)
cleaned_text_data.head()

预训练模型导入（相关数据可以Google、Baidu等获得）

word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('pretrained-model/GoogleNews-vectors-negative300.bin', binary=True)

词向量化

all_vec = []
cleaned_text = cleaned_text_data["clean_text"]
for i in range(len(cleaned_text)):
    text = cleaned_text.at[i]
    sum_vec = np.zeros((300,))
    word_num = len(word_tokenize(text))
    for each_word in word_tokenize(text):
        try:
            each_vec = word2vec_model.wv[each_word]
            sum_vec += each_vec
        except:
            pass
    avg_vec = sum_vec / word_num
    all_vec.append(avg_vec)

wikipedia 文本&向量化

# 计算cosine-similarity
def cosine_similarity(x, y):
    # 点乘
    num = float(np.dot(x, y))
    # 模长乘机
    denom = np.linalg.norm(x) * np.linalg.norm(y)
    if denom != 0:
        ret = num / denom
    else:
        ret = 0
    return ret


def similarity_cal(all_comment_vec, all_feature_vec):
    all_similarity = []
    for each_comment in all_comment_vec:
        similarity = 0
        for each_feature in all_feature_vec:
            temp = cosine_similarity(each_comment, each_feature)
            if temp > similarity:
                similarity = temp
        all_similarity.append(similarity)
    return all_similarity



# wikipedia中描述Asian giant hornet的文本
bee_feature_text = """the world's largest hornet, native to temperate and tropical Eastern Asia. Its body length is approximately 50.8 mm (2.0 in), with a wingspan of about 76 mm (3 in).[1] Queens may reach a length of 55 mm (2.2 in).[2] Due to its size, it is known in Japan as the giant sparrow bee.
The head of the hornet is orange and quite wide in comparison to other hornet species. The compound eyes and are dark brown, and the antennae are dark brown with orange scapes. The(the shield like plate on the front of the head) is orange and coarsely punctured; the posterior side of the has narrow, rounded lobes. The mandible is large and orange with a black tooth (inner biting surface).
The thorax and (the segment which forms the posterior part of the thorax) of the Asian giant hornet has a distinctive golden tint and a larg (a shield-like scale on the thorax) that has a deeply-impressed medial line; the post(the plate behind the) bulges and overhangs the. The hornet's forelegs are orange with dark brown (the of the leg); the mid legs and hind legs are dark brown. Wings are a dark brownish gray. The are brown.
The (the portion of the abdomen behind the connection) is dark brown with a white, powdery covering; with narrow yellow bands at the posterior margins of the the sixth segment is entirely yellow. It is similar in appearance to the established European hornet
Asian giant hornets, like other social wasps, are predators of other insects. For reasons that aren't clear, Asian giant hornets switch from other prey sources to honey bees beginning in August and peaking in September and October. 
Queen bees they are only seen outside the nest when they are hibernating or in the spring before workers have emerged. 
similar in size to other wasps
yellow heads, a black thorax, and yellow and black or brown striped abdomens
nests underground, abandoned rodent forests, often in association with pine roots. dead, hollow trunks or roots of trees, but these are never more than 3 to 6 feet above the ground
nest dispersion"""


# 每个句子，每个词的词向量
all_feature_text = bee_feature_text.split('\n')
for i in range(len(all_feature_text)):
    feature_text = all_feature_text[i]
    for each in string.punctuation:
        feature_text = feature_text.replace(each,' ')
    feature_text = feature_text.lower()
    feature_text = ''.join([word for word in feature_text if not word.isdigit()])
    stop_words = set(stopwords.words('english'))
    words = []
    lemmatizer = WordNetLemmatizer()
    for each_word in word_tokenize(feature_text):
        if each_word not in stop_words:
            lemmatized = lemmatizer.lemmatize(each_word, pos='v')
            words.append(lemmatized)
    feature_text = " ".join(words)
    all_feature_text[i] = feature_text

all_feature_text_vec = []
for each_text in all_feature_text:
    text_vec_list =[]
    for each_word in word_tokenize(each_text):
        vec = word2vec_model.wv[each_word]
        text_vec_list.append(vec)
    all_feature_text_vec.append(text_vec_list)

# 每个句子的词向量求和
feature_vector_final = []
for each_feature_sentence in all_feature_text_vec:
    feature_sum = np.zeros(300,)
    for each_word in each_feature_sentence:
        feature_sum += each_word
    feature_vec = feature_sum / len(each_feature_sentence)
    feature_vector_final.append(feature_vec)

# all_vec = 所有评论的词向量； feature_vector_final = wikipedia每句话词向量
similarities = similarity_cal(all_vec, feature_vector_final)

'''此时获得的similarity包含所有描述与wikipedia的描述的相似度'''

1.4 位置信息处理

找到为真的几个点

#找到为真的几个点
all_true_point = cleaned_text_data[cleaned_text_data["Lab Status"] == "Positive ID"][["Latitude", "Longitude"]]

计算所有点到这几个点的距离，取最小值

def distance(x1,y1, x2,y2):
    return ((x2-x1)**2 + (y2-y1)**2)**0.5

# 计算其他点到这几个点的距离，选最接近
all_points = []
for i in range(len(list(all_true_point["Latitude"]))):
    lati = list(all_true_point["Latitude"])[i]
    long = list(all_true_point["Longitude"])[i]
    for j in range(len(cleaned_text_data)):
        item_la = cleaned_text_data.at[j, "Latitude"]
        item_lo = cleaned_text_data.at[j, "Longitude"]
        dis = distance(lati, long, item_la, item_lo)
        if i == 0:
            all_points.append(dis)
        else:
            if dis < all_points[j]:
                all_points[j] = dis

1.5 最终的分类器

至此，我们获得了三个数值

图像为Asian giant hornet的概率值
文本描述为Asian giatn hornet的概率值
位置上与确实为Asian giant hornet的距离

传统方法：通过这三个数值，构建一个计算式（比如非加权平均数）从而得到最终的分类器
进阶方法：构建一个机器学习模型（SVM,RandomForest等），来进行分类判断

str_717

关注

2
点赞
踩
39

收藏

觉得还不错? 一键收藏
4
评论
2021年美赛C题- 胡蜂的传播(图像分类模型，word2vec词向量化, 多模态模型构建)

2021年美赛C题（大数据题）- 胡蜂的传播（图像，文本和地理位置信息——构建多模态模型）今年参加美赛，属实有些无奈。队伍里只有我一个人在负责模型和代码，加上在假期各自在家没办法做到紧密合作，所以结果也可想而知????。但是我还是想在这里分享一下：我的思路 (本篇)在网上看到的另外一个类似的思路（F）O奖思路对三者进行比较。目录2021年美赛C题（大数据题）- 胡蜂的传播（图像，文本和地理位置信息——构建多模态模型）0. 赛题1. 我的思路1.1 Overview1.2 图像数据模型1.3
复制链接

扫一扫