文本相似度匹配-task5

  • 任务5:文本匹配模型(LSTM孪生网络)
    • 步骤1:定义孪生网络(嵌入层、LSTM层、全连接层)
    • 步骤2:使用文本匹配数据训练孪生网络
    • 步骤3:对测试数据进行预测

1.训练LSTM孪生网络

RNN/LSTM孪生网络是一种文本匹配模型,网络的输入是两个文本序列,分别对文本进行编码,编码后的向量通过一些运算得到相似度分数,来判断文本之间的相似度。使用RNN/LSTM孪生网络进行文本匹配的网络结构如下:文本嵌入、文本特征提取和文本匹配。

参考资料:

train和valid生成train.data数据集

class SiameseNetwork:
    def __init__(self):
        cur = '/'.join(os.path.abspath(__file__).split('/')[:-1])
        # self.train_path = os.path.join(cur, './data/LCQMC.train.data')
        # self.valid_path = os.path.join(cur, './data/LCQMC.valid.data')
        self.train_path = os.path.join(cur, './data/train.data')
        self.vocab_path = os.path.join(cur, 'model/vocab.txt')
        self.embedding_file = os.path.join(cur, 'model/token_vec_300.bin')
        self.timestamps_file = os.path.join(cur, 'model/timestamps.txt')
        self.model_path = os.path.join(cur, 'model/tokenvec_bilstm2_siamese_model.h5')
        self.datas, self.word_dict = self.build_data()
        self.EMBEDDING_DIM = 300
        self.EPOCHS = 10
        self.BATCH_SIZE = 512
        self.NUM_CLASSES = 20
        self.VOCAB_SIZE = len(self.word_dict)
        self.LIMIT_RATE = 0.95
        self.TIME_STAMPS = self.select_best_length()
        self.embedding_matrix = self.build_embedding_matrix()

        print(self.VOCAB_SIZE)

    '''根据样本长度,选择最佳的样本max-length'''
    def select_best_length(self):
        len_list = []
        max_length = 0
        cover_rate = 0.0
        for line in open(self.train_path):
            line = line.strip().split(' ')
            if not line:
                continue
            sent = line[0]
            sent_len = len(sent)
            len_list.append(sent_len)
        all_sent = len(len_list)
        sum_length = 0
        len_dict = Counter(len_list).most_common()
        for i in len_dict:
            sum_length += i[1]*i[0]
        average_length = sum_length/all_sent
        for i in len_dict:
            rate = i[1]/all_sent
            cover_rate += rate
            if cover_rate >= self.LIMIT_RATE:
                max_length = i[0]
                break
        print('average_length:', average_length)
        print('max_length:', max_length)
        print('saving timestamps.....')
        f = open(self.timestamps_file, 'w+')
        f.write(str(max_length))
        f.close()
        return max_length

    '''构造数据集'''
    def build_data(self):
        sample_x = []
        sample_y = []
        sample_x_left = []
        sample_x_right = []
        vocabs = {'UNK'}
        for line in open(self.train_path):
            line = line.rstrip().split('\t')
            if not line:
                continue
            sent_left = line[0]
            sent_right = line[1]
            label = line[2]
            sample_x_left.append([char for char in sent_left if char])
            sample_x_right.append([char for char in sent_right if char])
            sample_y.append(label)
            for char in [char for char in sent_left + sent_right if char]:
                vocabs.add(char)
        print(len(sample_x_left), len(sample_x_right))
        sample_x = [sample_x_left, sample_x_right]
        datas = [sample_x, sample_y]
        word_dict = {wd:index for index, wd in enumerate(list(vocabs))}
        self.write_file(vocabs, self.vocab_path)
        return datas, word_dict

    '''将数据转换成keras所需的格式'''
    def modify_data(self):
        sample_x = self.datas[0]
        sample_y = self.datas[1]
        sample_x_left = sample_x[0]
        sample_x_right = sample_x[1]
        left_x_train = [[self.word_dict[char] for char in data] for data in sample_x_left]
        right_x_train = [[self.word_dict[char] for char in data] for data in sample_x_right]
        y_train = [int(i) for i in sample_y]
        left_x_train = pad_sequences(left_x_train, self.TIME_STAMPS)
        right_x_train = pad_sequences(right_x_train, self.TIME_STAMPS)
        # y_train = np.expand_dims(y_train, 2)
        y_train = np.expand_dims(y_train, 1)
        return left_x_train, right_x_train, y_train

    '''保存字典文件'''
    def write_file(self, wordlist, filepath):
        print(len(wordlist))
        with open(filepath, 'w+') as f:
            for wd in wordlist:
                f.write(wd + '\n')

    '''加载预训练词向量'''
    def load_pretrained_embedding(self):
        embeddings_dict = {}
        with open(self.embedding_file, 'r') as f:
            for line in f:
                values = line.strip().split(' ')
                if len(values) < 300:
                    continue
                word = values[0]
                coefs = np.asarray(values[1:], dtype='float32')
                embeddings_dict[word] = coefs
        print('Found %s word vectors.' % len(embeddings_dict))
        return embeddings_dict

    '''加载词向量矩阵'''
    def build_embedding_matrix(self):
        embedding_dict = self.load_pretrained_embedding()
        embedding_matrix = np.zeros((self.VOCAB_SIZE + 1, self.EMBEDDING_DIM))
        for word, i in self.word_dict.items():
            embedding_vector = embedding_dict.get(word)
            if embedding_vector is not None:
                embedding_matrix[i] = embedding_vector
        return embedding_matrix

    '''基于曼哈顿空间距离计算两个字符串语义空间表示相似度计算'''
    def exponent_neg_manhattan_distance(self, inputX):
        (sent_left, sent_right) = inputX
        return K.exp(-K.sum(K.abs(sent_left - sent_right), axis=1, keepdims=True))

    '''基于欧式距离的字符串相似度计算'''
    def euclidean_distance(self, sent_left, sent_right):
        sum_square = K.sum(K.square(sent_left - sent_right), axis=1, keepdims=True)
        return K.sqrt(K.maximum(sum_square, K.epsilon()))


    '''搭建编码层网络,用于权重共享'''
    def create_base_network(self, input_shape):
        input = Input(shape=input_shape)
        lstm1 = Bidirectional(LSTM(128, return_sequences=True))(input)
        lstm1 = Dropout(0.5)(lstm1)
        lstm2 = Bidirectional(LSTM(32))(lstm1)
        lstm2 = Dropout(0.5)(lstm2)
        return Model(input, lstm2)

    '''搭建网络'''
    def bilstm_siamese_model(self):
        embedding_layer = Embedding(self.VOCAB_SIZE + 1,
                                    self.EMBEDDING_DIM,
                                    weights=[self.embedding_matrix],
                                    input_length=self.TIME_STAMPS,
                                    trainable=False,
                                    mask_zero=True)

        left_input = Input(shape=(self.TIME_STAMPS,), dtype='float32')
        right_input = Input(shape=(self.TIME_STAMPS,), dtype='float32')

        encoded_left = embedding_layer(left_input)
        encoded_right = embedding_layer(right_input)

        shared_lstm = self.create_base_network(input_shape=(self.TIME_STAMPS, self.EMBEDDING_DIM))
        left_output = shared_lstm(encoded_left)
        right_output = shared_lstm(encoded_right)
        distance = Lambda(self.exponent_neg_manhattan_distance)([left_output, right_output])
        model = Model([left_input, right_input], distance)
        model.compile(loss='binary_crossentropy',
                      optimizer='nadam',
                      metrics=['accuracy'])
        model.summary()
        return model


    '''训练模型'''
    def train_model(self):
        # left_x_train, right_x_train, y_train = self.modify_data()
        left_x_data, right_x_data, y_data = self.modify_data()
        left_x_train = left_x_data[:238767]
        right_x_train = right_x_data[:238767]
        y_train = y_data[:238767]
        left_x_valid = left_x_data[238766:]
        right_x_valid = right_x_data[238766:]
        y_valid = y_data[238766:]          
        model = self.bilstm_siamese_model()
        history = model.fit(
                              x=[left_x_train, right_x_train],
                              y=y_train,
                            #   validation_split=0.2,
                              validation_data=([left_x_valid, right_x_valid], y_valid),
                              batch_size=self.BATCH_SIZE,
                              epochs=self.EPOCHS,
                            )
        self.draw_train(history)
        model.save_weights(self.model_path)
        return model

    '''绘制训练曲线'''
    def draw_train(self, history):
        # Plot training & validation accuracy values
        # plt.plot(history.history['acc'])
        # plt.plot(history.history['val_acc'])
        # plt.title('Model accuracy')
        # plt.ylabel('Accuracy')
        # plt.xlabel('Epoch')
        # plt.legend(['Train', 'Test'], loc='upper left')
        # plt.show()
        #
        # # Plot training & validation loss values
        # plt.plot(history.history['loss'])
        # plt.plot(history.history['val_loss'])
        # plt.title('Model loss')
        # plt.ylabel('Loss')
        # plt.xlabel('Epoch')
        # plt.legend(['Train', 'Test'], loc='upper left')
        # plt.show()


if __name__ == '__main__':
    handler = SiameseNetwork()
    handler.train_model()
'''
Using TensorFlow backend.
247568 247568
5056
average_length: 10.730158986621857
max_length: 5
saving timestamps.....
Found 20028 word vectors.
5056
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_1 (InputLayer)            (None, 5)            0
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 5)            0
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 5, 300)       1517100     input_1[0][0]
                                                                 input_2[0][0]
__________________________________________________________________________________________________
model_1 (Model)                 (None, 64)           513280      embedding_1[0][0]
                                                                 embedding_1[1][0]
__________________________________________________________________________________________________
lambda_1 (Lambda)               (None, 1)            0           model_1[1][0]
                                                                 model_1[2][0]
==================================================================================================
Total params: 2,030,380
Trainable params: 513,280
Non-trainable params: 1,517,100
__________________________________________________________________________________________________
Train on 238767 samples, validate on 8802 samples
Epoch 1/10
238767/238767 [==============================] - 58s 243us/step - loss: 0.7997 - acc: 0.6185 - val_loss: 1.6662 - val_acc: 0.5201
Epoch 2/10
238767/238767 [==============================] - 57s 240us/step - loss: 0.6066 - acc: 0.6716 - val_loss: 1.5997 - val_acc: 0.5377
Epoch 3/10
238767/238767 [==============================] - 55s 229us/step - loss: 0.5883 - acc: 0.6895 - val_loss: 1.6456 - val_acc: 0.5377
Epoch 4/10
238767/238767 [==============================] - 58s 244us/step - loss: 0.5768 - acc: 0.6990 - val_loss: 1.6298 - val_acc: 0.5377
Epoch 5/10
238767/238767 [==============================] - 58s 244us/step - loss: 0.5659 - acc: 0.7083 - val_loss: 1.5989 - val_acc: 0.5515
Epoch 6/10
238767/238767 [==============================] - 58s 241us/step - loss: 0.5579 - acc: 0.7143 - val_loss: 1.5905 - val_acc: 0.5576
Epoch 7/10
238767/238767 [==============================] - 58s 242us/step - loss: 0.5507 - acc: 0.7195 - val_loss: 1.5879 - val_acc: 0.5564
Epoch 8/10
238767/238767 [==============================] - 58s 244us/step - loss: 0.5437 - acc: 0.7246 - val_loss: 1.5911 - val_acc: 0.5615
Epoch 9/10
238767/238767 [==============================] - 58s 241us/step - loss: 0.5356 - acc: 0.7303 - val_loss: 1.6027 - val_acc: 0.5606
Epoch 10/10
238767/238767 [==============================] - 59s 249us/step - loss: 0.5309 - acc: 0.7337 - val_loss: 1.5985 - val_acc: 0.5667
'''

2.预测

class SiameseNetwork:
    def __init__(self):
        cur = '/'.join(os.path.abspath(__file__).split('/')[:-1])
        self.train_path = os.path.join(cur, 'data/train.data')
        self.vocab_path = os.path.join(cur, 'model/vocab.txt')
        self.embedding_file = os.path.join(cur, 'model/token_vec_300.bin')
        self.model_path = os.path.join(cur, 'model/tokenvec_bilstm2_siamese_model.h5')
        self.timestamps_file = os.path.join(cur, 'model/timestamps.txt')
        self.word_dict = self.load_worddict()
        self.EMBEDDING_DIM = 300
        self.EPOCHS = 1
        self.BATCH_SIZE = 512
        self.NUM_CLASSES = 20
        self.VOCAB_SIZE = len(self.word_dict)
        self.LIMIT_RATE = 0.95
        self.TIME_STAMPS = self.load_timestamps()
        self.embedding_matrix = self.build_embedding_matrix()
        self.model = self.load_siamese_model()

    '''加载timestamps'''
    def load_timestamps(self):
        timestamps = [i.strip() for i in open(self.timestamps_file) if i.strip()][0]
        return int(timestamps)

    '''加载词典'''
    def load_worddict(self):
        vocabs = [i.replace('\n','') for i in open(self.vocab_path)]
        word_dict = {wd: index for index, wd in enumerate(vocabs)}
        print(len(vocabs))
        return word_dict

    '''对输入的文本进行处理'''
    def represent_sent(self, s):
        wds = [char for char in s if char]
        # sent = [[self.word_dict[char] for char in wds]]
        sentLi = []
        for char in wds:
            if char not in self.word_dict:
                sentLi.append(self.word_dict['UNK'])
            else:
                sentLi.append(self.word_dict[char])
        sent = [sentLi]
        # above
        sent_rep = pad_sequences(sent, self.TIME_STAMPS)
        return sent_rep

    '''加载预训练词向量'''
    def load_pretrained_embedding(self):
        embeddings_dict = {}
        with open(self.embedding_file, 'r') as f:
            for line in f:
                values = line.strip().split(' ')
                if len(values) < 300:
                    continue
                word = values[0]
                coefs = np.asarray(values[1:], dtype='float32')
                embeddings_dict[word] = coefs
        print('Found %s word vectors.' % len(embeddings_dict))
        return embeddings_dict

    '''加载词向量矩阵'''
    def build_embedding_matrix(self):
        embedding_dict = self.load_pretrained_embedding()
        embedding_matrix = np.zeros((self.VOCAB_SIZE + 1, self.EMBEDDING_DIM))
        for word, i in self.word_dict.items():
            embedding_vector = embedding_dict.get(word)
            if embedding_vector is not None:
                embedding_matrix[i] = embedding_vector
        return embedding_matrix

    def exponent_neg_manhattan_distance(self, inputX):
        (sent_left, sent_right) = inputX
        return K.exp(-K.sum(K.abs(sent_left - sent_right), axis=1, keepdims=True))

    '''基于欧式距离的字符串相似度计算'''
    def euclidean_distance(self, sent_left, sent_right):
        sum_square = K.sum(K.square(sent_left - sent_right), axis=1, keepdims=True)
        return K.sqrt(K.maximum(sum_square, K.epsilon()))


    '''搭建编码层网络,用于权重共享'''
    def create_base_network(self, input_shape):
        input = Input(shape=input_shape)
        lstm1 = Bidirectional(LSTM(128, return_sequences=True))(input)
        lstm1 = Dropout(0.5)(lstm1)
        lstm2 = Bidirectional(LSTM(32))(lstm1)
        lstm2 = Dropout(0.5)(lstm2)
        return Model(input, lstm2)

    '''搭建网络'''
    def bilstm_siamese_model(self):
        embedding_layer = Embedding(self.VOCAB_SIZE + 1,
                                    self.EMBEDDING_DIM,
                                    weights=[self.embedding_matrix],
                                    input_length=self.TIME_STAMPS,
                                    trainable=False,
                                    mask_zero=True)

        left_input = Input(shape=(self.TIME_STAMPS,), dtype='float32')
        right_input = Input(shape=(self.TIME_STAMPS,), dtype='float32')

        encoded_left = embedding_layer(left_input)
        encoded_right = embedding_layer(right_input)

        shared_lstm = self.create_base_network(input_shape=(self.TIME_STAMPS, self.EMBEDDING_DIM))
        left_output = shared_lstm(encoded_left)
        right_output = shared_lstm(encoded_right)

        distance = Lambda(self.exponent_neg_manhattan_distance)([left_output, right_output])
        model = Model([left_input, right_input], distance)
        model.compile(loss='binary_crossentropy',
                      optimizer='nadam',
                      metrics=['accuracy'])
        model.summary()
        return model

    '''使用模型'''
    def load_siamese_model(self):
        model = self.bilstm_siamese_model()
        model.load_weights(self.model_path)

        return model

    '''使用模型进行预测'''
    def predict(self, s1, s2):
        rep_s1 = self.represent_sent(s1)
        rep_s2 = self.represent_sent(s2)
        res = self.model.predict([rep_s1, rep_s2])
        return res

    '''测试模型'''
    def test(self):
        s1 = '英雄联盟什么英雄最好'
        s2 = '英雄联盟最好英雄是什么'
        res = self.predict(s1, s2)
        print(res)
        return

    def test_data(self):
        lines = open('./data/LCQMC.test.data', 'r', encoding='utf8').readlines()
        ratio = 0
        wrongLabel = 0
        for line in lines:
            line = line.rstrip().split('\t')
            if not line:
                continue
            sent_left = line[0]
            sent_right = line[1]
            label = line[2]
            res = self.predict(sent_left, sent_right)
            if res[0][0] >= 0.5 and str(label) == '1':
                ratio += 1
            elif res[0][0] < 0.5 and str(label) == '0':
                ratio += 1
            else:
                wrongLabel += 1
        print(wrongLabel)
        print(ratio/12500)
if __name__ == '__main__':
    handler = SiameseNetwork()
    # handler.test()
    handler.test_data()
'''
Using TensorFlow backend.
true
5056
Found 20028 word vectors.
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 5)            0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 5)            0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 5, 300)       1517100     input_1[0][0]                    
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
model_1 (Model)                 (None, 64)           513280      embedding_1[0][0]                
                                                                 embedding_1[1][0]                
__________________________________________________________________________________________________
lambda_1 (Lambda)               (None, 1)            0           model_1[1][0]                    
                                                                 model_1[2][0]                    
==================================================================================================
Total params: 2,030,380
Trainable params: 513,280
Non-trainable params: 1,517,100
__________________________________________________________________________________________________
5188
accuray: 0.58496
'''

# reference:https://github.com/liuhuanyong/SiameseSentenceSimilarity

  • 9
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
文本相似度匹配算法是一种用于衡量文本之间相似程度的算法。在Java中,可以使用不同的方法来实现文本相似度匹配算法,下面我将介绍一种常用的方法:余弦相似度算法。 余弦相似度算法是通过计算两个文本向量之间的夹角来度量文本之间的相似度。具体步骤如下: 1. 首先,将文本转换为向量表示。可以使用词袋模型或者TF-IDF模型将文本转换为向量。在词袋模型中,每个文本被表示为一个向量,向量的每个维度代表一个词,词在文本中出现的次数即为该维度上的取值;而在TF-IDF模型中,向量的每个维度代表一个词,取值为该词在文本中的TF-IDF权重。 2. 计算两个文本向量的内积。通过计算两个向量的对应维度上的值的乘积之和,可以得到两个向量的内积。 3. 分别计算两个文本向量的模长。通过计算向量的模长,即向量各个维度上值的平方之和的开方,可以得到向量的模长。 4. 使用余弦公式计算余弦值。通过将步骤2中得到的内积除以步骤3中得到的模长的乘积,可以得到余弦值。 5. 最后,将余弦值转换为相似度得分。通常将余弦值的取值范围从[-1,1]映射到[0,1],取值越接近1,表示两个文本相似度越高。 在Java中,可以使用开源的文本相似度计算库如Jaccard-Text-Similarity或Similarity3来实现上述算法。这些库提供了丰富的API和函数,可以方便地计算文本相似度匹配。 总之,文本相似度匹配算法在Java中的实现可以采用余弦相似度算法,通过计算两个文本向量之间的夹角来度量文本之间的相似度

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值