tensorflow2.0实现IMDB文本数据集学习词嵌入

1. IMDB数据集示例如下所示

[
    {
        "rating": 5, 
        "title": "The dark is rising!", 
        "movie": "tt0484562", 
        "review": "It is adapted from the book. I did not read the book and maybe that is why I still enjoyed the movie. There are recent famous books adapted into movies like Eragon which is an unsuccessful movie compared to the rest but I like it better than The Seeker adaptation, another one is The Chronicles of Narnia: The lion, The witch and The wardrobe which is successful and has a sequel under it. The Seeker is this year adaptation. It did a fair job. It is not bad and it is not good. It depends on the viewer. If fans hate the unfaithful adaptation because it does not really follow the line of the story, then be it. Those who have not read the book like me would want to go and watch this movie for entertainment. It did make me a little interested but not enough.It does have its good and bad points. The director failed to bring the spark of the movie. The cast are okay, not too bad. The special effects are considered good for a fantasy movie. What I don't like it is that it is quite short, it just bring straight to the point and that is it. By the time, you will realise it is going to end like that with some short fantasy action. The story is like any fantasy movies. Fast and straight-forward plot. The talking seems long and boring followed by some short action. That is about it. Nothing else. Nothing so interesting to catch your eyes.Overall, it makes a harmless movie to watch in free time or the boring weekends. It is considered dark for children but they still can handle it. It seems long but it is short. Overall, I still think Eragon is better than this. Either you don't like it or like it, it does not matter. It is your view. In this case, I can't say anything. It is just okay.", 
        "link": "http://www.imdb.com/title/tt0484562/reviews-73", 
        "user": "ur12930537"
    }, 
    {
        "rating": 5, 
        "title": "Bad attempt by the people that borough us Eragon.", 
        "movie": "tt0484562", 
        "review": "Ever since Lord of the Rings became a hit and was internationally acclaimed all other studios are trying to do the same thing and I can tell you now we are not getting many successes out of these half hearted attempts. The decent ones are Chronicles of Narnia which Disney snapped up and Harry Potter from Warner Brothers. Even the Golden Compass was pretty good by the same people who did Lord of the Rings but then we get to the bad ones. Fox studios gave us Eragon which I still believe is the worst movie I have ever seen. Now Fox studios tries again with the Seeker: The Dark is Rising and I can tell you it is a lot better than Eragon. However, it still is not very good. The director filmed the movie and then realised that his movie was too short so he had a great idea of just making characters appear for no reason and just look scary. I have not read the books but from what I have heard it isn't even faithful their. Overall, it was a decent try but still not worth seeing.", 
        "link": "http://www.imdb.com/title/tt0484562/reviews-108", 
        "user": "ur15303216"
    }, 
    {
        "rating": 3, 
        "title": "fantasy movie lacks magic", 
        "movie": "tt0484562", 
        "review": "I've not read the novel this movie was based on, but do enjoy fantasy movies, and thought it looked interesting. But after seeing it...... oh dear.An American boy, Will living with his family in a small village somewhere in England, discovers on his 14th birthday that he's The Seeker for a group of old ones, who fight for the Light. He's got days to find them, before the Rider who fights for the Dark comes to full strength....As I said, I've not read the novel, but seeing the movie several things spring to mind. There are echoes of Harry Potter, the Russian movies Night Watch and Day Watch amongst other fantasy movies tossed into the mix. The script is all over the place, though perhaps this is due to some brutal editing as the movie seems disjointed in parts and the director can't resist having his camera moving all the time and with some quick editing it's almost as if he's trying to be Micheal Bay!! You also get the feeling that despite the production team's efforts, the movie didn't have the budget it really needed. There are a couple of so-called twists in the mix, but they are too obvious to work effectively.The acting isn't too bad, with special mention going to Ian McShane, as one of the elder ones but try as they might, they can't save the movie.As the first of a trio of fantasy movies coming out, the others being Stardust and The Golden Compass, I hope this is not a sign of things to come.", 
        "link": "http://www.imdb.com/title/tt0484562/reviews-60", 
        "user": "ur0680065"
    }
]

2. 加载数据集

tensorflow2.0的datasets数据集包含此内容,colab上无需下载,如果本地没有可以使用如下命令下载

!pip install -q tensorflow-datasets

加载该数据集

import tensorflow_datasets as tfds
imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

3. 设置训练集和测试集

使用.numpy()方法将原本tensorflow保存的tensor格式数据集转变为训练所需要的numpy格式
python3中需要用str(s.numpy())

import numpy as np

train_data, test_data = imdb['train'], imdb['test']

training_sentences = []
training_labels = []

testing_sentences = []
testing_labels = []

# str(s.tonumpy()) is needed in Python3 instead of just s.numpy()
for s,l in train_data:
  training_sentences.append(str(s.numpy()))
  training_labels.append(l.numpy())
  
for s,l in test_data:
  testing_sentences.append(str(s.numpy()))
  testing_labels.append(l.numpy())
  
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

4.文本预处理

设置词汇为10000,词嵌入维度16,文本向量最长为120.
padding: ‘pre’ 或 ‘post’: 在序列前填充或在序列后填充。truncating: ‘pre’ 或 ‘post’: 如果序列长度大于maxlen的值,从序列前端截取或者从序列后端截取。

vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type='post'
oov_tok = "<>"


from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type)

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)

5. 显示pad_sequences处理后的数据

为了显示数据,需要用reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])来将标签对反转,然后编写辅助函数decode_review对paddedd的数据解码展示
在这里插入图片描述

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

print(decode_review(padded[1]))
print(training_sentences[1])

b’i have been known to fall asleep during films but this is usually due to a combination of things including really tired being warm and comfortable on the <> and having just eaten a lot however on this occasion i fell asleep because the film was rubbish the plot development was constant constantly slow and boring things seemed to happen but with no explanation of what was causing them or why i admit i may have missed part of the film but i watched the majority of it and everything just seemed to happen of its own <> without any real concern for anything else i cant recommend this film at all ’
b’I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all.’

可以看到经过pad_sequences后忽略了单词的大小写,并且省去了标点,将陌生词汇标记为<>

6.搭建网络

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

在这里插入图片描述
这里也可以用GlobalAveragePooling2D代替flatten

Flatten将采用任何形状的张量并将其转换为一维张量(加上样本尺寸),但所有值都保持在张量中。例如,张量(samples,10,20,1)将被展平为(samples,10 * 20 * 1)。
GlobalAveragePooling2D做一些不同的事情。它对空间维度应用平均池化,直到每个空间维度为一,其他维度保持不变。在这种情况下,值将不保持平均值。例如,假设第2维和第3维为空间(最后一个通道),则将张量(samples,10、20、1)输出为(samples,1、1、1)。

6. 模型训练

num_epochs = 10
model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))

在这里插入图片描述

7. 查看嵌入矩阵维度

e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)

(10000, 16)

8. 可视化展示

在http://projector.tensorflow.org/上进行展示需要两个文件,编写如下代码下载这两个文件

import io

out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, vocab_size):
  word = reverse_word_index[word_num]
  embeddings = weights[word_num]
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

try:
  from google.colab import files
except ImportError:
  pass
else:
  files.download('vecs.tsv')
  files.download('meta.tsv')

打开网址,左侧下滑找到load,2标记处上传刚才下载的vesc.tsv,3标记处下载meta.tsv.
在这里插入图片描述右侧输入单词interesting,可以看到在嵌入矩阵的位置及和他邻近的单词
在这里插入图片描述
在这里插入图片描述

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
好的,下面是使用TensorFlow框架搭建GRU模型实现IMDB数据集的分析的步骤: 1. 导入所需的库和数据集 ```python import tensorflow as tf from tensorflow import keras from tensorflow.keras.layers import Dense, GRU, Embedding from tensorflow.keras.datasets import imdb # 加载IMDB数据集 vocab_size = 10000 # 词汇表大小 (train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=vocab_size) ``` 2. 对数据进行预处理 ```python maxlen = 500 # 句子最大长度 # 将句子进行填充或截断,使其长度相同 train_data = keras.preprocessing.sequence.pad_sequences(train_data, maxlen=maxlen) test_data = keras.preprocessing.sequence.pad_sequences(test_data, maxlen=maxlen) ``` 3. 构建GRU模型并编译 ```python embedding_dim = 32 # 词向量维度 gru_units = 64 # GRU单元数 model = keras.Sequential([ Embedding(vocab_size, embedding_dim, input_length=maxlen), GRU(gru_units), Dense(1, activation='sigmoid') ]) model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) ``` 4. 训练模型 ```python epochs = 10 # 迭代次数 model.fit(train_data, train_labels, epochs=epochs, batch_size=128, validation_split=0.2) ``` 5. 在测试集上评估模型性能 ```python test_loss, test_acc = model.evaluate(test_data, test_labels, verbose=2) print('Test accuracy:', test_acc) ``` 这样就完成了使用TensorFlow框架搭建GRU模型实现IMDB数据集的分析。完整代码如下: ```python import tensorflow as tf from tensorflow import keras from tensorflow.keras.layers import Dense, GRU, Embedding from tensorflow.keras.datasets import imdb # 加载IMDB数据集 vocab_size = 10000 # 词汇表大小 (train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=vocab_size) maxlen = 500 # 句子最大长度 # 将句子进行填充或截断,使其长度相同 train_data = keras.preprocessing.sequence.pad_sequences(train_data, maxlen=maxlen) test_data = keras.preprocessing.sequence.pad_sequences(test_data, maxlen=maxlen) embedding_dim = 32 # 词向量维度 gru_units = 64 # GRU单元数 model = keras.Sequential([ Embedding(vocab_size, embedding_dim, input_length=maxlen), GRU(gru_units), Dense(1, activation='sigmoid') ]) model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) epochs = 10 # 迭代次数 model.fit(train_data, train_labels, epochs=epochs, batch_size=128, validation_split=0.2) test_loss, test_acc = model.evaluate(test_data, test_labels, verbose=2) print('Test accuracy:', test_acc) ``` 注意:上述代码仅供参考,实际使用中需要根据具体情况进行调整和改进。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值