使用lstm进行股票分析_使用lstm与julia进行基本的情绪分析

使用lstm进行股票分析

The idea of this post is to make an introduction to sentiment analysis using Julia, a language design to high performance, and have a similar syntax with Python.

这篇文章的目的是介绍使用Julia的情感分析,Julia是一种高性能的语言设计,并且与Python具有类似的语法。

Sentiment analysis has grown over the scenario of artificial intelligence in the last years, bring changes in how to collect information about the perception of the user to a certain product, treat patients, discover diseases, etc. Many datasets have been used by researchers to measure their performance, so for this post, we are using IMDB’s dataset which contains reviews from users.

在过去的几年中,情感分析在人工智能的场景中得到了发展,它改变了如何在某种产品上收集有关用户感知的信息,治疗患者,发现疾病等方面的变化。研究人员使用了许多数据集来进行测量他们的表现,因此对于本篇文章,我们使用的是IMDB的数据集,其中包含来自用户的评论。

There are packages in Julia that provide a pre-processed dataset. For this article, we will use the dataset provide by CorpusLoader.To load the dataset we just need a simple command:dataset_train_pos = load(IMDB("train_pos"))

Julia中有一些提供了预处理数据集的软件包。 对于本文,我们将使用CorpusLoader提供的数据集。要加载数据集,我们只需要一个简单的命令:dataset_train_pos = load(IMDB(“ train_pos”))

Some variations of this command is passing as parameters: “train_pos”, “train_neg”, “test_pos”, “test_neg”. This will give you part of this dataset already labels.

此命令的某些变体作为参数传递:“ train_pos”,“ train_neg”,“ test_pos”,“ test_neg”。 这将为您提供此数据集已带有标签的一部分。

dataset_test_pos = load(IMDB("test_pos"))
dataset_train_neg = load(IMDB("train_neg"))
dataset_test_neg = load(IMDB("test_neg"))

Let’s transform this into a single array of tokens

让我们将其转换为单个令牌数组

julia> using Base.Iterators

julia> docs = collect(take(dataset_train_pos, 2)

This will transform or dataset into an array of arrays ( Array{Array{String,1}}, in summary, we got a list of sentences tokenized. But within the tokens, we can found stopwords. In the next step let’s remove them.

这将把数据集或数据集转换成一个数组数组(Array {Array {String,1}},总的来说,我们得到了标记化的句子列表。但是在标记中,我们可以找到停用词。下一步,我们将其删除。

停用词 (Stopwords)

Stopwords are words that don’t aggregate value to the sentences, some of them are: is, like, as … and many others.

停用词是不会为句子增加价值的词,其中一些是:is…等等。

Julia has a package that contains stopwords populate, this package is called Languages.

Julia有一个包含填充停用词的程序包,此程序包称为Languages。

Using Languageslist_stopwords = stopwords(Languages.English())

490-element Array{String,1}: "a" "about" "above" "across" "after" "again" "against" "all" "almost" "alone" "along" "already" "also" ⋮ "young" "younger" "youngest" "your" "you're" "yours" "yourself" "yourselves" "you've" "z" ""

This way we can create a function to remove all the stopword from our arrays

这样,我们可以创建一个函数来从数组中删除所有停用词

function removeStopWords(tokens)
filtered_sentence = []
for token in tokens
if !(lowercase(token) in list_stopwords)
push!(filtered_sentence, lowercase(token))
end
end

return filtered_sentence
end

删除标点 (Removing punctions)

Since our sentences are tokenized, we will assume that every punctuation represents a single cell on an array. This way we can use the following functions to convert the array into string than back to the array

由于我们的句子是标记化的,因此我们将假定每个标点代表一个数组中的单个单元格。 这样,我们可以使用以下函数将数组转换为字符串,然后再转换回数组

using Unicode
function convert_clean_arr(arr)
arr = string.(arr)
arr = Unicode.normalize.(arr, stripmark=true)
arr = map(x -> replace(x, r"[^a-zA-Z0-9_]" => ""), arr)
return arr
end

Now it’s time to create our vocab. The vocab will be used to transform the strings into numbers. By doing the weights will be fit in order to provide the functions to activates the neurons in our model

现在是时候创建我们的vocab了。 词汇将用于将字符串转换为数字。 通过执行权重调整,以提供激活模型中神经元的功能

First, let put all the positives and negatives together:

首先,将所有的正面和负面因素放在一起:

train_set = [docs_train_pos; docs_train_neg]
all_letters = collect(Iterators.flatten(train_set));

Now we can iterate over the all_letters and mapping by the index of the array

现在我们可以通过数组的索引遍历all_letters和映射

vocab = Dict()
index = 1
for (item,v) in counter_letter
vocab[lowercase(item)] = index
index = index + 1
end

So let transformer the words into index

所以让单词转换成索引

reviews_index_vocab = []
for review in train_set
r = [get(vocab, lowercase(w), 0) for w in review]
push!(reviews_index_vocab,r)
end

填充顺序 (Pad sequences)

In order to fit our data in our model, let us pad the sentences. This way we can normalize the array size

为了使我们的数据适合我们的模型,让我们填充句子。 这样我们可以规范化数组大小

function pad_features(reviews_int, length_max)
features = []
for review_int in reviews_int
dim_review = size(review_int)[1]
pad_size = length_max-dim_review
if pad_size > 0
pad_array = zeros(Int64, pad_size)
result = append!(pad_array,review_int)
else
result = review_int[1:length_max]
end
push!(features, result)
end
return features
end

This function will provide an array with length_max, if some sentence is larger than the length_max it suffers a cut.

此函数将为数组提供length_max,如果某些句子大于length_max,则会遭受剪切。

创建LSTM模型 (Create LSTM model)

Flux provides us the Chain structure, this simplifies how we can build multiple layers to our Deep Learning model. Let's build to LSTM layers with a softmax output layer.

Flux为我们提供了链结构,这简化了我们如何为深度学习模型构建多层。 让我们使用softmax输出层构建LSTM层。

model = Chain(
LSTM(300, 128),
LSTM(128,10),
Dense(10, 2),
softmax)

Let’s create a Loss function, fortunately, Flux provides some for us, so for this example, we are using Flux.mse and using Adam as our optimizer.

让我们创建一个Loss函数,幸运的是,Flux为我们提供了一些功能,因此在此示例中,我们使用Flux.mse并将Adam作为优化器。

L(x, y) = Flux.mse(model(x), y)

opt = ADAM(0.001)

Now we have to calculate measure whether our model is getting better. So let create a function to calculate the prediction.

现在,我们必须计算度量,以证明我们的模型是否有所改善。 因此,让我们创建一个函数来计算预测。

prediction(i) = findmax(model(new_test_features[i]))[2]-1

Before we train the model, it’s necessary to put the dataset in the format to be accepted into Flux.train. The format’s a tuple from data and classification.

在训练模型之前,有必要将数据集以可接受的格式放入Flux.train中。 格式是来自数据和分类的元组。

train_set_full = [ (new_features[i], train_label[i])  for i = 1:size(new_features)[1]];
test_set_full = [ (new_test_features[i], test_label[i]) for i = 1:size(new_test_features)[1]];

Now, we need to train our model over that data. For this, we will use Flux.train function, this is a powerful tool from Flux, where you can iterate over your dataset and update the Loss function. Here is an example of how to use it and sample the model to a future.

现在,我们需要对数据进行训练。 为此,我们将使用Flux.train函数,这是Flux提供的功能强大的工具,您可以在其中迭代数据集并更新Loss函数。 这是一个如何使用它以及将模型应用于未来的示例。

@info("Beginning training loop...")
best_acc = 0.0
last_improvement = 0
for epoch_idx in 1:200

global best_acc, last_improvement
Flux.train!(Loss, params(model), train_set_full, opt)

# Calculate accuracy:
acc = sum(prediction(i) == test_label[i] for i in 1:length(test_label))/length(test_label)
@info(@sprintf("[%d]: Test accuracy: %.4f", epoch_idx, acc))

# If our accuracy is good enough, quit out.
if acc >= 0.999
@info(" -> Early-exiting: We reached our target accuracy of 99.9%")
break
end

# If this is the best accuracy we've seen so far, save the model out
if acc >= best_acc
@info(" -> New best accuracy! Saving model out to mnist_conv.bson")
BSON.@save "mnist_conv.bson" model epoch_idx acc
best_acc = acc
last_improvement = epoch_idx
end

# If we haven't seen improvement in 5 epochs, drop our learning rate:
if epoch_idx - last_improvement >= 5 && opt.eta > 1e-6
opt.eta /= 10.0
@warn(" -> Haven't improved in a while, dropping learning rate to $(opt.eta)!")

# After dropping learning rate, give it a few epochs to improve
last_improvement = epoch_idx
end

if epoch_idx - last_improvement >= 10
@warn(" -> We're calling this converged.")
break
end
end

After running the cell above, we can see the messages:

运行完上面的单元格后,我们可以看到以下消息:

┌ Info: [30]: Test accuracy: 0.5057
└ @ Main In[91]:13
┌ Info: [31]: Test accuracy: 0.5057
└ @ Main In[91]:13
┌ Info: [32]: Test accuracy: 0.5057
└ @ Main In[91]:13
┌ Info: [33]: Test accuracy: 0.5056
└ @ Main In[91]:13
┌ Info: [34]: Test accuracy: 0.5056
└ @ Main In[91]:13
┌ Info: [35]: Test accuracy: 0.5056
└ @ Main In[91]:13
┌ Info: [36]: Test accuracy: 0.5056
└ @ Main In[91]:13
┌ Warning: -> We're calling this converged.
└ @ Main In[91]:39

结论 (Conclusion)

This post focus in create a basic sentiment model and how to do it step by step. The best result I got so far was only 0.51 accuracy, but there are a lot of improvements to do in pre-preprocessing, chance the model, use bi-directional LSTM. Even use more data for the Sentiment dataset in the beginning.

这篇文章的重点是创建基本的情感模型以及如何逐步进行。 到目前为止,我得到的最好结果仅为0.51精度,但是在预处理,进行模型拟合,使用双向LSTM方面还有很多改进。 甚至在一开始就为Sentiment数据集使用更多数据。

I hope you have enjoyed it, let me know what you think and your ideas.

希望您喜欢它,让我知道您的想法和想法。

翻译自: https://medium.com/@EmoryRaphael/basic-sentiment-analysis-with-julia-using-lstm-e12d4754ee6b

使用lstm进行股票分析

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值