使用lstm进行股票分析
The idea of this post is to make an introduction to sentiment analysis using Julia, a language design to high performance, and have a similar syntax with Python.
这篇文章的目的是介绍使用Julia的情感分析,Julia是一种高性能的语言设计,并且与Python具有类似的语法。
Sentiment analysis has grown over the scenario of artificial intelligence in the last years, bring changes in how to collect information about the perception of the user to a certain product, treat patients, discover diseases, etc. Many datasets have been used by researchers to measure their performance, so for this post, we are using IMDB’s dataset which contains reviews from users.
在过去的几年中,情感分析在人工智能的场景中得到了发展,它改变了如何在某种产品上收集有关用户感知的信息,治疗患者,发现疾病等方面的变化。研究人员使用了许多数据集来进行测量他们的表现,因此对于本篇文章,我们使用的是IMDB的数据集,其中包含来自用户的评论。
There are packages in Julia that provide a pre-processed dataset. For this article, we will use the dataset provide by CorpusLoader.To load the dataset we just need a simple command:dataset_train_pos = load(IMDB("train_pos"))
Julia中有一些提供了预处理数据集的软件包。 对于本文,我们将使用CorpusLoader提供的数据集。要加载数据集,我们只需要一个简单的命令:dataset_train_pos = load(IMDB(“ train_pos”))
Some variations of this command is passing as parameters: “train_pos”, “train_neg”, “test_pos”, “test_neg”. This will give you part of this dataset already labels.
此命令的某些变体作为参数传递:“ train_pos”,“ train_neg”,“ test_pos”,“ test_neg”。 这将为您提供此数据集已带有标签的一部分。
dataset_test_pos = load(IMDB("test_pos"))
dataset_train_neg = load(IMDB("train_neg"))
dataset_test_neg = load(IMDB("test_neg"))
Let’s transform this into a single array of tokens
让我们将其转换为单个令牌数组
julia> using Base.Iterators
julia> docs = collect(take(dataset_train_pos, 2)
This will transform or dataset into an array of arrays ( Array{Array{String,1}}, in summary, we got a list of sentences tokenized. But within the tokens, we can found stopwords. In the next step let’s remove them.
这将把数据集或数据集转换成一个数组数组(Array {Array {String,1}},总的来说,我们得到了标记化的句子列表。但是在标记中,我们可以找到停用词。下一步,我们将其删除。
停用词 (Stopwords)
Stopwords are words that don’t aggregate value to the sentences, some of them are: is, like, as … and many others.
停用词是不会为句子增加价值的词ÿ