电力现货市场现货需求_现货与情绪:现货铜市场中的自然语言处理与情绪评分

电力现货市场现货需求

Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without seeking professional advice. See our Reader Terms for details.

Towards Data Science编辑的注意事项: 尽管我们允许独立作者按照我们的 规则和指南 发表文章 ,但我们不认可每位作者的贡献。 您不应在未征求专业意见的情况下依赖作者的作品。 有关 详细信息, 请参见我们的 阅读器条款

介绍 (Introduction)

After Iron and Aluminium, Copper is one of the most consumed metals in the World. An extremely versatile metal, Copper’s electrical and thermal conductivity, antimicrobial, and corrosion-resistant properties lend the metal to widespread application in most sectors of the economy. From power infrastructure, homes, and factories to electronics and medical equipment, the global economic dependency on Copper is so profound that it is sometimes referred to as ‘Dr. Copper’, and is often cited as such by market and commodity analysts because of the metal’s ability to assess global economic health and activity.

一压脚提升钢铁,铝,铜是在消费最多的金属之一的世界 。 铜具有极强的通用性,其导电性和导热性,抗菌性和耐腐蚀性能使该金属在大多数经济领域得到广泛应用。 从电力基础设施,住宅,工厂到电子设备和医疗设备,全球对铜的经济依存度非常高,以至于有时被称为“ 铜博士 ”,并且由于受到市场和商品分析师的欢迎,因此经常被称为“ 铜博士 ”。金属评估全球经济健康和活动的能力。

From a trading perspective, Copper pricing is determined by the supply and demand dynamics on the metal exchanges, particularly the London Metal Exchange (LME) and the Chicago Mercantile Exchange (CME) COMEX. The price Copper trades at, however, is affected by innumerable factors, many of which are very difficult to measure concurrently :

从交易的角度来看,铜价取决于金属交易所(尤其是伦敦金属交易所(LME)芝加哥商业交易所(CME)COMEX )的供求动态。 但是,铜的交易价格受到众多因素的影响,其中许多因素很难同时衡量:

  • Global economic growth (GDP)

    全球经济增长(GDP)
  • Emerging market economies

    新兴市场经济
  • China’s economy (China accounts for half of the global Copper demand)

    中国的经济( 中国占全球铜需求的一半 )

  • Political and environmental instability in Copper ore producing countries

    铜矿生产国的政治和环境动荡
  • The U.S. housing market

    美国住房市场
  • Trade sanctions & tariffs

    贸易制裁与关税
  • Many, many others.

    很多很多。

As well as the aforementioned fundamental factors, Copper’s price can also be artificially influenced by hedge funds, investment institutions, bonded metal, and even domestic trading. From a systematic trading point of view, this makes for a very challenging situation when we want to develop a predictive model.

除上述基本因素外,对冲基金,投资机构, 金属保税甚至国内交易也可能人为地影响铜价。 从系统的交易角度来看,当我们要开发预测模型时,这将带来非常具有挑战性的情况。

Short-term opportunities can exist however, in relation to events that are announced in the form of news. The spot and forward price of Copper has been buffeted throughout the US-China trade war and like all markets, responds almost instantly to major news announcements.

但是,对于以新闻形式宣布的事件,可能存在短期机会。 在美中贸易战期间 ,铜的现货价格和远期价格一直受到打击,并且像所有市场一样,铜价几乎立即回应了重大新闻公告。

Caught early enough, NLP-based systematic trading models can capitalize on these short-term price movements through parsing the announcements as a vector of tokens, evaluating the underlying sentiment, and subsequently taking a position prior to the anticipated (if applicable) price move, or, during the movement in the hope of capitalizing on a potential correction.

尽早发现基于NLP的系统交易模型,可以通过将公告作为代币的载体进行解析,评估基本情绪并随后在预期(如果适用)的价格变动之前持仓来利用这些短期价格变动,或者,在运动中希望利用可能的修正。

问题 (Problem)

In this article, we are going to scrape historical (and current) tweets from a variety of financial news publications Twitter feeds. We will then analyse this data in order to understand the underlying sentiment behind each tweet, develop a sentiment score, and examine the correlation between this score and Copper’s spot price over the last five years.

在本文中,我们将从各种金融新闻出版物Twitter提要中抓取历史(和当前)推文。 然后,我们将分析此数据,以便了解每条推文背后的基本情绪,得出情绪评分,并检查该评分与最近五年铜价之间的相关性。

We will cover:

我们将介绍:

  1. How to obtain historic tweets with GetOldTweets3.

    如何使用GetOldTweets3 获取历史性推文

  2. Basic Exploratory Data Analysis (EDA) techniques with our Twitter data.

    我们的Twitter数据的基本探索性数据分析 (EDA)技术。

  3. Text data preprocessing techniques (Stopwords, tokenization, n-grams, Stemming & lemmatization etc).

    文本数据预处理技术 (停止词,标记化,n-gram,词干和词根化等)。

  4. Latent Dirichlet Allocation to model & explore the distribution of topics and content their content within our Twitter data using GenSim & NLTK PyLDAvis.

    使用GenSimNLTK PyLDAvis, 潜在的Dirichlet分配可以在我们的Twitter数据中建模和探索主题的分布和内容的内容

  5. Sentiment scoring with NLTK Valence Aware Dictionary and sEntiment Reasoner (VADER).

    使用NLTK价意识字典和情感推理器( VADER )进行 情感评分

We will not go as far as developing and testing a fully-fledged trading strategy off the back of this work, the semantics of which is beyond the scope of this article. Moreover, this article is intended to demonstrate the various techniques a Data Scientist can employ to extract signals from text data in the search for profitable signals.

我们将不去开发和测试一种成熟的交易策略,因为它的语义超出了本文的范围。 此外,本文旨在演示数据科学家可以使用多种技术从文本数据中提取信号以寻找有利的信号。

现货铜NLP策略模型 (Spot Copper NLP Strategy Model)

Let’s kick off by acquiring our data.

让我们开始获取数据。

现货价格数据 (Spot Price Data)

We will start by acquiring our spot Copper price data. The reason behind our choice to use Copper’s spot price, rather than a Copper forward contract (an agreement to buy or sell a fixed amount of metal for delivery on an agreed fixed future date at a price agreed today) is that the spot price is the most reactive to market events — it is an offer to complete a commodity transaction immediately. Normally, we would use a Bloomberg terminal to acquire this data, however, we can get historical spot Copper data for free from Business Insider:

我们将从获取现货铜价格数据开始。 我们选择使用铜的现货价格而不是铜远期合约 (以今天商定的价格在约定的固定未来日期买卖固定数量的金属以进行交割的协议)的原因是,现货价格是对市场事件最有React-它是立即完成商品交易的要约。 通常,我们将使用彭博终端机来获取此数据,但是,我们可以从Business Insider免费获取历史现货铜数据:

# Imports
import glob
import GetOldTweets3 as got
import gensim as gs
import os
import keras
import matplotlib.pyplot as plt
import numpy as np
import nltk
import pandas as pd
import pyLDAvis.gensim
import re
import seaborn as snsfrom keras.preprocessing.text import Tokenizer
from nltk.stem import *
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer
from tensorflow.keras.preprocessing.sequence import pad_sequences# Get Cu Spot
prices_df = pd.read_csv(
'/content/Copper_120115_073120',
parse_dates=True,
index_col='Date'
)
# To lower case
cu_df.columns = cu_df.columns.str.lower()#Plt close price
cu_df['close'].plot(figsize=(16,4))
plt.ylabel('Spot, $/Oz')
plt.title('Cu Spot Close Price, $/Oz')
plt.legend()
plt.grid()
Image for post
Spot Copper, Close Price, 2015–01–01 to 2020–07–01, $/Mt
现货铜,收盘价,2015-01-01至2020-07-01,美元/吨

Whilst our price data looks fine, it is important to note that we are considering daily price data. Consequently, we are limiting ourselves to a timeframe that could see us losing information — any market reaction to a news event is likely to take place within minutes, likely seconds of its announcement. Ideally, we would be using 1–5-minute bars, but for the purposes of this article, this will do ok.

虽然我们的价格数据看起来不错,但需要注意的是,我们正在考虑每日价格数据。 因此,我们将自己限制在一个可能使我们丢失信息的时间表上-对新闻事件的任何市场React都可能在几分钟之内发生,可能在新闻发布几秒钟后发生。 理想情况下,我们将使用1-5分钟的柱线,但是出于本文的目的,这没关系。

推文数据 (Tweet Data)

We will extract our historical tweet data using a library called GetOldTweets3 (GOT). Unlike the official Twitter API, GOT3 enables users to access an extensive history of twitter data. Given a list of Twitter handles belonging to financial news outlets and a few relevant keywords, we can define the search parameters over which we want to get data for (note: I have posted a screenshot, rather than a code snippet, of the requisite logic to perform this action below for formatting reasons):

我们将使用名为GetOldTweets3 (GOT)的库提取历史推文数据 与官方的Twitter API不同,GOT3使用户可以访问Twitter数据的广泛历史记录。 给定属于金融新闻媒体的Twitter句柄列表和一些相关的关键字,我们可以定义要为其获取数据的搜索参数( 注意:我已发布了必要逻辑的屏幕截图,而不是代码段)出于格式化原因在下面执行此操作):

Image for post
Get historical twitter data for specified handles w.r.t search parameters
获取指定手柄句柄wrt搜索参数的历史Twitter数据

The method .setQuerySearch() accepts a single search query, so we are unable to extract tweets for multiple search criteria. We can easily solve this limitation using a loop. For example, one could simply assign variable names to each execution of a unique query, i.e. ‘spot copper’, ‘copper prices’ etc, but for the purposes of this article we can settle for a single query:

.setQuerySearch()方法接受单个搜索查询,因此我们无法为多个搜索条件提取推文。 我们可以使用循环轻松解决此限制。 例如,可以为每次查询的每次执行简单地分配变量名称,即“现货铜价”,“铜价”等,但是出于本文的目的,我们可以解决一个查询:

# Define handles
commodity_sources = ['reuters','wsj','financialtimes', 'bloomberg']# Query
search_terms = 'spot copper'# Get twitter data
tweets_df = get_tweets(
commodity_sources,
search_term = search_terms,
top_only = False,
start_date = '2015-01-01',
end_date = '2020-01-01'
).sort_values('date', ascending=False).set_index('date')tweets_df.head(10)
Image for post
Historical Twitter Data
历史Twitter数据

So far so good.

到目前为止,一切都很好。

We now need to process this text data in order to make it interpretable for our topic and sentiment models.

现在,我们需要处理此文本数据,以使其对于我们的主题和情感模型可解释。

预处理和探索性数据分析 (Preprocessing & Exploratory Data Analysis)

Preprocessing of text data for Natural Language applications requires careful consideration. Composing a numerical vector from text data can be challenging from a loss point of view, with valuable information and subject context easily lost when performing seemingly basic tasks such as removing stopwords, as we shall see next.

对于自然语言应用程序,文本数据的预处理需要仔细考虑。 从丢失的角度来看,从文本数据组成数字矢量可能具有挑战性,当执行看似基本的任务(例如删除停用词)时,有价值的信息和主题上下文很容易丢失,我们将在后面看到。

Firstly, let's remove redundant information in the form of tags and URLs, i.e.

首先,让我们以标记和URL的形式删除多余的信息,即

Image for post
Tweets from media outlets typically contain handle tags, hashtags, and links to articles, all of which need removing.
来自媒体的推文通常包含句柄标签,主题标签和指向文章的链接,所有这些都需要删除。

We define a couple of one-line Lambda functions that use Regex to remove the letters and characters matching the expressions we want to remove:

我们定义了几个单行Lambda函数,这些函数使用Regex删除与要删除的表达式匹配的字母和字符:

#@title Strip chars & urls
remove_handles = lambda x: re.sub(‘@[^\s]+’,’’, x)
remove_urls = lambda x: re.sub(‘http[^\s]+’,’’, x)
remove_hashtags = lambda x: re.sub('#[^\s]*','',x)tweets_df[‘text’] = tweets_df[‘text’].apply(remove_handles)
tweets_df[‘text’] = tweets_df[‘text’].apply(remove_urls)
tweets_df[‘text’] = tweets_df[‘text’].apply(remove_hashtags)

Next, we perform some fundamental analysis of our twitter data by examining tweet composition, such as individual tweet lengths (words per tweet), number of characters, etc.

接下来,我们通过检查tweet组成对Twitter数据进行一些基础分析,例如各个tweet长度(每条tweet的单词数),字符数等。

Image for post
Basic Text EDA — Word & Character Frequency Distributions
基本文本EDA —单词和字符的频率分布

停用词 (Stop Words)

It is immediately apparent that the mean length of each tweet is relatively short (10.3 words, to be precise). This information suggests that at the expense of computational complexity and memory overhead, filtering stopwords might not be a good idea if we consider the potential information loss.

显而易见,每条推文的平均长度都比较短(准确地说是10.3个字)。 此信息表明,如果考虑潜在的信息丢失,则以计算复杂性和内存开销为代价,过滤停用词可能不是一个好主意。

Initially, this experiment was trialled with having all stop words removed from Tweets, using NLTK’s very handy standard list of Stop Words:

最初,该实验使用NLTK非常方便的标准停用词列表,从Tweets中删除了所有停用词,进行了试验:

# Standard tweet sw
stop_words_nltk = set(stopwords.words('english'))# custom stop words
stop_words = get_top_ngram(tweets_df['text'], 1)
stop_words_split = [
w[0] for w in stop_words
if w[0] not in [
'price', 'prices',
'china', 'copper',
'spot', 'other_stop_words_etc'
] # Keep SW with hypothesised importance
]stop_words_all = list(stop_words_nltk) + stop_words_split

However, this action lead to a lot of miscategorised tweets (from a sentiment score point-of-view) which supports the notion of loss of information, and therefore best avoided.

但是,此操作会导致很多错误分类的推文(从情感评分的角度来看),这些推文支持信息丢失的概念,因此最好避免。

At this point, It is well worth highlighting NLTK’s excellent library when it comes to dealing with Twitter data. It offers a comprehensive suite of tools and functions to help parse social media outputs, including emoticon interpretation!. One can find a really helpful guide to getting started with and using NLTK on Twitter data here.

在这一点上,当涉及到Twitter数据时,非常值得强调NLTK的出色库。 它提供了一套全面的工具和功能来帮助解析社交媒体输出,包括表情释义! 人们可以找到一个真正的帮助指导入门,并在Twitter上的数据使用NLTK 这里

(N-grams)

The next step is to consider word order. When we vectorize a sequence of tokens into a bag-of-words (BOW — more on this in the next paragraph), we lose both context and meaning inherent in the order those words within a tweet. We can attempt to understand the importance of word order within our tweets DataFrame by examining the most frequent n-grams.

下一步是考虑单词顺序。 当我们将标记序列向量化为单词袋时(BOW-下一节将对此进行详细介绍),我们将失去上下文和含义,这些上下文和含义是推文中这些单词所固有的顺序。 我们可以通过检查最频繁的n-gram来尝试了解推文DataFrame中单词顺序的重要性。

As observed in our initial analysis above, the average length of a given tweet is only 10 words. In light of this information, the order of words within a tweet and, specifically, ensuring we retain the context and meaning inherent within this ordering, is critical to generating an accurate sentiment score. We can extend the concept of a token to include multiword tokens i.e. n-grams in order to retain the meaning within the ordering of words.

正如我们在上文的初步分析中所观察到的,给定推文的平均长度仅为10个字。 根据这些信息,一条推文中的单词顺序 ,特别是确保我们保留该顺序中固有的上下文和含义, 对于生成准确的情感评分至关重要 。 我们可以将令牌的概念扩展为包括多字令牌(即n-gram) ,以便将含义保留在单词的顺序内。

NLTK has a very handy (and very efficient) n-gram tokenizer: from nltk.util import ngram. The n-gram function returns a generator that yields the top “n” n-grams as tuples. We, however, are interested in exploring what these n-grams actually are so will in the first instance, so will make use of Scikit-learn’s CountVectorizer to parse our tweet data:

NLTK有一个非常方便(且非常有效)的n-gram令牌生成器: from nltk.util import ngram 。 n-gram函数返回一个生成器,该生成器生成前“ n”个n-gram作为元组。 但是,我们有兴趣首先探索这些n-gram的实际含义,因此将利用Scikit-learn的CountVectorizer来解析我们的tweet数据:

def get_ngrams(doc, n=None):
"""
Get matrix of individual token counts for a given text document.
Args:
corpus: String, the text document to be vectorized into its constituent tokens.
n: Int, the number of contiguous words (n-grams) to return.
Returns:
word_counts: A list of word:word frequency tuples.
"""
# Instantiate CountVectorizer class
vectorizer = CountVectorizer(ngram_range=
(n,n)).fit(doc)
bag_of_words = vectorizer.transform(doc)
sum_of_words = bag_of_words.sum(axis=0)
# Get word frequencies
word_counts = [(word, sum_of_words[0, index])
for word, index in vectorizer.vocabulary_.items()
]
word_counts = sorted(word_counts, key=lambda x:x[1], reverse=True)
return word_counts# Get n-grams
top_bigrams = get_ngrams(tweets_df['text'], 2)[:20]
top_trigrams = get_ngrams(tweets_df['text'], 3)[:20]
Image for post
Contiguous Word Frequencies (n-grams).
连续字频率(n克)。

Upon examination of our n-gram plots, we can see that apart from a few exceptions, an NLP-based predictive model would learn significantly more from our n-gram features. For example, the model will be able to correctly interpret ‘copper price’ as a reference to the physical price of copper, or ‘china trade’ to China’s trade, rather than interpreting the individual word’s meaning.

通过检查我们的n元语法图,我们可以看到,除了少数例外,基于NLP的预测模型将从n元语法特征中学习更多。 例如,该模型将能够正确地将“铜价”解释为对铜的实物价格的参考 ,或者将“中国贸易”解释为对中国贸易的参考,而不是解释单个词的含义。

令牌化和合法化。 (Tokenization & Lemmatization.)

Our next step is to tokenize our tweets for use in our LDA topic model. We will develop a function that will perform the necessary segmentation of our tweets (the Tokenizer’s job) and Lemmatization.

下一步是标记要在LDA主题模型中使用的推文。 我们将开发一个函数,该函数将对我们的tweet(Tokenizer的工作)和Lemmatization进行必要的分段。

We will use NLTK’s TweetTokenizer to perform the tokenization of our tweets, which has been developed specifically to parse tweets and understand their semantics relative to this social media platform.

我们将使用NLTK的TweetTokenizer来对我们的tweet进行令牌化,该令牌是专门为解析tweet并了解其相对于此社交媒体平台的语义而开发的。

Given the relatively brief nature of each tweet, dimensionality reduction is not so much of a pressing issue for our model. With this in mind, it is reasonable not to perform any Stemming operations on our data in an attempt to eliminate the small meaning differences in the plural vs possessive forms of words.

考虑到每条推文的相对简短性,降维并不是我们模型的紧迫问题。 考虑到这一点,合理的做法是不对我们的数据执行任何词干操作,以消除单词的复数形式和所有格形式中的微小含义差异。

We shall instead implement a Lemmatizer, WordNetLemmatizer, to normalise the words within our tweet data. Lemmatisation is arguably more accurate than stemming for our application as it takes into account a word’s meaning. WordNetLemmatizer can also help improve the accuracy of our topic model as it utilises part of speech (POS) tagging. The POS tag for a word indicates its role in the grammar of a sentence, such as drawing the distinction between a noun POS and an adjective POS, like “Copper” and “Copper’s price”.

相反,我们将实现一个Lemmatizer WordNetLemmatizer ,以规范我们tweet数据中的单词。 对于我们的应用程序, 词法化可能要比词干化更为准确,因为它考虑了单词的含义 。 WordNetLemmatizer利用词性(POS)标记,还可以帮助提高主题模型的准确性。 单词的POS标签指示其在句子语法中的作用,例如绘制名词POS和形容词POS(例如“ Copper”和“ Copper's price”)之间的区别。

Note: You must configure the POS tags manually within WordNetLemmatizer. Without a POS tag, it assumes everything you feed it is a noun.

注意:您必须在WordNetLemmatizer中手动配置POS标签。 没有POS标签,它将假定您提供的所有内容都是一个名词。

def preprocess_tweet(df: pd.DataFrame, stop_words: None):
"""
Tokenize and Lemmatize raw tweets in a given DataFrame.
Args:
df: A Pandas DataFrame of raw tweets indexed by index of type DateTime.
stop_words: Optional. A list of Strings containing stop words to be removed.
Returns:
processed_tweets: A list of preprocessed tokens of type String.
"""
processed_tweets = []
tokenizer = TweetTokenizer()
lemmatizer = WordNetLemmatizer()
for text in df['text']:
words = [w for w in tokenizer.tokenize(text) if (w not in stop_words)]
words = [lemmatizer.lemmatize(w) for w in words if len(w) > 2] processed_tweets.append(words)
return processed_tweets# Tokenize & normalise tweets
tweets_preprocessed = preprocess_tweet(tweets_df, stop_words_all)

For the purposes of demonstrating the utility of the above function, we have also passed a list of stop words into the function.

为了演示上述功能的实用性,我们还向该功能传递了停用词列表。

矢量化和连续词袋 (Vectorisation & Continuous Bag-Of-Words)

We now need to convert our tokenised tweets to vectors, using a document representation method known as a Bag Of Words (BOW). In order to perform this mapping, we will use Gensim’s Dictionary class:

现在,我们需要使用一种称为词义(BOW)的文档表示方法,将标记化的推文转换为向量 为了执行此映射,我们将使用Gensim的Dictionary类

tweets_dict = gs.corpora.Dictionary(tweets_preprocessed)

By passing the list of processed tweets as an argument, Gensim’s Dictionary creates a unique integer id mapping for each unique, normalised word (similar to a Hash Map). We can view the word: id mapping by calling .token2id()on our tweets_dict. We then count the number of occurrences of each distinct word, convert the word to its integer word id, and return the result as a sparse vector:

通过将已处理的推文列表作为参数传递,Gensim的词典为每个唯一的标准化单词创建一个唯一的整数id映射(类似于Hash Map )。 我们可以通过在tweets_dict上调用.token2id()来查看单词:id映射。 然后,我们计算每个不同单词的出现次数,将该单词转换为其整数单词id,然后将结果作为稀疏向量返回:

cbow_tweets = [tweets_dict.doc2bow(doc) for doc in tweets_preprocessed]

LDA主题建模 (LDA Topic Modelling)

Now for the fun part.

现在是有趣的部分。

A precursor to developing our NLP-based trading strategy is to understand whether the data we have extracted contains topics/signals that are relevant to the price of Copper, and, more importantly, whether it contains information that we could potentially trade on.

制定基于NLP的交易策略的先驱是了解我们提取的数据是否包含与铜价相关的主题/信号,更重要的是,它是否包含我们可能进行交易的信息。

This requires us to examine and evaluate the various topics and the words that are representative of these topics within our data. Garbage in, garbage out.

这就要求我们检查和评估数据中的各个主题以及代表这些主题的词语。 垃圾进垃圾出。

In order to explore the various topics (and the subjects of said topics) within our tweet corpus, we will use Gensim’s Latent Dirichlet Allocation model. LDA is a generative probabilistic model applicable to collections of discrete data such as text. LDA functions as a hierarchical Bayesian model in which each item in a collection is modelled as a finite mixture over an underlying set of topics. Each topic is, in turn, modelled as an infinite mixture over an underlying set of topic probabilities (Blei, Ng et al 2003).

为了探索我们推文语料库中的各个主题(以及所述主题的主题),我们将使用Gensim的潜在Dirichlet分配模型。 LDA是适用于离散数据(例如文本)集合的生成概率模型。 LDA用作分层贝叶斯模型,其中将集合中的每个项目建模为基础主题集上的有限混合。 每个主题依次被建模为主题概率的基础上的无限混合( Blei,Ng等,2003 )。

We pass our newly vectorized tweets, cbow_tweetsand the dictionary mapping each word to an id, tweets_dictto Gensim’s LDA Model Class:

我们通过我们的新矢量鸣叫, cbow_tweets和字典映射每个单词的ID, tweets_dict到Gensim的LDA模型类:

# Instantiate model 
model = gs.models.LdaMulticore(
cbow_tweets,
num_topics = 4,
id2word = tweets_dict,
passes = 10,
workers = 2)# Display topics
model.show_topics()

You can see that we are required to provide an estimate of the number of topics within our dataset, via the num_topics hyperparameter. There are, as far as I am aware, two methods of determining the optimal number of topics:

您可以看到,我们需要通过num_topics超参数来估计数据集中主题的数量。 据我所知,有两种确定最佳主题数的方法:

  1. Build multiple LDA models and compute their coherence score with a Coherence Model.

    构建多个LDA模型并计算其连贯性得分为 相干模型

  2. Domain expertise and intuition.

    领域专业知识和直觉。

From a trading point of view, this is where domain knowledge and market expertise can help. We would expect the topics within our Twitter data, bearing in mind they are the product of financial news publications, to focus primarily on the following subjects:

从交易的角度来看,这是领域知识和市场专业知识可以提供帮助的地方。 考虑到它们是金融新闻出版物的产物,我们希望Twitter数据中的主题主要集中于以下主题:

  • Copper price (naturally)

    铜价(自然)

  • The U.S. / China trade war

    美中贸易战

  • The U.S. President Donald Trump

    美国总统唐纳德·特朗普

  • Major Copper miners

    主要铜矿商

  • Macroeconomic announcements

    宏观经济公告

  • Local producing country civil/political unrest

    当地生产国的内乱/政治动荡

Aside from this, one should use their own judgment when determining this hyperparameter.

除此之外,在确定此超参数时应使用自己的判断。

It is worth mentioning that a whole host of other hyperparameters exists. This flexibility makes Gensim’s LDA model extremely powerful. For example, as a Bayesian model, if we had ‘A-priori’ belief on a topic/word probability, our LDA model allows us to encode these priors for the Dirichlet distribution through the init_dir_prior method, or similarly through the eta hyperparameter.

值得一提的是,其他超参数整个主机的存在。 这种灵活性使Gensim的LDA模型极为强大。 例如,作为贝叶斯模型,如果我们对主题/单词的概率具有“先验”信念,则我们的LDA模型允许我们通过init_dir_prior方法或类似地通过eta超参数为Dirichlet分布编码这些先验。

Getting back to our model, you will note that we have used the multicore variant of Gensim’s LdaModelwhich allows for a faster implementation (ops are parallelized for multicore machines):

回到我们的模型,您会注意到,我们使用了Gensim的LdaModel的多核变体,该变体可以实现更快的实现(多核机器并行化操作):

Image for post
LDA Model show_topics() output: Note the topics numbered 0–4 containing the words and their associated weight i.e. how much they contribute to the topic.
LDA模型show_topics()输出:注意,编号为0–4的主题包含单词及其关联的权重,即它们对主题的贡献程度。

A cursory inspection of the topics within our model would suggest that we have both relevant data and that our LDA model has done a reasonable job of modelling said topics.

粗略检查模型中的主题将表明我们拥有相关数据,并且LDA模型已经完成了对所述主题进行建模的合理工作。

In order to understand the distribution of topics and their keywords, we will use pyLDAvis which launches an interactive widget making it ideal for use in Jupyter/Colab notebooks:

为了了解主题及其关键字的分布,我们将使用pyLDAvis启动一个交互式小部件,使其非常适合在Jupyter / Colab笔记本中使用:

pyLDAvis.enable_notebook()
topic_vis = pyLDAvis.gensim.prepare(model, cbow_tweets, tweets_dict)
topic_vis
Image for post
LDA Model — Twitter News Data, Topic Distribution.
LDA模型-Twitter新闻数据,主题分布。

LDA模型结果 (LDA Model Results)

Upon inspection of the resulting topic plot, we can see that our LDA model has done a reasonable job of capturing the salient topics and their constituent words within our Twitter data.

通过查看最终的主题图,我们可以看到我们的LDA模型在捕获Twitter数据中的重要主题及其组成词方面做得很合理。

What makes for a robust topic model?

是什么构成健壮的主题模型?

A good topic model typically exhibits large, distinct topics (circles) with no overlap. The areas of said circles are proportional to the proportions of the topics across the ‘N’ total tokens in the corpus (namely, our Twitter data). The centres of each topic circle are set in two dimensions: PC1 and PC2, and the distance between them set by the output of a dimensionality reduction model (Multidimensional Scaling, to be precise) that is run on the inter-topic distance matrix. A full explanation of the mathematical detail behind the pyLDAvis topic visual can be found here.

一个好的主题模型通常表现出没有重叠的大而独特的主题(圆圈)。 所述圆圈的面积与语料库中“ N”个总标记中主题的比例(即我们的Twitter数据)成比例。 每个主题圆的中心都设置为二维:PC1和PC2,它们之间的距离由在主题间距离矩阵上运行的降维模型(准确地说是多维缩放)的输出设置。 pyLDAvis主题外观背后的数学细节的完整说明可以在这里找到。

Interpreting our results

解释我们的结果

While remembering not to lose sight of the problem that we are trying to solve, specifically, understand whether there are any useful signals in our tweet data that might affect the Copper’s spot price, we must make a qualitative assessment.

在记住不要忘记我们试图解决的问题时,尤其是要了解我们的推文数据中是否有任何有用的信号可能会影响铜的现货价格,我们必须进行定性评估。

Examining the individual topics in detail, we can see a promising set of results, specifically top words appearing within the individual topics, that adhere largely to our expected topic criteria above:

详细检查各个主题,我们可以看到一系列有希望的结果,尤其是各个主题中出现的热门单词,这些结果在很大程度上符合我们上面预期的主题标准:

Topic Number:

主题编号:

  1. Copper Mining & Copper Exporting Countries

    铜矿山和铜出口国

Top words include major Copper miners (BHP Billiton, Antofagasta, Anglo American & Rio Tinto), along with mentions of major Copper exporting countries i.e. Peru, Chile, Mongolia, etc.

热门词包括主要的铜矿商(必和必拓,安托法加斯塔,英美资源集团和力拓),以及主要的铜出口国,例如秘鲁,智利,蒙古等。

2. China Trade & Manufacturing Activity

2. 中国贸易与制造业活动

Top words include ‘Copper’, ‘Copper price’, ‘China’, ‘Freeport’ and ‘Shanghai’.

热门词包括“铜”,“铜价格”,“中国”,“自由港”和“上海”。

3. U.S. / China Trade War

3. 中美贸易战

Top words include ‘Copper’, ‘Price’, ‘China’, ‘Trump’, ‘Dollar’, and the ‘Fed’, but also some unusual terms like ‘Chile’ and ‘Video’.

热门词包括“铜”,“价格”,“中国”,“特朗普”,“美元”和“美联储”,还有一些不寻常的术语,例如“智利”和“视频”。

On the strength of the results above, we make the decision to proceed with our NLP trading strategy, on the strength that our twitter data exhibits enough information relevant to the spot price of Copper. More importantly, we can be confident of the relevancy of our Twitter data with respect to the price of Copper — the topics that our LDA Model uncovered adhered to our view of the expected topics that should be present within the data.

根据上述结果,我们决定继续执行NLP交易策略,因为我们的Twitter数据显示出与铜的现货价格有关的足够信息。 更重要的是,我们可以确信我们的Twitter数据的关联性 ,相对于铜的价格-的主题,我们的LDA示范破获坚持我们的预期主题视图 应该存在于数据中。

验证LDA模型 (Validate LDA Model)

As Data Scientists, we know that we must validate the integrity and robustness of any model. Our LDA Model is no different. We can do so by checking the Coherence (mentioned above) of our model. In layman terms, Coherence measures the relative distance between words within a topic. The mathematical detail on the mathematics behind precisely how this score is calculated can be found in this paper. I have omitted to repeat the various expressions for brevity’s sake.

作为数据科学家,我们知道我们必须验证任何模型的完整性和健壮性。 我们的LDA模型也不例外。 我们可以通过检查模型的一致性(如上所述)来实现。 用外行术语来说,连贯性是衡量一个主题中单词之间的相对距离。 在这一点上是背后正是如何计算数学的数学细节可以在本作中找到文件 为了简洁起见,我省略了重复各种表达式的步骤。

Generally speaking, a score between .55 and .70 is indicative of a skillful topic model:

一般来说,.55和.70之间的得分表示熟练的主题模型:

# Compute Coherence Score
coherence_model = gs.models.CoherenceModel(
model=model,
texts=tweets_preprocessed,
dictionary=tweets_dict,
coherence='c_v')coherence_score = coherence_model.get_coherence()
print(f'Coherence Score: {coherence_score}')
Image for post
Calculating the Coherence Score of our LDA model, based on the confirmation measure, ‘c_v’, (as opposed to UMass).
根据确认量“ c_v”(与UMass相对),计算我们的LDA模型的一致性得分。

At a Coherence Score of .0639, we can be reasonably confident that our LDA Model has been trained on the correct number of topics, and retains an adequate degree of semantic similarity between high scoring words in each.

在.0639的连贯分数下,我们可以合理地确信我们的LDA模型已针对正确数量的主题进行了训练,并且在每个高分单词之间保留了足够的语义相似度。

Our choice of score measure, observable in the signature of the above Coherence Model logic, is motivated by the results in the paper by Roder, Both & Hindeburg. You can see that we have chosen to score our model against the coherence = 'c_v measure, as opposed to ‘u_mass’, ‘c_v’, ‘c_uci’. etc. The ‘c_v’ score measure was found to return superior results to that of the other measures, particularly in cases of small word sets, qualifying our choice.

Roder,Both和Hindeburg在论文中的结果激励了我们选择分数度量的方法,可以从上述一致性模型逻辑的签名中看出 。 您可以看到我们选择了对模型的coherence = 'c_v度量,而不是'u_mass','c_v','c_uci'。 我们发现,“ c_v”评分标准比其他方法能获得更好的结果,特别是在单词集较小的情况下,符合我们的选择。

情感分数:VADER (Sentiment Score: VADER)

Having been satisfied that our twitter data contains relevant enough information to potentially be predictive of short-term Copper price movements, we move onto the sentiment analysis part of our problem.

在满意我们的推特数据包含足够相关的信息以潜在地预测短期铜价走势之后,我们继续进行情绪分析 部分 我们的问题。

We will use NLTK’s Valence Aware Dictionary and sEntiment Reasoner (VADER) to analyse our tweets, and, based on the sum of the underlying intensity of each word within each tweet, generate a sentiment score between -1 and 1.

我们将使用NLTK的价数意识字典和情感推理器(VADER)分析我们的推文,并根据每个推文中每个单词的基本强度总和得出-1和1之间的情感分数。

Irrespective of whether we employ single-tokens, ngrams, stems, or lemmas in our NLP model, fundamentally, each token in our tweet data contains some information. Possibly the most important part of this information is the word’s sentiment.

无论我们在NLP模型中采用单令牌,ngram,词干还是引理,从根本上讲,tweet数据中的每个令牌都包含一些信息。 该信息中最重要的部分可能是单词的情感。

VADER is a popular heuristic, rule-based (composed by humans) sentiment analysis model by Hutto and Gilbert. It is particularly accurate (and was designed specifically for this application) for use on social media text. It seems rational, therefore, to use it for our project.

VADER是Hutto和Gilbert流行的启发式,基于规则的(由人组成)情感分析模型。 它在社交媒体文本上使用特别准确(并且是为此应用程序专门设计的)。 因此,将其用于我们的项目似乎是合理的。

VADER’s implementation is very straightforward:

VADER的实现非常简单:

# Instantiate SIA class
analyser = SentimentIntensityAnalyzer()sentiment_score = []for tweet in tweets_df[‘text’]:
sentiment_score.append(analyser.polarity_scores(tweet))

The SentimentIntensityAnalyzer contains a dictionary of tokens and their individual scores. We then generate a sentiment score for each tweet in our tweets DataFrame and access the result, a dictionary object, of the four separate score components generated by the VADER model:

SentimentIntensityAnalyzer包含一个令牌及其各个分数的字典。 然后,我们在tweet DataFrame中为每个tweet生成一个情绪得分,并访问由VADER模型生成的四个独立得分成分的结果(字典对象):

  • The negative proportion of the text

    文字的负比例
  • The positive proportion of the text

    文字的正比例
  • The neutral proportion of the text &

    文字的中性比例&
  • The combined intensity of sentiment polarity in the above, the ‘Compound’ score

    情绪极性的综合强度,即“复合”得分
#@title Extract Sentiment Score Elementssentiment_prop_negative = []
sentiment_prop_positive = []
sentiment_prop_neutral = []
sentiment_score_compound = []for item in sentiment_score:
sentiment_prop_negative.append(item['neg'])
sentiment_prop_positive.append(item['neu'])
sentiment_prop_neutral.append(item['pos'])
sentiment_score_compound.append(item['compound'])# Append to tweets DataFrame
tweets_df['sentiment_prop_negative'] = sentiment_prop_negative
tweets_df['sentiment_prop_positive'] = sentiment_prop_positive
tweets_df['sentiment_prop_neutral'] = sentiment_prop_neutral
tweets_df['sentiment_score_compound'] = sentiment_score_compound
Image for post
Tweet Data Sentiment Scores: Negative, Positive, and Compound, Daily.
Tweet数据情感评分:负面,正面和复合,每日。

After plotting the rolling scores for the various components, negative, positive, and compound scores (we leave neutral out), we can make a few observations:

在绘制出各个组成部分的消极得分,消极得分,积极得分和综合得分的滚动得分之后(我们将中性得分排除在外),我们可以进行一些观察:

  • Clearly, the sentiment score is very noisy/volatile — our Twitter data may simply contain redundant information, with a few causing large spikes in scores. This is, however, the nature of signal finding — we only need that one piece of salient information.

    显然,情绪得分非常嘈杂/不稳定-我们的Twitter数据可能只包含冗余信息,有一些会导致得分大幅度上升。 但是,这就是发现信号的本质-我们只需要一条重要信息。
  • Our Twitter data appears to be predominantly positive: the mean negative score is .09, whilst the mean positive score is .83.

    我们的Twitter数据似乎主要是正面的:平均负面分数是.09,而平均正面分数是.83。

情绪比分VS铜现货价格 (Sentiment Score VS Copper Spot Price)

Now we must evaluate whether our hard work has paid off: Is our sentiment score predictive of Copper’s spot price!

现在,我们必须评估我们的辛勤工作是否取得了回报: 我们的情绪得分是否可以预测铜的现货价格!

On first glance, there does not appear to be any association between the spot price and our compound score:

乍一看,现货价格与我们的综合评分之间似乎没有任何关联:

Image for post
Compound Sentiment Score vs Spot Copper ($/Mt), Daily.
每日综合情绪指数与现货铜价格(美元/吨)。

However, when we apply a classic smoothing technique and calculate the rolling average of our sentiment score, we see a different picture emerge:

但是,当我们应用经典的平滑技术并计算情绪得分的滚动平均值时,我们看到了另一幅图:

Image for post
Rolling 21d Mean Compound Sentiment Score vs Spot Copper ($/mt), Daily.
每日21天复合平均情绪指数与现货铜($ / mt)的对比。

This now looks a lot more promising. With the exception of the time period between January and August 2017, we can readily observe a near-symmetric, inverse relationship between our 21-day rolling mean compound score and the spot price of Copper.

现在,这看起来很多更有前途。 除了2017年1月至2017年8月这段时间之外,我们可以很容易地观察到我们21天滚动平均复合得分与铜的现货价格之间的近似对称,反比关系。

结论 (Conclusion)

At this juncture, we pause to consider the options available to us on how we want our model to process and classify the underlying sentiment within a piece of text data, and, critically, how the model will act on this classification in terms of its trading decisions.

在此关头,我们暂停考虑可供我们选择的方案,这些方案涉及我们希望模型如何处理和分类文本数据中的潜在情绪,以及至关重要的是,模型将如何根据其交易对这种分类采取行动决定。

Consistent with the Occam’s Razor principle, we implemented an out-of-the-box solution to analyse the underlying sentiment within our twitter data. As well as exploring some renowned EDA and preprocessing techniques as a prerequisite, we used NLTK’s Valence Aware Dictionary and sEntiment Reasoner (VADER), to generate an associated sentiment score for each tweet, and examined the correlation of said score against simple corresponding Copper spot price movements.

Occam的Razor原则一致,我们实施了一种即用型解决方案来分析Twitter数据中的基本情绪。 除了探索一些著名的EDA和预处理技术作为先决条件外,我们还使用了NLTK的价数感知字典和情感推理器( VADER )来生成每个推文的相关情感评分,并检查了该评分与简单的相应铜现货价格的相关性。动作。

Interestingly a correlation was observed between our rolling compound sentiment score and the price of Copper. This does not, of course, imply causation. Moreover, it may simply be the news data trails that of the price of Copper, and our tweet data is simply reporting on its movements. Nonetheless, there is scope for further work.

有趣的是,我们的滚动复合情绪评分与铜价之间存在相关性。 当然,这并不意味着因果关系。 此外,新闻数据可能仅落后于铜价,而我们的推特数据仅是在报道其走势。 尽管如此,仍有进一步工作的余地。

观察,批评和进一步分析 (Observations, Criticisms & Further Analysis)

In reality, the design of a systematic trading strategy necessitates a great deal more mathematical and analytical rigour, as well as a good dose of domain expertise. One would typically invest a great deal of time designing a suitable label that best encompasses the signal and the magnitude of the price movement (if at all!) found within said signal, notwithstanding a thorough investigation of the signal itself.

实际上,系统交易策略的设计需要更多的数学和分析严格性,以及大量的领域专业知识。 尽管会仔细研究信号本身,但通常会花费大量时间来设计合适的标签,以最好地包含信号和在所述信号中发现的价格变动幅度(如果有的话!)。

Scope for very interesting further work and analysis exists in abundance with respect to our problem:

关于我们的问题,存在很多非常有趣的进一步工作和分析的范围:

  1. Neural Network Embeddings: As an example, in order to intimately understand how an NLP model, with an associated label (or labels), makes a trading decision we would look to train a Neural Network with an Embedding Layer. We could then examine the trained embedding layer to understand how the model treats the various tokens within the layer against those with a similar encoding, and the label(s). We could then visualise how the model groups words with respect to their effect on the class we wish to predict i.e. 0 for negative price movement, 1 for positive price movement. TensorFlow’s Embedding Projector, for example, is an invaluable tool for visualising such embeddings:

    神经网络嵌入:例如,为了深入了解带有关联标签的NLP模型如何做出交易决策,我们希望训练一个具有嵌入层的神经网络。 然后,我们可以检查经过训练的嵌入层,以了解该模型如何将层中的各种标记与具有相似编码的标记和标签进行比较。 然后,我们可以可视化模型如何根据单词对我们希望预测的类别的影响来对单词进行分组,即0表示负价格变动,1表示正价格变动。 例如,TensorFlow的Embedding Projector是一种使此类嵌入可视化的宝贵工具:

Image for post

2. Multinomial Naive Bayes

2.多项式朴素贝叶斯

We used VADER to parse and interpret the underlying sentiment of our Twitter data, which it did a reasonable job of doing. The drawback of using VADER, however, is that it doesn’t consider all the words in a document, only about 7,500 in fact. Given the complexity of commodity trading and its associated terminology, we might be missing crucial information.

我们使用VADER来分析和解释我们Twitter数据的基本情感,这是一项合理的工作。 但是,使用VADER的缺点是它没有考虑文档中的所有单词,实际上仅考虑了大约7500个单词。 鉴于商品交易及其相关术语的复杂性,我们可能会丢失重要信息。

As an alternative, we could employ a Naive Bayes classifier to find sets of keywords that are predictive of our target, be it the price of Copper itself or a sentiment score.

作为替代方案,我们可以使用朴素贝叶斯分类器来找到可以预测目标的关键字集,无论是铜本身的价格还是情绪得分。

3. Intra-day Data

3日内数据

Intra-day data in nearly all cases is a must when designing an NLP trading strategy model, for the reasons mentioned in the introduction. Time and trade execution is very much of the essence when attempting to capitalise on news/event-based price movements.

出于导言中提到的原因,在设计NLP交易策略模型时,几乎所有情况下的日内数据都是必需的。 当试图利用基于新闻/事件的价格变动时,时间和交易执行是非常重要的。

Thank you for taking the time to read my article, I hope you found it interesting.

感谢您抽出宝贵的时间阅读我的文章,希望您觉得它有趣。

Please do feel free to reach out — I very much welcome comments & constructive critiques.

请随时与我们联系-我非常欢迎您提出评论和提出建设性的批评。

翻译自: https://towardsdatascience.com/spot-vs-sentiment-nlp-sentiment-scoring-in-the-spot-copper-market-492456b031b0

电力现货市场现货需求

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值