量化投资:机器学习工作流

The Machine Learning Workflow

这篇文本是关于机器学习工作流程的介绍。
它讨论了如何使用各种监督和无监督的机器学习模型进行交易,并介绍了这些模型的应用场景和使用Python库的方法。
这些模型包括线性模型、广义加法模型、集成模型、降维和聚类的无监督模型、神经网络模型和强化学习模型。
文章还介绍了如何将这些模型嵌入到交易策略中,并优化投资组合和评估策略表现的方法。文章还讨论了监督和无监督学习的区别,以及算法交易的用例。最后,文章还提供了一些方法来诊断模型中的错误,如过度拟合,并提高模型的性能。

在文本中使用了一些技术,如监督学习、无监督学习、线性模型、非线性模型、决策树、随机森林、梯度提升机、神经网络、卷积神经网络、循环神经网络、强化学习等。文章还介绍了一些Python库,如scikit-learn和yellowbricks,用于机器学习工作流程中的数据准备、特征提取、模型选择和参数调整。

机器学习工作流程

本章是本书第2部分的开篇,我们将演示如何使用一系列监督和无监督的机器学习(ML)模型进行交易。我们将在展示相关的Python库应用之前,解释每个模型的假设和用例。我们将在第2-4部分涵盖以下模型类别:

用于交叉部分、时间序列和面板数据的线性回归和分类模型
广义可加模型,包括非线性的基于树的模型,如决策树
集成模型,包括随机森林和梯度提升机
用于降维和聚类的无监督线性和非线性方法
神经网络模型,包括循环和卷积结构
强化学习模型
我们将应用这些模型于本书第一部分介绍的市场、基本和另类数据源。我们将在已介绍的内容基础上进一步演示如何将这些模型嵌入到交易策略中,将模型信号转化为交易,如何优化投资组合以及如何评估策略绩效。

这些模型及其应用有许多共同点。本章介绍了这些共同点,以便我们可以在接下来的章节中关注模型的具体用法。这些共同点包括通过优化目标或损失函数从数据中学习函数关系的总体目标。它们还包括评估模型性能的相关方法。

我们区分无监督学习和监督学习,并概述了算法交易的用例。我们对比了监督回归和分类问题,监督学习用于统计推断输入和输出数据之间关系的应用与用于预测未来输出的应用。我们还说明了预测误差是由于模型的偏差或方差,或者由于数据中噪声信号比高所致。最重要的是,我们提供诊断错误源(如过拟合)并改善模型性能的方法。

如果您对机器学习已经相当熟悉,可以跳过此部分,直接学习如何使用机器学习模型为算法交易策略生成和组合Alpha因子。

This chapter starts part 2 of this book where we illustrate how you can use a range of supervised and unsupervised machine learning (ML) models for trading. We will explain each model’s assumptions and use cases before we demonstrate relevant applications using various Python libraries. The categories of models that we will cover in parts 2-4 include:

  • Linear models for the regression and classification of cross-section, time series, and panel data
  • Generalized additive models, including nonlinear tree-based models, such as decision trees
  • Ensemble models, including random forest and gradient-boosting machines
  • Unsupervised linear and nonlinear methods for dimensionality reduction and clustering
  • Neural network models, including recurrent and convolutional architectures
  • Reinforcement learning models

We will apply these models to the market, fundamental, and alternative data sources introduced in the first part of this book. We will build on the material covered so far by demonstrating how to embed these models in a trading strategy that translates model signals into trades, how to optimize portfolio, and how to evaluate strategy performance.

There are several aspects that many of these models and their applications have in common. This chapter covers these common aspects so that we can focus on model-specific usage in the following chapters. They include the overarching goal of learning a functional relationship from data by optimizing an objective or loss function. They also include the closely related methods of measuring model performance.

We distinguish between unsupervised and supervised learning and outline use cases for algorithmic trading. We contrast supervised regression and classification problems, the use of supervised learning for statistical inference of relationships between input and output data with its use for the prediction of future outputs. We also illustrate how prediction errors are due to the model’s bias or variance, or because of a high noise-to-signal ratio in the data. Most importantly, we present methods to diagnose sources of errors like overfitting and improve your model’s performance.

If you are already quite familiar with ML, feel free to skip ahead and dive right into learning how to use ML models to produce and combine alpha factors for an algorithmic trading strategy.

Content

内容:
1.机器学习如何从数据中工作
	关键挑战:找到适合给定任务的算法
	监督学习:通过示例教授任务
	无监督学习:探索数据以识别有用的模式
		交易策略的用例:从风险管理到文本处理
	强化学习:逐步通过实践学习
2.机器学习工作流程
	代码示例:使用K最近邻算法的ML工作流程
3.确定问题的框架:目标和度量标准
4.收集和准备数据
5.如何探索、提取和构建特征
	代码示例:互信息
6.选择一个机器学习算法
7.设计和调整模型
	代码示例:偏差-方差权衡
8.如何使用交叉验证进行模型选择
	代码示例:如何在Python中实现交叉验证
9.使用scikit-learn进行参数调整
	代码示例:使用yellowbricks绘制学习曲线和验证曲线
	代码示例:使用GridSearchCV和pipeline进行参数调整
10.金融中交叉验证的挑战
	清洗、禁用和组合交叉验证
  1. How machine learning from data works
  2. The Machine Learning Workflow
  3. Frame the problem: goals & metrics
  4. Collect & prepare the data
  5. How to explore, extract and engineer features
  6. Select an ML algorithm
  7. Design and tune the model
  8. How to use cross-validation for model selection
  9. Parameter tuning with scikit-learn
  10. Challenges with cross-validation in finance

机器学习如何从数据中工作

许多机器学习的定义都围绕着自动检测数据中有意义模式的概念。其中两个著名的例子包括:

AI先驱Arthur Samuelson在1959年将机器学习定义为计算机科学的一个子领域,使计算机能够在没有明确编程的情况下学习。
Tom Mitchell,该领域的现任领导者之一,在1998年更具体地描述了一个明确定义的学习问题:计算机程序通过经验学习任务,并通过性能度量判断任务的表现是否随经验提高而改善(Mitchell, 1997)。
经验以训练数据的形式呈现给算法。与以往构建解决问题的机器的尝试相比,其主要区别在于算法用于做出决策的规则是从数据中学习的,而不是像上世纪80年代著名的专家系统一样由人类编程。

推荐的涵盖广泛算法和通用应用的教材包括:

《统计学习导论》,James等人(2013年)
《统计学习基础:数据挖掘、推断和预测》,Hastie,Tibshirani和Friedman(2009年)
《模式识别与机器学习》,Bishop(2006年)
《机器学习》,Mitchell(1997年)。

How machine learning from data works

Many definitions of ML revolve around the automated detection of meaningful patterns in data. Two prominent examples include:

  • AI pioneer Arthur Samuelson defined ML in 1959 as a subfield of computer science that gives computers the ability to learn without being explicitly programmed.
  • Tom Mitchell, one of the current leaders in the field, pinned down a well-posed learning problem more specifically in 1998: a computer program learns from experience with respect to a task and a performance measure of whether the performance of the task improves with experience (Mitchell, 1997).

Experience is presented to an algorithm in the form of training data. The principal difference to previous attempts at building machines that solve problems is that the rules that an algorithm uses to make decisions are learned from the data as opposed to being programmed by humans as was the case, for example, for expert systems prominent in the 1980s.

Recommended textbooks that cover a wide range of algorithms and general applications include

The key challenge: Finding the right algorithm for the given task

The key challenge of automated learning is to identify patterns in the training data that are meaningful when generalizing the model’s learning to new data. There are a large number of potential patterns that a model could identify, while the training data only constitutes a sample of the larger set of phenomena that the algorithm may encounter when performing the task in the future.

Supervised Learning: teaching a task by example

Supervised learning is the most commonly used type of ML. We will dedicate most of the chapters in this book to applications in this category. The term supervised implies the presence of an outcome variable that guides the learning process—that is, it teaches the algorithm the correct solution to the task at hand. Supervised learning aims to capture a functional input-output relationship from individual samples that reflect this relationship and to apply its learning by making valid statements about new data.

Unsupervised learning: Exploring data to identify useful patterns

When solving an unsupervised learning problem, we only observe the features and have no measurements of the outcome. Instead of predicting future outcomes or inferring relationships among variables, unsupervised algorithms aim to identify structure in the input that permits a new representation of the information contained in the data.

Use cases for trading strategies: From risk management to text processing

There are numerous trading use cases for unsupervised learning that we will cover in later chapters:

  • Grouping together securities with similar risk and return characteristics (see hierarchical risk parity in Chapter 13
  • Finding a small number of risk factors driving the performance of a much larger number of securities using principal component analysis) or autoencoders (Chapter 20
  • Identifying latent topics in a body of documents (for example, earnings call transcripts) that comprise the most important aspects of those documents (Chapter 15)

Reinforcement learning: Learning by doing, one step at a time

Reinforcement learning (RL) is the third type of ML. It centers on an agent that needs to pick an action at each time step based on information provided by the environment. The agent could be a self-driving car, a program playing a board game or a video game, or a trading strategy operating in a certain security market.

You find an excellent introduction in Sutton and Barto, 2018.

The Machine Learning Workflow

Developing an ML solution requires a systematic approach to maximize the chances of success while proceeding efficiently. It is also important to make the process transparent and replicable to facilitate collaboration, maintenance, and subsequent refinements.

The process is iterative throughout, and the effort at different stages will vary according to the project. Nonethelesee, this process should generally include the following steps:

  1. Frame the problem, identify a target metric, and define success
  2. Source, clean, and validate the data
  3. Understand your data and generate informative features
  4. Pick one or more machine learning algorithms suitable for your data
  5. Train, test, and tune your models
  6. Use your model to solve the original problem

Code Example: ML workflow with K-nearest neighbors

The notebook machine_learning_workflow contains several examples that illustrate the machine learning workflow using a simple dataset of house prices.

Frame the problem: goals & metrics

The starting point for any machine learning exercise is the ultimate use case it aims to address. Sometimes, this goal will be statistical inference in order to identify an association between variables or even a causal relationship. Most frequently, however, the goal will be the direct prediction of an outcome to yield a trading signal.

Collect & prepare the data

We addressed the sourcing of market and fundamental data in Chapter 2, and for alternative data in Chapter 3. We will continue to work with various examples of these sources as we illustrate the application of the various models in later chapters.

How to explore, extract and engineer features

Understanding the distribution of individual variables and the relationships among outcomes and features is the basis for picking a suitable algorithm. This typically starts with visualizations such as scatter plots, as illustrated in the companion notebook (and shown in the following image), but also includes numerical evaluations ranging from linear metrics, such as the correlation, to nonlinear statistics, such as the Spearman rank correlation coefficient that we encountered when we introduced the information coefficient. It also includes information-theoretic measures, such as mutual information

Code Example: Mutual Information

The notebook mutual_information applies information theory to the financial data we created in the notebook feature_engineering, in the chapter [Alpha Factors – Research and Evaluation]((…/04_alpha_factor_research).

Select an ML algorithm

The remainder of this book will introduce several model families, ranging from linear models, which make fairly strong assumptions about the nature of the functional relationship between input and output variables, to deep neural networks, which make very few assumptions.

Design and tune the model

The ML process includes steps to diagnose and manage model complexity based on estimates of the model’s generalization error. An unbiased estimate requires a statistically sound and efficient procedure, as well as error metrics that align with the output variable type, which also determines whether we are dealing with a regression, classification, or ranking problem.

Code Example: Bias-Variance Trade-Off

The errors that an ML model makes when predicting outcomes for new input data can be broken down into reducible and irreducible parts. The irreducible part is due to random variation (noise) in the data that is not measured, such as relevant but missing variables or natural variation.

The notebook bias_variance demonstrates overfitting by approximating a cosine function using increasingly complex polynomials and measuring the in-sample error. It draws 10 random samples with some added noise (n = 30) to learn a polynomial of varying complexity. Each time, the model predicts new data points and we capture the mean-squared error for these predictions, as well as the standard deviation of these errors. It goes on to illustrate the impact of overfitting versus underfitting by trying to learn a Taylor series approximation of the cosine function of ninth degree with some added noise. In the following diagram, we draw random samples of the true function and fit polynomials that underfit, overfit, and provide an approximately correct degree of flexibility.

How to use cross-validation for model selection

When several candidate models (that is, algorithms) are available for your use case, the act of choosing one of them is called the model selection problem. Model selection aims to identify the model that will produce the lowest prediction error given new data.

Code Example: How to implement cross-validation in Python

The script cross_validation illustrates various options for splitting data into training and test sets by showing how the indices of a mock dataset with ten observations are assigned to the train and test set.

Parameter tuning with scikit-learn

Model selection typically involves repeated cross-validation of the out-of-sample performance of models using different algorithms (such as linear regression and random forest) or different configurations. Different configurations may involve changes to hyperparameters or the inclusion or exclusion of different variables.

Code Example: Learning and Validation curves with yellowbricks

The notebook machine_learning_workflow) demonstrates the use of learning and validation illustrates the use of various model selection techniques.

  • Yellowbrick: Machine Learning Visualization docs

Code Example: Parameter tuning using GridSearchCV and pipeline

Since hyperparameter tuning is a key ingredient of the machine learning workflow, there are tools to automate this process. The sklearn library includes a GridSearchCV interface that cross-validates all combinations of parameters in parallel, captures the result, and automatically trains the model using the parameter setting that performed best during cross-validation on the full dataset.

In practice, the training and validation sets often require some processing prior to cross-validation. Scikit-learn offers the Pipeline to also automate any requisite feature-processing steps in the automated hyperparameter tuning facilitated by GridSearchCV.

The implementation examples in the included machine_learning_workflow.ipynb notebook to see these tools in action.

The notebook machine_learning_workflow) also demonstrates the use of these tools.

Challenges with cross-validation in finance

A key assumption for the cross-validation methods discussed so far is the independent and identical (iid) distribution of the samples available for training.
For financial data, this is often not the case. On the contrary, financial data is neither independently nor identically distributed because of serial correlation and time-varying standard deviation, also known as heteroskedasticity

Purging, embargoing, and combinatorial CV

For financial data, labels are often derived from overlapping data points as returns are computed from prices in multiple periods. In the context of trading strategies, the results of a model’s prediction, which may imply taking a position in an asset, may only be known later, when this decision is evaluated—for example, when a position is closed out.

The resulting risks include the leaking of information from the test into the training set, likely leading to an artificially inflated performance that needs to be addressed by ensuring that all data is point-in-time—that is, truly available and known at the time it is used as the input for a model. Several methods have been proposed by Marcos Lopez de Prado in Advances in Financial Machine Learning to address these challenges of financial data for cross-validation:

  • Purging: Eliminate training data points where the evaluation occurs after the prediction of a point-in-time data point in the validation set to avoid look-ahead bias.
  • Embargoing: Further eliminate training samples that follow a test period.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Longbo-AI

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值