319x Filetype PDF File size 0.52 MB Source: www.awaleeconsulting.com
Why And How To Use
Ensemble Methods in
Financial Machine
Learning?
Study carried out by the Quantitative Practice
Special thanks to Pierre-Edouard THIERY
JANVIER 2021
S
Introduction 1
1. FromASingleModelToEnsembleMethods:BaggingandBoosting 1
2. TheThreeErrorsOfAMachineLearningModel 2
3. WhyIsItBeerToRelyOnBaggingInFinance? 3
Conclusion 5
References 5
Note Awalee
Introduction the number of neurons in each layer as well as the functions
within each neuron, forms the metaparameters M. The pa-
MachineLearningtechniquesaregainingcurrencyinfinance rameters of the neural network are the weights for each link
nowadays; ever more strategies rely on Machine Learning betweentwoneuronsfromtwoconsecutivelayers. Thosepa-
models such as neural networks to detect ever subtler sig- rameters are estimated thanks to a training set D: formally
nals. Nonetheless this rising popularity does not come with- P = P(D). As of now, since the training sets which are used
out any shortcoming, the most widespread one being the so- areoftheutmostimportance,wealwayswriteP(D)toclearly
called "overfiing", when models tend to learn by heart from mention which training set is used to find the parameters of
the data and are thus unable to face unknown data. In our a given model.
opinion, using Machine Learning algorithms in finance with-
out a deep understanding of their inner logic is highly risky: The gist of ensemble methods is fairly simple: we com-
promisinginitial results are oen misleading, the real-life im- bine several weak models to produce a single output. As of
plementation being much disappointing due to the lack of now and for the rest of this paper, the number of models is
comprehension of what is really happening. denoted N.
In this paper we decide to focus on a specific category
of Machine Learning meta-algorithms: the ensemble meth- Theensemblemethodscanbedividedintotwomainsets:
ods. The ensemble methods are called meta-algorithms since the parallel methods, where the N models are independent,
theyprovidedifferentwaysofcombiningmiscellaneousmod- and the sequential methods, where the N models are built
els of Machine Learning in order to build a stronger model. progressively.
Thosetechniquesarewell-knownforbeingextremelypower-
ful withinmanyareas;howeverwebelievethatitisimportant •AParallel Method: Bootstrap Aggregating
to understand what their advantages are from a mathemati-
cal point of view to make sure that they are used purposefully In this section, we set forth the bootstrap aggregat-
whendealingwithafinancial Machine Learning problem. ing method, also known as "bagging", which is the most
First we set forth how ensemble methods work from a widespread of the parallel methods [1]. As of now, we as-
general point of view. We then present the three sources of sumeatraining set, denoted D, is at our disposal:
errors in Machine Learning models before explaining what
the advantages of bagging over boosting are in finance, and Definition 2 (Data Set)
howtouseefficiently bagging. Adataset D is a set of couples of the following form
m
(
D={xi,yi) ∈ R ×R,1≤i ≤n}
wherenisthecardinalofD. Forthei-thelementinthedataset,
m
1 FromASingleModelToEnsemble xi ∈ R is called the vector of the m features, and yi is called
the output value.
Methods: BaggingandBoosting j
Tocarryoutthebagging,weconstructN modelsM with
Machine Learning is mainly premised on predictive models. 1 ≤ j ≤ N. To do so, we consider a generic model M(M;•;•),
Once devised, a model is then trained thanks to available i.e. a predictive model whose metaparameters are fixed, for
data; its purpose is to predict the output value, also known as instance a neural network with a given shape. In order to get
the outcome, corresponding to new input data. Formally we N models, the generic model will be trained with N different
can define a predictive model in the following manner: training sets Dj:
Mj =M(M;P(Dj);•)
Definition 1 (Predictive Model)
Apredictive model is defined as an operator M, based on meta- Thus, Mj is now a function which, for every input vector
m j
parameters denoted M, and on parameters denoted P. It uses a x ∈ R outputarealvaluey = M (x).
m
set of inputs, denoted x ∈ R , to compute an output, denoted The N models are different since they are not trained on
O∈R,seenasthepredictedvalue. We can write: thesametrainingset;itmeansthatthesetofparameterswill
( m
MM;P;•) : R → R bedifferent; therefore we will have different output values for
(
x → O=MM;P;x)
the same input vector of features x.
Thus the idea of a predictive model is only to predict a value The N training sets are created thanks to the data set D.
based on several features which are the inputs. If M is consid- Thesizeofthetrainingsetsischosenbytheuseranddenoted
ered to be "the machine", the learning part consists in estimating K with K < n, otherwise the training sets would necessary
the parameters P that enable us to use the model. The metapa- contain redundant information. The K elements of the train-
rameters M are chosen, and oen optimized, by the user. ing set D for 1 ≤ j ≤ N are sampled in D with replacement:
j
For instance, a neural network is a predictive model. The for 1 ≤ j ≤ N
shape of the neural network, i.e. the number of layers and D = x , y ,1≤k≤K
j u(j,k) u(j,k
1
1
whereu(j,k) is a uniform random variable in 1,n . Theprocess is then exactly similar. Thanks to the error of
[ ]
OncetheN modelsMj havebeentrained, they are com- thesecondmodel,wecancomputeforeachelementwithinD
a new weight. Those weights are used to create a new train-
bined into the final model Mf. For instance, if we consider ing set: we sample D using the new weights to get D2, which
a regression problem, meaning that the output value y does will then be used to train model M3 and so on.
not belong to a predetermined finite set of values: It is then possible to define the final model Mf as a
weighted sum of the N models Mj, where the weight as-
Mf : Rm → R
1 N j sociated with a given model is derived from the error of this
(
x → N j=1 M x) modelonthedata.
If we consider a classification problem, meaning that the
output value y belongs to a finite set of values S, the output Wehave only presented the main ideas of boosting; the
value of the final model is determined by a vote of the N simplest implementation of those guidelines is probably the
modelsMj: theoutcomewhichappearsthemostamongthe AdaBoost algorithm [2].
N output values produced by the N models is the outcome of
the final model. 2 The Three Errors Of A Machine
Suchamodelcanthenbetestedonatestsetofdataasit LearningModel
is usually done for every model of Machine Learning.
It is also worth noticing that there are many bagging ap- A Machine Learning model can suffer from three sources of
proaches, which all derive from the general principle as pre- error: the bias, the variance and the noise. It is important
sented above. Even though we do not delve into the details, to understand what lies behind those words in order to un-
we can for instance mention the so-called "feature bagging", derstand why and how ensemble methods can prove to be
where each one of the N models is trained using only a spe- helpful in finance.
cific subset of features. Thebias is the error spawned by unrealistic assumptions.
• A Sequential Method: Boosting When the bias is particularly important, it means that the
model fails to recognize the important relations between the
Sequential methods consist no longer in using a set of features and the outputs. The model is said to be underfied
N independent models, but instead a sequence of N models, whensuchacaseoccurs.
wheretheorderofthemodelsmaers: Figure1displaysamodelwithanimportantbias. Thedots
represent the training data, which obviously do not exhibit a
M1,...,MN → M1,...,MN linear relation. If we assume that there is a linear relationship
betweenthefeaturesandtheoutcomes,suchamodelclearly
a mere set: no order a sequence founders to recognize any relation between the former and
So we have to construct the sequence of the N models, the laer.
beginning with the first one, which will then sway how the
second one is defined, and so on and so forth. In the rest of
thissectionwepresentsomeoftheprincipalideasofboosting.
First, as with bagging, we assume we have a training set
madeofnelementsanddenotedD. Ifwechoosetoconsider
a generic model M(M;•;•), we can train a first model:
1 (
M =MM;P(D);•)
1
( ) and
For every element within D we can compute M xi
compareit to the outcome yi.
TodevisethesecondmodelM2,wearegoingtotrainthe
(
generic model M M;•;•) on a new training set D1: the new
training set derives from D. It contains K < n elements, as
will all the subsequent training sets. Figure 1: Underfied model
WeaributetoeachelementwithinDaweightdepending
1
on how far ( ). The more important the error,
y is from M x
i i The variance stems from the sensitivity to tiny changes
the higher the weight associated with an element. We then
usethoseweightstorandomlysampleDinordertogenerate in the training set. When variance is too high, the model is
the new training set D1. overfied on the training set: there happens a "learning-by-
heart"situation. Thisexplainswhyevenasmallchangeinthe
M2 ( training set can lead to widely different predictions.
= M M;P(D1);•)
2
2
no reviews yet
Please Login to review.