MathematicS In Action

A new method for stochastic control based on neural networks and using randomisation of discrete random variables is proposed and applied to optimal stopping time problems. The method models directly the policy and does not need the derivation of a dynamic programming principle nor a backward stochastic diﬀerential equation. Unlike continuous optimization where automatic diﬀerentiation is used directly, we propose a likelihood ratio method for gradient computation. Numerical tests are done on the pricing of American and swing options. The proposed algorithm succeeds in pricing high dimensional American and swing options in a reasonable computation time, which is not possible with classical algorithms.


Motivation
Optimal stopping problems are particularly important for risk management as they are involved in the pricing of American options. American-style options are used not only by traditional asset managers but also by energy companies to hedge "optimised assets" by finding optimal decisions to optimise their P&L and find their value. A common modelling of a power plant unit P&L is done using swing options which are options allowing to exercise at most l ≥ 1 times the option with possibly a constraint on the delay between two exercise dates (see [13] or [32]).
Formally, for T > 0, we are given a stochastic processes (X t ) t≥0 defined on a probability space (Ω, F, F = (F t ) t≥0 , P) and we search for an increasing sequence of F stopping times τ = (τ 1 , τ 2 , . . . , τ l ) maximizing the expectation of the objective function Numerical methods to solve the optimal stopping problem when l = 1, f (t, x) = e −rt g(x) and X is Markovian include: • Dynamic programming equation: the option price P 0 is computed using the following backward discrete scheme over a grid t 0 = 0 < t 1 < · · · < t N = T : P t N = g(X T ), P t i = max(g(X t i ), e −r(t i+1 −t i ) E P (P t i+1 |F t i )), i = 0, . . . , N − 1. (1.1) • Partial differential equation (PDE): a variational inequality derived from the Hamilton Jacobi Bellman equation is given by where L is the infinitesimal generator of X [28,Chapter 8,Section 3.3]. A numerical scheme can be applied to solve this PDE and find the option value.
• Reflected Backward Stochastic Differential Equation (BSDE): the value of the American option is the solution of the reflected BSDE [16]: [10] provides a numerical scheme to solve these equations.
• Policy search: the decision rule or exercise region is parametrized by a vector and the parameters are usually optimised by Monte Carlo methods as in reinforcement learning [19,Chapter 8,Section 2]; [2,18]. The algorithm proposed in this paper is strongly related to this class of method.
These approaches generalise well for l ≥ 1, see [13] for dynamic programming principle or [9] for the BSDE method. The non linear case where f is of the form φ( l i=1 e −rτ i g(X τ i )1 τ i ≤T ) is studied by [31]. We refer to [19,Chapter 8] for more exhaustive details on numerical methods for American option pricing. All these algorithms suffer from the curse of dimensionality: the number of underlying is hardly above 5. However energy companies portfolio may trade derivatives involving more that 4 commodities at one time (e.g. swing options indexed on C02, natural gas, electricity, volume, fuel) and traditional numerical methods hardly provide good solutions in a reasonable computing time.
Recently, neural network-based approaches have shown good results regarding stochastic control problems and PDE numerical resolution in high dimension, see [14,21,29]. In the following, one describes literature related to optimal stopping time problems using neural networks. [6,25] use neural networks for regression in the dynamic programming equation (1.1). [3,5,22] also use the dynamic programming equation (1.1) but neural networks are used to parameterize the optimal policy. Weights and bias of the neural network(s) minimise at each time step the right hand side of the dynamic programming equation (1.1), going backward. The optimal decision consists in a continuous variable (instead of a discrete one) taking value in (0, 1) modeled by a neural network. [15,21,23] use neural networks to solve BSDE's. In [23], the neural networks parameterizing the solution and eventually its gradient minimise the L 2 loss between the left hand-side and the right hand side of the Euler discretisation of the BSDE, going backward from the terminal value. [3] and [23] need to maximise one criteria by time step. The approaches of [21] and [15] are quite different: the neural network allows the parameterisation of the initial value of the BSDE and the gradient at each time step, and it minimises the distance between the terminal value obtained by the neural network and the terminal value of the BSDE, going forward. American put options prices are computed in [23] up to dimension 40 with 160 time steps. Neural networks approaches have also been used in the context of swing options pricing in gas market in [4]. The definition of swing options slightly differs from ours as it considers a continuous control: the option owner buys a certain amount of gas between a minimum and a maximum quantity. It is however related to our problem as in continuous time, this option is bang-bang: it is optimal to exercise at the minimum or the maximum level at each date, that is choosing between two actions. [4] directly models the policy by a neural network and optimises the objective function as in [12,17].
Contrarily to [3,6,15,21,22,23,25], the goal of this paper is to propose a reinforcement learning algorithm to solve optimal multi-exercise (rather than one single) stopping time problems with constraints on exercise times that does not need to derive a dynamic programming equation nor to find an equivalent BSDE of the problem. The only information needed is the dynamic of the state process X and the objective function. This kind of algorithm is called policy gradient and is well known in the area of reinforcement learning, see [30] for instance. Although continuous control approximation with reinforcement learning shows good results, see [12,17] for European-style option hedging, the case of optimal stopping times is more difficult as it involves controls taking values in a discrete set of actions. The problem is similar to a combinatorial optimisation one: at each time step, an action belonging to a finite set needs to be taken. One way to solve this problem is to perform a relaxation assuming that the control belongs to a continuous space. For instance, if one needs to price an American option, a decision represented by a value in {0, 1} and consisting in exercising or not must be taken. Relaxing the problem consists in searching for solutions in [0, 1]: this relaxation has successfully been applied to a Bermudan option pricing in a high dimensional setting (up to 1000) in [7]. These methods apply well for American-style option pricing but seem to be not flexible enough to be extended to swing options pricing.

Main results
Our approach follows the spirit of [17] and [7]: one directly parameterises the optimal policy by a neural network and maximises the objective function moving forward. We propose an algorithm using reinforcement learning in order to solve optimal stopping times problem seen as an combinatorial optimisation problem. Note that solving combinatorial optimisation problems with neural networks have been considered in [8] in a deterministic framework without dynamics on the state process. The stochastic optimization framework considered in this paper is described in Section 2.
Neural network hardly handles integer outputs which is the main difficulty of the problem addressed in this paper. To encompass this problem, the first step of the algorithm consists in randomizing the optimization variables (that is executing the option or not) and modelling their law by a neural network. The second step consists in computing the gradient of the objective function. It can not be computed as usual by automatic differentiation as the neural network does not output the optimization variables but their law. The use of likelihood ratio method allows to rewrite the gradient as a function of the neural network output gradient that can be computed with automatic differentiation. The algorithm is given in Section 3. Compared to the papers referenced above our approach allows to solve stopping time problems without any knowledge of the dynamic programming equation or of an equivalent BSDE. Furthermore, it presents many advantages as it • can solve multiple optimal stopping time problems; • allows to add in a flexible way any constraint on the stopping times; • can then be associated with the one of [17] considering continuous actions in order to solve stochastic impulse control problems, combining discrete and continuous controls (see [27,Chapter 6] for more information on impulse control problems).
Let us also notice that our method does not take advantage of the linearity (possible inversion between the sum and the expectation) of the considered optimal stopping problems contrarily to [7] and can be applied to non linear problem, making it suitable for impulse control problems. The theoretical convergence study of our algorithm is out of the scope of this paper.
Numerical tests covering Bermudan and swing options are proposed in Section 4 and show good results in the pricing of 10 underlyings Bermudan option and also on 5 underlyings swing options having up to l = 6 exercise dates. However, our algorithm gives suboptimal results on one of the considered case.

Continuous time modelling
We are given a financial market operating in continuous time. Let (Ω, F = (F t ) t≥0 , F, P) a filtered probability space and W a d-dimensional F-Brownian motion. One assumes that F satisfies the usual conditions of right continuity and completeness. Let T > 0 a finite horizon time and X = (X 1 , X 2 , . . . , X d ) be the unique strong solution of the Stochastic Differential Equation (SDE): Using the notations of [13] and with X as defined in (2.1) for t ∈ [0, T ] and X t = X T for t ≥ T , an optimal stopping time problem consists in solving the problem where S l is the collection of all vectors of increasing stopping times τ = (τ 1 , . . . , τ l ) such that for corresponds to the number of possible exercises and γ ≥ 0 to the minimum delay between two exercise dates. One wants to find the optimal value (2.2) but also the optimal policy

Discrete time modelling
In practice, one only considers optimal stopping on a discrete time grid (for instance, the valuation of a Bermudan option is used as a proxy of the American or swing option). Let us consider N + 1 exercise dates belonging to a discrete set The problem consists in finding where S l N is the set of stopping times belonging to S l (set of increasing stopping times vectors including delay constraints) such that τ i ∈ D N on {τ i ≤ T }, for i = 1, . . . , l. This discretisation is needed for our algorithm as it is needed in classical methods such as [26]. Problem (2.4) is equivalent to the following: and the delay at t j from the last exercise assuming that we can exercise at time 0 (hence the γ term in (2.8) allowing to start at time 0 with a delay γ) and with the convention

Neural network parametrization
As the Y i 's are discrete we cannot assume that they are the output of a neural network which weights are optimised by applying a stochastic gradient descent (SGD). To overcome this difficulty, one can randomize Y and consider that at each time step . The Bernoulli distribution parameter depends on the state variable of our control problem. This state variable, denoted S t j , is • X t j in the case of a Bermudan option, that is with only one execution date, Of course, one can adapt the state depending on the constraints (for instance if there is no delay constraint (2.7), the state is ( In the case where X is a non Markovian process, one needs to consider that the probability for Y j to be equal to 1 is a function of all the values of X t i for i ≤ j and Y i for i ≤ j − 1. In this case, one could use a Recurrent Neural Network to parameterise this function but we do not consider this case here. The Bernoulli distribution parameter, which is P(Y j = 1|S t j ) is parameterised by a neural network NN defined on [0, T ] × S × Θ and taking values in R where S is the state space and Θ represents the sets in which the biases and weights of the neural network lie. The neural network architecture is described in Section 3.3. The parametrization is then the following: with expit : R → (0, 1) and expit(x) = 1 1+e −x for x ∈ R and c(S t j ) is equal to 0 if the constraints are saturated and 1 otherwise. Typically, if there are no constraints, c is always equal to 1, and if there are constraints (2.6) and (2.7), Note that the methodology can be extended to any constraint on the policy. C tanh (NN(x, θ)) outputs the logit (the inverse function of expit) of P(Y j = 1|S t j ) (when constraints are not saturated). The function tanh is not necessary and one could only consider NN to parameterise the logit of the probability. To reduce the values taken by the logit, we bound the output of the neural using tanh and choose C such that expit(−C) ≈ 0 and expit(C) ≈ 1 ; C is given in Section 3.4. From now on, P is replaced by P θ to indicate the dependence of the law of Y with θ. At this step, we still cannot train our neural networks by applying a stochastic gradient descent because of the Y 's randomization.

Optimization
To approximate a solution to (2.5) we search for Classical neural network parameter optimization consists first in evaluating the objective function using Monte Carlo method and replacing it by with N batch ∈ N * and where X m and Y m correspond to one realization of (X, Y ) simulated according to (2.1) for X and to P θ for Y . Secondly, a gradient descent is done to update the parameter θ using gradient of the objective function which is computed using backpropagation. However in our case, it is not possible to directly use backpropagation: Y is not a function of θ, but a discrete variable with law depending on θ.
To bypass this problem, we use a likelihood ratio method, see [19,Section 7.3]. Let us consider a random variable Z : Ω → E with probability measure Q a , a ∈ R d , absolutely continuous with respect to a measure Q. Let l a (x) = dQa dQ (x) be the likelihood function. We have, under some integrability conditions, ∇ a E Qa (Z) = E Qa (Z∇ a log(l a (Z))). Using this method and iterative conditioning for probability computation, we find that the gradient of E P θ N j=0 Y j f (t j , X t j ) is given by: is a continuous function of the neural network: the gradients appearing in Equation (3.2) can be easily computed using backpropagation.
The main drawback of this method is the high variance of the right hand side of (3.2). Several methods exist to bypass this issue, the most used with neural network optimisation being the use of a baseline, see [33].
Every ∆ test steps, the objective value is computed over the testing set. The parameters kept at the end are the ones minimising those evaluations. The objective function is finally evaluated on a validation set. While on the training phase actions are sampled from the outputted probability on the training set, they are chosen equal to 1 if the probability is greater than 0.5 and 0 otherwise on the test and validation sets. The algorithm is described in Algorithm 1 with hyperparameters given in Section 3.4.
Remark 3.1. The method does not take any advantage of the fact that we can exchange the sum and the expectation in (2.5). It is then suitable for optimization problems of the form and could be combined with forward neural network continuous optimization algorithms such as those of [12,17] to solve impulse control problems.

Algorithm 1
Algorithm for optimal stopping. The lines in blue are the main difference with classical backpropagation.

Neural network architecture
The neural network architecture is inspired by [14] and consists in one single feed forward neural network which features are the time step t i and the current state realisation S t i . Let t ∈ [0, T ] and x = (x 1 , . . . , x r ) ∈ S that we assume to be included in R r . The neural network is defined as follow , x 1 , . . . , x r ) ) where A l (y) = W l y + b l for l = 1, . . . , L + 1, W 1 ∈ R m×(r+1) , W l ∈ R m×m for l = 2, . . . , L, W L+1 ∈ R 1×m , b l ∈ R m for l = 1, . . . , L and b L+1 ∈ R. L corresponds to the number of hidden layers and m to the number of neurons per layer (that we assume to be the same for every layer). The (W l ) l=1,...,L+1 correspond to the weights and (b l ) l=1,...,L+1 to the biases. The function ρ is the activation function and is chosen as the ReLu function, that is ρ(x) = max(0, x). θ is then equal to (W 1 , . . . , W L+1 , b 1 , . . . , b L+1 ).

Hyper parameters
• The batch size N batch as the number of iterations N iter depend on the use case and are specified at a later stage. The training set size is then equal to N iter × N batch . As the likelihood ratio estimator of the gradient has high variance, choosing a large batch size (>1000) allows for a better estimation. The drawback is that it tends to slow down the algorithm. To reduce the variance, one could also use a baseline function as in [33].
• The test set size is chosen equal to 500 000 and the validation set size to 4 096 000 (500 000 and 4 096 000 are chosen high to have very accurate optimisation). The test set is evaluated every ∆ test = 100 steps.
• The number of hidden layers L is chosen equal to 3. The number of neurons m per layer is constant (but can vary from a case to another).
• The learning rate (α in Algorithm 1) is chosen equal to 0.001.
• Since we use the same network at each time step, we use a mean-variance normalisation over all the time steps to center all the inputs (t i , X t i ) for all t i 's with the same coefficients. The scaling and recentering coefficients are estimated on 100 000 pre-simulated data that is just used to this end. The mean and the standard deviation are first computed for every time step over the simulations then averaged over the time steps.
• We use Xavier initialisation [20] for the weights and a zero initialisation for the biases.
• The parameter C that bounds the input of the expit function is chosen such that expit(−C) ≈ 0 and expit(C) ≈ 1. We choose C = 10.
• The library used is [1] and the algorithm runs on a laptop with 8 cores of 2.50 GHz, a RAM memory of 15.6 Go and without GPU acceleration.

Numerical results
In this section Algorithm 1 is applied to the valuation of Bermudan and swing options. The function f (s, x) is of the form e −rs g(x) where g is the payoff of the option and r ≥ 0 is the risk free rate. We place ourselves in the Black-Scholes framework: µ(s, x) = (r − δ)x, with δ ≥ 0 corresponding to the dividend rate and σ(s, x) = diag(x)Σ with Σ a positive definite matrix. We choose to work with a regular time grid t i = i N for i = 0, . . . , N . The probability measure corresponds to the risk neutral probability and finding the value of the option consists in solving Problem (2.5).

Bermudan options
In this section, we assume that l = 1 (only one exercise) and we consider different options to price.  Losses and times obtained with Algorithm 1 are given in Table 4.1 for each case and losses are compared to a reference value ( [10] for the put option and [7] for the other options). The algorithm succeeds in pricing Bermudan options with a high precision (relative error <1%) in dimension up to d = 10 and number of time steps up to N = 50. The computing time is more sensitive to the number of time steps than to the dimension: the number of neural network estimation is equal to the number of time steps. The increase of computing time when dimension increases is mostly caused by a need to increase the batch size and a more important simulation time. Algorithm 1 succeeds in pricing Bermudan options and solves problems that are usually hard to solve and very expensive in terms of computation time as they suffer from the curse of dimensionality. The training and testing learning curves are given in Figure 4.1. The testing errors are relatively stable for the put and the max-call options but less stable for the strangle spread. For the put and the 2 dimensional max-call, the testing error converges quickly to the optimal value. The testing error of the 10 dimensional max-call decreases more slowly. The different training errors are all noisy as they are of smaller size.
Once trained, the neural network allows to compute the probability to exercise according to the price and time to maturity, then the exercise region in a few seconds, see Figure 4.2 for the Bermudan put option. The probability to exercise has a S-shape with limit values 0 and 1 (do not exercise and exercise) and a small transition region between those two values. As expected, the first value such that the probability becomes 0 increases when time to maturity decreases. This is confirmed in the exercise region in Figure 4.1 with the frontier decreasing with time to maturity. The exercise region is the one below the curve, which is computed using the first value such that the probability of exercise is below 0.5. The frontier is extrapolated from the neural network trained only on a discrete time grid but that allows to use different time to maturity values (even ones not used in the training), which is not possible with classical backward optimisation.

Swing options without delay
In this section, we consider a swing option without delay constraint. We compare in Table 4.2 the results obtained by Algorithm 1 with the results of [24] in the case of a put option with d = 1, g(x) = (K − x) + , K = 40, S 0 ∈ {35, 40, 45}, r = 0.0488, δ = 0, Σ = 0.25, N = 12, T = 0.25, l ∈ {1, 2, 3, 4, 5, 6} and no delay. We consider a batch size equal to N batch = 2000, a neural networks with L = 3 hidden layers with size m = 10 and N iter = 5000 iterations. Every case takes around 4 minutes to converge, see Table 4.3. The algorithm gives very accurate results Table 4.1. Results obtained on different Bermudan options pricing with Algorithm 1 with the relative difference between Algorithm 1 and a reference value (given by [11] for the put option and by [7] for the other options       [24] for different initial values S 0 and different number of executions l. The first value corresponds to the swing option value obtained with Algorithm 1, the second value to the one in [24] and the third value is the relative difference in %. considering a high dimensional case. We consider a batch size equal to N batch = 8000, a neural networks with L = 3 hidden layers of size m = 30 and N iter = 5000 iterations. Results are given in Table 4.4. The algorithm succeeds in pricing this option with 5 underlyings in a reasonable time (less than 30 minutes). By using a less costly hyperparameterization (m = 20 neurons density) instead of m = 30, N iter = 2000 iterations instead of 5000, N batch = 3000 instead of 8000 we are able to obtain results in less than 4 minutes with a 2% accuracy as shown in Table 4.5.

Swing options with delay
Let us now consider the case of a put option with d = 1, g(x) = (K − x) + , K = 100, S 0 = 100, r = 0.05, δ = 0, Σ = 0.3, N = 50, T = 1, γ = 5 T N and l ∈ {1, 2, 3, 4, 5}. Delay constraint is now present and a higher number of dates is considered. We consider a batch size equal to N batch = 5000, a neural networks with L = 3 hidden layers of size m = 10 and N iter = 10000 iterations. We compare in Table 4.6 the results obtained with Algorithm 1 to the ones obtained with [13]. The algorithm gives satisfying results but when compared to results of Table 4.5, we notice a loss of performances as l increases. This may be due to the delay constraint that is added in this case. We tried to change the architecture of the neural networks without noticing a significant improvement of the algorithm.

Conclusion and perspectives
A stochastic control algorithm able to deal with (discrete) optimal stopping variables is presented. The different use cases show that the proposed algorithm is able to solve optimal stopping time problems in a reasonable time, even when the dimension is high and also for multi-exercise. The algorithm is simple and allows us to find an optimal policy without any knowledge on the dynamic programming equation. The method presented in this paper avoids a costly backward pass and only needs a forward pass. While the computation time increases a little with dimension, it increases a lot more with the number of time steps and the algorithm can have troubles to converge. To confirm all those results, one should study the theoretical convergence of the algorithm. This algorithm could easily be extended to impulse control if combined with [17] in order to solve problems involving both continuous and discrete controls such as hedging with fixed transaction costs.