Regression Monte Carlo for Impulse Control

I develop a numerical algorithm for stochastic impulse control in the spirit of Regression Monte Carlo for optimal stopping. The approach consists in generating statistical surrogates (aka functional approximators) for the continuation function. The surrogates are recursively trained by empirical regression over simulated state trajectories. In parallel, the same surrogates are used to learn the intervention function characterizing the optimal impulse amounts. I discuss appropriate surrogate types for this task, as well as the choice of training sets. Case studies from forest rotation and irreversible investment illustrate the numerical scheme and highlight its flexibility and extensibility. Implementation in \texttt{R} is provided as a publicly available package posted on GitHub.


Introduction
Stochastic impulse control is concerned with systems where the state process (X t ) is subject to stochastic dynamics as well as repeated lumpy interventions or shocks by the controller.Such impulses make an instantaneous, rather than sustained, impact on (X t ) and carry an instantaneous cost/reward.The goal of the controller is to maximize total expected (discounted) profit that is driven by the impulses and the continuous revenue function π(X t ).Impulse control problems have a long history with manifold applications over the past 40+ years, see below.In particular, impulsive controls are common in commodities applications to describe management of natural resources or industrial capacity planning.Nevertheless, numerical methods for stochastic impulse control are surprisingly thinly studied, especially in comparison to the vast literature on numerical optimal stopping, and the emergent literature on numerical continuous control via Neural Networks.
In this article I propose to leverage Monte Carlo based methods that have been developed for optimal stopping in order to create a related approach to stochastic impulse control.To do so, I rely on the interpretation of impulse control as a two-stage decision making; at each time instant, the controller must first decide whether to act or to wait; conditional on acting, in the second stage the controller picks the optimal action.This perspective reduces impulse control to repeated optimal stopping with an implicit payoff function specified via the so-called intervention operator M. In turn, it permits to import algorithms for multiple stopping problems after incorporating the computation of the intervention operator.Through this lens, solvers for impulse control can be built on top of related code for optimal stopping.The proposed algorithm extends the realm of Regression Monte Carlo (RMC) methods to the setting of impulse control.It employs the main features of RMC -statistical surrogates for a functional approximation of the continuation value and simulation for empirical training of these surrogates -in the context of multiple impulsing actions.Additionally, I propose direct optimization of the surrogate over the action set to obtain optimal impulses.Methodologically, this strategy highlights the advantageous modularity of RMC which makes the paradigm applicable beyond classical stop/continue decisions.On the implementation side, the algorithm is coded in R and is integrated into the existing mlOSP "Machine Learning for Optimal Stopping Problems" package developed by the author over the past few years [30].Thus, mlOSP effectively subsumes numerical resolution of impulse control into the existing framework of RMC.The package is publicly available via GitHub at github.com/mludkov/mlOSP and offers reproducible vignettes.Thus, all the underlying code and case studies can be fully examined by the readers or other researchers, facilitating reproducibility and future extensions.

Literature Review
Relative to other types of stochastic control impulse control problems are rarely solvable explicitly, with only a few exceptions, see [2,19,17].In part, this is because there are many problem ingredients to solve for: impulse thresholds, impulse targets, intervention function, value function, etc.Thus, typically only very special cases, such as time-stationary models with linear intervention costs and linear dynamics, have been studied in detail.
The most well known sub-case is when the state (X t ) is one dimensional and is expected to be a renewal process when optimally controlled.The so-called (s, S) strategies focus on determining an impulse threshold s and a target level S and reduce computation of optimal strategy to a two-dimensional optimization over the two constants s, S. (s, S) strategies have been studied in Operations Research for over 20 years, formulations similar to the one I discuss have appeared in optimal inventory [10,14,26] and dividend payout problems [7,11,18].
Another major application of stochastic impulse control has been in real options, in the context of (ir)reversible investment [1,3,23].Alternatively called the capacity expansion problem, this class of models considers gradual addition of capacity, e.g.power generation capacity in the context of owning a fleet of electricity power plants.More sophisticated models [13,24] treat separately the commodity price and the current capacity, leading to a two-dimensional impulse control formulation, with one exogenous and one endogenous component, see Section 4.2.A kind of a conceptual opposite to investment are harvesting problems, especially the Faustmann problem of forest management [4,5,12].The state variable represents the current forest stand value; actions correspond to cutting trees down for sale, known as a "rotation".Thus impulses are down and are viewed as revenue rather than cost.I discuss the Faustmann problem further in Section 4.1.Other applied domains include control of foreign exchange rates by a central bank [15] and management of energy retail prices [9].The standard approach to numerical solution of impulse control is via quasi-variational inequalities, that reduce to a HJB-type partial differential equation.However, the non-local term in the equation that corresponds to the impulses makes numerical schemes quite challenging.So far there is limited literature available to handle it, [8].Indeed, eliminating this limitation of the HJB approach is one of the gaps I aim to address in the present publication that takes a completely probabilistic/statistical perspective and does not depend on the smoothness of the value function.Similarly, due to the associated analytic difficulties, the vast majority of works consider time-stationary one-dimensional impulse control on infinite horizon, see [12] and [22,21] for recent results on finite-horizon SIC.Multivariate SIC is treated in [16,7] among others.
The rest of the paper is organized as follows.Section 2 introduces the problem and the new RMC-based algorithm.Section 3 discusses implementation, including how to evaluate optimal impulses.That Section concludes with a concrete example, illustrated with R code snippets.Section contain two case studies, about forest rotation (Section 4.1) and two-dimensional capacity expansion (Section 4.2).

Problem Formulation
To set the stage, we focus on diffusion models where the system state (X t ) in the absence of any impulses is assumed to satisfy a Stochastic Differential Equation of Îto type, where (W t ) is a (multi-dimensional) Brownian motion and the drift µ(•) and volatility σ(•) are smooth enough to yield a unique strong solution to (1).The state space of (X t ) is X ⊂ R and we assume the standard probabilistic structure of (Ω, F, (F t ), P), where X t is adapted to the filtration F = (F t ).Generalization to multiple dimensions is straightforward.
The above is the uncontrolled dynamics, which are subject to control shocks.To describe the latter, let T < ∞ be a given time horizon and denote by A the set of all admissible controls.An admissible control A = {τ n , z n } is a double sequence such that • τ n is an increasing sequence of F-stopping times, such that τ n < τ n+1 P-a.s. and lim n→∞ τ n = T P-a.s.
A controlled process X t,x,A is indexed by its initial condition X t = x and the associated admissible impulse strategy and satisfies for s > t The corresponding expectation conditional on X t = x is denoted by The dynamics in (1) then correspond to the case of no control: A = ∅, and one may view (2) as the concatenation of the uncontrolled (1) on each [τ n , τ n+1 ) plus the instantaneous jumps X t,x,A τn = X t,x,A τn− + z n .Let π : x → R be the running reward function, and κ : (x, z) → R be the impulse cost function, representing the net revenue of applying impulse of size z ∈ R at time t and state x.Typically κ(x, z) < 0, capturing the cost of moving X t from x ∈ X to x + z ∈ X .We assume a finite horizon T with a respective terminal condition φ(X T ).The controller aims to maximize discounted expected reward on the horizon T < ∞, where r ≥ 0 is the discount factor.Above we assume that π, φ are such that For an admissible strategy A, denote by the reward from applying the strategy A on the interval [t, T ] and starting with X t = x.Then our goal is to evaluate the value function where A t is the set of admissible strategies on [t, T ].
The infinitesimal generator of the uncontrolled X ∅ is Also define the intervention operator Optimality for the controller's actions implies that V (t, x) ≥ MV (t, x) for all (t, x).At the same time, Ito's lemma implies that LV (t, x) ≥ π(x).Putting the two together yields the quasi-variational inequality (QVI) with the boundary condition V (T, x) = φ(x).The analytic approach then characterizes V as the viscosity solution of the QVI (6), see e.g.Chapter 6 in Oksendal and Sulem [31].HJB-driven methods either attempt to find a classical smooth solution to the QVI, or consider finite-difference schemes for (6); the challenge being the non-local operator M.

Dynamic Programming Equation
For the remainder of the article I adopt the discrete-time paradigm, where decisions are made at K prespecified instances t 0 = 0 < t 1 < . . .< t k < t k+1 < . . .< t K = T , where typically we have t k = k∆t for a given discretization step ∆t.Henceforth, with a slight abuse of notation I index everything by k and work with T = (t k ) K k=0 , presuming that t k = k∆t for ease of exposition.In particular this implies that we restrict τ n ∈ T and rule out multiple instantaneous actions, so that A consists of at most K impulses.
The dynamic programming Bellman equation for impulse control on [t, t + ∆t] is: e −r(s−t) π(X s )ds + κ(x, z) .(7) Discretizing in time and writing V (k, x) ≡ V (t k , x), etc we substitute E t,x [ t+∆t t e −r(s−t) π(X s )ds] π(x)∆t and end up with where The latter intervention operator captures the value of making the best possible impulse.In line with above, we view optimal impulse control as a two-stage sequential decision making.At each time period k, the controller must decide whether to continue (no action) or act (z = 0).In the latter case, she must further select the best action z * .This matches the appearance of max in V (t, x)-one should continue if the Q-value dominates the intervention value, and one should impulse otherwise.Within the Markovian structure of (2) the impulse strategy can be encoded as mapping each input x according to the respective feedback action map Z k (x) ∈ {0} ∪ Ξ: The action map Z k gives rise to the action region where the optimal choice is to act.
Regression Monte Carlo proceeds by recursively constructing surrogates Q(k, •) that are used to induce the respective Z k according to (8).The inductive logical loop is achieved by employing Z k to define the forward J k,K (x; Z k:K ).To do so, given any set of (admissible) action maps Z k:K (•) we define the corresponding discrete-time controlled state process X x,Z according to the Euler scheme: X k = x and Note that the action is only applied at the end of the period.The respective total revenue along the path X x,Z is then (setting κ(x, 0) ≡ 0) On a given path, we can also record the pathwise realized impulse times τ n and respective impulses z n : Denoting by Z * the optimal action map, we have that the true Q-value satisfies In RMC, The resulting loop is initialized with V (K, x) = φ(x) and proceeds as follows: Note that in principle the entire V is superfluous.Indeed, we have generically that for any w The look-ahead horizon w ∈ {1, . . ., K − k} allows to combines pathwise rewards based on Z and the approximate value function V w-steps into the future [20].We focus on the cases w = 1, w = K − k and w fixed that correspond to learning The choice w = 1 is analogous to the Tsitsiklis-van Roy [32] scheme for optimal stopping: simulate one-stepahead paths and regress π(X k )∆t + V (k + 1, X k+1 ) against X k .The choice w = K − k is analogous to the Longstaff-Schwartz [28] scheme, where we regress the full future rewards to go on the interval {k, k+1, . . ., K} against X k .These choices are not numerically identical, because x for different 's due to the approximation error.In all examples below, we utilize the Longstaff-Schwartz version with w = K − k, so that V is never computed until the very end.

Algorithm
The proposed RMC approach reduces optimal impulse control to a double sequence of probabilistic function approximation tasks.The primary task entails fitting a functional approximator Q based on empirical simulations and then utilizing a statistical model to capture the observed input-output relationship.To do so, we define a regression model and the training set used as input to the regression model.The three ingredients are the inputs x 1:N ∈ X , the outputs y 1:N and the approximation class H k .The inputs are the sampled states at step k.The outputs are viewed as a random realization Y (x) of the pathwise reward starting at (k, x) . Specifically, they are the realizations of J k,K (x; Z k:K ) along a set of independent paths x (k),n that start with x (k),n k = x n .Statistically these y 1:N are linked to the inputs by the observation model Given a training collection D = x 1:N henceforth called the simulation design, we collect the simulation outputs y 1:N = Y (x 1:N ) and then obtain the approximate continuation value Q(k, •) (viewed as a statistical object, rather than say a vector of numbers) as the empirical L 2 minimizer in the given function space H. Namely we minimize the penalized mean squared error from the observations, The summation in ( 16) is a measure of closeness of f to data, while the right-most term penalizes the fluctuations of f to avoid over-fitting.
Indexing everything by the time steps k = 1, . . ., K and allowing for time dependence we summarize the following notation: • N k : number of training inputs at step k; • D k : simulation design, i.e. the collection of training inputs pathwise samples of reward-to-go used as the responses in the regression model.
Equipped with above, Algorithm 1 presents the overall scheme that abstracts from the regression module for fitting Q(k, •) and subsequently M (k, •).The algorithm matches the mlOSP template from [29] as implemented in the eponymous R package.
The output of Algorithm 1 is the approximate action maps Z k (•).Once computed, they induce the expected reward E J 0,K (x; Z 0:K ) X 0 = x .which can be evaluated over an out-of-sample set of test scenarios.Thus, we compute the sample average reward across a fresh set of Algorithm 1 Regression Monte Carlo for Impulse Control based on mlOSP template.Require: K = T /∆t (time steps), (N k ) (simulation budget per step), w (path lookahead) Generate training design D k := (x Set y 1:N k k+1 ← 0 // pathwise rewards 4: // pathwise controlled trajectories for n where m (k),n > q (k),n // impulse 9: Set end for 11: Set 12: where (τ m , z m ) are the pathwise impulse times and impulse amounts, see ( 11)- (12).Note that V (0, x) is an unbiased estimator of E J 0,K (x; Z 0:K ) X 0 = x and the latter is a lower bound on the true optimal expected reward, so that

Relation to Stationary Impulse Control
The cited analytical works consider the infinite horizon case of solving for Assuming time-stationary dynamics for X ∅ , the optimal strategy is also time-stationary, meaning that there is a feedback action map Z * (x) that is independent of t and fully characterizes the impulses.
In contrast, the solution constructed above is explicitly time-dependent, as it is specified by Q k that are intrinsically distinct for different k.Nevertheless, when far from the horizon K, the time-dependence should be intuitively weak, and we expect to recover the time-stationary Z * (x).Indeed, informally if we parameterize in terms of time-to-maturity ṼK−k ( since they both correspond to running the backward RMC algorithm for K + 1 − (k + 1) = K − k rounds.Thus, Ṽ (•) is well defined and the regime → ∞ corresponds to being far from the terminal condition so that we expect Ṽ → v(•) as → ∞.
The above perspective suggests that for k small, we can validate our Q(k, •) by comparing with v(•).Conversely, we may approximate v(•) by Q(k, •; K) for k small and K large.To speed up that convergence, we recall model predictive control where during forward scenario generation one uses Z k (rather than Z , = k + 1, . . ., k + w for all time-steps.In other words, we compute expected reward based on a timestationary control that is derived from the latest (in the sense of backward induction) surrogate Q(k, •).The resulting J k,K (x; Z k ) captures the reward over K − k steps and its expectation would be a good approximation of v(x) for K large.Using model predictive control in Algorithm 1 is analogous to a policy iteration search for infinite-horizon problems; it reinterprets K as the receding horizon depth.

Implementation
In general, there is a huge range of potential statistical models for empirically fitting a Q(k, •).Thus, any statistical learning framework could be applied; see the mlOSP package that allows the use of more than a dozen different regression modules, linking to the vast library of R regression packages, from random forests to support vector machines.However, the fact that Q is necessary to evaluate M imposes requirements on what would be good surrogates.For example, piecewise models (like a random forest, multivariate adaptive regression splines (MARS), or a hierarchical linear model) would tend to be inappropriate, as they would lead to discontinuities in defining Z k (x) and hence unstable schemes due to error backpropagation.Similarly, polynomial bases might be problematic since their gradient tends to be highly oscillatory and therefore lead to unstable behavior in Z k (x).Overall, we seek smooth regression models with an interpretable gradient.
Our two main proposals are smoothing splines (SS) and Gaussian processes (GP).Splines intrinsically target C 2 fits, and tend to be highly robust to noisy data.Their main limitation is poor scalability, but otherwise they are a great default choice in 1 or 2 dimensions.Gaussian Processes yield smooth functional interpolators that work well with non-uniform training designs.Moreover, despite being non-parametric, GPs yield analytic gradients.
Both SS and GPs consider ( 16) for a certain smoothing parameter λ ≥ 0 and a Reproducing Kernel Hilbert Space (RKHS) H.The representer theorem implies that the minimizer of ( 16) therefore has an expansion in terms of the RKHS eigen-functions where N j span the null space of H.Note that this is a non-parametric fit since it involves the sum over the data-driven C(•, x n ), n = 1, . . ., N .
Smoothing splines: Thin-plate splines take the RKHS H T P S as the set of all twice continuouslydifferentiable functions with f 2 As λ → ∞, the optimization in ( 16) penalizes any convexity and ultimately reduces to the linear fit Q(x) = β 0 + β 1 x.Indeed, the null space of H T P S consists of affine functions, cf. the second term in (18).A common parametrization for the smoothing parameter λ is through the effective degrees of freedom statistic df λ ; one may also select λ adaptively via cross-validation or Maximum Likelihood Estimation (MLE) [25,Chapter 5].The respective eigenfuctions are C T P S (x, x ) = |x − x | 2 log |x − x |, and optimization of ( 16) gives a smooth C 2 surrogate that has the explicit form See [27] for implementation of RMC via splines.
Gaussian Processes: GPs start with a positive definite kernel c(x, x ) which defines the function space H C and take λ = 1/2.The corresponding norm • H has a spectral decomposition in terms of differential operators [33,Ch. 6.2].An intuitive interpretation is that GPs find Q(•) through applying Gaussian conditioning equations to the training data (x 1:N , y 1:N ).To do so, a GP regression (GPR) model specifies the covariance function c(x, x ) and a mean function m(x), assumed for simplicity to be constant m(x) ≡ β 0 .The GP estimate is then where I is the N × N identity matrix, 1 is the N vector of 1's, and C is N × N covariance matrix described through the kernel function C i,j = c(x i , x j ; ϑ).The parameter σ 2 comes from the observation noise in (15), interpreted as being i.i.d.Gaussian with the respective variance, and is to be inferred with the rest of the GP hyperparameters.
A GPR is implemented by fitting the hyper-parameters ϑ governing the covariance kernel, the mean function and the observation noise.The user first specifies a parametric family and then optimizes, typically through the nonlinear MLE procedure.The GP kernel c(x, x ) controls the smoothness (in the sense of differentiability) of Q GP and hence the roughness of its gradient.A popular choice for c(•, •) is the (anisotropic) squared exponential (SE) family, parametrized by the lengthscale len and the process variance σ 2 p : The SE kernel ( 22) yields infinitely differentiable fits Q(k, •) and has hyperparameters ϑ := ( len , σ 2 p , σ 2 ).Other popular kernels include those from the Matérn family.
Remark: Artificial Neural Networks (ANNs) with smooth activation functions could be another appropriate framework, facilitating training via back-propagation.Machine learning libraries like TensorFlow provide facilities for fitting ANNs, as well as efficiently differentiating them in order to evaluate ∂ x Q as in the next section.Those libraries are native to Python, rather than R, and are not yet supported in the current version of mlOSP.

Approximating the Intervention Function
The computation of Z k (x) is embedded deep in Algorithm 1 and drives the outputs y 1:N k+1 used to fit Q.In this section I discuss how that piece of the solver should be implemented.The base implementation is to directly solve (14) by calling an optimization sub-routine.The objective function is given implicitly in terms of the object Q(k, •) so ostensibly a general-purpose, gradient-free optimizer may be needed.Given that M (k, •) has to be evaluated repeatedly on each forward path emanating from each training input x (k),n this is the major computational bottleneck.To overcome it, several efficiencies could be exploited.
First, one may speed up the computation by using a gradient-based optimizer.This requires to specify not just Q(k, •) but also ∂ x Q(k, •) in an explicit functional way.The latter is available for several types of surrogates, including splines and GPs.For the latter, we recall that given a fitted GP model Q(k, •), its gradient forms another GP with the respective mean at input x * being Thus, the gradient of the surrogate is g * (x * ) in ( 23) which can be interpreted as formally differentiating the expression in (20) with respect to x.For example, for the SE kernel (22) we have: Second, one may exploit specific features of the problem setting.As a foremost example, I now discuss the common case where κ(x, z) is linear in z, namely κ(x, z) = c 0 z + c 1 for some constants c 0 , c 1 .In this situation, the optimization defining M (k, x) simplifies considerably.Indeed, the first order conditions reduce to searching for the "global" impulse target S * k : In particular, the target level S * k is independent of the current state x, and moreover can be determined by a single root search on the gradient of the value function.This drastically simplifies and stabilizes the numerics, since we just need to determine S * k once, and can then immediately compute Third, when computing M (k, x 1:N k ) for each training input x 1:N k , one can record and save the resulting optimal impulse amount z 1:N k k .Then in subsequent calls, instead of again solving for M (k, x ) at some new x by re-rerunning the optimizer, one may instead train a separate independent functional representation Ẑ(k, •) based on the dataset (x ).Thus, we could build an auxiliary surrogate Ẑ(k, •) (e.g. through another GP surrogate) and then use Ẑ(k, x ) instead of arg max z { Q(k, x + z) + κ(x , z)}.This substitutes the prediction Ẑ(k, •), which is typically much faster to compute, instead of calling the optimizer.

Training Designs
To train the regression surrogate, the user must supply the simulation design(s) D k .See [30] for a detailed description of various options for training optimal stopping emulators within mlOSP.With impulse control, the major difference is that (X k ) is no longer autonomous.Thus, it no longer makes sense to construct D k ∼ p(X k ) as a sample from the uncontrolled dynamics.Instead, I propose to directly specify D k , building on the idea that the quality of Q reflects the geometry of D k -one learns best in regions where the training samples lie.Since the optimally controlled (X * k ) typically has a stationary distribution (modulo timedependence imposed by the finite horizon), one may select a training region based on a prior guess of the latter.For example, one may choose a hyper-rectangle D and then set D k as a finite, space-filling sample from D, yielding D k that looks like a sample from a uniform density on D. Some of the ways to achieve this are This can be achieved either through a deterministic lattice, or i.i.d.Uniform (stratified) sampling, or a low-discrepancy (Quasi Monte Carlo) sequence.All these choices will • Direct specification of D k , e.g. as a fixed lattice seq(a,b,by=∆x); • Probabilistic sampling of D k , either using i.i.d.Uniforms, or a variance reduced variant of the former (e.g.stratified sampling) or a Latin Hypercube sampling method (package lhs in R); • Generation of D k from a (scrambled) low-discrepancy sequence (LDS), such as Sobol (package randtoolbox in R).Note that in this case D k is deterministic.This is specified by the qmc.method field that supports LHS (default) and various LDS.
Beyond targeting a uniform density of training samples on a given region, one may also take non-uniform D k that preferentially place more training inputs in some parts of X .For example, we may put more x k 's in the region where we expect the impulse target to be, in order to improve the quality of Q there, and hence the quality of M .The underlying intuition is that learning is achieved through exploration (sampling a diverse collection of x k 's) and exploitation (sampling x k 's that are likely to be encountered on forward controlled paths).
A further training option that I highlight is replication.A replicated design is akin to a Monte Carlo forest, in the sense that some training inputs appear multiple times.In a most common batched design, we have N unique distinct sites, the so-called macro-design, and each unique x n is then repeated N rep times, so that where the superscripts now index unique inputs and the total training budget for The corresponding simulator outputs are denoted as y 1,1 , y 1,2 , . . ., y n,i , . . ., y Nunique,Nrep .
A replicated design allows to pre-average the corresponding y-values, ȳn := 1 Nrep Nrep i=1 y n,i , and then calling the regression module on the reduced dataset (x 1:Nunique , ȳ1:Nunique ).Replication with pre-averaging offers a simple way of reducing the variance of observations.This is often desirable because the pathwise rewards tend to be highly volatile especially over longer periods of time, and many functional approximators struggle under low signal-to-noise settings.With high degree of replication, one can view ȳn as almost deterministic, so that regression effectively reduces to interpolation.
Remark: As mentioned, since the entire algorithm proceeds step by step, D can depend on k.Similarly, one can straightforwardly incorporate all types of time-dependency in the dynamics, costs, etc.Such a generalization is conceptually trivial and requires only careful encoding of further k-dependent objects.

Illustration
To illustrate the overall workflow of solving an optimal impulse problem, I present a few brief code snippets.These utilize the mlOSP constructs and can be directly reproduced by any reader who installs the package.Consider impulsing a 1-D Geometric Brownian Motion (GBM) process with uncontrolled dynamics with scalar parameters µ, σ, x 0 .Thus, (X ∅ t ) can be simulated exactly by sampling from the respective log-normal distribution.The running payoff is of concave power-type π(x) = x γ /γ, 0 < γ < 1, and the intervention costs are linear κ(x, z) = c 0 •z +c 1 .This setup is motivated by Federico et al. [23] who considered irreversible investment with fixed adjustment costs.The state process (X t ) represents an economic indicator, such as the production capacity of a firm which drives the revenue rate π(X t ).
As discussed above, linear impulse costs yield an optimal strategy of (s, S) type: intervene as soon as (X t ) goes below s and bring it back up to S > s.Thanks to the linearity of GBM and κ(x, •), and the power-form of π(x) the infinite-horizon problem is known to have an explicit solution ṽ Thus, with infinite horizon the controlled (X t ) will be a time-stationary renewal process and undergo a cyclical behavior with renewal times τ n+1 = τ n + inf{t > 0 : X τn,S t ≤ s} and z n+1 = S − s.
To implement the above instance, one starts by defining the model, which is a list of (a) parameters that determine the dynamics of (X ∅ t ) in ( 1 Next, we must choose a solver scheme, namely specifying the surrogate type and the training sets.In mlOSP, the solver implementing Algorithm 1 is named osp.impulse.controland comes with a method field that controls the surrogate, and input.domainthat controls the simulation designs.The code below utilizes a (cross-validated) spline surrogate that automatically sets the number of knots.For the input.domainI utilize a non-uniform lattice that is dense for x ∈ [1,18] (region of the action set) and less so for x ∈ [18,90].I also employ replication, with each input replicated batch.nrep=40times.
For the terminal condition I take the expected value of future running rewards given the current state and no more impulses, i.e.
where C is from (27).This is interpreted as T being the horizon for actions, thereafter X evolves endogenously without any further controls.
spl.solver <-osp.impulse.control(modelFRT,input.domain= input.dom,method="cvspline") Note that no output is printed: the produced object spl.solver contains an array of 99 (one for each time step, except at maturity) fitted smoothing spline surrogates for Q(k, •).Technically, spl.solver is a list that has a few other diagnostics beyond the collection of the smooth.splineobjects.
Figure 1 visualizes the fitted Q from three time-steps.To do so, we simply predict the Q-value object over a collection of test locations.In the Figure, this is done for k = 1, 30, 60.The middle panel shows the gradient ∂ x Q(k, •), obtained by finite-differencing, at the same time steps k.
To better understand the resulting strategy Z 0:K we build an independent database of forward controlled paths.This is done via the forward.impulse.policycommand that evaluates (17) and also records all the associated actions (impulse amounts and times).The right panel of Figure 1 shows two different controlled forward paths based on the computed Z 0:K .On the blue path there are 3 impulse times τ n ; on the purple one only two.One can clearly see the (s, S) policy where (X Z t ) is impulsed whenever it gets too low and is then brought up to about S * k 60 which is the target level.
Below we also show code to plot the impulse strategy, namely the impulse boundary s * k and the impulse target levels S * k , displayed in Figure 2. Note that the shown s * k is based on the forward paths, so at some times k there is no recorded s * k since none of the forward paths were impulsed at that specific k (impulses are not so frequent).The Figure shows the time-stationary s, S values that confirm the good approximation by the present solver.One can also observe the time-dependence which manifests itself through the agent being impatient as problem horizon is approached.As a result, impulses are applied sooner (lower timing value, i.e. lower opportunity cost of acting) and we see a "boundary layer" as k → K.The slight fluctuations observed in s * k , S * k are due to Monte Carlo-driven approximation errors and can be decreased with larger

Case Studies
In this section I present two more case studies that showcase other problem settings and further features available in mlOSP.

Faustmann Problem of Forest Rotation
For the next example of an impulse control problem amenable to Algorithm 1, I take up the problem of forest management as nicely summarized and analyzed by Alvarez and co-authors [4,6].Let X t represent the value of forest stand, i.e. the economic value of existing timber.Forest growth is modeled as a stochastic process: timber increase is uncertain and fluctuates due to weather, precipitation and other environmental effects.Droughts or insect infestations might reduce forest stand, justifying the use of stochastic differential equations for modeling (X t ).
The controller aims to maximize total profit from cutting down and selling timber on a given time horizon T .This is achieved through carrying out a sequence of so-called forest rotations.At each rotation, the forest stand is cut down and sold.The number of rotations is stochastic and up to the forest manager.The horizon T represents the lease term for the timberland, measured in years.The objective functional is then where the payoff function κ captures the value of selling z n timber, subject to the cutting costs, and r is the intertemporal discount rate.Above we assume zero salvage value at T , φ(X T ) ≡ 0. As might be expected, the timber will be cut at some time-dependent threshold S * k .In the classic formulation, the problem is stated on infinite horizon, the dynamics of X t are linear and time-homogeneous, and the post-rotation level X τ k ≡ x is pre-specified.This offers an explicit solution, see [4], who showed that the action region is S k = {x > S * } where S * is the maximizer of a certain nonlinear equation.The finite-horizon version is analyzed in [12] and I reproduce their example where (X ∅ t ) is a standard arithmetic Brownian Motion, r = 0.1, and there is a fixed cost for each rotation, κ(x, z) = (z − 1) + .Thus, the forest is always cut down to nominal level zero x = 0.This means that z n = X τn− .The reference impulse boundary is S * = 1.84.I take a horizon of T = 5 years and time steps of ∆t = 0.1, yielding K = 50 periods.For the regression, I utilize a Gaussian Process emulator with the squared-exponential kernel (22) and hyperparameters fitted via Maximum Likelihood Estimation, as done in the DiceKriging package.I then use the exact surrogate gradient based on (23) to efficiently find S * k in (8).For the training simulation design, the particular structure of this formulation implies that it is important to obtain an accurate estimate of Q(k, 0).To this end, I consider training in a Monte Carlo forest like fashion, employing a high degree of replication so that each unique input will have multiple forward paths emanating from it.Namely, I take 100 unique inputs on a lattice between [−0.25, 2.5], each replicated 100 times, for a total of 10,000 forward training paths.
The left panel of Figure 3 effectively reproduces Figure 1 in Belak et al. [12]; in contrast to that paper where the impulse boundary is obtained from a solution of an integral equation and requires specific assumptions on the impulse cost function and dynamics of (X t ), my method is completely generic and can be trivially re-solved if any of the ingredients were to change.In Figure 3 we can clearly observe the effect of the terminal condition, where the manager will cut down even a bit of forest ahead of the deadline T that would give him no profit whatsoever.One also notes that the estimated impulse boundary is below that of the infinite-horizon problem.This arises due to (i) restricting actions to take place in T k , here with separation ∆t = 0.2, this is known to induce the manager to act sooner; (ii) the finite horizon that remains non-negligible with T = 5,

Two Dimensional Capacity Expansion
In the 2-dimensional version of capacity expansion, the state processes are the production price (P t ) and the current capacity C t .The price is exogenous and stochastic, while the capacity is fully endogenous and deterministic.We refer to [13] and [24] who provided explicit solutions (up to solving an integral equation) in some special cases.
The price follows Geometric Brownian motion dP t = µP t dt + σP t dW t and the capacity undergoes deterministic exponential decay/aging with rate δ: with profit function π(P t , C t ) = P t C α t for α < 1, representing decreasing efficiency of adding more capacity.We identify above with a two dimensional state X t ∈ R 2 + where the impulses z ∈ R + only affect the second coordinate: an impulse of size z leads to P τn = P τn− and C τn = C τn− + z.Obvious generalizations to handle the two coordinates are straightforward to handle in code and are effectively abstracted away by the package as far as the user is concerned.
Following [24] I consider concave investment costs κ((p, c), z) = z β with β < 1 and β > α.The concavity of κ encourages making large investments.The quantity Y t := P t C α−β t can be interpreted as the firm's return on assets (ROA) and affords dimension reduction in the case of log-linear dynamics as above.Indeed [24] shows that with these choices and infinite horizon, V (p, c) = c β v(c α−β p) where v(y) = B 0 y + B 1 y γ for some explicit constants B 0 , B 1 , γ and the optimal impulses are of (s, S)-type in Y t , rather than P t , C t separately.Guthrie [24] considers the parameter values r = 0.04, µ = 0, σ = 0.08, δ = 0.1, β = 0.95, α = 0.5 which gives ROA threshold y 0 = 0.224.Moreover, the respective optimal impulse is to increase capacity by 178; this happens on average once every 11 years.In this setup, the problem is strongly non-stationary: (P t ) is autonomous and can grow without bound (since µ > 0) while C t is impulsed upwards.Consequently, one must select time-dependent simulation designs D k lest the forward paths end up extrapolating, rather than interpolating Q.
For the terminal condition we set φ(x) = E[ ∞ 0 e −rs π(X s )ds|X 0 = (p, c)] = pc α 1 r−µ .There is no simple way to summarize the resulting solution; Figure 4 displays a few optimally controlled paths, as well as the impulse target map (namely the arg max of M (k, •)) on the training set D k , here chosen to be a Sobol low-discrepancy sequence.We note that with the nonlinear impulse costs, the impulse target depends nontrivially on both coordinates, increasing both in price p and in current capacity c.

Figure 1 :
Figure 1: A 1D capacity expansion instance.Left: Q-value based on a Smoothing Spline emulator at k ∈ {1, 30, 60}.Middle: corresponding gradient ∂ x Q(k, x).The target level S * k is the threshold where ∂ x Q(k, S * k ) = 1.Right: two resulting controlled paths of X Z .Dots indicate the times τ n k of impulses.

Figure 2 :
Figure 2: Threshold boundary s * k (dots towards the bottom) and threshold target levels S * k (line towards the top) for the irreversible investment case study.

Figure 3 :
Figure 3: Left: estimated impulse boundary Ŝ * k as a function of time step k, infinite horizon threshold S * is shown as a dashed line.To smooth out minor numerical artifacts we also display a smoothed estimate of S * as a blue curve.Right: value function V (k, 0) at zero.

Figure 4 :
Figure 4: Two dimensional finite horizon impulse control problem inspired by [24].Left: 4 controlled trajectories of X t = (P t , C t ).Capacity C t decays exponentially without impulses.Right: impulse target (p, c) → c + Z k (p, c) for a representative time step k.