Estimation for dynamical systems using a population-based Kalman filter – Applications in computational biology

Estimation of dynamical systems (in particular, identiﬁcation of their parameters) is fundamental in computational biology, e.g., pharmacology, virology, or epidemiology, to reconcile model runs with available measurements. Unfortunately, the mean and variance priors of the parameters must be chosen very appropriately to balance our distrust of the measurements when the data are sparse or corrupted by noise. Otherwise, the identiﬁcation procedure fails. One option is to use repeated measurements collected in con-ﬁgurations with common priors (for example, with multiple subjects in a clinical trial or clusters in an epidemiological investigation). This shared information is beneﬁcial and is typically modeled in statistics using nonlinear mixed-eﬀects models. In this paper, we present a data assimilation method that is compatible with such a mixed-eﬀects strategy without being compromised by the potential curse of dimensionality. We deﬁne population-based estimators through maximum likelihood estimation. We then develop an equivalent robust sequential estimator for large populations based on ﬁltering theory that sequentially integrates data. Finally, we limit the computational complexity by deﬁning a reduced-order version of this population-based Kalman ﬁlter that clusters subpopulations with common observational backgrounds. The performance of the resulting algorithm is evaluated against classical pharmacokinetics benchmarks. Finally, the versatility of the proposed method is tested in an epidemiological study using real data on the hospitalisation of COVID-19 patients in the regions and departments of France.


Introduction
Mathematical models (such as systems of ordinary differential equations (ODE) or even stochastic differential equations (SDE) to account for additional uncertainties) can be used to understand the evolution of biological processes. For example, in pharmacology they can describe the pharmacokinetics of drugs (PK) in plasma [53], or in virology the viral kinetics during HIV treatment [39], or in epidemiology the recent evolution of COVID-19 epidemics [29,48].
When using these models in practical applications, a major difficulty is dealing with the many uncertain quantities that must be prescribed for the models to be truly predictive. These quantities include the initial conditions or parameters of the model, which are difficult to measure directly. Fortunately, collected data provide additional information that can be used to circumvent the uncertainties associated with the definition of the dynamical system.
Estimation strategies for these models can be viewed as deterministic inverse methods, where we wish to recover the individual parameters that led to particular observations. The solution to this inverse problem then typically leads to the solution of a ODE constrained optimization problem. However, it could also be thought of as a statistical estimation procedure with a variety of techniques based on maximum likelihood optimization, for example, but not limited to, Expectation-Maximization type algorithms. In this landscape, either deterministic or stochastic sequential estimation such as Kalman-based filters, which consist in correcting the original dynamics as the data are collected, are particularly attractive when they are affordable in terms of computational complexity [2,4,19,30,49].
Unfortunately, whatever method is chosen, the estimation procedure may not be robust enough in many situations when measurements are too sparse or noisy without relying on strong priors of parameters. One way to compensate for the underlying weakness in observability may be to take advantage of the fact that multiple independent measurements have been collected for a configuration in which the model variables describe a statistical unit of interest that belongs to a larger population in which variability is constrained. This is typically the assumption in nonlinear mixed-effects models [26,28,52]) where the use of available population data improves the identifiability of individual parameters through coupled maximization of the likelihood function. Many algorithms and corresponding software have been developed using this principle. Among others, we can mention the First Order Conditional Estimation (FOCE) algorithm implemented in NONMEM [43], the Stochastic Approximation of Expectation-Maximization (SAEM) algorithm proposed in MONOLIX [25], penalized maximum likelihood approaches available in NIMROD [46], or even full Markov chain Monte Carlo (MCMC) methods implemented in Winbugs [54], jags [14], or Stan [7]. Most of these estimation algorithms and software have been compared in previous work and have been shown to achieve similar performance results for identifiable models of intermediate size, such as models in pharmacokinetics [16,32,44,46]. However, increasing the size of the underlying dynamics without increasing the complexity cost of the algorithm too much remains a challenge.
In this work, we propose an alternative strategy inspired by the data assimilation method generally developed for environmental sciences [2] and often confronted with the complexity burden associated with large-scale systems. Our strategy is to capture all population information in a unified maximum likelihood estimation procedure. Assuming Gaussian disturbances up to a change of variables, we approximate the corresponding optimization problem with a Kalmanbased filter. More specifically, we choose the Unscented Kalman Filter [22], but other Gaussian filters [2] could also lead to similar results. The potential curse of dimensionality caused by the population framework is then limited by covariance reduction techniques such as those proposed in the Reduced-Order Unscented Kalman Filter [33]. Note that this is not the first time that Kalman-based approaches have been used in mixed-effects strategies. This is the case in [24,36,50] where a specific combination of the FOCE method with the extended Kalman filter was proposed; in [12] where a coupling of the SAEM algorithm with the extended Kalman filter is performed for ODE models and in [36] for SDE models. However, in this literature, the extended Kalman filters are used only to approximate the individual probability distribution function, whereas in this paper we use the Kalman approach also at the population level.
To evaluate our approach, we propose a series of numerical tests starting from a classical toy problem in pharmacokinetics. First, a classical synthetic data benchmark is considered to validate our implementation. Then, a validation is performed on a real data set to evaluate the performance of the algorithm against the literature. We found that the pandemic COVID-19 represents a very general use case when trying to aggregate multiple data collections from different locations in space using SIR-based models [37]. More specifically, we assume that the data collected in France at the regional (or departmental) level are independent realizations of the same underlying epidemic of COVID-19 driven by random parameters specific to each region (or department), which requires an population approach to estimate them. This numerical experiment is a real challenge for our algorithm and illustrates its versatility.
The paper is organized as follows. Section 2 formalizes the estimation problem and introduces the notation. Section 3 introduces the population-based Kalman filter and its reduced-order version. In Section 4, we provide numerical experiments and an evaluation of the performances before a final conclusion is drawn in Section 5.

Model formulation
Consider a population of N P subjects. For each subject, we denote by x i (t) ∈ X R Nx , t ≥ 0 for all 1 ≤ i ≤ N P , the system state (i.e. a vector, containing all time-dependent variables of the system) and by θ i , 1 ≤ i ≤ N P the vector containing all time-independent parameters of the system. The dynamics over the time window [0, T ] is modeled by a vector field f ∈ C 1 (X ) that represents a deterministic evolution law such that for each subject 1 The parameters of interest θ i ∈ P R N θ can be treated as state variables by expanding the state dimension and remembering thatθ i = 0. We therefore denote by z i the augmented state such that and for 1 ≤ i ≤ N P , the dynamics reads Therefore, (2.1) is a general dynamics combining time-dependent and time-independent variables. In practice, we will ultimately rely on a discretization of dynamics (2.1). We restrict ourselves here to a forward Euler time scheme, but it is quite easy to consider more general time schemes. We should solve numerically for the choice of a time-step δT n = T n+1 − T n that is small enough to satisfy the conditional stability constraint In the remainder of the paper, which focuses on the discrete-time formulation, we assume that there exists a discrete transition operator φ n+1|n that yields the state x i n+1 in knowledge of the state x i n , namely where here φ n+1|n (x i n , θ i ) = (x i n + δT n f (x i n , θ i , T n )). Note that more complex time schemes will lead to different definitions of the transition operator, but (2.5) is a fairly general discrete-time dynamics. As shown before, we define from φ n+1|n a joint state-parameter transition operator ψ n+1|n such that (2.4)

Uncertainties modeling
In the spirit of [30, Chapter 2], we introduce some stochasticity into the discrete-time formulation (2.3) and consequently into (2.4). In fact, we complete (2.3) by introducing an additive modeling error contribution with the introduction of an operator G ∈ L(R Nx , R Nν ) and for all is now a Markov chain defined by the random map Note that we typically have the following scaling G n ∝ δT n due to discretization and Tr(W i k ) ∝ 1 δTn [17] to match a potential continuous-time limit. Consequently, upon completion of (2.4), there exists a joint operator B n such that for the joint state-parameter sequence random map we have In addition to the stochasticity introduced in the dynamics, we assume that the initial state and the parameters of interest are subject to some stochasticity in the population under consideration. We have z i 0 = z 0 + ξ i , where z 0 is a deterministic known quantity and ξ i , 1 ≤ i ≤ N P , are random variables. A population-based approach typically assumes that the N P members of the population (called subjects in this paper) are drawn at random from a same distribution, in this case a Gaussian distribution where ξ pop is the population mean and P 0 is the associated covariance. We equivalently write In mixed effects models, ξ pop is called the population intercept, while ξ i is the random effect. Normally, ξ pop is not known, nor is P 0 . However, we can assume that ξ pop is bounded as a deterministic quantity with respect to a norm defined by a given positive matrix M , which is assumed to be known. Moreover, we have at our disposal some a priori P 0 for P 0 .

Observations
Our goal is to estimate the unknown parts of the initial conditions in our population using the available data. In practice, we do not observe x i directly, but we do have time-sampled observations (or measurements) y i ∈ Y R N obs . These observations y i are related to x i by the measurement process. This process is modeled by introducing an observation operator h k that depends on time t k ∈ [0, T ] such that k represents an additive measurement error, typically.
with W i k the covariance matrix of the observation errors. Note that to define a discrete-time white noise that could be consistent with a potential continuous-time limit, we have the following scaling Tr For simplicity, in introducing the method formalism, the number of observation times N T obs +1 will not be subject dependent. However, in the results section we will explain how to practically account for subject-dependent sampling.
Again, we can rewrite the system using the augmented state, namely We note that for simplicity of exposition, the model discretization and the observation discretization are consistent, namely max i (δt i ) is small enough, such that N T ≥ N obs and there exists a sequence of integers (n k ) 0≤k≤N obs such that for all observation times t k , 1 ≤ k ≤ N obs , there exists T n k = t k .
Note that when we combine the augmented dynamics and the observations, we find that there is a nonlinear mapping Γ such that where χ i k is Gaussian noise, ξ pop is the fixed effect, and ξ i and (ν i n ) 0≤n≤N T −1 are random effects as in classical mixed-effect models [28]. And even if the model dynamics are linear and the observation operator is linear, the dependence of the dynamics on the parameters ultimately leads to Γ being a nonlinear operator with respect to z pop 0 and ξ i . Thus, to summarize the problem, our goal is to estimate ξ pop , ξ i , (ν i n ) 0≤n≤N T −1 and P 0 from (y i k ) 1≤i≤N P ,0≤k≤N T obs given the relation (2.7), benefiting (from a data assimilation point of view) from the dynamic structure behind Γ.

A population-based likelihood objective function for Gaussian disturbances
Let us first aggregate all the subject uncertainties in a vector of dimension Similarly, we denote (ν n ) 0≤n≤N T −1 = ((ν 1 n · · · ν N P n ) ) 0≤n≤N T −1 . Throughout the rest of the presentation, bold notation will indicate quantities gathering all the subjects together.
In a classical Maximum Likelihood Estimation (MLE) strategy with Gaussian disturbances, the MLE should be obtained by minimizing a negative log-likelihood of the form where all (z i n ) 0≤n≤N T , 1 ≤ i ≤ N P , are forced to follow the dynamics (2.6). Here, this minimization includes the population intercept ξ pop . To simplify the minimization problem, we note that the population intercept represents the expected mean of all the subject uncertainties. We therefore propose to model ξ pop as an empirical mean over all subjects namely Therefore, the minimization problem becomes where the unknowns of our population optimization procedure reduce to In fact, the functional in (3.2) depends on fewer unknowns than in (3.1), which means that our coupling reduces the minimization space, which is already quite large, namely R N P ×Nz × L 2 [0, T ], R N P ×Nz . The objective function can be rearranged by developing the first two terms to We then introduce the matrix M 0 of dimension (N P × N z ) 2 defined by blocks such that for where δ ij is the Kronecker delta equal to 1 when i = j and 0 otherwise. Using Kronechker product ⊗, 1 N P = 1 · · · 1 ∈ R N P and I N P denoting the identity matrix in R N P , the matrix M 0 can be rewritten as This enforces the invertibility of the matrix M 0 . Note that the matrix N P −2 1 N P 1 N P ⊗ M can be small (in the sense of its Frobenius norm Tr(M 2 ) 1/2 ) and therefore can be interpreted as a regularization term with respect to the second matrix. Indeed, if Tr(M ) is small with respect to Tr( P −1 0 ), our criterion means that the estimation is strongly supported with a shared population variable ξ pop and small variations ξ i around it. Conversely, Tr(M ) that is large with respect to Tr( P −1 0 ) in a large population N p 1 leads to an estimation where all members of the population are decoupled.
Ultimately, the introduction of our estimator is defined as the minimizer of the least-squares objective function (3.2), which is rewritten into the general form (3.5) We have gathered all the individual states in z = (z 1 · · · z N P ) , which must follow the discretetime "populations" dynamic and measurements are concatenated in: y = (y 1 . . . y N P ) so that with an observation operator and a measurement noise χ k = (χ 1 k . . . χ N P k ) of the covariance Minimizing (3.5) under the constraint imposed by the dynamics (3.6) becomes a classical data assimilation problem [2,30]. The key to our uncertainty modeling is that P 0 (constructed from (3.3)) couples the population members, since in fact In the upcoming numerical section, the effects of considering P 0 rather than P un 0 will be illustrated by comparing our population estimation strategy with a completely decoupled approach where each subject dynamics is estimated separately.

Corresponding sequential estimator
Minimization of the functional (3.5) under the constraint of dynamics of the form (3.6) can be naturally performed using variational data assimilation methods [2,5,9]. We denote bȳ z n|N T , 0 ≤ n ≤ N T , the trajectory associated with the minimization of L T . It is also well known that a sequential estimation approach can equivalently give a recursive formulation of the so-called optimal sequential estimator defined by z n =z n|n . This is the Kalman estimator for the case where all operators are "linear"(namely for all n, where ψ n+1|n , c n and B n are linear [4,23,30,49]. Given a nonlinear model operator or a nonlinear observation operator, it is classical to rely on an approximate optimal sequential estimator based on the generalization of the Kalman filter to nonlinear operators [30,49]. Here, we have chosen to use the Unscented Kalman Filter (UKF), where the mean and covariance operators are computed from the empirical mean and empirical covariance based on N σ sample points (the so-called sigma points [22]. We define for a given set of positive coefficients α = (α j ) 1≤j≤Nσ and a given set of particles (v (j) ) 1≤j≤Nσ , the empirical mean and covariance by Remark 3.1. Different choices of sigma-points can be used in practice from canonical sigmapoints (N σ = 2N P × N z ) which are aligned with the canonical base to star sigma-points (N σ = 2N P × N z + 1) in which the origin is added to the canonical points or simplex sigma-points, see [22,34] for more details. Here in the numerical part, we will use canonical sigma-points.
Let us now introduce a set of positive weight coefficients α = (α j ) 1≤j≤Nσ with Nσ j=1 α j = 1, and a set of unitary sigma points (e (j) ) 1≤j≤Nσ ∈ ((X × P) N P ) Nσ with the following empirical mean and covariances where δ is a scaling factor. The UKF algorithm reads Initialization: Correction, when T n ∈ {t 0 , . . . , t N obs }: Sampling: and reduces to the classical Kalman Filter when all operators (ψ n+1|n , c n and B n ) are linear. Note here that each sigma point is a combination of individual contributions in the population. We now recall (see [4,19,49] for in-depth presentations) the resulting computed state vector z + n approximates the conditional expectation z + n E(z n |y 0 , . . . , y n ), and the positive definite matrix P + n is the approximation of the corresponding estimation error covariance P + n Cov(z n − z + n |y 0 , . . . , y n ). As a consequence concerning the estimated parameters, we can extract from each member of the population a covariance of the parameter estimation error At final time, averaging over the members of the population and taking the square gives an estimation of P 0 knowing the available data. Then, taking the square root of the diagonal entries of this matrix is an estimation the Standard Deviation of the Random Effects for each estimated parameter.
Additionally, as ξ pop can be seen as an update of M θ , the parameter part of the introduced norm M to be used for a subsequent estimation procedure.
Remark 3.2 (Initial condition estimation through smoothing). One classical drawback of Kalman estimation is that a quantity of the form E(z 0 |y 0 , . . . , y n ) is not directly available. When considering the parameters, we circumvent this limitation by reusing the dynamics which imposes that θ i n = θ i 0 . However, this is not the case of E(x i 0 |y 0 , . . . , y n ) as the dynamics is not necessarily solvable backward in time. However, such quantities can be computed by considering a Kalman smoothing instead of the Kalman filter [49]. In practice, this can be achieved by augmenting the subject state . The corresponding initial covariance should be modified accordingly from where two lines are identical since the two variables are fully correlated at initial time. During the UKF algorithm, we can extract this time from the last diagonal bloc of P + n , an estimation of E(x i 0 |y 0 , . . . , y n ). Remark 3.3 (Fading memory). It is common to rely on a so-called fading memory effect when using a sequential estimator in order to give greater emphasis to more recent data and, by contrast, limiting the risk of being struggle by the history prediction [49,Section 7.4]. In this respect, we introduce a forgetting factor 0 ≤ ρ ≤ 1, and modify the covariance prediction into In fact, such modification resonates as the result of changing the functional into Remark 3.4 (Non-linear transformation of state variables). In many cases, the state variable x i n modeled for each subject i in iteration n does not follow a Gaussian distribution. However, they can be modeled as a nonlinear transformation of a Gaussian variable, as is often proposed in nonlinear mixed effects models [28,35]. This is the case when you consider a variable that takes non-negative values and can be modeled as a Gaussian variable after being transformed by a log transformation. Similarly, if you consider a variable that takes bounded values and can be modeled as a Gaussian variable after transformation by a logit transformation. Then it is not difficult to extend the UKF filter to manage variables that follow a Gaussian distribution up to a certain transformation. By introducing the general mapping ϕ that transforms the augmented state into Gaussian variables, we modify the classical UKF algorithm in Initialization: Correction, when T n ∈ {t 0 , . . . , t N obs }: Sampling: (3.10) Note that these transformations also apply to the model parameters. Moreover, when B n = 0, this approach is only valid if the modeling error acts on the transformed variable ϕ( z) and not on z. This may mean that the column of B n is transformed accordingly. We refer to Section 4.2 for an enlightening example. In addition, such a transformation could also be used to transform the measurement space to change how observational errors are perceived by the UKF, for example from a Gaussian distribution to a Poisson distribution using the Anscombe transformation. Finally, we would like to emphasize that the transformation of variables prior to each Kalman analysis step shares common features with other transformed-based Kalman-like algorithms such as [3].
Remark 3.5 (Iterative Kalman procedure). In the presence of strong nonlinearity, we know that the Kalman filter is only an approximation to the MLE, which is often overcome in the literature by requiring a small number of iterations of the Kalman filter [19, Section 6.1]. In a case where we estimate the parameters and initial condition by a smoothing procedure, an iterative procedure can be easily set up by considering multiple loops where the new a priori comes from the final a posteriori of the previous estimation. Such a procedure is very similar to the Bayesian approach described in [45].
Remark 3.6 (Alternative filtering strategies). We rely here on the Unscented Kalman Filter, but we could also have proposed an equivalent strategy using an Extended Kalman Filter (EKF). The advantage of the UKF in this work was the ease of integration and implementation of the nonlinear transformation of variables, since UKF can be seen as a derivative-free alternative to EKF [2]. In addition, we could have used an Ensemble Kalman Filter [18], which would have provided greater sampling latitude without going up to the complexity of a fully justified particle filter. Our choice was in fact guided by the convenience of having at our disposal a reduced-order version of UKF that will further limit the computational complexity of our strategy as described in the next section.

A reduced-order version of the sequential optimal estimator
The computation of the full covariance matrices P + n and then P − n+1 can be prohibitive when the number of unknowns, parameters, and subjects increase. In this section, we therefore propose a strategy of uncertainty reduction. Assuming that P − n is of reduced rank (typically much smaller than the dimension of the space N P × N z ) the main idea in the reduced-order filtering strategies [34,40,42] is to be able to maintain covariance matrices in their factorized form with U + n and U − n+1 invertible matrices of small size which condense the system uncertainties and L n an extension operator. What is crucial here is to be able to perform all computations on L n and U ± n without storing P ± n . In this work, we follow the path of our Unscented Kalman strategy and use the reduced-order Unscented-Kalman Filter (RoUKF) introduced in [34]. In fact, we only need to define the initial reduced covariance matrix U 0 and the initial extension matrix L 0 from the initial covariance matrix P 0 .
In this work, we proposed one possible strategy (among others see Remark 3.7) of defining U 0 and L 0 . We here choose a clustering approach applied to the observations sequence using the k-means algorithm. More precisely, the k-means algorithm is applied to the following matrix This clustering allows to separate the N P subjects in N C clusters. For 1 ≤ s ≤ N C , we define c s the s th -cluster, N cs the number of individuals belonging to the s th -cluster.
We then define the reduced initial covariance matrix However, such definition is not acceptable as it constraints the dynamics of all the subjects of a given cluster to be the same. As the k-means algorithm is based on the squared Euclidean distance metric, we can compute how much each subject i ∈ [1, N P ] belongs to one cluster c s with s ∈ [1, N C ] by computing the following weight otherwise.
Therefore, we prefer to replace (3.13) by (3.14) Remark 3.7 (Other reduction techniques). We propose here a practical definition of L 0 , U 0 based on the k-means, but alternative strategies could have been followed, for example hierarchical clustering or principal component analysis of the YY as performed in the Proper-Orthogonal-Decomposition approach presented in [10].
We conclude this section by presenting our reduced estimator algorithm based on the reducedorder Unscented-Kalman Filter (RoUKF) [34]. We emphasize that in this algorithm the number of sigma-points N r σ N σ used in the full UKF approach (3.8). Therefore, the associated unitary sigma-points e (j) r generate a subspace of small dimension. Moreover, here we here complement [34] by several features. First, we integrate the nonlinear transformation introduced in Remark 3.4. Second, we add the impact of the model error operator, which must be projected into the reduced subspace generated by the column of L n to preserve the decomposition (3.11) over time iteration. In this respect, we follow [40,41] using the orthogonal projector Π n = L n (L n L n ) −1 L n (i.e. Π 2 n = Π n and Π n = Π n . Third, the forgetting factor 0 ≤ ρ ≤ 1 that we introduced in the UKF algorithm is preserved in the RoUKF algorithm by analogy with [40,41]. We finally obtain Initialization: Remark 3.8 (Fisher information matrix). An additional advantage of the reduced-order formulation is that it is based on a square-root formulation of Kalman filtering, which allows easy computation of the Fisher information matrix. Indeed, Λ n is the sensitivity matrix of the observation errors with respect to the state. Therefore, computing then taking the parameter block, and finally averaging over the members of the population gives which is an estimation of the Fisher information matrix as computed in other Unscented Kalman Filter works [6].

Numerical results
In this section, we validate and illustrate our population-based Kalman filter with examples of different nature. In Section 4.1, we consider a toy problem corresponding to a one-compartment pharmacokinetic model with first-order absorption and elimination. First, we validate our strategy on synthetic data by evaluating the performance of Algorithm (3.8) using the coupled covariance matrix P 0 . We then consider the performance of Algorithm (3.15) by applying the clustering algorithm to the coupled covariance matrix P 0 (reduced-order population estimation). We conclude our investigation on this toy problem by considering real data estimation. In Section 4.2, we consider a more complex configuration from a real estimation problem of the COVID-19 epidemic early evolution in 2 geographic resolutions (in regions (N P = 12) and in departments (N P = 94)) of the non-island French metropolitan area. The objective of the estimation is to reconstruct the impact of lockdown health policies on the transmission rate in the first months of virus spread using an extended model of type SIR [37] (from March 1 st and May 11 th , 2020). We find this structure emblematic, as it requires the use of all the components proposed in our methodology.

Model and observations
We consider a one-compartment pharmacokinetic model with first-order absorption and elimination. We denote by A GI , the amount of drug in the gastrointestinal compartment (in mg) and by A P the amount of drug in plasma (in mg). The interactions between A GI and A P are described by the following system where k a corresponds to the absorption rate (in h −1 ) and k e corresponds to the elimination rate (in h −1 ). In clinical practice, it is assumed that only the plasma concentration U P = A P V 0 (in mg.L −1 ) is observed at the N T obs observation times, where V 0 is the volume of the plasma compartment (in L). System (4.1) can be rewritten as follows  However, in this numerical part, we will solve it using the first-order Euler time scheme, since we want to validate Kalman-based filters, which are by definition written on the time derivative part of ODE or PDE systems (or their discretizations).
As for the initial values, A GI (0) corresponds to a known initial dose in mg given by oral absorption, and U P (0) = 0. The parameters k a , k e and V 0 are unknown and will be estimated from available observations. These parameters are all positive and are constrained by a log transformation as the estimation proceeds. Following the notations in Section 2 and considering a population of N P patients, we have in this context, Note that this first example is thus based on a fully deterministic model, as is often the case in data assimilation [30, Section 2.1]. As for the measurements, the observation operator is given by The measurements are subject to stochastic noise with constant variance W i k = w for each subject 1 ≤ i ≤ N P and for each observation 0 ≤ k ≤ N T obs .
Another important aspect is the ability to handle the time sampling of the data, which is imposed in practice and can be very different from the time step of the model. In fact, the time sampling of the data may be different for each patient. To deal with the time sampling of the data, we can think of two strategies: We can either use the data only when they are available, or we can rely on time interpolation. To simplify our implementation, we chose to interpolate the observations. More specifically, using the time discretization notations of the system introduced in Section 3, we first reconstruct the measurements for 1 ≤ i ≤ N P , 0 ≤ n < N T These measurements are consistent with an observation operator c n k of the form (4.3), with additional rescaling of W i n ∝

Validation using synthetic data
First, we will evaluate the performance of Algorithm (3.8) using the coupled covariance matrix P 0 (see (3.4)) (population estimation) compared to using the diagonal covariance matrix P un 0 , see (3.7) (individual estimation). We then evaluate the performance of the reduced approach (3.15) by applying the clustering algorithm to the coupled covariance matrix P 0 (reducedorder population estimation).

Data and statistical validation criteria.
We consider a sequence of simulations with different numbers of patients (N P =2, 20 and 100) and with different measurement errors (standard deviation of an additive Gaussian noise w t = 0.3 and 1). We generate 100 simulation replicates (N R = 100) for each scenario. The observation times are 30, 60, 90, 120, 180, 240, 360, 480 and 600 minutes. The synthetic data (or simulation replicates) are generated using as the mean values and covariances of the initial state conditions and parameters. The covariances are diagonal since we assume that the parameters are independent variables. For the estimation process, we set the following a priori values and covariances: and for the standard deviation of the measurements noise w = 10.
To evaluate the performance of the estimation, several statistical validation criteria can be considered. For a parameter θ, we define by θ t (respectively, θ e ) the matrix of size N R × N P that contains in each row r the target (respectively, estimated) values of θ for all patients of replicate r. We also calculate the relative bias RBIAS and the mean square error MSE The RBIAS describes the deviation of the mean across the estimated parameters from their true values. The MSE summarizes both the bias and the variability of the estimates. Using P i n , we compute σ θ e the matrix of size N R × N P that contains in each row r the estimated standard deviation of θ for all patients of replicate r. We also supplement the RBIAS and MSE with the mean estimated standard deviation (STD) Finally, we can calculate the relative bias of the standard deviation for random effects, called BMIXED, defined by where σ θ e,m is the matrix of size N R × N P containing in each row r the estimated mixed standard deviation of θ for all patients of replicate r and therefore formed using P i n . This indicator is completed by coverage (COV), which is defined as the percentage of time that the true value is contained within the 95% confidence interval of the estimated values.    In contrast, as shown in Table 4.1, the population approach really strengthens the estimation because the RBIAS and MSE decrease with sample size and are dramatically smaller than for the individual approach (for N = 100, the RBIAIS goes from 0.374 for the individual approach to −0.007 for the population approach). This is also confirmed by the resulting individual trajectories as shown in Figure 4.3 (first column) showing the time evolution of a patient's state and parameters. We note that the STD is significantly larger than the ESTD. It indicates an overestimation of the standard deviation of the estimates due to the fact that the initial covariances of the estimates were chosen very roughly (3 2 instead of 0.2 2 , 0.25 2 , and 0.1 2 for k a , k e , and V 0 , respectively). Then, to obtain a more accurate covariance for the final estimate, we iterate the estimation procedure according to Remark 3.5. Each run is referred to as a loop. The population procedure is evaluated for w t = 0.3 and N P =20. Table 4.2 summarizes the results for each parameter of the model. As expected, the RBIAS and MSE decrease when multiple loops are used. The second (respectively third) column of Figure 4.3 shows the time evolution of the states and parameters for a patient for the second (respectively third) loop. In addition, STD get closer to the ESTD and COV approaches its nominal value of 95%. Indeed, we naturally recommend performing multiple estimation loops when starting from poor covariance a priori. The estimation of the standard error of the random effect is also less biased with 3 loops, as BMIXED shows, but still important. Perhaps this can be explained by the fact that we do not minimize the full functional (3.1), which includes the population intercept, but the functional (3.   assumes that the population intercept can be considered an empirical mean over all patients. It would be a perspective for this work to compare our results with the full functional.

Reduced-order population Kalman filter validation.
The results obtained with the population Kalman filter are very encouraging, but the computational complexity can still be prohibitive. This is why the reduced-order version of the population Kalman filter that we will present here is so interesting. We focus on the scenario of N P =100 and σ m = 0.3. Figure 4.4 illustrates the clustering algorithm (for N C = 20 and N C = 5) by plotting the observed variable U 0.25 P for a replicate. Table 4.3 presents the results given for each parameter and also aggregated across the 3 parameters. These estimates are supplemented by computation times obtained using a 2.3 GHz quad-core Intel Core i7 computer and an implementation in MATLAB-2020-b. This gives an order of magnitude numerical complexity of the algorithm, but with a not fully optimized implementation. Focusing on the MSE, we obtain equivalent results for the full covariance approach and the reduced covariance one. As for the RBIAS, the results obtained with the reduced version remain acceptable. As expected, both standard deviation criteria (STD and ESTD) decrease with the reduced-order filter, but their values are also acceptable. As for the ESTD, it is even better. By using the reduced-order filter, the COV approaches its nominal value of 95% and finally the BMIXED also improves. These improvements can be explained by the fact that the reduced-order version of the population Kalman filter increases the coupling between the subjects. This also means that the number of clusters must be chosen carefully so as not to constrain the system too much. Figure 4.5 shows the time evolutions of the states and the parameters of a patient obtained with the full population Kalman filter (top) and its reduced-order version with N C = 20 (middle) and N C = 5 (bottom). Typically, we see that the time evolutions show similar behavior in the four cases. This means that the system is not too constrained even if the COV for V 0 is low (in the acceptable range), which is known to be the most difficult parameter of the system to estimate. Figure 4.5 shows that increasing the coupling by reducing the number of clusters reduces the standard deviations, eliminating the need for multiple loops. Finally, in terms of computation times, these can be drastically reduced with the reduced-order version of the population Kalman filter: for example, by a factor of 7 between the full filter and the reduced-order version with N C = 5.

Experiment with real data
To conclude this study on a toy problem, we propose validation against a real data set. Theophylline is a methylxanthine drug used in the treatment of respiratory diseases such as chronic obstructive pulmonary disease and asthma under various brand names. Data were collected by Upton et al. on 12 subjects who received a single oral dose of theophylline and then gave 11 blood samples over a 25-hour period [51]. The 12 subjects received A GI (0) = 320 mg of theophylline. A spaghetti plot showing the plasma concentration of the drug with time is available in Figure 4.1 (b). Considering the small number of patients, only the filter for the whole population The individual nonlinear estimation is implemented with the R-package nls and the population NLME model is implemented with the R-package saemix. [11]. Our population-based Kalman filter yields very similar results to the population-based NLME, see Table 4.4. It is also interesting to note that the individual Kalman filter gives better results than the individual nonlinear strategy, especially in estimating V 0 , the most difficult parameter to estimate, since it determines the slope of U p and is very sensitive to observational noise. This shows that the individual Kalman filter can be efficient when the a priori are better initialized, a typical advantage brought by the population approach.

COVID example
We model the evolution of the COVID epidemic using a SIR type model [37] inferred with hospitalization data. The objective of the estimation is to derive the impact of the lockdown on the transmission rate between March 2, 2020 and May 11, 2020.

SEIRAH Model
More specifically, we use a SEIRAH model adapted from [37] in which the population of size N is divided into 6 compartments: susceptible S, latent E, ascertained infectious I, unascertained infectious A, hospitalized H, and removed R (ie, both recovered and deceased). The dynamics of such a model is given by where α, r E , D E , r I , D I , D q , D H are time-independent parameters described in Table 4.5 while t → b(t) is a function of time that models the disease transmission over time resulting from the public health policies followed in France during the first wave of the epidemic of COVID-19.  where g is a function describing the health policy effect on the transmission rate b. For example, a transmission rate decreases from b init to b end due to a lockdown policy centered around the time t with an efficiency time-rate 1 τ can typically be modeled with a logistic function denoted by leading to the assumption that g(t) = (t). Moreover, the dynamics of multiple lockdown can be modeled by a combination of such g-functions during intervals, while unlockdown can also be represented by such g-functions with reversed sign.
According to our methodology, we discretize the SEIRAH model using a forward Euler timescheme with a sufficiently small constant time-step δT n . The resulting discrete-time dynamical system includes the state variable x = (E, I, R, A, H) ∈ R 5 , which is coupled to the dynamics of b and the parameters θ to be estimated where f accounts for the dynamics in (4.4) and S is subsequently reconstructed by using S = N − (E + I + R + A + H) in each region/department. Again, the discretization is completed by introducing model uncertainties, which for all 1 ≤ i ≤ N P are given by the sequence . Finally, recall that θ ∈ R Np gathers all unknown time-independent parameters that need to be estimated, namely here b end and b init characterizing the g i functions.
Since all state variables E, I, R, A, H are positive and bounded by N , the total population size of the department, following our Remark 3.4, they are going to be mapped with a scaling and logit transform As for the variable b, we all use a weighted logit function: max b logit(b), where max b was set to 1.5 after preliminary investigation. This allows b to be estimated positive and smaller than max b . Neglecting deaths, the proportion of infected individuals in the population in each region / department at a given time is given by This quantity, which can also be calculated using other methods, is used to validate our strategy.

Available Data
To proceed with the estimation of these uncertain quantities, we have at our disposal the hospitalization data, prevalence and daily number incidence, denoted H data and H in,data , respectively. They are collected in the database SI-VIC, which records COVID-19 hospitalizations of patients in French hospitals. These two quantities are represented with the solutions of System (4.4) respectively as H data = H and H in,data = (1−r I ) Dq I. We develop the following strategy, which allows us to avoid any a priori about the shape of b:

Estimation in the 12 regions
Step 1. Estimate the parameter b init (resp. b end ) using data obtained before lockdown (resp. 10 days after lockdown). The values are given in Table 4.6. It can be seen that the values are quite similar in all regions. In this first step, the transmission rate b is assumed to be a piecewise constant function equal to b init before the lockdown and b end after.
Step 2. Estimate the shape of b without the logistic function a priori (i.e. assuming g(t) = 0, ∀ t) but setting b init to the value estimated in Step 1. The results are shown in Figure 4.7 (a) for all regions. It shows the form of a logistic function, confirming our modeling choice. Since there is no information about b, the model error is very important and leads to overfitting of the data. For many regions, oscillations with a mean of 0 and a period of up to 7 days are visible.
Step 3. To assess whether these weekly plateaus are just compensation for the lack of information about b, we propose to fix g as a logistic function (b init and b end resulting from Step 1, τ = 1 and t l being the 8 th day of lockdown). The results are shown in Figure 4.7 (b). In Figure 4.7 (c), we calculate the difference between the a priori logistic function b and its estimate. We can see that in the regions with a high hospitalization rate: Île-de-France (IDF), Grand-Est and Bourgogne Franche Comté (BFC), we have b estim (t) > b a priori (t).
When individuals are homogeneous and mix uniformly, the effective reproductive ratio R eff (t) is defined as the mean number of infections generated during the infectious period of a single infectious case at time t. In this model, as can be read in Appendix B, the effective reproductive ratio can be written as a function of the model parameters as follows Before the lockdown the values of the effective reproductive ratios are around 3.5, a value comparable to those found in the literature for the basic reproductive ratio, see [15,47]. The effective reproductive ratios at the end of the lockdown are less than 0.7, showing the efficiency of the lockdown. The proportion of infected individuals in the population in each region at the end of the first lockdown (May 11, 2020) is shown in Figure 4.8 (a). These results are very similar to those in [47] that allow us to validate our strategy.

Estimation in the 94 departments
Let us now proceed to a more refined population and consider departments. Here we apply only the last Step 3, using the estimated mean values of b init and b end obtained in the estimation for the 12 regions. Thus, we obtain the proportion of infected individuals in the population at the department level, as shown in Figure 4.8 (b). These results obtained with the Kalman filter for the entire population are then compared with the results obtained with a reduced-order version with N C = 24. Figure 4.8 (d) shows that the estimated infected individuals are very close to the values obtained with the full population Kalman filter, while the computation time is divided by 5: 40 minutes for the full filter versus 8 minutes for its reduced-order version on a 2.3 GHz quad-core Intel Core i7 computer with a sequential implementation of the algorithm using MATLAB-2020-b. We also note that the cluster (see Figure 4.8 (c)) does not fully match the partition of the region, as the epidemic in France follows a north-east to south-west gradient. Therefore, our reducedorder strategy allows better results at the departmental scale than estimation at the region scale, but with comparable computational complexity.
All these results will be discussed in more detail in a forthcoming paper devoted to the epidemic. However, we believe that they already show how versatile our approach is in terms of model complexity with appealing estimation performance and controlled computational complexity.

Conclusion
In this work, we have proposed a new strategy that allows combining the principle of data assimilation with the population configuration classically considered in the formulation of mixed effects (c) Difference between g and b when g corresponds to a logistic function.  Illustration of the estimation strategy. The first (resp. last) vertical black line corresponds to the first (resp. last) day of lockdown. The other dashed lines are spaced 7 days apart. models. In this regard, assuming Gaussian disturbances up to some variable transformation, we defined a log-likelihood functional that couples subjects in an elegant way. We approximate the resulting minimization with a dynamic programming approach, here the Unscented Kalman Filter [22]. To reduce computation time, we augment our approach with a reduced-order version of our population filter after clustering the measurements. Note that our choice of UKF could be replaced by an Extended Kalman Filter [49] or an Ensemble Kalman Filter [18] assuming that reduced-order versions are available [40,42].
We illustrated and evaluated our proposed strategy on a simple pharmacokinetics model. Our approach has been shown to be successful on synthetic data. Moreover, on real data, our results compare very well with estimation procedure used as standard in the literature for estimation of parameters of nonlinear mixed effects models. We then moved to a more complex configuration, with stronger nonlinearities and modeling errors.  As explained in the introduction, the goal of this work was to propose a strategy that remains accurate, manageable, and within reasonable runtime as the dimension of the ODE system increases. The results obtained are very encouraging and the computational time achieved paves the way for the use of our strategy in large dimensional ODE models or possibly partial differential systems (PDE).
The correlation between H data and H in,data at the regional level allows us to calculate D H in each region using a mean square estimate. The values obtained are given in Table A.1. We set D H to the mean value.
Department-level hospitalization data cannot be trusted. This is because each region consists of several departments, and the sum of hospitalizations in all departments of a region is not consistent with hospitalizations in the region (which is the case for the data on daily case rates). In addition, when we calculate D H at the department level, we get unrealistic values with a minimum value at 10 days and a maximum value at 30.5 days. This is partly due to poor reporting in patient transfers between hospitals from different departments. For this reason, we use only daily case rate data when considering departments and set D H to the mean value estimated from regional-level data (see Table A.1). Appendix B. Computation of the effective reproductive ratio R eff To calculate the reproductive ratio R eff of our SEIRAH model (4.4), we apply the Next Generation Matrix approach (see for example [21]. The principle is to focus on three categories: (i) latent E, (ii) ascertained infectious I and (iii) unascertained infectious A with the following dynamics A D I Then, we build two matrices corresponding to: (1) V following the arrivals and departures from one other category and (2) F following the arrivals from another compartment exterior to the three categories. We have It is then well known (see for instance [38] for a proof) that where ρ(F V −1 ) is the spectral radius of the Next Generation Matrix F V −1 . Here, we have We therefore obtain