An important step in integrating the model equations is specifying the initial conditions. There are two challenges to overcome
Computing initial compartment sizes based on the values entered in the databook
Computing the initial compartment sizes at an arbitrary initial time
We will discuss these two issues in turn.
Characteristics to Compartments¶
The first challenge is related to the fact that in the databook, users enter initial values for characteristics, not just compartments. This is motivated by the fact that application data may correspond to characteristics rather than compartments. The guiding principle is that the databook should strive to directly accept as much data as the user has available. That is, the databook aims to be as complete a record of the available data as possible, with as little manual user processing as possible.
To see why initial values are often drawn via characteristics, consider the TB cascade. Because latent infection has a separate stream for vaccinated and unvaccinated individuals, the model contains the following compartments
|Short name||Long name|
||Early latent (undiagnosable)|
||Late latent (undiagnosable)|
All of the people in these compartments have recieved a vaccine, because the
ltlx compartments are specific to vaccinated individuals. Thus, the
vac compartment corresponds to people who have been vaccinated, but who have not acquired a latent TB infection. However, data is likely to be available for the total number of vaccinated people regardless of their latent infection status. Thus, the data is likely to correspond to a characteristic defined as
vac+ltex+ltlx. Similarly, there is typically good demographic data available for the total population size. However, the total population size is given by the sum of all of the compartments in the model.
Given a set of characteristics, the goal is then to compute the corresponding compartment sizes. For example, if we know
nvac = vac+ltex+ltlt (total number of vaccinated individuals) ltx = ltex+ltlx (total number of latent infections in the vaccinated population) ltlx (number of late latent infections)
ltx are the names of the characteristics, then we can readily use substitutions to infer the initial compartment sizes.
Optima TB and entry points¶
In Optima TB, this calculation was implemented as an iterative explicit substitution. The user was required to specify for each characteristic an ‘entry point’. The ‘entry point’ corresponds to the compartment that the characteristic is supplying values for. Each compartment is required to appear as the entry point for exactly one characteristic. For example
When computing the compartment value, it is assumed that an explicit value has already been calculated for all of the remaining characteristics. That is, we assume that it is possible to evaluate the equations listed above based on the values supplied either in the databook or computed from other characteristics. The computation would proceed as follows
Use the third equation to supply the value for
ltlxbased on the databook
Use the value computed for
ltlxtogether with the value of
ltxsupplied in the databook to compute the initial value for
Use the values computed for
ltlxtogether with the value of
nvacsupplied in the databook to compute the initial value for
The implementation essentially proceeds as described above, in a loop. In the general case, there could be many arbitrary combinations of compartments. The main issues with this approach are that a lot of logic is needed in the implementation to determine the correct order in which to initialize the compartments, which is complicated and difficult to ensure correctness, and that it is up to the modeller to decide which compartment to assign as the entry point for each characteristic (for example, in theory any one compartment could be listed as the entry point for the
The equations for the characteristics
nvac = vac+ltex+ltlt ltx = ltex+ltlx ltlx
can be readily recognized as a system of simultaneous equations. This can be written in matrix form as follows
[nvac] [1 1 1] [vac ] [ltx ] = [0 1 1] * [ltex] [ltlx] [0 0 1] [ltlx]
where the column vector on the left corresponds to the characteristics that the user enters in the databook, the matrix in the middle corresponds to the included compartments for each characteristic defined in the Framework, and the column vector on the right corresponds to the compartment sizes for all compartments in the model. We therefore want to solve for the values in the column vector on the right, given the values for the characteristics on the left, and the compartment membership matrix in the middle. This is the familiar system
y = A*x
which can be solved using any number of standard linear algebra functions (e.g.
numpy.linalg.solve). Subject to the limitations imposed on the system by the Optima TB databook (namely that each characteristic has one entry point, and every compartment appears exactly once in the list of entry points) this linear algebra based routine is functionally identical to the original implementation, except
It is considerably shorter because the complexity is offloaded to numpy/scipy
It is more reliable because no complex logic to validate
Problems and solutions¶
Negative initial compartment sizes¶
It is possible for the initial characteristic values provided by the user to lead to negative initial compartment sizes. For example, if the user specified
alive = sus + vac alive = 100 vac = 120
then the value for
vac is clearly critically inconsistent with the value for
alive, because the corresponding value for
sus must be
-20. In the previous Optima TB approach, this error can be isolated to the computation of the
alive characteristic, whereas in the linear algebra approach, it is the consequence of the matrix operation and thus cannot be isolated directly. However, in practice this did not make much difference in Optima TB because characteristics are typically nested. That is, if we instead had
alive = sus + nvac nvac = vac + ltex + ltlt ...
then a negative value for
sus would be a consequence of the value for the characteristic
nvac which itself could be the result of a computation. Thus, the user has to look at a set of characteristics and compartments to trace where the original inconsistency occurred. As a result, the most helpful output for the user to debug this is a dump of all of the compartment and characteristic values, which is the same in both the original and the linear algebra based approach.
With the introduction of the linear algebra based approach, specification of entry points is no longer required (because in actuality that information is redundant). In Optima TB, it was possible to throw an error if a compartment was not included as an entry point in any of the characteristics, thus preventing its value from being computed. In the linear algebra approach, this condition can be directly detected based on the
A matrix. For example, consider the system
nvac = vac + ltex + ltlx ltx = ltex + ltlx
By inspection, there is clearly insufficient information provided to be able to uniquely solve for the compartment sizes. The corresponding matrix equation is
[nvac] [1 1 1] [vac ] [ltx ] = [0 1 1] * [ltex] [ltlx]
(which is of course a valid matrix equation, with dimensions
(2x1) = (2x3)*(3x1)). The fact that there is insufficient information can be detected by comparing the rank of the
A matrix with the number of compartments. Specifically,
y=A*x is underdetermined if the rank of
A is less than the length of
x. This can be trivially checked using Numpy’s built in
numpy.linalg.matrix_rank function. Note that this is a stricter condition than simply counting the number of characteristics on the left hand side, because it is possible for a characteristic to not to provide any unique information if that characteristic can be expressed as a linear combination of other characteristics in the system. Thus, instead of using the entry points, we can check the rank of the matrix to directly identify cases where there is insufficient information for initialization.
In the case where the number of characteristics used for initialization is the same as the number of compartments, there exists a single unique solution for the compartment values that exactly matches the characteristic values. For example, if we have
alive = sus + vac
and are also given that
alive = 100 vac = 120
then we have the unique solution
sus = -20 vac = 120 alive = -20 + 120 = 100
which exactly matches the specified values
vac=120. The inconsistency in the system therefore results in a negative compartment size appearing in one or more compartments, and this can be detected regardless of the implementation used for initialization.
Inconsistent overdetermined system¶
As described above, the guiding principle for the databook is that it strives to be a complete, direct record of all available data. In pursuing this, it is possible that the user could provide an overspecification of the system. For example, suppose we have a characteristic
alive = sus + vac and the user has the following values from their data sources
alive = 100 sus = 60 vac = 50
This situation (where the data values are not entirely consistent) can occur when users are compiling data from multiple sources. How should this inconsistency be resolved? Note that in the case where the system is overdetermined, the inconsistent values mean that it is not possible to choose initial values for
vac that exactly satisfy the provided values. For example, we could have
sus = 60 vac = 50 alive = 60+50=110 (does not match provided value)
or we could have
sus = 50 (does not match provided value) vac = 50 alive = 50 + 50 = 100
In this case though, the consequence of the inconsistency is not that a compartment value becomes negative. Instead, it is manifested as a nonzero residual when comparing the provided input values to the calculated initialization. Which inconsistent initialization should be chosen? In Optima TB, the approach is that the entry points specify the initialization and redundant conflicting information is ignored. For example, if we have
then the initialization will proceed with
sus = alive-vac = 100-50 = 50
And note that the
sus = 60 information specified by the user is ignored entirely. The fact that this information was ignored is a consequence of the choice of entry points. If we instead had
then the calculation would proceed with the equally valid
sus = 60
vac = alive - sus = 100-60 = 40
In general, with a large number of characteristics and compartments, it is not easy to determine which information in the databook will take priority in the initialization. In the linear algebra based approach, we can write out the matrix equation as
[alive] [1 1] [sus ] [sus ] = [1 0] * [vac ] [vac ] [0 1]
Notice how this system is obviously overdetermined, because the column vector on the left is longer than the column vector on the right. We can solve this overdetermined system using a standard least squares approach (e.g.
np.linalg.lstsq). In the case where all of the information provided is consistent, the residual for the fit will be
0. If the values are inconsistent, the least squares solution will solve for the compartment sizes on the right that best match the set of characteristics on the left. In this case, this would be
sus = 56.66 vac = 46.66
and the corresponding value of alive would be 103.32. In this way, the difference between the data provided by the user and the values used for initialization is minimized across all of the values provided. The least squares solution also has the useful property that it is unique.
Resolving inconsistencies - initialization weight¶
In the example above, where the inconsistent input values of
sus = 60 vac = 50 alive = 100
were resolved with the valid initialization
sus = 56.66 vac = 46.66 alive = 103.32
the overall residual has been minimized across all of the provided characteristics. However, in general there are likely to be uncertainties in the data. The user may wish to more closely match the data they are most certain about. For example, if there was a census in that year so that
alive=100 is quite likely, and there was also good data on vaccination stating that
vac=50 then it may be preferable to use an initialization like
sus = 50.2 vac = 49.9 alive = 100.2
where the initialization penalizes discrepancies in
vac more than discrepancies in
sus. This is a simple case of weighted least squares, and can be readily incorporated into the algorithm. To cater for this, the user needs to specify a setup weight for each characteristic.
Additional open questions¶
The linear algebra method solves the core of the problem of inverting characteristic values to solve for compartment values given a set of characteristics. The key questions on the input side are
How and where to specify which characteristics to use for initialization
How and where to specify the setup weights
The simplest approach would be to simply read all of the characteristics in the databook and use them all to calculate the compartment sizes. This approach will work in general as long as the databook contains a sufficient set of characteristics for the system to not be underdetermined. It is currently up to the modeller to ensure that the ‘databook order’ in the Framework results in a sufficient set of charcteristics will appear in the databook.
However, a characteristic is ignored for initialization if the setup weight is 0. If the user is able to set setup weights, then it is possible that the user could set some of the characteristics to be ignored for initialization such that the system becomes underdetermined, resulting in an error. To prevent this, the setup weights are currently specified in the Framework, which users are not expected to edit.
The general philosophy is that
Modellers interact primarily with frameworks
Users interact primarily with databooks
Users are not expected to be able to diagnose and fix errors such as the system being underdetermined
On the other hand, the setup weight is currently proposed to mainly reflect data quality, which means that it ought to be specified in the databook so that different projects with different data can share the framework file but perform their initialization differently due to their differing data. This is currently not possible, because the setup weights are in the framework file.
A proposed simplification without loss of functionality is:
The ‘setup weights’ are removed from the Framework, with characteristics in the databook assumed to have a setup weight of
A nonzero setup weight can be optionally specified in the databook, which overrides the default value of
Here, it is still the modeller’s responsiblity to select a suitable set of characteristics to appear in the databook. However, the user is now able to provide weights that reflect their data quality without the possibility of those weights resulting in the system becoming underdetermined. The conditions for a nonzero setup weight to be used are
The provided characteristic values are inconsistent, and
Some of the data is known to be of higher quality than others, and
Those characteristics are also the ones where the inconsistency occurs
Thus, for the majority of applications, no setup weight would need to be entered. If the provided characteristic values are inconsistent, this is detected automatically based on the residual being large, and the user is warned accordingly. At that point, the user can decide whether or not they need to specify different setup weights.
Computing initial values at arbitrary times¶
The initial compartment values are computed based on the characteristic values at a specific point in time. However, the data values entered in the databook are in general provided at sparse, arbitrary times. Therefore, it is typically necessary to interpolate each of the characteristics onto the initial simulation time prior to calculating the initial compartment values.
The critical issue here is that inconsistencies in the values can be time dependent. For example, the user may enter a consistent set of values in 1990 but an invalid set of values in 2000. If the simulation is started in 1990 then it may run fine, but if started in 2000 it is possible that an inconsistency could result in negative compartment sizes. More insidious is the possibility that interpolation could result in inconsistent values. For example, suppose that the values are consistent in 1990 and in 2000. It is still possible that in 1995 the interpolated values are not consistent, particularly if some of the characteristics have data values between 1990 and 2000 and others do not.
Because the interpolation routine uses the nearest-neighbour value for times before and after the first or last data point, respectively, if the simulation is started from the first year in the databook then it is possible to directly read out the characteristic values used for interpolation, which makes diagnosing inconsistencies easier. However, if interpolation results in inconsistencies, the input characteristic values will not correspond to anything entered in the databook. So it is important to keep this issue in mind when diagnosing inconsistent initializations when the simulation start year is different to the data start year.