{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Uncertainty\n",
    "\n",
    "This notebook examines uncertainty in Atomica, going through\n",
    "\n",
    "- How uncertainty in model inputs is entered\n",
    "- How to perform simulations that are sampled from the distribution of parameters with uncertainty\n",
    "- How to perform analyses that incorporate uncertainty\n",
    "\n",
    "## Specifying uncertainty\n",
    "\n",
    "Fundamentally, uncertainty enters Atomica via the data entry spreadsheets. Both the databook and program book have columns that can be used to enter uncertainty. \n",
    "\n",
    "In the databook, uncertainty can be added to any TDVE table - that is, any of the sub-tables within the databook for each quantity. By default, those columns are not produced when the databook is written e.g. with \n",
    "\n",
    "    proj.data.save('databook.xlsx')\n",
    "    \n",
    "If you want to include those columns, use the `write_uncertainty` options:\n",
    "\n",
    "    proj.data.save('databook.xlsx',write_uncertainty=True)\n",
    "    \n",
    "This will add an 'uncertainty' column to all tables in the databook. If you only want to add uncertainty to a small number of quantities, you can simply insert an extra column for the tables that you want to add uncertainty to, by selecting the other cells and moving them over. The example below shows how an uncertainty column has been added to the 'Screened people' table and not to the 'All people with condition' table:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![image1](assets/databook_uncertainty.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Uncertainty should be entered as a standard deviation, with the same units as the quantity itself\n",
    "</div>\n",
    "\n",
    "In the example above, it would be accurate to state that the 2016 value for the number of screened people in the male rural population is $214 \\pm 20$. Mathematically, the entry above corresponds to specifying that the distribution of screened people in the male rural population has a mean of 214 and a standard deviation of 20. \n",
    "\n",
    "For the program book, uncertainty appears in two places - in the spending/coverage sheet, and in the outcome sheet, as shown below. These columns are always present in the program book.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![image2](assets/progbook_spending_uncertainty.png)\n",
    "![image3](assets/progbook_outcome_uncertainty.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To limit complexity in data entry, only a single uncertainty value can be entered for each quantity. That is, in the databook and spending/coverage sheet, uncertainty cannot vary over time, and in the outcome sheet, the uncertainty is specified at the parameter-population level, and applies to all programs reaching that parameter. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sampling inputs\n",
    "\n",
    "We have seen above how to enter uncertainty in the databook and program book. These uncertainties are loaded into the corresponding `ProjectData` and `ProgramSet` objects in Atomica. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "%matplotlib inline\n",
    "import sys\n",
    "sys.path.append('..')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import atomica as at\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from scipy import stats\n",
    "import sciris as sc\n",
    "np.random.seed(3)\n",
    "testdir = '../../tests/'\n",
    "\n",
    "P = at.Project(framework=testdir + 'test_uncertainty_framework.xlsx', databook=testdir + 'test_uncertainty_databook.xlsx')\n",
    "P.load_progbook(testdir + 'test_uncertainty_high_progbook.xlsx')\n",
    "\n",
    "low_uncertainty_progset = at.ProgramSet.from_spreadsheet(testdir + 'test_uncertainty_low_progbook.xlsx',project=P)\n",
    "high_uncertainty_progset = at.ProgramSet.from_spreadsheet(testdir + 'test_uncertainty_high_progbook.xlsx',project=P)\n",
    "default_budget = at.ProgramInstructions(start_year=2018, alloc=P.progsets[0])\n",
    "doubled_budget = default_budget.scale_alloc(2)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above, we have loaded in a databook with uncertainty values, and also loaded in two different program books, one with low uncertainty, and one with high uncertainty. We can inspect the objects to see how the uncertainty entered in the spreadsheets is reflected in code. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "P.data.tdve['all_screened'].ts[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The uncertainty value is attached to the underlying `TimeSeries` objects, and stored in the `sigma` property. When running a simulation, we specify a `ParameterSet` and optionally a `ProgramSet`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "parset = P.parsets[0]\n",
    "res = P.run_sim(parset)\n",
    "d = at.PlotData(res,'all_screened',pops='m_rural')\n",
    "at.plot_series(d);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If there is uncertainty, we can obtain a sampled epidemiological prediction by sampling from all of the quantities containing uncertinty. For example, replacing the value of `214` above with a sample from the distribution $\\ \\mathcal{N}(214,20)$ - along with all of the other quantities. The input to the model is effectively a parameter set derived from the original set of parameters, but with values replaced by sampled values where appropriate. You can obtain a sampled parameter set simply by calling the `ParameterSet.sample()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sampled_parset = parset.sample()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now compare the values for the screening parameter in the original parset and the sampled parset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Original parameters: %.2f' % (parset.pars['all_screened'].ts['m_rural'].vals[0]))\n",
    "print('Sampled parameters: %.2f' % (sampled_parset.pars['all_screened'].ts['m_rural'].vals[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The original parameters contain the values entered in the databook, while the sampled parameters have the values perturbed by the sampling. Although the sampled parameters retain the original uncertainty values, you can only perform sampling once:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    sampled_parset.sample()\n",
    "except Exception as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To run a sampled simulation, it is thus only necessary to pass the sampled `ParameterSet` into `Project.run_sim` instead of the original parameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sampled_res = P.run_sim(sampled_parset,result_name='Sampled')\n",
    "d = at.PlotData([res,sampled_res],'all_screened',pops='m_rural')\n",
    "at.plot_series(d,axis='results');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Exactly the same procedure is used for programs - the `ProgramSet` has a `sample()` method which returns a sampled `ProgramSet` that can be used instead of the original `ProgramSet` to perform a sampled simulation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sampled_progset = high_uncertainty_progset.sample()\n",
    "res = P.run_sim(parset,high_uncertainty_progset,progset_instructions=default_budget,result_name='Original')\n",
    "sampled_res = P.run_sim(parset,sampled_progset,progset_instructions=default_budget,result_name='Sampled')\n",
    "d = at.PlotData([res,sampled_res],'all_screened',pops='m_rural')\n",
    "at.plot_series(d,axis='results');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course, we can combine the sampled parset and the sampled progset in a single simulation to sample from both sources of uncertainty simultaneously:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sampled_res = P.run_sim(sampled_parset,sampled_progset,progset_instructions=default_budget,result_name='Sampled')\n",
    "d = at.PlotData([res,sampled_res],'all_screened',pops='m_rural')\n",
    "at.plot_series(d,axis='results');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One issue when sampling is that of satisfying initial conditions. Databooks often require some work to ensure that none of the initial compartment sizes are negative. If uncertainties are specified for characteristics that are used for initialization, it is possible that the independently-sampled characteristic values yield a bad initialization (where some of the compartment sizes are negative, or the characteristic values are otherwise inconsistent). If this happens, a `BadInitialization` error is raised. Therefore, to write reliable sampling code, it is essential to catch this exception:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(1) # This seed provides a BadInitialization at the next iteration\n",
    "sampled_parset = parset.sample()\n",
    "try:\n",
    "    sampled_res = P.run_sim(sampled_parset)\n",
    "except at.BadInitialization as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When this error occurs, the correct thing to do is usually to just sample again on the assumption that other samples may satisfy the initial conditions. Thus, sampling would be carried out within a `while` loop that continues until a simulation is successfully run. To automate this, we can use `Project.run_sampled_sims` which functions similarly to `Project.run_sim()` except that it incorporates a sampling step together with automated resampling if the initial conditions are unsatisfactory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sampled_res = P.run_sampled_sims(parset)\n",
    "print(sampled_res)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the return value is a list of results, rather than just a `Result` like with `Project.run_sim()`. Typically, uncertainties require working with multiple samples. Thus, you can also pass in a number of samples, and `Project.run_sampled_sims()` will return the required number of sampled results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sampled_res = P.run_sampled_sims(parset,n_samples=5)\n",
    "print(len(sampled_res))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When running multiple sampled simulations, a progress bar with estimated time remaining will be displayed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analysis\n",
    "\n",
    "Now that we are able to run simulations with sampled parameters, the next question is how to produce analysis outputs such as plots that have error bars or otherwise visually represent uncertainty. The first step to doing this is to obtain a collection of results. The number of samples required loosely depends on the number of quantities with uncertainty - a general rule of thumb though is that the number of samples required would generally be on the order of 50-200. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sampled_res = P.run_sampled_sims(parset,n_samples=100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For comparison, we will also run a 'baseline' simulation with no sampling. A plot of the baseline results looks like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "baseline = P.run_sim(parset)\n",
    "d = at.PlotData(baseline,outputs=['screen','diag','initiate'],pops='m_rural')\n",
    "at.plot_series(d);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Basic plotting\n",
    "\n",
    "The normal sequence for plotting is that we would \n",
    "\n",
    "1. Instantiate a `PlotData` object with the quantities to plot. Effectively, this operation is mapping a `Result` to a `PlotData`\n",
    "2. Call a plotting library function to generate the plot\n",
    "\n",
    "Plots generated with the plotting library compare multiple results (e.g. different scenarios) rather than showing uncertainty computed over an ensemble of results. Further, they only take in a single `PlotData` instance where the results to compare have been included in the same `PlotData` object. \n",
    "\n",
    "To work with uncertainty, instead of plotting a single `PlotData` object, we instead plot a collection of `PlotData` objects. And plots are generated using a different set of plotting functions that take in multiple `PlotData` instances and use them to compute and display uncertainties (e.g. as error bars or as shaded areas).\n",
    "\n",
    "This functionality is accessed via the `Ensemble` class. The `Ensemble` class is designed to store and plot ensembles of results. Internally, it stores a list of `PlotData` objects - one for each sample. It also contains a 'mapping function' that takes in results and returns a `PlotData` instance. To use an Ensemble, first we need to define a mapping function. In the simplest case, this looks just like a normal call to `PlotData()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mapping_function = lambda x: at.PlotData(x,outputs=['screen','diag','initiate'],pops='m_rural')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we create an `Ensemble`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ensemble = at.Ensemble(mapping_function=mapping_function)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We then need to load all of the results into the mapping function. Similar to Python's built-in `set` class, there are two ways of inserting the results\n",
    "\n",
    "- One by one, using `Ensemble.add(result)`\n",
    "- All at once, using `Ensemble.update(list_of_results)`\n",
    "\n",
    "In our case, we already have a list of results generated by `P.run_sampled_sims`, so we can go ahead and use `Ensemble.update()` to insert them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ensemble.update(sampled_res)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can make some plots. To start with, let's make a plot of the time series like the one shown above for the baseline results, but with uncertainty this time:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ensemble.plot_series();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we have only loaded in the sampled results, the dashed line corresponds to the median of the samples. The shaded areas show the area between the first and third quartiles for the samples. To start with, instead of showing the median of the samples, it is generally more appropriate to show the actual baseline results. To do this, we can load the baseline results into the Ensemble and then regenerate the plot.\n",
    "\n",
    "<div class=\"alert alert-warning\">\n",
    "The baseline results need to have the same names as the sampled results. If you do not explicitly specify names, `Project.run_sampled_sims` produces default names that avoid collisions between multiple instructions. However, `Project.run_sim` produces default names that avoid collisions with other results stored in the project. Therefore, it may be necessary to explicitly set or change the result names if the names were set differently.\n",
    "</div>\n",
    "\n",
    "In this case, the baseline results need to have their name changed to `'default'` to match the sampled results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "baseline.name = 'default'\n",
    "ensemble.set_baseline(baseline)\n",
    "ensemble.plot_series();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As with the rest of the plotting library, the legend is 'maximally informative' in that it contains as much information as possible. This is because the most appropriate labelling for the figure (whether the axis labels, titles, or legend entries) varies greatly with context, and it is often necessary to relabel the figure on a plot-by-plot basis. This relabelling is facilitating by having all of the necessary content on the figure, rather than having to keep track of it separately. Thus, it is expected that the labels may need to be edited afterwards depending on usage.\n",
    "\n",
    "The `'<unnamed'>` string corresponds to the name of the ensemble. We hadn't given the ensemble a name, but naming the ensemble can be useful if you want to superimpose multiple ensembles on the same plot. The ensemble can be named at construction, or at any point subsequently:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ensemble.name = 'Example'\n",
    "ensemble.plot_series();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, the baseline results are displayed as solid lines. Other types of plots can be generated too. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ensemble.plot_bars(years=2018)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because the `Ensemble` stores `PlotData` objects, you can perform any aggregation operations in the same way as normal. For example, we can plot the years lost to disability by integrating the number of people in the disease compartments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "yld = lambda x: at.PlotData(x,outputs={'disease':['undx','scr','dx','tx']},t_bins=[2018,2023],time_aggregation='integrate')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the usual way that this plot would be generated for a single `Result` without uncertainty, and indeed we can go ahead and generate that plot now"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d = yld(baseline)\n",
    "at.plot_bars(d);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With uncertainty, we simply make the plot via an `Ensemble` instead"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ensemble = at.Ensemble(name='Example',mapping_function=yld,baseline_results=baseline)\n",
    "ensemble.update(sampled_res)\n",
    "ensemble.plot_bars()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also use the `Ensemble.summary_statistics()` method to get a dataframe with summary statistics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ensemble.summary_statistics()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Comparing sampled results\n",
    "\n",
    "Moving up in complexity, a common analysis task is comparing results across scenarios. At the beginning, we instantiated two sets of instructions - one with default spending, and one with doubled spending. We will now produce samples from both of these instructions. There are two ways to approach this problem\n",
    "\n",
    "- Storing the results in separate Ensembles\n",
    "- Putting multiple results into a single Ensemble\n",
    "\n",
    "To perform the analysis most validly, it is important to run the budget scenario in each of the instructions for the same parset and progset samples. That is, to generate a single result, it is necessary to first sample, and then use the same samples for all instructions. Written out explicitly, this would be"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(3)\n",
    "sampled_parset = parset.sample()\n",
    "sampled_progset = high_uncertainty_progset.sample()\n",
    "result_default_spend = P.run_sim(sampled_parset,sampled_progset,default_budget,result_name='Baseline')\n",
    "result_doubled_spend = P.run_sim(sampled_parset,sampled_progset,doubled_budget,result_name='Doubled budget')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Comparing results across budget scenarios is extremely common, so Atomica has built-in functionality to facilitate this. You can pass multiple instructions and result names to `Project.run_sampled_sims()` to generate results as shown above  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sampled_res = P.run_sampled_sims(parset,high_uncertainty_progset,n_samples=100,progset_instructions=[default_budget,doubled_budget],result_names=['Baseline','Doubled budget'])\n",
    "print(len(sampled_res))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output is now a list of lists, where the inner list contains a set of results that has been generated with different instructions but the same sampled inputs. This is the input that would go to an `Ensemble` mapping function, to produce a `PlotData` instance that mixes results. Now, we can generate a plot comparing YLD in each population"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sampled_res = P.run_sampled_sims(parset,high_uncertainty_progset,progset_instructions=[default_budget,doubled_budget],n_samples=100,result_names=['Baseline','Doubled budget'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mapping_function = lambda x: at.PlotData(x,outputs=['screen','diag','initiate'],pops='m_rural')\n",
    "ensemble = at.Ensemble(name='Example',mapping_function=mapping_function)\n",
    "ensemble.update(sampled_res)\n",
    "ensemble.plot_series();"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ensemble = at.Ensemble(name='Example',mapping_function=yld)\n",
    "ensemble.update(sampled_res)\n",
    "ensemble.plot_bars();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To aid comparison, it may be clearer to plot individual populations at a time. For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ensemble.plot_bars(pops='m_urban');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, there isn't much difference because as shown in the time series, most of the variability is encountered in the initial simulation years. We can see this if we plot a quantity with more variability - for example, the number of undiagnosed people in the first simulation year"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "undx = lambda x: at.PlotData(x,outputs='undx',pops='m_urban')\n",
    "ensemble = at.Ensemble(name='Example',mapping_function=undx)\n",
    "ensemble.update(sampled_res)\n",
    "ensemble.plot_bars(years=2017);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you pass in a figure to `Ensemble.plot_bars()`, it will attempt to append new bars into the existing figure. For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = ensemble.plot_bars(years=2017);\n",
    "ensemble.plot_bars(years=2018,fig=fig);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, it was not essential to do this because multiple `years` can be plotted in a single `plot_bars` call. However, you could superimpose bars from entirely separate Ensemble objects, if you wanted to compare quantities computed using different mapping functions. \n",
    "\n",
    "Another important type of plot is the sampled distribution of a quantity. Effectively, the entire Atomica model is a function that maps probabilitity distributions for model inputs into distributions of model outputs. These distributions can be plotted directly. For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ensemble.plot_distribution(pops='m_urban');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The distributions that are plotted are effectively vertical cross-sections of the time series plots. They can be important to inspect to check for multi-modal distributions, or asymmetric distributions, which have implications for the validity of reporting summary statistics. \n",
    "\n",
    "There are two ways to compare distributions\n",
    "\n",
    "- Within an Ensemble\n",
    "    - Comparing multiple outputs/pops\n",
    "    - Comparing multiple results\n",
    "- Across Ensembles\n",
    "   - Typically would be comparing corresponding outputs/pops\n",
    "   \n",
    "To cater for the former, if a `PlotData` instance contains multiple years, results, outputs, or pops, they will all be shown on the distribution figure unless filtered using arguments to `Ensemble.plot_distribution`. For the latter, subsequent calls to `Ensemble.plot_distribution` can pass in a Figure, which will superimpose the distributions onto the existing figure. For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ensemble = at.Ensemble(lambda x: at.PlotData(x)) # All compartments\n",
    "ensemble.update(sampled_res)\n",
    "ensemble.plot_distribution(pops='m_rural',outputs='undx');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Distributions of differences between results\n",
    "\n",
    "One of the most important analyses is quantifying the difference between one budget scenario and another - for example, the improvement from optimizing the budget. In the presence of uncertainty, this difference becomes a distribution. However, samples from this distribution must be computed pairwise - from the same sampled inputs, test both budgets, compute the difference between them, and then store the difference. This can be computed by defining a mapping function that return a `PlotData` containing differences rather than raw values. For example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_differences(res):\n",
    "    d1 = at.PlotData(res[0],outputs={'disease':['undx','scr','dx','tx']},t_bins=[2018,2023],time_aggregation='integrate')\n",
    "    d2 = at.PlotData(res[1],outputs={'disease':['undx','scr','dx','tx']},t_bins=[2018,2023],time_aggregation='integrate')\n",
    "    return d2-d1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the `PlotData` class implements operators for subtraction and division, to facilitate comparing derived/aggregated quantities. We can now create an `Ensemble` with this mapping function, add the results, and plot the distribution of differences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ensemble = at.Ensemble(get_differences,'difference')\n",
    "ensemble.update(sampled_res)\n",
    "ensemble.plot_distribution(pops='m_rural');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cascade ensembles\n",
    "\n",
    "Plotting uncertainty on cascades is a common task. However, the cascade plotting functions normally operate directly on `Result` objects, rather than `PlotData` which forms the basis of data storage in Ensembles. For normal analysis without uncertainty, this facilitates ease of use by not requiring an intermediate `PlotData` step when producing standard cascade plots. For uncertainty, the simplest option is to construct a `PlotData` object containing the required cascade values, and to generate the plot via `Ensemble.plot_bars` (the key difference between `Ensemble.plot_bars` and `plotting.plot_bars` is that `Ensemble.plot_bars` doesn't support any stacking - which would be potentially misleading with uncertainty - and therefore supports arbitrary ordering of the bars, whereas `plotting.plot_bars` requires that the inner group consist of outputs or pops only, as these are the only valid stacking targets). \n",
    "\n",
    "To facilitate generation of the appropriate mapping function and arguments to `Ensemble.plot_bars`, this functionality is built into the `CascadeEnsemble` class, which inherits from `Ensemble`. The constructor for `CascadeEnsemble` takes in the name of the cascade to plot. It also requires a `ProjectFramework` to be passed in, which provides the definition of the cascade being requested:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cascade_ensemble = at.CascadeEnsemble(P.framework, 'main')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Results can be loaded into the cascade ensemble as normal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cascade_ensemble.update(sampled_res)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A cascade plot can then be plotted using `CascadeEnsemble.plot_multi_cascade()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cascade_ensemble.plot_multi_cascade(years=2018)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how the bars are automatically grouped by cascade stage and labelled appropriately. It's also possible to call standard plotting functions that are inherited from the `Ensemble` class. For example, we could plot the time series of cascade stages using `Ensemble.plot_series`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cascade_ensemble.plot_series(results='Baseline',pops='m_rural');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Virtual stages\n",
    "\n",
    "Normal cascades must be formed from compartments and characteristics only, and these stages are automatically validated to ensure proper nesting. However, because `CascadeEnsemble.plot_multi_cascade` is based around `PlotData` it is possible to display a cascade-like plot where stages are defined using arbitrary `PlotData` operations. That is, the `CascadeEnsemble` does not require that the cascade is actually valid. This opens up the possibility of generating cascade-like plots with 'virtual stages' that don't actually otherwise exist. An example from an actual application is the introduction of a 'Pre-diagnosed' stage where application data has (separately) indicated that about 80% of screened individual would be in this pre-diagnosed stage. The cascade could be defined as:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cascade = {\n",
    "    'Prevalent':'all_people',\n",
    "    'Screened':'all_screened',\n",
    "    'Pre-diagnosed':'0.8*all_screened',\n",
    "    'Diagnosed':'all_dx',\n",
    "    'Treated':'all_tx',\n",
    "    'Controlled':'all_con'\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "noting that the 'pre-diagnosed' stage is the result of a calculation and is thus not a compartment or characteristic at all. The cascade specification here is simply any `output` dict supported by `PlotData`. We can then create a `CascadeEnsemble` using this cascade, and plot it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cascade_ensemble = at.CascadeEnsemble(P.framework, cascade)\n",
    "cascade_ensemble.update(sampled_res)\n",
    "fig = cascade_ensemble.plot_multi_cascade(years=2018)\n",
    "fig.set_size_inches(8,4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "Notice that because the model outputs are converted to `PlotData` on a per-sample basis, uncertainty is automatically propagated to the derived quantity (i.e. 'pre-diagnosed' has uncertainty even though it was defined on the fly). This would be the case for any `PlotData` operation. \n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Memory-limited environments\n",
    "\n",
    "A common issue that may occur when sampling large models is that there may be insufficient RAM to store all of the `Result` objects in memory prior to adding them to an `Ensemble`. The `PlotData` stored within an `Ensemble` is typically orders of magnitude smaller than the raw results - in the first instance, it only contains a small subset of the model outputs. A large model might have on the order of 150 parameters, 50 compartments, and 150 links. A typical `Ensemble` may have track 5-10 quantities, which represents a ~50x reduction in size. Further, the raw output has values at every timestep, which could be on the order of 100 values (e.g. a 25 year simulation with quarterly timesteps) but it may only be desired to compare simulations for a few time points - which could provide another ~20x reduction in size. \n",
    "\n",
    "In sum, the strategy for dealing with this situation is\n",
    "\n",
    "1. Define a mapping function that leads to a sufficient reduction in size of the outputs\n",
    "2. Add results to the Ensemble as they are generated, rather than keeping them\n",
    "\n",
    "For reducing the size of the mapping function, we have already seen examples of mapping functions that select only a subset of the outputs or populations. For time, an additional call to `PlotData.interpolate()` can be added to retain the values only at times of interest. For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "store_minimal = lambda x: at.PlotData(x,outputs=['screen','diag','initiate'],pops='m_rural').interpolate(2020)\n",
    "ensemble = at.Ensemble(store_minimal)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, to perform the sampling, we simply have to generate results one at a time, rather than all at once"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for _ in range(100):\n",
    "    result = P.run_sampled_sims(parset,high_uncertainty_progset,default_budget,n_samples=1)\n",
    "    ensemble.add(result[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, the peak amount of memory required is only that of 1 simulation. We can then plot the ensemble as usual:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ensemble.plot_bars();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To perform this procedure seamlessly, instead of calling `P.run_sampled_sims`, you can instead call `Ensemble.run_sims`. The idea is that `Ensemble.run_sims` runs simulations specifically for that Ensemble - instead of returning results, it adds them to the Ensemble. Under the hood, it adds the results as they are generated without storing them internally. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ensemble = at.Ensemble(store_minimal)\n",
    "ensemble.run_sims(proj=P,parset=parset,progset=high_uncertainty_progset,progset_instructions=default_budget,n_samples=100)\n",
    "ensemble.plot_bars();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that if you use `Ensemble.run_sims` is that a baseline simulation without sampling will automatically be performed and stored in the Ensemble. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "- Use `P.run_sampled_sims` if you can store all of the results in memory, which allows you to generate new Ensembles and plots without re-running the simulations    \n",
    "- Use `Ensemble.run_sims` if you don't have enough memory to hold all of the results\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parallization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sampling is a trivially parallizable problem, and Atomica supports parallization of sampling. To run samples in parallel, simply set `parallel=True` when calling `Project.run_sampled_sims()` or `Ensemble.run_sims()`. See `test_ensemble_parallel.py` in the `tests` folder for an executable example. Note that on Windows, the calling code must be wrapped in `if __name__ == '__main__'`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Experimental plots\n",
    "\n",
    "The plots below are not fully developed but serve as examples to show possibilities that can be built upon as applications of Atomica progress. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ensemble.pairplot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ensemble.boxplot()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:atomica37]",
   "language": "python",
   "name": "conda-env-atomica37-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}