{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Frameworks\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The _framework_ is an object that represents the transition structure of the model in Atomica. It contains a listing of all of the compartments, characteristics, and parameters, as well as the transition matrix that links parameters with transitions between compartments. It does not contain a specification of the populations, because these are entered in the databook. This page assumes familiarity with basic Atomica concepts (e.g., compartments, characteristics, parameters, populations) and is intended as technical documentation on the implementation of the framework." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import atomica as at" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A 'framework' exists in two forms:\n", "\n", "- The 'framework file' which is an `xlsx` file, such as `framework_tb.xlsx`\n", "- A Python object, `ProjectFramework`, which is stored in `Project.framework` and is constructed by parsing the framework file\n", "\n", "On the user input side, users of Atomica implementing new cascade models etc. will typically be modifying the framework file. On the development size, Python code can generally expect to interact with the `ProjectFramework` object. A key goal is to make the `ProjectFramework` as flexible as possible to accommodate changes in the Excel file, to maximize the extent to which changes in a project can be performed in the framework file without needing complex changes in the codebase." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `ProjectFramework` basics\n", "\n", "The `ProjectFramework` class stores a parsed, validated representation of the framework file. The framework file consists of a set of sheets, with content on each of the sheets. An example from the TB framework is shown below\n", "\n", "![framework-overview-example](assets/framework_overview_example.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we can see that there are a number of worksheets in the file ('Databook Pages', 'Compartments', 'Transitions',...) and there is of course content on each sheet. \n", "\n", "The `ProjectFramework` contains a member variable, `sheets`, which is an odict where the key is the sheet name and the value is a list of Pandas `DataFrames` that contains all of the tables on that sheet. A table is defined as a rectangular range of cells terminated by an empty row." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "F = at.ProjectFramework(at.LIBRARY_PATH / \"tb_framework.xlsx\")\n", "F.sheets.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Notice that the sheet names are all converted to lowercase!\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "F.sheets['databook pages'][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some sheets contain multiple elements. For example, consider the 'Cascades' sheet here:\n", "\n", "![framework-cascade-example](assets/framework_cascade_example.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Intuitively, this sheet contains two separate 'tables' that each have a heading row, and they are separated by a blank row. When a worksheet is parsed, if there are any blank rows, multiple `DataFrames` will be instantiated, one for each 'table' present on the page. As shown below, `F.sheets['cascades']` is a list of `DataFrames`, and each `DataFrame` stores the contents of one of the tables:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(F.sheets['cascades'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(F.sheets['cascades'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "F.sheets['cascades'][0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "F.sheets['cascades'][1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "The code parses Excel sheets into `DataFrames` in `ProjectFramework.__init__()` and **this initial stage of parsing is blind to the contents of the worksheet**. This means that all columns of all sheets will be loaded into `DataFrames`. Any new columns added by the user in the framework file will automatically appear in the `DataFrame` when the framework file is loaded.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accessing the data\n", "\n", "You can access the contents from the Excel file by operating on the `DataFrame`. Notice how in the previous examples, the first row of each 'table' is assumed to be a heading and is set as the `Columns` property of the `DataFrame`. This means those column names can be used as normal to index the `DataFrame`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This is the DataFrame from the 'Databook Pages' sheet\n", "F.sheets['databook pages'][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Notice that the column titles are all converted to lowercase!\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This is a listing of just the 'Datasheet Title' column. The name should be an identical match\n", "# to the title in the Excel file. These titles are set directly based on the contents of the Excel\n", "# file. If the title is changed in the Excel file, then references to it in the code will need\n", "# to be updated in order to reflect this\n", "F.sheets['databook pages'][0]['datasheet title']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "As these are just ordinary `DataFrames`, you can work with them in the same way as any other `DataFrame` in regards to indexing, accessing rows/columns, filling values etc. \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exporting a Framework\n", "\n", "By design, although a `ProjectFramework` can be pickled and saved as an object, it cannot be exported back to a spreadsheet. Users are not expected to require the ability to programatically modify framework files (in contrast, they _are_ expected to programatically modify databooks)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Making a new Framework\n", "\n", "It is no longer necessary to export a blank framework file from the code - simply copy `atomica/core/framework_template.xlsx` to start a new template. That Excel file now has a lot of extra validation and autocompletion to facilitate creating new Frameworks. This will be outlined in detail in the user guide." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Validating the input\n", "\n", "When a `ProjectFramework` is instantiated, it will read in everything in the Excel file. However, it is up to the code to actually use the variables contained in the framework. For example, a new column could be added to store an additional property of say, a compartment. However, it won't have any effect on the simulation unless code is written to take advantage of this (e.g., `model.py` might use the new property to treat those compartments differently).\n", "\n", "This presents a challenge - what should be done if the `ProjectFramework` does not contain something that is required by the code? The simplest solution is that the user's custom code can simply check whether a particular sheet, a particular column, or a particular value is present in the `ProjectFramework` and act accordingly. \n", "\n", "If you don't want the logic of your code to be complicated by the additional validation checks, or if the code is performance-critical and you only want to validate values once, you can instead perform the validation when the `ProjectFramework` is loaded. The method `ProjectFramework._validate()` is automatically called after the spreadsheet is read in. At the moment, it mainly contains validation of the basic elements of an Atomica model - for example, ensuring that compartments, characteristics, parameters, and the transition matrix are present. Here are some of the standard checks that are performed\n", "\n", "1. Requiring that a sheet is present - an error will be thrown if the sheet is missing. For example, if the 'Compartments' sheet is missing, then the Framework is essentially unusable\n", "2. Checking whether a property is present - for example, if the 'Compartments' page is missing the 'Code Name' column then the Framework is considered unusable \n", "3. Validating the contents of specified fields - for example, the 'Is Source' field for a compartment can only be 'y' or 'n' so an error will result if a different value was entered in the framework file\n", "4. Filling in any missing values with defaults - for example, if a 'Compartment' has an empty cell for 'Is Source' then it will default to 'no'\n", "\n", "The last three of those checks are implemented in `framework.py:sanitize_dataframe()` so they can be readily used for any `DataFrame` that is loaded in.\n", "\n", "In addition to this, arbitrary validation can also be performed at this point. For example, `ProjectFramework._validate()` checks here that there are no transitions from a normal compartment to a source compartment. \n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Adding a quantity to the Framework\n", "\n", "
\n", "Adding something to the Framework can be done simply by adding an extra column or extra sheet. `DataFrames` associated with every Table on the page will be created. These can then be accessed directly from the Framework. In the simplest case, you can perform validation (like checking if a quantity was provided in the framework) at the point you want to use the quantity. \n", "
\n", "\n", "However, if you want to validate the new variables when the framework is loaded, then you will need to modify `ProjectFramework._validate()`. To add a default quantity, look at how default values for compartments etc. are handled in that function. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convenience methods\n", "\n", "First and foremost, the role of the `ProjectFramework` is to store information about compartments, characteristics, parameters, interactions, and transitions, because those are are the fundamental building blocks of `Model` objects, and simulations cannot be run without them. These are accessed so frequently that the `ProjectFramework` provides some special methods to facilitate working with them. \n", "\n", "First, the `Transitions` sheet is handled in a special way. As with every other sheet, it appears in the `.sheets` property:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "F.sheets['transitions'][0].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, this format is not very helpful for actually working with transitions. The `ProjectFramework` stores a special parsed representation of this in `ProjectFramework.transitions` which is an odict where the key is a parameter name, and the value is a list of tuples for every transition that parameter governs:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(F.transitions['b_rate'])\n", "print(F.transitions['doth_rate'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This also illustrates an example of where validation is used - the parsing will fail unresolvably if the `Transitions` sheet is missing from the framework file, so `ProjectFramework._validate()` first checks that the sheet is present and displays an informative error if it is missing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, the `DataFrames` for compartments, characteristics, parameters, and interactions have the 'Code Name' column set as the index for the `DataFrame`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "F.sheets['compartments'][0].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This means that you can access a row of this `DataFrame` using `.loc` and passing in the code name of the quantity. For example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "F.sheets['compartments'][0].loc['sus']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As usual, the `.loc` method returns a Pandas `Series` object, which is similar to a `dict()`. Note that the `name` of the `Series` is the code name of the quantity i.e." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "F.sheets['compartments'][0].loc['sus'].name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This allows you to retrieve the code name based only on the `Series` object - so for example, if you retrieved that row by index rather than by name " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "row = F.sheets['compartments'][0].iloc[0]\n", "print(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "then you can still recover the name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "row.name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because the column names have also been set in the `DataFrame`, you can index the `Series` using the column name to retrieve a specific property. For example, if we want to check which databook page the `vac` compartment is on:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "F.sheets['compartments'][0].loc['vac']['databook page']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The column names are read straight from the Excel file - any additional columns will automatically appear and be usable. Note that there should be an exact match between the name of the column in the Excel file, and the string used in the code (for example, you could not use `F.sheets['Compartments'].loc['vac']['Databook page']` because it is case sensitive)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For convenience, the `.comps`, `.pars`, `.characs`, and `.interactions` property methods map to their corresponding sheets. That is, `F.comps` is shorthand for `F.sheets['Compartments']`. So the above command could instead be" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "F.comps.loc['vac']['databook page']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Those properties directly return the `DataFrame` in `F.sheets` so if you modify the `DataFrame` via `F.comps` it will be changed everywhere. \n", "\n", "Finding specific items by their code name is very common. To facilitate this, you can use the `.get_comp`, `.get_charac`, `.get_par`, and `.get_interaction` property methods to look up rows of the corresponding `DataFrames` by code name. For example, to retrieve the row for `vac`, instead of `F.comps.loc['vac']` you can use" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "F.get_comp('vac')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So looking up the databook page can be made even simpler:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "F.get_comp('vac')['databook page']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes you might know the code name without knowing the type of the variable - for example, only given the code name `'alive'`, should you be looking up a compartment or a characteristic? In that case, you can use the `.get_variable()` method which will search for the name in each of compartments, characteristics, parameters, and interactions. The return value is a tuple where the first entry is the row of the `DataFrame`, and the second entry is a string like `'comp'`, `'par'`, `'charac'`, which identifies what type of variable was passed in" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "F.get_variable('vac')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, sometimes you might have the display name, and not the code name. You can also pass a display name into `get_variable()` and it will retrieve the corresponding item" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "F.get_variable('Vaccinated')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that these functions rely on the fact that code names and display names are supposed to be unique across all of the different variable types (i.e. not only can you not have two compartments with the same name, you cannot give a parameter the same name as a compartment). This is checked for in `ProjectFramework._validate()`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Iterating over items\n", "\n", "The simplest way to iterate over rows of a `DataFrame` is using the `iterrows()` method of the `DataFrame`. For example, to loop over all compartments, you could use:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for _,row in F.comps.iterrows():\n", " print('%s-%s' % (row.name,row['is source']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the `.name` property can be used inside the loop in order to identify the code name for the row.\n", "\n", "Although simple, this operation is relatively computationally expensive because it creates a `Series` object for every row. If performance is critical (for example, during `Model.build()`) then it is better to use the `.at` method of the `DataFrame`. For example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "F.comps.at['sus','is source']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a loop, you would then have to first look up the index values and then iterate over them:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for comp_name in list(F.comps.index):\n", " print('%s-%s' % (comp_name,F.comps.at[comp_name,'is source']))" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:optima3]", "language": "python", "name": "conda-env-optima3-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }