tlo.analysis.utils module

General utility functions for TLO analysis

parse_log_file(log_filepath, level: int = 20)[source]

Parses logged output from a TLO run, split it into smaller logfiles and returns a class containing paths to these split logfiles.

  • log_filepath – file path to log file

  • level – parse everything from the given level


a class containing paths to split logfiles

write_log_to_excel(filename, log_dataframes)[source]

Takes the output of parse_log_file() and creates an Excel file from dataframes


Returns a dictionary mapping calendar year (in years) to five year period i.e. { 1950: ‘1950-1954’, 1951: ‘1950-1954, …}


Make an ordered categorical type for calendar periods Returns CategoricalDType


Returns a dictionary mapping age (in years) to five year period i.e. { 0: ‘0-4’, 1: ‘0-4’, …, 119: ‘100+’, 120: ‘100+’ }


Make an ordered categorical type for age-groups Returns CategoricalDType

to_age_group(_ages: Series)[source]

Return a pd.Series with age-group formatted as a categorical type, created from a pd.Series with exact age.

get_scenario_outputs(scenario_filename: str, outputs_dir: Path) list[source]

Returns paths of folders associated with a batch_file, in chronological order.

get_scenario_info(scenario_output_dir: Path) dict[source]

Utility function to get the the number draws and the number of runs in a batch set.

TODO: read the JSON file to get further information

load_pickled_dataframes(results_folder: Path, draw=0, run=0, name=None) dict[source]

Utility function to create a dict contaning all the logs from the specified run within a batch set

extract_params(results_folder: Path) Optional[DataFrame][source]

Utility function to get overridden parameters from scenario runs

Returns dateframe summarizing parameters that change across the draws. It produces a dataframe with index of draw and columns of each parameters that is specified to be varied in the batch. NB. This does the extraction from run 0 in each draw, under the assumption that the over-written parameters are the same in each run.

extract_results(results_folder: Path, module: str, key: str, column: Optional[str] = None, index: Optional[str] = None, custom_generate_series=None, do_scaling: bool = False) DataFrame[source]

Utility function to unpack results.

Produces a dataframe from extracting information from a log with the column multi-index for the draw/run.

If the column to be extracted exists in the log, the name of the column is provided as column. If the resulting

dataframe should be based on another column that exists in the log, this can be provided as ‘index’.

If instead, some work must be done to generate a new column from log, then a function can be provided to do this as


Optionally, with do_scaling=True, each element is multiplied by the scaling_factor recorded in the simulation.

Note that if runs in the batch have failed (such that logs have not been generated), these are dropped silently.

summarize(results: DataFrame, only_mean: bool = False, collapse_columns: bool = False) DataFrame[source]

Utility function to compute summary statistics

Finds mean value and 95% interval across the runs for each draw.

get_grid(params: DataFrame, res: Series)[source]

Utility function to create the arrays needed to plot a heatmap.

  • params (pd.DataFrame) – the dataframe of parameters with index=draw (made using extract_params()).

  • res (pd.Series) – results of interest with index=draw (can be made using extract_params())


grid as dictionary

format_gbd(gbd_df: DataFrame)[source]

Format GBD data to give standarize categories for age_group and period

create_pickles_locally(scenario_output_dir, compressed_file_name_prefix=None)[source]

For a run from the Batch system that has not resulted in the creation of the pickles, reconstruct the pickles locally.

compare_number_of_deaths(logfile: Path, resourcefilepath: Path)[source]

Helper function to produce tables summarising deaths in the model run (given be a logfile) and the corresponding number of deaths in the GBD dataset. NB. * Requires output from the module tlo.methods.demography * Will do scaling automatically if the scaling-factor has been computed in the simulation (but not otherwise).

flatten_multi_index_series_into_dict_for_logging(ser: Series) dict[source]

Helper function that converts a pd.Series with multi-index into a dict format that is suitable for logging. It does this by converting the multi-index into keys of type str in a format that later be used to reconstruct the multi-index (using unflatten_flattened_multi_index_in_logging).

unflatten_flattened_multi_index_in_logging(_x: [<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.indexes.base.Index'>]) [<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.indexes.base.Index'>][source]

Helper function that recreate the multi-index of logged results from a pd.DataFrame that is generated by parse_log.

If a pd.DataFrame created by parse_log is the result of repeated logging of a pd.Series with a multi-index that was transformed before logging using flatten_multi_index_series_into_dict_for_logging, then the pd.DataFrame’s columns will be those flattened labels. This helper function recreates the original multi-index from which the flattened labels were created and applies it to the pd.DataFrame.

Alternatively, if jus the index of the “flattened” labels is provided, then the equivalent multi-index is returned.

class LogsDict(file_names_and_paths, level)[source]

Bases: Mapping

Parses module-specific log files and returns Pandas dataframes.

The dictionary returned has the format:

    <logger 1 name>: {
                       <log key 1>: <pandas dataframe>,
                       <log key 2>: <pandas dataframe>,
                       <log key 3>: <pandas dataframe>

    <logger 2 name>: {
                       <log key 4>: <pandas dataframe>,
                       <log key 5>: <pandas dataframe>,
                       <log key 6>: <pandas dataframe>
items() a set-like object providing a view on D's items[source]
keys() a set-like object providing a view on D's keys[source]
values() an object providing a view on D's values[source]
get_filtered_treatment_ids(depth: Optional[int] = None) List[str][source]

Return a list of treatment_ids that are defined in the model, filtered to a specified depth.

colors_in_matplotlib() tuple[source]

Return tuple of the strings for all the colours defined in Matplotlib.

get_coarse_appt_type(appt_type: str) str[source]

Return the coarser categorization of appt_types for a given appt_type.

order_of_coarse_appt(_coarse_appt: Union[str, Index]) Union[int, Index][source]

Define a standard order for the coarse appointment types.

get_color_coarse_appt(coarse_appt_type: str) str[source]

Return the colour (as matplotlib string) assigned to this appointment type. Returns np.nan if appointment-type is not recognised. Names of colors are selected with reference to:

order_of_short_treatment_ids(_short_treatment_id: Union[str, Index]) Union[int, Index][source]

Define a standard order for short treatment_ids.

get_color_short_treatment_id(short_treatment_id: str) str[source]

Return the colour (as matplotlib string) assigned to this shorted TREATMENT_ID. Returns np.nan if treatment_id is not recognised.

order_of_cause_of_death_label(_cause_of_death_label: Union[str, Index]) Union[int, Index][source]

Define a standard order for Cause-of-Death labels.

get_color_cause_of_death_label(cause_of_death_label: str) str[source]

Return the colour (as matplotlib string) assigned to this shorted Cause-of-Death Label. Returns np.nan if label is not recognised.

squarify_neat(sizes: array, label: array, colormap: Callable, numlabels=5, **kwargs)[source]

Pass through to squarify, with some customisation: … * Apply the colormap specified * Only give label a selection of the segments N.B. The package squarify is required.

get_root_path(starter_path: Optional[Path] = None) Path[source]

Returns the absolute path of the top level of the repository. starter_path optionally gives a reference location from which to begin search; if omitted the location of this file is used.

plot_clustered_stacked(dfall, ax, color_for_column_map=None, legends=True, H='/', **kwargs)[source]

Given a dict of dataframes, with identical columns and index, create a clustered stacked bar plot. * H is the hatch used for identification of the different dataframe. * color_for_column_map should return a color for every column in the dataframes * legends=False, suppresses generation of the legends From: