tlo.analysis.utils module
General utility functions for TLO analysis
- parse_log_file(log_filepath, level: int = 20)
Parses logged output from a TLO run, split it into smaller logfiles and returns a class containing paths to these split logfiles.
- Parameters:
log_filepath – file path to log file
level – parse everything from the given level
- Returns:
a class containing paths to split logfiles
- merge_log_files(log_path_1: Path, log_path_2: Path, output_path: Path) None
Merge two log files, skipping any repeated header lines.
- Parameters:
log_path_1 – Path to first log file to merge. Records from this log file will appear first in merged log file.
log_path_2 – Path to second log file to merge. Records from this log file will appear after those in log file at log_path_1 and any header lines in this file which are also present in log file at log_path_1 will be skipped.
output_path – Path to write merged log file to. Must not be one of log_path_1 or log_path_2 as data is read from files while writing to this path.
- write_log_to_excel(filename, log_dataframes)
Takes the output of parse_log_file() and creates an Excel file from dataframes
- make_calendar_period_lookup()
Returns a dictionary mapping calendar year (in years) to five year period i.e. { 1950: ‘1950-1954’, 1951: ‘1950-1954, …}
- make_calendar_period_type()
Make an ordered categorical type for calendar periods Returns CategoricalDType
- make_age_grp_lookup()
Returns a dictionary mapping age (in years) to five year period i.e. { 0: ‘0-4’, 1: ‘0-4’, …, 119: ‘100+’, 120: ‘100+’ }
- make_age_grp_types()
Make an ordered categorical type for age-groups Returns CategoricalDType
- to_age_group(_ages: Series)
Return a pd.Series with age-group formatted as a categorical type, created from a pd.Series with exact age.
- get_scenario_outputs(scenario_filename: str, outputs_dir: Path) list
Returns paths of folders associated with a batch_file, in chronological order.
- get_scenario_info(scenario_output_dir: Path) dict
Utility function to get the the number draws and the number of runs in a batch set.
TODO: read the JSON file to get further information
- load_pickled_dataframes(results_folder: Path, draw=0, run=0, name=None) dict
Utility function to create a dict contaning all the logs from the specified run within a batch set
- extract_draw_names(results_folder: Path) dict[int, str]
Returns dict keyed by the draw-number giving the ‘draw-name’ declared for that draw in the Scenario at draw_names().
- extract_params(results_folder: Path, use_draw_names: bool = False) DataFrame | None
Utility function to get overridden parameters from scenario runs
Returns dateframe summarizing parameters that change across the draws. It produces a dataframe with index of draw and columns of each parameters that is specified to be varied in the batch. NB. This does the extraction from run 0 in each draw, under the assumption that the over-written parameters are the same in each run.
- extract_results(results_folder: Path, module: str, key: str, column: str = None, index: str = None, custom_generate_series=None, do_scaling: bool = False) DataFrame
Utility function to unpack results.
Produces a dataframe from extracting information from a log with the column multi-index for the draw/run.
- If the column to be extracted exists in the log, the name of the column is provided as column. If the resulting
dataframe should be based on another column that exists in the log, this can be provided as ‘index’.
- If instead, some work must be done to generate a new column from log, then a function can be provided to do this as
custom_generate_series.
Optionally, with do_scaling=True, each element is multiplied by the scaling_factor recorded in the simulation.
Note that if runs in the batch have failed (such that logs have not been generated), these are dropped silently.
- compute_summary_statistics(results: DataFrame, central_measure: Literal['mean', 'median'] | None = None, width_of_range: float = 0.95, use_standard_error: bool = False, only_central: bool = False, collapse_columns: bool = False) DataFrame
Utility function to compute summary statistics
- Finds a central value and a specified interval across the runs for each draw. By default, this uses a central
measure of the median and a 95% interval range.
- Parameters:
results – The dataframe of results to compute summary statistics of.
central_measure – The name of the central measure to use - either ‘mean’ or ‘median’ (defaults to ‘median’)
width_of_range – The width of the range to compute the statistics (e.g. 0.95 for the 95% interval).
use_standard_error – Whether the range should represent the standard error; otherwise it is just a description of the variation of runs. If selected, then the central measure is always the mean.
collapse_columns – Whether to simplify the columnar index if there is only one run (cannot be done otherwise).
only_central – Whether to only report the central value (dropping the range).
- Returns:
A dataframe with computed summary statistics.
- summarize(results: DataFrame, only_mean: bool = False, collapse_columns: bool = False)
Utility function to compute summary statistics
Finds mean value and 95% interval across the runs for each draw.
- NOTE: This provides the legacy functionality of summarize that is hard-wired to use means (the kwarg is
only_mean and the name of the column in the output is mean). Please move to using the new and more flexible version of summarize that allows the use of medians and is flexible to allow other forms of summary measure in the future.
- get_grid(params: DataFrame, res: Series)
Utility function to create the arrays needed to plot a heatmap.
- Parameters:
params (pd.DataFrame) – the dataframe of parameters with index=draw (made using extract_params()).
res (pd.Series) – results of interest with index=draw (can be made using extract_params())
- Returns:
grid as dictionary
- format_gbd(gbd_df: DataFrame)
Format GBD data to give standarize categories for age_group and period
- create_pickles_locally(scenario_output_dir, compressed_file_name_prefix=None)
For a run from the Batch system that has not resulted in the creation of the pickles, reconstruct the pickles locally.
- compare_number_of_deaths(logfile: Path, resourcefilepath: Path)
Helper function to produce tables summarising deaths in the model run (given be a logfile) and the corresponding number of deaths in the GBD dataset. NB. * Requires output from the module tlo.methods.demography * Will do scaling automatically if the scaling-factor has been computed in the simulation (but not otherwise).
- flatten_multi_index_series_into_dict_for_logging(ser: Series) dict
Helper function that converts a pd.Series with multi-index into a dict format that is suitable for logging. It does this by converting the multi-index into keys of type str in a format that later be used to reconstruct the multi-index (using unflatten_flattened_multi_index_in_logging).
- unflatten_flattened_multi_index_in_logging(_x: [<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.indexes.base.Index'>]) [<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.indexes.base.Index'>]
Helper function that recreate the multi-index of logged results from a pd.DataFrame that is generated by parse_log.
If a pd.DataFrame created by parse_log is the result of repeated logging of a pd.Series with a multi-index that was transformed before logging using flatten_multi_index_series_into_dict_for_logging, then the pd.DataFrame’s columns will be those flattened labels. This helper function recreates the original multi-index from which the flattened labels were created and applies it to the pd.DataFrame.
Alternatively, if jus the index of the “flattened” labels is provided, then the equivalent multi-index is returned.
- class LogsDict(file_names_and_paths, level)
Bases:
Mapping
Parses module-specific log files and returns Pandas dataframes.
The dictionary returned has the format:
{ <logger 1 name>: { <log key 1>: <pandas dataframe>, <log key 2>: <pandas dataframe>, <log key 3>: <pandas dataframe> }, <logger 2 name>: { <log key 4>: <pandas dataframe>, <log key 5>: <pandas dataframe>, <log key 6>: <pandas dataframe> }, ... }
- items() a set-like object providing a view on D's items
- keys() a set-like object providing a view on D's keys
- values() an object providing a view on D's values
- get_filtered_treatment_ids(depth: int | None = None) List[str]
Return a list of treatment_ids that are defined in the model, filtered to a specified depth.
- colors_in_matplotlib() tuple
Return tuple of the strings for all the colours defined in Matplotlib.
- get_coarse_appt_type(appt_type: str) str
Return the coarser categorization of appt_types for a given appt_type.
- order_of_coarse_appt(_coarse_appt: str | Index) int | Index
Define a standard order for the coarse appointment types.
- get_color_coarse_appt(coarse_appt_type: str) str
Return the colour (as matplotlib string) assigned to this appointment type.
Returns np.nan if appointment-type is not recognised.
Names of colors are selected with reference to: https://i.stack.imgur.com/lFZum.png
- order_of_short_treatment_ids(short_treatment_id: str | Index) int | Index
Define a standard order for short treatment_ids.
- get_color_short_treatment_id(short_treatment_id: str) str
Return the colour (as matplotlib string) assigned to this shorted TREATMENT_ID.
Returns np.nan if treatment_id is not recognised.
- order_of_cause_of_death_or_daly_label(cause_of_death_label: str | Index) int | Index
Define a standard order for Cause-of-Death labels.
- get_color_cause_of_death_or_daly_label(cause_of_death_label: str) str
Return the colour (as matplotlib string) assigned to this Cause-of-Death Label.
Returns np.nan if label is not recognised.
- squarify_neat(sizes: array, label: array, colormap: Callable = None, numlabels: int = 5, **kwargs)
Pass through to squarify, with some customisation: … * Apply the colormap specified * Only give label a selection of the segments N.B. The package squarify is required.
- get_root_path(starter_path: Path | None = None) Path
Returns the absolute path of the top level of the repository. starter_path optionally gives a reference location from which to begin search; if omitted the location of this file is used.
- bin_hsi_event_details(results_folder: Path, get_counter_from_event_details: callable, start_date: Timestamp, end_date: Timestamp, do_scaling: bool = False) Dict[Tuple[int, int], Counter]
Bin logged HSI event details into dictionary of counters for each draw and run.
- Parameters:
results_folder – Path to folder containing scenario outputs.
get_counter_from_event_details – Callable which when passed and event details dictionary and count returns a Counter instance keyed by properties to bin over.
start_date – Start date to filter log entries by when accumulating counts.
end_date – End date to filter log entries by when accumulating counts.
do_scaling – Whether to scale counts by population scaling factor value recorded in tlo.methods.population log.
- Returns:
Dictionary keyed by (draw, run) tuples with corresponding values the counters containing the binned event detail property counts for the corresponding scenario draw and run.
- compute_mean_across_runs(counters_by_draw_and_run: Dict[Tuple[int, int], Counter]) Dict[int, Counter]
Compute mean across scenario runs of dict of counters keyed by draw and run.
- Parameters:
counters_by_draw_and_run – Dictionary keyed by (draw, run) tuples with counter values.
- Returns:
Dictionary keyed by draw with counter values corresponding to mean of counters across all runs for each draw.
- plot_stacked_bar_chart(ax: Axes, binned_counts: Counter, inner_group_cmap: Dict | None = None, bar_width: float = 0.5, count_scale: float = 1.0)
Plot a stacked bar chart using count data binned over two levels of grouping.
- Parameters:
ax – Matplotlib axis to add bar chart to.
binned_counts – Counts keyed by pair of string keys corresponding to inner and outer groups binning performed over.
inner_group_cmap – Map from inner group keys to colors to plot corresponding bars with. If
None
the default color cycle will be used.bar_width – Width of each bar as a proportion of space between bars.
count_scale – Scaling factor to multiply all counts by.
- plot_clustered_stacked(dfall, ax, color_for_column_map=None, scaled=False, legends=True, H='/', **kwargs)
Given a dict of dataframes, with identical columns and index, create a clustered stacked bar plot. * H is the hatch used for identification of the different dataframe. * color_for_column_map should return a color for every column in the dataframes * legends=False, suppresses generation of the legends With scaled=True, the height of the stacked-bar is scaled to 1.0. From: https://stackoverflow.com/questions/22787209/how-to-have-clusters-of-stacked-bars
- get_mappers_in_fullmodel(resourcefilepath: Path, outputpath: Path)
Returns the cause-of-death, cause-of-disability and cause-of-DALYS mappers that are created in a run of the fullmodel.
- get_parameters_for_status_quo() Dict
Returns a dictionary of parameters and their updated values to indicate the “Status Quo” scenario. This is the configuration that is the target of calibrations.
The return dict is in the form: e.g. {
- ‘Depression’: {
‘pr_assessed_for_depression_for_perinatal_female’: 1.0, ‘pr_assessed_for_depression_in_generic_appt_level1’: 1.0, },
- ‘Hiv’: {
‘prob_start_art_or_vs’: 1.0, }
}
- get_parameters_for_standard_mode2_runs() Dict
Returns a dictionary of parameters and their updated values to indicate the “standard mode 2” scenario.
The return dict is in the form: e.g. {
- ‘Depression’: {
‘pr_assessed_for_depression_for_perinatal_female’: 1.0, ‘pr_assessed_for_depression_in_generic_appt_level1’: 1.0, },
- ‘Hiv’: {
‘prob_start_art_or_vs’: 1.0, }
}
- get_parameters_for_hrh_historical_scaling_and_rescaling_for_mode2() Dict
Returns a dictionary of parameters and their updated values to indicate scenario runs that involve: mode switch from 1 to 2 in 2020, rescaling hrh capabilities to effective capabilities in the end of 2019 (the previous year of mode switch), hrh historical scaling from 2020 to 2024.
The return dict is in the form: e.g. {
- ‘Depression’: {
‘pr_assessed_for_depression_for_perinatal_female’: 1.0, ‘pr_assessed_for_depression_in_generic_appt_level1’: 1.0, },
- ‘Hiv’: {
‘prob_start_art_or_vs’: 1.0, }
}
- get_parameters_for_improved_healthsystem_and_healthcare_seeking(resourcefilepath: Path, max_healthsystem_function: bool | None = False, max_healthcare_seeking: bool | None = False) Dict
Returns a dictionary of parameters and their updated values to indicate an ideal healthcare system in terms of maximum health system function, and/or maximum healthcare seeking.
The return dict is in the form: e.g. {
- ‘Depression’: {
‘pr_assessed_for_depression_for_perinatal_female’: 1.0, ‘pr_assessed_for_depression_in_generic_appt_level1’: 1.0 },
- ‘Hiv’: {
‘prob_start_art_or_vs’: <<the dataframe named in the corresponding cell in the ResourceFile>> }
}
- mix_scenarios(*dicts) Dict
Helper function to combine a Dicts that show which parameters should be over-written. * If a parameter appears in more than one Dict, the value in the last-added dict is taken, and a UserWarning
is raised;
Items under the same top-level key (i.e., for the Module) are merged rather than being over-written.