Skip to content

Avoidable Admissions Analysis Pipeline Example

Admitted Care Data Set

Site: Lancashire Teaching Hospitals NHS Trust
Created: 2023-02-04

Lead site: SCHARR Institute, Sheffield University

References

%load_ext autoreload
%autoreload 2

from datetime import datetime

now = datetime.today()
print("Starting pipeline execution at", now)
Starting pipeline execution at 2023-02-05 11:51:15.151223
import os
from pathlib import Path

from dotenv import load_dotenv

load_dotenv("../.env")

dir_data = Path(os.getenv("DIR_DATA"))
dir_data
WindowsPath('T:/Business Intelligence/Data Science/Work/hdruk_avoidable_adms')
import numpy as np
import pandas as pd
from IPython.display import HTML

from avoidable_admissions.data.validate import (
    AdmittedCareEpisodeSchema,
    AdmittedCareFeatureSchema,
    get_schema_properties,
    validate_admitted_care_data,
    validate_admitted_care_features,
    validate_dataframe,
)
from avoidable_admissions.features import feature_maps
from avoidable_admissions.features.build_features import build_admitted_care_features

Load Data

# df_admcare = pd.read_csv(dir_data.joinpathpath('raw', 'admitted_care.csv')
df_admcare = pd.read_csv("../data/raw/admitted_care.csv")
# Create a copy of the data to fix DQ issues to avoid reloading data from source everytime
dfa = df_admcare.copy()

"Raw dataframe contains %d rows and %d columns" % dfa.shape
'Raw dataframe contains 47483 rows and 63 columns'

Pipeline Overview

See https://lthtr-dst.github.io/hdruk_avoidable_admissions/pipeline/ for more details.

First Validation

This is done using the raw data typically extracted from SQL databases maintained by Business Intelligence.

Receiving the data in a format that matches the Sheffield specification as closely as possible will make subsequent steps easier.

Always, always, have friends in BI. Who is your Quin?

Admitted Care Episode Validation Rules

Use get_schema_properties(Schema) for an overview of what the expectations of the first validation schema are.

Help improve the validation rules and documentation, eg. title, description, etc. by contributing to the GitHub repo or raising an issue.

schema_1 = (
    get_schema_properties(AdmittedCareEpisodeSchema)
    .sort_values("name")
    .set_index("name")
)
schema_1
dtype nullable unique coerce required checks regex title
name
admiage int64 False False False True [<Check greater_than_or_equal_to: greater_than_or_equal_to(18)>, <Check less_than_or_equal_to: less_than_or_equal_to(130)>] False None
admidate date False False False True [<Check greater_than_or_equal_to: greater_than_or_equal_to(2021-10-01)>, <Check less_than_or_equal_to: less_than_or_equal_to(2022-09-30)>] False None
admimeth str True False False True [<Check isin: isin({'31', '32', '28', '12', '24', '2A', '98', '82', '23', '81', '25', '13', '83', '2C', '21', '2D', '22', '2B', '11', '99'})>] False None
admisorc str True False False True [<Check isin: isin({'53', '54', '66', '55', '65', '39', '37', '19', '98', '87', '79', '51', '88', '86', '52', '40', '56', '85', '29', '49', '99'})>] False None
admitime str True False True True [<Check str_matches: str_matches(re.compile('2[0-3]|[01]?[0-9]:[0-5][0-9]'))>] False None
diag_01 str True False False True [] False None
diag_[0-9]{2} str True False False True [] True None
disdest str True False False True [<Check isin: isin({'53', '54', '66', '48', '30', '65', '49', '39', '37', '19', '98', '87', '79', '84', '51', '88', '52', '85', '50', '29', '38', '99'})>] False None
dismeth str True False False True [<Check isin: isin({'2', '8', '4', '1', '5', '9', '3'})>] False None
epiorder int64 True False False True [<Check greater_than_or_equal_to: greater_than_or_equal_to(0)>] False None
ethnos str False False False True [<Check isin: isin({'C', 'H', 'M', 'N', 'J', 'R', 'B', 'L', 'A', 'G', 'K', 'S', 'P', 'E', 'Z', 'F', '99', 'D'})>] False None
gender str False False False True [<Check isin: isin({'2', 'X', '1', '9', '0'})>] False None
length_of_stay float64 True False False True [<Check greater_than_or_equal_to: greater_than_or_equal_to(0)>] False None
opdate_01 datetime64[ns] True False False True [] False None
opdate_[0-9]{2} datetime64[ns] True False False True [] True None
opertn_01 str True False False True [] False None
opertn_[0-9]{2} str True False False True [] True None
patient_id int64 False False False True [] False None
procodet str False False False True [] False None
sitetret str False False False True [] False None
townsend_score_decile int64 False False False True [<Check greater_than_or_equal_to: greater_than_or_equal_to(0)>, <Check less_than_or_equal_to: less_than_or_equal_to(10)>] False None
visit_id int64 False True False True [] False None

Validate using AdmittedCareEpisodeSchema

The first time we try to validate the raw data, there are several errors for multiple reasons.

Spending a few minutes reading the schema validation error messages as well as reading the documentation for Pandera will save days for this and future projects.

validate_dataframe(dfa, AdmittedCareEpisodeSchema) is equivalent to validate_admitted_care_data(dfa)

good, bad = validate_dataframe(dfa, AdmittedCareEpisodeSchema)

print("Good dataframe has %d rows" % good.shape[0])
print("Bad dataframe has %d rows" % bad.shape[0])
Schema AdmittedCareEpisodeSchema: A total of 18 schema errors were found.

Error Counts
------------
- column_not_in_dataframe: 1
- schema_component_check: 17

Schema Error Summary
--------------------
                                                                                                                                                                                                                                     failure_cases  n_failes
schema_context  column                check
DataFrameSchema <NA>                  column_in_dataframe                                                                                                                                                                               [visit_id]         1
Column          admiage               greater_than_or_equal_to(18)                                                                                                                                   [0,3,12,14,4,15,2,6,11,5,1,16,17,10,7,8,13,9]        18
                admidate              greater_than_                                                                                                               [TypeError("'>=' not supported between instances of 'str' and 'datetime.date'")]         1
                                      less_than_or_equal_to(2022-09-30)                                                                                           [TypeError("'<=' not supported between instances of 'str' and 'datetime.date'")]         1
                admimeth              isin({'31','32','28','12','24','2A','98','82','23','81','25','13','83','2C','21','2D','22','2B','11','99'})                                                                                             [27]         1
                admisorc              dtype('str')                                                                                                                                                                                         [int64]         1
                                      isin({'53','54','66','55','65','39','37','19','98','87','79','51','88','86','52','40','56','85','29','49','99'})                                                 [51,19,66,52,29,56,99,88,40,49,33,87,79,53]        14
                disdest               dtype('str')                                                                                                                                                                                         [int64]         1
                                      isin({'53','54','66','48','30','65','49','39','37','19','98','87','79','84','51','88','52','85','50','29','38','99'})                       [19,99,79,88,29,51,52,85,84,54,38,50,37,66,65,53,87,49,48,98,30]        21
                dismeth               dtype('str')                                                                                                                                                                                         [int64]         1
                                      isin({'2','8','4','1','5','9','3'})                                                                                                                                                              [1,4,2,8,9]         5
                gender                dtype('str')                                                                                                                                                                                         [int64]         1
                                      isin({'2','X','1','9','0'})                                                                                                                                                                          [2,1,9]         3
                opdate_01             dtype('datetime64[ns]')                                                                                                                                                                             [object]         1
                townsend_score_decile dtype('int64')                                                                                                                                                                                     [float64]         1
                                      not_nullable                                                                                                                                                                                           [nan]         1

Usage Tip
---------

Directly inspect all errors by catching the exception:

```
try:
    schema.validate(dataframe, lazy=True)
except SchemaErrors as err:
    err.failure_cases  # dataframe of schema errors
    err.data  # invalid dataframe
```

No data will pass validation due to column error. See output above.
Good dataframe has 0 rows
Bad dataframe has 246321 rows

Iterative DQ Fixes

Fix one column or one error at a time and rerun validate_dataframe(dfa, AdmittedCareEpisodeSchema) in the previous cell to check if it has worked.

If you have made an error in data transformation, rerun dfa = df_admcare.copy() to start again.

Error Behaviour: In case of errors other than SchemaErrors, validate_dataframe will print out the error message and return an empty good data frame and a bad dataframe containing all rows in input dataframe. Column errors, i.e., missing, misspelt or unexpected additional columns will also result in identical behaviour.

The bad dataframe may contain more rows than the input dataframe, as each error generates a new row. For instance, if one row in the input dataframe resulted in 5 errors in different columns, this will generate 5 rows in the bad dataframe, each with details on the specific errors available in additional columns.

TODO: vc to fix. If you get a ufunc 'bitwise_or' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' error, this is usually caused by a dtype mismatch between datetime in the input data when the validation code expects a date dtype.

Feature Maps: These set out what values are allowed in various columns to pass first validation, how these will be transformed during feature engineering and what values are allowed in the new columns to pass second validation. https://github.com/LTHTR-DST/hdruk_avoidable_admissions/blob/dev/avoidable_admissions/features/feature_maps.py

_Help improve this process by raising a GitHub issue if validation fails when it shouldn't or you run into unexpected errors.

"Starting with %d rows and %d columns" % dfa.shape
'Starting with 47483 rows and 63 columns'
dfa["visit_id"] = dfa.reset_index(drop=True).index
dfa.patient_id = dfa.patient_id.astype(np.int64)
dfa = dfa.loc[dfa.admiage >= 18].copy()
"Dataframe has %d rows and %d columns" % dfa.shape
'Dataframe has 39168 rows and 64 columns'
cols_opdate = dfa.filter(regex="opdate").columns
dfa[cols_opdate] = dfa[cols_opdate].apply(pd.to_datetime, dayfirst=True)
dfa.admidate = pd.to_datetime(dfa.admidate, dayfirst=True).dt.date
# dfa.admimeth = dfa.admimeth.astype(str)
dfa = dfa[dfa.admimeth != "27"]  # Careful
"Dataframe has %d rows and %d columns" % dfa.shape
'Dataframe has 39167 rows and 64 columns'
dfa.admisorc = dfa.admisorc.astype(str)
dfa = dfa[dfa.admisorc.isin(feature_maps.admisorc)]
dfa = dfa[dfa.admimeth != "27"]  # Careful
"Dataframe has %d rows and %d columns" % dfa.shape
'Dataframe has 39166 rows and 64 columns'
dfa.gender = dfa.gender.astype(str)
dfa.disdest = dfa.disdest.astype(str)
dfa = dfa[dfa.disdest.isin(feature_maps.disdest)]
"Dataframe has %d rows and %d columns" % dfa.shape
'Dataframe has 39166 rows and 64 columns'
dfa.dismeth = dfa.dismeth.astype(str)
dfa = dfa[dfa.dismeth.isin(feature_maps.dismeth)]
"Dataframe has %d rows and %d columns" % dfa.shape
'Dataframe has 39166 rows and 64 columns'
# Set missing IMD values to 0 to pass validation. Address how this is dealth with at the analytics stage.
# This is to get around the quirks of what nan means in the Pandas (and Pandera) world.
dfa.townsend_score_decile = dfa.townsend_score_decile.fillna(0).astype(np.int64)
# If everything above worked, this should not throw any errors.
good, bad = validate_admitted_care_data(dfa)

print("Good dataframe has %d rows" % good.shape[0])
print("Bad dataframe has %d rows" % bad.shape[0])
Good dataframe has 39166 rows
Bad dataframe has 0 rows

Feature engineering

See documentation here: https://lthtr-dst.github.io/hdruk_avoidable_admissions/features/

This has been written to Sheffield's specifications and will change if the specification changes.

_Help improve this process by raising a GitHub issue or contributing code.

dff = build_admitted_care_features(good.copy())

Second validation

Second validation and addressing DQ is identical to the first validation process.

Admitted Care Feature Validation Rules

The features dataframe should contain all columns from the initial dataframe and the new generated features.

Once again have a look at feature_maps.py for more details.

Let's look at only the new features and the rules that apply to them.

schema_2 = (
    get_schema_properties(AdmittedCareFeatureSchema)
    .sort_values("name")
    .set_index("name")
)
schema_2[~schema_2.index.isin(schema_1.index)]
dtype nullable unique coerce required checks regex title description
name
admiage_cat str False False False True [<Check isin: isin({'65 - 69', '25 - 29', '75 ... False None None
admidayofweek None True False False True [<Check isin: isin({'Sunday', 'Wednesday', 'Mo... False None None
admisorc_cat None True False False True [<Check isin: isin({'Medical care', 'Residence... False None None
diag_01_acsc None False False False True [<Check isin: isin({'COPD', 'Pneumothorax ', '... False None None
diag_seasonal_cat None True False False True [<Check isin: isin({'Respiratory infection', '... False None None
disdest_cat None True False False True [<Check isin: isin({'Medical care', 'Died', 'C... False None None
dismeth_cat None True False False True [<Check isin: isin({'Died', 'Discharged', 'Unk... False None None
ethnos_cat str True False False True [<Check isin: isin({'Other Ethnic Groups', 'As... False None None
gender_cat str False False False True [<Check isin: isin({'Female', 'Indeterminate',... False None None
length_of_stay_cat None True False False True [<Check isin: isin({'>=2 days', '<2 days'})>] False None None
opertn_count int64 False False False True [<Check greater_than_or_equal_to: greater_than... False None None
townsend_score_quintile int64 True False False True [<Check in_range: in_range(0, 5)>] False None None

Fix minor DQ issues and validate using AdmittedCareFeatureSchema

# TODO: vc to fix. Easiest will be to validate this as datetime rather than date
dff.admidate = dff.admidate.dt.date
# TODO: vc to fix. build_features makes this a categorical variable. Needs to be fixed either in build_features or in the validation schema
dff.admiage_cat = dff.admiage_cat.astype(str)
good_f, bad_f = validate_dataframe(dff, AdmittedCareFeatureSchema)
print("Good dataframe has %d rows" % good_f.shape[0])
print("Bad dataframe has %d rows" % bad_f.shape[0])
Good dataframe has 39166 rows
Bad dataframe has 0 rows

Analysis

Refer to https://docs.google.com/document/d/10PuNTnEG5zTkWOVaMGlfOInleXJ0KA-_5hSH4upwhi4/edit for details.

df = good_f.copy()

This section describes the analysis populations or the analysis of the APC data. These populations will be referred to in the analysis descriptions that follow.

All Acute Admissions

This analysis population includes all acute admissions.

Filter: (admimeth in 21, 22, 23, 24, 25, 2A, 2B, 2C, 2D, 28) & (epiorder = 1)

These filters should have been applied to the original data extract.

All Acute Admissions for ACSCs

This analysis population includes all acute admissions where the primary diagnosis from the first episode is an ACSC.

Filter: (admimeth in 21, 22, 23, 24, 25, 2A, 2B, 2C, 2D, 28) & (epiorder = 1) & (diag_01 = ACSC_Code_APC)

All Acute Admissions for Non-ACSCs

This analysis population includes all acute admissions where the primary diagnosis from the first episode is not an ACSC.

Filter: (admimeth in 21, 22, 23, 24, 25, 2A, 2B, 2C, 2D, 28) & (epiorder = 1) & (diag_01 != ACSC_Code_APC)

# All acute admissions - this should be redundant if done at extraction
df = df[
    (df.admimeth.isin({"21", "22", "23", "24", "25", "2A", "2B", "2C", "2D", "28"}))
    & (df.epiorder == 1)
]

# Acute admissions by ACSC status
df["is_acsc"] = df.diag_01_acsc.where(df.diag_01_acsc == "-", "ACSC").replace(
    "-", "Non-ACSC"
)

Age

pd.concat(
    [df.admiage.describe().rename("All"), df.groupby("is_acsc").admiage.describe().T],
    axis=1,
).T
count mean std min 25% 50% 75% max
All 3#.## 5#.## 2#.## 1#.## 3#.## 6#.## 7#.## 1#.##
ACSC 1#.## 6#.## 1#.## 1#.## 5#.## 6#.## 8#.## 1#.##
Non-ACSC 2#.## 5#.## 2#.## 1#.## 3#.## 5#.## 7#.## 1#.##
categorical_features = {
    "admiage_cat": "Age Bands",
    "gender_cat": "Gender",
    "ethnos_cat": "Ethnicity",
    "townsend_score_quintile": "Townsend Score Quintile",
    "admisorc_cat": "Admission Source",
    "admidayofweek": "Admission Day of Week",
    "diag_seasonal_cat": "Seasonal Diagnosis",
    "length_of_stay_cat": "Length of Stay",
    "disdest_cat": "Discharge Destination",
    "dismeth_cat": "Discharge Method",
}
def make_crosstab(colname, tablename):
    x = pd.crosstab(df[k], df.is_acsc, margins=True, dropna=False, margins_name="Total")

    y = (
        pd.crosstab(
            df[k],
            df.is_acsc,
            normalize="index",
            dropna=False,
            margins_name="Total",
        )
        .mul(100)
        .round(2)
        .rename(columns={"ACSC": "ACSC %", "Non-ACSC": "Non-ACSC %"})
    )

    z = pd.concat([x, y], axis=1).sort_index(axis=1).fillna("-")

    z.index = pd.MultiIndex.from_tuples([(v, i) for i in z.index])

    return z
# HTML Output
out = ""
df_results = []
for k, v in categorical_features.items():
    z = make_crosstab(k, v)
    df_results.append(z)
    out += f"""
        ### {v}
        {z.to_html()}

        """
df_results = pd.concat(df_results)
HTML(out)

Age Bands

is_acsc ACSC ACSC % Non-ACSC Non-ACSC % Total
Age Bands 18-19 1### 1#.## 7### 8#.## 9###
20 - 24 4### 2#.## 1### 7#.## 2###
25 - 29 4### 1#.## 2### 8#.## 2###
30 - 34 5### 2#.## 1### 7#.## 2###
35 - 39 5### 2#.## 1### 7#.## 2###
40 - 44 6### 3#.## 1### 6#.## 1###
45 - 49 7### 3#.## 1### 6#.## 1###
50 - 54 9### 4#.## 1### 5#.## 2###
55 - 59 1### 4#.## 1### 5#.## 2###
60 - 64 1### 4#.## 1### 5#.## 2###
65 - 69 1### 4#.## 1### 5#.## 2###
70 - 74 1### 4#.## 1### 5#.## 3###
75 - 79 1### 4#.## 1### 5#.## 3###
80 - 84 1### 4#.## 1### 5#.## 3###
>85 2### 4#.## 2### 5#.## 4###
Total 1### - 2### - 3###

Gender

is_acsc ACSC ACSC % Non-ACSC Non-ACSC % Total
Gender Female 8### 3#.## 1### 6#.## 2###
Indeterminate 4### 3#.## 9### 6#.## 1###
Male 7### 4#.## 9### 5#.## 1###
Total 1### - 2### - 3###

Ethnicity

is_acsc ACSC ACSC % Non-ACSC Non-ACSC % Total
Ethnicity Asian or Asian British 7### 3#.## 1### 6#.## 2###
Black or Black British 1### 3#.## 1### 6#.## 3###
Mixed 9### 3#.## 2### 6#.## 3###
Not known 1### 0#.## 1### 1#.## 1###
Not stated 5### 2#.## 1### 7#.## 2###
Other Ethnic Groups 1### 3#.## 2### 6#.## 4###
White 1### 4#.## 2### 5#.## 3###
Total 1### - 2### - 3###

Townsend Score Quintile

is_acsc ACSC ACSC % Non-ACSC Non-ACSC % Total
Townsend Score Quintile 0 1### 2#.## 2### 7#.## 3###
1 3### 3#.## 5### 6#.## 9###
2 3### 3#.## 4### 6#.## 7###
3 2### 3#.## 3### 6#.## 5###
4 3### 3#.## 5### 6#.## 8###
5 3### 4#.## 4### 5#.## 7###
Total 1### - 2### - 3###

Admission Source

is_acsc ACSC ACSC % Non-ACSC Non-ACSC % Total
Admission Source Care Home 2### 1#.## 1### 8#.## 1###
Medical care 8### 3#.## 1### 6#.## 2###
Penal 5### 5#.## 5### 5#.## 1###
Residence 1### 3#.## 2### 6#.## 3###
Unknown 1### 1#.## 5### 8#.## 6###
Total 1### - 2### - 3###

Admission Day of Week

is_acsc ACSC ACSC % Non-ACSC Non-ACSC % Total
Admission Day of Week Friday 2### 4#.## 3### 5#.## 6###
Monday 2### 3#.## 3### 6#.## 5###
Saturday 1### 3#.## 2### 6#.## 4###
Sunday 1### 3#.## 2### 6#.## 3###
Thursday 2### 4#.## 3### 5#.## 6###
Tuesday 2### 3#.## 3### 6#.## 6###
Wednesday 2### 4#.## 3### 5#.## 6###
Total 1### - 2### - 3###

Seasonal Diagnosis

is_acsc ACSC ACSC % Non-ACSC Non-ACSC % Total
Seasonal Diagnosis - 1### 3#.## 2### 6#.## 3###
Chronic disease exacerbation 1### 9#.## 9### 7#.## 1###
Respiratory infection 1### 8#.## 2### 1#.## 2###
Total 1### - 2### - 3###

Length of Stay

is_acsc ACSC ACSC % Non-ACSC Non-ACSC % Total
Length of Stay <2 days 7### 3#.## 1### 6#.## 1###
>=2 days 8### 4#.## 1### 5#.## 1###
Total 1### - 2### - 3###

Discharge Destination

is_acsc ACSC ACSC % Non-ACSC Non-ACSC % Total
Discharge Destination Care Home 2### 3#.## 5### 6#.## 7###
Died 1### 3#.## 3### 6#.## 4###
Medical care 1### 2#.## 5### 7#.## 6###
Penal 3### 1#.## 2### 8#.## 2###
Residence 1### 4#.## 1### 5#.## 3###
Unknown 1### 3#.## 2### 6#.## 4###
Total 1### - 2### - 3###

Discharge Method

is_acsc ACSC ACSC % Non-ACSC Non-ACSC % Total
Discharge Method Died 5### 3#.## 1### 6#.## 1###
Discharged 1### 3#.## 2### 6#.## 3###
Not Applicable 1### 0#.## 1### 1#.## 1###
Unknown 1### 0#.## 1### 1#.## 1###
Total 1### - 2### - 3###

Export Output for Sheffield

timestamp = now.strftime("%Y_%m_%dT%H%M%S")
timestamp
'2023_02_05T115115'

Export as CSV

# save df_results to a CSV to send to Sheffield for programmatic meta-analysis
# df_results.to_csv('path/to/results_timestamp.csv')

Export as HTML

output_head = f"""
<h1> LTH Admitted Care Analysis Tables Draft</h1>

<pre>
prepared: {timestamp}
email:    datascience@lthtr.nhs.uk
github:   <a href='https://github.com/LTHTR-DST/'>https://github.com/LTHTR-DST/</a>
</pre>
"""

fp = f"../reports/lth_admitted_care_analysis_tables_draft_{timestamp}.html"
with open(fp, "w") as f:
    x = df_results.to_html()
    x = output_head + x
    f.write(x)
print(
    "Finished pipeline execution in",
    round((datetime.today() - now).total_seconds()),
    "seconds at",
    datetime.today(),
)
Finished pipeline execution in 9 seconds at 2023-02-05 11:51:24.288155

The above outputs can easily be shared and if all sites used the same code, Sheffield should be able to generate the 'meta-analysis' more easily.

END OF NOTEBOOK