Data Validation

There are multiple schema defined for validation of different datasets at different pipeline stages.

`validate_dataframe(df, schema, **kwargs)`

Validates data against a specified schema.

The data should have been prepared as per the specification set by the lead site. Use the output of this function to iteratively identify and address data quality issues.

The following schema are defined:

Admitted Care Data:

AdmittedCareEpisodeSchema
AdmittedCareFeatureSchema

Emergency Care Data:

EmergencyCareEpisodeSchema
EmergencyCareFeatureSchema

See source code for validation rules.

Returns a good dataframe containing rows that passed validation and a bad dataframe with rows that failed validation. The bad dataframe has additional columns that provide information on failure cause(s). If there is a column error (misspelt, missing or additional), all rows will be returned in bad dataframe.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Dataframe to be validated	required
`schema`	`DataFrameSchema`	Pandera schema to validate against	required
`kwargs`		The following keyword arguments are currently supported	`{}`
`start_date`	`datetime`	Study start date (inclusive)	required
`end_date`	`datetime`	Study end date (excluded)	required
`ignore_cols`	`list`	Columns to ignore during validation checks.	required
`update_cols`	`dict[str`	dict]): Dictionary of column:properties to update schema.	required

Returns:

Type	Description
`Tuple[DataFrame, DataFrame]`	Good and Bad dataframes. See example below.

Validation example

from avoidable_admissions.data.validate import (
    validate_dataframe,
    AdmittedCareEpisodeSchema
)


df = pd.read_csv('path/to/data.csv')
good, bad = validate_dataframe(df, AdmittedCareEpisodeSchema)

If df had rows that fail validation, the function will print an output similar to below.

Schema AdmittedCareEpisodeSchema: A total of 1 schema errors were found.

Error Counts
------------
- schema_component_check: 1

Schema Error Summary
--------------------
                                                    failure_cases  n_failure_cases
schema_context column  check
Column         admiage greater_than_or_equal_to(18)        [17.0]                1

This message indicates that there was a validation error in the admiage column which expects values >=18.

Fix data quality iteratively to ensure there are no errors.

If you find a bug in the validation code, and correct data fails validation, please raise a GitHub issue.

Customising validation

Customise study dates

As a default, the study dates specified in the initial protocol will be used (admidate>=datetime(2021,11,1) and admidate<datetime(2022,11,1)). However, these can be altered by providing these as keyword arguments.

The following rule is applied to admidate in the Acute Admissions dataset and to edarrivaldatetime in the emergency care dataset.

admidate>=start_date and admidate<end_date

The <end_date allows 31-10-2022 23:59:00 to pass validation when end_date is set to datetime(2022,11,1).

Ignore selected columns

Passing a list of column names toignore_cols as a keyword argument will apply the following properties, effectively turning off validation.

{
    'dtype': None,
    'checks': [],
    'nullable': False,
    'unique': False,
    'coerce': False,
    'required': True
}

Update validation rules

Passing a dictionary of {column_name: property_dict} allows fine-grained control. For example, to update remove checks on edchiefcomplaint but preserve other validation rules, pass the following to update_cols.

Custom validation example

The example below applies the following custom rules:

Change study start and end dates
Ignore validation on eddiag_NN columns. This requires both _01 and _[0-9]{2}$ regex suffixes to be set. Note the $ at the end of regex.
Don't perform checks on edchiefcomplaint but retain other rules e.g dtype
Dont' check data type for accommodationstatus but retain other rules

good, bad = validate_dataframe(
    df,
    schema,
    start_date=datetime(2021, 10, 1),
    end_date=datetime(2022, 11, 1),
    ignore_cols=["eddiag_01", "eddiag_[0-9]{2}$"],
    update_cols={
        "edchiefcomplaint": {"checks": []},
        "accommodationstatus": {"dtype": None},
    }
)

Source code in avoidable_admissions/data/validate.py

def validate_dataframe(
    df: pd.DataFrame, schema: pa.DataFrameSchema, **kwargs
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Validates data against a specified schema.

    The data should have been prepared as per the specification set by the lead site.
    Use the output of this function to iteratively identify and address data quality issues.

    The following schema are defined:

    Admitted Care Data:

    - `AdmittedCareEpisodeSchema`
    - `AdmittedCareFeatureSchema`

    Emergency Care Data:

    - `EmergencyCareEpisodeSchema`
    - `EmergencyCareFeatureSchema`

    See __[source code](https://github.com/LTHTR-DST/hdruk_avoidable_admissions/blob/dev/avoidable_admissions/data/validate.py)__
    for validation rules.

    Returns a _good_ dataframe containing rows that passed validation and
    a _bad_ dataframe with rows that failed validation.
    The _bad_ dataframe has additional columns that provide information on failure cause(s).
    If there is a column error (misspelt, missing or additional), all rows will be returned in _bad_ dataframe.

    Args:
        df (pandas.DataFrame): Dataframe to be validated
        schema (pa.DataFrameSchema): Pandera schema to validate against
        kwargs: The following keyword arguments are currently supported
        start_date (datetime): Study start date (inclusive)
        end_date (datetime): Study end date (excluded)
        ignore_cols (list): Columns to ignore during validation checks.
        update_cols (dict[str:dict]): Dictionary of column:properties to update schema.

    Returns:
        _Good_ and _Bad_ dataframes. See example below.

    ## Validation example

    ``` python
    from avoidable_admissions.data.validate import (
        validate_dataframe,
        AdmittedCareEpisodeSchema
    )


    df = pd.read_csv('path/to/data.csv')
    good, bad = validate_dataframe(df, AdmittedCareEpisodeSchema)
    ```

    If `df` had rows that fail validation, the function will print an output similar to below.

        Schema AdmittedCareEpisodeSchema: A total of 1 schema errors were found.

        Error Counts
        ------------
        - schema_component_check: 1

        Schema Error Summary
        --------------------
                                                            failure_cases  n_failure_cases
        schema_context column  check
        Column         admiage greater_than_or_equal_to(18)        [17.0]                1

    This message indicates that there was a validation error in the `admiage` column which expects values >=18.

    Fix data quality iteratively to ensure there are no errors.

    If you find a bug in the validation code, and correct data fails validation,
    please raise a [GitHub issue](https://github.com/LTHTR-DST/hdruk_avoidable_admissions/issues).

    ## Customising validation

    ### Customise study dates

    As a default, the study dates specified in the initial protocol will be used
    (`admidate>=datetime(2021,11,1) and admidate<datetime(2022,11,1)`).
    However, these can be altered by providing these as keyword arguments.

    The following rule is applied to `admidate` in the Acute Admissions dataset
    and to `edarrivaldatetime` in the emergency care dataset.

    `admidate>=start_date` and `admidate<end_date`

    The `<end_date` allows `31-10-2022 23:59:00` to pass validation when
    `end_date` is set to `datetime(2022,11,1)`.

    ### Ignore selected columns

    Passing a list of column names to`ignore_cols` as a keyword argument will
    apply the following properties, effectively turning off validation.

    ```python
    {
        'dtype': None,
        'checks': [],
        'nullable': False,
        'unique': False,
        'coerce': False,
        'required': True
    }
    ```

    ### Update validation rules

    Passing a dictionary of {column_name: property_dict} allows fine-grained control.
    For example, to update remove checks on `edchiefcomplaint` but preserve
    other validation rules, pass the following to `update_cols`.

    ### Custom validation example

    The example below applies the following custom rules:

    - Change study start and end dates
    - Ignore validation on `eddiag_NN` columns.
        This requires both *_01* and *_[0-9]{2}$* regex suffixes to be set.
        Note the _$_ at the end of regex.
    - Don't perform checks on `edchiefcomplaint` but retain other rules e.g dtype
    - Dont' check data type for `accommodationstatus` but retain other rules


    ```python
    good, bad = validate_dataframe(
        df,
        schema,
        start_date=datetime(2021, 10, 1),
        end_date=datetime(2022, 11, 1),
        ignore_cols=["eddiag_01", "eddiag_[0-9]{2}$"],
        update_cols={
            "edchiefcomplaint": {"checks": []},
            "accommodationstatus": {"dtype": None},
        }
    )
    ```


    """

    df_errors = pd.DataFrame()

    # todo: document this behaviour to warn user that index will be dropped.
    # alternatively find a way to set a unique key for each row - important for merging errors
    df = df.copy().reset_index(drop=True)

    start_date = kwargs.get("start_date", datetime(2021, 11, 1))
    end_date = kwargs.get("end_date", datetime(2022, 11, 1))

    date_checks = [
        pa.Check.ge(start_date),
        pa.Check.lt(end_date),
    ]

    if schema.name.startswith("AdmittedCare"):
        cohort_date_col = "admidate"
    elif schema.name.startswith("EmergencyCare"):
        cohort_date_col = "edarrivaldatetime"

    updated_column_props = {}

    updated_column_props[cohort_date_col] = {"checks": date_checks}

    # New feature - allow user to ignore checks on some columns
    ignore_cols = kwargs.get("ignore_cols", [])

    update_cols = kwargs.get("update_cols", {})

    blank_props = {
        "dtype": None,
        "checks": [],
        "nullable": False,
        "unique": False,
        "coerce": False,
        "required": True,
    }

    for col in ignore_cols:
        updated_column_props[col] = blank_props

    updated_column_props.update(update_cols)

    # If a column in ignore_cols is not present in schema, this will raise
    # a SchemaInitError with name of column causing the error.
    schema = schema.update_columns(updated_column_props)

    try:
        # Capture all errors
        # https://pandera.readthedocs.io/en/stable/lazy_validation.html
        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            schema.validate(df, lazy=True)
    except pa.errors.SchemaErrors as ex:

        print(ex.args[0])

        df_errors = ex.failure_cases.copy()

        # First get the rows that are causing errors
        df_errors["index"] = df_errors["index"].fillna(df.index.max() + 1)
        df_errors = df.merge(df_errors, how="right", left_index=True, right_on="index")
        df_errors["index"] = df_errors["index"].replace(df.index.max() + 1, None)

        # Column name mismatches will have an 'index' of NaN which causes merge to fail
        # If a column name is not present, then all rows should be returned as errors

        if df_errors["index"].hasnans:  # this
            # there is a column error. drop all rows from the 'good' dataframe
            df = df.iloc[0:0]

            print("No data will pass validation due to column error. See output above.")

        else:
            df = df[~df.index.isin(df_errors["index"])]
    except Exception as ex:
        # This is to catch all other errors.
        print(ex.args[0])
        print(
            "No data will pass validation due to undefined error."
            "See output above and please raise an issue on GitHub."
        )

        df = df.iloc[0:0]
        df_errors = df.copy()

    finally:
        return df, df_errors

`validate_admitted_care_data(df, **kwargs)`

Convenience wrapper for validate_dataframe(df, AdmittedCareEpisodeSchema)

See avoidable_admissions.data.validate.validate_dataframe for usage.

Source code in avoidable_admissions/data/validate.py

def validate_admitted_care_data(
    df: pd.DataFrame, **kwargs
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Convenience wrapper for `validate_dataframe(df, AdmittedCareEpisodeSchema)`

    See [avoidable_admissions.data.validate.validate_dataframe][] for usage.
    """

    return validate_dataframe(df, AdmittedCareEpisodeSchema, **kwargs)

`validate_admitted_care_features(df, **kwargs)`

Convenience wrapper for validate_dataframe(df, AdmittedCareFeatureSchema)

See avoidable_admissions.data.validate.validate_dataframe for usage.

Source code in avoidable_admissions/data/validate.py

def validate_admitted_care_features(
    df: pd.DataFrame, **kwargs
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Convenience wrapper for `validate_dataframe(df, AdmittedCareFeatureSchema)`

    See [avoidable_admissions.data.validate.validate_dataframe][] for usage.
    """
    return validate_dataframe(df, AdmittedCareFeatureSchema, **kwargs)

`validate_emergency_care_data(df, **kwargs)`

Convenience wrapper for validate_dataframe(df, EmergencyCareEpisodeSchema)

See avoidable_admissions.data.validate.validate_dataframe for usage.

Source code in avoidable_admissions/data/validate.py

def validate_emergency_care_data(
    df: pd.DataFrame, **kwargs
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Convenience wrapper for `validate_dataframe(df, EmergencyCareEpisodeSchema)`

    See [avoidable_admissions.data.validate.validate_dataframe][] for usage.
    """
    return validate_dataframe(df, EmergencyCareEpisodeSchema, **kwargs)

`validate_emergency_care_features(df, **kwargs)`

Convenience wrapper for validate_dataframe(df, EmergencyCareFeatureSchema)

See avoidable_admissions.data.validate.validate_dataframe for usage.

Source code in avoidable_admissions/data/validate.py

def validate_emergency_care_features(
    df: pd.DataFrame, **kwargs
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Convenience wrapper for `validate_dataframe(df, EmergencyCareFeatureSchema)`

    See [avoidable_admissions.data.validate.validate_dataframe][] for usage.
    """
    return validate_dataframe(df, EmergencyCareFeatureSchema, **kwargs)

`get_schema_properties(schema)`

Get detailed information about a validation schema including checks, dtypes, nullability and more.

Parameters:

Name	Type	Description	Default
`schema`	`Pandera DataFrameSchema`	One of `AdmittedCareEpisodeSchema`, `AdmittedCareFeatureSchema`,	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: Dataframe containing schema properties.

Source code in avoidable_admissions/data/validate.py

def get_schema_properties(schema: pa.DataFrameSchema) -> pd.DataFrame:
    """Get detailed information about a validation schema including checks, dtypes, nullability and more.

    Args:
        schema (Pandera DataFrameSchema): One of `AdmittedCareEpisodeSchema`, `AdmittedCareFeatureSchema`,
        `EmergencyCareEpisodeSchema`, `EmergencyCareFeatureSchema`

    Returns:
        pd.DataFrame: Dataframe containing schema properties.
    """

    df = pd.DataFrame([x.properties for x in schema.columns.values()])

    columns = [
        "name",
        "dtype",
        "nullable",
        "unique",
        "coerce",
        "required",
        "checks",
        "regex",
        "title",
        "description",
    ]

    return df[columns]

Fixing Errors

It is likely that data validation will fail on a subset of the data the first few times. Fixing errors will be an iterative process and the following are some examples.

Please see https://mattstammers.github.io/hdruk_avoidable_admissions_collaboration_docs/ for more examples.

Errors in validation after feature generation may be caused by extraneous codes that are not specified in the data specification.

Examples

# Convert to date

df['admidate'] = pd.to_datetime(df['admidate'], yearfirst=True)
df['admidate'] = df['admidate'].dt.date


# Fill missing SNOMED codes with 0.
# Else valiation will fail as nan is treated as float.
df['accommodationstatus'] = df['accommodationstatus'].fillna(0)

Missing Values

To be finalised after further discussion and testing.

There is an entire chapter in Pandas documentation on missing values which is an important read for any data scientist.

For the purposes of this project, several pragmatic choices have been made regarding how missing values are treated.

Where a definition exists for how missing values should be coded, for instance in the NHS data model, use this.
For SNOMED codes, which are always integers, use 0 (zero) to replace all missing values. This avoids validation errors caused by NaN values that are treated as float dtype by Pandas.
For strings, use "-" (without the quotes) for missing values.
During feature engineering, custom error values are assigned to codes that are missing from either the refsets or mapping.