Skip to content

Data Validation

There are multiple schema defined for validation of different datasets at different pipeline stages.

validate_dataframe(df, schema, **kwargs)

Validates data against a specified schema.

The data should have been prepared as per the specification set by the lead site. Use the output of this function to iteratively identify and address data quality issues.

The following schema are defined:

Admitted Care Data:

  • AdmittedCareEpisodeSchema
  • AdmittedCareFeatureSchema

Emergency Care Data:

  • EmergencyCareEpisodeSchema
  • EmergencyCareFeatureSchema

See source code for validation rules.

Returns a good dataframe containing rows that passed validation and a bad dataframe with rows that failed validation. The bad dataframe has additional columns that provide information on failure cause(s). If there is a column error (misspelt, missing or additional), all rows will be returned in bad dataframe.

Parameters:

Name Type Description Default
df DataFrame

Dataframe to be validated

required
schema DataFrameSchema

Pandera schema to validate against

required
kwargs

The following keyword arguments are currently supported

{}
start_date datetime

Study start date (inclusive)

required
end_date datetime

Study end date (excluded)

required
ignore_cols list

Columns to ignore during validation checks.

required
update_cols dict[str

dict]): Dictionary of column:properties to update schema.

required

Returns:

Type Description
Tuple[DataFrame, DataFrame]

Good and Bad dataframes. See example below.

Validation example

from avoidable_admissions.data.validate import (
    validate_dataframe,
    AdmittedCareEpisodeSchema
)


df = pd.read_csv('path/to/data.csv')
good, bad = validate_dataframe(df, AdmittedCareEpisodeSchema)

If df had rows that fail validation, the function will print an output similar to below.

Schema AdmittedCareEpisodeSchema: A total of 1 schema errors were found.

Error Counts
------------
- schema_component_check: 1

Schema Error Summary
--------------------
                                                    failure_cases  n_failure_cases
schema_context column  check
Column         admiage greater_than_or_equal_to(18)        [17.0]                1

This message indicates that there was a validation error in the admiage column which expects values >=18.

Fix data quality iteratively to ensure there are no errors.

If you find a bug in the validation code, and correct data fails validation, please raise a GitHub issue.

Customising validation

Customise study dates

As a default, the study dates specified in the initial protocol will be used (admidate>=datetime(2021,11,1) and admidate<datetime(2022,11,1)). However, these can be altered by providing these as keyword arguments.

The following rule is applied to admidate in the Acute Admissions dataset and to edarrivaldatetime in the emergency care dataset.

admidate>=start_date and admidate<end_date

The <end_date allows 31-10-2022 23:59:00 to pass validation when end_date is set to datetime(2022,11,1).

Ignore selected columns

Passing a list of column names toignore_cols as a keyword argument will apply the following properties, effectively turning off validation.

{
    'dtype': None,
    'checks': [],
    'nullable': False,
    'unique': False,
    'coerce': False,
    'required': True
}
Update validation rules

Passing a dictionary of {column_name: property_dict} allows fine-grained control. For example, to update remove checks on edchiefcomplaint but preserve other validation rules, pass the following to update_cols.

Custom validation example

The example below applies the following custom rules:

  • Change study start and end dates
  • Ignore validation on eddiag_NN columns. This requires both _01 and _[0-9]{2}$ regex suffixes to be set. Note the $ at the end of regex.
  • Don't perform checks on edchiefcomplaint but retain other rules e.g dtype
  • Dont' check data type for accommodationstatus but retain other rules
good, bad = validate_dataframe(
    df,
    schema,
    start_date=datetime(2021, 10, 1),
    end_date=datetime(2022, 11, 1),
    ignore_cols=["eddiag_01", "eddiag_[0-9]{2}$"],
    update_cols={
        "edchiefcomplaint": {"checks": []},
        "accommodationstatus": {"dtype": None},
    }
)
Source code in avoidable_admissions/data/validate.py
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
def validate_dataframe(
    df: pd.DataFrame, schema: pa.DataFrameSchema, **kwargs
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Validates data against a specified schema.

    The data should have been prepared as per the specification set by the lead site.
    Use the output of this function to iteratively identify and address data quality issues.

    The following schema are defined:

    Admitted Care Data:

    - `AdmittedCareEpisodeSchema`
    - `AdmittedCareFeatureSchema`

    Emergency Care Data:

    - `EmergencyCareEpisodeSchema`
    - `EmergencyCareFeatureSchema`

    See __[source code](https://github.com/LTHTR-DST/hdruk_avoidable_admissions/blob/dev/avoidable_admissions/data/validate.py)__
    for validation rules.

    Returns a _good_ dataframe containing rows that passed validation and
    a _bad_ dataframe with rows that failed validation.
    The _bad_ dataframe has additional columns that provide information on failure cause(s).
    If there is a column error (misspelt, missing or additional), all rows will be returned in _bad_ dataframe.

    Args:
        df (pandas.DataFrame): Dataframe to be validated
        schema (pa.DataFrameSchema): Pandera schema to validate against
        kwargs: The following keyword arguments are currently supported
        start_date (datetime): Study start date (inclusive)
        end_date (datetime): Study end date (excluded)
        ignore_cols (list): Columns to ignore during validation checks.
        update_cols (dict[str:dict]): Dictionary of column:properties to update schema.

    Returns:
        _Good_ and _Bad_ dataframes. See example below.

    ## Validation example

    ``` python
    from avoidable_admissions.data.validate import (
        validate_dataframe,
        AdmittedCareEpisodeSchema
    )


    df = pd.read_csv('path/to/data.csv')
    good, bad = validate_dataframe(df, AdmittedCareEpisodeSchema)
    ```

    If `df` had rows that fail validation, the function will print an output similar to below.

        Schema AdmittedCareEpisodeSchema: A total of 1 schema errors were found.

        Error Counts
        ------------
        - schema_component_check: 1

        Schema Error Summary
        --------------------
                                                            failure_cases  n_failure_cases
        schema_context column  check
        Column         admiage greater_than_or_equal_to(18)        [17.0]                1

    This message indicates that there was a validation error in the `admiage` column which expects values >=18.

    Fix data quality iteratively to ensure there are no errors.

    If you find a bug in the validation code, and correct data fails validation,
    please raise a [GitHub issue](https://github.com/LTHTR-DST/hdruk_avoidable_admissions/issues).

    ## Customising validation

    ### Customise study dates

    As a default, the study dates specified in the initial protocol will be used
    (`admidate>=datetime(2021,11,1) and admidate<datetime(2022,11,1)`).
    However, these can be altered by providing these as keyword arguments.

    The following rule is applied to `admidate` in the Acute Admissions dataset
    and to `edarrivaldatetime` in the emergency care dataset.

    `admidate>=start_date` and `admidate<end_date`

    The `<end_date` allows `31-10-2022 23:59:00` to pass validation when
    `end_date` is set to `datetime(2022,11,1)`.

    ### Ignore selected columns

    Passing a list of column names to`ignore_cols` as a keyword argument will
    apply the following properties, effectively turning off validation.

    ```python
    {
        'dtype': None,
        'checks': [],
        'nullable': False,
        'unique': False,
        'coerce': False,
        'required': True
    }
    ```

    ### Update validation rules

    Passing a dictionary of {column_name: property_dict} allows fine-grained control.
    For example, to update remove checks on `edchiefcomplaint` but preserve
    other validation rules, pass the following to `update_cols`.

    ### Custom validation example

    The example below applies the following custom rules:

    - Change study start and end dates
    - Ignore validation on `eddiag_NN` columns.
        This requires both *_01* and *_[0-9]{2}$* regex suffixes to be set.
        Note the _$_ at the end of regex.
    - Don't perform checks on `edchiefcomplaint` but retain other rules e.g dtype
    - Dont' check data type for `accommodationstatus` but retain other rules


    ```python
    good, bad = validate_dataframe(
        df,
        schema,
        start_date=datetime(2021, 10, 1),
        end_date=datetime(2022, 11, 1),
        ignore_cols=["eddiag_01", "eddiag_[0-9]{2}$"],
        update_cols={
            "edchiefcomplaint": {"checks": []},
            "accommodationstatus": {"dtype": None},
        }
    )
    ```


    """

    df_errors = pd.DataFrame()

    # todo: document this behaviour to warn user that index will be dropped.
    # alternatively find a way to set a unique key for each row - important for merging errors
    df = df.copy().reset_index(drop=True)

    start_date = kwargs.get("start_date", datetime(2021, 11, 1))
    end_date = kwargs.get("end_date", datetime(2022, 11, 1))

    date_checks = [
        pa.Check.ge(start_date),
        pa.Check.lt(end_date),
    ]

    if schema.name.startswith("AdmittedCare"):
        cohort_date_col = "admidate"
    elif schema.name.startswith("EmergencyCare"):
        cohort_date_col = "edarrivaldatetime"

    updated_column_props = {}

    updated_column_props[cohort_date_col] = {"checks": date_checks}

    # New feature - allow user to ignore checks on some columns
    ignore_cols = kwargs.get("ignore_cols", [])

    update_cols = kwargs.get("update_cols", {})

    blank_props = {
        "dtype": None,
        "checks": [],
        "nullable": False,
        "unique": False,
        "coerce": False,
        "required": True,
    }

    for col in ignore_cols:
        updated_column_props[col] = blank_props

    updated_column_props.update(update_cols)

    # If a column in ignore_cols is not present in schema, this will raise
    # a SchemaInitError with name of column causing the error.
    schema = schema.update_columns(updated_column_props)

    try:
        # Capture all errors
        # https://pandera.readthedocs.io/en/stable/lazy_validation.html
        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            schema.validate(df, lazy=True)
    except pa.errors.SchemaErrors as ex:

        print(ex.args[0])

        df_errors = ex.failure_cases.copy()

        # First get the rows that are causing errors
        df_errors["index"] = df_errors["index"].fillna(df.index.max() + 1)
        df_errors = df.merge(df_errors, how="right", left_index=True, right_on="index")
        df_errors["index"] = df_errors["index"].replace(df.index.max() + 1, None)

        # Column name mismatches will have an 'index' of NaN which causes merge to fail
        # If a column name is not present, then all rows should be returned as errors

        if df_errors["index"].hasnans:  # this
            # there is a column error. drop all rows from the 'good' dataframe
            df = df.iloc[0:0]

            print("No data will pass validation due to column error. See output above.")

        else:
            df = df[~df.index.isin(df_errors["index"])]
    except Exception as ex:
        # This is to catch all other errors.
        print(ex.args[0])
        print(
            "No data will pass validation due to undefined error."
            "See output above and please raise an issue on GitHub."
        )

        df = df.iloc[0:0]
        df_errors = df.copy()

    finally:
        return df, df_errors

validate_admitted_care_data(df, **kwargs)

Convenience wrapper for validate_dataframe(df, AdmittedCareEpisodeSchema)

See avoidable_admissions.data.validate.validate_dataframe for usage.

Source code in avoidable_admissions/data/validate.py
773
774
775
776
777
778
779
780
781
def validate_admitted_care_data(
    df: pd.DataFrame, **kwargs
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Convenience wrapper for `validate_dataframe(df, AdmittedCareEpisodeSchema)`

    See [avoidable_admissions.data.validate.validate_dataframe][] for usage.
    """

    return validate_dataframe(df, AdmittedCareEpisodeSchema, **kwargs)

validate_admitted_care_features(df, **kwargs)

Convenience wrapper for validate_dataframe(df, AdmittedCareFeatureSchema)

See avoidable_admissions.data.validate.validate_dataframe for usage.

Source code in avoidable_admissions/data/validate.py
794
795
796
797
798
799
800
801
def validate_admitted_care_features(
    df: pd.DataFrame, **kwargs
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Convenience wrapper for `validate_dataframe(df, AdmittedCareFeatureSchema)`

    See [avoidable_admissions.data.validate.validate_dataframe][] for usage.
    """
    return validate_dataframe(df, AdmittedCareFeatureSchema, **kwargs)

validate_emergency_care_data(df, **kwargs)

Convenience wrapper for validate_dataframe(df, EmergencyCareEpisodeSchema)

See avoidable_admissions.data.validate.validate_dataframe for usage.

Source code in avoidable_admissions/data/validate.py
784
785
786
787
788
789
790
791
def validate_emergency_care_data(
    df: pd.DataFrame, **kwargs
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Convenience wrapper for `validate_dataframe(df, EmergencyCareEpisodeSchema)`

    See [avoidable_admissions.data.validate.validate_dataframe][] for usage.
    """
    return validate_dataframe(df, EmergencyCareEpisodeSchema, **kwargs)

validate_emergency_care_features(df, **kwargs)

Convenience wrapper for validate_dataframe(df, EmergencyCareFeatureSchema)

See avoidable_admissions.data.validate.validate_dataframe for usage.

Source code in avoidable_admissions/data/validate.py
804
805
806
807
808
809
810
811
def validate_emergency_care_features(
    df: pd.DataFrame, **kwargs
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Convenience wrapper for `validate_dataframe(df, EmergencyCareFeatureSchema)`

    See [avoidable_admissions.data.validate.validate_dataframe][] for usage.
    """
    return validate_dataframe(df, EmergencyCareFeatureSchema, **kwargs)

get_schema_properties(schema)

Get detailed information about a validation schema including checks, dtypes, nullability and more.

Parameters:

Name Type Description Default
schema Pandera DataFrameSchema

One of AdmittedCareEpisodeSchema, AdmittedCareFeatureSchema,

required

Returns:

Type Description
DataFrame

pd.DataFrame: Dataframe containing schema properties.

Source code in avoidable_admissions/data/validate.py
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
def get_schema_properties(schema: pa.DataFrameSchema) -> pd.DataFrame:
    """Get detailed information about a validation schema including checks, dtypes, nullability and more.

    Args:
        schema (Pandera DataFrameSchema): One of `AdmittedCareEpisodeSchema`, `AdmittedCareFeatureSchema`,
        `EmergencyCareEpisodeSchema`, `EmergencyCareFeatureSchema`

    Returns:
        pd.DataFrame: Dataframe containing schema properties.
    """

    df = pd.DataFrame([x.properties for x in schema.columns.values()])

    columns = [
        "name",
        "dtype",
        "nullable",
        "unique",
        "coerce",
        "required",
        "checks",
        "regex",
        "title",
        "description",
    ]

    return df[columns]

Fixing Errors

It is likely that data validation will fail on a subset of the data the first few times. Fixing errors will be an iterative process and the following are some examples.

Please see https://mattstammers.github.io/hdruk_avoidable_admissions_collaboration_docs/ for more examples.

Errors in validation after feature generation may be caused by extraneous codes that are not specified in the data specification.

Examples

# Convert to date

df['admidate'] = pd.to_datetime(df['admidate'], yearfirst=True)
df['admidate'] = df['admidate'].dt.date


# Fill missing SNOMED codes with 0.
# Else valiation will fail as nan is treated as float.
df['accommodationstatus'] = df['accommodationstatus'].fillna(0)

Missing Values

To be finalised after further discussion and testing.

There is an entire chapter in Pandas documentation on missing values which is an important read for any data scientist.

For the purposes of this project, several pragmatic choices have been made regarding how missing values are treated.

  1. Where a definition exists for how missing values should be coded, for instance in the NHS data model, use this.
  2. For SNOMED codes, which are always integers, use 0 (zero) to replace all missing values. This avoids validation errors caused by NaN values that are treated as float dtype by Pandas.
  3. For strings, use "-" (without the quotes) for missing values.
  4. During feature engineering, custom error values are assigned to codes that are missing from either the refsets or mapping.