Feature Engineering

Feature engineering is the process of generating new variables from one or more existing variables.

The Data Processing document defined by the lead team provides excellent and explicit documentation on what new features are expected. Refer to these documents for more details.

HDRUK Data Processing V1 Google Docs

The functions described below generate these features automatically in preparation for the second validation step and further analysis.

Ensure that data has undergone preprocessing and has passed the first validation step as described in the analysis pipeline before using these functions.

Error codes

A pragmatic approach has been used in dealing with missing data, unmapped codes and codes not in refsets. Please read section on missing values in the Data Validation chapter as well.

During feature engineering, especially in the Emergency Care dataset that has several columns with SNOMED codes, the following rules are applied to assign the appropriate categories.

Source Data	Mapping	Refset	Category	Who fixes
Yes	Yes	Yes	Assign to `Category`
Yes	No	Yes	`ERROR:Unmapped - In Refset`	Lead site to advise
Yes	Yes	No	`ERROR:Not In Refset\|{Category}`	Lead site to fix
No	x	x	`ERROR:Missing Data`	Local site if feasible
Yes	No	No	`ERROR:Unmapped - Not In Refset`	Local site to fix

Please see the source code for feature_maps.py and raise a GitHub issue for any questions or bugs.

`build_admitted_care_features(df)`

Generate features described in the Admitted Care Data Specification

See Analysis Pipeline for more information

Parameters:

Name	Type	Description	Default
`df`	`Pandas DataFrame`	Dataframe that has passed the first validation step	required

Returns:

Type	Description
`DataFrame`	pd.DataFrame: Dataframe with additional feature columns

Feature Engineering Example:

import pandas as pd
from avoidable_admissions.data.validate import (
    validate_dataframe,
    AdmittedCareEpisodeSchema
)
from avoidable_admissions.features.build_features import (
    build_admitted_care_features
)


# Load raw data typically extracted using SQL from source database
df = pd.read_csv('../data/raw/admitted_care)

# First validation step using Episode Schema
# Review, fix DQ issues and repeat this step until all data passes validation
good, bad = validate_dataframe(df, AdmittedCareEpisodeSchema)

# Feature engineering using the _good_ dataframe
df_features = build_admitted_care_features(good)

# Second validation step and continue...

See Analysis Pipeline for more information.

Source code in avoidable_admissions/features/build_features.py

def build_admitted_care_features(df: pd.DataFrame) -> pd.DataFrame:
    """Generate features described in the Admitted Care Data Specification

    See [Analysis Pipeline][data-analysis-pipeline] for more information

    Args:
        df (Pandas DataFrame): Dataframe that has passed the first validation step

    Returns:
        pd.DataFrame: Dataframe with additional feature columns


    ## Feature Engineering Example:

    ``` python
    import pandas as pd
    from avoidable_admissions.data.validate import (
        validate_dataframe,
        AdmittedCareEpisodeSchema
    )
    from avoidable_admissions.features.build_features import (
        build_admitted_care_features
    )


    # Load raw data typically extracted using SQL from source database
    df = pd.read_csv('../data/raw/admitted_care)

    # First validation step using Episode Schema
    # Review, fix DQ issues and repeat this step until all data passes validation
    good, bad = validate_dataframe(df, AdmittedCareEpisodeSchema)

    # Feature engineering using the _good_ dataframe
    df_features = build_admitted_care_features(good)

    # Second validation step and continue...
    ```

    See [Analysis Pipeline][data-analysis-pipeline] for more information.
    """

    df = admitted_care_features.build_all(df)

    return df

`build_emergency_care_features(df)`

Source code in avoidable_admissions/features/build_features.py

def build_emergency_care_features(df: pd.DataFrame) -> pd.DataFrame:

    df = emergency_care_features.build_all(df)

    return df

Read the source code for generating admitted care features and emergency care features on GitHub.