Skip to content

Feature Engineering

Feature engineering is the process of generating new variables from one or more existing variables.

The Data Processing document defined by the lead team provides excellent and explicit documentation on what new features are expected. Refer to these documents for more details.

The functions described below generate these features automatically in preparation for the second validation step and further analysis.

Ensure that data has undergone preprocessing and has passed the first validation step as described in the analysis pipeline before using these functions.

Error codes

A pragmatic approach has been used in dealing with missing data, unmapped codes and codes not in refsets. Please read section on missing values in the Data Validation chapter as well.

During feature engineering, especially in the Emergency Care dataset that has several columns with SNOMED codes, the following rules are applied to assign the appropriate categories.

Source Data Mapping Refset Category Who fixes
Yes Yes Yes Assign to Category
Yes No Yes ERROR:Unmapped - In Refset Lead site to advise
Yes Yes No ERROR:Not In Refset|{Category} Lead site to fix
No x x ERROR:Missing Data Local site if feasible
Yes No No ERROR:Unmapped - Not In Refset Local site to fix

Please see the source code for feature_maps.py and raise a GitHub issue for any questions or bugs.

build_admitted_care_features(df)

Generate features described in the Admitted Care Data Specification

See Analysis Pipeline for more information

Parameters:

Name Type Description Default
df Pandas DataFrame

Dataframe that has passed the first validation step

required

Returns:

Type Description
DataFrame

pd.DataFrame: Dataframe with additional feature columns

Feature Engineering Example:

import pandas as pd
from avoidable_admissions.data.validate import (
    validate_dataframe,
    AdmittedCareEpisodeSchema
)
from avoidable_admissions.features.build_features import (
    build_admitted_care_features
)


# Load raw data typically extracted using SQL from source database
df = pd.read_csv('../data/raw/admitted_care)

# First validation step using Episode Schema
# Review, fix DQ issues and repeat this step until all data passes validation
good, bad = validate_dataframe(df, AdmittedCareEpisodeSchema)

# Feature engineering using the _good_ dataframe
df_features = build_admitted_care_features(good)

# Second validation step and continue...

See Analysis Pipeline for more information.

Source code in avoidable_admissions/features/build_features.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
def build_admitted_care_features(df: pd.DataFrame) -> pd.DataFrame:
    """Generate features described in the Admitted Care Data Specification

    See [Analysis Pipeline][data-analysis-pipeline] for more information

    Args:
        df (Pandas DataFrame): Dataframe that has passed the first validation step

    Returns:
        pd.DataFrame: Dataframe with additional feature columns


    ## Feature Engineering Example:

    ``` python
    import pandas as pd
    from avoidable_admissions.data.validate import (
        validate_dataframe,
        AdmittedCareEpisodeSchema
    )
    from avoidable_admissions.features.build_features import (
        build_admitted_care_features
    )


    # Load raw data typically extracted using SQL from source database
    df = pd.read_csv('../data/raw/admitted_care)

    # First validation step using Episode Schema
    # Review, fix DQ issues and repeat this step until all data passes validation
    good, bad = validate_dataframe(df, AdmittedCareEpisodeSchema)

    # Feature engineering using the _good_ dataframe
    df_features = build_admitted_care_features(good)

    # Second validation step and continue...
    ```

    See [Analysis Pipeline][data-analysis-pipeline] for more information.
    """

    df = admitted_care_features.build_all(df)

    return df

build_emergency_care_features(df)

Source code in avoidable_admissions/features/build_features.py
53
54
55
56
57
def build_emergency_care_features(df: pd.DataFrame) -> pd.DataFrame:

    df = emergency_care_features.build_all(df)

    return df

Read the source code for generating admitted care features and emergency care features on GitHub.