Feature Engineering
Feature engineering is the process of generating new variables from one or more existing variables.
The Data Processing document defined by the lead team provides excellent and explicit documentation on what new features are expected. Refer to these documents for more details.
The functions described below generate these features automatically in preparation for the second validation step and further analysis.
Ensure that data has undergone preprocessing and has passed the first validation step as described in the analysis pipeline before using these functions.
Error codes
A pragmatic approach has been used in dealing with missing data, unmapped codes and codes not in refsets. Please read section on missing values in the Data Validation chapter as well.
During feature engineering, especially in the Emergency Care dataset that has several columns with SNOMED codes, the following rules are applied to assign the appropriate categories.
Source Data | Mapping | Refset | Category | Who fixes |
---|---|---|---|---|
Yes | Yes | Yes | Assign to Category |
|
Yes | No | Yes | ERROR:Unmapped - In Refset |
Lead site to advise |
Yes | Yes | No | ERROR:Not In Refset|{Category} |
Lead site to fix |
No | x | x | ERROR:Missing Data |
Local site if feasible |
Yes | No | No | ERROR:Unmapped - Not In Refset |
Local site to fix |
Please see the source code for feature_maps.py
and raise a GitHub issue for any questions or bugs.
build_admitted_care_features(df)
Generate features described in the Admitted Care Data Specification
See Analysis Pipeline for more information
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
Pandas DataFrame
|
Dataframe that has passed the first validation step |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: Dataframe with additional feature columns |
Feature Engineering Example:
import pandas as pd
from avoidable_admissions.data.validate import (
validate_dataframe,
AdmittedCareEpisodeSchema
)
from avoidable_admissions.features.build_features import (
build_admitted_care_features
)
# Load raw data typically extracted using SQL from source database
df = pd.read_csv('../data/raw/admitted_care)
# First validation step using Episode Schema
# Review, fix DQ issues and repeat this step until all data passes validation
good, bad = validate_dataframe(df, AdmittedCareEpisodeSchema)
# Feature engineering using the _good_ dataframe
df_features = build_admitted_care_features(good)
# Second validation step and continue...
See Analysis Pipeline for more information.
Source code in avoidable_admissions/features/build_features.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
|
build_emergency_care_features(df)
Source code in avoidable_admissions/features/build_features.py
53 54 55 56 57 |
|
Read the source code for generating admitted care features and emergency care features on GitHub.