Data Analysis Pipeline
Data harmonisation is a vital step in constructing a reproducible, reusable data analytics pipeline.
Please refer to the detailed data specification and analysis plan developed by the lead team at Sheffield University.
These documentation as well as detailed, step-by-step information on setting up a Python environment and getting started on this project are available at https://mattstammers.github.io/hdruk_avoidable_admissions_collaboration_docs/.
The Avoidable Admissions project requires the preparation and analysis of 2 distinct datasets - for admitted care and emergency care. The steps are identical for both and are shown in the flow chart below.
Click on the flowchart elements for more information as it applies to the Admitted Care Dataset. Similar functions are available for the Emergency Care Dataset.
flowchart TB
subgraph Admitted_Care_Pipeline
direction TB
subgraph Preprocessing
A(Extract) --> B(Validate)
B --> C{Errors?}
C -->|Yes| D(Fix Errors)
D --> B
end
subgraph Feature_Engineering
C -->|No| E(Generate Features)
E --> F(Validate)
F --> G{Errors?}
G -->|Yes| H(Fix Errors)
H --> F
end
subgraph Analysis
G -->|No| I(Analysis)
end
end
style A stroke:#526cfe,stroke-width:4px
style E stroke:#526cfe,stroke-width:4px
style I stroke:#526cfe,stroke-width:4px
style B stroke:#26b079, stroke-width:4px
style F stroke:#26b079, stroke-width:4px
style D stroke:#ff7872
style H stroke:#ff7872
click B "/hdruk_avoidable_admissions/validation/#avoidable_admissions.data.validate.validate_admitted_care_data"
click F "/hdruk_avoidable_admissions/validation/#avoidable_admissions.data.validate.validate_admitted_care_features"
click C "/hdruk_avoidable_admissions/validation/#avoidable_admissions.data.validate.validate_dataframe--validation-example"
click G "/hdruk_avoidable_admissions/validation/#avoidable_admissions.data.validate.validate_dataframe--validation-example"
click E "/hdruk_avoidable_admissions/features/#avoidable_admissions.features.build_features.build_admitted_care_features"
click D "/hdruk_avoidable_admissions/validation/#fixing-errors"
Pipeline Example
This is an example using the Admitted Care Dataset. The same principles apply for the Emergency Care Dataset.
import pandas as pd
from avoidable_admissions.data.validate import (
validate_dataframe,
AdmittedCareEpisodeSchema,
AdmittedCareFeatureSchema
)
from avoidable_admissions.features.build_features import (
build_admitted_care_features
)
# Load raw data typically extracted using SQL from source database
df = pd.read_csv("../data/raw/admitted_care.csv")
# First validation step using Episode Schema
# Review, fix DQ issues and repeat this step until all data passes validation
good, bad = validate_dataframe(df, AdmittedCareEpisodeSchema)
# Feature engineering using the _good_ dataframe
df_features = build_admitted_care_features(good)
# Second validation step using Feature Schema
# Review and fix DQ issues.
# This may require returning to the first validation step or even extraction.
good_f, bad_f = validate_dataframe(df_features, AdmittedCareFeatureSchema)
# Use the good_f dataframe for analysis as required by lead site
Please see Pipeline Example for a more detailed Jupyter notebook.